All posts by Danilo Poccia

New for Amazon CodeGuru – Python Support, Security Detectors, and Memory Profiling

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-codeguru-python-support-security-detectors-and-memory-profiling/

Amazon CodeGuru is a developer tool that helps you improve your code quality and has two main components:

  • CodeGuru Reviewer uses program analysis and machine learning to detect potential defects that are difficult to find in your code and offers suggestions for improvement.
  • CodeGuru Profiler collects runtime performance data from your live applications, and provides visualizations and recommendations to help you fine-tune your application performance.

Today, I am happy to announce three new features:

  • Python Support for CodeGuru Reviewer and Profiler (Preview) – You can now use CodeGuru to improve applications written in Python. Before this release, CodeGuru Reviewer could analyze Java code, and CodeGuru Profiler supported applications running on a Java virtual machine (JVM).
  • Security Detectors for CodeGuru Reviewer – A new set of detectors for CodeGuru Reviewer to identify security vulnerabilities and check for security best practices in your Java code.
  • Memory Profiling for CodeGuru Profiler – A new visualization of memory retention per object type over time. This makes it easier to find memory leaks and optimize how your application is using memory.

Let’s see these functionalities in more detail.

Python Support for CodeGuru Reviewer and Profiler (Preview)
Python Support for CodeGuru Reviewer is available in Preview and offers recommendations on how to improve the Python code of your applications in multiple categories such as concurrency, data structures and control flow, scientific/math operations, error handling, using the standard library, and of course AWS best practices.

You can now also use CodeGuru Profiler to collect runtime performance data from your Python applications and get visualizations to help you identify how code is running on the CPU and where time is consumed. In this way, you can detect the most expensive lines of code of your application. Focusing your tuning activities on those parts helps you reduce infrastructure cost and improve application performance.

Let’s see the CodeGuru Reviewer in action with some Python code. When I joined AWS eight years ago, one of the first projects I created was a Filesystem in Userspace (FUSE) interface to Amazon Simple Storage Service (S3) called yas3fs (Yet Another S3-backed File System). It was inspired by the more popular s3fs-fuse project but rewritten from scratch to implement a distributed cache synchronized by Amazon Simple Notification Service (SNS) notifications (now, thanks to the many contributors, it’s using S3 event notifications). It was also a good excuse for me to learn more about Python programming and S3. It’s a personal project that at the time made available as open source. Today, if you need a shared file system, you can use Amazon Elastic File System (EFS).

In the CodeGuru console, I associate the yas3fs repository. You can associate repositories from GitHub, including GitHub Enterprise Cloud and GitHub Enterprise Server, Bitbucket, or AWS CodeCommit.

After that, I can get a code review from CodeGuru in two ways:

  • Automatically, when I create a pull request. This is a great way to use it as you and your team are working on a code base.
  • Manually, creating a repository analysis to get a code review for all the code in one branch. This is useful to start using GodeGuru with an existing code base.

Since I just associated the whole repository, I go for a full analysis and write down the branch name to review (apologies, I was still using master at the time, now I use main for new projects).

After a few minutes, the code review is completed, and there are 14 recommendations. Not bad, but I can definitely improve the code. Here’s a few of the recommendations I get. I was using exceptions and global variables too much at the time.

Security Detectors for CodeGuru Reviewer
The new CodeGuru Reviewer Security Detector uses automated reasoning to analyze all code paths and find potential security issues deep in your Java code, even ones that span multiple methods and files and that may involve multiple sequences of operations. To build this detector, we used learning and best practices from Amazon’s 20+ years of experience.

The Security Detector is also identifying security vulnerabilities in the top 10 Open Web Application Security Project (OWASP) categories, such as weak hash encryption.

If the security detector discovers an issue, it offers a suggested remediation along with an explanation. In this way, it’s much easier to follow security best practices for AWS APIs, such as those for AWS Key Management Service (KMS) and Amazon Elastic Compute Cloud (EC2), and for common Java cryptography and TLS/SSL libraries.

With help from the security detector, security engineers can focus on architectural and application-specific security best-practices, and code reviewers can focus their attention on other improvements.

Memory Profiling for CodeGuru Profiler
For applications running on a JVM, CodeGuru Profiler can now show the Heap Summary, a consolidated view of memory retention during a time frame, tracking both overall sizes and number of objects per object type (such as String, int, char[], and custom types). These metrics are presented in a timeline graph, so that you can easily spot trends and peaks of memory utilization per object type.

Here are a couple of scenarios where this can help:

Memory Leaks – A constantly growing memory utilization curve for one or more object types may indicate a leak (intended here as unnecessary retention of memory objects by the application), possibly leading to out-of-memory errors and application crashes.

Memory Optimizations – Having a breakdown of memory utilization per object type is a step beyond traditional memory utilization monitoring, based solely on JVM-level metrics like total heap usage. By knowing that an unexpectedly high amount of memory has been associated with a specific object type, you can focus your analysis and optimization efforts on the parts of your application that are responsible for allocating and referencing objects of that type.

For example, here is a graph showing how memory is used by a Java application over an interval of time. Apart from the total capacity available and the used space, I can see how memory is being used by some specific object types, such as byte[], java.lang.UUID, and the entries of a java.util.LinkedHashMap. The continuous growth over time of the memory retained by these object types is suspicious. There is probably a memory leak I have to investigate.

In the table just below, I have a longer list of object types allocating memory on the heap. The first three are selected and for that reason are shown in the graph above. Here, I can inspect other object types and select them to see their memory usage over time. It looks like the three I already selected are the ones with more risk of being affected by a memory leak.

Available Now
These new features are available today in all regions where Amazon CodeGuru is offered. For more information, please see the AWS Regional Services table.

There are no pricing changes for Python support, security detectors, and memory profiling. You pay for what you use without upfront fees or commitments.

Learn more about Amazon CodeGuru and start using these new features today to improve the code quality of your applications.  

Danilo

New – SaaS Lens in AWS Well-Architected Tool

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-saas-lens-in-aws-well-architected-tool/

To help you build secure, high-performing, resilient, and efficient solutions on AWS, in 2015 we publicly launched the AWS Well-Architected Framework. It started as a single whitepaper but has expanded to include domain-specific lenses, hands-on labs, and the AWS Well-Architected Tool (available at no cost in the AWS Management Console) that provides a mechanism for regularly evaluating your workloads, identifying high risk issues, and recording your improvements.

To offer more workload-specific advice, in 2017 we extended the framework with the concept of “lens” to go beyond a general perspective and enter specific technology domains. Now, to help accelerate building Software-as-a-Service (SaaS) solutions, the AWS SaaS Factory team has led an effort to build a new AWS Well-Architected SaaS Lens.

SaaS is a licensing and delivery model by which software is centrally managed and hosted by a provider and available to customers on a subscription basis. In this way, software providers can innovate rapidly, optimize their costs, and gain operational efficiencies. At the same time, customers benefit from simplified IT management, speed, and a pay-for-what-you-use business model.

The Well-Architected SaaS Lens adds questions to the tool that are tailored to SaaS workloads and intended to drive critical thinking for developing and operating SaaS workloads. Each question has a list of best practices, and each best practice has a list of improvement plans to help guide you in implementing them. AWS Solution Architects from the AWS SaaS Factory Program, having worked with thousands of software developers and AWS Partners, view these well-architected patterns as a key component of building and operating a SaaS architecture on AWS.

Using the SaaS Lens in the Well-Architected Tool
In the Well-Architected Tool console, I start by defining my workload. Today, I’m reviewing a pre-production environment of a SaaS application. It’s just a minimum viable product (MVP) version of what I want to build, with just enough features to be usable and get a first feedback.

Now, I can choose which lenses to apply. The AWS Well-Architected Framework is there by default. I select the SaaS Lens. This is adding a set of additional questions that help me understand how to design, deploy, and architect my SaaS workload following the framework best practices. Other lenses are available in the tool, for example the Serverless Lens described here.

Now, I start my review. Many questions in the SaaS Lens are focused on how you are managing a multi-tenant application. This is the first question for the Operational Excellence pillar. I can also add some notes to explain my answer better or take note of what I want to improve.

I don’t need to answer all questions to start improving my SaaS application. For example, this is the improvement plan based on my answer to the previous question. For each point here, I can click and get more information on how to implement that on AWS.

Moving to the Reliability pillar, I feel more confident because of the techniques I used to separate individual tenants of my SaaS application in their own “sandbox” environment.

As I expect, no risks are detected this time!

When I finish reviewing the SaaS Lens for my workload, I get an overview of the detected risks. Here, I can also save a milestone that I can use later to compare my status and estimate my improvements.

Just below that, I get a suggestion on what to focus on next. Again, I can click and get in-depth suggestion on how to mitigate the risk.

As often happens in IT services, this is an iterative process. The AWS Well-Architected Tool helps quantify the risks and gives me a path to follow to continuously improve my SaaS application.

Available Now
The SaaS Lens is available today in all regions where the AWS Well-Architected Tool is offered, as described in the AWS Regional Services List. It can be applied to existing workloads, or used for new workloads you define in the tool.

There are no costs in using the AWS Well-Architected Tool; you can use it to improve the application you are working on, or to get visibility into multiple workloads used by the department or area you are working with.

Learn more about the new SaaS Lens and get started today with the AWS Well-Architected Tool!

Danilo

New for AWS Lambda – Container Image Support

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-aws-lambda-container-image-support/

With AWS Lambda, you upload your code and run it without thinking about servers. Many customers enjoy the way this works, but if you’ve invested in container tooling for your development workflows, it’s not easy to use the same approach to build applications using Lambda.

To help you with that, you can now package and deploy Lambda functions as container images of up to 10 GB in size. In this way, you can also easily build and deploy larger workloads that rely on sizable dependencies, such as machine learning or data intensive workloads. Just like functions packaged as ZIP archives, functions deployed as container images benefit from the same operational simplicity, automatic scaling, high availability, and native integrations with many services.

We are providing base images for all the supported Lambda runtimes (Python, Node.js, Java, .NET, Go, Ruby) so that you can easily add your code and dependencies. We also have base images for custom runtimes based on Amazon Linux that you can extend to include your own runtime implementing the Lambda Runtime API.

You can deploy your own arbitrary base images to Lambda, for example images based on Alpine or Debian Linux. To work with Lambda, these images must implement the Lambda Runtime API. To make it easier to build your own base images, we are releasing Lambda Runtime Interface Clients implementing the Runtime API for all supported runtimes. These implementations are available via native package managers, so that you can easily pick them up in your images, and are being shared with the community using an open source license.

We are also releasing as open source a Lambda Runtime Interface Emulator that enables you to perform local testing of the container image and check that it will run when deployed to Lambda. The Lambda Runtime Interface Emulator is included in all AWS-provided base images and can be used with arbitrary images as well.

Your container images can also use the Lambda Extensions API to integrate monitoring, security and other tools with the Lambda execution environment.

To deploy a container image, you select one from an Amazon Elastic Container Registry repository. Let’s see how this works in practice with a couple of examples, first using an AWS-provided image for Node.js, and then building a custom image for Python.

Using the AWS-Provided Base Image for Node.js
Here’s the code (app.js) for a simple Node.js Lambda function generating a PDF file using the PDFKit module. Each time it is invoked, it creates a new mail containing random data generated by the faker.js module. The output of the function is using the syntax of the Amazon API Gateway to return the PDF file.

const PDFDocument = require('pdfkit');
const faker = require('faker');
const getStream = require('get-stream');

exports.lambdaHandler = async (event) => {

    const doc = new PDFDocument();

    const randomName = faker.name.findName();

    doc.text(randomName, { align: 'right' });
    doc.text(faker.address.streetAddress(), { align: 'right' });
    doc.text(faker.address.secondaryAddress(), { align: 'right' });
    doc.text(faker.address.zipCode() + ' ' + faker.address.city(), { align: 'right' });
    doc.moveDown();
    doc.text('Dear ' + randomName + ',');
    doc.moveDown();
    for(let i = 0; i < 3; i++) {
        doc.text(faker.lorem.paragraph());
        doc.moveDown();
    }
    doc.text(faker.name.findName(), { align: 'right' });
    doc.end();

    pdfBuffer = await getStream.buffer(doc);
    pdfBase64 = pdfBuffer.toString('base64');

    const response = {
        statusCode: 200,
        headers: {
            'Content-Length': Buffer.byteLength(pdfBase64),
            'Content-Type': 'application/pdf',
            'Content-disposition': 'attachment;filename=test.pdf'
        },
        isBase64Encoded: true,
        body: pdfBase64
    };
    return response;
};

I use npm to initialize the package and add the three dependencies I need in the package.json file. In this way, I also create the package-lock.json file. I am going to add it to the container image to have a more predictable result.

$ npm init
$ npm install pdfkit
$ npm install faker
$ npm install get-stream

Now, I create a Dockerfile to create the container image for my Lambda function, starting from the AWS provided base image for the nodejs12.x runtime:

FROM amazon/aws-lambda-nodejs:12
COPY app.js package*.json ./
RUN npm install
CMD [ "app.lambdaHandler" ]

The Dockerfile is adding the source code (app.js) and the files describing the package and the dependencies (package.json and package-lock.json) to the base image. Then, I run npm to install the dependencies. I set the CMD to the function handler, but this could also be done later as a parameter override when configuring the Lambda function.

I use the Docker CLI to build the random-letter container image locally:

$ docker build -t random-letter .

To check if this is working, I start the container image locally using the Lambda Runtime Interface Emulator:

$ docker run -p 9000:8080 random-letter:latest

Now, I test a function invocation with cURL. Here, I am passing an empty JSON payload.

$ curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'

If there are errors, I can fix them locally. When it works, I move to the next step.

To upload the container image, I create a new ECR repository in my account and tag the local image to push it to ECR. To help me identify software vulnerabilities in my container images, I enable ECR image scanning.

$ aws ecr create-repository – repository-name random-letter – image-scanning-configuration scanOnPush=true
$ docker tag random-letter:latest 123412341234.dkr.ecr.sa-east-1.amazonaws.com/random-letter:latest
$ aws ecr get-login-password | docker login – username AWS – password-stdin 123412341234.dkr.ecr.sa-east-1.amazonaws.com
$ docker push 123412341234.dkr.ecr.sa-east-1.amazonaws.com/random-letter:latest

Here I am using the AWS Management Console to complete the creation of the function. You can also use the AWS Serverless Application Model, that has been updated to add support for container images.

In the Lambda console, I click on Create function. I select Container image, give the function a name, and then Browse images to look for the right image in my ECR repositories.

Screenshot of the console.

After I select the repository, I use the latest image I uploaded. When I select the image, the Lambda is translating that to the underlying image digest (on the right of the tag in the image below). You can see the digest of your images locally with the docker images – digests command. In this way, the function is using the same image even if the latest tag is passed to a newer one, and you are protected from unintentional deployments. You can update the image to use in the function code. Updating the function configuration has no impact on the image used, even if the tag was reassigned to another image in the meantime.

Screenshot of the console.

Optionally, I can override some of the container image values. I am not doing this now, but in this way I can create images that can be used for different functions, for example by overriding the function handler in the CMD value.

Screenshot of the console.

I leave all other options to their default and select Create function.

When creating or updating the code of a function, the Lambda platform optimizes new and updated container images to prepare them to receive invocations. This optimization takes a few seconds or minutes, depending on the size of the image. After that, the function is ready to be invoked. I test the function in the console.

Screenshot of the console.

It’s working! Now let’s add the API Gateway as trigger. I select Add Trigger and add the API Gateway using an HTTP API. For simplicity, I leave the authentication of the API open.

Screenshot of the console.

Now, I click on the API endpoint a few times and download a few random mails.

Screenshot of the console.

It works as expected! Here are a few of the PDF files that are generated with random data from the faker.js module.

Output of the sample application.

 

Building a Custom Image for Python
Sometimes you need to use your custom container images, for example to follow your company guidelines or to use a runtime version that we don’t support.

In this case, I want to build an image to use Python 3.9. The code (app.py) of my function is very simple, I just want to say hello and the version of Python that is being used.

import sys
def handler(event, context): 
    return 'Hello from AWS Lambda using Python' + sys.version + '!'

As I mentioned before, we are sharing with you open source implementations of the Lambda Runtime Interface Clients (which implement the Runtime API) for all the supported runtimes. In this case, I start with a Python image based on Alpine Linux. Then, I add the Lambda Runtime Interface Client for Python (link coming soon) to the image. Here’s the Dockerfile:

# Define global args
ARG FUNCTION_DIR="/home/app/"
ARG RUNTIME_VERSION="3.9"
ARG DISTRO_VERSION="3.12"

# Stage 1 - bundle base image + runtime
# Grab a fresh copy of the image and install GCC
FROM python:${RUNTIME_VERSION}-alpine${DISTRO_VERSION} AS python-alpine
# Install GCC (Alpine uses musl but we compile and link dependencies with GCC)
RUN apk add – no-cache \
    libstdc++

# Stage 2 - build function and dependencies
FROM python-alpine AS build-image
# Install aws-lambda-cpp build dependencies
RUN apk add – no-cache \
    build-base \
    libtool \
    autoconf \
    automake \
    libexecinfo-dev \
    make \
    cmake \
    libcurl
# Include global args in this stage of the build
ARG FUNCTION_DIR
ARG RUNTIME_VERSION
# Create function directory
RUN mkdir -p ${FUNCTION_DIR}
# Copy handler function
COPY app/* ${FUNCTION_DIR}
# Optional – Install the function's dependencies
# RUN python${RUNTIME_VERSION} -m pip install -r requirements.txt – target ${FUNCTION_DIR}
# Install Lambda Runtime Interface Client for Python
RUN python${RUNTIME_VERSION} -m pip install awslambdaric – target ${FUNCTION_DIR}

# Stage 3 - final runtime image
# Grab a fresh copy of the Python image
FROM python-alpine
# Include global arg in this stage of the build
ARG FUNCTION_DIR
# Set working directory to function root directory
WORKDIR ${FUNCTION_DIR}
# Copy in the built dependencies
COPY – from=build-image ${FUNCTION_DIR} ${FUNCTION_DIR}
# (Optional) Add Lambda Runtime Interface Emulator and use a script in the ENTRYPOINT for simpler local runs
COPY https://github.com/aws/aws-lambda-runtime-interface-emulator/releases/latest/download/aws-lambda-rie /usr/bin/aws-lambda-rie
RUN chmod 755 /usr/bin/aws-lambda-rie
COPY entry.sh /
ENTRYPOINT [ "/entry.sh" ]
CMD [ "app.handler" ]

The Dockerfile this time is more articulated, building the final image in three stages, following the Docker best practices of multi-stage builds. You can use this three-stage approach to build your own custom images:

  • Stage 1 is building the base image with the runtime, Python 3.9 in this case, plus GCC that we use to compile and link dependencies in stage 2.
  • Stage 2 is installing the Lambda Runtime Interface Client and building function and dependencies.
  • Stage 3 is creating the final image adding the output from stage 2 to the base image built in stage 1. Here I am also adding the Lambda Runtime Interface Emulator, but this is optional, see below.

I create the entry.sh script below to use it as ENTRYPOINT. It executes the Lambda Runtime Interface Client for Python. If the execution is local, the Runtime Interface Client is wrapped by the Lambda Runtime Interface Emulator.

#!/bin/sh
if [ -z "${AWS_LAMBDA_RUNTIME_API}" ]; then
    exec /usr/bin/aws-lambda-rie /usr/local/bin/python -m awslambdaric
else
    exec /usr/local/bin/python -m awslambdaric
fi

Now, I can use the Lambda Runtime Interface Emulator to check locally if the function and the container image are working correctly:

$ docker run -p 9000:8080 lambda/python:3.9-alpine3.12

Not Including the Lambda Runtime Interface Emulator in the Container Image
It’s optional to add the Lambda Runtime Interface Emulator to a custom container image. If I don’t include it, I can test locally by installing the Lambda Runtime Interface Emulator in my local machine following these steps:

  • In Stage 3 of the Dockerfile, I remove the commands copying the Lambda Runtime Interface Emulator (aws-lambda-rie) and the entry.sh script. I don’t need the entry.sh script in this case.
  • I use this ENTRYPOINT to start by default the Lambda Runtime Interface Client:
    ENTRYPOINT [ "/usr/local/bin/python", “-m”, “awslambdaric” ]
  • I run these commands to install the Lambda Runtime Interface Emulator in my local machine, for example under ~/.aws-lambda-rie:
mkdir -p ~/.aws-lambda-rie
curl -Lo ~/.aws-lambda-rie/aws-lambda-rie https://github.com/aws/aws-lambda-runtime-interface-emulator/releases/latest/download/aws-lambda-rie
chmod +x ~/.aws-lambda-rie/aws-lambda-rie

When the Lambda Runtime Interface Emulator is installed on my local machine, I can mount it when starting the container. The command to start the container locally now is (assuming the Lambda Runtime Interface Emulator is at ~/.aws-lambda-rie):

docker run -d -v ~/.aws-lambda-rie:/aws-lambda -p 9000:8080 \
       – entrypoint /aws-lambda/aws-lambda-rie lambda/python:3.9-alpine3.12
       /lambda-entrypoint.sh app.handler

Testing the Custom Image for Python
Either way, when the container is running locally, I can test a function invocation with cURL:

curl -XPOST "http://localhost:9000/2015-03-31/functions/function/invocations" -d '{}'

The output is what I am expecting!

"Hello from AWS Lambda using Python3.9.0 (default, Oct 22 2020, 05:03:39) \n[GCC 9.3.0]!"

I push the image to ECR and create the function as before. Here’s my test in the console:

Screenshot of the console.

My custom container image based on Alpine is running Python 3.9 on Lambda!

Available Now
You can use container images to deploy your Lambda functions today in US East (N. Virginia), US East (Ohio), US West (Oregon), Asia Pacific (Tokyo), Asia Pacific (Singapore), Europe (Ireland), Europe (Frankfurt), South America (São Paulo). We are working to add support in more Regions soon. The container image support is offered in addition to ZIP archives and we will continue to support the ZIP packaging format.

There are no additional costs to use this feature. You pay for the ECR repository and the usual Lambda pricing.

You can use container image support in AWS Lambda with the console, AWS Command Line Interface (CLI), AWS SDKs, AWS Serverless Application Model, and solutions from AWS Partners, including Aqua Security, Datadog, Epsagon, HashiCorp Terraform, Honeycomb, Lumigo, Pulumi, Stackery, Sumo Logic, and Thundra.

This new capability opens up new scenarios, simplifies the integration with your development pipeline, and makes it easier to use custom images and your favorite programming platforms to build serverless applications.

Learn more and start using container images with AWS Lambda.

Danilo

New for AWS Lambda – Functions with Up to 10 GB of Memory and 6 vCPUs

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-aws-lambda-functions-with-up-to-10-gb-of-memory-and-6-vcpus/

AWS Lambda runs your code on an highly available and scalable compute infrastructure so that you can focus on what you want to build. Do you want to get the advantages of Lambda for workloads that are memory or computationally intensive? Wait no more!

Starting today, you can allocate up to 10 GB of memory to a Lambda function. This is more than a 3x increase compared to previous limits. Lambda allocates CPU and other resources linearly in proportion to the amount of memory configured. That means you can now have access to up to 6 vCPUs in each execution environment. In this way, your multithreaded and multiprocess applications run faster. Since Lambda charges are proportional to memory configured and function duration (GB-seconds), the additional costs for using more memory may be offset by lower duration. I have more on this in the example below.

With more memory and CPU power, and support for the AVX2 instruction set, new use cases — such as machine learning applications; batch and extract, transform, load (ETL) jobs; modelling; genomics; gaming; high-performance computing (HPC); and media processing — become easier to implement and scale with Lambda functions.

Let’s see how this works in practice!

Lambda Function Performance as Memory Increases
When I first wrote about the capability of mounting a shared Amazon Elastic File System (EFS) for Lambda functions, one of the examples I used was a function doing machine learning inference to classify images of birds. The function is using PyTorch to run the inference, applying a pre-trained machine learning model.

Now, I can execute the same function in the updated Lambda execution environment. Let’s see how increasing memory affects the duration of the function. Here are the results of using memory configurations between 1 and 10 GB. To get these numbers, I ran 20 invocations for each memory configuration. Then, I computed the average duration, discarding function initializations. To avoid possible outliers, I also excluded from the average the top and bottom 10% of reported durations. Based on the results, I estimated the charges I would have for 1 million invocations with each configuration.

Graph showing Function Duration and Charges for 1M Invocations as Memory Increases

As you can see, the function is able to use the additional CPU power that comes with more memory, decreasing the duration of the invocations. What is interesting is the impact of increasing memory on my costs.

Lambda charges are related to memory and duration, so if I increase memory and this is reducing duration by the same proportion, the overall charges are about the same. For example, looking at the graph above, when I configure 5 GB of memory, I have the same costs as when I have 1 GB of memory (about $61 for one million invocations), but the function is 5x faster. If I need lower latency, I can increase memory up to 10 GB, where the function is 7.6x faster and I pay a little more ($80 for one million invocations).

Depending on your code and business case, you can find out which memory configuration gives the optimal trade-off between cost and performance. To help you with that, my colleague and friend Alex Casalboni started the AWS Lambda Power Tuning project to help you optimize your Lambda functions in a data-driven way. This open source tool is really useful and has been improved by the support of many contributors. Give it a try!

In my tests, PyTorch is also using the optimizations of the Advanced Vector Extensions 2 (AVX2) instruction set, now available in the Lambda execution environment. With the AVX2 instruction set, the processor allows running a certain set of operations simultaneously. This is extremely beneficial for applications with operations that can run in parallel such as matrix multiplication. As a result, using AVX2 can improve performance by increasing CPU throughput per cycle. This typically helps compute intensive workloads such as machine learning inference, multimedia processing, scientific simulations, and financial modeling applications.

Available Now
AWS Lambda support for larger functions is available in Africa (Cape Town), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), EU (Frankfurt), EU (Ireland), EU (London), EU (Milano), EU (Paris), EU (Stockholm), South America (Sao Paulo), US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon).

You can configure up to 10 GB of memory for new or existing Lambda functions using the AWS Management Console, AWS Command Line Interface (CLI), AWS SDKs, and Serverless Application Model.

Here’s a snapshot of the new console experience. We replaced the slider with a field, and you can now configure memory in 1 MB increments (it was 64MB increments before). In this way, the console works similarly to the Lambda API that always accepted memory configurations with 1MB granularity.

There is no change in Lambda pricing, you pay for requests and usage, with duration and Provisioned Concurrency charged at a rate proportional to the amount of memory configured.

Start using Lambda functions with up to 10 GB of memory and 6 vCPUs today.

Danilo

New for AWS Lambda – 1ms Billing Granularity Adds Cost Savings

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-aws-lambda-1ms-billing-granularity-adds-cost-savings/

What I like about AWS Lambda is that it lets you run code without provisioning or managing servers, and you pay only for what you use. Since we launched Lambda in 2014, you have been charged for the number of times your code is triggered (requests) and for the time your code executes, rounded up to the nearest 100ms (duration).

Starting today, we are rounding up duration to the nearest millisecond with no minimum execution time.

With this new pricing, you are going to pay less most of the time, but it’s going to be more noticeable when you have functions whose execution time is much lower than 100ms, such as low latency APIs.

For example, let’s look at a simple web app that I have running. In the Amazon CloudWatch Logs, for each invocation there is a REPORT line. To improve readability, I am breaking the REPORT line into three lines here:

REPORT RequestId: 35a7e0cb-4902-490d-b8d3-eb315dded660
Duration: 27.40 ms  Billed Duration: 100 ms Memory Size: 1024 MB  Max Memory Used: 472 MB

With 1ms billing granularity that becomes:

REPORT RequestId: a24d03b5-429d-4ca3-a490-878a52a0182f
Duration: 27.55 ms  Billed Duration: 28 ms Memory Size: 1024 MB  Max Memory Used: 472 MB

My application doesn’t have a lot of traffic, so let’s do a simple production scenario. Let’s say I have 100,000 users for a web/mobile app. I expect each user to call this function via the web/mobile app about 20 times per day. The duration of those invocations is on average 28ms. Each month, I should expect:

  • 100,000 users * 20 invocations * 30 days = 60 million invocations.

Let’s estimate the costs in US East (N. Virginia). For simplicity, I am not considering the Lambda free tier.

The Lambda monthly request charges are unchanged:

  • 60 million invocations * $0.20 per 1M requests = $12

To that, I have to add compute charges based on duration.

The Lambda monthly compute charges with the old 100ms rounded up pricing would have been:

  • 60 million invocations* 100ms * 1G memory * $0.0000166667 for every GB-second = $100

With the new 1ms billing granularity, the duration costs are:

  • 60 million invocations * 28ms * 1G memory * $0.0000166667 for every GB-second = $28

For this scenario, overall costs including request and compute charges are much cheaper ($40) than before ($112).

With this pricing, there is now more of an incentive to optimize the duration of functions even if it is already well below 100ms. Your engineering efforts can reduce costs even more.

If you increase memory to get more CPU power and speed up your functions, you now get the benefit of a lower billed duration below 100ms as well. That means that increasing performance and reducing latency is going to be cheaper than before.

We are applying 1ms billing granularity for duration, including when you have Provisioned Concurrency enabled, in all AWS Regions with the exception of those based in China starting with the December 2020 billing period. Regions in China will get the change from January.

Enjoy the new pricing!

Danilo

Coming Soon – EC2 C6gn Instances – 100 Gbps Networking with AWS Graviton2 Processors

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/coming-soon-ec2-c6gn-instances-100-gbps-networking-with-aws-graviton2-processors/

Based on the amazing feedback from customers such as Snap, NextRoll, Intuit, SmugMug, and Honeycomb who are running their workloads on Amazon Elastic Compute Cloud (EC2) instances powered by AWS Graviton2, today we are announcing an addition to our broad Arm-based Graviton2 portfolio with C6gn instances that deliver up to 100 Gbps network bandwidth, up to 38 Gbps Amazon Elastic Block Store (EBS) bandwidth, up to 40% higher packet processing performance, and up to 40% better price/performance versus comparable current generation x86-based network optimized instances.

Compared to C6g instances, this new instance type provides 4x higher network bandwidth, 4x higher packet processing performance, and 2x higher EBS bandwidth. This means that customers with workloads that need high networking bandwidth such as high performance computing (HPC), network appliance, real-time video communications, and data analytics, will be able to bring their biggest and most challenging applications to Arm and take advantage of the performance and cost-optimization.

C6gn instances will be available in 8 sizes:

NamevCPUsMemory
(GiB)
Network Bandwidth
(Gbps)
EBS Throughput
(Gbps)
c6gn.medium12Up to 25Up to 9.5
c6gn.large24Up to 25Up to 9.5
c6gn.xlarge48Up to 25Up to 9.5
c6gn.2xlarge816Up to 25Up to 9.5
c6gn.4xlarge1632259.5
c6gn.8xlarge32645019
c6gn.12xlarge48967528.5
c6gn.16xlarge6412810038

The new instances are built on the AWS Nitro System, a collection of AWS-designed hardware and software innovations that maximize resource efficiency. C6gn instances support Elastic Fabric Adapter (EFA) on the c6gn.16xlarge sizes for workloads that can take advantage of lower network latency (such as HPC and video processing) and use Message Passing Interface (MPI) for highly scalable clusters. These new instances also fully support network frameworks like Data Plane Development Kit (DPDK), making it easier to migrate network appliance workloads.

Coming Soon
EC2 C6gn instances will be available later this month and make it easier to optimize costs for HPC and workloads that require high network bandwidth and low latency. Let me know what you are going to build with them!

To get practice with the AWS Graviton2 architecture, you can try t4g.micro instances for free for up to 750 hours per month until March 31st, 2021.

Learn more about EC2 C6gn instances today.

Danilo

Introducing Amazon Managed Workflows for Apache Airflow (MWAA)

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/introducing-amazon-managed-workflows-for-apache-airflow-mwaa/

As the volume and complexity of your data processing pipelines increase, you can simplify the overall process by decomposing it into a series of smaller tasks and coordinate the execution of these tasks as part of a workflow. To do so, many developers and data engineers use Apache Airflow, a platform created by the community to programmatically author, schedule, and monitor workflows. With Airflow you can manage workflows as scripts, monitor them via the user interface (UI), and extend their functionality through a set of powerful plugins. However, manually installing, maintaining, and scaling Airflow, and at the same time handling security, authentication, and authorization for its users takes much of the time you’d rather use to focus on solving actual business problems.

For these reasons, I am happy to announce the availability of Amazon Managed Workflows for Apache Airflow (MWAA), a fully managed service that makes it easy to run open-source versions of Apache Airflow on AWS, and to build workflows to execute your extract-transform-load (ETL) jobs and data pipelines.

Airflow workflows retrieve input from sources like Amazon Simple Storage Service (S3) using Amazon Athena queries, perform transformations on Amazon EMR clusters, and can use the resulting data to train machine learning models on Amazon SageMaker. Workflows in Airflow are authored as Directed Acyclic Graphs (DAGs) using the Python programming language.

A key benefit of Airflow is its open extensibility through plugins which allows you to create tasks that interact with AWS or on-premise resources required for your workflows including AWS Batch, Amazon CloudWatch, Amazon DynamoDB, AWS DataSync, Amazon ECS and AWS Fargate, Amazon Elastic Kubernetes Service (EKS), Amazon Kinesis Firehose, AWS Glue, AWS Lambda, Amazon Redshift, Amazon Simple Queue Service (SQS), and Amazon Simple Notification Service (SNS).

To improve observability, Airflow metrics can be published as CloudWatch Metrics, and logs can be sent to CloudWatch Logs. Amazon MWAA provides automatic minor version upgrades and patches by default, with an option to designate a maintenance window in which these upgrades are performed.

You can use Amazon MWAA with these three steps:

  1. Create an environment – Each environment contains your Airflow cluster, including your scheduler, workers, and web server.
  2. Upload your DAGs and plugins to S3 – Amazon MWAA loads the code into Airflow automatically.
  3. Run your DAGs in Airflow – Run your DAGs from the Airflow UI or command line interface (CLI) and monitor your environment with CloudWatch.

Let’s see how this works in practice!

How to Create an Airflow Environment Using Amazon MWAA
In the Amazon MWAA console, I click on Create environment. I give the environment a name and select the Airflow version to use.

Then, I select the S3 bucket and the folder to load my DAG code. The bucket name must start with airflow-.

Optionally, I can specify a plugins file and a requirements file:

  • The plugins file is a ZIP file containing the plugins used by my DAGs.
  • The requirements file describes the Python dependencies to run my DAGs.

For plugins and requirements, I can select the S3 object version to use. In case the plugins or the requirements I use create a non-recoverable error in my environment, Amazon MWAA will automatically roll back to the previous working version.


I click Next to configure the advanced settings, starting with networking. Each environment runs in a Amazon Virtual Private Cloud using private subnets in two availability zones. Web server access to the Airflow UI is always protected by a secure login using AWS Identity and Access Management (IAM). However, you can choose to have web server access on a public network so that you can login over the Internet, or on a private network in your VPC. For simplicity, I select a Public network. I let Amazon MWAA create a new security group with the correct inbound and outbound rules. Optionally, I can add one or more existing security groups to fine-tune control of inbound and outbound traffic for your environment.

Now, I configure my environment class. Each environment includes a scheduler, a web server, and a worker. Workers automatically scale up and down according to my workload. We provide you a suggestion on which class to use based on the number of DAGs, but you can monitor the load on your environment and modify its class at any time.

Encryption is always enabled for data at rest, and while I can select a customized key managed by AWS Key Management Service (KMS) I will instead keep the default key that AWS owns and manages on my behalf.

For monitoring, I publish environment performance to CloudWatch Metrics. This is enabled by default, but I can disable CloudWatch Metrics after launch. For the logs, I can specify the log level and which Airflow components should send their logs to CloudWatch Logs. I leave the default to send only the task logs and use log level INFO.

I can modify the default settings for Airflow configuration options, such as default_task_retries or worker_concurrency. For now, I am not changing these values.

Finally, but most importantly, I configure the permissions that will be used by my environment to access my DAGs, write logs, and run DAGs accessing other AWS resources. I select Create a new role and click on Create environment. After a few minutes, the new Airflow environment is ready to be used.

Using the Airflow UI
In the Amazon MWAA console, I look for the new environment I just created and click on Open Airflow UI. A new browser window is created and I am authenticated with a secure login via AWS IAM.

There, I look for a DAG that I put on S3 in the movie_list_dag.py file. The DAG is downloading the MovieLens dataset, processing the files on S3 using Amazon Athena, and loading the result to a Redshift cluster, creating the table if missing.

Here’s the full source code of the DAG:

from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.operators import HttpSensor, S3KeySensor
from airflow.contrib.operators.aws_athena_operator import AWSAthenaOperator
from airflow.utils.dates import days_ago
from datetime import datetime, timedelta
from io import StringIO
from io import BytesIO
from time import sleep
import csv
import requests
import json
import boto3
import zipfile
import io
s3_bucket_name = 'my-bucket'
s3_key='files/'
redshift_cluster='redshift-cluster-1'
redshift_db='dev'
redshift_dbuser='awsuser'
redshift_table_name='movie_demo'
test_http='https://grouplens.org/datasets/movielens/latest/'
download_http='http://files.grouplens.org/datasets/movielens/ml-latest-small.zip'
athena_db='demo_athena_db'
athena_results='athena-results/'
create_athena_movie_table_query="""
CREATE EXTERNAL TABLE IF NOT EXISTS Demo_Athena_DB.ML_Latest_Small_Movies (
  `movieId` int,
  `title` string,
  `genres` string 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
) LOCATION 's3://pinwheeldemo1-pinwheeldagsbucketfeed0594-1bks69fq0utz/files/ml-latest-small/movies.csv/ml-latest-small/'
TBLPROPERTIES (
  'has_encrypted_data'='false',
  'skip.header.line.count'='1'
); 
"""
create_athena_ratings_table_query="""
CREATE EXTERNAL TABLE IF NOT EXISTS Demo_Athena_DB.ML_Latest_Small_Ratings (
  `userId` int,
  `movieId` int,
  `rating` int,
  `timestamp` bigint 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
) LOCATION 's3://pinwheeldemo1-pinwheeldagsbucketfeed0594-1bks69fq0utz/files/ml-latest-small/ratings.csv/ml-latest-small/'
TBLPROPERTIES (
  'has_encrypted_data'='false',
  'skip.header.line.count'='1'
); 
"""
create_athena_tags_table_query="""
CREATE EXTERNAL TABLE IF NOT EXISTS Demo_Athena_DB.ML_Latest_Small_Tags (
  `userId` int,
  `movieId` int,
  `tag` int,
  `timestamp` bigint 
)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'
WITH SERDEPROPERTIES (
  'serialization.format' = ',',
  'field.delim' = ','
) LOCATION 's3://pinwheeldemo1-pinwheeldagsbucketfeed0594-1bks69fq0utz/files/ml-latest-small/tags.csv/ml-latest-small/'
TBLPROPERTIES (
  'has_encrypted_data'='false',
  'skip.header.line.count'='1'
); 
"""
join_tables_athena_query="""
SELECT REPLACE ( m.title , '"' , '' ) as title, r.rating
FROM demo_athena_db.ML_Latest_Small_Movies m
INNER JOIN (SELECT rating, movieId FROM demo_athena_db.ML_Latest_Small_Ratings WHERE rating > 4) r on m.movieId = r.movieId
"""
def download_zip():
    s3c = boto3.client('s3')
    indata = requests.get(download_http)
    n=0
    with zipfile.ZipFile(io.BytesIO(indata.content)) as z:       
        zList=z.namelist()
        print(zList)
        for i in zList: 
            print(i) 
            zfiledata = BytesIO(z.read(i))
            n += 1
            s3c.put_object(Bucket=s3_bucket_name, Key=s3_key+i+'/'+i, Body=zfiledata)
def clean_up_csv_fn(**kwargs):    
    ti = kwargs['task_instance']
    queryId = ti.xcom_pull(key='return_value', task_ids='join_athena_tables' )
    print(queryId)
    athenaKey=athena_results+"join_athena_tables/"+queryId+".csv"
    print(athenaKey)
    cleanKey=athena_results+"join_athena_tables/"+queryId+"_clean.csv"
    s3c = boto3.client('s3')
    obj = s3c.get_object(Bucket=s3_bucket_name, Key=athenaKey)
    infileStr=obj['Body'].read().decode('utf-8')
    outfileStr=infileStr.replace('"e"', '') 
    outfile = StringIO(outfileStr)
    s3c.put_object(Bucket=s3_bucket_name, Key=cleanKey, Body=outfile.getvalue())
def s3_to_redshift(**kwargs):    
    ti = kwargs['task_instance']
    queryId = ti.xcom_pull(key='return_value', task_ids='join_athena_tables' )
    print(queryId)
    athenaKey='s3://'+s3_bucket_name+"/"+athena_results+"join_athena_tables/"+queryId+"_clean.csv"
    print(athenaKey)
    sqlQuery="copy "+redshift_table_name+" from '"+athenaKey+"' iam_role 'arn:aws:iam::163919838948:role/myRedshiftRole' CSV IGNOREHEADER 1;"
    print(sqlQuery)
    rsd = boto3.client('redshift-data')
    resp = rsd.execute_statement(
        ClusterIdentifier=redshift_cluster,
        Database=redshift_db,
        DbUser=redshift_dbuser,
        Sql=sqlQuery
    )
    print(resp)
    return "OK"
def create_redshift_table():
    rsd = boto3.client('redshift-data')
    resp = rsd.execute_statement(
        ClusterIdentifier=redshift_cluster,
        Database=redshift_db,
        DbUser=redshift_dbuser,
        Sql="CREATE TABLE IF NOT EXISTS "+redshift_table_name+" (title	character varying, rating	int);"
    )
    print(resp)
    return "OK"
DEFAULT_ARGS = {
    'owner': 'airflow',
    'depends_on_past': False,
    'email': ['[email protected]'],
    'email_on_failure': False,
    'email_on_retry': False 
}
with DAG(
    dag_id='movie-list-dag',
    default_args=DEFAULT_ARGS,
    dagrun_timeout=timedelta(hours=2),
    start_date=days_ago(2),
    schedule_interval='*/10 * * * *',
    tags=['athena','redshift'],
) as dag:
    check_s3_for_key = S3KeySensor(
        task_id='check_s3_for_key',
        bucket_key=s3_key,
        wildcard_match=True,
        bucket_name=s3_bucket_name,
        s3_conn_id='aws_default',
        timeout=20,
        poke_interval=5,
        dag=dag
    )
    files_to_s3 = PythonOperator(
        task_id="files_to_s3",
        python_callable=download_zip
    )
    create_athena_movie_table = AWSAthenaOperator(task_id="create_athena_movie_table",query=create_athena_movie_table_query, database=athena_db, output_location='s3://'+s3_bucket_name+"/"+athena_results+'create_athena_movie_table')
    create_athena_ratings_table = AWSAthenaOperator(task_id="create_athena_ratings_table",query=create_athena_ratings_table_query, database=athena_db, output_location='s3://'+s3_bucket_name+"/"+athena_results+'create_athena_ratings_table')
    create_athena_tags_table = AWSAthenaOperator(task_id="create_athena_tags_table",query=create_athena_tags_table_query, database=athena_db, output_location='s3://'+s3_bucket_name+"/"+athena_results+'create_athena_tags_table')
    join_athena_tables = AWSAthenaOperator(task_id="join_athena_tables",query=join_tables_athena_query, database=athena_db, output_location='s3://'+s3_bucket_name+"/"+athena_results+'join_athena_tables')
    create_redshift_table_if_not_exists = PythonOperator(
        task_id="create_redshift_table_if_not_exists",
        python_callable=create_redshift_table
    )
    clean_up_csv = PythonOperator(
        task_id="clean_up_csv",
        python_callable=clean_up_csv_fn,
        provide_context=True     
    )
    transfer_to_redshift = PythonOperator(
        task_id="transfer_to_redshift",
        python_callable=s3_to_redshift,
        provide_context=True     
    )
    check_s3_for_key >> files_to_s3 >> create_athena_movie_table >> join_athena_tables >> clean_up_csv >> transfer_to_redshift
    files_to_s3 >> create_athena_ratings_table >> join_athena_tables
    files_to_s3 >> create_athena_tags_table >> join_athena_tables
    files_to_s3 >> create_redshift_table_if_not_exists >> transfer_to_redshift

In the code, different tasks are created using operators like PythonOperator, for generic Python code, or AWSAthenaOperator, to use the integration with Amazon Athena. To see how those tasks are connected in the workflow, you can see the latest few lines, that I repeat here (without indentation) for simplicity:

check_s3_for_key >> files_to_s3 >> create_athena_movie_table >> join_athena_tables >> clean_up_csv >> transfer_to_redshift
files_to_s3 >> create_athena_ratings_table >> join_athena_tables
files_to_s3 >> create_athena_tags_table >> join_athena_tables
files_to_s3 >> create_redshift_table_if_not_exists >> transfer_to_redshift

The Airflow code is overloading the right shift >> operator in Python to create a dependency, meaning that the task on the left should be executed first, and the output passed to the task on the right. Looking at the code, this is quite easy to read. Each of the four lines above is adding dependencies, and they are all evaluated together to execute the tasks in the right order.

In the Airflow console, I can see a graph view of the DAG to have a clear representation of how tasks are executed:

Available Now
Amazon Managed Workflows for Apache Airflow (MWAA) is available today in US East (Northern Virginia), US West (Oregon), US East (Ohio), Asia Pacific (Singapore), Asia Pacific (Toyko), Asia Pacific (Sydney), Europe (Ireland), Europe (Frankfurt), and Europe (Stockholm). You can launch a new Amazon MWAA environment from the console, AWS Command Line Interface (CLI), or AWS SDKs. Then, you can develop workflows in Python using Airflow’s ecosystem of integrations.

With Amazon MWAA, you pay based on the environment class and the workers you use. For more information, see the pricing page.

Upstream compatibility is a core tenet of Amazon MWAA. Our code changes to the AirFlow platform are released back to open source.

With Amazon MWAA you can spend more time building workflows for your engineering and data science tasks, and less time managing and scaling the infrastructure of your Airflow platform.

Learn more about Amazon MWAA and get started today!

Danilo

Announcing AWS Glue DataBrew – A Visual Data Preparation Tool That Helps You Clean and Normalize Data Faster

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/announcing-aws-glue-databrew-a-visual-data-preparation-tool-that-helps-you-clean-and-normalize-data-faster/

To be able to run analytics, build reports, or apply machine learning, you need to be sure the data you’re using is clean and in the right format. That’s the data preparation step that requires data analysts and data scientists to write custom code and do many manual activities. First, you need to look at the data, understand which possible values are present, and build some simple visualizations to understand if there are correlations between the columns. Then, you need to check for strange values outside of what you’re expecting, such as weather temperature above 200℉ (93℃) or speed of a truck above 200 mph (322 km/h), or for data that is missing. Many algorithms need values to be rescaled to a specific range, for example between 0 and 1, or normalized around the mean. Text fields need to be set to a standard format, and may require advanced transformations such as stemming.

That’s a lot of work. For this reason, I am happy to announce that today AWS Glue DataBrew is available, a visual data preparation tool that helps you clean and normalize data up to 80% faster so you can focus more on the business value you can get.

DataBrew provides a visual interface that quickly connects to your data stored in Amazon Simple Storage Service (S3), Amazon Redshift, Amazon Relational Database Service (RDS), any JDBC accessible data store, or data indexed by the AWS Glue Data Catalog. You can then explore the data, look for patterns, and apply transformations. For example, you can apply joins and pivots, merge different data sets, or use functions to manipulate data.

Once your data is ready, you can immediately use it with AWS and third-party services to gain further insights, such as Amazon SageMaker for machine learning, Amazon Redshift and Amazon Athena for analytics, and Amazon QuickSight and Tableau for business intelligence.

How AWS Glue DataBrew Works
To prepare your data with DataBrew, you follow these steps:

  • Connect one or more datasets from S3 or the Glue data catalog (S3, Redshift, RDS). You can also upload a local file to S3 from the DataBrew console. CSV, JSON, Parquet, and .XLSX formats are supported.
  • Create a project to visually explore, understand, combine, clean, and normalize data in a dataset. You can merge or join multiple datasets. From the console, you can quickly spot anomalies in your data with value distributions, histograms, box plots, and other visualizations.
  • Generate a rich data profile for your dataset with over 40 statistics by running a job in the profile view.
  • When you select a column, you get recommendations on how to improve data quality.
  • You can clean and normalize data using more than 250 built-in transformations. For example, you can remove or replace null values, or create encodings. Each transformation is automatically added as a step to build a recipe.
  • You can then save, publish, and version recipes, and automate the data preparation tasks by applying recipes on all incoming data. To apply recipes to or generate profiles for large datasets, you can run jobs.
  • At any point in time, you can visually track and explore how datasets are linked to projects, recipes, and job runs. In this way, you can understand how data flows and what are the changes. This information is called data lineage and can help you find the root cause in case of errors in your output.

Let’s see how this works with a quick demo!

Preparing a Sample Dataset with AWS Glue DataBrew
In the DataBrew console, I select the Projects tab and then Create project. I name the new project Comments. A new recipe is also created and will be automatically updated with the data transformations that I will apply next.

I choose to work on a New dataset and name it Comments.

Here, I select Upload file and in the next dialog I upload a comments.csv file I prepared for this demo. In a production use case, here you will probably connect an existing source on S3 or in the Glue Data Catalog. For this demo, I specify the S3 destination for storing the uploaded file. I leave Encryption disabled.

The comments.csv file is very small, but will help show some common data preparation needs and how to complete them quickly with DataBrew. The format of the file is comma-separated values (CSV). The first line contains the name of the columns. Then, each line contains a text comment and a numerical rating made by a customer (customer_id) about an item (item_id). Each item is part of a category. For each text comment, there is an indication of the overall sentiment (comment_sentiment). Optionally, when giving the comment, customers can enable a flag to ask to be contacted for further support (support_needed).

Here’s the content of the comments.csv file:

customer_id,item_id,category,rating,comment,comment_sentiment,support_needed
234,2345,"Electronics;Computer", 5,"I love this!",Positive,False
321,5432,"Home;Furniture",1,"I can't make this work... Help, please!!!",negative,true
123,3245,"Electronics;Photography",3,"It works. But I'd like to do more",,True
543,2345,"Electronics;Computer",4,"Very nice, it's going well",Positive,False
786,4536,"Home;Kitchen",5,"I really love it!",positive,false
567,5432,"Home;Furniture",1,"I doesn't work :-(",negative,true
897,4536,"Home;Kitchen",3,"It seems OK...",,True
476,3245,"Electronics;Photography",4,"Let me say this is nice!",positive,false

In the Access permissions, I select a AWS Identity and Access Management (IAM) role which provides DataBrew read permissions to my input S3 bucket. Only roles where DataBrew is the service principal for the trust policy are shown in the DataBrew console. To create one in the IAM console, select DataBrew as trusted entity.

If the dataset is big, you can use Sampling to limit the number of rows to use in the project. These rows can be selected at the beginning, at the end, or randomly through the data. You are going to use projects to create recipes, and then jobs to apply recipes to all the data. Depending on your dataset, you may not need access to all the rows to define the data preparation recipe.

Optionally, you can use Tagging to manage, search, or filter resources you create with AWS Glue DataBrew.

The project is now being prepared and in a few minutes I can start exploring my dataset.

In the Grid view, the default when I create a new project, I see the data as it has been imported. For each column, there is a summary of the range of values that have been found. For numerical columns, the statistical distribution is given.

In the Schema view, I can drill down on the schema that has been inferred, and optionally hide some of the columns.

In the Profile view, I can run a data profile job to examine and collect statistical summaries about the data. This is an assessment in terms of structure, content, relationships, and derivation. For a large dataset, this is very useful to understand the data. For this small example the benefits are limited, but I run it nonetheless, sending the output of the profile job to a different folder in the same S3 bucket I use to store the source data.

When the profile job has succeeded, I can see a summary of the rows and columns in my dataset, how many columns and rows are valid, and correlations between columns.

Here, if I select a column, for example rating, I can drill down into specific statistical information and correlations for that column.

Now, let’s do some actual data preparation. In the Grid view, I look at the columns. The category contains two pieces of information, separated by a semicolon. For example, the category of the first row is “Electronics;Computers.” I select the category column, then click on the column actions (the three small dots on the right of the column name) and there I have access to many transformations that I can apply to the column. In this case, I select to split the column on a single delimiter. Before applying the changes, I quickly preview them in the console.

I use the semicolon as delimiter, and now I have two columns, category_1 and category_2. I use the column actions again to rename them to category and subcategory. Now, for the first row, category contains Electronics and subcategory Computers. All these changes are added as steps to the project recipe, so that I’ll be able to apply them to similar data.

The rating column contains values between 1 and 5. For many algorithms, I prefer to have these kind of values normalized. In the column actions, I use min-max normalization to rescale the values between 0 and 1. More advanced techniques are available, such as mean or Z-score normalization. A new rating_normalized column is added.

I look into the recommendations that DataBrew gives for the comment column. Since it’s text, the suggestion is to use a standard case format, such as lowercase, capital case, or sentence case. I select lowercase.

The comments contain free text written by customers. To simplify further analytics, I use word tokenization on the column to remove stop words (such as “a,” “an,” “the”), expand contractions (so that “don’t” becomes “do not”), and apply stemming. The destination for these changes is a new column, comment_tokenized.

I still have some special characters in the comment_tokenized column, such as an emoticon :-). In the column actions, I select to clean and remove special characters.

I look into the recommendations for the comment_sentiment column. There are some missing values. I decide to fill the missing values with a neutral sentiment. Now, I still have values written with a different case, so I follow the recommendation to use lowercase for this column.

The comment_sentiment column now contains three different values (positive, negative, or neutral), but many algorithms prefer to have one-hot encoding, where there is a column for each of the possible values, and these columns contain 1, if that is the original value, or 0 otherwise. I select the Encode icon in the menu bar and then One-hot encode column. I leave the defaults and apply. Three new columns for the three possible values are added.

The support_needed column is recognized as boolean, and its values are automatically formatted to a standard format. I don’t have to do anything here.

The recipe for my dataset is now ready to be published and can be used in a recurring job processing similar data. I didn’t have a lot of data, but the recipe can be used with much larger datasets.

In the recipe, you can find a list of all the transformations that I just applied. When running a recipe job, output data is available in S3 and ready to be used with analytics and machine learning platforms, or to build reports and visualization with BI tools. The output can be written in a different format than the input, for example using a columnar storage format like Apache Parquet.

Available Now

AWS Glue DataBrew is available today in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (Frankfurt), Asia Pacific (Tokyo), Asia Pacific (Sydney).

It’s never been easier to prepare you data for analytics, machine learning, or for BI. In this way, you can really focus on getting the right insights for your business instead of writing custom code that you then have to maintain and update.

To practice with DataBrew, you can create a new project and select one of the sample datasets that are provided. That’s a great way to understand all the available features and how you can apply them to your data.

Learn more and get started with AWS Glue DataBrew today.

Danilo

New – Archive and Replay Events with Amazon EventBridge

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-archive-and-replay-events-with-amazon-eventbridge/

Event-driven architectures use events to share information between the components of one or more applications. Events tell us that “something has happened”, maybe you received an API request, a file has been uploaded to a storage platform, or a database record has been updated. Business events describe something related to your activities, for example that […]

New – Application Load Balancer Support for End-to-End HTTP/2 and gRPC

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-application-load-balancer-support-for-end-to-end-http-2-and-grpc/

Thanks to its efficiency and support for numerous programming languages, gRPC is a popular choice for microservice integrations and client-server communications. gRPC is a high performance remote procedure call (RPC) framework using HTTP/2 for transport and Protocol Buffers to describe the interface. To make it easier to use gRPC with your applications, Application Load Balancer (ALB) […]

Introducing Amazon SNS FIFO – First-In-First-Out Pub/Sub Messaging

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/introducing-amazon-sns-fifo-first-in-first-out-pub-sub-messaging/

When designing a distributed software architecture, it is important to define how services exchange information. For example, the use of asynchronous communication decouples components and simplifies scaling, reducing the impact of changes and making it easier to release new features.

The two most common forms of asynchronous service-to-service communication are message queues and publish/subscribe messaging:

  • With message queues, messages are stored on the queue until they are processed and deleted by a consumer. On AWS, Amazon Simple Queue Service (SQS) provides a fully managed message queuing service with no administrative overhead.
  • With pub/sub messaging, a message published to a topic is delivered to all subscribers to the topic. On AWS, Amazon Simple Notification Service (SNS) is a fully managed pub/sub messaging service that enables message delivery to a large number of subscribers. Each subscriber can also set a filter policy to receive only the messages that it cares about.

You can use topics when you want to fan out messages to multiple applications, and queues when you want to send messages to one application. Using topics and queues together, you can decouple microservices, distributed systems, and serverless applications.

With SQS, you can use FIFO (First-In-First-Out) queues to preserve the order in which messages are sent and received, and to avoid that a message is processed more than once.

Introducing SNS FIFO Topics
Today, we are adding similar capabilities for pub/sub messaging with the introduction of SNS FIFO topics, providing strict message ordering and deduplicated message delivery to one or more subscribers.

FIFO topics manage ordering and deduplication similar to FIFO queues:

Ordering – You configure a message group by including a message group ID when publishing a message to a FIFO topic. For each message group ID, all messages are sent and delivered in order of their arrival. For example, to ensure the delivery of messages related to the same customer in order, you can publish these messages to the topic using the customer’s account number as the message group ID. There is no limit in the number of message groups with FIFO topics and queues. You don’t need to declare in advance the message group ID, any value will work. If you don’t have a logical distinction between messages, you can simply use the same message group ID for all and have a single group of ordered messages. The message group ID is passed to any subscribed FIFO queue.

Deduplication – Distributed systems (like SNS) and client applications sometimes generate duplicate messages. You can avoid duplicated message deliveries from the topic in two ways: either by enabling content-based deduplication on the topic, or by adding a deduplication ID to the messages that you publish. With message content-based deduplication, SNS uses a SHA-256 hash to generate the message deduplication ID using the body of the message. After a message with a specific deduplication ID is published successfully, there is a 5-minute interval during which any message with the same deduplication ID is accepted but not delivered. If you subscribe a FIFO queue to a FIFO topic, the deduplication ID is passed to the queue and it is used by SQS to avoid duplicate messages being received.

You can use FIFO topics and queues together to simplify the implementation of applications where the order of operations and events is critical, or when you cannot tolerate duplicates. For example, to process financial operations and inventory updates, or to asynchronously apply commands that you receive from a client device. FIFO queues can use message filtering in FIFO topics to selectively receive only a subset of messages rather than every message published to the topic.

How to Use SNS FIFO Topics
A common scenario where FIFO topics can help is when you receive updates that need to be processed in order. For example, I can use a FIFO topic to receive updates from an application where my customers edit their account profiles. Then, I subscribe an SQS FIFO queue to the FIFO topic, and use the queue as trigger for a Lambda function that applies the account updates to an Amazon DynamoDB table used by my Customer management system that needs to be kept in sync.

The decoupling introduced by the FIFO topic makes it easier to add new functionality with minimal impact to existing applications. For example, to reward my loyal customers with additional promotions, I add a new Loyalty application that is storing information in a relational database managed by Amazon Aurora. To keep the customer’s information stored in the Loyalty database in sync with my other applications, I can subscribe a new FIFO queue to the same FIFO topic, and add a new Lambda function that receives customer updates in the same order as they are generated, and applies them to the Loyalty database. In this way, I don’t need to change code and configuration of other applications to integrate the new Loyalty app.

First, I create two FIFO queues in the SQS console, leaving all options to their defaults:

  • The customer.fifo queue to process updates in my Customer management system.
  • The loyalty.fifo queue to help me collect and store customer updates for the Loyalty application.

In the SNS console, I create the updates.fifo topic. I select FIFO as type, and enable Content-based message deduplication.

Then,  I subscribe the customer.fifo and loyalty.fifo queues to the topic.

To be able to receive messages, I add a statement to the access policy of both queues granting the updates.fifo topic permissions to send messages to the queues. For example, for the customer.fifo queue the statement is:

{
  "Effect": "Allow",
  "Principal": {
    "Service": "sns.amazonaws.com"
  },
  "Action": "SQS:SendMessage",
  "Resource": "arn:aws:sqs:us-east-2:123412341234:customer.fifo",
  "Condition": {
    "ArnLike": {
      "aws:SourceArn": "arn:aws:sns:us-east-2:123412341234:updates.fifo"
    }
  }
}

Now, I use the SNS console to publish 4 messages in sequence. For all messages, I use the same message group ID. In this way, they are all in the same message group. The only part that is different is the message body, where I use in order:

  • Update One
  • Update Two
  • Update Three
  • Update One

In the SQS console, I see that only 3 messages have been delivered to the FIFO queues:

Why is that? When I created the FIFO topics, I enabled content-based deduplication. The 4 messages were sent within the 5-minute deduplication window. The last message has been recognized as a duplicate of the first one and has not been delivered to the subscribed queues.

Let’s see the actual messages in the queues. I use the AWS Command Line Interface (CLI) to receive the messages from SQS, and the jq command-line JSON processor to format the output and get only the Message in the Body.

Here are the messages in the customer.fifo queue:

$ aws sqs receive-message – queue-url https://sqs.us-east-2.amazonaws.com/123412341234/customer.fifo – max-number-of-messages 10 | jq '.Messages[].Body | fromjson | .Message'

"Update One"
"Update Two"
"Update Three"

And these are the messages in the loyalty.fifo queue:

$ aws sqs receive-message – queue-url https://sqs.us-east-2.amazonaws.com/123412341234/loyalty.fifo – max-number-of-messages 10 | jq '.Messages[].Body | fromjson | .Message'

"Update One"
"Update Two"
"Update Three"

As expected, the 3 messages with unique content have been delivered to both queues in the same order as they were sent.

Available Now
You can use SNS FIFO topics in all commercial regions. You can process up to 300 transactions per second (TPS) per FIFO topic or FIFO queue. With SNS, you pay only for what you use, you can find more information in the pricing page.

To learn more, please see the documentation.

Danilo

Store and Access Time Series Data at Any Scale with Amazon Timestream – Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/store-and-access-time-series-data-at-any-scale-with-amazon-timestream-now-generally-available/

Time series are a very common data format that describes how things change over time. Some of the most common sources are industrial machines and IoT devices, IT infrastructure stacks (such as hardware, software, and networking components), and applications that share their results over time. Managing time series data efficiently is not easy because the data model doesn’t fit general-purpose databases.

For this reason, I am happy to share that Amazon Timestream is now generally available. Timestream is a fast, scalable, and serverless time series database service that makes it easy to collect, store, and process trillions of time series events per day up to 1,000 times faster and at as little as to 1/10th the cost of a relational database.

This is made possible by the way Timestream is managing data: recent data is kept in memory and historical data is moved to cost-optimized storage based on a retention policy you define. All data is always automatically replicated across multiple availability zones (AZ) in the same AWS region. New data is written to the memory store, where data is replicated across three AZs before returning success of the operation. Data replication is quorum based such that the loss of nodes, or an entire AZ, does not disrupt durability or availability. In addition, data in the memory store is continuously backed up to Amazon Simple Storage Service (S3) as an extra precaution.

Queries automatically access and combine recent and historical data across tiers without the need to specify the storage location, and support time series-specific functionalities to help you identify trends and patterns in data in near real time.

There are no upfront costs, you pay only for the data you write, store, or query. Based on the load, Timestream automatically scales up or down to adjust capacity, without the need to manage the underlying infrastructure.

Timestream integrates with popular services for data collection, visualization, and machine learning, making it easy to use with existing and new applications. For example, you can ingest data directly from AWS IoT Core, Amazon Kinesis Data Analytics for Apache Flink, AWS IoT Greengrass, and Amazon MSK. You can visualize data stored in Timestream from Amazon QuickSight, and use Amazon SageMaker to apply machine learning algorithms to time series data, for example for anomaly detection. You can use Timestream fine-grained AWS Identity and Access Management (IAM) permissions to easily ingest or query data from an AWS Lambda function. We are providing the tools to use Timestream with open source platforms such as Apache Kafka, Telegraf, Prometheus, and Grafana.

Using Amazon Timestream from the Console
In the Timestream console, I select Create database. I can choose to create a Standard database or a Sample database populated with sample data. I proceed with a standard database and I name it MyDatabase.

All Timestream data is encrypted by default. I use the default master key, but you can use a customer managed key that you created using AWS Key Management Service (KMS). In that way, you can control the rotation of the master key, and who has permissions to use or manage it.

I complete the creation of the database. Now my database is empty. I select Create table and name it MyTable.

Each table has its own data retention policy. First data is ingested in the memory store, where it can be stored from a minimum of one hour to a maximum of a year. After that, it is automatically moved to the magnetic store, where it can be kept up from a minimum of one day to a maximum of 200 years, after which it is deleted. In my case, I select 1 hour of memory store retention and 5 years of magnetic store retention.

When writing data in Timestream, you cannot insert data that is older than the retention period of the memory store. For example, in my case I will not be able to insert records older than 1 hour. Similarly, you cannot insert data with a future timestamp.

I complete the creation of the table. As you noticed, I was not asked for a data schema. Timestream will automatically infer that as data is ingested. Now, let’s put some data in the table!

Loading Data in Amazon Timestream
Each record in a Timestream table is a single data point in the time series and contains:

  • The measure name, type, and value. Each record can contain a single measure, but different measure names and types can be stored in the same table.
  • The timestamp of when the measure was collected, with nanosecond granularity.
  • Zero or more dimensions that describe the measure and can be used to filter or aggregate data. Records in a table can have different dimensions.

For example, let’s build a simple monitoring application collecting CPU, memory, swap, and disk usage from a server. Each server is identified by a hostname and has a location expressed as a country and a city.

In this case, the dimensions would be the same for all records:

  • country
  • city
  • hostname

Records in the table are going to measure different things. The measure names I use are:

  • cpu_utilization
  • memory_utilization
  • swap_utilization
  • disk_utilization

Measure type is DOUBLE for all of them.

For the monitoring application, I am using Python. To collect monitoring information I use the psutil module that I can install with:

pip3 install psutil

Here’s the code for the collect.py application:

import time
import boto3
import psutil

from botocore.config import Config

DATABASE_NAME = "MyDatabase"
TABLE_NAME = "MyTable"

COUNTRY = "UK"
CITY = "London"
HOSTNAME = "MyHostname" # You can make it dynamic using socket.gethostname()

INTERVAL = 1 # Seconds

def prepare_record(measure_name, measure_value):
    record = {
        'Time': str(current_time),
        'Dimensions': dimensions,
        'MeasureName': measure_name,
        'MeasureValue': str(measure_value),
        'MeasureValueType': 'DOUBLE'
    }
    return record


def write_records(records):
    try:
        result = write_client.write_records(DatabaseName=DATABASE_NAME,
                                            TableName=TABLE_NAME,
                                            Records=records,
                                            CommonAttributes={})
        status = result['ResponseMetadata']['HTTPStatusCode']
        print("Processed %d records. WriteRecords Status: %s" %
              (len(records), status))
    except Exception as err:
        print("Error:", err)


if __name__ == '__main__':

    session = boto3.Session()
    write_client = session.client('timestream-write', config=Config(
        read_timeout=20, max_pool_connections=5000, retries={'max_attempts': 10}))
    query_client = session.client('timestream-query')

    dimensions = [
        {'Name': 'country', 'Value': COUNTRY},
        {'Name': 'city', 'Value': CITY},
        {'Name': 'hostname', 'Value': HOSTNAME},
    ]

    records = []

    while True:

        current_time = int(time.time() * 1000)
        cpu_utilization = psutil.cpu_percent()
        memory_utilization = psutil.virtual_memory().percent
        swap_utilization = psutil.swap_memory().percent
        disk_utilization = psutil.disk_usage('/').percent

        records.append(prepare_record('cpu_utilization', cpu_utilization))
        records.append(prepare_record(
            'memory_utilization', memory_utilization))
        records.append(prepare_record('swap_utilization', swap_utilization))
        records.append(prepare_record('disk_utilization', disk_utilization))

        print("records {} - cpu {} - memory {} - swap {} - disk {}".format(
            len(records), cpu_utilization, memory_utilization,
            swap_utilization, disk_utilization))

        if len(records) == 100:
            write_records(records)
            records = []

        time.sleep(INTERVAL)

I start the collect.py application. Every 100 records, data is written in the MyData table:

$ python3 collect.py
records 4 - cpu 31.6 - memory 65.3 - swap 73.8 - disk 5.7
records 8 - cpu 18.3 - memory 64.9 - swap 73.8 - disk 5.7
records 12 - cpu 15.1 - memory 64.8 - swap 73.8 - disk 5.7
. . .
records 96 - cpu 44.1 - memory 64.2 - swap 73.8 - disk 5.7
records 100 - cpu 46.8 - memory 64.1 - swap 73.8 - disk 5.7
Processed 100 records. WriteRecords Status: 200
records 4 - cpu 36.3 - memory 64.1 - swap 73.8 - disk 5.7
records 8 - cpu 31.7 - memory 64.1 - swap 73.8 - disk 5.7
records 12 - cpu 38.8 - memory 64.1 - swap 73.8 - disk 5.7
. . .

Now, in the Timestream console, I see the schema of the MyData table, automatically updated based on the data ingested:

Note that, since all measures in the table are of type DOUBLE, the measure_value::double column contains the value for all of them. If the measures were of different types (for example, INT or BIGINT) I would have more columns (such as measure_value::int and measure_value::bigint) .

In the console, I can also see a recap of which kind measures I have in the table, their corresponding data type, and the dimensions used for that specific measure:

Querying Data from the Console
I can query time series data using SQL. The memory store is optimized for fast point-in-time queries, while the magnetic store is optimized for fast analytical queries. However, queries automatically process data on all stores (memory and magnetic) without having to specify the data location in the query.

I am running queries straight from the console, but I can also use JDBC connectivity to access the query engine. I start with a basic query to see the most recent records in the table:

SELECT * FROM MyDatabase.MyTable ORDER BY time DESC LIMIT 8

Let’s try something a little more complex. I want to see the average CPU utilization aggregated by hostname in 5 minutes intervals for the last two hours. I filter records based on the content of measure_name. I use the function bin() to round time to a multiple of an interval size, and the function ago() to compare timestamps:

SELECT hostname,
       bin(time, 5m) as binned_time,
       avg(measure_value::double) as avg_cpu_utilization
  FROM MyDatabase.MyTable
 WHERE measure_name = 'cpu_utilization'
   AND time > ago(2h)
 GROUP BY hostname, bin(time, 5m)

When collecting time series data you may miss some values. This is quite common especially for distributed architectures and IoT devices. Timestream has some interesting functions that you can use to fill in the missing values, for example using linear interpolation, or based on the last observation carried forward.

More generally, Timestream offers many functions that help you to use mathematical expressions, manipulate strings, arrays, and date/time values, use regular expressions, and work with aggregations/windows.

To experience what you can do with Timestream, you can create a sample database and add the two IoT and DevOps datasets that we provide. Then, in the console query interface, look at the sample queries to get a glimpse of some of the more advanced functionalities:

Using Amazon Timestream with Grafana
One of the most interesting aspects of Timestream is the integration with many platforms. For example, you can visualize your time series data and create alerts using Grafana 7.1 or higher. The Timestream plugin is part of the open source edition of Grafana.

I add a new GrafanaDemo table to my database, and use another sample application to continuously ingest data. The application simulates performance data collected from a microservice architecture running on thousands of hosts.

I install Grafana on an Amazon Elastic Compute Cloud (EC2) instance and add the Timestream plugin using the Grafana CLI.

$ grafana-cli plugins install grafana-timestream-datasource

I use SSH Port Forwarding to access the Grafana console from my laptop:

$ ssh -L 3000:<EC2-Public-DNS>:3000 -N -f [email protected]<EC2-Public-DNS>

In the Grafana console, I configure the plugin with the right AWS credentials, and the Timestream database and table. Now, I can select the sample dashboard, distributed as part of the Timestream plugin, using data from the GrafanaDemo table where performance data is continuously collected:

Available Now
Amazon Timestream is available today in US East (N. Virginia), Europe (Ireland), US West (Oregon), and US East (Ohio). You can use Timestream with the console, the AWS Command Line Interface (CLI), AWS SDKs, and AWS CloudFormation. With Timestream, you pay based on the number of writes, the data scanned by the queries, and the storage used. For more information, please see the pricing page.

You can find more sample applications in this repo. To learn more, please see the documentation. It’s never been easier to work with time series, including data ingestion, retention, access, and storage tiering. Let me know what you are going to build!

Danilo

New EC2 T4g Instances – Burstable Performance Powered by AWS Graviton2 – Try Them for Free

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-t4g-instances-burstable-performance-powered-by-aws-graviton2/

Two years ago Amazon Elastic Compute Cloud (EC2) T3 instances were first made available, offering a very cost effective way to run general purpose workloads. While current T3 instances offer sufficient compute performance for many use cases, many customers have told us that they have additional workloads that would benefit from increased peak performance and lower cost.

Today, we are launching T4g instances, a new generation of low cost burstable instance type powered by AWS Graviton2, a processor custom built by AWS using 64-bit Arm Neoverse cores. Using T4g instances you can enjoy a performance benefit of up to 40% at a 20% lower cost in comparison to T3 instances, providing the best price/performance for a broader spectrum of workloads.

T4g instances are designed for applications that don’t use CPU at full power most of the time, using the same credit model as T3 instances with unlimited mode enabled by default. Examples of production workloads that require high CPU performance only during times of heavy data processing are web/application servers, small/medium data stores, and many microservices. Compared to previous generations, the performance of T4g instances makes it possible to migrate additional workloads such as caching servers, search engine indexing, and e-commerce platforms.

T4g instances are available in 7 sizes providing up to 5 Gbps of network and up to 2.7 Gbps of Amazon Elastic Block Store (EBS) performance:

NamevCPUsBaseline Performance/vCPUCPU Credits Earned/HourMemory
t4g.nano25%60.5 GiB
t4g.micro210%121 GiB
t4g.small220%242 GiB
t4g.medium220%244 GiB
t4g.large230%368 GiB
t4g.xlarge440%9616 GiB
t4g.2xlarge840%19232 GiB

Free Trial
To make it easier to develop, test, and run your applications on T4g instances, all AWS customers are automatically enrolled in a free trial on the t4g.micro size. Starting September 2020 until December 31st 2020, you can run a t4g.micro instance and automatically get 750 free hours per month deducted from your bill, including any CPU credits during the free 750 hours of usage. The 750 hours are calculated in aggregate across all regions. For details on terms and conditions of the free trial, please refer to the EC2 FAQs.

During the free trial, have a look at this getting started guide on using the Arm-based AWS Graviton processors. There, you can find suggestions on how to build and optimize your applications, using different programming languages and operating systems, and on managing container-based workloads. Some of the tips are specific for the Graviton processor, but most of the content works generally for anyone using Arm to run their code.

Using T4g Instances
You can start an EC2 instance in different ways, for example using the EC2 console, the AWS Command Line Interface (CLI), AWS SDKs, or AWS CloudFormation. For my first T4g instance, I use the AWS CLI:

$ aws ec2 run-instances \
  – instance-type t4g.micro \
  – image-id ami-09a67037138f86e67 \
  – security-groups MySecurityGroup \
  – key-name my-key-pair

The Amazon Machine Image (AMI) I am using is based on Amazon Linux 2. Other platforms are available, such as Ubuntu 18.04 or newer, Red Hat Enterprise Linux 8.0 and newer, and SUSE Enterprise Server 15 and newer. You can find additional AMIs in the AWS Marketplace, for example Fedora, Debian, NetBSD, CentOS, and NGINX Plus. For containerized applications, Amazon ECS and Amazon Elastic Kubernetes Service optimized AMIs are available as well.

The security group I selected gives me SSH access to the instance. I connect to the instance and do a general update:

$ sudo yum update -y

Since the kernel has been updated, I reboot the instance.

I’d like to set up this instance as a development environment. I can use it to build new applications, or to recompile my existing apps to the 64-bit Arm architecture. To install most development tools, such as Git, GCC, and Make, I use this group of packages:

$ sudo yum groupinstall -y "Development Tools"

AWS is working with several open source communities to drive improvements to the performance of software stacks running on AWS Graviton2. For example, you can see our contributions to PHP for Arm64 in this post.

Using the latest versions helps you obtain maximum performance from your Graviton2-based instances. The amazon-linux-extras command enables new versions for some of my favorite programming environments:

$ sudo amazon-linux-extras enable golang1.11 corretto8 php7.4 python3.8 ruby2.6

The output of the amazon-linux-extras command tells me which packages to install with yum:

$ yum clean metadata
$ sudo yum install -y golang java-1.8.0-amazon-corretto \
  php-cli php-pdo php-fpm php-json php-mysqlnd \
  python38 ruby ruby-irb rubygem-rake rubygem-json rubygems

Let’s check the versions of the tools that I just installed:

$ go version
go version go1.13.14 linux/arm64
$ java -version
openjdk version "1.8.0_265"
OpenJDK Runtime Environment Corretto-8.265.01.1 (build 1.8.0_265-b01)
OpenJDK 64-Bit Server VM Corretto-8.265.01.1 (build 25.265-b01, mixed mode)
$ php -v
PHP 7.4.9 (cli) (built: Aug 21 2020 21:45:13) ( NTS )
Copyright (c) The PHP Group
Zend Engine v3.4.0, Copyright (c) Zend Technologies
$ python3.8 -V
Python 3.8.5
$ ruby -v
ruby 2.6.3p62 (2019-04-16 revision 67580) [aarch64-linux]

It looks like I am ready to go! Many more packages are available with yum, such as MariaDB and PostgreSQL. If you’re interested in databases, you might also want to try the preview of Amazon RDS powered by AWS Graviton2 processors.

Available Now
T4g instances are available today in US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Tokyo, Mumbai), Europe (Frankfurt, Ireland).

You now have a broad choice of Graviton2-based instances to better optimize your workloads for cost and performance: low cost burstable general-purpose (T4g), general purpose (M6g), compute optimized (C6g) and memory optimized (R6g) instances. Local NVMe-based SSD storage options are also available.

You can use the free trial to develop new applications, or migrate your existing workloads to the AWS Graviton2 processor. Let me know how that goes!

Danilo

AWS Named as a Cloud Leader for the 10th Consecutive Year in Gartner’s Infrastructure & Platform Services Magic Quadrant

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-named-as-a-cloud-leader-for-the-10th-consecutive-year-in-gartners-infrastructure-platform-services-magic-quadrant/

At AWS, we strive to provide you a technology platform that allows for agile development, rapid deployment, and unlimited scale, so that you can free up your resources to focus on innovation for your customers. It’s greatly rewarding to see our efforts recognized not just by our customers, but also by leading analysts.

This year, Gartner announced a new Magic Quadrant for Cloud Infrastructure and Platform Services (CIPS). This is an evolution of their Magic Quadrant for Cloud Infrastructure as a Service (IaaS) for which AWS has been named as a Leader for nine consecutive years.

Customers are using the cloud in broad ways, beyond foundational compute, networking and storage services. We believe for this reason, Gartner is expanding the scope to include additional platform as a service (PaaS) capabilities, and is extending coverage for areas such as managed database services, serverless computing, and developer tools.

Today, I am happy to share that AWS has been named as a Leader in the Magic Quadrant for Cloud Infrastructure and Platform Services, and placed highest in Ability to Execute and furthest in Completeness of Vision.

More information on the features and factors that our customers examine when choosing a cloud provider are available in the full report.

Danilo

Gartner, Magic Quadrant for Cloud Infrastructure and Platform Services, Raj Bala, Bob Gill, Dennis Smith, David Wright, Kevin Ji, 1 September 2020 – Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

New – Using Amazon GuardDuty to Protect Your S3 Buckets

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-using-amazon-guardduty-to-protect-your-s3-buckets/

As we anticipated in this post, the anomaly and threat detection for Amazon Simple Storage Service (S3) activities that was previously available in Amazon Macie has now been enhanced and reduced in cost by over 80% as part of Amazon GuardDuty. This expands GuardDuty threat detection coverage beyond workloads and AWS accounts to also help you protect your data stored in S3.

This new capability enables GuardDuty to continuously monitor and profile S3 data access events (usually referred to data plane operations) and S3 configurations (control plane APIs) to detect suspicious activities such as requests coming from an unusual geo-location, disabling of preventative controls such as S3 block public access, or API call patterns consistent with an attempt to discover misconfigured bucket permissions. To detect possibly malicious behavior, GuardDuty uses a combination of anomaly detection, machine learning, and continuously updated threat intelligence. For your reference, here’s the full list of GuardDuty S3 threat detections.

When threats are detected, GuardDuty produces detailed security findings to the console and to Amazon EventBridge, making alerts actionable and easy to integrate into existing event management and workflow systems, or trigger automated remediation actions using AWS Lambda. You can optionally deliver findings to an S3 bucket to aggregate findings from multiple regions, and to integrate with third party security analysis tools.

If you are not using GuardDuty yet, S3 protection will be on by default when you enable the service. If you are using GuardDuty, you can simply enable this new capability with one-click in the GuardDuty console or through the API. For simplicity, and to optimize your costs, GuardDuty has now been integrated directly with S3. In this way, you don’t need to manually enable or configure S3 data event logging in AWS CloudTrail to take advantage of this new capability. GuardDuty also intelligently processes only the data events that can be used to generate threat detections, significantly reducing the number of events processed and lowering your costs.

If you are part of a centralized security team that manages GuardDuty across your entire organization, you can manage all accounts from a single account using the integration with AWS Organizations.

Enabling S3 Protection for an AWS Account
I already have GuardDuty enabled for my AWS account in this region. Now, I want to add threat detection for my S3 buckets. In the GuardDuty console, I select S3 Protection and then Enable. That’s it. To be more protected, I repeat this process for all regions enabled in my account.

After a few minutes, I start seeing new findings related to my S3 buckets. I can select each finding to get more information on the possible threat, including details on the source actor and the target action.

After a few days, I select the Usage section of the console to monitor the estimated monthly costs of GuardDuty in my account, including the new S3 protection. I can also find which are the S3 buckets contributing more to the costs. Well, it turns out I didn’t have lots of traffic on my buckets recently.

Enabling S3 Protection for an AWS Organization
To simplify management of multiple accounts, GuardDuty uses its integration with AWS Organizations to allow you to delegate an account to be the administrator for GuardDuty for the whole organization.

Now, the delegated administrator can enable GuardDuty for all accounts in the organization in a region with one click. You can also set Auto-enable to ON to automatically include new accounts in the organization. If you prefer, you can add accounts by invitation. You can then go to the S3 Protection page under Settings to enable S3 protection for their entire organization.

When selecting Auto-enable, the delegated administrator can also choose to enable S3 protection automatically for new member accounts.

Available Now
As always, with Amazon GuardDuty, you only pay for the quantity of logs and events processed to detect threats. This includes API control plane events captured in CloudTrail, network flow captured in VPC Flow Logs, DNS request and response logs, and with S3 protection enabled, S3 data plane events. These sources are ingested by GuardDuty through internal integrations when you enable the service, so you don’t need to configure any of these sources directly. The service continually optimizes logs and events processed to reduce your cost, and displays your usage split by source in the console. If configured in multi-account, usage is also split by account.

There is a 30-day free trial for the new S3 threat detection capabilities. This applies as well to accounts that already have GuardDuty enabled, and add the new S3 protection capability. During the trial, the estimated cost based on your S3 data event volume is calculated in the GuardDuty console Usage tab. In this way, while you evaluate these new capabilities at no cost, you can understand what would be your monthly spend.

GuardDuty for S3 protection is available in all regions where GuardDuty is offered. For regional availability, please see the AWS Region Table. To learn more, please see the documentation.

Danilo

Find Your Most Expensive Lines of Code – Amazon CodeGuru Is Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/find-your-most-expensive-lines-of-code-amazon-codeguru-is-now-generally-available/

Bringing new applications into production, maintaining their code base as they grow and evolve, and at the same time respond to operational issues, is a challenging task. For this reason, you can find many ideas on how to structure your teams, on which methodologies to apply, and how to safely automate your software delivery pipeline.

At re:Invent last year, we introduced in preview Amazon CodeGuru, a developer tool powered by machine learning that helps you improve your applications and troubleshoot issues with automated code reviews and performance recommendations based on runtime data. During the last few months, many improvements have been launched, including a more cost-effective pricing model, support for Bitbucket repositories, and the ability to start the profiling agent using a command line switch, so that you no longer need to modify the code of your application, or add dependencies, to run the agent.

You can use CodeGuru in two ways:

  • CodeGuru Reviewer uses program analysis and machine learning to detect potential defects that are difficult for developers to find, and recommends fixes in your Java code. The code can be stored in GitHub (now also in GitHub Enterprise), AWS CodeCommit, or Bitbucket repositories. When you submit a pull request on a repository that is associated with CodeGuru Reviewer, it provides recommendations for how to improve your code. Each pull request corresponds to a code review, and each code review can include multiple recommendations that appear as comments on the pull request.
  • CodeGuru Profiler provides interactive visualizations and recommendations that help you fine-tune your application performance and troubleshoot operational issues using runtime data from your live applications. It currently supports applications written in Java virtual machine (JVM) languages such as Java, Scala, Kotlin, Groovy, Jython, JRuby, and Clojure. CodeGuru Profiler can help you find the most expensive lines of code, in terms of CPU usage or introduced latency, and suggest ways you can improve efficiency and remove bottlenecks. You can use CodeGuru Profiler in production, and when you test your application with a meaningful workload, for example in a pre-production environment.

Today, Amazon CodeGuru is generally available with the addition of many new features.

In CodeGuru Reviewer, we included the following:

  • Support for Github Enterprise – You can now scan your pull requests and get recommendations against your source code on Github Enterprise on-premises repositories, together with a description of what’s causing the issue and how to remediate it.
  • New types of recommendations to solve defects and improve your code – For example, checking input validation, to avoid issues that can compromise security and performance, and looking for multiple copies of code that do the same thing.

In CodeGuru Profiler, you can find these new capabilities:

  • Anomaly detection – We automatically detect anomalies in the application profile for those methods that represent the highest proportion of CPU time or latency.
  • Lambda function support – You can now profile AWS Lambda functions just like applications hosted on Amazon Elastic Compute Cloud (EC2) and containerized applications running on Amazon ECS and Amazon Elastic Kubernetes Service, including those using AWS Fargate.
  • Cost of issues in the recommendation report – Recommendations contain actionable resolution steps which explain what the problem is, the CPU impact, and how to fix the issue. To help you better prioritize your activities, you now have an estimation of the savings introduced by applying the recommendation.
  • Color-my-code – In the visualizations, to help you easily find your own code, we are coloring your methods differently from frameworks and other libraries you may use.
  • CloudWatch metrics and alerts – To keep track and monitor efficiency issues that have been discovered.

Let’s see some of these new features at work!

Using CodeGuru Reviewer with a Lambda Function
I create a new repo in my GitHub account, and leave it empty for now. Locally, where I am developing a Lambda function using the Java 11 runtime, I initialize my Git repo and add only the README.md file to the master branch. In this way, I can add all the code as a pull request later and have it go through a code review by CodeGuru.

git init
git add README.md
git commit -m "First commit"

Now, I add the GitHub repo as origin, and push my changes to the new repo:

git remote add origin https://github.com/<my-user-id>/amazon-codeguru-sample-lambda-function.git
git push -u origin master

I associate the repository in the CodeGuru console:

When the repository is associated, I create a new dev branch, add all my local files to it, and push it remotely:

git checkout -b dev
git add .
git commit -m "Code added to the dev branch"
git push – set-upstream origin dev

In the GitHub console, I open a new pull request by comparing changes across the two branches, master and dev. I verify that the pull request is able to merge, then I create it.

Since the repository is associated with CodeGuru, a code review is listed as Pending in the Code reviews section of the CodeGuru console.

After a few minutes, the code review status is Completed, and CodeGuru Reviewer issues a recommendation on the same GitHub page where the pull request was created.

Oops! I am creating the Amazon DynamoDB service object inside the function invocation method. In this way, it cannot be reused across invocations. This is not efficient.

To improve the performance of my Lambda function, I follow the CodeGuru recommendation, and move the declaration of the DynamoDB service object to a static final attribute of the Java application object, so that it is instantiated only once, during function initialization. Then, I follow the link in the recommendation to learn more best practices for working with Lambda functions.

Using CodeGuru Profiler with a Lambda Function
In the CodeGuru console, I create a MyServerlessApp-Development profiling group and select the Lambda compute platform.

Next, I give the AWS Identity and Access Management (IAM) role used by my Lambda function permissions to submit data to this profiling group.

Now, the console is giving me all the info I need to profile my Lambda function. To configure the profiling agent, I use a couple of environment variables:

  • AWS_CODEGURU_PROFILER_GROUP_ARN to specify the ARN of the profiling group to use.
  • AWS_CODEGURU_PROFILER_ENABLED to enable (TRUE) or disable (FALSE) profiling.

I follow the instructions (for Maven and Gradle) to add a dependency, and include the profiling agent in the build. Then, I update the code of the Lambda function to wrap the handler function inside the LambdaProfiler provided by the agent.

To generate some load, I start a few scripts invoking my function using the Amazon API Gateway as trigger. After a few minutes, the profiling group starts to show visualizations describing the runtime behavior of my Lambda function.

For example, I can see how much CPU time is spent in the different methods of my function. At the bottom, there are the entry point methods. As I scroll up, I find methods that are called deeper in the stack trace. I right-click and hide the LambdaRuntimeClient methods to focus on my code. Note that my methods are colored differently than those in the packages I am using, such as the AWS SDK for Java.

I am mostly interested in what happens in the handler method invoked by the Lambda platform. I select the handler method, and now it becomes the new “base” of the visualization.

As I move my pointer on each of my methods, I get more information, including an estimation of the yearly cost of running that specific part of the code in production, based on the load experienced by the profiling agent during the selected time window. In my case, the handler function cost is estimated to be $6. If I select the two main functions above, I have an estimation of $3 each. The cost estimation works for code running on Lambda functions, EC2 instances, and containerized applications.

Similarly, I can visualize Latency, to understand how much time is spent inside the methods in my code. I keep the Lambda function handler method selected to drill down into what is under my control, and see where time is being spent the most.

The CodeGuru Profiler is also providing a recommendation based on the data collected. I am spending too much time (more than 4%) in managing encryption. I can use a more efficient crypto provider, such as the open source Amazon Corretto Crypto Provider, described in this blog post. This should lower the time spent to what is expected, about 1% of my profile.

Finally, I edit the profiling group to enable notifications. In this way, if CodeGuru detects an anomaly in the profile of my application, I am notified in one or more Amazon Simple Notification Service (SNS) topics.

Available Now
Amazon CodeGuru is available today in 10 regions, and we are working to add more regions in the coming months. For regional availability, please see the AWS Region Table.

CodeGuru helps you improve your application code and reduce compute and infrastructure costs with an automated code reviewer and application profiler that provide intelligent recommendations. Using visualizations based on runtime data, you can quickly find the most expensive lines of code of your applications. With CodeGuru, you pay only for what you use. Pricing is based on the lines of code analyzed by CodeGuru Reviewer, and on sampling hours for CodeGuru Profiler.

To learn more, please see the documentation.

Danilo

AWS Solutions Constructs – A Library of Architecture Patterns for the AWS CDK

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-solutions-constructs-a-library-of-architecture-patterns-for-the-aws-cdk/

Cloud applications are built using multiple components, such as virtual servers, containers, serverless functions, storage buckets, and databases. Being able to provision and configure these resources in a safe, repeatable way is incredibly important to automate your processes and let you focus on the unique parts of your implementation.

With the AWS Cloud Development Kit, you can leverage the expressive power of your favorite programming languages to model your applications. You can use high-level components called constructs, preconfigured with “sensible defaults” that you can customize, to quickly build a new application. The CDK provisions your resources using AWS CloudFormation to get all the benefits of managing your infrastructure as code. One of the reasons I like the CDK, is that you can compose and share your own custom components as higher-level constructs.

As you can imagine, there are recurring patterns that can be useful to more than one customer. For this reason, today we are launching the AWS Solutions Constructs, an open source extension library for the CDK that provides well-architected patterns to help you build your unique solutions. CDK constructs mostly cover single services. AWS Solutions Constructs provide multi-service patterns that combine two or more CDK resources, and implement best practices such as logging and encryption.

Using AWS Solutions Constructs
To see the power of a pattern-based approach, let’s take a look at how that works when building a new application. As an example, I want to build an HTTP API to store data in a Amazon DynamoDB table. To keep the content of the table small, I can use DynamoDB Time to Live (TTL) to expire items after a few days. After the TTL expires, data is deleted from the table and sent, via DynamoDB Streams, to a AWS Lambda function to archive the expired data on Amazon Simple Storage Service (S3).

To build this application, I can use a few components:

  • An Amazon API Gateway endpoint for the API.
  • A DynamoDB table to store data.
  • A Lambda function to process the API requests, and store data in the DynamoDB table.
  • DynamoDB Streams to capture data changes.
  • A Lambda function processing data changes to archive the expired data.

Can I make it simpler? Looking at the available patterns in the AWS Solutions Constructs, I find two that can help me build my app:

  • aws-apigateway-lambda, a Construct that implements an API Gateway REST API connected to a Lambda function. As an example of the “sensible defaults” used by AWS Solutions Constructs, this pattern enables CloudWatch logging for the API Gateway.
  • aws-dynamodb-stream-lambda, a Construct implementing a DynamoDB table streaming data changes to a Lambda function with the least privileged permissions.

To build the final architecture, I simply connect those two Constructs together:

I am using TypeScript to define the CDK stack, and Node.js for the Lambda functions. Let’s start with the CDK stack:

 

import * as cdk from '@aws-cdk/core';
import * as lambda from '@aws-cdk/aws-lambda';
import * as apigw from '@aws-cdk/aws-apigateway';
import * as dynamodb from '@aws-cdk/aws-dynamodb';
import { ApiGatewayToLambda } from '@aws-solutions-constructs/aws-apigateway-lambda';
import { DynamoDBStreamToLambda } from '@aws-solutions-constructs/aws-dynamodb-stream-lambda';

export class DemoConstructsStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const apiGatewayToLambda = new ApiGatewayToLambda(this, 'ApiGatewayToLambda', {
      deployLambda: true,
      lambdaFunctionProps: {
        code: lambda.Code.fromAsset('lambda'),
        runtime: lambda.Runtime.NODEJS_12_X,
        handler: 'restApi.handler'
      },
      apiGatewayProps: {
        defaultMethodOptions: {
          authorizationType: apigw.AuthorizationType.NONE
        }
      }
    });

    const dynamoDBStreamToLambda = new DynamoDBStreamToLambda(this, 'DynamoDBStreamToLambda', {
      deployLambda: true,
      lambdaFunctionProps: {
        code: lambda.Code.fromAsset('lambda'),
        runtime: lambda.Runtime.NODEJS_12_X,
        handler: 'processStream.handler'
      },
      dynamoTableProps: {
        tableName: 'my-table',
        partitionKey: { name: 'id', type: dynamodb.AttributeType.STRING },
        timeToLiveAttribute: 'ttl'
      }
    });

    const apiFunction = apiGatewayToLambda.lambdaFunction;
    const dynamoTable = dynamoDBStreamToLambda.dynamoTable;

    dynamoTable.grantReadWriteData(apiFunction);
    apiFunction.addEnvironment('TABLE_NAME', dynamoTable.tableName);
  }
}

At the beginning of the stack, I import the standard CDK constructs for the Lambda function, the API Gateway endpoint, and the DynamoDB table. Then, I add the two patterns from the AWS Solutions Constructs, ApiGatewayToLambda and DynamoDBStreamToLambda.

After declaring the two ApiGatewayToLambda and DynamoDBStreamToLambda constructs, I store the Lambda function, created by the ApiGatewayToLambda constructs, and the DynamoDB table, created by DynamoDBStreamToLambda, in two variables.

At the end of the stack, I “connect” the two patterns together by granting permissions to the Lambda function to read/write in the DynamoDB table, and add the name of the DynamoDB table to the environment of the Lambda function, so that it can be used in the function code to store data in the table.

The code of the two Lambda functions is in the lambda folder of the CDK application. I am using the Node.js 12 runtime.

The restApi.js function implements the API and writes data to the DynamoDB table. The URL path is used as partition key, all the query string parameters in the URL are stored as attributes. The TTL for the item is computed adding a time window of 7 days to the current time.

const { DynamoDB } = require("aws-sdk");

const docClient = new DynamoDB.DocumentClient();

const TABLE_NAME = process.env.TABLE_NAME;
const TTL_WINDOW = 7 * 24 * 60 * 60; // 7 days expressed in seconds

exports.handler = async function (event) {

  const item = event.queryStringParameters;
  item.id = event.pathParameters.proxy;

  const now = new Date(); 
  item.ttl = Math.round(now.getTime() / 1000) + TTL_WINDOW;

  const response = await docClient.put({
    TableName: TABLE_NAME,
    Item: item
  }).promise();

  let statusCode = 204;
  
  if (response.err != null) {
    console.error('request: ', JSON.stringify(event, undefined, 2));
    console.error('error: ', response.err);
    statusCode = 500
  }

  return {
    statusCode: statusCode
  };
};

The processStream.js function is processing data capture records from the DynamoDB Stream, looking for the items deleted by TTL. The archive functionality is not implemented in this sample code.

exports.handler = async function (event) {
  event.Records.forEach((record) => {
    console.log('Stream record: ', JSON.stringify(record, null, 2));
    if (record.userIdentity.type == "Service" &&
      record.userIdentity.principalId == "dynamodb.amazonaws.com") {

      // Record deleted by DynamoDB Time to Live (TTL)
      
      // I can archive the record to S3, for example using Kinesis Data Firehose.
    }
  }
};

Let’s see if this works! First, I need to install all dependencies. To simplify dependencies, each release of AWS Solutions Constructs is linked to the corresponding version of the CDK. I this case, I am using version 1.46.0 for both the CDK and the AWS Solutions Constructs patterns. The first three commands are installing plain CDK constructs. The last two commands are installing the AWS Solutions Constructs patterns I am using for this application.

npm install @aws-cdk/[email protected]
npm install @aws-cdk/[email protected]
npm install @aws-cdk/[email protected]
npm install @aws-solutions-constructs/[email protected]
npm install @aws-solutions-constructs/[email protected]

Now, I build the application and use the CDK to deploy the application.

npm run build
cdk deploy

Towards the end of the output of the cdk deploy command, a green light is telling me that the deployment of the stack is completed. Just next, in the Outputs, I find the endpoint of the API Gateway.

 ✅  DemoConstructsStack

Outputs:
DemoConstructsStack.ApiGatewayToLambdaLambdaRestApiEndpoint9800D4B5 = https://1a2c3c4d.execute-api.eu-west-1.amazonaws.com/prod/

I can now use curl to test the API:

curl "https://1a2c3c4d.execute-api.eu-west-1.amazonaws.com/prod/danilop?name=Danilo&amp;company=AWS"

Let’s have a look at the DynamoDB table:

The item is stored, and the TTL is set. After a week, the item will be deleted and sent via DynamoDB Streams to the processStream.js function.

After I complete my testing, I use the CDK again to quickly delete all resources created for this application:

cdk destroy

Available Now
The AWS Solutions Constructs are available now for TypeScript and Python. The AWS Solutions Builders team is working to make these constructs also available when using Java and C# with the CDK, stay tuned. There is no cost in using the AWS Solutions Constructs, or the CDK, you only pay for the resources created when deploying the stack.

In this first release, 25 patterns are included, covering lots of different use cases. Which new patterns and features should we focus now? Give use your feedback in the open source project repository!

Danilo

New – A Shared File System for Your Lambda Functions

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-a-shared-file-system-for-your-lambda-functions/

I am very happy to announce that AWS Lambda functions can now mount an Amazon Elastic File System (EFS), a scalable and elastic NFS file system storing data within and across multiple availability zones (AZ) for high availability and durability. In this way, you can use a familiar file system interface to store and share data across all concurrent execution environments of one, or more, Lambda functions. EFS supports full file system access semantics, such as strong consistency and file locking.

To connect an EFS file system with a Lambda function, you use an EFS access point, an application-specific entry point into an EFS file system that includes the operating system user and group to use when accessing the file system, file system permissions, and can limit access to a specific path in the file system. This helps keeping file system configuration decoupled from the application code.

You can access the same EFS file system from multiple functions, using the same or different access points. For example, using different EFS access points, each Lambda function can access different paths in a file system, or use different file system permissions.

You can share the same EFS file system with Amazon Elastic Compute Cloud (EC2) instances, containerized applications using Amazon ECS and AWS Fargate, and on-premises servers. Following this approach, you can use different computing architectures (functions, containers, virtual servers) to process the same files. For example, a Lambda function reacting to an event can update a configuration file that is read by an application running on containers. Or you can use a Lambda function to process files uploaded by a web application running on EC2.

In this way, some use cases are much easier to implement with Lambda functions. For example:

  • Processing or loading data larger than the space available in /tmp (512MB).
  • Loading the most updated version of files that change frequently.
  • Using data science packages that require storage space to load models and other dependencies.
  • Saving function state across invocations (using unique file names, or file system locks).
  • Building applications requiring access to large amounts of reference data.
  • Migrating legacy applications to serverless architectures.
  • Interacting with data intensive workloads designed for file system access.
  • Partially updating files (using file system locks for concurrent access).
  • Moving a directory and all its content within a file system with an atomic operation.

Creating an EFS File System
To mount an EFS file system, your Lambda functions must be connected to an Amazon Virtual Private Cloud that can reach the EFS mount targets. For simplicity, I am using here the default VPC that is automatically created in each AWS Region.

Note that, when connecting Lambda functions to a VPC, networking works differently. If your Lambda functions are using Amazon Simple Storage Service (S3) or Amazon DynamoDB, you should create a gateway VPC endpoint for those services. If your Lambda functions need to access the public internet, for example to call an external API, you need to configure a NAT Gateway. I usually don’t change the configuration of my default VPCs. If I have specific requirements, I create a new VPC with private and public subnets using the AWS Cloud Development Kit, or use one of these AWS CloudFormation sample templates. In this way, I can manage networking as code.

In the EFS console, I select Create file system and make sure that the default VPC and its subnets are selected. For all subnets, I use the default security group that gives network access to other resources in the VPC using the same security group.

In the next step, I give the file system a Name tag and leave all other options to their default values.

Then, I select Add access point. I use 1001 for the user and group IDs and limit access to the /message path. In the Owner section, used to create the folder automatically when first connecting to the access point, I use the same user and group IDs as before, and 750 for permissions. With this permissions, the owner can read, write, and execute files. Users in the same group can only read. Other users have no access.

I go on, and complete the creation of the file system.

Using EFS with Lambda Functions
To start with a simple use case, let’s build a Lambda function implementing a MessageWall API to add, read, or delete text messages. Messages are stored in a file on EFS so that all concurrent execution environments of that Lambda function see the same content.

In the Lambda console, I create a new MessageWall function and select the Python 3.8 runtime. In the Permissions section, I leave the default. This will create a new AWS Identity and Access Management (IAM) role with basic permissions.

When the function is created, in the Permissions tab I click on the IAM role name to open the role in the IAM console. Here, I select Attach policies to add the AWSLambdaVPCAccessExecutionRole and AmazonElasticFileSystemClientReadWriteAccess AWS managed policies. In a production environment, you can restrict access to a specific VPC and EFS access point.

Back in the Lambda console, I edit the VPC configuration to connect the MessageWall function to all subnets in the default VPC, using the same default security group I used for the EFS mount points.

Now, I select Add file system in the new File system section of the function configuration. Here, I choose the EFS file system and accesss point I created before. For the local mount point, I use /mnt/msg and Save. This is the path where the access point will be mounted, and corresponds to the /message folder in my EFS file system.

In the Function code editor of the Lambda console, I paste the following code and Save.

import os
import fcntl

MSG_FILE_PATH = '/mnt/msg/content'


def get_messages():
    try:
        with open(MSG_FILE_PATH, 'r') as msg_file:
            fcntl.flock(msg_file, fcntl.LOCK_SH)
            messages = msg_file.read()
            fcntl.flock(msg_file, fcntl.LOCK_UN)
    except:
        messages = 'No message yet.'
    return messages


def add_message(new_message):
    with open(MSG_FILE_PATH, 'a') as msg_file:
        fcntl.flock(msg_file, fcntl.LOCK_EX)
        msg_file.write(new_message + "\n")
        fcntl.flock(msg_file, fcntl.LOCK_UN)


def delete_messages():
    try:
        os.remove(MSG_FILE_PATH)
    except:
        pass


def lambda_handler(event, context):
    method = event['requestContext']['http']['method']
    if method == 'GET':
        messages = get_messages()
    elif method == 'POST':
        new_message = event['body']
        add_message(new_message)
        messages = get_messages()
    elif method == 'DELETE':
        delete_messages()
        messages = 'Messages deleted.'
    else:
        messages = 'Method unsupported.'
    return messages

I select Add trigger and in the configuration I select the Amazon API Gateway. I create a new HTTP API. For simplicity, I leave my API endpoint open.

With the API Gateway trigger selected, I copy the endpoint of the new API I just created.

I can now use curl to test the API:

$ curl https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
No message yet.
$ curl -X POST -H "Content-Type: text/plain" -d 'Hello from EFS!' https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
Hello from EFS!

$ curl -X POST -H "Content-Type: text/plain" -d 'Hello again :)' https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
Hello from EFS!
Hello again :)

$ curl https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
Hello from EFS!
Hello again :)

$ curl -X DELETE https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
Messages deleted.

$ curl https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
No message yet.

It would be relatively easy to add unique file names (or specific subdirectories) for different users and extend this simple example into a more complete messaging application. As a developer, I appreciate the simplicity of using a familiar file system interface in my code. However, depending on your requirements, EFS throughput configuration must be taken into account. See the section Understanding EFS performance later in the post for more information.

Now, let’s use the new EFS file system support in AWS Lambda to build something more interesting. For example, let’s use the additional space available with EFS to build a machine learning inference API processing images.

Building a Serverless Machine Learning Inference API
To create a Lambda function implementing machine learning inference, I need to be able, in my code, to import the necessary libraries and load the machine learning model. Often, when doing so, the overall size of those dependencies goes beyond the current AWS Lambda limits in the deployment package size. One way of solving this is to accurately minimize the libraries to ship with the function code, and then download the model from an S3 bucket straight to memory (up to 3 GB, including the memory required for processing the model) or to /tmp (up 512 MB). This custom minimization and download of the model has never been easy to implement. Now, I can use an EFS file system.

The Lambda function I am building this time needs access to the public internet to download a pre-trained model and the images to run inference on. So I create a new VPC with public and private subnets, and configure a NAT Gateway and the route table used by the the private subnets to give access to the public internet. Using the AWS Cloud Development Kit, it’s just a few lines of code.

I create a new EFS file system and an access point in the new VPC using similar configurations as before. This time, I use /ml for the access point path.

Then, I create a new MLInference Lambda function with the same set up as before for permissions and connect the function to the private subnets of the new VPC. Machine learning inference is quite a heavy workload, so I select 3 GB for memory and 5 minutes for timeout. In the File system configuration, I add the new access point and mount it under /mnt/inference.

The machine learning framework I am using for this function is PyTorch, and I need to put the libraries required to run inference in the EFS file system. I launch an Amazon Linux EC2 instance in a public subnet of the new VPC. In the instance details, I select one of the availability zones where I have an EFS mount point, and then Add file system to automatically mount the same EFS file system I am using for the function. For the security groups of the EC2 instance, I select the default security group (to be able to mount the EFS file system) and one that gives inbound access to SSH (to be able to connect to the instance).

I connect to the instance using SSH and create a requirements.txt file containing the dependencies I need:

torch
torchvision
numpy

The EFS file system is automatically mounted by EC2 under /mnt/efs/fs1. There, I create the /ml directory and change the owner of the path to the user and group I am using now that I am connected (ec2-user).

$ sudo mkdir /mnt/efs/fs1/ml
$ sudo chown ec2-user:ec2-user /mnt/efs/fs1/ml

I install Python 3 and use pip to install the dependencies in the /mnt/efs/fs1/ml/lib path:

$ sudo yum install python3
$ pip3 install -t /mnt/efs/fs1/ml/lib -r requirements.txt

Finally, I give ownership of the whole /ml path to the user and group I used for the EFS access point:

$ sudo chown -R 1001:1001 /mnt/efs/fs1/ml

Overall, the dependencies in my EFS file system are using about 1.5 GB of storage.

I go back to the MLInference Lambda function configuration. Depending on the runtime you use, you need to find a way to tell where to look for dependencies if they are not included with the deployment package or in a layer. In the case of Python, I set the PYTHONPATH environment variable to /mnt/inference/lib.

I am going to use PyTorch Hub to download this pre-trained machine learning model to recognize the kind of bird in a picture. The model I am using for this example is relatively small, about 200 MB. To cache the model on the EFS file system, I set the TORCH_HOME environment variable to /mnt/inference/model.

All dependencies are now in the file system mounted by the function, and I can type my code straight in the Function code editor. I paste the following code to have a machine learning inference API:

import urllib
import json
import os

import torch
from PIL import Image
from torchvision import transforms

transform_test = transforms.Compose([
    transforms.Resize((600, 600), Image.BILINEAR),
    transforms.CenterCrop((448, 448)),
    transforms.ToTensor(),
    transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
])

model = torch.hub.load('nicolalandro/ntsnet-cub200', 'ntsnet', pretrained=True,
                       **{'topN': 6, 'device': 'cpu', 'num_classes': 200})
model.eval()


def lambda_handler(event, context):
    url = event['queryStringParameters']['url']

    img = Image.open(urllib.request.urlopen(url))
    scaled_img = transform_test(img)
    torch_images = scaled_img.unsqueeze(0)

    with torch.no_grad():
        top_n_coordinates, concat_out, raw_logits, concat_logits, part_logits, top_n_index, top_n_prob = model(torch_images)

        _, predict = torch.max(concat_logits, 1)
        pred_id = predict.item()
        bird_class = model.bird_classes[pred_id]
        print('bird_class:', bird_class)

    return json.dumps({
        "bird_class": bird_class,
    })

I add the API Gateway as trigger, similarly to what I did before for the MessageWall function. Now, I can use the serverless API I just created to analyze pictures of birds. I am not really an expert in the field, so I looked for a couple of interesting images on Wikipedia:

I call the API to get a prediction for these two pictures:

$ curl https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MLInference?url=https://path/to/image/atlantic-puffin.jpg

{"bird_class": "106.Horned_Puffin"}

$ curl https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MLInference?url=https://path/to/image/western-grebe.jpg

{"bird_class": "053.Western_Grebe"}

It works! Looking at Amazon CloudWatch Logs for the Lambda function, I see that the first invocation, when the function loads and prepares the pre-trained model for inference on CPUs, takes about 30 seconds. To avoid a slow response, or a timeout from the API Gateway, I use Provisioned Concurrency to keep the function ready. The next invocations take about 1.8 seconds.

Understanding EFS Performance
When using EFS with your Lambda function, is very important to understand how EFS performance works. For throughput, each file system can be configured to use bursting or provisioned mode.

When using bursting mode, all EFS file systems, regardless of size, can burst at least to 100 MiB/s of throughput. Those over 1 TiB in the standard storage class can burst to 100 MiB/s per TiB of data stored in the file system. EFS uses a credit system to determine when file systems can burst. Each file system earns credits over time at a baseline rate that is determined by the size of the file system that is stored in the standard storage class. A file system uses credits whenever it reads or writes data. The baseline rate is 50 KiB/s per GiB of storage.

You can monitor the use of credits in CloudWatch, each EFS file system has a BurstCreditBalance metric. If you see that you are consuming all credits, and the BurstCreditBalance metric is going to zero, you should enable provisioned throughput mode for the file system, from 1 to 1024 MiB/s. There is an additional cost when using provisioned throughput, based on how much throughput you are adding on top of the baseline rate.

To avoid running out of credits, you should think of the throughput as the average you need during the day. For example, if you have a 10GB file system, you have 500 KiB/s of baseline rate, and every day you can read/write 500 KiB/s * 3600 seconds * 24 hours = 43.2 GiB.

If the libraries and everything you function needs to load during initialization are about 2 GiB, and you only access the EFS file system during function initialization, like in the MLInference Lambda function above, that means you can initialize your function (for example because of updates or scaling up activities) about 20 times per day. That’s not a lot, and you would probably need to configure provisioned throughput for the EFS file system.

If you have 10 MiB/s of provisioned throughput, then every day you have 10 MiB/s * 3600 seconds * 24 hours = 864 GiB to read or write. If you only use the EFS file system at function initialization to read about 2 GB of dependencies, it means that you can have 400 initializations per day. That may be enough for your use case.

In the Lambda function configuration, you can also use the reserve concurrency control to limit the maximum number of execution environments used by a function.

If, by mistake, the BurstCreditBalance goes down to zero, and the file system is relatively small (for example, a few GiBs), there is the possibility that your function gets stuck and can’t execute fast enough before reaching the timeout. In that case, you should enable (or increase) provisioned throughput for the EFS file system, or throttle your function by setting the reserved concurrency to zero to avoid all invocations until the EFS file system has enough credits.

Understanding Security Controls
When using EFS file systems with AWS Lambda, you have multiple levels of security controls. I’m doing a quick recap here because they should all be considered during the design and implementation of your serverless applications. You can find more info on using IAM authorization and access points with EFS in this post.

To connect a Lambda function to an EFS file system, you need:

  • Network visibility in terms of VPC routing/peering and security group.
  • IAM permissions for the Lambda function to access the VPC and mount (read only or read/write) the EFS file system.
  • You can specify in the IAM policy conditions which EFS access point the Lambda function can use.
  • The EFS access point can limit access to a specific path in the file system.
  • File system security (user ID, group ID, permissions) can limit read, write, or executable access for each file or directory mounted by a Lambda function.

The Lambda function execution environment and the EFS mount point uses industry standard Transport Layer Security (TLS) 1.2 to encrypt data in transit. You can provision Amazon EFS to encrypt data at rest. Data encrypted at rest is transparently encrypted while being written, and transparently decrypted while being read, so you don’t have to modify your applications. Encryption keys are managed by the AWS Key Management Service (KMS), eliminating the need to build and maintain a secure key management infrastructure.

Available Now
This new feature is offered in all regions where AWS Lambda and Amazon EFS are available, with the exception of the regions in China, where we are working to make this integration available as soon as possible. For more information on availability, please see the AWS Region table. To learn more, please see the documentation.

EFS for Lambda can be configured using the console, the AWS Command Line Interface (CLI), the AWS SDKs, and the Serverless Application Model. This feature allows you to build data intensive applications that need to process large files. For example, you can now unzip a 1.5 GB file in a few lines of code, or process a 10 GB JSON document. You can also load libraries or packages that are larger than the 250 MB package deployment size limit of AWS Lambda, enabling new machine learning, data modelling, financial analysis, and ETL jobs scenarios.

Amazon EFS for Lambda is supported at launch in AWS Partner Network solutions, including Epsagon, Lumigo, Datadog, HashiCorp Terraform, and Pulumi.

There is no additional charge for using EFS from Lambda functions. You pay the standard price for AWS Lambda and Amazon EFS. Lambda execution environments always connect to the right mount target in an AZ and not across AZs. You can connect to EFS in the same AZ via cross account VPC but there can be data transfer costs for that. We do not support cross region, or cross AZ connectivity between EFS and Lambda.

Danilo

Welcome to the Serverless-First Function Virtual Events

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/serverless-first-function-virtual-events/

When you develop a serverless application, you can focus on the core features you want to build, instead of worrying about managing and operating servers, databases, or storage systems.

To simplify adoption and use of serverless technologies, we launched many new features in the last few months. For example, just to pick up a few:

To help you and your organization get the most out of the cloud, we organized a set of virtual events called the AWS Serverless-First Function:

  • Last Thursday, May 21, we had the first event, Serverless for your Organization. Between an opening by Dr. Werner Vogels, Amazon CTO, and closing statements by Jeff Barr, AWS VP and Chief Evangelist, we had an agenda fully packed with tips to bring the benefits of serverless to your organization, including a customer case study by Gillian McCann, Head of Cloud Engineering and AI at Workgrid Software.
  • On Thursday, May 28, we have the second event, Serverless for your Application, a full day of incremental, hands-on sessions that demonstrate end-to-end best practices for building serverless applications. To start, we’ll discuss how we use serverless at AWS, and the benefits AWS teams get from adopting serverless-first. This will be followed by dive-deep sessions on topics such as security, performance, and observability. You can still register for this event here.

A Few Highlights from May 21
There were too many great moments to list them all here. If you missed the first event (or to review again some of the ideas and resources that were shared) here are my favorites:

  • Dr. Werner Vogels started the event by defining modern applications, and addressing the importance of culture and adaptability when building them. These are two of the key ingredients to a serverless-first approach. He mentioned the Amazon Builders’ Library (a great resource for learning more about Amazon’s own technical journey), including multiple articles by one of the other presenters for the day, Sr. Principal Engineer, David Yanacek. He also discussed how the AWS Well-Architected Framework’s Serverless Lens (now also available in the AWS Well-Architected Tool) can help you measure your architectures against best practices and identify areas for improvement for all of your serverless applications. Among many examples mentioned during the day, you can find the journeys of iRobot and Fender to serverless-first captured in this whitepaper.
  • David Richardson, VP of Serverless at AWS, discussed the importance of taking a serverless-approach when building modern applications, and covered a few of the recent launches I mentioned above that make building on serverless even better. Considerations around the total cost of ownership (TCO) were part of the discussion of many sessions, so we shared links to two whitepapers, this one by IDC and this one by Deloitte, that help companies evaluate the real impact of their technology choices and understand the long-term return on investment (ROI) of choosing a serverless-first approach.
  • Adrian Cockcroft, VP of Cloud Architecture Strategy, highlighted an array of objections to using serverless and the multiple ways in which we’ve solved for each of them. He paired these solutions with insightful sessions delivered at re:Invent last year, including Serverless at scale: Design patterns and optimizations and Moving to event-driven architectures. He also discussed “relics” from the past… There are indeed ways to migrate an IBM mainframe to microservices on AWS Lambda and to refactor a U.S. Department of Defense mainframe to AWS!
  • Gillian McCann, Head of Cloud Engineering and AI at Workgrid Software, delivered a compelling behind-the-scenes story of Workgrid Software’s learnings, challenges, and successes using a serverless-first approach. I enjoy real-life customer stories, they are a great way to learn and, for us at AWS, to get feedback.

Watch (Again) and Join Us on May 28
It was great seeing all the great questions, answers, and content being shared via the live chat. You can watch (or re-watch) the sessions from May 21 on-demand here on Twitch.

You’re still in time to join us for the second event on May 28! You can find the full agenda and register here.

Danilo

New – Enhanced Amazon Macie Now Available with Substantially Reduced Pricing

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-enhanced-amazon-macie-now-available/

Amazon Macie is a fully managed service that helps you discover and protect your sensitive data, using machine learning to automatically spot and classify data for you.

Over time, Macie customers told us what they like, and what they didn’t. The service team has worked hard to address this feedback, and today I am very happy to share that we are making available a new, enhanced version of Amazon Macie!

This new version has simplified the pricing plan: you are now charged based on the number of Amazon Simple Storage Service (S3) buckets that are evaluated, and the amount of data processed for sensitive data discovery jobs. The new tiered pricing plan has reduced the price by 80%. With higher volumes, you can reduce your costs by more than 90%.

At the same time, we have introduced many new features:

  • An expanded sensitive data discovery, including updated machine learning models for personally identifiable information (PII) detection, and customer-defined sensitive data types using regular expressions.
  • Multi-account support with AWS Organizations.
  • Full API coverage for programmatic use of the service with AWS SDKs and AWS Command Line Interface (CLI).
  • Expanded regional availability to 17 Regions.
  • A new, simplified free tier and free trial to help you get started and understand your costs.
  • A completely redesigned console and user experience.

Macie is now tightly integrated with S3 in the backend, providing more advantages:

  • Enabling S3 data events in AWS CloudTrail is no longer a requirement, further reducing overall costs.
  • There is now a continual evaluation of all buckets, issuing security findings for any public bucket, unencrypted buckets, and for buckets shared with (or replicated to) an AWS account outside of your Organization.

The anomaly detection features monitoring S3 data access activity previously available in Macie are now in private beta as part of Amazon GuardDuty, and have been enhanced to include deeper capabilities to protect your data in S3.

Enabling Amazon Macie
In the Macie console, I select to Enable Macie. If you use AWS Organizations, you can delegate an AWS account to administer Macie for your Organization.

After it has been enabled, Amazon Macie automatically provides a summary of my S3 buckets in the region, and continually evaluates those buckets to generate actionable security findings for any unencrypted or publicly accessible data, including buckets shared with AWS accounts outside of my Organization.

Below the summary, I see the top findings by type and by S3 bucket. Overall, this page provides a great overview of the status of my S3 buckets.

In the Findings section I have the full list of findings, and I can select them to archive, unarchive, or export them. I can also select one of the findings to see the full information collected by Macie.

Findings can be viewed in the web console and are sent to Amazon CloudWatch Events for easy integration with existing workflow or event management systems, or to be used in combination with AWS Step Functions to take automated remediation actions. This can help meet regulations such as Payment Card Industry Data Security Standard (PCI-DSS), Health Insurance Portability and Accountability Act (HIPAA), General Data Privacy Regulation (GDPR), and California Consumer Protection Act (CCPA).

In the S3 Buckets section, I can search and filter on buckets of interest to create sensitive data discovery jobs across one or multiple buckets to discover sensitive data in objects, and to check encryption status and public accessibility at object level. Jobs can be executed once, or scheduled daily, weekly, or monthly.

For jobs, Amazon Macie automatically tracks changes to the buckets and only evaluates new or modified objects over time. In the additional settings, I can include or exclude objects based on tags, size, file extensions, or last modified date.

To monitor my costs, and the use of the free trial, I look at the Usage section of the console.

Creating Custom Data Identifiers
Amazon Macie supports natively the most common sensitive data types, including personally identifying information (PII) and credential data. You can extend that list with custom data identifiers to discover proprietary or unique sensitive data for your business.

For example, often companies have a specific syntax for their employee IDs. A possible syntax is to have a capital letter, that defines if this is a full-time or a part-time employee, followed by a dash, and then eight numbers. Possible values in this case are F-12345678 or P-87654321.

To create this custom data identifier, I enter a regular expression (regex) to describe the pattern to match:

[A-Z]-\d{8}

To avoid false positives, I ask that the employee keyword is found near the identifier (by default, less than 50 characters apart). I use the Evaluate box to test that this configuration works with sample text, then I select Submit.

Available Now
For Amazon Macie regional availability, please see the AWS Region Table. You can find more information on how the new enhanced Macie in the documentation.

This release of Amazon Macie remains optimized for S3. However, anything you can get into S3, permanently or temporarily, in an object format supported by Macie, can be scanned for sensitive data. This allows you to expand the coverage to data residing outside of S3 by pulling data out of custom applications, databases, and third-party services, temporarily placing it in S3, and using Amazon Macie to identify sensitive data.

For example, we’ve made this even easier with RDS and Aurora now supporting snapshots to S3 in Apache Parquet, which is a format Macie supports. Similarly, in DynamoDB, you can use AWS Glue to export tables to S3 which can then be scanned by Macie. With the new API and SDKs coverage, you can use the new enhanced Amazon Macie as a building block in an automated process exporting data to S3 to discover and protect your sensitive data across multiple sources.

Danilo