All posts by Danilo Poccia

Introducing Amazon SNS FIFO – First-In-First-Out Pub/Sub Messaging

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/introducing-amazon-sns-fifo-first-in-first-out-pub-sub-messaging/

When designing a distributed software architecture, it is important to define how services exchange information. For example, the use of asynchronous communication decouples components and simplifies scaling, reducing the impact of changes and making it easier to release new features.

The two most common forms of asynchronous service-to-service communication are message queues and publish/subscribe messaging:

  • With message queues, messages are stored on the queue until they are processed and deleted by a consumer. On AWS, Amazon Simple Queue Service (SQS) provides a fully managed message queuing service with no administrative overhead.
  • With pub/sub messaging, a message published to a topic is delivered to all subscribers to the topic. On AWS, Amazon Simple Notification Service (SNS) is a fully managed pub/sub messaging service that enables message delivery to a large number of subscribers. Each subscriber can also set a filter policy to receive only the messages that it cares about.

You can use topics when you want to fan out messages to multiple applications, and queues when you want to send messages to one application. Using topics and queues together, you can decouple microservices, distributed systems, and serverless applications.

With SQS, you can use FIFO (First-In-First-Out) queues to preserve the order in which messages are sent and received, and to avoid that a message is processed more than once.

Introducing SNS FIFO Topics
Today, we are adding similar capabilities for pub/sub messaging with the introduction of SNS FIFO topics, providing strict message ordering and deduplicated message delivery to one or more subscribers.

FIFO topics manage ordering and deduplication similar to FIFO queues:

Ordering – You configure a message group by including a message group ID when publishing a message to a FIFO topic. For each message group ID, all messages are sent and delivered in order of their arrival. For example, to ensure the delivery of messages related to the same customer in order, you can publish these messages to the topic using the customer’s account number as the message group ID. There is no limit in the number of message groups with FIFO topics and queues. You don’t need to declare in advance the message group ID, any value will work. If you don’t have a logical distinction between messages, you can simply use the same message group ID for all and have a single group of ordered messages. The message group ID is passed to any subscribed FIFO queue.

Deduplication – Distributed systems (like SNS) and client applications sometimes generate duplicate messages. You can avoid duplicated message deliveries from the topic in two ways: either by enabling content-based deduplication on the topic, or by adding a deduplication ID to the messages that you publish. With message content-based deduplication, SNS uses a SHA-256 hash to generate the message deduplication ID using the body of the message. After a message with a specific deduplication ID is published successfully, there is a 5-minute interval during which any message with the same deduplication ID is accepted but not delivered. If you subscribe a FIFO queue to a FIFO topic, the deduplication ID is passed to the queue and it is used by SQS to avoid duplicate messages being received.

You can use FIFO topics and queues together to simplify the implementation of applications where the order of operations and events is critical, or when you cannot tolerate duplicates. For example, to process financial operations and inventory updates, or to asynchronously apply commands that you receive from a client device. FIFO queues can use message filtering in FIFO topics to selectively receive only a subset of messages rather than every message published to the topic.

How to Use SNS FIFO Topics
A common scenario where FIFO topics can help is when you receive updates that need to be processed in order. For example, I can use a FIFO topic to receive updates from an application where my customers edit their account profiles. Then, I subscribe an SQS FIFO queue to the FIFO topic, and use the queue as trigger for a Lambda function that applies the account updates to an Amazon DynamoDB table used by my Customer management system that needs to be kept in sync.

The decoupling introduced by the FIFO topic makes it easier to add new functionality with minimal impact to existing applications. For example, to reward my loyal customers with additional promotions, I add a new Loyalty application that is storing information in a relational database managed by Amazon Aurora. To keep the customer’s information stored in the Loyalty database in sync with my other applications, I can subscribe a new FIFO queue to the same FIFO topic, and add a new Lambda function that receives customer updates in the same order as they are generated, and applies them to the Loyalty database. In this way, I don’t need to change code and configuration of other applications to integrate the new Loyalty app.

First, I create two FIFO queues in the SQS console, leaving all options to their defaults:

  • The customer.fifo queue to process updates in my Customer management system.
  • The loyalty.fifo queue to help me collect and store customer updates for the Loyalty application.

In the SNS console, I create the updates.fifo topic. I select FIFO as type, and enable Content-based message deduplication.

Then,  I subscribe the customer.fifo and loyalty.fifo queues to the topic.

To be able to receive messages, I add a statement to the access policy of both queues granting the updates.fifo topic permissions to send messages to the queues. For example, for the customer.fifo queue the statement is:

{
  "Effect": "Allow",
  "Principal": {
    "Service": "sns.amazonaws.com"
  },
  "Action": "SQS:SendMessage",
  "Resource": "arn:aws:sqs:us-east-2:123412341234:customer.fifo",
  "Condition": {
    "ArnLike": {
      "aws:SourceArn": "arn:aws:sns:us-east-2:123412341234:updates.fifo"
    }
  }
}

Now, I use the SNS console to publish 4 messages in sequence. For all messages, I use the same message group ID. In this way, they are all in the same message group. The only part that is different is the message body, where I use in order:

  • Update One
  • Update Two
  • Update Three
  • Update One

In the SQS console, I see that only 3 messages have been delivered to the FIFO queues:

Why is that? When I created the FIFO topics, I enabled content-based deduplication. The 4 messages were sent within the 5-minute deduplication window. The last message has been recognized as a duplicate of the first one and has not been delivered to the subscribed queues.

Let’s see the actual messages in the queues. I use the AWS Command Line Interface (CLI) to receive the messages from SQS, and the jq command-line JSON processor to format the output and get only the Message in the Body.

Here are the messages in the customer.fifo queue:

$ aws sqs receive-message --queue-url https://sqs.us-east-2.amazonaws.com/123412341234/customer.fifo --max-number-of-messages 10 | jq '.Messages[].Body | fromjson | .Message'

"Update One"
"Update Two"
"Update Three"

And these are the messages in the loyalty.fifo queue:

$ aws sqs receive-message --queue-url https://sqs.us-east-2.amazonaws.com/123412341234/loyalty.fifo --max-number-of-messages 10 | jq '.Messages[].Body | fromjson | .Message'

"Update One"
"Update Two"
"Update Three"

As expected, the 3 messages with unique content have been delivered to both queues in the same order as they were sent.

Available Now
You can use SNS FIFO topics in all commercial regions. You can process up to 300 transactions per second (TPS) per FIFO topic or FIFO queue. With SNS, you pay only for what you use, you can find more information in the pricing page.

To learn more, please see the documentation.

Danilo

Store and Access Time Series Data at Any Scale with Amazon Timestream – Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/store-and-access-time-series-data-at-any-scale-with-amazon-timestream-now-generally-available/

Time series are a very common data format that describes how things change over time. Some of the most common sources are industrial machines and IoT devices, IT infrastructure stacks (such as hardware, software, and networking components), and applications that share their results over time. Managing time series data efficiently is not easy because the data model doesn’t fit general-purpose databases.

For this reason, I am happy to share that Amazon Timestream is now generally available. Timestream is a fast, scalable, and serverless time series database service that makes it easy to collect, store, and process trillions of time series events per day up to 1,000 times faster and at as little as to 1/10th the cost of a relational database.

This is made possible by the way Timestream is managing data: recent data is kept in memory and historical data is moved to cost-optimized storage based on a retention policy you define. All data is always automatically replicated across multiple availability zones (AZ) in the same AWS region. New data is written to the memory store, where data is replicated across three AZs before returning success of the operation. Data replication is quorum based such that the loss of nodes, or an entire AZ, does not disrupt durability or availability. In addition, data in the memory store is continuously backed up to Amazon Simple Storage Service (S3) as an extra precaution.

Queries automatically access and combine recent and historical data across tiers without the need to specify the storage location, and support time series-specific functionalities to help you identify trends and patterns in data in near real time.

There are no upfront costs, you pay only for the data you write, store, or query. Based on the load, Timestream automatically scales up or down to adjust capacity, without the need to manage the underlying infrastructure.

Timestream integrates with popular services for data collection, visualization, and machine learning, making it easy to use with existing and new applications. For example, you can ingest data directly from AWS IoT Core, Amazon Kinesis Data Analytics for Apache Flink, AWS IoT Greengrass, and Amazon MSK. You can visualize data stored in Timestream from Amazon QuickSight, and use Amazon SageMaker to apply machine learning algorithms to time series data, for example for anomaly detection. You can use Timestream fine-grained AWS Identity and Access Management (IAM) permissions to easily ingest or query data from an AWS Lambda function. We are providing the tools to use Timestream with open source platforms such as Apache Kafka, Telegraf, Prometheus, and Grafana.

Using Amazon Timestream from the Console
In the Timestream console, I select Create database. I can choose to create a Standard database or a Sample database populated with sample data. I proceed with a standard database and I name it MyDatabase.

All Timestream data is encrypted by default. I use the default master key, but you can use a customer managed key that you created using AWS Key Management Service (KMS). In that way, you can control the rotation of the master key, and who has permissions to use or manage it.

I complete the creation of the database. Now my database is empty. I select Create table and name it MyTable.

Each table has its own data retention policy. First data is ingested in the memory store, where it can be stored from a minimum of one hour to a maximum of a year. After that, it is automatically moved to the magnetic store, where it can be kept up from a minimum of one day to a maximum of 200 years, after which it is deleted. In my case, I select 1 hour of memory store retention and 5 years of magnetic store retention.

When writing data in Timestream, you cannot insert data that is older than the retention period of the memory store. For example, in my case I will not be able to insert records older than 1 hour. Similarly, you cannot insert data with a future timestamp.

I complete the creation of the table. As you noticed, I was not asked for a data schema. Timestream will automatically infer that as data is ingested. Now, let’s put some data in the table!

Loading Data in Amazon Timestream
Each record in a Timestream table is a single data point in the time series and contains:

  • The measure name, type, and value. Each record can contain a single measure, but different measure names and types can be stored in the same table.
  • The timestamp of when the measure was collected, with nanosecond granularity.
  • Zero or more dimensions that describe the measure and can be used to filter or aggregate data. Records in a table can have different dimensions.

For example, let’s build a simple monitoring application collecting CPU, memory, swap, and disk usage from a server. Each server is identified by a hostname and has a location expressed as a country and a city.

In this case, the dimensions would be the same for all records:

  • country
  • city
  • hostname

Records in the table are going to measure different things. The measure names I use are:

  • cpu_utilization
  • memory_utilization
  • swap_utilization
  • disk_utilization

Measure type is DOUBLE for all of them.

For the monitoring application, I am using Python. To collect monitoring information I use the psutil module that I can install with:

pip3 install psutil

Here’s the code for the collect.py application:

import time
import boto3
import psutil

from botocore.config import Config

DATABASE_NAME = "MyDatabase"
TABLE_NAME = "MyTable"

COUNTRY = "UK"
CITY = "London"
HOSTNAME = "MyHostname" # You can make it dynamic using socket.gethostname()

INTERVAL = 1 # Seconds

def prepare_record(measure_name, measure_value):
    record = {
        'Time': str(current_time),
        'Dimensions': dimensions,
        'MeasureName': measure_name,
        'MeasureValue': str(measure_value),
        'MeasureValueType': 'DOUBLE'
    }
    return record


def write_records(records):
    try:
        result = write_client.write_records(DatabaseName=DATABASE_NAME,
                                            TableName=TABLE_NAME,
                                            Records=records,
                                            CommonAttributes={})
        status = result['ResponseMetadata']['HTTPStatusCode']
        print("Processed %d records. WriteRecords Status: %s" %
              (len(records), status))
    except Exception as err:
        print("Error:", err)


if __name__ == '__main__':

    session = boto3.Session()
    write_client = session.client('timestream-write', config=Config(
        read_timeout=20, max_pool_connections=5000, retries={'max_attempts': 10}))
    query_client = session.client('timestream-query')

    dimensions = [
        {'Name': 'country', 'Value': COUNTRY},
        {'Name': 'city', 'Value': CITY},
        {'Name': 'hostname', 'Value': HOSTNAME},
    ]

    records = []

    while True:

        current_time = int(time.time() * 1000)
        cpu_utilization = psutil.cpu_percent()
        memory_utilization = psutil.virtual_memory().percent
        swap_utilization = psutil.swap_memory().percent
        disk_utilization = psutil.disk_usage('/').percent

        records.append(prepare_record('cpu_utilization', cpu_utilization))
        records.append(prepare_record(
            'memory_utilization', memory_utilization))
        records.append(prepare_record('swap_utilization', swap_utilization))
        records.append(prepare_record('disk_utilization', disk_utilization))

        print("records {} - cpu {} - memory {} - swap {} - disk {}".format(
            len(records), cpu_utilization, memory_utilization,
            swap_utilization, disk_utilization))

        if len(records) == 100:
            write_records(records)
            records = []

        time.sleep(INTERVAL)

I start the collect.py application. Every 100 records, data is written in the MyData table:

$ python3 collect.py
records 4 - cpu 31.6 - memory 65.3 - swap 73.8 - disk 5.7
records 8 - cpu 18.3 - memory 64.9 - swap 73.8 - disk 5.7
records 12 - cpu 15.1 - memory 64.8 - swap 73.8 - disk 5.7
. . .
records 96 - cpu 44.1 - memory 64.2 - swap 73.8 - disk 5.7
records 100 - cpu 46.8 - memory 64.1 - swap 73.8 - disk 5.7
Processed 100 records. WriteRecords Status: 200
records 4 - cpu 36.3 - memory 64.1 - swap 73.8 - disk 5.7
records 8 - cpu 31.7 - memory 64.1 - swap 73.8 - disk 5.7
records 12 - cpu 38.8 - memory 64.1 - swap 73.8 - disk 5.7
. . .

Now, in the Timestream console, I see the schema of the MyData table, automatically updated based on the data ingested:

Note that, since all measures in the table are of type DOUBLE, the measure_value::double column contains the value for all of them. If the measures were of different types (for example, INT or BIGINT) I would have more columns (such as measure_value::int and measure_value::bigint) .

In the console, I can also see a recap of which kind measures I have in the table, their corresponding data type, and the dimensions used for that specific measure:

Querying Data from the Console
I can query time series data using SQL. The memory store is optimized for fast point-in-time queries, while the magnetic store is optimized for fast analytical queries. However, queries automatically process data on all stores (memory and magnetic) without having to specify the data location in the query.

I am running queries straight from the console, but I can also use JDBC connectivity to access the query engine. I start with a basic query to see the most recent records in the table:

SELECT * FROM MyDatabase.MyTable ORDER BY time DESC LIMIT 8

Let’s try something a little more complex. I want to see the average CPU utilization aggregated by hostname in 5 minutes intervals for the last two hours. I filter records based on the content of measure_name. I use the function bin() to round time to a multiple of an interval size, and the function ago() to compare timestamps:

SELECT hostname,
       bin(time, 5m) as binned_time,
       avg(measure_value::double) as avg_cpu_utilization
  FROM MyDatabase.MyTable
 WHERE measure_name = 'cpu_utilization'
   AND time > ago(2h)
 GROUP BY hostname, bin(time, 5m)

When collecting time series data you may miss some values. This is quite common especially for distributed architectures and IoT devices. Timestream has some interesting functions that you can use to fill in the missing values, for example using linear interpolation, or based on the last observation carried forward.

More generally, Timestream offers many functions that help you to use mathematical expressions, manipulate strings, arrays, and date/time values, use regular expressions, and work with aggregations/windows.

To experience what you can do with Timestream, you can create a sample database and add the two IoT and DevOps datasets that we provide. Then, in the console query interface, look at the sample queries to get a glimpse of some of the more advanced functionalities:

Using Amazon Timestream with Grafana
One of the most interesting aspects of Timestream is the integration with many platforms. For example, you can visualize your time series data and create alerts using Grafana 7.1 or higher. The Timestream plugin is part of the open source edition of Grafana.

I add a new GrafanaDemo table to my database, and use another sample application to continuously ingest data. The application simulates performance data collected from a microservice architecture running on thousands of hosts.

I install Grafana on an Amazon Elastic Compute Cloud (EC2) instance and add the Timestream plugin using the Grafana CLI.

$ grafana-cli plugins install grafana-timestream-datasource

I use SSH Port Forwarding to access the Grafana console from my laptop:

$ ssh -L 3000:<EC2-Public-DNS>:3000 -N -f [email protected]<EC2-Public-DNS>

In the Grafana console, I configure the plugin with the right AWS credentials, and the Timestream database and table. Now, I can select the sample dashboard, distributed as part of the Timestream plugin, using data from the GrafanaDemo table where performance data is continuously collected:

Available Now
Amazon Timestream is available today in US East (N. Virginia), Europe (Ireland), US West (Oregon), and US East (Ohio). You can use Timestream with the console, the AWS Command Line Interface (CLI), AWS SDKs, and AWS CloudFormation. With Timestream, you pay based on the number of writes, the data scanned by the queries, and the storage used. For more information, please see the pricing page.

You can find more sample applications in this repo. To learn more, please see the documentation. It’s never been easier to work with time series, including data ingestion, retention, access, and storage tiering. Let me know what you are going to build!

Danilo

New EC2 T4g Instances – Burstable Performance Powered by AWS Graviton2 – Try Them for Free

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-t4g-instances-burstable-performance-powered-by-aws-graviton2/

Two years ago Amazon Elastic Compute Cloud (EC2) T3 instances were first made available, offering a very cost effective way to run general purpose workloads. While current T3 instances offer sufficient compute performance for many use cases, many customers have told us that they have additional workloads that would benefit from increased peak performance and lower cost.

Today, we are launching T4g instances, a new generation of low cost burstable instance type powered by AWS Graviton2, a processor custom built by AWS using 64-bit Arm Neoverse cores. Using T4g instances you can enjoy a performance benefit of up to 40% at a 20% lower cost in comparison to T3 instances, providing the best price/performance for a broader spectrum of workloads.

T4g instances are designed for applications that don’t use CPU at full power most of the time, using the same credit model as T3 instances with unlimited mode enabled by default. Examples of production workloads that require high CPU performance only during times of heavy data processing are web/application servers, small/medium data stores, and many microservices. Compared to previous generations, the performance of T4g instances makes it possible to migrate additional workloads such as caching servers, search engine indexing, and e-commerce platforms.

T4g instances are available in 7 sizes providing up to 5 Gbps of network and up to 2.7 Gbps of Amazon Elastic Block Store (EBS) performance:

NamevCPUsBaseline Performance/vCPUCPU Credits Earned/HourMemory
t4g.nano25%60.5 GiB
t4g.micro210%121 GiB
t4g.small220%242 GiB
t4g.medium220%244 GiB
t4g.large230%368 GiB
t4g.xlarge440%9616 GiB
t4g.2xlarge840%19232 GiB

Free Trial
To make it easier to develop, test, and run your applications on T4g instances, all AWS customers are automatically enrolled in a free trial on the t4g.micro size. Starting September 2020 until December 31st 2020, you can run a t4g.micro instance and automatically get 750 free hours per month deducted from your bill, including any CPU credits during the free 750 hours of usage. The 750 hours are calculated in aggregate across all regions. For details on terms and conditions of the free trial, please refer to the EC2 FAQs.

During the free trial, have a look at this getting started guide on using the Arm-based AWS Graviton processors. There, you can find suggestions on how to build and optimize your applications, using different programming languages and operating systems, and on managing container-based workloads. Some of the tips are specific for the Graviton processor, but most of the content works generally for anyone using Arm to run their code.

Using T4g Instances
You can start an EC2 instance in different ways, for example using the EC2 console, the AWS Command Line Interface (CLI), AWS SDKs, or AWS CloudFormation. For my first T4g instance, I use the AWS CLI:

$ aws ec2 run-instances \
  --instance-type t4g.micro \
  --image-id ami-09a67037138f86e67 \
  --security-groups MySecurityGroup \
  --key-name my-key-pair

The Amazon Machine Image (AMI) I am using is based on Amazon Linux 2. Other platforms are available, such as Ubuntu 18.04 or newer, Red Hat Enterprise Linux 8.0 and newer, and SUSE Enterprise Server 15 and newer. You can find additional AMIs in the AWS Marketplace, for example Fedora, Debian, NetBSD, CentOS, and NGINX Plus. For containerized applications, Amazon ECS and Amazon Elastic Kubernetes Service optimized AMIs are available as well.

The security group I selected gives me SSH access to the instance. I connect to the instance and do a general update:

$ sudo yum update -y

Since the kernel has been updated, I reboot the instance.

I’d like to set up this instance as a development environment. I can use it to build new applications, or to recompile my existing apps to the 64-bit Arm architecture. To install most development tools, such as Git, GCC, and Make, I use this group of packages:

$ sudo yum groupinstall -y "Development Tools"

AWS is working with several open source communities to drive improvements to the performance of software stacks running on AWS Graviton2. For example, you can see our contributions to PHP for Arm64 in this post.

Using the latest versions helps you obtain maximum performance from your Graviton2-based instances. The amazon-linux-extras command enables new versions for some of my favorite programming environments:

$ sudo amazon-linux-extras enable golang1.11 corretto8 php7.4 python3.8 ruby2.6

The output of the amazon-linux-extras command tells me which packages to install with yum:

$ yum clean metadata
$ sudo yum install -y golang java-1.8.0-amazon-corretto \
  php-cli php-pdo php-fpm php-json php-mysqlnd \
  python38 ruby ruby-irb rubygem-rake rubygem-json rubygems

Let’s check the versions of the tools that I just installed:

$ go version
go version go1.13.14 linux/arm64
$ java -version
openjdk version "1.8.0_265"
OpenJDK Runtime Environment Corretto-8.265.01.1 (build 1.8.0_265-b01)
OpenJDK 64-Bit Server VM Corretto-8.265.01.1 (build 25.265-b01, mixed mode)
$ php -v
PHP 7.4.9 (cli) (built: Aug 21 2020 21:45:13) ( NTS )
Copyright (c) The PHP Group
Zend Engine v3.4.0, Copyright (c) Zend Technologies
$ python3.8 -V
Python 3.8.5
$ ruby -v
ruby 2.6.3p62 (2019-04-16 revision 67580) [aarch64-linux]

It looks like I am ready to go! Many more packages are available with yum, such as MariaDB and PostgreSQL. If you’re interested in databases, you might also want to try the preview of Amazon RDS powered by AWS Graviton2 processors.

Available Now
T4g instances are available today in US East (N. Virginia, Ohio), US West (Oregon), Asia Pacific (Tokyo, Mumbai), Europe (Frankfurt, Ireland).

You now have a broad choice of Graviton2-based instances to better optimize your workloads for cost and performance: low cost burstable general-purpose (T4g), general purpose (M6g), compute optimized (C6g) and memory optimized (R6g) instances. Local NVMe-based SSD storage options are also available.

You can use the free trial to develop new applications, or migrate your existing workloads to the AWS Graviton2 processor. Let me know how that goes!

Danilo

AWS Named as a Cloud Leader for the 10th Consecutive Year in Gartner’s Infrastructure & Platform Services Magic Quadrant

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-named-as-a-cloud-leader-for-the-10th-consecutive-year-in-gartners-infrastructure-platform-services-magic-quadrant/

At AWS, we strive to provide you a technology platform that allows for agile development, rapid deployment, and unlimited scale, so that you can free up your resources to focus on innovation for your customers. It’s greatly rewarding to see our efforts recognized not just by our customers, but also by leading analysts.

This year, Gartner announced a new Magic Quadrant for Cloud Infrastructure and Platform Services (CIPS). This is an evolution of their Magic Quadrant for Cloud Infrastructure as a Service (IaaS) for which AWS has been named as a Leader for nine consecutive years.

Customers are using the cloud in broad ways, beyond foundational compute, networking and storage services. We believe for this reason, Gartner is expanding the scope to include additional platform as a service (PaaS) capabilities, and is extending coverage for areas such as managed database services, serverless computing, and developer tools.

Today, I am happy to share that AWS has been named as a Leader in the Magic Quadrant for Cloud Infrastructure and Platform Services, and placed highest in Ability to Execute and furthest in Completeness of Vision.

More information on the features and factors that our customers examine when choosing a cloud provider are available in the full report.

Danilo

Gartner, Magic Quadrant for Cloud Infrastructure and Platform Services, Raj Bala, Bob Gill, Dennis Smith, David Wright, Kevin Ji, 1 September 2020 – Gartner does not endorse any vendor, product or service depicted in its research publications, and does not advise technology users to select only those vendors with the highest ratings or other designation. Gartner research publications consist of the opinions of Gartner’s research organization and should not be construed as statements of fact. Gartner disclaims all warranties, expressed or implied, with respect to this research, including any warranties of merchantability or fitness for a particular purpose.

New – Using Amazon GuardDuty to Protect Your S3 Buckets

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-using-amazon-guardduty-to-protect-your-s3-buckets/

As we anticipated in this post, the anomaly and threat detection for Amazon Simple Storage Service (S3) activities that was previously available in Amazon Macie has now been enhanced and reduced in cost by over 80% as part of Amazon GuardDuty. This expands GuardDuty threat detection coverage beyond workloads and AWS accounts to also help you protect your data stored in S3.

This new capability enables GuardDuty to continuously monitor and profile S3 data access events (usually referred to data plane operations) and S3 configurations (control plane APIs) to detect suspicious activities such as requests coming from an unusual geo-location, disabling of preventative controls such as S3 block public access, or API call patterns consistent with an attempt to discover misconfigured bucket permissions. To detect possibly malicious behavior, GuardDuty uses a combination of anomaly detection, machine learning, and continuously updated threat intelligence. For your reference, here’s the full list of GuardDuty S3 threat detections.

When threats are detected, GuardDuty produces detailed security findings to the console and to Amazon EventBridge, making alerts actionable and easy to integrate into existing event management and workflow systems, or trigger automated remediation actions using AWS Lambda. You can optionally deliver findings to an S3 bucket to aggregate findings from multiple regions, and to integrate with third party security analysis tools.

If you are not using GuardDuty yet, S3 protection will be on by default when you enable the service. If you are using GuardDuty, you can simply enable this new capability with one-click in the GuardDuty console or through the API. For simplicity, and to optimize your costs, GuardDuty has now been integrated directly with S3. In this way, you don’t need to manually enable or configure S3 data event logging in AWS CloudTrail to take advantage of this new capability. GuardDuty also intelligently processes only the data events that can be used to generate threat detections, significantly reducing the number of events processed and lowering your costs.

If you are part of a centralized security team that manages GuardDuty across your entire organization, you can manage all accounts from a single account using the integration with AWS Organizations.

Enabling S3 Protection for an AWS Account
I already have GuardDuty enabled for my AWS account in this region. Now, I want to add threat detection for my S3 buckets. In the GuardDuty console, I select S3 Protection and then Enable. That’s it. To be more protected, I repeat this process for all regions enabled in my account.

After a few minutes, I start seeing new findings related to my S3 buckets. I can select each finding to get more information on the possible threat, including details on the source actor and the target action.

After a few days, I select the Usage section of the console to monitor the estimated monthly costs of GuardDuty in my account, including the new S3 protection. I can also find which are the S3 buckets contributing more to the costs. Well, it turns out I didn’t have lots of traffic on my buckets recently.

Enabling S3 Protection for an AWS Organization
To simplify management of multiple accounts, GuardDuty uses its integration with AWS Organizations to allow you to delegate an account to be the administrator for GuardDuty for the whole organization.

Now, the delegated administrator can enable GuardDuty for all accounts in the organization in a region with one click. You can also set Auto-enable to ON to automatically include new accounts in the organization. If you prefer, you can add accounts by invitation. You can then go to the S3 Protection page under Settings to enable S3 protection for their entire organization.

When selecting Auto-enable, the delegated administrator can also choose to enable S3 protection automatically for new member accounts.

Available Now
As always, with Amazon GuardDuty, you only pay for the quantity of logs and events processed to detect threats. This includes API control plane events captured in CloudTrail, network flow captured in VPC Flow Logs, DNS request and response logs, and with S3 protection enabled, S3 data plane events. These sources are ingested by GuardDuty through internal integrations when you enable the service, so you don’t need to configure any of these sources directly. The service continually optimizes logs and events processed to reduce your cost, and displays your usage split by source in the console. If configured in multi-account, usage is also split by account.

There is a 30-day free trial for the new S3 threat detection capabilities. This applies as well to accounts that already have GuardDuty enabled, and add the new S3 protection capability. During the trial, the estimated cost based on your S3 data event volume is calculated in the GuardDuty console Usage tab. In this way, while you evaluate these new capabilities at no cost, you can understand what would be your monthly spend.

GuardDuty for S3 protection is available in all regions where GuardDuty is offered. For regional availability, please see the AWS Region Table. To learn more, please see the documentation.

Danilo

Find Your Most Expensive Lines of Code – Amazon CodeGuru Is Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/find-your-most-expensive-lines-of-code-amazon-codeguru-is-now-generally-available/

Bringing new applications into production, maintaining their code base as they grow and evolve, and at the same time respond to operational issues, is a challenging task. For this reason, you can find many ideas on how to structure your teams, on which methodologies to apply, and how to safely automate your software delivery pipeline.

At re:Invent last year, we introduced in preview Amazon CodeGuru, a developer tool powered by machine learning that helps you improve your applications and troubleshoot issues with automated code reviews and performance recommendations based on runtime data. During the last few months, many improvements have been launched, including a more cost-effective pricing model, support for Bitbucket repositories, and the ability to start the profiling agent using a command line switch, so that you no longer need to modify the code of your application, or add dependencies, to run the agent.

You can use CodeGuru in two ways:

  • CodeGuru Reviewer uses program analysis and machine learning to detect potential defects that are difficult for developers to find, and recommends fixes in your Java code. The code can be stored in GitHub (now also in GitHub Enterprise), AWS CodeCommit, or Bitbucket repositories. When you submit a pull request on a repository that is associated with CodeGuru Reviewer, it provides recommendations for how to improve your code. Each pull request corresponds to a code review, and each code review can include multiple recommendations that appear as comments on the pull request.
  • CodeGuru Profiler provides interactive visualizations and recommendations that help you fine-tune your application performance and troubleshoot operational issues using runtime data from your live applications. It currently supports applications written in Java virtual machine (JVM) languages such as Java, Scala, Kotlin, Groovy, Jython, JRuby, and Clojure. CodeGuru Profiler can help you find the most expensive lines of code, in terms of CPU usage or introduced latency, and suggest ways you can improve efficiency and remove bottlenecks. You can use CodeGuru Profiler in production, and when you test your application with a meaningful workload, for example in a pre-production environment.

Today, Amazon CodeGuru is generally available with the addition of many new features.

In CodeGuru Reviewer, we included the following:

  • Support for Github Enterprise – You can now scan your pull requests and get recommendations against your source code on Github Enterprise on-premises repositories, together with a description of what’s causing the issue and how to remediate it.
  • New types of recommendations to solve defects and improve your code – For example, checking input validation, to avoid issues that can compromise security and performance, and looking for multiple copies of code that do the same thing.

In CodeGuru Profiler, you can find these new capabilities:

  • Anomaly detection – We automatically detect anomalies in the application profile for those methods that represent the highest proportion of CPU time or latency.
  • Lambda function support – You can now profile AWS Lambda functions just like applications hosted on Amazon Elastic Compute Cloud (EC2) and containerized applications running on Amazon ECS and Amazon Elastic Kubernetes Service, including those using AWS Fargate.
  • Cost of issues in the recommendation report – Recommendations contain actionable resolution steps which explain what the problem is, the CPU impact, and how to fix the issue. To help you better prioritize your activities, you now have an estimation of the savings introduced by applying the recommendation.
  • Color-my-code – In the visualizations, to help you easily find your own code, we are coloring your methods differently from frameworks and other libraries you may use.
  • CloudWatch metrics and alerts – To keep track and monitor efficiency issues that have been discovered.

Let’s see some of these new features at work!

Using CodeGuru Reviewer with a Lambda Function
I create a new repo in my GitHub account, and leave it empty for now. Locally, where I am developing a Lambda function using the Java 11 runtime, I initialize my Git repo and add only the README.md file to the master branch. In this way, I can add all the code as a pull request later and have it go through a code review by CodeGuru.

git init
git add README.md
git commit -m "First commit"

Now, I add the GitHub repo as origin, and push my changes to the new repo:

git remote add origin https://github.com/<my-user-id>/amazon-codeguru-sample-lambda-function.git
git push -u origin master

I associate the repository in the CodeGuru console:

When the repository is associated, I create a new dev branch, add all my local files to it, and push it remotely:

git checkout -b dev
git add .
git commit -m "Code added to the dev branch"
git push --set-upstream origin dev

In the GitHub console, I open a new pull request by comparing changes across the two branches, master and dev. I verify that the pull request is able to merge, then I create it.

Since the repository is associated with CodeGuru, a code review is listed as Pending in the Code reviews section of the CodeGuru console.

After a few minutes, the code review status is Completed, and CodeGuru Reviewer issues a recommendation on the same GitHub page where the pull request was created.

Oops! I am creating the Amazon DynamoDB service object inside the function invocation method. In this way, it cannot be reused across invocations. This is not efficient.

To improve the performance of my Lambda function, I follow the CodeGuru recommendation, and move the declaration of the DynamoDB service object to a static final attribute of the Java application object, so that it is instantiated only once, during function initialization. Then, I follow the link in the recommendation to learn more best practices for working with Lambda functions.

Using CodeGuru Profiler with a Lambda Function
In the CodeGuru console, I create a MyServerlessApp-Development profiling group and select the Lambda compute platform.

Next, I give the AWS Identity and Access Management (IAM) role used by my Lambda function permissions to submit data to this profiling group.

Now, the console is giving me all the info I need to profile my Lambda function. To configure the profiling agent, I use a couple of environment variables:

  • AWS_CODEGURU_PROFILER_GROUP_ARN to specify the ARN of the profiling group to use.
  • AWS_CODEGURU_PROFILER_ENABLED to enable (TRUE) or disable (FALSE) profiling.

I follow the instructions (for Maven and Gradle) to add a dependency, and include the profiling agent in the build. Then, I update the code of the Lambda function to wrap the handler function inside the LambdaProfiler provided by the agent.

To generate some load, I start a few scripts invoking my function using the Amazon API Gateway as trigger. After a few minutes, the profiling group starts to show visualizations describing the runtime behavior of my Lambda function.

For example, I can see how much CPU time is spent in the different methods of my function. At the bottom, there are the entry point methods. As I scroll up, I find methods that are called deeper in the stack trace. I right-click and hide the LambdaRuntimeClient methods to focus on my code. Note that my methods are colored differently than those in the packages I am using, such as the AWS SDK for Java.

I am mostly interested in what happens in the handler method invoked by the Lambda platform. I select the handler method, and now it becomes the new “base” of the visualization.

As I move my pointer on each of my methods, I get more information, including an estimation of the yearly cost of running that specific part of the code in production, based on the load experienced by the profiling agent during the selected time window. In my case, the handler function cost is estimated to be $6. If I select the two main functions above, I have an estimation of $3 each. The cost estimation works for code running on Lambda functions, EC2 instances, and containerized applications.

Similarly, I can visualize Latency, to understand how much time is spent inside the methods in my code. I keep the Lambda function handler method selected to drill down into what is under my control, and see where time is being spent the most.

The CodeGuru Profiler is also providing a recommendation based on the data collected. I am spending too much time (more than 4%) in managing encryption. I can use a more efficient crypto provider, such as the open source Amazon Corretto Crypto Provider, described in this blog post. This should lower the time spent to what is expected, about 1% of my profile.

Finally, I edit the profiling group to enable notifications. In this way, if CodeGuru detects an anomaly in the profile of my application, I am notified in one or more Amazon Simple Notification Service (SNS) topics.

Available Now
Amazon CodeGuru is available today in 10 regions, and we are working to add more regions in the coming months. For regional availability, please see the AWS Region Table.

CodeGuru helps you improve your application code and reduce compute and infrastructure costs with an automated code reviewer and application profiler that provide intelligent recommendations. Using visualizations based on runtime data, you can quickly find the most expensive lines of code of your applications. With CodeGuru, you pay only for what you use. Pricing is based on the lines of code analyzed by CodeGuru Reviewer, and on sampling hours for CodeGuru Profiler.

To learn more, please see the documentation.

Danilo

AWS Solutions Constructs – A Library of Architecture Patterns for the AWS CDK

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-solutions-constructs-a-library-of-architecture-patterns-for-the-aws-cdk/

Cloud applications are built using multiple components, such as virtual servers, containers, serverless functions, storage buckets, and databases. Being able to provision and configure these resources in a safe, repeatable way is incredibly important to automate your processes and let you focus on the unique parts of your implementation.

With the AWS Cloud Development Kit, you can leverage the expressive power of your favorite programming languages to model your applications. You can use high-level components called constructs, preconfigured with “sensible defaults” that you can customize, to quickly build a new application. The CDK provisions your resources using AWS CloudFormation to get all the benefits of managing your infrastructure as code. One of the reasons I like the CDK, is that you can compose and share your own custom components as higher-level constructs.

As you can imagine, there are recurring patterns that can be useful to more than one customer. For this reason, today we are launching the AWS Solutions Constructs, an open source extension library for the CDK that provides well-architected patterns to help you build your unique solutions. CDK constructs mostly cover single services. AWS Solutions Constructs provide multi-service patterns that combine two or more CDK resources, and implement best practices such as logging and encryption.

Using AWS Solutions Constructs
To see the power of a pattern-based approach, let’s take a look at how that works when building a new application. As an example, I want to build an HTTP API to store data in a Amazon DynamoDB table. To keep the content of the table small, I can use DynamoDB Time to Live (TTL) to expire items after a few days. After the TTL expires, data is deleted from the table and sent, via DynamoDB Streams, to a AWS Lambda function to archive the expired data on Amazon Simple Storage Service (S3).

To build this application, I can use a few components:

  • An Amazon API Gateway endpoint for the API.
  • A DynamoDB table to store data.
  • A Lambda function to process the API requests, and store data in the DynamoDB table.
  • DynamoDB Streams to capture data changes.
  • A Lambda function processing data changes to archive the expired data.

Can I make it simpler? Looking at the available patterns in the AWS Solutions Constructs, I find two that can help me build my app:

  • aws-apigateway-lambda, a Construct that implements an API Gateway REST API connected to a Lambda function. As an example of the “sensible defaults” used by AWS Solutions Constructs, this pattern enables CloudWatch logging for the API Gateway.
  • aws-dynamodb-stream-lambda, a Construct implementing a DynamoDB table streaming data changes to a Lambda function with the least privileged permissions.

To build the final architecture, I simply connect those two Constructs together:

I am using TypeScript to define the CDK stack, and Node.js for the Lambda functions. Let’s start with the CDK stack:

 

import * as cdk from '@aws-cdk/core';
import * as lambda from '@aws-cdk/aws-lambda';
import * as apigw from '@aws-cdk/aws-apigateway';
import * as dynamodb from '@aws-cdk/aws-dynamodb';
import { ApiGatewayToLambda } from '@aws-solutions-constructs/aws-apigateway-lambda';
import { DynamoDBStreamToLambda } from '@aws-solutions-constructs/aws-dynamodb-stream-lambda';

export class DemoConstructsStack extends cdk.Stack {
  constructor(scope: cdk.Construct, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    const apiGatewayToLambda = new ApiGatewayToLambda(this, 'ApiGatewayToLambda', {
      deployLambda: true,
      lambdaFunctionProps: {
        code: lambda.Code.fromAsset('lambda'),
        runtime: lambda.Runtime.NODEJS_12_X,
        handler: 'restApi.handler'
      },
      apiGatewayProps: {
        defaultMethodOptions: {
          authorizationType: apigw.AuthorizationType.NONE
        }
      }
    });

    const dynamoDBStreamToLambda = new DynamoDBStreamToLambda(this, 'DynamoDBStreamToLambda', {
      deployLambda: true,
      lambdaFunctionProps: {
        code: lambda.Code.fromAsset('lambda'),
        runtime: lambda.Runtime.NODEJS_12_X,
        handler: 'processStream.handler'
      },
      dynamoTableProps: {
        tableName: 'my-table',
        partitionKey: { name: 'id', type: dynamodb.AttributeType.STRING },
        timeToLiveAttribute: 'ttl'
      }
    });

    const apiFunction = apiGatewayToLambda.lambdaFunction;
    const dynamoTable = dynamoDBStreamToLambda.dynamoTable;

    dynamoTable.grantReadWriteData(apiFunction);
    apiFunction.addEnvironment('TABLE_NAME', dynamoTable.tableName);
  }
}

At the beginning of the stack, I import the standard CDK constructs for the Lambda function, the API Gateway endpoint, and the DynamoDB table. Then, I add the two patterns from the AWS Solutions Constructs, ApiGatewayToLambda and DynamoDBStreamToLambda.

After declaring the two ApiGatewayToLambda and DynamoDBStreamToLambda constructs, I store the Lambda function, created by the ApiGatewayToLambda constructs, and the DynamoDB table, created by DynamoDBStreamToLambda, in two variables.

At the end of the stack, I “connect” the two patterns together by granting permissions to the Lambda function to read/write in the DynamoDB table, and add the name of the DynamoDB table to the environment of the Lambda function, so that it can be used in the function code to store data in the table.

The code of the two Lambda functions is in the lambda folder of the CDK application. I am using the Node.js 12 runtime.

The restApi.js function implements the API and writes data to the DynamoDB table. The URL path is used as partition key, all the query string parameters in the URL are stored as attributes. The TTL for the item is computed adding a time window of 7 days to the current time.

const { DynamoDB } = require("aws-sdk");

const docClient = new DynamoDB.DocumentClient();

const TABLE_NAME = process.env.TABLE_NAME;
const TTL_WINDOW = 7 * 24 * 60 * 60; // 7 days expressed in seconds

exports.handler = async function (event) {

  const item = event.queryStringParameters;
  item.id = event.pathParameters.proxy;

  const now = new Date(); 
  item.ttl = Math.round(now.getTime() / 1000) + TTL_WINDOW;

  const response = await docClient.put({
    TableName: TABLE_NAME,
    Item: item
  }).promise();

  let statusCode = 204;
  
  if (response.err != null) {
    console.error('request: ', JSON.stringify(event, undefined, 2));
    console.error('error: ', response.err);
    statusCode = 500
  }

  return {
    statusCode: statusCode
  };
};

The processStream.js function is processing data capture records from the DynamoDB Stream, looking for the items deleted by TTL. The archive functionality is not implemented in this sample code.

exports.handler = async function (event) {
  event.Records.forEach((record) => {
    console.log('Stream record: ', JSON.stringify(record, null, 2));
    if (record.userIdentity.type == "Service" &&
      record.userIdentity.principalId == "dynamodb.amazonaws.com") {

      // Record deleted by DynamoDB Time to Live (TTL)
      
      // I can archive the record to S3, for example using Kinesis Data Firehose.
    }
  }
};

Let’s see if this works! First, I need to install all dependencies. To simplify dependencies, each release of AWS Solutions Constructs is linked to the corresponding version of the CDK. I this case, I am using version 1.46.0 for both the CDK and the AWS Solutions Constructs patterns. The first three commands are installing plain CDK constructs. The last two commands are installing the AWS Solutions Constructs patterns I am using for this application.

npm install @aws-cdk/[email protected]
npm install @aws-cdk/[email protected]
npm install @aws-cdk/[email protected]
npm install @aws-solutions-constructs/[email protected]
npm install @aws-solutions-constructs/[email protected]

Now, I build the application and use the CDK to deploy the application.

npm run build
cdk deploy

Towards the end of the output of the cdk deploy command, a green light is telling me that the deployment of the stack is completed. Just next, in the Outputs, I find the endpoint of the API Gateway.

 ✅  DemoConstructsStack

Outputs:
DemoConstructsStack.ApiGatewayToLambdaLambdaRestApiEndpoint9800D4B5 = https://1a2c3c4d.execute-api.eu-west-1.amazonaws.com/prod/

I can now use curl to test the API:

curl "https://1a2c3c4d.execute-api.eu-west-1.amazonaws.com/prod/danilop?name=Danilo&amp;company=AWS"

Let’s have a look at the DynamoDB table:

The item is stored, and the TTL is set. After a week, the item will be deleted and sent via DynamoDB Streams to the processStream.js function.

After I complete my testing, I use the CDK again to quickly delete all resources created for this application:

cdk destroy

Available Now
The AWS Solutions Constructs are available now for TypeScript and Python. The AWS Solutions Builders team is working to make these constructs also available when using Java and C# with the CDK, stay tuned. There is no cost in using the AWS Solutions Constructs, or the CDK, you only pay for the resources created when deploying the stack.

In this first release, 25 patterns are included, covering lots of different use cases. Which new patterns and features should we focus now? Give use your feedback in the open source project repository!

Danilo

New – A Shared File System for Your Lambda Functions

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-a-shared-file-system-for-your-lambda-functions/

I am very happy to announce that AWS Lambda functions can now mount an Amazon Elastic File System (EFS), a scalable and elastic NFS file system storing data within and across multiple availability zones (AZ) for high availability and durability. In this way, you can use a familiar file system interface to store and share data across all concurrent execution environments of one, or more, Lambda functions. EFS supports full file system access semantics, such as strong consistency and file locking.

To connect an EFS file system with a Lambda function, you use an EFS access point, an application-specific entry point into an EFS file system that includes the operating system user and group to use when accessing the file system, file system permissions, and can limit access to a specific path in the file system. This helps keeping file system configuration decoupled from the application code.

You can access the same EFS file system from multiple functions, using the same or different access points. For example, using different EFS access points, each Lambda function can access different paths in a file system, or use different file system permissions.

You can share the same EFS file system with Amazon Elastic Compute Cloud (EC2) instances, containerized applications using Amazon ECS and AWS Fargate, and on-premises servers. Following this approach, you can use different computing architectures (functions, containers, virtual servers) to process the same files. For example, a Lambda function reacting to an event can update a configuration file that is read by an application running on containers. Or you can use a Lambda function to process files uploaded by a web application running on EC2.

In this way, some use cases are much easier to implement with Lambda functions. For example:

  • Processing or loading data larger than the space available in /tmp (512MB).
  • Loading the most updated version of files that change frequently.
  • Using data science packages that require storage space to load models and other dependencies.
  • Saving function state across invocations (using unique file names, or file system locks).
  • Building applications requiring access to large amounts of reference data.
  • Migrating legacy applications to serverless architectures.
  • Interacting with data intensive workloads designed for file system access.
  • Partially updating files (using file system locks for concurrent access).
  • Moving a directory and all its content within a file system with an atomic operation.

Creating an EFS File System
To mount an EFS file system, your Lambda functions must be connected to an Amazon Virtual Private Cloud that can reach the EFS mount targets. For simplicity, I am using here the default VPC that is automatically created in each AWS Region.

Note that, when connecting Lambda functions to a VPC, networking works differently. If your Lambda functions are using Amazon Simple Storage Service (S3) or Amazon DynamoDB, you should create a gateway VPC endpoint for those services. If your Lambda functions need to access the public internet, for example to call an external API, you need to configure a NAT Gateway. I usually don’t change the configuration of my default VPCs. If I have specific requirements, I create a new VPC with private and public subnets using the AWS Cloud Development Kit, or use one of these AWS CloudFormation sample templates. In this way, I can manage networking as code.

In the EFS console, I select Create file system and make sure that the default VPC and its subnets are selected. For all subnets, I use the default security group that gives network access to other resources in the VPC using the same security group.

In the next step, I give the file system a Name tag and leave all other options to their default values.

Then, I select Add access point. I use 1001 for the user and group IDs and limit access to the /message path. In the Owner section, used to create the folder automatically when first connecting to the access point, I use the same user and group IDs as before, and 750 for permissions. With this permissions, the owner can read, write, and execute files. Users in the same group can only read. Other users have no access.

I go on, and complete the creation of the file system.

Using EFS with Lambda Functions
To start with a simple use case, let’s build a Lambda function implementing a MessageWall API to add, read, or delete text messages. Messages are stored in a file on EFS so that all concurrent execution environments of that Lambda function see the same content.

In the Lambda console, I create a new MessageWall function and select the Python 3.8 runtime. In the Permissions section, I leave the default. This will create a new AWS Identity and Access Management (IAM) role with basic permissions.

When the function is created, in the Permissions tab I click on the IAM role name to open the role in the IAM console. Here, I select Attach policies to add the AWSLambdaVPCAccessExecutionRole and AmazonElasticFileSystemClientReadWriteAccess AWS managed policies. In a production environment, you can restrict access to a specific VPC and EFS access point.

Back in the Lambda console, I edit the VPC configuration to connect the MessageWall function to all subnets in the default VPC, using the same default security group I used for the EFS mount points.

Now, I select Add file system in the new File system section of the function configuration. Here, I choose the EFS file system and accesss point I created before. For the local mount point, I use /mnt/msg and Save. This is the path where the access point will be mounted, and corresponds to the /message folder in my EFS file system.

In the Function code editor of the Lambda console, I paste the following code and Save.

import os
import fcntl

MSG_FILE_PATH = '/mnt/msg/content'


def get_messages():
    try:
        with open(MSG_FILE_PATH, 'r') as msg_file:
            fcntl.flock(msg_file, fcntl.LOCK_SH)
            messages = msg_file.read()
            fcntl.flock(msg_file, fcntl.LOCK_UN)
    except:
        messages = 'No message yet.'
    return messages


def add_message(new_message):
    with open(MSG_FILE_PATH, 'a') as msg_file:
        fcntl.flock(msg_file, fcntl.LOCK_EX)
        msg_file.write(new_message + "\n")
        fcntl.flock(msg_file, fcntl.LOCK_UN)


def delete_messages():
    try:
        os.remove(MSG_FILE_PATH)
    except:
        pass


def lambda_handler(event, context):
    method = event['requestContext']['http']['method']
    if method == 'GET':
        messages = get_messages()
    elif method == 'POST':
        new_message = event['body']
        add_message(new_message)
        messages = get_messages()
    elif method == 'DELETE':
        delete_messages()
        messages = 'Messages deleted.'
    else:
        messages = 'Method unsupported.'
    return messages

I select Add trigger and in the configuration I select the Amazon API Gateway. I create a new HTTP API. For simplicity, I leave my API endpoint open.

With the API Gateway trigger selected, I copy the endpoint of the new API I just created.

I can now use curl to test the API:

$ curl https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
No message yet.
$ curl -X POST -H "Content-Type: text/plain" -d 'Hello from EFS!' https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
Hello from EFS!

$ curl -X POST -H "Content-Type: text/plain" -d 'Hello again :)' https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
Hello from EFS!
Hello again :)

$ curl https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
Hello from EFS!
Hello again :)

$ curl -X DELETE https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
Messages deleted.

$ curl https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MessageWall
No message yet.

It would be relatively easy to add unique file names (or specific subdirectories) for different users and extend this simple example into a more complete messaging application. As a developer, I appreciate the simplicity of using a familiar file system interface in my code. However, depending on your requirements, EFS throughput configuration must be taken into account. See the section Understanding EFS performance later in the post for more information.

Now, let’s use the new EFS file system support in AWS Lambda to build something more interesting. For example, let’s use the additional space available with EFS to build a machine learning inference API processing images.

Building a Serverless Machine Learning Inference API
To create a Lambda function implementing machine learning inference, I need to be able, in my code, to import the necessary libraries and load the machine learning model. Often, when doing so, the overall size of those dependencies goes beyond the current AWS Lambda limits in the deployment package size. One way of solving this is to accurately minimize the libraries to ship with the function code, and then download the model from an S3 bucket straight to memory (up to 3 GB, including the memory required for processing the model) or to /tmp (up 512 MB). This custom minimization and download of the model has never been easy to implement. Now, I can use an EFS file system.

The Lambda function I am building this time needs access to the public internet to download a pre-trained model and the images to run inference on. So I create a new VPC with public and private subnets, and configure a NAT Gateway and the route table used by the the private subnets to give access to the public internet. Using the AWS Cloud Development Kit, it’s just a few lines of code.

I create a new EFS file system and an access point in the new VPC using similar configurations as before. This time, I use /ml for the access point path.

Then, I create a new MLInference Lambda function with the same set up as before for permissions and connect the function to the private subnets of the new VPC. Machine learning inference is quite a heavy workload, so I select 3 GB for memory and 5 minutes for timeout. In the File system configuration, I add the new access point and mount it under /mnt/inference.

The machine learning framework I am using for this function is PyTorch, and I need to put the libraries required to run inference in the EFS file system. I launch an Amazon Linux EC2 instance in a public subnet of the new VPC. In the instance details, I select one of the availability zones where I have an EFS mount point, and then Add file system to automatically mount the same EFS file system I am using for the function. For the security groups of the EC2 instance, I select the default security group (to be able to mount the EFS file system) and one that gives inbound access to SSH (to be able to connect to the instance).

I connect to the instance using SSH and create a requirements.txt file containing the dependencies I need:

torch
torchvision
numpy

The EFS file system is automatically mounted by EC2 under /mnt/efs/fs1. There, I create the /ml directory and change the owner of the path to the user and group I am using now that I am connected (ec2-user).

$ sudo mkdir /mnt/efs/fs1/ml
$ sudo chown ec2-user:ec2-user /mnt/efs/fs1/ml

I install Python 3 and use pip to install the dependencies in the /mnt/efs/fs1/ml/lib path:

$ sudo yum install python3
$ pip3 install -t /mnt/efs/fs1/ml/lib -r requirements.txt

Finally, I give ownership of the whole /ml path to the user and group I used for the EFS access point:

$ sudo chown -R 1001:1001 /mnt/efs/fs1/ml

Overall, the dependencies in my EFS file system are using about 1.5 GB of storage.

I go back to the MLInference Lambda function configuration. Depending on the runtime you use, you need to find a way to tell where to look for dependencies if they are not included with the deployment package or in a layer. In the case of Python, I set the PYTHONPATH environment variable to /mnt/inference/lib.

I am going to use PyTorch Hub to download this pre-trained machine learning model to recognize the kind of bird in a picture. The model I am using for this example is relatively small, about 200 MB. To cache the model on the EFS file system, I set the TORCH_HOME environment variable to /mnt/inference/model.

All dependencies are now in the file system mounted by the function, and I can type my code straight in the Function code editor. I paste the following code to have a machine learning inference API:

import urllib
import json
import os

import torch
from PIL import Image
from torchvision import transforms

transform_test = transforms.Compose([
    transforms.Resize((600, 600), Image.BILINEAR),
    transforms.CenterCrop((448, 448)),
    transforms.ToTensor(),
    transforms.Normalize((0.485, 0.456, 0.406), (0.229, 0.224, 0.225)),
])

model = torch.hub.load('nicolalandro/ntsnet-cub200', 'ntsnet', pretrained=True,
                       **{'topN': 6, 'device': 'cpu', 'num_classes': 200})
model.eval()


def lambda_handler(event, context):
    url = event['queryStringParameters']['url']

    img = Image.open(urllib.request.urlopen(url))
    scaled_img = transform_test(img)
    torch_images = scaled_img.unsqueeze(0)

    with torch.no_grad():
        top_n_coordinates, concat_out, raw_logits, concat_logits, part_logits, top_n_index, top_n_prob = model(torch_images)

        _, predict = torch.max(concat_logits, 1)
        pred_id = predict.item()
        bird_class = model.bird_classes[pred_id]
        print('bird_class:', bird_class)

    return json.dumps({
        "bird_class": bird_class,
    })

I add the API Gateway as trigger, similarly to what I did before for the MessageWall function. Now, I can use the serverless API I just created to analyze pictures of birds. I am not really an expert in the field, so I looked for a couple of interesting images on Wikipedia:

I call the API to get a prediction for these two pictures:

$ curl https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MLInference?url=https://path/to/image/atlantic-puffin.jpg

{"bird_class": "106.Horned_Puffin"}

$ curl https://1a2b3c4d5e.execute-api.us-east-1.amazonaws.com/default/MLInference?url=https://path/to/image/western-grebe.jpg

{"bird_class": "053.Western_Grebe"}

It works! Looking at Amazon CloudWatch Logs for the Lambda function, I see that the first invocation, when the function loads and prepares the pre-trained model for inference on CPUs, takes about 30 seconds. To avoid a slow response, or a timeout from the API Gateway, I use Provisioned Concurrency to keep the function ready. The next invocations take about 1.8 seconds.

Understanding EFS Performance
When using EFS with your Lambda function, is very important to understand how EFS performance works. For throughput, each file system can be configured to use bursting or provisioned mode.

When using bursting mode, all EFS file systems, regardless of size, can burst at least to 100 MiB/s of throughput. Those over 1 TiB in the standard storage class can burst to 100 MiB/s per TiB of data stored in the file system. EFS uses a credit system to determine when file systems can burst. Each file system earns credits over time at a baseline rate that is determined by the size of the file system that is stored in the standard storage class. A file system uses credits whenever it reads or writes data. The baseline rate is 50 KiB/s per GiB of storage.

You can monitor the use of credits in CloudWatch, each EFS file system has a BurstCreditBalance metric. If you see that you are consuming all credits, and the BurstCreditBalance metric is going to zero, you should enable provisioned throughput mode for the file system, from 1 to 1024 MiB/s. There is an additional cost when using provisioned throughput, based on how much throughput you are adding on top of the baseline rate.

To avoid running out of credits, you should think of the throughput as the average you need during the day. For example, if you have a 10GB file system, you have 500 KiB/s of baseline rate, and every day you can read/write 500 KiB/s * 3600 seconds * 24 hours = 43.2 GiB.

If the libraries and everything you function needs to load during initialization are about 2 GiB, and you only access the EFS file system during function initialization, like in the MLInference Lambda function above, that means you can initialize your function (for example because of updates or scaling up activities) about 20 times per day. That’s not a lot, and you would probably need to configure provisioned throughput for the EFS file system.

If you have 10 MiB/s of provisioned throughput, then every day you have 10 MiB/s * 3600 seconds * 24 hours = 864 GiB to read or write. If you only use the EFS file system at function initialization to read about 2 GB of dependencies, it means that you can have 400 initializations per day. That may be enough for your use case.

In the Lambda function configuration, you can also use the reserve concurrency control to limit the maximum number of execution environments used by a function.

If, by mistake, the BurstCreditBalance goes down to zero, and the file system is relatively small (for example, a few GiBs), there is the possibility that your function gets stuck and can’t execute fast enough before reaching the timeout. In that case, you should enable (or increase) provisioned throughput for the EFS file system, or throttle your function by setting the reserved concurrency to zero to avoid all invocations until the EFS file system has enough credits.

Understanding Security Controls
When using EFS file systems with AWS Lambda, you have multiple levels of security controls. I’m doing a quick recap here because they should all be considered during the design and implementation of your serverless applications. You can find more info on using IAM authorization and access points with EFS in this post.

To connect a Lambda function to an EFS file system, you need:

  • Network visibility in terms of VPC routing/peering and security group.
  • IAM permissions for the Lambda function to access the VPC and mount (read only or read/write) the EFS file system.
  • You can specify in the IAM policy conditions which EFS access point the Lambda function can use.
  • The EFS access point can limit access to a specific path in the file system.
  • File system security (user ID, group ID, permissions) can limit read, write, or executable access for each file or directory mounted by a Lambda function.

The Lambda function execution environment and the EFS mount point uses industry standard Transport Layer Security (TLS) 1.2 to encrypt data in transit. You can provision Amazon EFS to encrypt data at rest. Data encrypted at rest is transparently encrypted while being written, and transparently decrypted while being read, so you don’t have to modify your applications. Encryption keys are managed by the AWS Key Management Service (KMS), eliminating the need to build and maintain a secure key management infrastructure.

Available Now
This new feature is offered in all regions where AWS Lambda and Amazon EFS are available, with the exception of the regions in China, where we are working to make this integration available as soon as possible. For more information on availability, please see the AWS Region table. To learn more, please see the documentation.

EFS for Lambda can be configured using the console, the AWS Command Line Interface (CLI), the AWS SDKs, and the Serverless Application Model. This feature allows you to build data intensive applications that need to process large files. For example, you can now unzip a 1.5 GB file in a few lines of code, or process a 10 GB JSON document. You can also load libraries or packages that are larger than the 250 MB package deployment size limit of AWS Lambda, enabling new machine learning, data modelling, financial analysis, and ETL jobs scenarios.

Amazon EFS for Lambda is supported at launch in AWS Partner Network solutions, including Epsagon, Lumigo, Datadog, HashiCorp Terraform, and Pulumi.

There is no additional charge for using EFS from Lambda functions. You pay the standard price for AWS Lambda and Amazon EFS. Lambda execution environments always connect to the right mount target in an AZ and not across AZs. You can connect to EFS in the same AZ via cross account VPC but there can be data transfer costs for that. We do not support cross region, or cross AZ connectivity between EFS and Lambda.

Danilo

Welcome to the Serverless-First Function Virtual Events

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/serverless-first-function-virtual-events/

When you develop a serverless application, you can focus on the core features you want to build, instead of worrying about managing and operating servers, databases, or storage systems.

To simplify adoption and use of serverless technologies, we launched many new features in the last few months. For example, just to pick up a few:

To help you and your organization get the most out of the cloud, we organized a set of virtual events called the AWS Serverless-First Function:

  • Last Thursday, May 21, we had the first event, Serverless for your Organization. Between an opening by Dr. Werner Vogels, Amazon CTO, and closing statements by Jeff Barr, AWS VP and Chief Evangelist, we had an agenda fully packed with tips to bring the benefits of serverless to your organization, including a customer case study by Gillian McCann, Head of Cloud Engineering and AI at Workgrid Software.
  • On Thursday, May 28, we have the second event, Serverless for your Application, a full day of incremental, hands-on sessions that demonstrate end-to-end best practices for building serverless applications. To start, we’ll discuss how we use serverless at AWS, and the benefits AWS teams get from adopting serverless-first. This will be followed by dive-deep sessions on topics such as security, performance, and observability. You can still register for this event here.

A Few Highlights from May 21
There were too many great moments to list them all here. If you missed the first event (or to review again some of the ideas and resources that were shared) here are my favorites:

  • Dr. Werner Vogels started the event by defining modern applications, and addressing the importance of culture and adaptability when building them. These are two of the key ingredients to a serverless-first approach. He mentioned the Amazon Builders’ Library (a great resource for learning more about Amazon’s own technical journey), including multiple articles by one of the other presenters for the day, Sr. Principal Engineer, David Yanacek. He also discussed how the AWS Well-Architected Framework’s Serverless Lens (now also available in the AWS Well-Architected Tool) can help you measure your architectures against best practices and identify areas for improvement for all of your serverless applications. Among many examples mentioned during the day, you can find the journeys of iRobot and Fender to serverless-first captured in this whitepaper.
  • David Richardson, VP of Serverless at AWS, discussed the importance of taking a serverless-approach when building modern applications, and covered a few of the recent launches I mentioned above that make building on serverless even better. Considerations around the total cost of ownership (TCO) were part of the discussion of many sessions, so we shared links to two whitepapers, this one by IDC and this one by Deloitte, that help companies evaluate the real impact of their technology choices and understand the long-term return on investment (ROI) of choosing a serverless-first approach.
  • Adrian Cockcroft, VP of Cloud Architecture Strategy, highlighted an array of objections to using serverless and the multiple ways in which we’ve solved for each of them. He paired these solutions with insightful sessions delivered at re:Invent last year, including Serverless at scale: Design patterns and optimizations and Moving to event-driven architectures. He also discussed “relics” from the past… There are indeed ways to migrate an IBM mainframe to microservices on AWS Lambda and to refactor a U.S. Department of Defense mainframe to AWS!
  • Gillian McCann, Head of Cloud Engineering and AI at Workgrid Software, delivered a compelling behind-the-scenes story of Workgrid Software’s learnings, challenges, and successes using a serverless-first approach. I enjoy real-life customer stories, they are a great way to learn and, for us at AWS, to get feedback.

Watch (Again) and Join Us on May 28
It was great seeing all the great questions, answers, and content being shared via the live chat. You can watch (or re-watch) the sessions from May 21 on-demand here on Twitch.

You’re still in time to join us for the second event on May 28! You can find the full agenda and register here.

Danilo

New – Enhanced Amazon Macie Now Available with Substantially Reduced Pricing

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-enhanced-amazon-macie-now-available/

Amazon Macie is a fully managed service that helps you discover and protect your sensitive data, using machine learning to automatically spot and classify data for you.

Over time, Macie customers told us what they like, and what they didn’t. The service team has worked hard to address this feedback, and today I am very happy to share that we are making available a new, enhanced version of Amazon Macie!

This new version has simplified the pricing plan: you are now charged based on the number of Amazon Simple Storage Service (S3) buckets that are evaluated, and the amount of data processed for sensitive data discovery jobs. The new tiered pricing plan has reduced the price by 80%. With higher volumes, you can reduce your costs by more than 90%.

At the same time, we have introduced many new features:

  • An expanded sensitive data discovery, including updated machine learning models for personally identifiable information (PII) detection, and customer-defined sensitive data types using regular expressions.
  • Multi-account support with AWS Organizations.
  • Full API coverage for programmatic use of the service with AWS SDKs and AWS Command Line Interface (CLI).
  • Expanded regional availability to 17 Regions.
  • A new, simplified free tier and free trial to help you get started and understand your costs.
  • A completely redesigned console and user experience.

Macie is now tightly integrated with S3 in the backend, providing more advantages:

  • Enabling S3 data events in AWS CloudTrail is no longer a requirement, further reducing overall costs.
  • There is now a continual evaluation of all buckets, issuing security findings for any public bucket, unencrypted buckets, and for buckets shared with (or replicated to) an AWS account outside of your Organization.

The anomaly detection features monitoring S3 data access activity previously available in Macie are now in private beta as part of Amazon GuardDuty, and have been enhanced to include deeper capabilities to protect your data in S3.

Enabling Amazon Macie
In the Macie console, I select to Enable Macie. If you use AWS Organizations, you can delegate an AWS account to administer Macie for your Organization.

After it has been enabled, Amazon Macie automatically provides a summary of my S3 buckets in the region, and continually evaluates those buckets to generate actionable security findings for any unencrypted or publicly accessible data, including buckets shared with AWS accounts outside of my Organization.

Below the summary, I see the top findings by type and by S3 bucket. Overall, this page provides a great overview of the status of my S3 buckets.

In the Findings section I have the full list of findings, and I can select them to archive, unarchive, or export them. I can also select one of the findings to see the full information collected by Macie.

Findings can be viewed in the web console and are sent to Amazon CloudWatch Events for easy integration with existing workflow or event management systems, or to be used in combination with AWS Step Functions to take automated remediation actions. This can help meet regulations such as Payment Card Industry Data Security Standard (PCI-DSS), Health Insurance Portability and Accountability Act (HIPAA), General Data Privacy Regulation (GDPR), and California Consumer Protection Act (CCPA).

In the S3 Buckets section, I can search and filter on buckets of interest to create sensitive data discovery jobs across one or multiple buckets to discover sensitive data in objects, and to check encryption status and public accessibility at object level. Jobs can be executed once, or scheduled daily, weekly, or monthly.

For jobs, Amazon Macie automatically tracks changes to the buckets and only evaluates new or modified objects over time. In the additional settings, I can include or exclude objects based on tags, size, file extensions, or last modified date.

To monitor my costs, and the use of the free trial, I look at the Usage section of the console.

Creating Custom Data Identifiers
Amazon Macie supports natively the most common sensitive data types, including personally identifying information (PII) and credential data. You can extend that list with custom data identifiers to discover proprietary or unique sensitive data for your business.

For example, often companies have a specific syntax for their employee IDs. A possible syntax is to have a capital letter, that defines if this is a full-time or a part-time employee, followed by a dash, and then eight numbers. Possible values in this case are F-12345678 or P-87654321.

To create this custom data identifier, I enter a regular expression (regex) to describe the pattern to match:

[A-Z]-\d{8}

To avoid false positives, I ask that the employee keyword is found near the identifier (by default, less than 50 characters apart). I use the Evaluate box to test that this configuration works with sample text, then I select Submit.

Available Now
For Amazon Macie regional availability, please see the AWS Region Table. You can find more information on how the new enhanced Macie in the documentation.

This release of Amazon Macie remains optimized for S3. However, anything you can get into S3, permanently or temporarily, in an object format supported by Macie, can be scanned for sensitive data. This allows you to expand the coverage to data residing outside of S3 by pulling data out of custom applications, databases, and third-party services, temporarily placing it in S3, and using Amazon Macie to identify sensitive data.

For example, we’ve made this even easier with RDS and Aurora now supporting snapshots to S3 in Apache Parquet, which is a format Macie supports. Similarly, in DynamoDB, you can use AWS Glue to export tables to S3 which can then be scanned by Macie. With the new API and SDKs coverage, you can use the new enhanced Amazon Macie as a building block in an automated process exporting data to S3 to discover and protect your sensitive data across multiple sources.

Danilo

New – Building a Continuous Integration Workflow with Step Functions and AWS CodeBuild

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-building-a-continuous-integration-workflow-with-step-functions-and-aws-codebuild/

Automating your software build is an important step to adopt DevOps best practices. To help you with that, we built AWS CodeBuild, a fully managed continuous integration service that compiles source code, runs tests, and produces packages that are ready for deployment.

However, there are so many possible customizations in our customers’ build processes, and we have seen developers spend time in creating their own custom workflows to coordinate the different activities required by their software build. For example, you may want to run, or not, some tests, or skip static analysis of your code when you need to deploy a quick fix. Depending on the results of your unit tests, you may want to take different actions, or be notified via SNS.

To simplify that, we are launching today a new AWS Step Functions service integration with CodeBuild. Now, during the execution of a state machine, you can start or stop a build, get build report summaries, and delete past build executions records.

In this way, you can define your own workflow-driven build process, and trigger it manually or automatically. For example you can:

With this integration, you can use the full capabilities of Step Functions to automate your software builds. For example, you can use a Parallel state to create parallel builds for independent components of the build. Starting from a list of all the branches in your code repository, you can use a Map state to run a set of steps (automating build, unit tests, and integration tests) for each branch. You can also leverage in the same workflow other Step Functions service integrations. For instance, you can send a message to an SQS queue to track your activities, or start a containerized application you just built using Amazon ECS and AWS Fargate.

Using Step Functions for a Workflow-Driven Build Process
I am working on a Java web application. To be sure that it works as I add new features, I wrote a few tests using JUnit Jupiter. I want those tests to be run just after the build process, but not always because tests can slow down some quick iterations. When I run tests, I want to store and view the reports of my tests using CodeBuild. At the end, I want to be notified in an SNS topic if the tests run, and if they were successful.

I created a repository in CodeCommit and I included two buildspec files for CodeBuild:

  • buildspec.yml is the default and is using Apache Maven to run the build and the tests, and then is storing test results as reports.
version: 0.2
phases:
  build:
    commands:
      - mvn package
artifacts:
  files:
    - target/binary-converter-1.0-SNAPSHOT.jar
reports:
  SurefireReports:
    files:
      - '**/*'
    base-directory: 'target/surefire-reports'
  • buildspec-notests.yml is doing only the build, and no tests are executed.
version: 0.2
phases:
  build:
    commands:
      - mvn package -DskipTests
artifacts:
  files:
    - target/binary-converter-1.0-SNAPSHOT.jar

To set up the CodeBuild project and the Step Functions state machine to automate the build, I am using AWS CloudFormation with the following template:

AWSTemplateFormatVersion: 2010-09-09
Description: AWS Step Functions sample project for getting notified on AWS CodeBuild test report results
Resources:
  CodeBuildStateMachine:
    Type: AWS::StepFunctions::StateMachine
    Properties:
      RoleArn: !GetAtt [ CodeBuildExecutionRole, Arn ]
      DefinitionString:
        !Sub
          - |-
            {
              "Comment": "An example of using CodeBuild to run (or not run) tests, get test results and send a notification.",
              "StartAt": "Run Tests?",
              "States": {
                "Run Tests?": {
                  "Type": "Choice",
                  "Choices": [
                    {
                      "Variable": "$.tests",
                      "BooleanEquals": false,
                      "Next": "Trigger CodeBuild Build Without Tests"
                    }
                  ],
                  "Default": "Trigger CodeBuild Build With Tests"
                },
                "Trigger CodeBuild Build With Tests": {
                  "Type": "Task",
                  "Resource": "arn:${AWS::Partition}:states:::codebuild:startBuild.sync",
                  "Parameters": {
                    "ProjectName": "${projectName}"
                  },
                  "Next": "Get Test Results"
                },
                "Trigger CodeBuild Build Without Tests": {
                  "Type": "Task",
                  "Resource": "arn:${AWS::Partition}:states:::codebuild:startBuild.sync",
                  "Parameters": {
                    "ProjectName": "${projectName}",
                    "BuildspecOverride": "buildspec-notests.yml"
                  },
                  "Next": "Notify No Tests"
                },
                "Get Test Results": {
                  "Type": "Task",
                  "Resource": "arn:${AWS::Partition}:states:::codebuild:batchGetReports",
                  "Parameters": {
                    "ReportArns.$": "$.Build.ReportArns"
                  },
                  "Next": "All Tests Passed?"
                },
                "All Tests Passed?": {
                  "Type": "Choice",
                  "Choices": [
                    {
                      "Variable": "$.Reports[0].Status",
                      "StringEquals": "SUCCEEDED",
                      "Next": "Notify Success"
                    }
                  ],
                  "Default": "Notify Failure"
                },
                "Notify Success": {
                  "Type": "Task",
                  "Resource": "arn:${AWS::Partition}:states:::sns:publish",
                  "Parameters": {
                    "Message": "CodeBuild build tests succeeded",
                    "TopicArn": "${snsTopicArn}"
                  },
                  "End": true
                },
                "Notify Failure": {
                  "Type": "Task",
                  "Resource": "arn:${AWS::Partition}:states:::sns:publish",
                  "Parameters": {
                    "Message": "CodeBuild build tests failed",
                    "TopicArn": "${snsTopicArn}"
                  },
                  "End": true
                },
                "Notify No Tests": {
                  "Type": "Task",
                  "Resource": "arn:${AWS::Partition}:states:::sns:publish",
                  "Parameters": {
                    "Message": "CodeBuild build without tests",
                    "TopicArn": "${snsTopicArn}"
                  },
                  "End": true
                }
              }
            }
          - {snsTopicArn: !Ref SNSTopic, projectName: !Ref CodeBuildProject}
  SNSTopic:
    Type: AWS::SNS::Topic
  CodeBuildProject:
    Type: AWS::CodeBuild::Project
    Properties:
      ServiceRole: !Ref CodeBuildServiceRole
      Artifacts:
        Type: NO_ARTIFACTS
      Environment:
        Type: LINUX_CONTAINER
        ComputeType: BUILD_GENERAL1_SMALL
        Image: aws/codebuild/standard:2.0
      Source:
        Type: CODECOMMIT
        Location: https://git-codecommit.us-east-1.amazonaws.com/v1/repos/binary-converter
  CodeBuildExecutionRole:
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Action: "sts:AssumeRole"
            Principal:
              Service: states.amazonaws.com
      Path: "/"
      Policies:
        - PolicyName: CodeBuildExecutionRolePolicy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                  - "sns:Publish"
                Resource:
                  - !Ref SNSTopic
              - Effect: Allow
                Action:
                  - "codebuild:StartBuild"
                  - "codebuild:StopBuild"
                  - "codebuild:BatchGetBuilds"
                  - "codebuild:BatchGetReports"
                Resource: "*"
              - Effect: Allow
                Action:
                  - "events:PutTargets"
                  - "events:PutRule"
                  - "events:DescribeRule"
                Resource:
                  - !Sub "arn:${AWS::Partition}:events:${AWS::Region}:${AWS::AccountId}:rule/StepFunctionsGetEventForCodeBuildStartBuildRule"
  CodeBuildServiceRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Action: "sts:AssumeRole"
            Effect: Allow
            Principal:
              Service: codebuild.amazonaws.com
      Path: /
      Policies:
        - PolicyName: CodeBuildServiceRolePolicy
          PolicyDocument:
            Version: "2012-10-17"
            Statement:
              - Effect: Allow
                Action:
                - "logs:CreateLogGroup"
                - "logs:CreateLogStream"
                - "logs:PutLogEvents"
                - "codebuild:CreateReportGroup"
                - "codebuild:CreateReport"
                - "codebuild:UpdateReport"
                - "codebuild:BatchPutTestCases"
                - "codecommit:GitPull"
                Resource: "*"
Outputs:
  StateMachineArn:
    Value: !Ref CodeBuildStateMachine
  ExecutionInput:
    Description: Sample input to StartExecution.
    Value:
      >
        {}

When the CloudFormation stack has been created, there are two CodeBuild tasks in the state machine definition:

  • The first CodeBuild task is using a synchronous integration (startBuild.sync) to automatically wait for the build to terminate before progressing to the next step:
"Trigger CodeBuild Build With Tests": {
  "Type": "Task",
  "Resource": "arn:aws:states:::codebuild:startBuild.sync",
  "Parameters": {
    "ProjectName": "CodeBuildProject-HaVamwTeX8kM"
  },
  "Next": "Get Test Results"
}
  • The second CodeBuild task is using the BuildspecOverride parameter to override the default buildspec file used by the build with the one not running tests:
"Trigger CodeBuild Build Without Tests": {
  "Type": "Task",
  "Resource": "arn:aws:states:::codebuild:startBuild.sync",
  "Parameters": {
    "ProjectName": "CodeBuildProject-HaVamwTeX8kM",
    "BuildspecOverride": "buildspec-notests.yml"
  },
  "Next": "Notify No Tests"
},

The first step is a Choice that looks into the input of the state machine execution to decide if to run tests, or not. For example, to run tests I can give in input:

{
  "tests": true
}

This is the visual workflow of the execution running tests, all tests are passed.

I change the value of "tests" to false, and start a new execution that goes on a different branch.

This time the buildspec is not executing tests, and I get a notification that no tests were run.

When starting this workflow automatically after an activity on GitHub or CodeCommit, I could look into the last commit message for specific patterns, and customize the build process accordingly. For example, I could skip tests if the  [skip tests] string is part of the commit message. Similarly, in a production environment I could skip code static analysis, to have faster integration for urgent changes, if the [skip static analysis] message in included in the commit.

Extending the Workflow for Containerized Applications
A great way to distribute applications to different environments, is to package them as Docker images. In this way, I can also add a step to my build workflow and start the containerized application in an Amazon ECS task (running on AWS Fargate) for the Quality Assurance (QA) team.

First, I create an image repository in ECR and add permissions to the service role used by the CodeBuild project to upload to ECR, as described here.

Then, in the code repository, I follow this example to add:

  • A Dockerfile to prepare the Docker container with the software build, and start the application.
  • A buildspec-docker.yml file with the commands to create and upload the Docker image.

The final workflow is automating all these steps:

  1. Building the software from the source code.
  2. Creating the Docker image.
  3. Uploading of the Docker image to ECR.
  4. Starting the QA environment on ECS and Fargate.
  5. Sending an SNS notification that the QA environment is ready.

The workflow and its steps can easily be customized based on your requirements. For example, with a few changes, you can adapt the buildspec file to push the image to Docker Hub.

Available Now
The CodeBuild service integration is available in all commercial and GovCloud regions where Step Functions and CodeBuild services are offered. For regional availability, please see the AWS Region Table. For more information, please look at the documentation.

As AWS Serverless Hero Gojko Adzic pointed out on the AWS DevOps Blog, CodeBuild can also be used to execute administrative tasks. The integration with Step Functions opens a whole set of new possibilities.

Let me know what are you going to use this new service integration for!

Danilo

New – Amazon EventBridge Schema Registry is Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-eventbridge-schema-registry-is-now-generally-available/

Amazon EventBridge is a serverless event bus that makes it easy to connect applications together. It can use data from AWS services, your own applications, and integrations with Software-as-a-Service (SaaS) partners. Last year at re:Invent, we introduced in preview EventBridge schema registry and discovery, a way to store the structure of the events (the schema) in a central location, and simplify using events in your code by generating the code to process them for Java, Python, and Typescript.

Today, I am happy to announce that the EventBridge schema registry is generally available, and that we added support for resource policies. Resource policies allow to share a schema repository across different AWS accounts and organizations. In this way, developers on different teams can search for and use any schema that another team has added to the shared registry.

Using EventBridge Schema Registry Resource Policies
It’s common for companies to have different development teams working on different services. To make a more concrete example, let’s take two teams working on services that have to communicate with each other:

  • The CreateAccount development team, working on a frontend API that receives requests from a web/mobile client to create a new customer account for the company.
  • the FraudCheck development team, working on a backend service checking the data for newly created accounts to estimate the risk that those are fake.

Each team is using their own AWS account to develop their application. Using EventBridge, we can implement the following architecture:

  • The frontend CreateAccount applications is using the Amazon API Gateway to process the request using a AWS Lambda function written in Python. When a new account is created, the Lambda function publishes the ACCOUNT_CREATED event on a custom event bus.
  • The backend FraudCheck Lambda function is built in Java, and is expecting to receive the ACCOUNT_CREATED event to call Amazon Fraud Detector (a fully managed service we introduced in preview at re:Invent) to estimate the risk of that being a fake account. If the risk is above a certain threshold, the Lambda function takes preemptive actions. For example, it can flag the account as fake on a database, or post a FAKE_ACCOUNT event on the event bus.

How can the two teams coordinate their work so that they both know the syntax of the events, and use EventBridge to generate the code to process those events?

First, a custom event bus is created with permissions to access within the company organization.

Then, the CreateAccount team uses EventBridge schema discovery to automatically populate the schema for the ACCOUNT_CREATED event that their service is publishing. This event contains all the information of the account that has just been created.

In an event-driven architecture, services can subscribe to specific types of events that they’re interested in. To receive ACCOUNT_CREATED events, a rule is created on the event bus to send those events to the FraudCheck function.

Using resource policies, the CreateAccount team gives read-only access to the FraudCheck team AWS account to the discovered schemas. The Principal in this policy is the AWS account getting the permissions. The Resource is the schema registry that is being shared.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Sid": "GiveSchemaAccess",
      "Effect": "Allow",
      "Action": [
        "schemas:ListSchemas",
        "schemas:SearchSchemas", 
        "schemas:DescribeSchema",
        "schemas:DescribeCodeBinding",
        "schemas:GetCodeBindingSource",
        "schemas:PutCodeBinding"
      ],
      "Principal": {
        "AWS": "123412341234"
      },
      "Resource": [
        "arn:aws:schemas:us-east-1:432143214321:schema/discovered-schemas",
        "arn:aws:schemas:us-east-1:432143214321:schema/discovered-schemas*"
      ]
    }
  ]
}

Now, the FraudCheck team can search the content of the discovered schema for the ACCOUNT_CREATED event. Resource policies allow you to make a registry available across accounts and organizations, but they will not automatically show up in the console. To access the shared registry, the FraudCheck team needs to use the AWS Command Line Interface (CLI) and specify the full ARN of the registry:

aws schemas search-schemas \
    --registry-name arn:aws:schemas:us-east-1:432143214321:registry/discovered-schemas \
    --keywords ACCOUNT_CREATED

In this way, the FraudCheck team gets the exact name of the schema created by the CreateAccount team.

{
    "Schemas": [
        {
            "RegistryName": "discovered-schemas",
            "SchemaArn": "arn:aws:schemas:us-east-1:432143214321:schema/discovered-schemas/[email protected]_CREATED",
            "SchemaName": “[email protected]_CREATED",
            "SchemaVersions": [
                {
                    "CreatedDate": "2020-04-28T11:10:15+00:00",
                    "SchemaVersion": 1
                }
            ]
        }
    ]
}

With the schema name, the FraudCheck team can describe the content of the schema:

aws schemas describe-schema \
    --registry-name arn:aws:schemas:us-east-1:432143214321:registry/discovered-schemas \
    --schema-name [email protected]_CREATED

The result describes the schema using the OpenAPI specification:

{
    "Content": "{\"openapi\":\"3.0.0\",\"info\":{\"version\":\"1.0.0\",\"title\":\"CREATE_ACCOUNT\"},\"paths\":{},\"components\":{\"schemas\":{\"AWSEvent\":{\"type\":\"object\",\"required\":[\"detail-type\",\"resources\",\"detail\",\"id\",\"source\",\"time\",\"region\",\"version\",\"account\"],\"x-amazon-events-detail-type\":\"CREATE_ACCOUNT\",\"x-amazon-events-source\":\”CreateAccount\",\"properties\":{\"detail\":{\"$ref\":\"#/components/schemas/CREATE_ACCOUNT\"},\"account\":{\"type\":\"string\"},\"detail-type\":{\"type\":\"string\"},\"id\":{\"type\":\"string\"},\"region\":{\"type\":\"string\"},\"resources\":{\"type\":\"array\",\"items\":{\"type\":\"object\"}},\"source\":{\"type\":\"string\"},\"time\":{\"type\":\"string\",\"format\":\"date-time\"},\"version\":{\"type\":\"string\"}}},\"CREATE_ACCOUNT\":{\"type\":\"object\",\"required\":[\"firstName\",\"surname\",\"id\",\"email\"],\"properties\":{\"email\":{\"type\":\"string\"},\"firstName\":{\"type\":\"string\"},\"id\":{\"type\":\"string\"},\"surname\":{\"type\":\"string\"}}}}}}",
    "LastModified": "2020-04-28T11:10:15+00:00",
    "SchemaArn": "arn:aws:schemas:us-east-1:432143214321:schema/discovered-schemas/[email protected]_ACCOUNT",
    "SchemaName": “[email protected]_CREATED",
    "SchemaVersion": "1",
    "Tags": {},
    "Type": "OpenApi3",
    "VersionCreatedDate": "2020-04-28T11:10:15+00:00"
}

Using the AWS Command Line Interface (CLI), the FraudCheck team can create a code binding if it isn’t already created, using the put-code-binding command, and then download the code binding to process that event:

aws schemas get-code-binding-source \
    --registry-name arn:aws:schemas:us-east-1:432143214321:registry/discovered-schemas \
    --schema-name [email protected]_CREATED \
    --language Java8 CreateAccount.zip

Another option for the FraudCheck team is to copy and paste (after unescaping the JSON string) the Content of the discovered schema to create a new custom schema in their AWS account.

Once the schema is copied to their own account, the FraudCheck team can use the AWS Toolkit IDE plugins to view the schema, download code bindings, and generate serverless applications directly from their IDEs. The EventBridge team is working to add the capability to the AWS Toolkit to use a schema registry in a different account, making this step simpler. Stay tuned!

Often customers have a specific team, with a different AWS account, managing the event bus. For the sake of simplicity, in this post I assumed that the CreateAccount team was the one configuring the EventBridge event bus. With more accounts, you can simplify permissions using IAM to share resources with groups of AWS accounts in AWS Organizations.

Available Now
The EventBridge Schema Registry is available now in all commercial regions except Bahrain, Cape Town, Milan, Osaka, Beijing, and Ningxia. For more information on how to use resource policies for schema registries, please see the documentation.

Using Schema Registry resource policies, it is much easier to coordinate the work of different teams sharing information in an event-driven architecture.

Let me know what are you going to build with this!

Danilo

Now Open – AWS Europe (Milan) Region

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/now-open-aws-europe-milan-region/

Today, I am very happy to announce that, as we anticipated some time ago, a new AWS Region is available in Italy!

The Europe (Milan) Region is our sixth Region in Europe, and is composed of 3 availability zones (AZs) that you can use to reliably spread your applications across multiple data centers, for example configuring the subnets of your Amazon Virtual Private Cloud to use different AZs for your Amazon Elastic Compute Cloud (EC2) instances. Each AZ is a fully isolated partition of our infrastructure that contains one or more data centers.

AZs are located in separate and distinct geographic locations with enough distance to significantly reduce the risk of a single event impacting availability in the Region, but near enough for business continuity applications that require rapid failover and synchronous replication. This gives you the ability to operate production applications that are more highly available, more fault tolerant, and more scalable than would be possible from a single data center. Fully managed services like Amazon Simple Storage Service (S3), AWS Lambda, and Amazon DynamoDB, replicate data and applications across AZs automatically.

The AWS Region in Milan offers low latency for customers seeking to serve end-users in Italy, and also has a latency advantage over other existing AWS regions when serving customers from other countries such as Austria, Greece, and Bulgaria. Results may differ based on the quality, capacity, and distance of the connection in the end user’s last-mile network.

An in-country infrastructure is also critical for Italian customers with data residency requirements and regulations, such as those operating in government, healthcare, and financial services.

AWS in Italy
Currently AWS has five edge locations in Italy (three in Milan, one in Palermo, and one in Rome) and an AWS Direct Connect location in Milan which connects to the Europe (Frankfurt) Region and to the new Region in Milan.

We opened the first AWS office in Italy at the beginning of 2014 in Milan, there is now also an office in Rome, engineering teams in Piedmont and Sardinia, and a broad network of partners. AWS continues to build in Italy teams of account managers, solutions architects, business developers, and professional services consultants to help customers of all sizes build or move their workloads in the cloud. In 2016, AWS acquired Italy-based NICE Software, a leading provider of software and services for high performance and technical computing.

I joined AWS in 2012, and I was immediately blown away by what Italian customers were building on AWS to lower their costs, become more agile, and innovate faster. For example:

  • GEDI Gruppo Editoriale is an Italian multimedia giant that publishes some of the largest circulation newspapers in Italy, including La Repubblica and La Stampa. In March 2018, during the Italian general elections, they experienced over 80 million page views and 18.8 million unique visits, and were able to provide their readers with continuous special election-day coverage with real-time data of election results.
  • Satispay is disrupting the mobile payment landscape, allowing their users to securely send money or pay using a smartphone app that relies on International Bank Account Numbers (IBANs), and directly connects consumers and merchants via their bank accounts. They are all-in on AWS, and benefit from AWS’s compliance accreditations and certifications, many of which are required to operate in the financial services industry. Adopting DevOps and CI/CD best practices, they went from one deployment per week to 16 deployments per day, giving them the freedom and flexibility to develop new features, and innovate faster.
  • Musixmatch is a Bologna-based startup that has quickly become the world’s largest lyrics platform , with more than 50 million users and over 14 million lyrics in 58 languages. Musixmatch is using AWS to be able to scale quickly, and most importantly to innovate constantly. In just three days, Musixmatch started using Amazon SageMaker to train models to analyze the language of songs and identify the mood and emotion of the lyrics. This allowed Musixmatch to build a platform where users can find new music based on emotions and mood of the lyrics of songs. Musixmatch also used SageMaker to train umBERTo, a state-of-the-art Italian language model.
  • Avio Aero is an aerospace business that designs, constructs, and maintains systems and components for military and civil aviation. Among other things, they developed a serverless application for their finance team to manage expense approvals and purchase orders. They are excited by the new Region because they have many applications that contain particularly critical data that needs to be stored in Italy.

Available Now
The new Europe (Milan) Region is ready to support your business. You can look at the Region Table for service availability.

With this launch, AWS now has 76 AZs within 24 geographic Regions around the world, with 3 new Regions coming in Indonesia, Japan, and Spain. To build a secure, high-performing, resilient, and efficient infrastructure for your applications, you can leverage the best practices we shared in the AWS Well Architected Framework, and review your architecture using the AWS Well-Architected Tool.

For more information on our global infrastructure, and the custom hardware we use, check out this interactive map.

Danilo

New – Serverless Streaming ETL with AWS Glue

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-serverless-streaming-etl-with-aws-glue/

When you have applications in production, you want to understand what is happening, and how the applications are being used. To analyze data, a first approach is a batch processing model: a set of data is collected over a period of time, then run through analytics tools. To be able to react quickly, you can use a streaming model, where data is processed as it arrives, a record at a time or in micro-batches of tens, hundreds, or thousands of records.

Managing continuous ingestion pipelines and processing data on-the-fly is quite complex, because it’s an always-on system that needs to be managed, patched, scaled, and generally taken care of. Today, we are making this easier and more cost-effective to implement by extending AWS Glue jobs, based on Apache Spark, to run continuously and consume data from streaming platforms such as Amazon Kinesis Data Streams and Apache Kafka (including the fully-managed Amazon MSK).

In this way, Glue can provision, manage, and scale the infrastructure needed to ingest data to data lakes on Amazon S3, data warehouses such as Amazon Redshift, or other data stores. For example, you can store streaming data in a DynamoDB table for quick lookups, or in Elasticsearch to look for specific patterns. This procedure is usually referred to as extract, transform, load (ETL).

As you process streaming data in a Glue job, you have access to the full capabilities of Spark Structured Streaming to implement data transformations, such as aggregating, partitioning, and formatting as well as joining with other data sets to enrich or cleanse the data for easier analysis. For example, you can access an external system to identify fraud in real-time, or use machine learning algorithms to classify data, or detect anomalies and outliers.

Processing Streaming Data with AWS Glue
To try this new feature, I want to collect data from IoT sensors and store all data points in an S3 data lake. I am using a Raspberry Pi with a Sense HAT to collect temperature, humidity, barometric pressure, and its position in space in real-time (using the integrated gyroscope, accelerometer, and magnetometer). Here’s an architectural view of what I am building:

First, I register the device with AWS IoT Core, and run the following Python code to send, once per second, a JSON message with sensor data to the streaming-data MQTT topic. I have a single device in this setup, with more devices, I would use a subtopic per device, for example streaming-data/{client_id}.

import time
import datetime
import json
from sense_hat import SenseHat
from awscrt import io, mqtt, auth, http
from awsiot import mqtt_connection_builder

sense = SenseHat()

topic = "streaming-data"
client_id = "raspberrypi"

# Callback when connection is accidentally lost.


def on_connection_interrupted(connection, error, **kwargs):
    print("Connection interrupted. error: {}".format(error))


# Callback when an interrupted connection is re-established.
def on_connection_resumed(connection, return_code, session_present, **kwargs):
    print("Connection resumed. return_code: {} session_present: {}".format(
        return_code, session_present))

    if return_code == mqtt.ConnectReturnCode.ACCEPTED and not session_present:
        print("Session did not persist. Resubscribing to existing topics...")
        resubscribe_future, _ = connection.resubscribe_existing_topics()

        # Cannot synchronously wait for resubscribe result because we're on the connection's event-loop thread,
        # evaluate result with a callback instead.
        resubscribe_future.add_done_callback(on_resubscribe_complete)


def on_resubscribe_complete(resubscribe_future):
    resubscribe_results = resubscribe_future.result()
    print("Resubscribe results: {}".format(resubscribe_results))

    for topic, qos in resubscribe_results['topics']:
        if qos is None:
            sys.exit("Server rejected resubscribe to topic: {}".format(topic))


# Callback when the subscribed topic receives a message
def on_message_received(topic, payload, **kwargs):
    print("Received message from topic '{}': {}".format(topic, payload))


def collect_and_send_data():
    publish_count = 0
    while(True):

        humidity = sense.get_humidity()
        print("Humidity: %s %%rH" % humidity)

        temp = sense.get_temperature()
        print("Temperature: %s C" % temp)

        pressure = sense.get_pressure()
        print("Pressure: %s Millibars" % pressure)

        orientation = sense.get_orientation_degrees()
        print("p: {pitch}, r: {roll}, y: {yaw}".format(**orientation))

        timestamp = datetime.datetime.fromtimestamp(
            time.time()).strftime('%Y-%m-%d %H:%M:%S')

        message = {
            "client_id": client_id,
            "timestamp": timestamp,
            "humidity": humidity,
            "temperature": temp,
            "pressure": pressure,
            "pitch": orientation['pitch'],
            "roll": orientation['roll'],
            "yaw": orientation['yaw'],
            "count": publish_count
        }
        print("Publishing message to topic '{}': {}".format(topic, message))

        mqtt_connection.publish(
            topic=topic,
            payload=json.dumps(message),
            qos=mqtt.QoS.AT_LEAST_ONCE)
        time.sleep(1)
        publish_count += 1


if __name__ == '__main__':
    # Spin up resources
    event_loop_group = io.EventLoopGroup(1)
    host_resolver = io.DefaultHostResolver(event_loop_group)
    client_bootstrap = io.ClientBootstrap(event_loop_group, host_resolver)

    mqtt_connection = mqtt_connection_builder.mtls_from_path(
        endpoint="a1b2c3d4e5f6g7-ats.iot.us-east-1.amazonaws.com",
        cert_filepath="rapberrypi.cert.pem",
        pri_key_filepath="rapberrypi.private.key",
        client_bootstrap=client_bootstrap,
        ca_filepath="root-CA.crt",
        on_connection_interrupted=on_connection_interrupted,
        on_connection_resumed=on_connection_resumed,
        client_id=client_id,
        clean_session=False,
        keep_alive_secs=6)

    connect_future = mqtt_connection.connect()

    # Future.result() waits until a result is available
    connect_future.result()
    print("Connected!")

    # Subscribe
    print("Subscribing to topic '{}'...".format(topic))
    subscribe_future, packet_id = mqtt_connection.subscribe(
        topic=topic,
        qos=mqtt.QoS.AT_LEAST_ONCE,
        callback=on_message_received)

    subscribe_result = subscribe_future.result()
    print("Subscribed with {}".format(str(subscribe_result['qos'])))

    collect_and_send_data()

This is an example of the JSON messages sent by the device:

{
    "client_id": "raspberrypi",
    "timestamp": "2020-04-16 11:33:23",
    "humidity": 39.35261535644531,
    "temperature": 30.10732078552246,
    "pressure": 1020.447509765625,
    "pitch": 4.044007304723748,
    "roll": 7.533848064912158,
    "yaw": 77.01560798660883,
    "count": 104
}

In the Kinesis console, I create the my-data-stream data stream (1 shard is more than enough for my workload). Back in the AWS IoT console, I create an IoT rule to send all data from the MQTT topic to this Kinesis data stream.

Now that all sensor data is sent to Kinesis, I can leverage the new Glue integration to process data as it arrives. In the Glue console, I manually add a table in the Glue Data Catalog. I select Kinesis as the type of source, and enter my stream name and the endpoint of the Kinesis Data Streams service. Note that for Kafka streams, before creating the table, you need to create a Glue connection.

I select JSON as data format, and define the schema for the streaming data. If I don’t specify a column here, it will be ignored when processing the stream.

After that, I confirm the final recap step, and create the my_streaming_data table. We are working to add schema inference to streaming ETL jobs. With that, specifying the full schema up front won’t be necessary. Stay tuned.

To process the streaming data, I create a Glue job. For the IAM role, I create a new one attaching the AWSGlueServiceRole and AmazonKinesisReadOnlyAccess managed policies. Depending on your use case and the set up of your AWS accounts, you may want to use a role providing more fine-grained access.

For the data source, I select the table I just created, receiving data from the Kinesis stream.

To get a script generated by Glue, I select the Change schema transform type. As target, I create a new table in the Glue Data Catalog, using an efficient format like Apache Parquet. The Parquet files generated by this job are going to be stored in an S3 bucket whose name starts with aws-glue- (including the final hyphen). By following the naming convention for resources specified in the AWSGlueServiceRole policy, this job has the required permissions to access those resources.

I leave the default mapping that keeps in output all the columns in the source stream. In this way, I can ingest all the records using the proposed script, without having to write a single line of code.

I quickly review the proposed script and save. Each record is processed as a DynamicFrame, and I can apply any of the Glue PySpark Transforms or any transforms supported by Spark Structured Streaming. By default with this configuration, only ApplyMapping is used.

I start the job, and after a few minutes I see the Parquet files containing the output of the job appearing in the output S3 bucket. They are partitioned by ingest date (year, month, day, and hour).

To populate the Glue Data Catalog with tables based on the content of the S3 bucket, I add and run a crawler. In the crawler configuration, I exclude the checkpoint folder used by Glue to keep track of the data that has been processed. After less than a minute, a new table has been added.

In the Amazon Athena console, I refresh database and tables, and select to preview the output_my_data containing ingest data from this year. In this way, I see the first ten records in the table, and get a confirmation that my setup is working!

Now, as data is being ingested, I can run more complex queries. For example, I can get the minimum and maximum temperature, collected from the device sensors, and the overall number of records stored in the Parquet files.

Looking at the results, I see more than 8,000 records have been processed, with a maximum temperature of 31 degrees Celsius (about 88 degrees Fahrenheit). Actually, it was never really this hot. Temperature is measured by these sensors very close to the device, and is growing as the device is warming up with usage.

I am using a single device in this set up, but the solution implemented here can easily scale up with the number of data sources.

Available Now
Support for streaming sources is available in all regions where Glue is offered, as described in the AWS Region table. For more information, please have a look at the documentation.

Managing a serverless ETL pipeline with Glue makes it easier and more cost-effective to set up and manage streaming ingestion processes, reducing implementation efforts so you can focus on the business outcomes of analytics. You can set up a whole ingestion pipeline without writing code, as I did in this walkthrough, or customize the proposed script based on your needs.

Let me know what are you going to use this new feature for!

Danilo

New – Amazon Keyspaces (for Apache Cassandra) is Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-keyspaces-for-apache-cassandra-is-now-generally-available/

We introduced Amazon Managed Apache Cassandra Service (MCS) in preview at re:Invent last year. In the few months that passed, the service introduced many new features, and it is generally available today with a new name: Amazon Keyspaces (for Apache Cassandra).

Amazon Keyspaces is built on Apache Cassandra, and you can use it as a fully managed, serverless database. Your applications can read and write data from Amazon Keyspaces using your existing Cassandra Query Language (CQL) code, with little or no changes. For each table, you can select the best configuration depending on your use case:

  • With on-demand, you pay based on the actual reads and writes you perform. This is the best option for unpredictable workloads.
  • With provisioned capacity, you can reduce your costs for predictable workloads by configuring capacity settings up front. You can also further optimize costs by enable auto scaling, which updates your provisioned capacity settings automatically as your traffic changes throughout the day.

Using Amazon Keyspaces
One of the first “serious” applications I built as a kid, was an archive for my books. I’d like to rebuild it now as a serverless API, using:

With Amazon Keyspaces, your data is stored in keyspaces and tables. A keyspace gives you a way to group related tables together. In the blog post for the preview, I used the console to configure my data model. Now, I can also use AWS CloudFormation to manage my keyspaces and tables as code. For example I can create a bookstore keyspace and a books table with this CloudFormation template:

AWSTemplateFormatVersion: '2010-09-09'
Description: Amazon Keyspaces for Apache Cassandra example

Resources:

  BookstoreKeyspace:
    Type: AWS::Cassandra::Keyspace
    Properties: 
      KeyspaceName: bookstore

  BooksTable:
    Type: AWS::Cassandra::Table
    Properties: 
      TableName: books
      KeyspaceName: !Ref BookstoreKeyspace
      PartitionKeyColumns: 
        - ColumnName: isbn
          ColumnType: text
      RegularColumns: 
        - ColumnName: title
          ColumnType: text
        - ColumnName: author
          ColumnType: text
        - ColumnName: pages
          ColumnType: int
        - ColumnName: year_of_publication
          ColumnType: int

Outputs:
  BookstoreKeyspaceName:
    Description: "Keyspace name"
    Value: !Ref BookstoreKeyspace # Or !Select [0, !Split ["|", !Ref BooksTable]]
  BooksTableName:
    Description: "Table name"
    Value: !Select [1, !Split ["|", !Ref BooksTable]]

If you don’t specify a name for a keyspace or a table in the template, CloudFormation generates a unique name for you. Note that in this way keyspaces and tables may contain uppercase characters that are outside of the usual Cassandra conventions, and you need to put those names between double quotes when using Cassandra Query Language (CQL).

When the creation of the stack is complete, I see the new bookstore keyspace in the console:

Selecting the books table, I have an overview of its configuration, including the partition key, the clustering columns, and all the columns, and the option to change the capacity mode for the table from on-demand to provisioned:

For authentication and authorization, Amazon Keyspaces supports AWS Identity and Access Management (IAM) identity-based policies, that you can use with IAM users, groups, and roles. Here’s a list of actions, resources, and conditions that you can use in IAM policies with Amazon Keyspaces. You can now also manage access to resources based on tags.

You can use IAM roles using AWS Signature Version 4 Process (SigV4) with this open source authentication plugin for the DataStax Java driver. In this way you can run your applications inside an Amazon Elastic Compute Cloud (EC2) instance, a container managed by Amazon ECS or Amazon Elastic Kubernetes Service, or a Lambda function, and leverage IAM roles for authentication and authorization to Amazon Keyspaces, without the need to manage credentials. Here’s a sample application that you can test on an EC2 instance with an associated IAM role giving access to Amazon Keyspaces.

Going back to my books API, I create all the resources I need, including a keyspace and a table, with the following AWS Serverless Application Model (SAM) template.

AWSTemplateFormatVersion: '2010-09-09'
Transform: AWS::Serverless-2016-10-31
Description: Sample Books API using Cassandra as database

Globals:
  Function:
    Timeout: 30

Resources:

  BookstoreKeyspace:
    Type: AWS::Cassandra::Keyspace

  BooksTable:
    Type: AWS::Cassandra::Table
    Properties: 
      KeyspaceName: !Ref BookstoreKeyspace
      PartitionKeyColumns: 
        - ColumnName: isbn
          ColumnType: text
      RegularColumns: 
        - ColumnName: title
          ColumnType: text
        - ColumnName: author
          ColumnType: text
        - ColumnName: pages
          ColumnType: int
        - ColumnName: year_of_publication
          ColumnType: int

  BooksFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: BooksFunction
      Handler: books.App::handleRequest
      Runtime: java11
      MemorySize: 2048
      Policies:
        - Statement:
          - Effect: Allow
            Action:
            - cassandra:Select
            Resource:
              - !Sub "arn:aws:cassandra:${AWS::Region}:${AWS::AccountId}:/keyspace/system*"
              - !Join
                - ""
                - - !Sub "arn:aws:cassandra:${AWS::Region}:${AWS::AccountId}:/keyspace/${BookstoreKeyspace}/table/"
                  - !Select [1, !Split ["|", !Ref BooksTable]] # !Ref BooksTable returns "Keyspace|Table"
          - Effect: Allow
            Action:
            - cassandra:Modify
            Resource:
              - !Join
                - ""
                - - !Sub "arn:aws:cassandra:${AWS::Region}:${AWS::AccountId}:/keyspace/${BookstoreKeyspace}/table/"
                  - !Select [1, !Split ["|", !Ref BooksTable]] # !Ref BooksTable returns "Keyspace|Table"
      Environment:
        Variables:
          KEYSPACE_TABLE: !Ref BooksTable # !Ref BooksTable returns "Keyspace|Table"
      Events:
        GetAllBooks:
          Type: HttpApi
          Properties:
            Method: GET
            Path: /books
        GetBookByIsbn:
          Type: HttpApi
          Properties:
            Method: GET
            Path: /books/{isbn}
        PostNewBook:
          Type: HttpApi
          Properties:
            Method: POST
            Path: /books

Outputs:
  BookstoreKeyspaceName:
    Description: "Keyspace name"
    Value: !Ref BookstoreKeyspace # Or !Select [0, !Split ["|", !Ref BooksTable]]
  BooksTableName:
    Description: "Table name"
    Value: !Select [1, !Split ["|", !Ref BooksTable]]
  BooksApi:
    Description: "API Gateway HTTP API endpoint URL"
    Value: !Sub "https://${ServerlessHttpApi}.execute-api.${AWS::Region}.amazonaws.com/"
  BooksFunction:
    Description: "Books Lambda Function ARN"
    Value: !GetAtt BooksFunction.Arn
  BooksFunctionIamRole:
    Description: "Implicit IAM Role created for Books function"
    Value: !GetAtt BooksFunctionRole.Arn

In this template I don’t specify the keyspace and table names, and CloudFormation is generating unique names automatically. The function IAM policy gives access to read (cassandra:Select) and write (cassandra:Write) only to the books table. I am using CloudFormation Fn::Select and Fn::Split intrinsic functions to get the table name. The driver also needs read access to the system* keyspaces.

To use the authentication plugin for the DataStax Java driver that supports IAM roles, I write the Lambda function in Java, using the APIGatewayV2ProxyRequestEvent and APIGatewayV2ProxyResponseEvent classes to communicate with the HTTP API created by the API Gateway.

package books;

import java.net.InetSocketAddress;
import java.security.NoSuchAlgorithmException;
import java.util.Collections;
import java.util.List;
import java.util.HashMap;
import java.util.Map;
import java.util.StringJoiner;
import javax.net.ssl.SSLContext;

import org.json.simple.JSONObject;
import org.json.simple.parser.JSONParser;
import org.json.simple.parser.ParseException;

import com.datastax.oss.driver.api.core.ConsistencyLevel;
import com.datastax.oss.driver.api.core.CqlSession;
import com.datastax.oss.driver.api.core.cql.*;

import software.aws.mcs.auth.SigV4AuthProvider;

import com.amazonaws.services.lambda.runtime.Context;
import com.amazonaws.services.lambda.runtime.RequestHandler;
import com.amazonaws.services.lambda.runtime.LambdaLogger;
import com.amazonaws.services.lambda.runtime.events.APIGatewayV2ProxyRequestEvent;
import com.amazonaws.services.lambda.runtime.events.APIGatewayV2ProxyResponseEvent;

public class App implements RequestHandler<APIGatewayV2ProxyRequestEvent, APIGatewayV2ProxyResponseEvent> {
    
    JSONParser parser = new JSONParser();
    String[] keyspace_table = System.getenv("KEYSPACE_TABLE").split("\\|");
    String keyspace = keyspace_table[0];
    String table = keyspace_table[1];
    CqlSession session = getSession();
    PreparedStatement selectBookByIsbn = session.prepare("select * from \"" + table + "\" where isbn = ?");
    PreparedStatement selectAllBooks = session.prepare("select * from \"" + table + "\"");
    PreparedStatement insertBook = session.prepare("insert into \"" + table + "\" "
    + "(isbn, title, author, pages, year_of_publication)" + "values (?, ?, ?, ?, ?)");
    
    public APIGatewayV2ProxyResponseEvent handleRequest(APIGatewayV2ProxyRequestEvent request, Context context) {
        
        LambdaLogger logger = context.getLogger();
        
        String responseBody;
        int statusCode = 200;
        
        String routeKey = request.getRequestContext().getRouteKey();
        logger.log("routeKey = '" + routeKey + "'");
        
        if (routeKey.equals("GET /books")) {
            ResultSet rs = execute(selectAllBooks.bind());
            StringJoiner jsonString = new StringJoiner(", ", "[ ", " ]");
            for (Row row : rs) {
                String json = row2json(row);
                jsonString.add(json);
            }
            responseBody = jsonString.toString();
        } else if (routeKey.equals("GET /books/{isbn}")) {
            String isbn = request.getPathParameters().get("isbn");
            logger.log("isbn: '" + isbn + "'");
            ResultSet rs = execute(selectBookByIsbn.bind(isbn));
            if (rs.getAvailableWithoutFetching() == 1) {
                responseBody = row2json(rs.one());
            } else {
                statusCode = 404;
                responseBody = "{\"message\": \"not found\"}";
            }
        } else if (routeKey.equals("POST /books")) {
            String body = request.getBody();
            logger.log("Body: '" + body + "'");
            JSONObject requestJsonObject = null;
            if (body != null) {
                try {
                    requestJsonObject = (JSONObject) parser.parse(body);
                } catch (ParseException e) {
                    e.printStackTrace();
                }
                if (requestJsonObject != null) {
                    int i = 0;
                    BoundStatement boundStatement = insertBook.bind()
                    .setString(i++, (String) requestJsonObject.get("isbn"))
                    .setString(i++, (String) requestJsonObject.get("title"))
                    .setString(i++, (String) requestJsonObject.get("author"))
                    .setInt(i++, ((Long) requestJsonObject.get("pages")).intValue())
                    .setInt(i++, ((Long) requestJsonObject.get("year_of_publication")).intValue())
                    .setConsistencyLevel(ConsistencyLevel.LOCAL_QUORUM);
                    ResultSet rs = execute(boundStatement);
                    statusCode = 201;
                    responseBody = body;
                } else {
                    statusCode = 400;
                    responseBody = "{\"message\": \"JSON parse error\"}";
                }
            } else {
                statusCode = 400;
                responseBody = "{\"message\": \"body missing\"}";
            }
        } else {
            statusCode = 405;
            responseBody = "{\"message\": \"not implemented\"}";
        }
        
        Map<String, String> headers = new HashMap<>();
        headers.put("Content-Type", "application/json");
        
        APIGatewayV2ProxyResponseEvent response = new APIGatewayV2ProxyResponseEvent();
        response.setStatusCode(statusCode);
        response.setHeaders(headers);
        response.setBody(responseBody);
        
        return response;
    }
    
    private String getStringColumn(Row row, String columnName) {
        return "\"" + columnName + "\": \"" + row.getString(columnName) + "\"";
    }
    
    private String getIntColumn(Row row, String columnName) {
        return "\"" + columnName + "\": " + row.getInt(columnName);
    }
    
    private String row2json(Row row) {
        StringJoiner jsonString = new StringJoiner(", ", "{ ", " }");
        jsonString.add(getStringColumn(row, "isbn"));
        jsonString.add(getStringColumn(row, "title"));
        jsonString.add(getStringColumn(row, "author"));
        jsonString.add(getIntColumn(row, "pages"));
        jsonString.add(getIntColumn(row, "year_of_publication"));
        return jsonString.toString();
    }
    
    private ResultSet execute(BoundStatement bs) {
        final int MAX_RETRIES = 3;
        ResultSet rs = null;
        int retries = 0;

        do {
            try {
                rs = session.execute(bs);
            } catch (Exception e) {
                e.printStackTrace();
                session = getSession(); // New session
            }
        } while (rs == null && retries++ < MAX_RETRIES);
        return rs;
    }
    
    private CqlSession getSession() {
        
        System.setProperty("javax.net.ssl.trustStore", "./cassandra_truststore.jks");
        System.setProperty("javax.net.ssl.trustStorePassword", "amazon");
        
        String region = System.getenv("AWS_REGION");
        String endpoint = "cassandra." + region + ".amazonaws.com";
        
        System.out.println("region: " + region);
        System.out.println("endpoint: " + endpoint);
        System.out.println("keyspace: " + keyspace);
        System.out.println("table: " + table);
        
        SigV4AuthProvider provider = new SigV4AuthProvider(region);
        List<InetSocketAddress> contactPoints = Collections.singletonList(new InetSocketAddress(endpoint, 9142));
        
        CqlSession session;
                
        try {
            session = CqlSession.builder().addContactPoints(contactPoints).withSslContext(SSLContext.getDefault())
            .withLocalDatacenter(region).withAuthProvider(provider).withKeyspace("\"" + keyspace + "\"")
            .build();
        } catch (NoSuchAlgorithmException e) {
            session = null;
            e.printStackTrace();
        }
        
        return session;
    }
}

To connect to Amazon Keyspaces with TLS/SSL using the Java driver, I need to include a trustStore in the JVM arguments. When using the Cassandra Java Client Driver in a Lambda function, I can’t pass parameters to the JVM, so I pass the same options as system properties, and specify the SSL context when creating the CQL session with the  withSslContext(SSLContext.getDefault()) parameter. Note that I also have to configure the pom.xml file, used by Apache Maven, to include the trustStore file as a dependency.

System.setProperty("javax.net.ssl.trustStore", "./cassandra_truststore.jks");
System.setProperty("javax.net.ssl.trustStorePassword", "amazon");

Now, I can use a tool like curl or Postman to test my books API. First, I take the endpoint of the API from the output of the CloudFormation stack. At the beginning there are no books stored in the books table, and if I do an HTTP GET on the resource, I get an empty JSON list. For readability, I am removing all HTTP headers from the output.

$ curl -i https://a1b2c3d4e5.execute-api.eu-west-1.amazonaws.com/books

HTTP/1.1 200 OK
[]

In the code, I am using a PreparedStatement to run a CQL statement to select all rows from the books table. The names of the keystore and of the table are passed to the Lambda function in an environment variable, as described in the SAM template above.

Let’s use the API to add a book, by doing an HTTP POST on the resource.

$ curl -i -d '{ "isbn": "978-0201896831", "title": "The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)", "author": "Donald E. Knuth", "pages": 672, "year_of_publication": 1997 }' -H "Content-Type: application/json" -X POST https://a1b2c3d4e5.execute-api.eu-west-1.amazonaws.com/books

HTTP/1.1 201 Created
{ "isbn": "978-0201896831", "title": "The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)", "author": "Donald E. Knuth", "pages": 672, "year_of_publication": 1997 }

I can check that the data has been inserted in the table using the CQL Editor in the console, where I select all the rows in the table.

I repeat the previous HTTP GET to get the list of the books, and I see the one I just created.

$ curl -i https://a1b2c3d4e5-api.eu-west-1.amazonaws.com/books

HTTP/1.1 200 OK
[ { "isbn": "978-0201896831", "title": "The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)", "author": "Donald E. Knuth", "pages": 672, "year_of_publication": 1997 } ]

I can get a single book by ISBN, because the isbn column is the primary key of the table and I can use it in the where condition of a select statement.

$ curl -i https://a1b2c3d4e5.execute-api.eu-west-1.amazonaws.com/books/978-0201896831

HTTP/1.1 200 OK
{ "isbn": "978-0201896831", "title": "The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)", "author": "Donald E. Knuth", "pages": 672, "year_of_publication": 1997 }

If there is no book with that ISBN, I return a “not found” message:

$ curl -i https://a1b2c3d4e5.execute-api.eu-west-1.amazonaws.com/books/1234567890

HTTP/1.1 404 Not Found
{"message": "not found"}

It works! We just built a fully serverless API using CQL to read and write data using temporary security credentials, managing the whole infrastructure, including the database table, as code.

Available Now
Amazon Keyspace (for Apache Cassandra) is ready for your applications, please see this table for regional availability. You can find more information on how to use Keyspaces in the documentation. In this post, I built a new application, but you can get lots of benefits by migrating your current tables to a fully managed environment. For migrating data, you can now use cqlsh as described in this post.

Let me know what are you going to use it for!

Danilo

Using AWS CodeBuild to execute administrative tasks

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/devops/using-aws-codebuild-to-execute-administrative-tasks/

This article is a guest post from AWS Serverless Hero Gojko Adzic.

At MindMup, we started using AWS CodeBuild to quickly lift and shift support tasks to the cloud. MindMup is a collaborative mind-mapping tool, used by millions of teachers and students to collaborate on assignments, structure ideas, and organize and navigate complex information. Still, the team behind the product consists of just two people, and we’re both responsible for everything from sales and product management to programming and customer support. One of the key reasons why such a tiny team can support a large group of users is that we tend to automate all recurring tasks in order to free up our time for more productive work.

Administrative support tasks often start as ad-hoc command line scripts, with manual intervention to resolve exceptions. As the scripts stabilize, humans can be less involved, so teams look for ways of scheduling and automating job executions. For infrastructure deployed to AWS, this also means moving away from running scripts from on-premises developers or operations computers to running in the cloud. With utilization-based pricing and on-demand capacity, AWS Lambda and AWS Fargate are the two obvious choices for running such tasks in AWS. There is a third option, often overlooked: CodeBuild. Although CodeBuild is designed for a completely different purpose, it offers some compelling features that make it very easy to set up and run periodic support jobs, especially as a first easy step towards a more systematic solution.

Solution overview

CodeBuild is, as the name suggests, a managed service for executing typical software build jobs. In some ways, such as each job having an associated IAM permissions, CodeBuild is similar to Lambda and Fargate. One of fundamental differences between Codebuild jobs and Lambda functions or Fargate tasks is the location of the executable definition of the job. The executable definition of a Lambda function is in a ZIP archive deployed to Lambda. For Fargate, the executable definition is in a Docker container image, deployed in a task with Amazon ECS or in a Kubernetes pod with Amazon EKS. Both services require an explicit deployment to update the executable definition of a task. For CodeBuild jobs, the executable definition is not deployed to an AWS service. Instead, it is in a source code control system that you can manage locally or using a service such as GitHub or AWS CodeCommit.

Sitting alongside the rest of the source code, each CodeBuild task has an entry-point configuration file, by convention called a buildspec.yml. The buildspec.yml file lists the programming language runtimes required by the job, and the steps to execute before, during and after the build job. For example, the following buildspec.yml sets up a build environment for JavaScript with Node.js 12, installs dependencies, runs tests, and then produces a deployment package using webpack.

version: 0.2

phases:
  install:
    runtime-versions:
      nodejs: 12
    commands:
      - npm install
  build:
    commands:
      - npm test
      - npm run web pack

Usually, the buildspec.yml file involves some variant of installing dependencies, compiling code and running tests, then packaging and versioning artifacts. But the steps of a buildspec.yml file are actually just shell commands, so CodeBuild doesn’t necessarily need to run tasks related to compiling or packaging. It can execute any sequence of Unix commands, scripts, or binaries. This makes CodeBuild a uniquely compelling choice for the transition from running shell scripts on an operations machine to running a shell script in the cloud.

Comparing CodeBuild and Lambda for administrative tasks

The major advantage of CodeBuild over Lambda functions for support jobs is that the scripts can be significantly more flexible. Moving from shell scripts to Lambda functions usually means rewriting the task in a language such as JavaScript or Python. You can execute a shell script from a Lambda function when using Amazon Linux 1 instances, or even use a Bash custom runtime, but when using CodeBuild, you can execute the same shell script without changes.

Lambda functions usually run only in a single language. Support tasks often perform a chain of actions, and different steps might require utilities written in different languages. Running such varied tasks with a Lambda function would require constructing a custom Lambda runtime, or splitting steps into multiple functions with different runtimes, and then somehow coordinating and passing data between them. AWS Step Functions can be used to coordinate the workflow, but most support tasks are a sequence of steps, to be executed in order if the previous one succeeds. With CodeBuild, you can configure the task to include all required runtimes.

Support tasks often need to transform the outputs of one tool and pass it into a different tool. For example, select rows from a database containing expired accounts, then filter out only the user emails, separate the data with commas, and send to an automated mailer with a template. Tools such as grep, awk, and sed become invaluable for such transformations. However, they aren’t available on new Lambda runtimes.

Lambda runtimes based on Amazon Linux 2 bundle only the absolutely minimal operating system packages. Even the basic command line Linux utilities, such as which, are not packaged with the recent Lambda runtimes. On the other hand, CodeBuild runs tasks in a full-blown Linux environment. Executing support tasks through CodeBuild means that you can pipe results into all the standard Unix tools, without having to use half-baked replacements written in a scripting language.

For applications running in the AWS ecosystem, support tasks often need to communicate with AWS services or resources. Standard CodeBuild environments also come with the aws command line tools, so you can use them without any additional setup. This becomes especially important for moving data from and to Amazon S3, where command line tools have operations for batch uploads or downloads or recursive directory synchronization. Those operations are not directly available through the programming language SDK libraries.

It is, of course, possible to install additional binaries to Lambda functions by building them for the right Linux environment. Because the standard shared system libraries are also not in the recent Lambda runtimes, compiling additional tools is akin to building a Linux distribution from scratch. With CodeBuild, most standard tools are included already, and you can add additional tools to the system by using an operating system package manager (apt-get or yum).

CodeBuild execution environments can also be more flexible in terms of execution time and performance constraints. Lambda tasks are currently limited to fifteen minutes. The only performance setting you can influence is the memory size, which proportionally impacts the CPU power. The highest setting is currently 3GB memory, which assigns two virtual cores. CodeBuild allows you to configure tasks which can run for up to 8 hours. You can also explicitly select a compute type, including using GPU processors and going all the way up to 255 GB memory or 72 virtual CPU cores. This makes CodeBuild an interesting choice for tasks that need to potentially run longer than fifteen minutes, that are very computationally intensive, or that need a lot of working memory.

On the other hand, compared to Lambda functions, CodeBuild jobs start significantly slower and running them in parallel is not as easy or convenient. For example, by default you can only run up to 60 CodeBuild tasks in parallel, but this is a soft limit that you can increase. However, support tasks are mostly batch jobs by nature, so saving a few seconds or being able to execute thousands of such tasks in parallel is not usually important.

Comparing CodeBuild or Amazon EC2/Fargate for administrative tasks

Most of the limitations of Lambda functions for admin jobs could be solved by running a virtual machine through Amazon EC2. In fact, running tasks on Amazon EC2 was the usual way of lifting support tasks from the operations computers and moving them into the cloud until Lambda became available. However, due to how Amazon EC2 instances are billed, teams often bundled all the operations tasks on a single Amazon EC2 instance. That instance needed a superset of all the security privileges required by the various tasks, opening potential security risks. That’s where Fargate can help. Fargate runs container-based tasks on demand, offering utilization-based billing and removing many restrictions of Lambda, such as the 15-minute runtime and reduced operating system environment, also allowing you to choose execution environments more flexibly.

This means that, compared to Fargate tasks, CodeBuild execution is more or less comparable in terms of what you can run and how much power you can assign to your tasks. Both can use a custom Docker container, and both run a full-blown operating system with all the standard binaries. They also have similar terms of start-up time and parallelization. However, setting up a CodeBuild job and updating it later is much easier than with Fargate using tasks or pods.

With Fargate, you need to provide a custom Docker container with the right entry point. CodeBuild lets you use custom containers or choose standard images provided by AWS, including Ubuntu or AWS Linux instances. Likewise, configuring a Fargate task involves deploying in an Amazon VPC, and if the task needs to access other AWS services, setting up a NAT gateway. CodeBuild tasks have network access by default, and can be deployed in a VPC if required.

Updating support scripts can also be easier with CodeBuild than with Fargate. Deploying a new version of a task into Fargate involves building a new Docker container and uploading it to a container manager such as Amazon ECS or Amazon EKS. Deploying a new version of a CodeBuild job involves committing to the version control system, without the need to set up a CI/CD pipeline. This makes CodeBuild a compelling way of setting up support tasks, especially for larger organizations with strict access rules. Support people can update tasks definitions by having access to the source code control system, without the need to get access to production resources on AWS.

Fargate environments are transient, similar to Lambda functions. If you want to preserve some files between job runs (for example, compiled task binaries or installed dependencies), you would have to manage that manually with Fargate. CodeBuild supports artifact caching out of the box, so it’s significantly easier to preserve data files or installed dependencies between runs.

Potential downsides

Although taking supporting tasks directly from the source code repository is one of the biggest advantages of CodeBuild over Fargate or Lambda, it can also be a major drawback. Ensuring that the scripts are always in a stable condition requires discipline regarding committing to the trunk. Without such discipline, untested or unstable code might be used for admin tasks by mistake. A potential workaround for teams without good trunk commit discipline would be to use a specific branch for CodeBuild tasks, and then merge code into that branch once it is ready to be released.

Using support scripts directly from a source code repository makes it more complicated to synchronize versions with other deployed software. If you need the support scripts to track the exact version of code that was deployed to other services, it’s probably safer and easier to use Lambda functions or Fargate containers with an explicit deployment step.

Executing support tasks through CodeBuild

CodeBuild jobs take a bit more setup than Lambda functions, but significantly less than Fargate tasks. Below is an example of a CodeBuild job set up through AWS CloudFormation.

Architecture diagram for the CodeBuild being used for administrative tasks

Here are a few things to note:

  • You can add the required IAM permissions for the task into the Policies section of the CodeBuildRole resource.
  • The Environment section of the CodeBuildProject resource is where you can define the container image, choose the virtual hardware or set up environment variables to configure the task.
  • Environment variables are directly available for the shell commands listed in the buildspec.yml file, so this trick allows you to easily parameterize jobs to use resources from the same AWS CloudFormation template.
  • The Location and BuildSpec properties in the Source section define the source code repository, and the path of the buildspec.yml file within the repository.
Resources:
  CodeBuildRole:
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: "codebuild.amazonaws.com"
            Action: "sts:AssumeRole"
      Policies:
        - PolicyName: AllowLogs
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - 'logs:*'
                Resource: '*'

  CodeBuildProject:
    Type: AWS::CodeBuild::Project
    Properties:
      Name: !Ref JobName
      ServiceRole: !GetAtt CodeBuildRole.Arn
      Artifacts:
        Type: NO_ARTIFACTS
      LogsConfig:
        CloudWatchLogs:
          Status: ENABLED
      Cache:
        Type: NO_CACHE
      Environment:
        Type: LINUX_CONTAINER
        ComputeType: BUILD_GENERAL1_SMALL
        Image: aws/codebuild/standard:3.0
        EnvironmentVariables:
          - Name: SYSTEM_BUCKET 
            Value: !Ref SystemBucketName
      Source:
        Type: GITHUB
        Location: !Ref GithubRepository 
        GitCloneDepth: 1
        BuildSpec: !Ref BuildSpecPath 
        ReportBuildStatus: False
        InsecureSsl: False
      TimeoutInMinutes: !Ref TimeoutInMinutes

CodeBuild jobs usually run after changes to source code files. Support tasks usually need to run on a periodic schedule. The previous snippet did not define the Triggers property for the CodeBuild job, so it will not track source code changes or run automatically. Instead, you can set up an Amazon CloudWatch Event rule (or optionally use Amazon EventBridge, that provides more sophisticated rules) that will periodically trigger the CodeBuild job. Here is how to do that with AWS CloudFormation:

  RunCodeBuildJobRole:
    Condition: ScheduleRuns
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: "events.amazonaws.com"
            Action: "sts:AssumeRole"
      Policies:
        - PolicyName: StartTask 
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - 'codebuild:StartBuild'
                Resource:
                  - !GetAtt CodeBuildProject.Arn

  RunCodeBuildJobRoleRule:
    Condition: ScheduleRuns
    Type: AWS::Events::Rule
    Properties:
      Name: !Sub '${JobName}-scheduler'
      Description: Periodically runs codebuild job to archive defunct accounts
      ScheduleExpression: !Ref ScheduleRate
      State: ENABLED
      Targets:
        - Arn: !GetAtt CodeBuildProject.Arn
          Id: CodeBuildProject
          RoleArn: !GetAtt RunCodeBuildJobRole.Arn

Note the ScheduleExpression property of the RunCodeBuildJobRoleRule resource. You can use any supported CloudWatch schedule expression there to set up when or how frequently your job runs.

Observability and audit logs

If a support job fails for any reason, people need to know. Luckily, CodeBuild already integrates nicely with CloudWatch to report job statuses, so you can set up another CloudWatch Event rule that tracks failures and alerts someone about it. To make notifications flexible, you can send them to an Amazon SNS topic. You can then subscribe for email notifications or forward those alerts somewhere else easily. The following wires up notifications with an AWS CloudFormation template.

  SnsPublishRole:
    Condition: CreateSNSNotifications
    Type: "AWS::IAM::Role"
    Properties:
      AssumeRolePolicyDocument:
        Version: "2012-10-17"
        Statement:
          - Effect: Allow
            Principal:
              Service: "events.amazonaws.com"
            Action: "sts:AssumeRole"
      Policies:
        - PolicyName: AllowLogs
          PolicyDocument:
            Version: '2012-10-17'
            Statement:
              - Effect: Allow
                Action:
                  - 'SNS:Publish'
                Resource:
                  - !Ref SnsTopicArn

  CodeBuildNotificationRule:
    Condition: CreateSNSNotifications
    Type: AWS::Events::Rule
    Properties:
      Name: !Sub '${JobName}-fail-notification'
      Description: Notify about codebuild project failures
      RoleArn: !GetAtt SnsPublishRole.Arn
      EventPattern:
        source:
          - "aws.codebuild"
        detail-type:
          - "CodeBuild Build State Change"
        detail:
          build-status:
            - "FAILED"
            - "STOPPED"
          project-name:
            - !Ref CodeBuildProject
      State: ENABLED
      Targets:
        - Arn: Ref SnsTopicArn
          Id: NotificationTopic

Another option to keep the execution of your tasks under control is to generate a report using the test report functionality introduced a few months ago and specify in the buildspec.yml file about the location of the files that store results you want to include in your report.

Testing administrative tasks

Note the build-status list inside the CodeBuildNotificationRule resource. This defines a list of statuses about which you want to publish alerts. In the previous snippet, the list does not include successful runs. That’s because it’s usually not necessary to take any action when a support job runs successfully. However, during initial testing you may want to add IN_PROGRESS (notify when a task starts) and SUCCEEDED (notify when the job ends without an error).

Finally, one of the biggest challenges when moving scripts from an operations machine to running in CodeBuild is to create the right IAM policies. Command-line users on operations machines usually have a wide set of privileges, and identifying the minimum required for a specific job usually involves starting small, then iterating over failed attempts and opening up required operations. Running that process directly through CodeBuild can be quite slow. Instead, I suggest setting up a separate IAM policy for the job, then assigning it both to the role for the CodeBuild task, and to a command-line role or a command-line user. You can then iterate quickly directly on the command line and identify all required IAM operations, then remove the additional command-line user when done.

Conclusion

The next time you need to move a support task to the cloud, and you need a rich execution environment, consider using CodeBuild, at least as the initial step towards a more systematic solution. It will allow you to quickly get a script up and running with all the benefits of IAM isolation, scheduled execution, and reliable notifications.

Gojko is author of the Running Serverless book and interactive course. He is currently working on Video Puppet, a tool for editing videos as easily as editing text. You can reach out to him on Twitter.

Now Open – Third Availability Zone in the AWS Canada (Central) Region

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/now-open-third-availability-zone-in-the-aws-canada-central-region/

When you start an EC2 instance, or store data in an S3 bucket, it’s easy to underestimate what an AWS Region is. Right now, we have 22 across the world, and while they look like dots on a global map, they are architected to let you run applications and store data with high availability and fault tolerance. In fact, each of our Regions is made up of multiple data centers, which are geographically separated into what we call Availability Zones (AZs).

Today, I am very happy to announce that we added a third AZ to the AWS Canada (Central) Region to support our customer base in Canada.

This third AZ provides customers with additional flexibility to architect scalable, fault-tolerant, and highly available applications, and will support additional AWS services in Canada. We opened the Canada (Central) Region in December 2016, just over 3 years ago, and we’ve more than tripled the number of available services as we bring on this third AZ.

Each AZ is in a separate and distinct geographic location with enough distance to significantly reduce the risk of a single event impacting availability in the Region, yet near enough for business continuity applications that require rapid failover and synchronous replication. For example, our Canada (Central) Region is located in the Montreal area of Quebec, and the upcoming new AZ will be on the mainland more than 45 kms/28 miles away from the next-closest AZ as the crow flies.

Where we place our Regions and AZs is a deliberate and thoughtful process that takes into account not only latency or distance, but also risk profiles. To keep the risk profile low, we look at decades of data related to floods and other environmental factors before we settle on a location. Montreal was heavily impacted in 1998 by a massive ice storm that crippled the power grid and brought down more than 1,000 transmission towers, leaving four million people in neighboring provinces and some areas of New York and Maine without power. In order to ensure that AWS infrastructure can withstand inclement weather such as this, half of the AZs interconnections use underground cables and are out of the impact of potential ice storms. In this way, every AZ is connected to the other two AZs by at least one 100% underground fiber path.

We’re excited to bring a new AZ to Canada to serve our incredible customers in the region. Here are some examples from different industries, courtesy of my colleagues in Canada:

Healthcare – AlayaCare delivers cloud-based software to home care organizations across Canada and all over the world. As a home healthcare technology company, they need in-country data centers to meet regulatory requirements.

Insurance – Aviva is delivering a world-class digital experience to its insurance clients in Canada and the expansion of the AWS Region is welcome as they continue to move more of their applications to the cloud.

E-LearningD2L leverages various AWS Regions around the world, including Canada to deliver a seamless experience for their clients. They have been on AWS for more than four years, and recently completed an all-in migration.

With this launch, AWS has now 70 AZs within 22 geographic Regions around the world, plus 5 new regions coming. We are continuously looking at expanding our infrastructure footprint globally, driven largely by customer demand.

To see how we use AZs in Amazon, have look at this article on Static stability using Availability Zones by Becky Weiss and Mike Furr. It’s part of the Amazon Builders’ Library, a place where we share what we’ve learned over the years.

For more information on our global infrastructure, and the custom hardware we use, check out this interactive map.

Danilo


Une troisième zone de disponibilité pour la Région AWS Canada (Centre) est lancée

Lorsque vous lancez une instance EC2, ou que vous stockez vos données dans Amazon S3, il est facile de sous-estimer l’étendue d’une région infonuagique AWS. À l’heure actuelle, nous avons 22 régions dans le monde. Bien que ces dernières ne ressemblent qu’à des petits points sur une grande carte, elles sont conçues pour vous permettre de lancer des applications et de stocker des données avec une grande disponibilité et une tolérance aux pannes. En fait, chacune de nos régions comprend plusieurs centres de données distincts, regroupés dans ce que nous appelons des zones de disponibilités.

Aujourd’hui, je suis très heureux d’annoncer que nous avons ajouté une troisième zone de disponibilité à la Région AWS Canada (Centre) afin de répondre à la demande croissante de nos clients canadiens.

Cette troisième zone de disponibilité offre aux clients une souplesse additionnelle, leur permettant de concevoir des applications évolutives, tolérantes et hautement disponibles. Cette zone de disponibilité permettra également la prise en charge d’un plus grand nombre de services AWS au Canada. Nous avons ouvert la région infonuagique en décembre 2016, il y a un peu plus de trois ans, et nous avons plus que triplé le nombre de services disponibles en lançant cette troisième zone.

Chaque zone de disponibilité AWS se situe dans un lieu géographique séparé et distinct, suffisamment éloignée pour réduire le risque qu’un seul événement puisse avoir une incidence sur la disponibilité dans la région, mais assez rapproché pour permettre le bon fonctionnement d’applications de continuité d’activités qui nécessitent un basculement rapide et une réplication synchrone. Par exemple, notre Région Canada (Centre) se situe dans la région du grand Montréal, au Québec. La nouvelle zone de disponibilité sera située à plus de 45 km à vol d’oiseau de la zone de disponibilité la plus proche.

Définir l’emplacement de nos régions et de nos zones de disponibilité est un processus délibéré et réfléchi, qui tient compte non seulement de la latence/distance, mais aussi des profils de risque. Par exemple, nous examinons les données liées aux inondations et à d’autres facteurs environnementaux sur des décennies avant de nous installer à un endroit. Ceci nous permet de maintenir un profil de risque faible. En 1998, Montréal a été lourdement touchée par la tempête du verglas, qui a non seulement paralysé le réseau électrique et engendré l’effondrement de plus de 1 000 pylônes de transmission, mais qui a également laissé quatre millions de personnes sans électricité dans les provinces avoisinantes et certaines parties dans les états de New York et du Maine. Afin de s’assurer que l’infrastructure AWS résiste à de telles intempéries, la moitié des interconnexions câblées des zones de disponibilité d’AWS sont souterraines, à l’abri des tempêtes de verglas potentielles par exemple. Ainsi, chaque zone de disponibilité est reliée aux deux autres zones par au moins un réseau de fibre entièrement souterrain.

Nous nous réjouissons d’offrir à nos clients canadiens une nouvelle zone de disponibilité pour la région. Voici quelques exemples clients de différents secteurs, gracieuseté de mes collègues canadiens :

SantéAlayaCare fournit des logiciels de santé à domicile basés sur le nuage à des organismes de soins à domicile canadiens et partout dans le monde. Pour une entreprise de technologie de soins à domicile, le fait d’avoir des centres de données au pays est essentiel et lui permet de répondre aux exigences réglementaires.

AssuranceAviva offre une expérience numérique de classe mondiale à ses clients du secteur de l’assurance au Canada. L’expansion de la région AWS est bien accueillie alors qu’ils poursuivent la migration d’un nombre croissant de leurs applications vers l’infonuagique.

Apprentissage en ligneD2L s’appuie sur diverses régions dans le monde, dont celle au Canada, pour offrir une expérience homogène à ses clients. Ils sont sur AWS depuis plus de quatre ans et ont récemment effectué une migration complète.

Avec ce lancement, AWS compte désormais 70 zones de disponibilité dans 22 régions géographiques au monde – et cinq nouvelles régions à venir. Nous sommes continuellement à la recherche de moyens pour étendre notre infrastructure à l’échelle mondiale, entre autres en raison de la demande croissante des clients.

Pour comprendre comment nous utilisons les zones de disponibilité chez Amazon, consultez cet article sur la stabilité statique à l’aide des zones de disponibilité par Becky Weiss et Mike Furr. Ce billet se retrouve dans la bibliothèque des créateurs d’Amazon, un lieu où nous partageons ce que nous avons appris au fil des années.

Pour plus d’informations sur notre infrastructure mondiale et le matériel informatique personnalisé que nous utilisons, consultez cette carte interactive.

Danilo

New – Serverless Lens in AWS Well-Architected Tool

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-serverless-lens-in-aws-well-architected-tool/

When you build and run applications in the cloud, how often are you asking yourself “am I doing this right” ?

This is actually a very good question, and to let you get a good answer, we released publicly in 2015 the AWS Well-Architected Framework, a formal approach to compare your workload against our best practices, and get guidance on how to improve. Today, the Well-Architected Framework gives a consistent way for customers and partners to design and evaluate cloud architectures, and is based on five pillars:

  • Operational Excellence
  • Security
  • Reliability
  • Performance Efficiency
  • Cost Optimization

To provide more workload-specific advice, in 2017 we extended the framework with the concept of “lens” to go beyond a general perspective, and enter specific technology domains. Currently, there are three lenses that you can use:

  • Serverless
  • High Performance Computing (HPC)
  • IoT (Internet of Things)

The first thing to do to improve something, is decide what to measure and how. To let you review your workloads in a more structured way, we launched in 2018 the AWS Well-Architected Tool, a free tool available in the AWS Management Console, where you can define your workload, and answer questions regarding the five pillars.

You can use the Well-Architected Tool in different ways. For example:

  • If you’re working on a specific application, you can use the tool to assess risks and find areas for improvement.
  • If you’re responsible for multiple applications, you can use the tool to get visibility on the current status for all of them.

Today, I am happy to announce that we added the ability to apply lenses to the Well-Architected Tool, and the first one to be available is the Serverless Lens!

Using the Serverless Lens in AWS Well-Architected Tool
In the Well-Architected Tool console, I start by defining my workload. I am currently building the backend for a mobile app using the Amplify Framework. It’ll be a simple game, but I am going to use DynamoDB Global Tables to store data for my users, and the application will be running in two AWS Regions. Adding the AWS account IDs is optional, but can be useful to understand the application deployment in a multi-account setup.

Now, I can choose which lenses to apply. The AWS Well-Architected Framework is there by default. I select the Serverless Lens. This is adding a set of additional questions that help me understand how to design, deploy, and architect my serverless app following the framework best practices.

When the workload is defined, I start my review. I jump straight to the Serverless Lens. The new questions are distributed across the five pillars. For example, one of my favorite questions is around performance:

For each question, there are resources on the right side of the console that help me understand the possible answers and the terminology used. I select the activities and the technology choices that are part of my implementation, specifically:

  • I am using data streams (like those provided by Amazon Kinesis, or DynamoDB Streams) and asynchronous function invocations to improve concurrency.
  • I am caching user data in memory to reduce database accesses. I could also use the /tmp of the Lambda functions, or external data stores like Amazon ElastiCache.
  • I am removing functions when a service integration can natively do the job, for example when I need to call Kinesis Data Firehose from the Amazon API Gateway (this is optimizing my costs, too).

I save and exit, and even if I answered just one question, I already get some feedback from the tool. From the workload overview, I select the Serverless Lens. There, I notice that I have a high risk that I need to mitigate.

Just below, I have a suggestion on how to address the risk, including specific recommendations based on the question raising the risk. For a serverless application is important to balance performance and costs, using the right capacity unit that is automatically scaled by the platform.

I click on the first recommendation, and I receive specific action items for my improvement plan. This is covering the different architectural components I can use in my serverless apps, such as Lambda functions, DynamoDB tables, or API Gateway endpoints. In my case, I am going to follow the suggestion to use the Lambda Power Tuning open-source tool to fine-tune the memory/power configuration of my Lambda functions.

Before working on my improvement plan, I go on and answer all questions. I can now see the full report in the AWS console, or download it in PDF format to share it with other stakeholders. In this way, we can work together to plan the necessary improvements and have a successful serverless app.

Once we have made the improvements, I can go back and mark the correct answers to remove the high risk issue. Great architectures come as result of multiple iterations.

Available Now
The Serverless Lens is available today in all regions where the Well-Architected Tool is offered, as described in the AWS Region Table. It can be applied to existing workloads, or used for new workloads you define in the tool.

There is no costs in using the AWS Well-Architected Tool, you can use it to improve the application you are working on, or to get visibility into multiple workloads used by the department or area you are working on.

As a CIO/CTO, you can use it as a dashboard describing the status of all the applications you are responsible for. To make this easier, you can share a workload with another AWS account, that you can use to have a single view across multiple applications.

Since the output of the tool is a report with risks and how to address them, you should use the tool during the overall lifecycle of your application, especially during the design and implementation phase, and not just when you are going in production, because it may be too late to implement some of the suggestions you get.

Danilo

New for Amazon EFS – IAM Authorization and Access Points

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-efs-iam-authorization-and-access-points/

When building or migrating applications, we often need to share data across multiple compute nodes. Many applications use file APIs and Amazon Elastic File System (EFS) makes it easy to use those applications on AWS, providing a scalable, fully managed Network File System (NFS) that you can access from other AWS services and on-premises resources.

EFS scales on demand from zero to petabytes with no disruptions, growing and shrinking automatically as you add and remove files, eliminating the need to provision and manage capacity. By using it, you get strong file system consistency across 3 Availability Zones. EFS performance scales with the amount of data stored, with the option to provision the throughput you need.

Last year, the EFS team focused on optimizing costs with the introduction of the EFS Infrequent Access (IA) storage class, with storage prices up to 92% lower compared to EFS Standard. You can quickly start reducing your costs by setting a Lifecycle Management policy to move to EFS IA the files that haven’t been accessed for a certain amount of days.

Today, we are introducing two new features that simplify managing access, sharing data sets, and protecting your EFS file systems:

  • IAM authentication and authorization for NFS Clients, to identify clients and use IAM policies to manage client-specific permissions.
  • EFS access points, to enforce the use of an operating system user and group, optionally restricting access to a directory in the file system.

Using IAM Authentication and Authorization
In the EFS console, when creating or updating an EFS file system, I can now set up a file system policy. This is an IAM resource policy, similar to bucket policies for Amazon Simple Storage Service (S3), and can be used, for example, to disable root access, enforce read-only access, or enforce in-transit encryption for all clients.

Identity-based policies, such as those used by IAM users, groups, or roles, can override these default permissions. These new features work on top of EFS’s current network-based access using security groups.

I select the option to disable root access by default, click on Set policy, and then select the JSON tab. Here, I can review the policy generated based on my settings, or create a more advanced policy, for example to grant permissions to a different AWS account or a specific IAM role.

The following actions can be used in IAM policies to manage access permissions for NFS clients:

  • ClientMount to give permission to mount a file system with read-only access
  • ClientWrite to be able to write to the file system
  • ClientRootAccess to access files as root

I look at the policy JSON. I see that I can mount and read (ClientMount) the file system, and I can write (ClientWrite) in the file system, but since I selected the option to disable root access, I don’t have ClientRootAccess permissions.

Similarly, I can attach a policy to an IAM user or role to give specific permissions. For example, I create a IAM role to give full access to this file system (including root access) with this policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "elasticfilesystem:ClientMount",
                "elasticfilesystem:ClientWrite",
                "elasticfilesystem:ClientRootAccess"
            ],
            "Resource": "arn:aws:elasticfilesystem:us-east-2:123412341234:file-system/fs-d1188b58"
        }
    ]
}

I start an Amazon Elastic Compute Cloud (EC2) instance in the same Amazon Virtual Private Cloud as the EFS file system, using Amazon Linux 2 and a security group that can connect to the file system. The EC2 instance is using the IAM role I just created.

The open source efs-utils are required to connect a client using IAM authentication, in-transit encryption, or both. Normally, on Amazon Linux 2, I would install efs-utils using yum, but the new version is still rolling out, so I am following the instructions to build the package from source in this repository. I’ll update this blog post when the updated package is available.

To mount the EFS file system, I use the mount command. To leverage in-transit encryption, I add the tls option. I am not using IAM authentication here, so the permissions I specified for the “*” principal in my file system policy apply to this connection.

$ sudo mkdir /mnt/shared
$ sudo mount -t efs -o tls fs-d1188b58 /mnt/shared

My file system policy disables root access by default, so I can’t create a new file as root.

$ sudo touch /mnt/shared/newfile
touch: cannot touch ‘/mnt/shared/newfile’: Permission denied

I now use IAM authentication adding the iam option to the mount command (tls is required for IAM authentication to work).

$ sudo mount -t efs -o iam,tls fs-d1188b58 /mnt/shared

When I use this mount option, the IAM role from my EC2 instance profile is used to connect, along with the permissions attached to that role, including root access:

$ sudo touch /mnt/shared/newfile
$ ls -la /mnt/shared/newfile
-rw-r--r-- 1 root root 0 Jan  8 09:52 /mnt/shared/newfile

Here I used the IAM role to have root access. Other common use cases are to enforce in-transit encryption (using the aws:SecureTransport condition key) or create different roles for clients needing write or read-only access.

EFS IAM permission checks are logged by AWS CloudTrail to audit client access to your file system. For example, when a client mounts a file system, a NewClientConnection event is shown in my CloudTrail console.

Using EFS Access Points
EFS access points allow you to easily manage application access to NFS environments, specifying a POSIX user and group to use when accessing the file system, and restricting access to a directory within a file system.

Use cases that can benefit from EFS access points include:

  • Container-based environments, where developers build and deploy their own containers (you can also see this blog post for using EFS for container storage).
  • Data science applications, that require read-only access to production data.
  • Sharing a specific directory in your file system with other AWS accounts.

In the EFS console, I create two access points for my file system, each using a different POSIX user and group:

  • /data – where I am sharing some data that must be read and updated by multiple clients.
  • /config – where I share some configuration files that must not be updated by clients using the /data access point.

I used file permissions 755 for both access points. That means that I am giving read and execute access to everyone and write access to the owner of the directory only. Permissions here are used when creating the directory. Within the directory, permissions are under full control of the user.

I mount the /data access point adding the accesspoint option to the mount command:

$ sudo mount -t efs -o tls,accesspoint=fsap-0204ce67a2208742e fs-d1188b58 /mnt/shared

I can now create a file, because I am not doing that as root, but I am automatically using the user and group ID of the access point:

$ sudo touch /mnt/shared/datafile
$ ls -la /mnt/shared/datafile
-rw-r--r-- 1 1001 1001 0 Jan  8 09:58 /mnt/shared/datafile

I mount the file system again, without specifying an access point. I see that datafile was created in the /data directory, as expected considering the access point configuration. When using the access point, I was unable to access any files that were in the root or other directories of my EFS file system.

$ sudo mount -t efs -o tls /mnt/shared/
$ ls -la /mnt/shared/data/datafile 
-rw-r--r-- 1 1001 1001 0 Jan  8 09:58 /mnt/shared/data/datafile

To use IAM authentication with access points, I add the iam option:

$ sudo mount -t efs -o iam,tls,accesspoint=fsap-0204ce67a2208742e fs-d1188b58 /mnt/shared

I can restrict a IAM role to use only a specific access point adding a Condition on the AccessPointArn to the policy:

"Condition": {
    "StringEquals": {
        "elasticfilesystem:AccessPointArn" : "arn:aws:elasticfilesystem:us-east-2:123412341234:access-point/fsap-0204ce67a2208742e"
    }
}

Using IAM authentication and EFS access points together simplifies securely sharing data for container-based architectures and multi-tenant-applications, because it ensures that every application automatically gets the right operating system user and group assigned to it, optionally limiting access to a specific directory, enforcing in-transit encryption, or giving read-only access to the file system.

Available Now
IAM authorization for NFS clients and EFS access points are available in all regions where EFS is offered, as described in the AWS Region Table. There is no additional cost for using them. You can learn more about using EFS with IAM and access points in the documentation.

It’s now easier to create scalable architectures sharing data and configurations. Let me know what you are going use these new features for!

Danilo

New – Amazon Comprehend Medical Adds Ontology Linking

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-comprehend-medical-adds-ontology-linking/

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights in unstructured text. It is very easy to use, with no machine learning experience required. You can customize Comprehend for your specific use case, for example creating custom document classifiers to organize your documents into your own categories, or custom entity types that analyze text for your specific terms. However, medical terminology can be very complex and specific to the healthcare domain.

For this reason, we introduced last year Amazon Comprehend Medical, a HIPAA eligible natural language processing service that makes it easy to use machine learning to extract relevant medical information from unstructured text. Using Comprehend Medical, you can quickly and accurately gather information, such as medical condition, medication, dosage, strength, and frequency from a variety of sources like doctors’ notes, clinical trial reports, and patient health records.

Today, we are adding the capability of linking the information extracted by Comprehend Medical to medical ontologies.

An ontology provides a declarative model of a domain that defines and represents the concepts existing in that domain, their attributes, and the relationships between them. It is typically represented as a knowledge base, and made available to applications that need to use or share knowledge. Within health informatics, an ontology is a formal description of a health-related domain.

The ontologies supported by Comprehend Medical are:

  • ICD-10-CM, to identify medical conditions as entities and link related information such as diagnosis, severity, and anatomical distinctions as attributes of that entity. This is a diagnosis code set that is very useful for population health analytics, and for getting payments from insurance companies based on medical services rendered.
  • RxNorm, to identify medications as entities and link attributes such as dose, frequency, strength, and route of administration to that entity. Healthcare providers use these concepts to enable use cases like medication reconciliation, which is is the process of creating the most accurate list possible of all medications a patient is taking.

For each ontology, Comprehend Medical returns a ranked list of potential matches. You can use confidence scores to decide which matches make sense, or what might need further review. Let’s see how this works with an example.

Using Ontology Linking
In the Comprehend Medical console, I start by giving some unstructured, doctor notes in input:

At first, I use some functionalities that were already available in Comprehend Medical to detect medical and protected health information (PHI) entities.

Among the recognized entities (see this post for more info) there are some symptoms and medications. Medications are recognized as generics or brands. Let’s see how we can connect some of these entities to more specific concepts.

I use the new features to link those entities to RxNorm concepts for medications.

In the text, only the parts mentioning medications are detected. In the details of the answer, I see more information. For example, let’s look at one of the detected medications:

  • The first occurrence of the term “Clonidine” (in second line in the input text above) is linked to the generic concept (on the left in the image below) in the RxNorm ontology.
  • The second occurrence of the term “Clonidine” (in the fourth line in the input text above) is followed by an explicit dosage, and is linked to a more prescriptive format that includes dosage (on the right in the image below) in the RxNorm ontology.

To look for for medical conditions using ICD-10-CM concepts, I am giving a different input:

The idea again is to link the detected entities, like symptoms and diagnoses, to specific concepts.

As expected, diagnoses and symptoms are recognized as entities. In the detailed results those entities are linked to the medical conditions in the ICD-10-CM ontology. For example, the two main diagnoses described in the input text are the top results, and specific concepts in the ontology are inferred by Comprehend Medical, each with its own score.

In production, you can use Comprehend Medical via API, to integrate these functionalities with your processing workflow. All the screenshots above render visually the structured information returned by the API in JSON format. For example, this is the result of detecting medications (RxNorm concepts):

{
    "Entities": [
        {
            "Id": 0,
            "Text": "Clonidine",
            "Category": "MEDICATION",
            "Type": "GENERIC_NAME",
            "Score": 0.9933062195777893,
            "BeginOffset": 83,
            "EndOffset": 92,
            "Attributes": [],
            "Traits": [],
            "RxNormConcepts": [
                {
                    "Description": "Clonidine",
                    "Code": "2599",
                    "Score": 0.9148101806640625
                },
                {
                    "Description": "168 HR Clonidine 0.00417 MG/HR Transdermal System",
                    "Code": "998671",
                    "Score": 0.8215734958648682
                },
                {
                    "Description": "Clonidine Hydrochloride 0.025 MG Oral Tablet",
                    "Code": "892791",
                    "Score": 0.7519310116767883
                },
                {
                    "Description": "10 ML Clonidine Hydrochloride 0.5 MG/ML Injection",
                    "Code": "884225",
                    "Score": 0.7171697020530701
                },
                {
                    "Description": "Clonidine Hydrochloride 0.2 MG Oral Tablet",
                    "Code": "884185",
                    "Score": 0.6776907444000244
                }
            ]
        },
        {
            "Id": 1,
            "Text": "Vyvanse",
            "Category": "MEDICATION",
            "Type": "BRAND_NAME",
            "Score": 0.9995427131652832,
            "BeginOffset": 148,
            "EndOffset": 155,
            "Attributes": [
                {
                    "Type": "DOSAGE",
                    "Score": 0.9910679459571838,
                    "RelationshipScore": 0.9999822378158569,
                    "Id": 2,
                    "BeginOffset": 156,
                    "EndOffset": 162,
                    "Text": "50 mgs",
                    "Traits": []
                },
                {
                    "Type": "ROUTE_OR_MODE",
                    "Score": 0.9997182488441467,
                    "RelationshipScore": 0.9993833303451538,
                    "Id": 3,
                    "BeginOffset": 163,
                    "EndOffset": 165,
                    "Text": "po",
                    "Traits": []
                },
                {
                    "Type": "FREQUENCY",
                    "Score": 0.983681321144104,
                    "RelationshipScore": 0.9999642372131348,
                    "Id": 4,
                    "BeginOffset": 166,
                    "EndOffset": 184,
                    "Text": "at breakfast daily",
                    "Traits": []
                }
            ],
            "Traits": [],
            "RxNormConcepts": [
                {
                    "Description": "lisdexamfetamine dimesylate 50 MG Oral Capsule [Vyvanse]",
                    "Code": "854852",
                    "Score": 0.8883932828903198
                },
                {
                    "Description": "lisdexamfetamine dimesylate 50 MG Chewable Tablet [Vyvanse]",
                    "Code": "1871469",
                    "Score": 0.7482635378837585
                },
                {
                    "Description": "Vyvanse",
                    "Code": "711043",
                    "Score": 0.7041242122650146
                },
                {
                    "Description": "lisdexamfetamine dimesylate 70 MG Oral Capsule [Vyvanse]",
                    "Code": "854844",
                    "Score": 0.23675969243049622
                },
                {
                    "Description": "lisdexamfetamine dimesylate 60 MG Oral Capsule [Vyvanse]",
                    "Code": "854848",
                    "Score": 0.14077001810073853
                }
            ]
        },
        {
            "Id": 5,
            "Text": "Clonidine",
            "Category": "MEDICATION",
            "Type": "GENERIC_NAME",
            "Score": 0.9982216954231262,
            "BeginOffset": 199,
            "EndOffset": 208,
            "Attributes": [
                {
                    "Type": "STRENGTH",
                    "Score": 0.7696017026901245,
                    "RelationshipScore": 0.9999960660934448,
                    "Id": 6,
                    "BeginOffset": 209,
                    "EndOffset": 216,
                    "Text": "0.2 mgs",
                    "Traits": []
                },
                {
                    "Type": "DOSAGE",
                    "Score": 0.777644693851471,
                    "RelationshipScore": 0.9999927282333374,
                    "Id": 7,
                    "BeginOffset": 220,
                    "EndOffset": 236,
                    "Text": "1 and 1 / 2 tabs",
                    "Traits": []
                },
                {
                    "Type": "ROUTE_OR_MODE",
                    "Score": 0.9981689453125,
                    "RelationshipScore": 0.999950647354126,
                    "Id": 8,
                    "BeginOffset": 237,
                    "EndOffset": 239,
                    "Text": "po",
                    "Traits": []
                },
                {
                    "Type": "FREQUENCY",
                    "Score": 0.99753737449646,
                    "RelationshipScore": 0.9999889135360718,
                    "Id": 9,
                    "BeginOffset": 240,
                    "EndOffset": 243,
                    "Text": "qhs",
                    "Traits": []
                }
            ],
            "Traits": [],
            "RxNormConcepts": [
                {
                    "Description": "Clonidine Hydrochloride 0.2 MG Oral Tablet",
                    "Code": "884185",
                    "Score": 0.9600071907043457
                },
                {
                    "Description": "Clonidine Hydrochloride 0.025 MG Oral Tablet",
                    "Code": "892791",
                    "Score": 0.8955953121185303
                },
                {
                    "Description": "24 HR Clonidine Hydrochloride 0.2 MG Extended Release Oral Tablet",
                    "Code": "885880",
                    "Score": 0.8706559538841248
                },
                {
                    "Description": "12 HR Clonidine Hydrochloride 0.2 MG Extended Release Oral Tablet",
                    "Code": "1013937",
                    "Score": 0.786146879196167
                },
                {
                    "Description": "Chlorthalidone 15 MG / Clonidine Hydrochloride 0.2 MG Oral Tablet",
                    "Code": "884198",
                    "Score": 0.601354718208313
                }
            ]
        }
    ],
    "ModelVersion": "0.0.0"
}

Similarly, this is the output when detecting medical conditions (ICD-10-CM concepts):

{
    "Entities": [
        {
            "Id": 0,
            "Text": "coronary artery disease",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9933860898017883,
            "BeginOffset": 90,
            "EndOffset": 113,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "DIAGNOSIS",
                    "Score": 0.9682672023773193
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Atherosclerotic heart disease of native coronary artery without angina pectoris",
                    "Code": "I25.10",
                    "Score": 0.8199513554573059
                },
                {
                    "Description": "Atherosclerotic heart disease of native coronary artery",
                    "Code": "I25.1",
                    "Score": 0.4950370192527771
                },
                {
                    "Description": "Old myocardial infarction",
                    "Code": "I25.2",
                    "Score": 0.18753206729888916
                },
                {
                    "Description": "Atherosclerotic heart disease of native coronary artery with unstable angina pectoris",
                    "Code": "I25.110",
                    "Score": 0.16535982489585876
                },
                {
                    "Description": "Atherosclerotic heart disease of native coronary artery with unspecified angina pectoris",
                    "Code": "I25.119",
                    "Score": 0.15222692489624023
                }
            ]
        },
        {
            "Id": 2,
            "Text": "atrial fibrillation",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9923409223556519,
            "BeginOffset": 116,
            "EndOffset": 135,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "DIAGNOSIS",
                    "Score": 0.9708861708641052
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Unspecified atrial fibrillation",
                    "Code": "I48.91",
                    "Score": 0.7011875510215759
                },
                {
                    "Description": "Chronic atrial fibrillation",
                    "Code": "I48.2",
                    "Score": 0.28612759709358215
                },
                {
                    "Description": "Paroxysmal atrial fibrillation",
                    "Code": "I48.0",
                    "Score": 0.21157972514629364
                },
                {
                    "Description": "Persistent atrial fibrillation",
                    "Code": "I48.1",
                    "Score": 0.16996538639068604
                },
                {
                    "Description": "Atrial premature depolarization",
                    "Code": "I49.1",
                    "Score": 0.16715925931930542
                }
            ]
        },
        {
            "Id": 3,
            "Text": "hypertension",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9993137121200562,
            "BeginOffset": 138,
            "EndOffset": 150,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "DIAGNOSIS",
                    "Score": 0.9734011888504028
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Essential (primary) hypertension",
                    "Code": "I10",
                    "Score": 0.6827990412712097
                },
                {
                    "Description": "Hypertensive heart disease without heart failure",
                    "Code": "I11.9",
                    "Score": 0.09846580773591995
                },
                {
                    "Description": "Hypertensive heart disease with heart failure",
                    "Code": "I11.0",
                    "Score": 0.09182810038328171
                },
                {
                    "Description": "Pulmonary hypertension, unspecified",
                    "Code": "I27.20",
                    "Score": 0.0866364985704422
                },
                {
                    "Description": "Primary pulmonary hypertension",
                    "Code": "I27.0",
                    "Score": 0.07662317156791687
                }
            ]
        },
        {
            "Id": 4,
            "Text": "hyperlipidemia",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9998835325241089,
            "BeginOffset": 153,
            "EndOffset": 167,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "DIAGNOSIS",
                    "Score": 0.9702492356300354
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Hyperlipidemia, unspecified",
                    "Code": "E78.5",
                    "Score": 0.8378056883811951
                },
                {
                    "Description": "Disorders of lipoprotein metabolism and other lipidemias",
                    "Code": "E78",
                    "Score": 0.20186281204223633
                },
                {
                    "Description": "Lipid storage disorder, unspecified",
                    "Code": "E75.6",
                    "Score": 0.18514418601989746
                },
                {
                    "Description": "Pure hyperglyceridemia",
                    "Code": "E78.1",
                    "Score": 0.1438658982515335
                },
                {
                    "Description": "Other hyperlipidemia",
                    "Code": "E78.49",
                    "Score": 0.13983778655529022
                }
            ]
        },
        {
            "Id": 5,
            "Text": "chills",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9989762306213379,
            "BeginOffset": 211,
            "EndOffset": 217,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "SYMPTOM",
                    "Score": 0.9510533213615417
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Chills (without fever)",
                    "Code": "R68.83",
                    "Score": 0.7460958361625671
                },
                {
                    "Description": "Fever, unspecified",
                    "Code": "R50.9",
                    "Score": 0.11848161369562149
                },
                {
                    "Description": "Typhus fever, unspecified",
                    "Code": "A75.9",
                    "Score": 0.07497859001159668
                },
                {
                    "Description": "Neutropenia, unspecified",
                    "Code": "D70.9",
                    "Score": 0.07332006841897964
                },
                {
                    "Description": "Lassa fever",
                    "Code": "A96.2",
                    "Score": 0.0721040666103363
                }
            ]
        },
        {
            "Id": 6,
            "Text": "nausea",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9993392825126648,
            "BeginOffset": 220,
            "EndOffset": 226,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "SYMPTOM",
                    "Score": 0.9175007939338684
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Nausea",
                    "Code": "R11.0",
                    "Score": 0.7333012819290161
                },
                {
                    "Description": "Nausea with vomiting, unspecified",
                    "Code": "R11.2",
                    "Score": 0.20183530449867249
                },
                {
                    "Description": "Hematemesis",
                    "Code": "K92.0",
                    "Score": 0.1203150525689125
                },
                {
                    "Description": "Vomiting, unspecified",
                    "Code": "R11.10",
                    "Score": 0.11658868193626404
                },
                {
                    "Description": "Nausea and vomiting",
                    "Code": "R11",
                    "Score": 0.11535880714654922
                }
            ]
        },
        {
            "Id": 8,
            "Text": "flank pain",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9315784573554993,
            "BeginOffset": 235,
            "EndOffset": 245,
            "Attributes": [
                {
                    "Type": "ACUITY",
                    "Score": 0.9809532761573792,
                    "RelationshipScore": 0.9999837875366211,
                    "Id": 7,
                    "BeginOffset": 229,
                    "EndOffset": 234,
                    "Text": "acute",
                    "Traits": []
                }
            ],
            "Traits": [
                {
                    "Name": "SYMPTOM",
                    "Score": 0.8182812929153442
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Unspecified abdominal pain",
                    "Code": "R10.9",
                    "Score": 0.4959934949874878
                },
                {
                    "Description": "Generalized abdominal pain",
                    "Code": "R10.84",
                    "Score": 0.12332479655742645
                },
                {
                    "Description": "Lower abdominal pain, unspecified",
                    "Code": "R10.30",
                    "Score": 0.08319114148616791
                },
                {
                    "Description": "Upper abdominal pain, unspecified",
                    "Code": "R10.10",
                    "Score": 0.08275411278009415
                },
                {
                    "Description": "Jaw pain",
                    "Code": "R68.84",
                    "Score": 0.07797083258628845
                }
            ]
        },
        {
            "Id": 10,
            "Text": "numbness",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9659366011619568,
            "BeginOffset": 255,
            "EndOffset": 263,
            "Attributes": [
                {
                    "Type": "SYSTEM_ORGAN_SITE",
                    "Score": 0.9976192116737366,
                    "RelationshipScore": 0.9999089241027832,
                    "Id": 11,
                    "BeginOffset": 271,
                    "EndOffset": 274,
                    "Text": "leg",
                    "Traits": []
                }
            ],
            "Traits": [
                {
                    "Name": "SYMPTOM",
                    "Score": 0.7310190796852112
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Anesthesia of skin",
                    "Code": "R20.0",
                    "Score": 0.767346203327179
                },
                {
                    "Description": "Paresthesia of skin",
                    "Code": "R20.2",
                    "Score": 0.13602739572525024
                },
                {
                    "Description": "Other complications of anesthesia",
                    "Code": "T88.59",
                    "Score": 0.09990577399730682
                },
                {
                    "Description": "Hypothermia following anesthesia",
                    "Code": "T88.51",
                    "Score": 0.09953102469444275
                },
                {
                    "Description": "Disorder of the skin and subcutaneous tissue, unspecified",
                    "Code": "L98.9",
                    "Score": 0.08736388385295868
                }
            ]
        }
    ],
    "ModelVersion": "0.0.0"
}

Available Now
You can use Amazon Comprehend Medical via the console, AWS Command Line Interface (CLI), or AWS SDKs. With Comprehend Medical, you pay only for what you use. You are charged based on the amount of text processed on a monthly basis, depending on the features you use. For more information, please see the Comprehend Medical section in the Comprehend Pricing page. Ontology Linking is available in all regions were Amazon Comprehend Medical is offered, as described in the AWS Regions Table.

The new ontology linking APIs make it easy to detect medications and medical conditions in unstructured clinical text and link them to RxNorm and ICD-10-CM codes respectively. This new feature can help you reduce the cost, time and effort of processing large amounts of unstructured medical text with high accuracy.

Danilo