All posts by Danilo Poccia

New – Serverless Lens in AWS Well-Architected Tool

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-serverless-lens-in-aws-well-architected-tool/

When you build and run applications in the cloud, how often are you asking yourself “am I doing this right” ?

This is actually a very good question, and to let you get a good answer, we released publicly in 2015 the AWS Well-Architected Framework, a formal approach to compare your workload against our best practices, and get guidance on how to improve. Today, the Well-Architected Framework gives a consistent way for customers and partners to design and evaluate cloud architectures, and is based on five pillars:

  • Operational Excellence
  • Security
  • Reliability
  • Performance Efficiency
  • Cost Optimization

To provide more workload-specific advice, in 2017 we extended the framework with the concept of “lens” to go beyond a general perspective, and enter specific technology domains. Currently, there are three lenses that you can use:

  • Serverless
  • High Performance Computing (HPC)
  • IoT (Internet of Things)

The first thing to do to improve something, is decide what to measure and how. To let you review your workloads in a more structured way, we launched in 2018 the AWS Well-Architected Tool, a free tool available in the AWS Management Console, where you can define your workload, and answer questions regarding the five pillars.

You can use the Well-Architected Tool in different ways. For example:

  • If you’re working on a specific application, you can use the tool to assess risks and find areas for improvement.
  • If you’re responsible for multiple applications, you can use the tool to get visibility on the current status for all of them.

Today, I am happy to announce that we added the ability to apply lenses to the Well-Architected Tool, and the first one to be available is the Serverless Lens!

Using the Serverless Lens in AWS Well-Architected Tool
In the Well-Architected Tool console, I start by defining my workload. I am currently building the backend for a mobile app using the Amplify Framework. It’ll be a simple game, but I am going to use DynamoDB Global Tables to store data for my users, and the application will be running in two AWS Regions. Adding the AWS account IDs is optional, but can be useful to understand the application deployment in a multi-account setup.

Now, I can choose which lenses to apply. The AWS Well-Architected Framework is there by default. I select the Serverless Lens. This is adding a set of additional questions that help me understand how to design, deploy, and architect my serverless app following the framework best practices.

When the workload is defined, I start my review. I jump straight to the Serverless Lens. The new questions are distributed across the five pillars. For example, one of my favorite questions is around performance:

For each question, there are resources on the right side of the console that help me understand the possible answers and the terminology used. I select the activities and the technology choices that are part of my implementation, specifically:

  • I am using data streams (like those provided by Amazon Kinesis, or DynamoDB Streams) and asynchronous function invocations to improve concurrency.
  • I am caching user data in memory to reduce database accesses. I could also use the /tmp of the Lambda functions, or external data stores like Amazon ElastiCache.
  • I am removing functions when a service integration can natively do the job, for example when I need to call Kinesis Data Firehose from the Amazon API Gateway (this is optimizing my costs, too).

I save and exit, and even if I answered just one question, I already get some feedback from the tool. From the workload overview, I select the Serverless Lens. There, I notice that I have a high risk that I need to mitigate.

Just below, I have a suggestion on how to address the risk, including specific recommendations based on the question raising the risk. For a serverless application is important to balance performance and costs, using the right capacity unit that is automatically scaled by the platform.

I click on the first recommendation, and I receive specific action items for my improvement plan. This is covering the different architectural components I can use in my serverless apps, such as Lambda functions, DynamoDB tables, or API Gateway endpoints. In my case, I am going to follow the suggestion to use the Lambda Power Tuning open-source tool to fine-tune the memory/power configuration of my Lambda functions.

Before working on my improvement plan, I go on and answer all questions. I can now see the full report in the AWS console, or download it in PDF format to share it with other stakeholders. In this way, we can work together to plan the necessary improvements and have a successful serverless app.

Once we have made the improvements, I can go back and mark the correct answers to remove the high risk issue. Great architectures come as result of multiple iterations.

Available Now
The Serverless Lens is available today in all regions where the Well-Architected Tool is offered, as described in the AWS Region Table. It can be applied to existing workloads, or used for new workloads you define in the tool.

There is no costs in using the AWS Well-Architected Tool, you can use it to improve the application you are working on, or to get visibility into multiple workloads used by the department or area you are working on.

As a CIO/CTO, you can use it as a dashboard describing the status of all the applications you are responsible for. To make this easier, you can share a workload with another AWS account, that you can use to have a single view across multiple applications.

Since the output of the tool is a report with risks and how to address them, you should use the tool during the overall lifecycle of your application, especially during the design and implementation phase, and not just when you are going in production, because it may be too late to implement some of the suggestions you get.

Danilo

New for Amazon EFS – IAM Authorization and Access Points

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-efs-iam-authorization-and-access-points/

When building or migrating applications, we often need to share data across multiple compute nodes. Many applications use file APIs and Amazon Elastic File System (EFS) makes it easy to use those applications on AWS, providing a scalable, fully managed Network File System (NFS) that you can access from other AWS services and on-premises resources.

EFS scales on demand from zero to petabytes with no disruptions, growing and shrinking automatically as you add and remove files, eliminating the need to provision and manage capacity. By using it, you get strong file system consistency across 3 Availability Zones. EFS performance scales with the amount of data stored, with the option to provision the throughput you need.

Last year, the EFS team focused on optimizing costs with the introduction of the EFS Infrequent Access (IA) storage class, with storage prices up to 92% lower compared to EFS Standard. You can quickly start reducing your costs by setting a Lifecycle Management policy to move to EFS IA the files that haven’t been accessed for a certain amount of days.

Today, we are introducing two new features that simplify managing access, sharing data sets, and protecting your EFS file systems:

  • IAM authentication and authorization for NFS Clients, to identify clients and use IAM policies to manage client-specific permissions.
  • EFS access points, to enforce the use of an operating system user and group, optionally restricting access to a directory in the file system.

Using IAM Authentication and Authorization
In the EFS console, when creating or updating an EFS file system, I can now set up a file system policy. This is an IAM resource policy, similar to bucket policies for Amazon Simple Storage Service (S3), and can be used, for example, to disable root access, enforce read-only access, or enforce in-transit encryption for all clients.

Identity-based policies, such as those used by IAM users, groups, or roles, can override these default permissions. These new features work on top of EFS’s current network-based access using security groups.

I select the option to disable root access by default, click on Set policy, and then select the JSON tab. Here, I can review the policy generated based on my settings, or create a more advanced policy, for example to grant permissions to a different AWS account or a specific IAM role.

The following actions can be used in IAM policies to manage access permissions for NFS clients:

  • ClientMount to give permission to mount a file system with read-only access
  • ClientWrite to be able to write to the file system
  • ClientRootAccess to access files as root

I look at the policy JSON. I see that I can mount and read (ClientMount) the file system, and I can write (ClientWrite) in the file system, but since I selected the option to disable root access, I don’t have ClientRootAccess permissions.

Similarly, I can attach a policy to an IAM user or role to give specific permissions. For example, I create a IAM role to give full access to this file system (including root access) with this policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "elasticfilesystem:ClientMount",
                "elasticfilesystem:ClientWrite",
                "elasticfilesystem:ClientRootAccess"
            ],
            "Resource": "arn:aws:elasticfilesystem:us-east-2:123412341234:file-system/fs-d1188b58"
        }
    ]
}

I start an Amazon Elastic Compute Cloud (EC2) instance in the same Amazon Virtual Private Cloud as the EFS file system, using Amazon Linux 2 and a security group that can connect to the file system. The EC2 instance is using the IAM role I just created.

The open source efs-utils are required to connect a client using IAM authentication, in-transit encryption, or both. Normally, on Amazon Linux 2, I would install efs-utils using yum, but the new version is still rolling out, so I am following the instructions to build the package from source in this repository. I’ll update this blog post when the updated package is available.

To mount the EFS file system, I use the mount command. To leverage in-transit encryption, I add the tls option. I am not using IAM authentication here, so the permissions I specified for the “*” principal in my file system policy apply to this connection.

$ sudo mkdir /mnt/shared
$ sudo mount -t efs -o tls fs-d1188b58 /mnt/shared

My file system policy disables root access by default, so I can’t create a new file as root.

$ sudo touch /mnt/shared/newfile
touch: cannot touch ‘/mnt/shared/newfile’: Permission denied

I now use IAM authentication adding the iam option to the mount command (tls is required for IAM authentication to work).

$ sudo mount -t efs -o iam,tls fs-d1188b58 /mnt/shared

When I use this mount option, the IAM role from my EC2 instance profile is used to connect, along with the permissions attached to that role, including root access:

$ sudo touch /mnt/shared/newfile
$ ls -la /mnt/shared/newfile
-rw-r--r-- 1 root root 0 Jan  8 09:52 /mnt/shared/newfile

Here I used the IAM role to have root access. Other common use cases are to enforce in-transit encryption (using the aws:SecureTransport condition key) or create different roles for clients needing write or read-only access.

EFS IAM permission checks are logged by AWS CloudTrail to audit client access to your file system. For example, when a client mounts a file system, a NewClientConnection event is shown in my CloudTrail console.

Using EFS Access Points
EFS access points allow you to easily manage application access to NFS environments, specifying a POSIX user and group to use when accessing the file system, and restricting access to a directory within a file system.

Use cases that can benefit from EFS access points include:

  • Container-based environments, where developers build and deploy their own containers (you can also see this blog post for using EFS for container storage).
  • Data science applications, that require read-only access to production data.
  • Sharing a specific directory in your file system with other AWS accounts.

In the EFS console, I create two access points for my file system, each using a different POSIX user and group:

  • /data – where I am sharing some data that must be read and updated by multiple clients.
  • /config – where I share some configuration files that must not be updated by clients using the /data access point.

I used file permissions 755 for both access points. That means that I am giving read and execute access to everyone and write access to the owner of the directory only. Permissions here are used when creating the directory. Within the directory, permissions are under full control of the user.

I mount the /data access point adding the accesspoint option to the mount command:

$ sudo mount -t efs -o tls,accesspoint=fsap-0204ce67a2208742e fs-d1188b58 /mnt/shared

I can now create a file, because I am not doing that as root, but I am automatically using the user and group ID of the access point:

$ sudo touch /mnt/shared/datafile
$ ls -la /mnt/shared/datafile
-rw-r--r-- 1 1001 1001 0 Jan  8 09:58 /mnt/shared/datafile

I mount the file system again, without specifying an access point. I see that datafile was created in the /data directory, as expected considering the access point configuration. When using the access point, I was unable to access any files that were in the root or other directories of my EFS file system.

$ sudo mount -t efs -o tls /mnt/shared/
$ ls -la /mnt/shared/data/datafile 
-rw-r--r-- 1 1001 1001 0 Jan  8 09:58 /mnt/shared/data/datafile

To use IAM authentication with access points, I add the iam option:

$ sudo mount -t efs -o iam,tls,accesspoint=fsap-0204ce67a2208742e fs-d1188b58 /mnt/shared

I can restrict a IAM role to use only a specific access point adding a Condition on the AccessPointArn to the policy:

"Condition": {
    "StringEquals": {
        "elasticfilesystem:AccessPointArn" : "arn:aws:elasticfilesystem:us-east-2:123412341234:access-point/fsap-0204ce67a2208742e"
    }
}

Using IAM authentication and EFS access points together simplifies securely sharing data for container-based architectures and multi-tenant-applications, because it ensures that every application automatically gets the right operating system user and group assigned to it, optionally limiting access to a specific directory, enforcing in-transit encryption, or giving read-only access to the file system.

Available Now
IAM authorization for NFS clients and EFS access points are available in all regions where EFS is offered, as described in the AWS Region Table. There is no additional cost for using them. You can learn more about using EFS with IAM and access points in the documentation.

It’s now easier to create scalable architectures sharing data and configurations. Let me know what you are going use these new features for!

Danilo

New – Amazon Comprehend Medical Adds Ontology Linking

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-comprehend-medical-adds-ontology-linking/

Amazon Comprehend is a natural language processing (NLP) service that uses machine learning to find insights in unstructured text. It is very easy to use, with no machine learning experience required. You can customize Comprehend for your specific use case, for example creating custom document classifiers to organize your documents into your own categories, or custom entity types that analyze text for your specific terms. However, medical terminology can be very complex and specific to the healthcare domain.

For this reason, we introduced last year Amazon Comprehend Medical, a HIPAA eligible natural language processing service that makes it easy to use machine learning to extract relevant medical information from unstructured text. Using Comprehend Medical, you can quickly and accurately gather information, such as medical condition, medication, dosage, strength, and frequency from a variety of sources like doctors’ notes, clinical trial reports, and patient health records.

Today, we are adding the capability of linking the information extracted by Comprehend Medical to medical ontologies.

An ontology provides a declarative model of a domain that defines and represents the concepts existing in that domain, their attributes, and the relationships between them. It is typically represented as a knowledge base, and made available to applications that need to use or share knowledge. Within health informatics, an ontology is a formal description of a health-related domain.

The ontologies supported by Comprehend Medical are:

  • ICD-10-CM, to identify medical conditions as entities and link related information such as diagnosis, severity, and anatomical distinctions as attributes of that entity. This is a diagnosis code set that is very useful for population health analytics, and for getting payments from insurance companies based on medical services rendered.
  • RxNorm, to identify medications as entities and link attributes such as dose, frequency, strength, and route of administration to that entity. Healthcare providers use these concepts to enable use cases like medication reconciliation, which is is the process of creating the most accurate list possible of all medications a patient is taking.

For each ontology, Comprehend Medical returns a ranked list of potential matches. You can use confidence scores to decide which matches make sense, or what might need further review. Let’s see how this works with an example.

Using Ontology Linking
In the Comprehend Medical console, I start by giving some unstructured, doctor notes in input:

At first, I use some functionalities that were already available in Comprehend Medical to detect medical and protected health information (PHI) entities.

Among the recognized entities (see this post for more info) there are some symptoms and medications. Medications are recognized as generics or brands. Let’s see how we can connect some of these entities to more specific concepts.

I use the new features to link those entities to RxNorm concepts for medications.

In the text, only the parts mentioning medications are detected. In the details of the answer, I see more information. For example, let’s look at one of the detected medications:

  • The first occurrence of the term “Clonidine” (in second line in the input text above) is linked to the generic concept (on the left in the image below) in the RxNorm ontology.
  • The second occurrence of the term “Clonidine” (in the fourth line in the input text above) is followed by an explicit dosage, and is linked to a more prescriptive format that includes dosage (on the right in the image below) in the RxNorm ontology.

To look for for medical conditions using ICD-10-CM concepts, I am giving a different input:

The idea again is to link the detected entities, like symptoms and diagnoses, to specific concepts.

As expected, diagnoses and symptoms are recognized as entities. In the detailed results those entities are linked to the medical conditions in the ICD-10-CM ontology. For example, the two main diagnoses described in the input text are the top results, and specific concepts in the ontology are inferred by Comprehend Medical, each with its own score.

In production, you can use Comprehend Medical via API, to integrate these functionalities with your processing workflow. All the screenshots above render visually the structured information returned by the API in JSON format. For example, this is the result of detecting medications (RxNorm concepts):

{
    "Entities": [
        {
            "Id": 0,
            "Text": "Clonidine",
            "Category": "MEDICATION",
            "Type": "GENERIC_NAME",
            "Score": 0.9933062195777893,
            "BeginOffset": 83,
            "EndOffset": 92,
            "Attributes": [],
            "Traits": [],
            "RxNormConcepts": [
                {
                    "Description": "Clonidine",
                    "Code": "2599",
                    "Score": 0.9148101806640625
                },
                {
                    "Description": "168 HR Clonidine 0.00417 MG/HR Transdermal System",
                    "Code": "998671",
                    "Score": 0.8215734958648682
                },
                {
                    "Description": "Clonidine Hydrochloride 0.025 MG Oral Tablet",
                    "Code": "892791",
                    "Score": 0.7519310116767883
                },
                {
                    "Description": "10 ML Clonidine Hydrochloride 0.5 MG/ML Injection",
                    "Code": "884225",
                    "Score": 0.7171697020530701
                },
                {
                    "Description": "Clonidine Hydrochloride 0.2 MG Oral Tablet",
                    "Code": "884185",
                    "Score": 0.6776907444000244
                }
            ]
        },
        {
            "Id": 1,
            "Text": "Vyvanse",
            "Category": "MEDICATION",
            "Type": "BRAND_NAME",
            "Score": 0.9995427131652832,
            "BeginOffset": 148,
            "EndOffset": 155,
            "Attributes": [
                {
                    "Type": "DOSAGE",
                    "Score": 0.9910679459571838,
                    "RelationshipScore": 0.9999822378158569,
                    "Id": 2,
                    "BeginOffset": 156,
                    "EndOffset": 162,
                    "Text": "50 mgs",
                    "Traits": []
                },
                {
                    "Type": "ROUTE_OR_MODE",
                    "Score": 0.9997182488441467,
                    "RelationshipScore": 0.9993833303451538,
                    "Id": 3,
                    "BeginOffset": 163,
                    "EndOffset": 165,
                    "Text": "po",
                    "Traits": []
                },
                {
                    "Type": "FREQUENCY",
                    "Score": 0.983681321144104,
                    "RelationshipScore": 0.9999642372131348,
                    "Id": 4,
                    "BeginOffset": 166,
                    "EndOffset": 184,
                    "Text": "at breakfast daily",
                    "Traits": []
                }
            ],
            "Traits": [],
            "RxNormConcepts": [
                {
                    "Description": "lisdexamfetamine dimesylate 50 MG Oral Capsule [Vyvanse]",
                    "Code": "854852",
                    "Score": 0.8883932828903198
                },
                {
                    "Description": "lisdexamfetamine dimesylate 50 MG Chewable Tablet [Vyvanse]",
                    "Code": "1871469",
                    "Score": 0.7482635378837585
                },
                {
                    "Description": "Vyvanse",
                    "Code": "711043",
                    "Score": 0.7041242122650146
                },
                {
                    "Description": "lisdexamfetamine dimesylate 70 MG Oral Capsule [Vyvanse]",
                    "Code": "854844",
                    "Score": 0.23675969243049622
                },
                {
                    "Description": "lisdexamfetamine dimesylate 60 MG Oral Capsule [Vyvanse]",
                    "Code": "854848",
                    "Score": 0.14077001810073853
                }
            ]
        },
        {
            "Id": 5,
            "Text": "Clonidine",
            "Category": "MEDICATION",
            "Type": "GENERIC_NAME",
            "Score": 0.9982216954231262,
            "BeginOffset": 199,
            "EndOffset": 208,
            "Attributes": [
                {
                    "Type": "STRENGTH",
                    "Score": 0.7696017026901245,
                    "RelationshipScore": 0.9999960660934448,
                    "Id": 6,
                    "BeginOffset": 209,
                    "EndOffset": 216,
                    "Text": "0.2 mgs",
                    "Traits": []
                },
                {
                    "Type": "DOSAGE",
                    "Score": 0.777644693851471,
                    "RelationshipScore": 0.9999927282333374,
                    "Id": 7,
                    "BeginOffset": 220,
                    "EndOffset": 236,
                    "Text": "1 and 1 / 2 tabs",
                    "Traits": []
                },
                {
                    "Type": "ROUTE_OR_MODE",
                    "Score": 0.9981689453125,
                    "RelationshipScore": 0.999950647354126,
                    "Id": 8,
                    "BeginOffset": 237,
                    "EndOffset": 239,
                    "Text": "po",
                    "Traits": []
                },
                {
                    "Type": "FREQUENCY",
                    "Score": 0.99753737449646,
                    "RelationshipScore": 0.9999889135360718,
                    "Id": 9,
                    "BeginOffset": 240,
                    "EndOffset": 243,
                    "Text": "qhs",
                    "Traits": []
                }
            ],
            "Traits": [],
            "RxNormConcepts": [
                {
                    "Description": "Clonidine Hydrochloride 0.2 MG Oral Tablet",
                    "Code": "884185",
                    "Score": 0.9600071907043457
                },
                {
                    "Description": "Clonidine Hydrochloride 0.025 MG Oral Tablet",
                    "Code": "892791",
                    "Score": 0.8955953121185303
                },
                {
                    "Description": "24 HR Clonidine Hydrochloride 0.2 MG Extended Release Oral Tablet",
                    "Code": "885880",
                    "Score": 0.8706559538841248
                },
                {
                    "Description": "12 HR Clonidine Hydrochloride 0.2 MG Extended Release Oral Tablet",
                    "Code": "1013937",
                    "Score": 0.786146879196167
                },
                {
                    "Description": "Chlorthalidone 15 MG / Clonidine Hydrochloride 0.2 MG Oral Tablet",
                    "Code": "884198",
                    "Score": 0.601354718208313
                }
            ]
        }
    ],
    "ModelVersion": "0.0.0"
}

Similarly, this is the output when detecting medical conditions (ICD-10-CM concepts):

{
    "Entities": [
        {
            "Id": 0,
            "Text": "coronary artery disease",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9933860898017883,
            "BeginOffset": 90,
            "EndOffset": 113,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "DIAGNOSIS",
                    "Score": 0.9682672023773193
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Atherosclerotic heart disease of native coronary artery without angina pectoris",
                    "Code": "I25.10",
                    "Score": 0.8199513554573059
                },
                {
                    "Description": "Atherosclerotic heart disease of native coronary artery",
                    "Code": "I25.1",
                    "Score": 0.4950370192527771
                },
                {
                    "Description": "Old myocardial infarction",
                    "Code": "I25.2",
                    "Score": 0.18753206729888916
                },
                {
                    "Description": "Atherosclerotic heart disease of native coronary artery with unstable angina pectoris",
                    "Code": "I25.110",
                    "Score": 0.16535982489585876
                },
                {
                    "Description": "Atherosclerotic heart disease of native coronary artery with unspecified angina pectoris",
                    "Code": "I25.119",
                    "Score": 0.15222692489624023
                }
            ]
        },
        {
            "Id": 2,
            "Text": "atrial fibrillation",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9923409223556519,
            "BeginOffset": 116,
            "EndOffset": 135,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "DIAGNOSIS",
                    "Score": 0.9708861708641052
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Unspecified atrial fibrillation",
                    "Code": "I48.91",
                    "Score": 0.7011875510215759
                },
                {
                    "Description": "Chronic atrial fibrillation",
                    "Code": "I48.2",
                    "Score": 0.28612759709358215
                },
                {
                    "Description": "Paroxysmal atrial fibrillation",
                    "Code": "I48.0",
                    "Score": 0.21157972514629364
                },
                {
                    "Description": "Persistent atrial fibrillation",
                    "Code": "I48.1",
                    "Score": 0.16996538639068604
                },
                {
                    "Description": "Atrial premature depolarization",
                    "Code": "I49.1",
                    "Score": 0.16715925931930542
                }
            ]
        },
        {
            "Id": 3,
            "Text": "hypertension",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9993137121200562,
            "BeginOffset": 138,
            "EndOffset": 150,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "DIAGNOSIS",
                    "Score": 0.9734011888504028
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Essential (primary) hypertension",
                    "Code": "I10",
                    "Score": 0.6827990412712097
                },
                {
                    "Description": "Hypertensive heart disease without heart failure",
                    "Code": "I11.9",
                    "Score": 0.09846580773591995
                },
                {
                    "Description": "Hypertensive heart disease with heart failure",
                    "Code": "I11.0",
                    "Score": 0.09182810038328171
                },
                {
                    "Description": "Pulmonary hypertension, unspecified",
                    "Code": "I27.20",
                    "Score": 0.0866364985704422
                },
                {
                    "Description": "Primary pulmonary hypertension",
                    "Code": "I27.0",
                    "Score": 0.07662317156791687
                }
            ]
        },
        {
            "Id": 4,
            "Text": "hyperlipidemia",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9998835325241089,
            "BeginOffset": 153,
            "EndOffset": 167,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "DIAGNOSIS",
                    "Score": 0.9702492356300354
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Hyperlipidemia, unspecified",
                    "Code": "E78.5",
                    "Score": 0.8378056883811951
                },
                {
                    "Description": "Disorders of lipoprotein metabolism and other lipidemias",
                    "Code": "E78",
                    "Score": 0.20186281204223633
                },
                {
                    "Description": "Lipid storage disorder, unspecified",
                    "Code": "E75.6",
                    "Score": 0.18514418601989746
                },
                {
                    "Description": "Pure hyperglyceridemia",
                    "Code": "E78.1",
                    "Score": 0.1438658982515335
                },
                {
                    "Description": "Other hyperlipidemia",
                    "Code": "E78.49",
                    "Score": 0.13983778655529022
                }
            ]
        },
        {
            "Id": 5,
            "Text": "chills",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9989762306213379,
            "BeginOffset": 211,
            "EndOffset": 217,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "SYMPTOM",
                    "Score": 0.9510533213615417
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Chills (without fever)",
                    "Code": "R68.83",
                    "Score": 0.7460958361625671
                },
                {
                    "Description": "Fever, unspecified",
                    "Code": "R50.9",
                    "Score": 0.11848161369562149
                },
                {
                    "Description": "Typhus fever, unspecified",
                    "Code": "A75.9",
                    "Score": 0.07497859001159668
                },
                {
                    "Description": "Neutropenia, unspecified",
                    "Code": "D70.9",
                    "Score": 0.07332006841897964
                },
                {
                    "Description": "Lassa fever",
                    "Code": "A96.2",
                    "Score": 0.0721040666103363
                }
            ]
        },
        {
            "Id": 6,
            "Text": "nausea",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9993392825126648,
            "BeginOffset": 220,
            "EndOffset": 226,
            "Attributes": [],
            "Traits": [
                {
                    "Name": "SYMPTOM",
                    "Score": 0.9175007939338684
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Nausea",
                    "Code": "R11.0",
                    "Score": 0.7333012819290161
                },
                {
                    "Description": "Nausea with vomiting, unspecified",
                    "Code": "R11.2",
                    "Score": 0.20183530449867249
                },
                {
                    "Description": "Hematemesis",
                    "Code": "K92.0",
                    "Score": 0.1203150525689125
                },
                {
                    "Description": "Vomiting, unspecified",
                    "Code": "R11.10",
                    "Score": 0.11658868193626404
                },
                {
                    "Description": "Nausea and vomiting",
                    "Code": "R11",
                    "Score": 0.11535880714654922
                }
            ]
        },
        {
            "Id": 8,
            "Text": "flank pain",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9315784573554993,
            "BeginOffset": 235,
            "EndOffset": 245,
            "Attributes": [
                {
                    "Type": "ACUITY",
                    "Score": 0.9809532761573792,
                    "RelationshipScore": 0.9999837875366211,
                    "Id": 7,
                    "BeginOffset": 229,
                    "EndOffset": 234,
                    "Text": "acute",
                    "Traits": []
                }
            ],
            "Traits": [
                {
                    "Name": "SYMPTOM",
                    "Score": 0.8182812929153442
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Unspecified abdominal pain",
                    "Code": "R10.9",
                    "Score": 0.4959934949874878
                },
                {
                    "Description": "Generalized abdominal pain",
                    "Code": "R10.84",
                    "Score": 0.12332479655742645
                },
                {
                    "Description": "Lower abdominal pain, unspecified",
                    "Code": "R10.30",
                    "Score": 0.08319114148616791
                },
                {
                    "Description": "Upper abdominal pain, unspecified",
                    "Code": "R10.10",
                    "Score": 0.08275411278009415
                },
                {
                    "Description": "Jaw pain",
                    "Code": "R68.84",
                    "Score": 0.07797083258628845
                }
            ]
        },
        {
            "Id": 10,
            "Text": "numbness",
            "Category": "MEDICAL_CONDITION",
            "Type": "DX_NAME",
            "Score": 0.9659366011619568,
            "BeginOffset": 255,
            "EndOffset": 263,
            "Attributes": [
                {
                    "Type": "SYSTEM_ORGAN_SITE",
                    "Score": 0.9976192116737366,
                    "RelationshipScore": 0.9999089241027832,
                    "Id": 11,
                    "BeginOffset": 271,
                    "EndOffset": 274,
                    "Text": "leg",
                    "Traits": []
                }
            ],
            "Traits": [
                {
                    "Name": "SYMPTOM",
                    "Score": 0.7310190796852112
                }
            ],
            "ICD10CMConcepts": [
                {
                    "Description": "Anesthesia of skin",
                    "Code": "R20.0",
                    "Score": 0.767346203327179
                },
                {
                    "Description": "Paresthesia of skin",
                    "Code": "R20.2",
                    "Score": 0.13602739572525024
                },
                {
                    "Description": "Other complications of anesthesia",
                    "Code": "T88.59",
                    "Score": 0.09990577399730682
                },
                {
                    "Description": "Hypothermia following anesthesia",
                    "Code": "T88.51",
                    "Score": 0.09953102469444275
                },
                {
                    "Description": "Disorder of the skin and subcutaneous tissue, unspecified",
                    "Code": "L98.9",
                    "Score": 0.08736388385295868
                }
            ]
        }
    ],
    "ModelVersion": "0.0.0"
}

Available Now
You can use Amazon Comprehend Medical via the console, AWS Command Line Interface (CLI), or AWS SDKs. With Comprehend Medical, you pay only for what you use. You are charged based on the amount of text processed on a monthly basis, depending on the features you use. For more information, please see the Comprehend Medical section in the Comprehend Pricing page. Ontology Linking is available in all regions were Amazon Comprehend Medical is offered, as described in the AWS Regions Table.

The new ontology linking APIs make it easy to detect medications and medical conditions in unstructured clinical text and link them to RxNorm and ICD-10-CM codes respectively. This new feature can help you reduce the cost, time and effort of processing large amounts of unstructured medical text with high accuracy.

Danilo

New – Provisioned Concurrency for Lambda Functions

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-provisioned-concurrency-for-lambda-functions/

It’s really true that time flies, especially when you don’t have to think about servers: AWS Lambda just turned 5 years old and the team is always looking for new ways to help customers build and run applications in an easier way.

As more mission critical applications move to serverless, customers need more control over the performance of their applications. Today we are launching Provisioned Concurrency, a feature that keeps functions initialized and hyper-ready to respond in double-digit milliseconds. This is ideal for implementing interactive services, such as web and mobile backends, latency-sensitive microservices, or synchronous APIs.

When you invoke a Lambda function, the invocation is routed to an execution environment to process the request. When a function has not been used for some time, when you need to process more concurrent invocations, or when you update a function, new execution environments are created. The creation of an execution environment takes care of installing the function code and starting the runtime. Depending on the size of your deployment package, and the initialization time of the runtime and of your code, this can introduce latency for the invocations that are routed to a new execution environment. This latency is usually referred to as a “cold start”. For most applications this additional latency is not a problem. For some applications, however, this latency may not be acceptable.

When you enable Provisioned Concurrency for a function, the Lambda service will initialize the requested number of execution environments so they can be ready to respond to invocations.

Configuring Provisioned Concurrency
I create two Lambda functions that use the same Java code and can be triggered by Amazon API Gateway. To simulate a production workload, these functions are repeating some mathematical computation 10 million times in the initialization phase and 200,000 times for each invocation. The computation is using java.Math.Random and conditions (if ...) to avoid compiler optimizations (such as “unlooping” the iterations). Each function has 1GB of memory and the size of the code is 1.7MB.

I want to enable Provisioned Concurrency only for one of the two functions, so that I can compare how they react to a similar workload. In the Lambda console, I select one the functions. In the configuration tab, I see the new Provisioned Concurrency settings.

I select Add configuration. Provisioned Concurrency can be enabled for a specific Lambda function version or alias (you can’t use $LATEST). You can have different settings for each version of a function. Using an alias, it is easier to enable these settings to the correct version of your function. In my case I select the alias live that I keep updated to the latest version using the AWS SAM AutoPublishAlias function preference. For the Provisioned Concurrency, I enter 500 and Save.

Now, the Provisioned Concurrency configuration is in progress. The execution environments are being prepared to serve concurrent incoming requests based on my input. During this time the function remains available and continues to serve traffic.

After a few minutes, the concurrency is ready. With these settings, up to 500 concurrent requests will find an execution environment ready to process them. If I go above that, the usual scaling of Lambda functions still applies.

To generate some load, I use an Amazon Elastic Compute Cloud (EC2) instance in the same region. To keep it simple, I use the ab tool bundled with the Apache HTTP Server to call the two API endpoints 10,000 times with a concurrency of 500. Since these are new functions, I expect that:

  • For the function with Provisioned Concurrency enabled and set to 500, my requests are managed by pre-initialized execution environments.
  • For the other function, that has Provisioned Concurrency disabled, about 500 execution environments need to be provisioned, adding some latency to the same amount of invocations, about 5% of the total.

One cool feature of the ab tool is that is reporting the percentage of the requests served within a certain time. That is a very good way to look at API latency, as described in this post on Serverless Latency by Tim Bray.

Here are the results for the function with Provisioned Concurrency disabled:

Percentage of the requests served within a certain time (ms)
50% 351
66% 359
75% 383
80% 396
90% 435
95% 1357
98% 1619
99% 1657
100% 1923 (longest request)

Looking at these numbers, I see that 50% the requests are served within 351ms, 66% of the requests within 359ms, and so on. It’s clear that something happens when I look at 95% or more of the requests: the time suddenly increases by about a second.

These are the results for the function with Provisioned Concurrency enabled:

Percentage of the requests served within a certain time (ms)
50% 352
66% 368
75% 382
80% 387
90% 400
95% 415
98% 447
99% 513
100% 593 (longest request)

Let’s compare those numbers in a graph.

As expected for my test workload, I see a big difference in the response time of the slowest 5% of the requests (between 95% and 100%), where the function with Provisioned Concurrency disabled shows the latency added by the creation of new execution environments and the (slow) initialization in my function code.

In general, the amount of latency added depends on the runtime you use, the size of your code, and the initialization required by your code to be ready for a first invocation. As a result, the added latency can be more, or less, than what I experienced here.

The number of invocations affected by this additional latency depends on how often the Lambda service needs to create new execution environments. Usually that happens when the number of concurrent invocations increases beyond what already provisioned, or when you deploy a new version of a function.

A small percentage of slow response times (generally referred to as tail latency) really makes a difference in end user experience. Over an extended period of time, most users are affected during some of their interactions. With Provisioned Concurrency enabled, user experience is much more stable.

Provisioned Concurrency is a Lambda feature and works with any trigger. For example, you can use it with WebSockets APIsGraphQL resolvers, or IoT Rules. This feature gives you more control when building serverless applications that require low latency, such as web and mobile apps, games, or any service that is part of a complex transaction.

Available Now
Provisioned Concurrency can be configured using the console, the AWS Command Line Interface (CLI), or AWS SDKs for new or existing Lambda functions, and is available today in the following AWS Regions: in US East (Ohio), US East (N. Virginia), US West (N. California), US West (Oregon), Asia Pacific (Hong Kong), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (London), Europe (Paris), and Europe (Stockholm), Middle East (Bahrain), and South America (São Paulo).

You can use the AWS Serverless Application Model (SAM) and SAM CLI to test, deploy and manage serverless applications that use Provisioned Concurrency.

With Application Auto Scaling you can automate configuring the required concurrency for your functions. As policies, Target Tracking and Scheduled Scaling are supported. Using these policies, you can automatically increase the amount of concurrency during times of high demand and decrease it when the demand decreases.

You can also use Provisioned Concurrency today with AWS Partner tools, including configuring Provisioned Currency settings with the Serverless Framework and Terraform, or viewing metrics with Datadog, Epsagon, Lumigo, New Relic, SignalFx, SumoLogic, and Thundra.

You only pay for the amount of concurrency that you configure and for the period of time that you configure it. Pricing in US East (N. Virginia) is $0.015 per GB-hour for Provisioned Concurrency and $0.035 per GB-hour for Duration. The number of requests is charged at the same rate as normal functions. You can find more information in the Lambda pricing page.

This new feature enables developers to use Lambda for a variety of workloads that require highly consistent latency. Let me know what you are going to use it for!

Danilo

New for AWS Transit Gateway – Build Global Networks and Centralize Monitoring Using Network Manager

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-aws-transit-gateway-build-global-networks-and-centralize-monitoring-using-network-manager/

As your company grows and gets the benefits of a cloud-based infrastructure, your on-premises sites like offices and stores increasingly need high performance private connectivity to AWS and to other sites at a reasonable cost. Growing your network is hard, because traditional branch networks based on leased lines are costly, and they suffer from the same lack of elasticity and agility as traditional data centers.

At the same time, it becomes increasingly complex to manage and monitor a global network that is spread across AWS regions and on-premises sites. You need to stitch together data from these diverse locations. This results in an inconsistent operational experience, increased costs and efforts, and missed insights from the lack of visibility across different technologies.

Today, we want to make it easier to build, manage, and monitor global networks with the following new capabilities for AWS Transit Gateway:

  • Transit Gateway Inter-Region Peering
  • Accelerated Site-to-Site VPN
  • AWS Transit Gateway Network Manager

These new networking capabilities enable you to optimize your network using AWS’s global backbone, and to centrally visualize and monitor your global network. More specifically:

  • Inter-Region Peering and Accelerated VPN improve application performance by leveraging the AWS Global Network. In this way, you can reduce the number of leased-lines required to operate your network, optimizing your cost and improving agility. Transit Gateway Inter-Region Peering sends inter region traffic privately over AWS’s global network backbone. Accelerated VPN uses AWS Global Accelerator to route VPN traffic from remote locations through the closest AWS edge location to improve connection performance.
  • Network Manager reduces the operational complexity of managing a global network across AWS and on-premises. With Network Manager, you set up a global view of your private network simply by registering your Transit Gateways and on-premises resources. Your global network can then be visualized and monitored via a centralized operational dashboard.

These features allow you to optimize connectivity from on-premises sites to AWS and also between on-premises sites, by routing traffic through Transit Gateways and the AWS Global Network, and centrally managing through Network Manager.

Visualizing Your Global Network
In the Network Manager console, that you can reach from the Transit Gateways section of the Amazon Virtual Private Cloud console, you have an overview of your global networks. Each global network includes AWS and on-premises resources. Specifically, it provides a central point of management for your AWS Transit Gateways, your physical devices and sites connected to the Transit Gateways via Site-to-Site VPN Connections, and AWS Direct Connect locations attached to the Transit Gateways.

For example, this is the Geographic view of a global network covering North America and Europe with 5 Transit Gateways in 3 AWS Regions, 80 VPCs, 50 VPNs, 1 Direct Connect location, and 16 on-premises sites with 50 devices:

As I zoom in the map, I get a description on what these nodes represent, for example if they are AWS Regions, Direct Connect locations, or branch offices.

I can select any node in the map to get more information. For example, I select the US West (Oregon) AWS Region to see the details of the two Transit Gateways I am using there, including the state of all VPN connections, VPCs, and VPNs handled by the selected Transit Gateway.

Selecting a site, I get a centralized view with the status of the VPN connections, including site metadata such as address, location, and description. For example, here are the details of the Colorado branch offices.

In the Topology panel, I see the logical relationship of all the resources in my network. On the left here there is the entire topology of my global network, on the right the detail of the European part. Connections status is reported as color in the topology view.

Selecting any node in the topology map displays details specific to the resource type (Transit Gateway, VPC, customer gateway, and so on) including links to the corresponding service in the AWS console to get more information and configure the resource.

Monitoring Your Global Network
Network Manager is using Amazon CloudWatch, which collects raw data and processes it into readable, near real-time metrics for data in/out, packets dropped, and VPN connection status.

These statistics are kept for 15 months, so that you can access historical information and gain a better perspective on how your web application or service is performing. You can also set alarms that watch for certain thresholds, and send notifications or take actions when those thresholds are met.

For example, these are the last 12 hours of Monitoring for the Transit Gateway in Europe (Ireland).

In the global network view, you have a single point of view of all events affecting your network, simplifying root cause analysis in case of issues. Clicking on any of the messages in the console will take to a more detailed view in the Events tab.

Your global network events are also delivered by CloudWatch Events. Using simple rules that you can quickly set up, you can match events and route them to one or more target functions or streams. To process the same events, you can also use the additional capabilities offered by Amazon EventBridge.

Network Manager sends the following types of events:

  • Topology changes, for example when a VPN connection is created for a transit gateway.
  • Routing updates, such as when a route is deleted in a transit gateway route table.
  • Status updates, for example in case a VPN tunnel’s BGP session goes down.

Configuring Your Global Network
To get your on-premises resources included in the above visualizations and monitoring, you need to input into Network Manager information about your on-premises devices, sites, and links. You also need to associate devices with the customer gateways they host for VPN connections.

Our software-defined wide area network (SD-WAN) partners, such as Cisco, Aruba, Silver Peak, and Aviatrix, have configured their SD-WAN devices to connect with Transit Gateway Network Manager in only a few clicks. Their SD-WANs also define the on-premises devices, sites, and links automatically in Network Manager. SD-WAN integrations enable to include your on-premises network in the Network Manager global dashboard view without requiring you to input information manually.

Available Now
AWS Transit Gateway Network Manager is a global service available for Transit Gateways in the following regions: US East (N. Virginia), US East (Ohio), US West (N. California), US West (Oregon), Europe (Ireland), Europe (Frankfurt), Europe (London), Europe (Paris), Asia Pacific (Singapore), Asia Pacific (Tokyo), Asia Pacific (Seoul), Asia Pacific (Sydney), Asia Pacific (Mumbai), Canada (Central), South America (São Paulo).

There is no additional cost for using Network Manager. You pay for the network resources you use, like Transit Gateways, VPNs, and so on. Here you can find more information on pricing for VPN and Transit Gateway.

You can learn more in the documentation of the Network ManagerInter-Region Peering, and Accelerated VPN.

With these new features, you can take advantage of the performance of our AWS Global Network, and simplify network management and monitoring across your AWS and on-premises resources.

Danilo

New – Amazon Managed Apache Cassandra Service (MCS)

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-amazon-managed-apache-cassandra-service-mcs/

Managing databases at scale is never easy. One of the options to store, retrieve, and manage large amounts of structured data, including key-value and tabular formats, is Apache Cassandra. With Cassandra, you can use the expressive Cassandra Query Language (CQL) to build applications quickly.

However, managing large Cassandra clusters can be difficult and takes a lot of time. You need specialized expertise to set up, configure, and maintain the underlying infrastructure, and have a deep understanding of the entire application stack, including the Apache Cassandra open source software. You need to add or remove nodes manually, rebalancing partitions, and doing so while keeping your application available with the required performance. Talking with customers, we found out that they often keep their clusters scaled up for peak load because scaling down is complex. To keep your Cassandra cluster updated, you have to do it node by node. It’s hard to backup and restore a cluster if something goes wrong during an update, and you may end up skipping patches or running an outdated version.

Introducing Amazon Managed Cassandra Service
Today, we are launching in open preview Amazon Managed Apache Cassandra Service (MCS), a scalable, highly available, and managed Apache Cassandra-compatible database service. Amazon MCS is serverless, so you pay for only the resources you use and the service automatically scales tables up and down in response to application traffic. You can build applications that serve thousands of requests per second with virtually unlimited throughput and storage.

With Amazon MCS, you can run your Cassandra workloads on AWS using the same Cassandra application code and developer tools that you use today. Amazon MCS implements the Apache Cassandra version 3.11 CQL API, allowing you to use the code and drivers that you already have in your applications. Updating your application is as easy as changing the endpoint to the one in the Amazon MCS service table.

Amazon MCS provides consistent single-digit-millisecond read and write performance at any scale, so you can build applications with low latency to provide a smooth user experience. You have visibility into how your application is performing using Amazon CloudWatch.

There is no limit on the size of a table or the number of items, and you do not need to provision storage. Data storage is fully managed and highly available. Your table data is replicated automatically three times across multiple AWS Availability Zones for durability.

All customer data is encrypted at rest by default. You can use encryption keys stored in AWS Key Management Service (KMS)Amazon MCS is also integrated with AWS Identity and Access Management (IAM) to help you manage access to your tables and data.

Using Amazon Managed Cassandra Service
You can use Amazon MCS with the console, CQL, or existing Apache 2.0 licensed Cassandra drivers. In the console there is a CQL editor, or you can connect using cqlsh. 

To connect using cqlsh, I need to generate service-specific credentials for an existing IAM user. This is just a command using the AWS Command Line Interface (CLI):

aws iam create-service-specific-credential --user-name USERNAME --service-name mcs.amazonaws.com

{
    "ServiceSpecificCredential": {
        "CreateDate": "2019-11-27T14:36:16Z",
        "ServiceName": "mcs.amazonaws.com",
        "ServiceUserName": "USERNAME-at-123412341234",
        "ServicePassword": "...",
        "ServiceSpecificCredentialId": "...",
        "UserName": "USERNAME",
        "Status": "Active"
    }
}

Amazon MCS only accepts secure connections using TLS.  I download the Amazon root certificate and edit the cqlshrc configuration file to use it. Now, I can connect with:

cqlsh {endpoint} {port} -u {ServiceUserName} -p {ServicePassword} --ssl

First, I create a keyspace. A keyspace contains one or more tables and defines the replication strategy for all the tables it contains. With Amazon MCS the default replication strategy for all keyspaces is the Single-region strategy. It replicates data 3 times across multiple Availability Zones in a single AWS Region.

To create a keyspace I can use the console or CQL. In the Amazon MCS console, I provide the name for the keyspace.

Similarly, I can use CQL to create the bookstore keyspace:

CREATE KEYSPACE IF NOT EXISTS bookstore WITH REPLICATION={'class': 'SingleRegionStrategy'};

Now I create a table. A table is where your data is organized and stored. Again, I can use the console or CQL. From the console, I select the bookstore keyspace and give the table a name.

Below that, I add the columns for my books table. Each row in a table is referenced by a primary key, that can be composed of one or more columns, the values of which determine which partition the data is stored in. In my case the primary key is the ISBN. Optionally, I can add clustering columns, which determine the sort order of records within a partition. I am not using clustering columns for this table.

Alternatively, using CQL, I can create the table with the following commands:

USE bookstore;

CREATE TABLE IF NOT EXISTS books
(isbn text PRIMARY KEY,
title text,
author text,
pages int,
year_of_publication int);

I now use CQL to insert a record in the books table:

INSERT INTO books (isbn, title, author, pages, year_of_publication)
VALUES ('978-0201896831',
'The Art of Computer Programming, Vol. 1: Fundamental Algorithms (3rd Edition)',
'Donald E. Knuth', 672, 1997);

Let’s run a quick query. In the console, I select the books table and then Query table.

In the CQL Editor, I use the default query and select Run command.

By default, I see the result of the query in table view:

If I prefer, I can see the result in JSON format, similar to what an application using the Cassandra API would see:

To insert more records, I use csqlsh again and upload some data from a local CSV file:

COPY books (isbn, title, author, pages, year_of_publication)
FROM './books.csv' WITH delimiter=',' AND header=TRUE;

Now I look again at the content of the books table:

SELECT * FROM books;

I can select a row using a primary key, or use filtering for additional conditions. For example:

SELECT title FROM books WHERE isbn='978-1942788713';

SELECT title FROM books WHERE author='Scott Page' ALLOW FILTERING;

With Amazon MCS you can use existing Apache Cassandra 2.0–licensed drivers and developer tools. Open-source Cassandra drivers are available for Java, Python, Ruby, .NET, Node.js, PHP, C++, Perl, and Go.

You can learn more in the Amazon MCS documentation.

Available in Open Preview
Amazon MCS is available today in open preview in US East (N. Virginia), US East (Ohio), Europe (Stockholm), Asia Pacific (Singapore), Asia Pacific (Tokyo).

As we work with the Cassandra API libraries, we are contributing bug fixes to the open source Apache Cassandra project. We are also contributing back improvements such as built-in support for AWS authentication (SigV4), which simplifies managing credentials for customers running Cassandra on Amazon Elastic Compute Cloud (EC2), since EC2 and IAM can handle distribution and management of credentials using instance roles automatically. We are also announcing the funding of AWS promotional service credits for testing Cassandra-related open-source projects. To learn more about these contributions, visit the Open Source blog.

During the preview, you can use Amazon MCS with on-demand capacity. At general availability, we will also offer the option to use provisioned throughput for more predictable workloads. With on-demand capacity mode, Amazon MCS charges you based on the amount of data your applications read and write from your tables. You do not need to specify how much read and write throughput capacity to provision to your tables because Amazon MCS accommodates your workloads instantly as they scale up or down.

As part of the AWS Free Tier, you can get started with Amazon MCS for free. For the first three months, you are offered a monthly free tier of 30 million write request units, 30 million read request units, and 1 GB of storage. Your free tier starts when you create your first Amazon MCS resource.

Next year we are making it easier to migrate your data Amazon MCS, adding support to use AWS Database Migration Service.

Amazon MCS makes it easy to use Cassandra workloads at any scale, providing a simple programming interface to build new applications, or migrate existing ones. I can’t wait to see what are you going to use it for!

New for Amazon Redshift – Data Lake Export and Federated Query

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-redshift-data-lake-export-and-federated-queries/

A data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications. Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze data using standard SQL and existing Business Intelligence (BI) tools.

To get information from unstructured data that would not fit in a data warehouse, you can build a data lake. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. With a data lake built on Amazon Simple Storage Service (S3), you can easily run big data analytics and use machine learning to gain insights from your semi-structured (such as JSON, XML) and unstructured datasets.

Today, we are launching two new features to help you improve the way you manage your data warehouse and integrate with a data lake:

  • Data Lake Export to unload data from a Redshift cluster to S3 in Apache Parquet format, an efficient open columnar storage format optimized for analytics.
  • Federated Query to be able, from a Redshift cluster, to query across data stored in the cluster, in your S3 data lake, and in one or more Amazon Relational Database Service (RDS) for PostgreSQL and Amazon Aurora PostgreSQL databases.

This architectural diagram gives a quick summary of how these features work and how they can be used together with other AWS services.

Let’s explain the interactions you see in the diagram better, starting from how you can use these features, and the advantages they provide.

Using Redshift Data Lake Export

You can now unload the result of a Redshift query to your S3 data lake in Apache Parquet format. The Parquet format is up to 2x faster to unload and consumes up to 6x less storage in S3, compared to text formats. This enables you to save data transformation and enrichment you have done in Redshift into your S3 data lake in an open format.

You can then analyze the data in your data lake with Redshift Spectrum, a feature of Redshift that allows you to query data directly from files on S3. Or you can use different tools such as Amazon Athena, Amazon EMR, or Amazon SageMaker.

To try this new feature, I create a new cluster from the Redshift console, and follow this tutorial to load sample data that keeps track of sales of musical events across different venues. I want to correlate this data with social media comments on the events stored in my data lake. To understand their relevance, each event should have a way of comparing its relative sales to other events.

Let’s build a query in Redshift to export the data to S3. My data is stored across multiple tables. I need to create a query that gives me a single view of what is going on with sales. I want to join the content of the  sales and date tables, adding information on the gross sales for an event (total_price in the query), and the percentile in terms of all time gross sales compared to all events.

To export the result of the query to S3 in Parquet format, I use the following SQL command:

UNLOAD ('SELECT sales.*, date.*, total_price, percentile
           FROM sales, date,
                (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) / 10.0 as percentile
                   FROM (SELECT eventid, sum(pricepaid) total_price
                           FROM sales
                       GROUP BY eventid)) as percentile_events
          WHERE sales.dateid = date.dateid
            AND percentile_events.eventid = sales.eventid')
TO 's3://MY-BUCKET/DataLake/Sales/'
FORMAT AS PARQUET
CREDENTIALS 'aws_iam_role=arn:aws:iam::123412341234:role/myRedshiftRole';

To give Redshift write access to my S3 bucket, I am using an AWS Identity and Access Management (IAM) role. I can see the result of the UNLOAD command using the AWS Command Line Interface (CLI). As expected, the output of the query is exported using the Parquet columnar data format:

$ aws s3 ls s3://MY-BUCKET/DataLake/Sales/
2019-11-25 14:26:56 1638550 0000_part_00.parquet
2019-11-25 14:26:56 1635489 0001_part_00.parquet
2019-11-25 14:26:56 1624418 0002_part_00.parquet
2019-11-25 14:26:56 1646179 0003_part_00.parquet

To optimize access to data, I can specify one or more partition columns so that unloaded data is automatically partitioned into folders in my S3 bucket. For example, I can unload sales data partitioned by year, month, and day. This enables my queries to take advantage of partition pruning and skip scanning irrelevant partitions, improving query performance and minimizing cost.

To use partitioning, I need to add to the previous SQL command the PARTITION BY option, followed by the columns I want to use to partition the data in different directories. In my case, I want to partition the output based on the year and the calendar date (caldate in the query) of the sales.

UNLOAD ('SELECT sales.*, date.*, total_price, percentile
           FROM sales, date,
                (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) / 10.0 as percentile
                   FROM (SELECT eventid, sum(pricepaid) total_price
                           FROM sales
                       GROUP BY eventid)) as percentile_events
          WHERE sales.dateid = date.dateid
            AND percentile_events.eventid = sales.eventid')
TO 's3://MY-BUCKET/DataLake/SalesPartitioned/'
FORMAT AS PARQUET
PARTITION BY (year, caldate)
CREDENTIALS 'aws_iam_role=arn:aws:iam::123412341234:role/myRedshiftRole';

This time, the output of the query is stored in multiple partitions. For example, here’s the content of a folder for a specific year and date:

$ aws s3 ls s3://MY-BUCKET/DataLake/SalesPartitioned/year=2008/caldate=2008-07-20/
2019-11-25 14:36:17 11940 0000_part_00.parquet
2019-11-25 14:36:17 11052 0001_part_00.parquet
2019-11-25 14:36:17 11138 0002_part_00.parquet
2019-11-25 14:36:18 12582 0003_part_00.parquet

Optionally, I can use AWS Glue to set up a Crawler that (on demand or on a schedule) looks for data in my S3 bucket to update the Glue Data Catalog. When the Data Catalog is updated, I can easily query the data using Redshift Spectrum, Athena, or EMR.

The sales data is now ready to be processed together with the unstructured and semi-structured  (JSON, XML, Parquet) data in my data lake. For example, I can now use Apache Spark with EMR, or any Sagemaker built-in algorithm to access the data and get new insights.

Using Redshift Federated Query
You can now also access data in RDS and Aurora PostgreSQL stores directly from your Redshift data warehouse. In this way, you can access data as soon as it is available. Straight from Redshift, you can now perform queries processing data in your data warehouse, transactional databases, and data lake, without requiring ETL jobs to transfer data to the data warehouse.

Redshift leverages its advanced optimization capabilities to push down and distribute a significant portion of the computation directly into the transactional databases, minimizing the amount of data moving over the network.

Using this syntax, you can add an external schema from an RDS or Aurora PostgreSQL database to a Redshift cluster:

CREATE EXTERNAL SCHEMA IF NOT EXISTS online_system
FROM POSTGRES
DATABASE 'online_sales_db' SCHEMA 'online_system'
URI ‘my-hostname' port 5432
IAM_ROLE 'iam-role-arn'
SECRET_ARN 'ssm-secret-arn';

Schema and port are optional here. Schema will default to public if left unspecified and default port for PostgreSQL databases is 5432. Redshift is using AWS Secrets Manager to manage the credentials to connect to the external databases.

With this command, all tables in the external schema are available and can be used by Redshift for any complex SQL query processing data in the cluster or, using Redshift Spectrum, in your S3 data lake.

Coming back to the sales data example I used before, I can now correlate the trends of my historical data of musical events with real-time sales. In this way, I can understand if an event is performing as expected or not, and calibrate my marketing activities without delays.

For example, after I define the online commerce database as the online_system external schema in my Redshift cluster, I can compare previous sales with what is in the online commerce system with this simple query:

SELECT eventid,
       sum(pricepaid) total_price,
       sum(online_pricepaid) online_total_price
  FROM sales, online_system.current_sales
 GROUP BY eventid
 WHERE eventid = online_eventid;

Redshift doesn’t import database or schema catalog in its entirety. When a query is run, it localizes the metadata for the Aurora and RDS tables (and views) that are part of the query. This localized metadata is then used for query compilation and plan generation.

Available Now
Amazon Redshift data lake export is a new tool to improve your data processing pipeline and is supported with Redshift release version 1.0.10480 or later. Refer to the AWS Region Table for Redshift availability, and check the version of your clusters.

The new federation capability in Amazon Redshift is released as a public preview and allows you to bring together data stored in Redshift, S3, and one or more RDS and Aurora PostgreSQL databases. When creating a cluster in the Amazon Redshift management console, you can pick three tracks for maintenance: Current, Trailing, or Preview. Within the Preview track, preview_features should be chosen to participate to the Federated Query public preview. For example:

These features simplify data processing and analytics, giving you more tools to react quickly, and a single point of view for your data. Let me know what you are going to use them for!

Danilo

New for Amazon Aurora – Use Machine Learning Directly From Your Databases

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-aurora-use-machine-learning-directly-from-your-databases/

Machine Learning allows you to get better insights from your data. But where is most of the structured data stored? In databases! Today, in order to use machine learning with data in a relational database, you need to develop a custom application to read the data from the database and then apply the machine learning model. Developing this application requires a mix of skills to be able to interact with the database and use machine learning. This is a new application, and now you have to manage its performance, availability, and security.

Can we make it easier to apply machine learning to data in a relational database? Even for existing applications?

Starting today, Amazon Aurora is natively integrated with two AWS machine learning services:

  • Amazon SageMaker, a service providing you with the ability to build, train, and deploy custom machine learning models quickly.
  • Amazon Comprehend, a natural language processing (NLP) service that uses machine learning to find insights in text.

Using this new functionality, you can use a SQL function in your queries to apply a machine learning model to the data in your relational database. For example, you can detect the sentiment of a user comment using Comprehend, or apply a custom machine learning model built with SageMaker to estimate the risk of “churn” for your customers. Churn is a word mixing “change” and “turn” and is used to describe customers that stop using your services.

You can store the output of a large query including the additional information from machine learning services in a new table, or use this feature interactively in your application by just changing the SQL code run by the clients, with no machine learning experience required.

Let’s see a couple of examples of what you can do from an Aurora database, first by using Comprehend, then SageMaker.

Configuring Database Permissions
The first step is to give the database permissions to access the services you want to use: Comprehend, SageMaker, or both. In the RDS console, I create a new Aurora MySQL 5.7 database. When it is available, in the Connectivity & security tab of the regional endpoint, I look for the Manage IAM roles section.

There I connect Comprehend and SageMaker to this database cluster. For SageMaker, I need to provide the Amazon Resource Name (ARN) of the endpoint of a deployed machine learning model. If you want to use multiple endpoints, you need to repeat this step. The console takes care of creating the service roles for the Aurora database to access those services in order for the new machine learning integration to work.

Using Comprehend from Amazon Aurora
I connect to the database using a MySQL client. To run my tests, I create a table storing comments for a blogging platform and insert a few sample records:

CREATE TABLE IF NOT EXISTS comments (
       comment_id INT AUTO_INCREMENT PRIMARY KEY,
       comment_text VARCHAR(255) NOT NULL
);

INSERT INTO comments (comment_text)
VALUES ("This is very useful, thank you for writing it!");
INSERT INTO comments (comment_text)
VALUES ("Awesome, I was waiting for this feature.");
INSERT INTO comments (comment_text)
VALUES ("An interesting write up, please add more details.");
INSERT INTO comments (comment_text)
VALUES ("I don’t like how this was implemented.");

To detect the sentiment of the comments in my table, I can use the aws_comprehend_detect_sentiment and aws_comprehend_detect_sentiment_confidence SQL functions:

SELECT comment_text,
       aws_comprehend_detect_sentiment(comment_text, 'en') AS sentiment,
       aws_comprehend_detect_sentiment_confidence(comment_text, 'en') AS confidence
  FROM comments;

The aws_comprehend_detect_sentiment function returns the most probable sentiment for the input text: POSITIVE, NEGATIVE, or NEUTRAL. The aws_comprehend_detect_sentiment_confidence function returns the confidence of the sentiment detection, between 0 (not confident at all) and 1 (fully confident).

Using SageMaker Endpoints from Amazon Aurora
Similarly to what I did with Comprehend, I can access a SageMaker endpoint to enrich the information stored in my database. To see a practical use case, let’s implement the customer churn example mentioned at the beginning of this post.

Mobile phone operators have historical records on which customers ultimately ended up churning and which continued using the service. We can use this historical information to construct a machine learning model. As input for the model, we’re looking at the current subscription plan, how much the customer is speaking on the phone at different times of day, and how often has called customer service.

Here’s the structure of my customer table:

SHOW COLUMNS FROM customers;

To be able to identify customers at risk of churn, I train a model following this sample SageMaker notebook using the XGBoost algorithm. When the model has been created, it’s deployed to a hosted endpoint.

When the SageMaker endpoint is in service, I go back to the Manage IAM roles section of the console to give the Aurora database permissions to access the endpoint ARN.

Now, I create a new will_churn SQL function giving input to the endpoint the parameters required by the model:

CREATE FUNCTION will_churn (
       state varchar(2048), acc_length bigint(20),
       area_code bigint(20), int_plan varchar(2048),
       vmail_plan varchar(2048), vmail_msg bigint(20),
       day_mins double, day_calls bigint(20),
       eve_mins double, eve_calls bigint(20),
       night_mins double, night_calls bigint(20),
       int_mins double, int_calls bigint(20),
       cust_service_calls bigint(20))
RETURNS varchar(2048) CHARSET latin1
       alias aws_sagemaker_invoke_endpoint
       endpoint name 'estimate_customer_churn_endpoint_version_123';

As you can see, the model looks at the customer’s phone subscription details and service usage patterns to identify the risk of churn. Using the will_churn SQL function, I run a query over my customers table to flag customers based on my machine learning model. To store the result of the query, I create a new customers_churn table:

CREATE TABLE customers_churn AS
SELECT *, will_churn(state, acc_length, area_code, int_plan,
       vmail_plan, vmail_msg, day_mins, day_calls,
       eve_mins, eve_calls, night_mins, night_calls,
       int_mins, int_calls, cust_service_calls) will_churn
  FROM customers;

Let’s see a few records from the customers_churn table:

SELECT * FROM customers_churn LIMIT 7;

I am lucky the first 7 customers are apparently not going to churn. But what happens overall? Since I stored the results of the will_churn function, I can run a SELECT GROUP BY statement on the customers_churn table.

SELECT will_churn, COUNT(*) FROM customers_churn GROUP BY will_churn;

Starting from there, I can dive deep to understand what brings my customers to churn.

If I create a new version of my machine learning model, with a new endpoint ARN, I can recreate the will_churn function without changing my SQL statements.

Available Now
The new machine learning integration is available today for Aurora MySQL 5.7, with the SageMaker integration generally available and the Comprehend integration in preview. You can learn more in the documentation. We are working on other engines and versions: Aurora MySQL 5.6 and Aurora PostgreSQL 10 and 11 are coming soon.

The Aurora machine learning integration is available in all regions in which the underlying services are available. For example, if both Aurora MySQL 5.7 and SageMaker are available in a region, then you can use the integration for SageMaker. For a complete list of services availability, please see the AWS Regional Table.

There’s no additional cost for using the integration, you just pay for the underlying services at your normal rates. Pay attention to the size of your queries when using Comprehend. For example, if you do sentiment analysis on user feedback in your customer service web page, to contact those who made particularly positive or negative comments, and people are making 10,000 comments a day, you’d pay $3/day. To optimize your costs, remember to store results.

It’s never been easier to apply machine learning models to data stored in your relational databases. Let me know what you are going to build with this!

Danilo

New – AWS IoT Greengrass Adds Container Support and Management of Data Streams at the Edge

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-aws-iot-greengrass-adds-docker-support-and-streams-management-at-the-edge/

AWS IoT Greengrass extends cloud capabilities to edge devices, so that they can respond to local events in near real-time, even with intermittent connectivity.

Today, we are adding two features that make it easier to build IoT solutions:

  • Container support to deploy applications using the Greengrass Docker application deployment connector.
  • Collect, process, and export data streams from edge devices and manage the lifecycle of that data with the Stream Manager for AWS IoT Greengrass.

Let’s see how these new features work and how to use them.

Deploying a Container-Based Application to a Greengrass Core Device
You can now run AWS Lambda functions and container-based applications in your AWS IoT Greengrass core device. In this way it is easier to migrate applications from on-premises, or build new applications that include dependencies such as libraries, other binaries, and configuration files, using container images. This provides a consistent deployment environment for your applications that enables portability across development environments and edge locations. You can easily deploy legacy and third-party applications by packaging the code or executables into the container images.

To use this feature, I describe my container-based application using a Docker Compose file. I can reference container images in public or private repositories, such as Amazon Elastic Container Registry (ECR) or Docker Hub. To start, I create a simple web app using Python and Flask that counts the number of times it is visualized.

from flask import Flask

app = Flask(__name__)

counter = 0

@app.route('/')
def hello():
    global counter
    counter += 1
    return 'Hello World! I have been seen {} times.\n'.format(counter)

My requirements.txt file contains a single dependency, flask.

I build the container image using this Dockerfile and push it to ECR.

FROM python:3.7-alpine
WORKDIR /code
ENV FLASK_APP app.py
ENV FLASK_RUN_HOST 0.0.0.0
COPY requirements.txt requirements.txt
RUN pip install -r requirements.txt
COPY . .
CMD ["flask", "run"]

Here is the docker-compose.yml file referencing the container image in my ECR repository. Docker Compose files can describe applications using multiple containers, but for this example I am using just one.

version: '3'
services:
  web:
    image: "123412341234.dkr.ecr.us-east-1.amazonaws.com/hello-world-counter:latest"
    ports:
      - "80:5000"

I upload the docker-compose.yml file to an Amazon Simple Storage Service (S3) bucket.

Now I create an AWS IoT Greengrass group using an Amazon Elastic Compute Cloud (EC2) instance as core device. Usually your core device is outside of the AWS cloud, but using an EC2 instance can be a good way to set up and automate a dev & test environment for your deployments at the edge.

When the group is ready, I run an “empty” deployment, just to check that everything is working as expected. After a few seconds, my first deployment has completed and I start adding a connector.

In the connector section of the AWS IoT Greengrass group, I select Add a connector and search for “Docker”. I select Docker Application Deployment and hit Next.

Now I configure the parameters for the connector. I select my docker-compose.yml file on S3. The AWS Identity and Access Management (IAM) role used by the AWS IoT Greengrass group needs permissions to get the file from S3 and to get the authorization token and download the image from ECR. If you use a private repository such as Docker Hub, you can leverage the integration with the AWS Secret Manager to make it easy for your connectors and Lambda functions to use local secrets to interact with services and applications.

I deploy my changes, similarly to what I did before. This time, the new container-based application is installed and started on the AWS IoT Greengrass core device.

To test the web app that I deployed, I open access to the HTTP port on the Security Group of the EC2 instance I am using as core device. When I connect with my browser, I see the Flask app starting to count the visits. My container-based application is running on the AWS IoT Greengrass core device!

You can deploy much more complex applications than what I did in this example. Let’s see that as we go through the other feature released today.

Using the Stream Manager for AWS IoT Greengrass
For common use cases like video processing, image recognition, or high-volume data collection from sensors at the edge, you often need to build your own data stream management capabilities. The new Stream Manager simplifies this process by adding a standardized mechanism to the Greengrass Core SDK that you can use to process data streams from IoT devices, manage local data retention policies based on cache size or data age, and automatically transmit data directly into AWS cloud services such as Amazon Kinesis and AWS IoT Analytics.

The Stream Manager also handles disconnected or intermittent connectivity scenarios by adding configurable prioritization, caching policies, bandwidth utilization, and time-outs on a per-stream basis. In situations where connectivity is unpredictable or bandwidth is constrained, this new functionality enables you to define the behavior of your applications’ data management while disconnected, reconnecting, or connected, allowing you to prioritize important data’s path to the cloud and make efficient use of a connection when it is available. Using this feature, you can focus on your specific application use cases rather than building data retention and connection management functionality.

Let’s see now how the Stream Manager works with a practical use case. For example, my AWS IoT Greengrass core device is receiving lots of data from multiple devices. I want to do two things with the data I am collecting:

  • Upload all row data with low priority to AWS IoT Analytics, where I use Amazon QuickSight to visualize and understand my data.
  • Aggregate data locally based on time and location of the devices, and send the aggregated data with high priority to a Kinesis Data Stream that is processed by a business application for predictive maintenance.

Using the Stream Manager in the Greengrass Core SDK, I create two local data streams:

  • The first local data stream has a configured low-priority export to IoT Analytics and can use up to 256MB of local disk (yes, it’s a constrained device). You can use memory to store the local data stream if you prefer speed to resilience. When local space is filled up, for example because I lost connectivity to the cloud and I continue to cache locally, I can choose to either reject new data or overwrite the oldest data.
  • The second local data stream is exporting data with high priority to a Kinesis Data Stream and can use up to 128MB of local disk (it’s aggregated data, I need less space for the same amount of time).

 

Here’s how the data flows in this architecture:

  • Sensor data is collected by a Producer Lambda function that is writing to the first local data stream.
  • A second Aggregator Lambda function is reading from the first local data stream, performing the aggregation, and writing its output to the second local data stream.
  • A Reader container-based app (deployed using the Docker application deployment connector) is rendering the aggregated data in real-time for a display panel.
  • The Stream Manager takes care of the ingestion to the cloud, based on the configuration and the policies of the local data streams, so that developers can focus their efforts on the logic on the device.

The use of Lambda functions or container-based apps in the previous architecture is just an example. You can mix and match, or standardize to one or the other, depending on your development best practices.

Available Now
The Docker application deployment connector and the Stream Manager are available with Greengrass version 1.10. The Stream Manager is available in the Greengrass Core SDK for Java and Python. We are adding support for other platforms based on customer feedback.

These new features are independent from each other, but can be used together as in my example. They can simplify the way you build and deploy applications on edge devices, making it easier to process data locally and be integrated with streaming and analytics services in the backend. Let me know what you are going to use these features for!

Danilo

New – Using Step Functions to Orchestrate Amazon EMR workloads

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-using-step-functions-to-orchestrate-amazon-emr-workloads/

AWS Step Functions allows you to add serverless workflow automation to your applications. The steps of your workflow can run anywhere, including in AWS Lambda functions, on Amazon Elastic Compute Cloud (EC2), or on-premises. To simplify building workflows, Step Functions is directly integrated with multiple AWS Services: Amazon ECS, AWS Fargate, Amazon DynamoDB, Amazon Simple Notification Service (SNS), Amazon Simple Queue Service (SQS), AWS Batch, AWS Glue, Amazon SageMaker, and (to run nested workflows) with Step Functions itself.

Starting today, Step Functions connects to Amazon EMR, enabling you to create data processing and analysis workflows with minimal code, saving time, and optimizing cluster utilization. For example, building data processing pipelines for machine learning is time consuming and hard. With this new integration, you have a simple way to orchestrate workflow capabilities, including parallel executions and dependencies from the result of a previous step, and handle failures and exceptions when running data processing jobs.

Specifically, a Step Functions state machine can now:

  • Create or terminate an EMR cluster, including the possibility to change the cluster termination protection. In this way, you can reuse an existing EMR cluster for your workflow, or create one on-demand during execution of a workflow.
  • Add or cancel an EMR step for your cluster. Each EMR step is a unit of work that contains instructions to manipulate data for processing by software installed on the cluster, including tools such as Apache Spark, Hive, or Presto.
  • Modify the size of an EMR cluster instance fleet or group, allowing you to manage scaling programmatically depending on the requirements of each step of your workflow. For example, you may increase the size of an instance group before adding a compute-intensive step, and reduce the size just after it has completed.

When you create or terminate a cluster or add an EMR step to a cluster, you can use synchronous integrations to move to the next step of your workflow only when the corresponding activity has completed on the EMR cluster.

Reading the configuration or the state of your EMR clusters is not part of the Step Functions service integration. In case you need that, the EMR List* and Describe* APIs can be accessed using Lambda functions as tasks.

Building a Workflow with EMR and Step Functions
On the Step Functions console, I create a new state machine. The console renders it visually, so that is much easier to understand:

To create the state machine, I use the following definition using the Amazon States Language (ASL):

{
  "StartAt": "Should_Create_Cluster",
  "States": {
    "Should_Create_Cluster": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.CreateCluster",
          "BooleanEquals": true,
          "Next": "Create_A_Cluster"
        },
        {
          "Variable": "$.CreateCluster",
          "BooleanEquals": false,
          "Next": "Enable_Termination_Protection"
        }
      ],
      "Default": "Create_A_Cluster"
    },
    "Create_A_Cluster": {
      "Type": "Task",
      "Resource": "arn:aws:states:::elasticmapreduce:createCluster.sync",
      "Parameters": {
        "Name": "WorkflowCluster",
        "VisibleToAllUsers": true,
        "ReleaseLabel": "emr-5.28.0",
        "Applications": [{ "Name": "Hive" }],
        "ServiceRole": "EMR_DefaultRole",
        "JobFlowRole": "EMR_EC2_DefaultRole",
        "LogUri": "s3://aws-logs-123412341234-eu-west-1/elasticmapreduce/",
        "Instances": {
          "KeepJobFlowAliveWhenNoSteps": true,
          "InstanceFleets": [
            {
              "InstanceFleetType": "MASTER",
              "TargetOnDemandCapacity": 1,
              "InstanceTypeConfigs": [
                {
                  "InstanceType": "m4.xlarge"
                }
              ]
            },
            {
              "InstanceFleetType": "CORE",
              "TargetOnDemandCapacity": 1,
              "InstanceTypeConfigs": [
                {
                  "InstanceType": "m4.xlarge"
                }
              ]
            }
          ]
        }
      },
      "ResultPath": "$.CreateClusterResult",
      "Next": "Merge_Results"
    },
    "Merge_Results": {
      "Type": "Pass",
      "Parameters": {
        "CreateCluster.$": "$.CreateCluster",
        "TerminateCluster.$": "$.TerminateCluster",
        "ClusterId.$": "$.CreateClusterResult.ClusterId"
      },
      "Next": "Enable_Termination_Protection"
    },
    "Enable_Termination_Protection": {
      "Type": "Task",
      "Resource": "arn:aws:states:::elasticmapreduce:setClusterTerminationProtection",
      "Parameters": {
        "ClusterId.$": "$.ClusterId",
        "TerminationProtected": true
      },
      "ResultPath": null,
      "Next": "Add_Steps_Parallel"
    },
    "Add_Steps_Parallel": {
      "Type": "Parallel",
      "Branches": [
        {
          "StartAt": "Step_One",
          "States": {
            "Step_One": {
              "Type": "Task",
              "Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
              "Parameters": {
                "ClusterId.$": "$.ClusterId",
                "Step": {
                  "Name": "The first step",
                  "ActionOnFailure": "CONTINUE",
                  "HadoopJarStep": {
                    "Jar": "command-runner.jar",
                    "Args": [
                      "hive-script",
                      "--run-hive-script",
                      "--args",
                      "-f",
                      "s3://eu-west-1.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q",
                      "-d",
                      "INPUT=s3://eu-west-1.elasticmapreduce.samples",
                      "-d",
                      "OUTPUT=s3://MY-BUCKET/MyHiveQueryResults/"
                    ]
                  }
                }
              },
              "End": true
            }
          }
        },
        {
          "StartAt": "Wait_10_Seconds",
          "States": {
            "Wait_10_Seconds": {
              "Type": "Wait",
              "Seconds": 10,
              "Next": "Step_Two (async)"
            },
            "Step_Two (async)": {
              "Type": "Task",
              "Resource": "arn:aws:states:::elasticmapreduce:addStep",
              "Parameters": {
                "ClusterId.$": "$.ClusterId",
                "Step": {
                  "Name": "The second step",
                  "ActionOnFailure": "CONTINUE",
                  "HadoopJarStep": {
                    "Jar": "command-runner.jar",
                    "Args": [
                      "hive-script",
                      "--run-hive-script",
                      "--args",
                      "-f",
                      "s3://eu-west-1.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q",
                      "-d",
                      "INPUT=s3://eu-west-1.elasticmapreduce.samples",
                      "-d",
                      "OUTPUT=s3://MY-BUCKET/MyHiveQueryResults/"
                    ]
                  }
                }
              },
              "ResultPath": "$.AddStepsResult",
              "Next": "Wait_Another_10_Seconds"
            },
            "Wait_Another_10_Seconds": {
              "Type": "Wait",
              "Seconds": 10,
              "Next": "Cancel_Step_Two"
            },
            "Cancel_Step_Two": {
              "Type": "Task",
              "Resource": "arn:aws:states:::elasticmapreduce:cancelStep",
              "Parameters": {
                "ClusterId.$": "$.ClusterId",
                "StepId.$": "$.AddStepsResult.StepId"
              },
              "End": true
            }
          }
        }
      ],
      "ResultPath": null,
      "Next": "Step_Three"
    },
    "Step_Three": {
      "Type": "Task",
      "Resource": "arn:aws:states:::elasticmapreduce:addStep.sync",
      "Parameters": {
        "ClusterId.$": "$.ClusterId",
        "Step": {
          "Name": "The third step",
          "ActionOnFailure": "CONTINUE",
          "HadoopJarStep": {
            "Jar": "command-runner.jar",
            "Args": [
              "hive-script",
              "--run-hive-script",
              "--args",
              "-f",
              "s3://eu-west-1.elasticmapreduce.samples/cloudfront/code/Hive_CloudFront.q",
              "-d",
              "INPUT=s3://eu-west-1.elasticmapreduce.samples",
              "-d",
              "OUTPUT=s3://MY-BUCKET/MyHiveQueryResults/"
            ]
          }
        }
      },
      "ResultPath": null,
      "Next": "Disable_Termination_Protection"
    },
    "Disable_Termination_Protection": {
      "Type": "Task",
      "Resource": "arn:aws:states:::elasticmapreduce:setClusterTerminationProtection",
      "Parameters": {
        "ClusterId.$": "$.ClusterId",
        "TerminationProtected": false
      },
      "ResultPath": null,
      "Next": "Should_Terminate_Cluster"
    },
    "Should_Terminate_Cluster": {
      "Type": "Choice",
      "Choices": [
        {
          "Variable": "$.TerminateCluster",
          "BooleanEquals": true,
          "Next": "Terminate_Cluster"
        },
        {
          "Variable": "$.TerminateCluster",
          "BooleanEquals": false,
          "Next": "Wrapping_Up"
        }
      ],
      "Default": "Wrapping_Up"
    },
    "Terminate_Cluster": {
      "Type": "Task",
      "Resource": "arn:aws:states:::elasticmapreduce:terminateCluster.sync",
      "Parameters": {
        "ClusterId.$": "$.ClusterId"
      },
      "Next": "Wrapping_Up"
    },
    "Wrapping_Up": {
      "Type": "Pass",
      "End": true
    }
  }
}

I let the Step Functions console create a new AWS Identity and Access Management (IAM) role for the executions of this state machine. The role automatically includes all permissions required to access EMR.

This state machine can either use an existing EMR cluster, or create a new one. I can use the following input to create a new cluster that is terminated at the end of the workflow:

{
"CreateCluster": true,
"TerminateCluster": true
}

To use an existing cluster, I need to provide input in the cluster ID, using this syntax:

{
"CreateCluster": false,
"TerminateCluster": false,
"ClusterId": "j-..."
}

Let’s see how that works. As the workflow starts, the Should_Create_Cluster Choice state looks into the input to decide if it should enter the Create_A_Cluster state or not. There, I use a synchronous call (elasticmapreduce:createCluster.sync) to wait for the new EMR cluster to reach the WAITING state before progressing to the next workflow state. The AWS Step Functions console shows the resource that is being created with a link to the EMR console:

After that, the Merge_Results Pass state merges the input state with the cluster ID of the newly created cluster to pass it to the next step in the workflow.

Before starting to process any data, I use the Enable_Termination_Protection state (elasticmapreduce:setClusterTerminationProtection) to help ensure that the EC2 instances in my EMR cluster are not shut down by an accident or error.

Now I am ready to do something with the EMR cluster. I have three EMR steps in the workflow. For the sake of simplicity, these steps are all based on this Hive tutorial. For each step, I use Hive’s SQL-like interface to run a query on some sample CloudFront logs and write the results to Amazon Simple Storage Service (S3). In a production use case, you’d probably have a combination of EMR tools processing and analyzing your data in parallel (two or more steps running at the same time) or with some dependencies (the output of one step is required by another step). Let’s try to do something similar.

First I execute Step_One and Step_Two inside a Parallel state:

  • Step_One is running the EMR step synchronously as a job (elasticmapreduce:addStep.sync). That means that the execution waits for the EMR step to be completed (or cancelled) before moving on to the next step in the workflow. You can optionally add a timeout to monitor that the execution of the EMR step happens within an expected time frame.
  • Step_Two is adding an EMR step asynchronously (elasticmapreduce:addStep). In this case, the workflow moves to the next step as soon as EMR replies that the request has been received. After a few seconds, to try another integration, I cancel Step_Two (elasticmapreduce:cancelStep). This integration can be really useful in production use cases. For example, you can cancel an EMR step if you get an error from another step running in parallel that would make it useless to continue with the execution of this step.

After those two steps have both completed and produce their results, I execute Step_Three as a job, similarly to what I did for Step_One. When Step_Three has completed, I enter the Disable_Termination_Protection step, because I am done using the cluster for this workflow.

Depending on the input state, the Should_Terminate_Cluster Choice state is going to enter the Terminate_Cluster state (elasticmapreduce:terminateCluster.sync) and wait for the EMR cluster to terminate, or go straight to the Wrapping_Up state and leave the cluster running.

Finally I have a state for Wrapping_Up. I am not doing much in this final state actually, but you can’t end a workflow from a Choice state.

In the EMR console I see the status of my cluster and of the EMR steps:

Using the AWS Command Line Interface (CLI), I find the results of my query in the S3 bucket configured as output for the EMR steps:

aws s3 ls s3://MY-BUCKET/MyHiveQueryResults/
...

Based on my input, the EMR cluster is still running at the end of this workflow execution. I follow the resource link in the Create_A_Cluster step to go to the EMR console and terminate it. In case you are following along with this demo, be careful to not leave your EMR cluster running if you don’t need it.

Available Now
Step Functions integration with EMR is available in all regions. There is no additional cost for using this feature on top of the usual Step Functions and EMR pricing.

You can now use Step Functions to quickly build complex workflows for executing EMR jobs. A workflow can include parallel executions, dependencies, and exception handling. Step Functions makes it easy to retry failed jobs and terminate workflows after critical errors, because you can specify what happens when something goes wrong. Let me know what are you going to use this feature for!

Danilo

New – Insert, Update, Delete Data on S3 with Amazon EMR and Apache Hudi

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-insert-update-delete-data-on-s3-with-amazon-emr-and-apache-hudi/

Storing your data in Amazon S3 provides lots of benefits in terms of scale, reliability, and cost effectiveness. On top of that, you can leverage Amazon EMR to process and analyze your data using open source tools like Apache Spark, Hive, and Presto. As powerful as these tools are, it can still be challenging to deal with use cases where you need to do incremental data processing, and record-level insert, update, and delete.

Talking with customers, we found that there are use cases that need to handle incremental changes to individual records, for example:

  • Complying with data privacy regulations, where their users choose to exercise their right to be forgotten, or change their consent as to how their data can be used.
  • Working with streaming data, when you have to handle specific data insertion and update events.
  • Using change data capture (CDC) architectures to track and ingest database change logs from enterprise data warehouses or operational data stores.
  • Reinstating late arriving data, or analyzing data as of a specific point in time.

Starting today, EMR release 5.28.0 includes Apache Hudi (incubating), so that you no longer need to build custom solutions to perform record-level insert, update, and delete operations. Hudi development started in Uber in 2016 to address inefficiencies across ingest and ETL pipelines. In the recent months the EMR team has worked closely with the Apache Hudi community, contributing patches that include updating Hudi to Spark 2.4.4 (HUDI-12), supporting Spark Avro (HUDI-91), adding support for AWS Glue Data Catalog (HUDI-306), as well as multiple bug fixes.

Using Hudi, you can perform record-level inserts, updates, and deletes on S3 allowing you to comply with data privacy laws, consume real time streams and change data captures, reinstate late arriving data and track history and rollbacks in an open, vendor neutral format. You create datasets and tables and Hudi manages the underlying data format. Hudi uses Apache Parquet, and Apache Avro for data storage, and includes built-in integrations with Spark, Hive, and Presto, enabling you to query Hudi datasets using the same tools that you use today with near real-time access to fresh data.

When launching an EMR cluster, the libraries and tools for Hudi are installed and configured automatically any time at least one of the following components is selected: Hive, Spark, or Presto. You can use Spark to create new Hudi datasets, and insert, update, and delete data. Each Hudi dataset is registered in your cluster’s configured metastore (including the AWS Glue Data Catalog), and appears as a table that can be queried using Spark, Hive, and Presto.

Hudi supports two storage types that define how data is written, indexed, and read from S3:

  • Copy on Write – data is stored in columnar format (Parquet) and updates create a new version of the files during writes. This storage type is best used for read-heavy workloads, because the latest version of the dataset is always available in efficient columnar files.
  • Merge on Read – data is stored with a combination of columnar (Parquet) and row-based (Avro) formats; updates are logged to row-based “delta files” and compacted later creating a new version of the columnar files. This storage type is best used for write-heavy workloads, because new commits are written quickly as delta files, but reading the data set requires merging the compacted columnar files with the delta files.

Let’s do a quick overview of how you can set up and use Hudi datasets in an EMR cluster.

Using Apache Hudi with Amazon EMR
I start creating a cluster from the EMR console. In the advanced options I select EMR release 5.28.0 (the first including Hudi) and the following applications: Spark, Hive, and Tez. In the hardware options, I add 3 task nodes to ensure I have enough capacity to run both Spark and Hive.

When the cluster is ready, I use the key pair I selected in the security options to SSH into the master node and access the Spark Shell. I use the following command to start the Spark Shell to use it with Hudi:

$ spark-shell --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
              --conf "spark.sql.hive.convertMetastoreParquet=false"
              --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar

There, I use the following Scala code to import some sample ELB logs in a Hudi dataset using the Copy on Write storage type:

import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor

//Set up various input values as variables
val inputDataPath = "s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1/"
val hudiTableName = "elb_logs_hudi_cow"
val hudiTablePath = "s3://MY-BUCKET/PATH/" + hudiTableName

// Set up our Hudi Data Source Options
val hudiOptions = Map[String,String](
    DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "request_ip",
    DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "request_verb", 
    HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
    DataSourceWriteOptions.OPERATION_OPT_KEY ->
        DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
    DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "request_timestamp", 
    DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", 
    DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> hudiTableName, 
    DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "request_verb", 
    DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY -> "false", 
    DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY ->
        classOf[MultiPartKeysValueExtractor].getName)

// Read data from S3 and create a DataFrame with Partition and Record Key
val inputDF = spark.read.format("parquet").load(inputDataPath)

// Write data into the Hudi dataset
inputDF.write
       .format("org.apache.hudi")
       .options(hudiOptions)
       .mode(SaveMode.Overwrite)
       .save(hudiTablePath)

In the Spark Shell, I can now count the records in the Hudi dataset:

scala> inputDF2.count()
res1: Long = 10491958

In the options, I used the integration with the Hive metastore configured for the cluster, so that the table is created in the default database. In this way, I can use Hive to query the data in the Hudi dataset:

hive> use default;
hive> select count(*) from elb_logs_hudi_cow;
...
OK
10491958
...

I can now update or delete a single record in the dataset. In the Spark Shell, I prepare some variables to find the record I want to update, and a SQL statement to select the value of the column I want to change:

val requestIpToUpdate = "243.80.62.181"
val sqlStatement = s"SELECT elb_name FROM elb_logs_hudi_cow WHERE request_ip = '$requestIpToUpdate'"

I execute the SQL statement to see the current value of the column:

scala> spark.sql(sqlStatement).show()
+------------+                                                                  
|    elb_name|
+------------+
|elb_demo_003|
+------------+

Then, I select and update the record:

// Create a DataFrame with a single record and update column value
val updateDF = inputDF.filter(col("request_ip") === requestIpToUpdate)
                      .withColumn("elb_name", lit("elb_demo_001"))

Now I update the Hudi dataset with a syntax similar to the one I used to create it. But this time, the DataFrame I am writing contains only one record:

// Write the DataFrame as an update to existing Hudi dataset
updateDF.write
        .format("org.apache.hudi")
        .options(hudiOptions)
        .option(DataSourceWriteOptions.OPERATION_OPT_KEY,
                DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
        .mode(SaveMode.Append)
        .save(hudiTablePath)

In the Spark Shell, I check the result of the update:

scala> spark.sql(sqlStatement).show()
+------------+                                                                  
|    elb_name|
+------------+
|elb_demo_001|
+------------+

Now I want to delete the same record. To delete it, I pass the EmptyHoodieRecordPayload payload in the write options:

// Write the DataFrame with an EmptyHoodieRecordPayload for deleting a record
updateDF.write
        .format("org.apache.hudi")
        .options(hudiOptions)
        .option(DataSourceWriteOptions.OPERATION_OPT_KEY,
                DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
        .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY,
                "org.apache.hudi.EmptyHoodieRecordPayload")
        .mode(SaveMode.Append)
        .save(hudiTablePath)

In the Spark Shell, I see that the record is no longer available:

scala> spark.sql(sqlStatement).show()
+--------+                                                                      
|elb_name|
+--------+
+--------+

How are all those updates and deletes managed by Hudi? Let’s use the Hudi Command Line Interface (CLI) to connect to the dataset and see now those changes are interpreted as commits:

This dataset is a Copy on Write dataset, that means that each time there is an update to a record, the file that contains that record will be rewritten to contain the updated values. You can see how many records have been written for each commit. The bottom line of the table describes the initial creation of the dataset, above there is the single record update, and at the top the single record delete.

With Hudi, you can roll back to each commit. For example, I can roll back the delete operation with:

hudi:elb_logs_hudi_cow->commit rollback --commit 20191104121031

In the Spark Shell, the record is now back to where it was, just after the update:

scala> spark.sql(sqlStatement).show()
+------------+                                                                  
|    elb_name|
+------------+
|elb_demo_001|
+------------+

Copy on Write is the default storage type. I can repeat the steps above to create and update a Merge on Read dataset type by adding this to our hudiOptions:

DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "MERGE_ON_READ"

If you update a Merge on Read dataset and look at the commits with the Hudi CLI, you can see how different Merge on Read is compared to Copy on Write. With Merge on Read, you are only writing the updated rows and not whole files as with Copy on Write. This is why Merge on Read is helpful for use cases that require more writes, or update/delete heavy workload, with a fewer number of reads. Delta commits are written to disk as Avro records (row-based storage), and compacted data is written as Parquet files (columnar storage). To avoid creating too many delta files, Hudi will automatically compact your dataset so that your reads are as performant as possible.

When a Merge On Read dataset is created, two Hive tables are created:

  • The first table matches the name of the dataset.
  • The second table has the characters _rt appended to its name; the _rt postfix stands for real-time.

When queried, the first table return the data that has been compacted, and will not show the latest delta commits. Using this table provides the best performance, but omits the freshest data. Querying the real-time table will merge the compacted data with the delta commits on read, hence this dataset being called “Merge on Read”. This will result in the freshest data being available, but incurs a performance overhead, and is not as performant as querying the compacted data. In this way, data engineers and analysts have the flexibility to choose between performance and data freshness.

Available Now
This new feature is available now in all regions with EMR 5.28.0. There is no additional cost in using Hudi with EMR. You can learn more about Hudi in the EMR documentation. This new tool can simplify the way you process, update and delete data in S3. Let me know which use cases are you going to use it for!

Danilo

New – Import Existing Resources into a CloudFormation Stack

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-import-existing-resources-into-a-cloudformation-stack/

With AWS CloudFormation, you can model your entire infrastructure with text files. In this way, you can treat your infrastructure as code and apply software development best practices, such as putting it under version control, or reviewing architectural changes with your team before deployment.

Sometimes AWS resources initially created using the console or the AWS Command Line Interface (CLI) need to be managed using CloudFormation. For example, you (or a different team) may create an IAM role, a Virtual Private Cloud, or an RDS database in the early stages of a migration, and then you have to spend time to include them in the same stack as the final application. In such cases, you often end up recreating the resources from scratch using CloudFormation, and then migrating configuration and data from the original resource.

To make these steps easier for our customers, you can now import existing resources into a CloudFormation stack!

It was already possible to remove resources from a stack without deleting them by setting the DeletionPolicy to Retain. This, together with the new import operation, enables a new range of possibilities. For example, you are now able to:

  • Create a new stack importing existing resources.
  • Import existing resources in an already created stack.
  • Migrate resources across stacks.
  • Remediate a detected drift.
  • Refactor nested stacks by deleting children stacks from one parent and then importing them into another parent stack.

To import existing resources into a CloudFormation stack, you need to provide:

  • A template that describes the entire stack, including both the resources to import and (for existing stacks) the resources that are already part of the stack.
  • Each resource to import must have a DeletionPolicy attribute in the template. This enables easy reverting of the operation in a completely safe manner.
  • A unique identifier for each target resource, for example the name of the Amazon DynamoDB table or of the Amazon Simple Storage Service (S3) bucket you want to import.

During the resource import operation, CloudFormation checks that:

  • The imported resources do not already belong to another stack in the same region (be careful with global resources such as IAM roles).
  • The target resources exist and you have sufficient permissions to perform the operation.
  • The properties and configuration values are valid against the resource type schema, which defines its required, acceptable properties, and supported values.

The resource import operation does not check that the template configuration and the actual configuration are the same. Since the import operation supports the same resource types as drift detection, I recommend running drift detection after importing resources in a stack.

Importing Existing Resources into a New Stack
In my AWS account, I have an S3 bucket and a DynamoDB table, both with some data inside, and I’d like to manage them using CloudFormation. In the CloudFormation console, I have two new options:

  • I can create a new stack importing existing resources.

  • I can import resources into an existing stack.

In this case, I want to start from scratch, so I create a new stack. The next step is to provide a template with the resources to import.

I upload the following template with two resources to import: a DynamoDB table and an S3 bucket.

AWSTemplateFormatVersion: "2010-09-09"
Description: Import test
Resources:

  ImportedTable:
    Type: AWS::DynamoDB::Table
    DeletionPolicy: Retain
    Properties: 
      BillingMode: PAY_PER_REQUEST
      AttributeDefinitions: 
        - AttributeName: id
          AttributeType: S
      KeySchema: 
        - AttributeName: id
          KeyType: HASH

  ImportedBucket:
    Type: AWS::S3::Bucket
    DeletionPolicy: Retain

In this template I am setting DeletionPolicy  to Retain for both resources. In this way, if I remove them from the stack, they will not be deleted. This is a good option for resources which contain data you don’t want to delete by mistake, or that you may want to move to a different stack in the future. It is mandatory for imported resources to have a deletion policy set, so you can safely and easily revert the operation, and be protected from mistakenly deleting resources that were imported by someone else.

I now have to provide an identifier to map the logical IDs in the template with the existing resources. In this case, I use the DynamoDB table name and the S3 bucket name. For other resource types, there may be multiple ways to identify them and you can select which property to use in the drop-down menus.

In the final recap, I review changes before applying them. Here I check that I’m targeting the right resources to import with the right identifiers. This is actually a CloudFormation Change Set that will be executed when I import the resources.

When importing resources into an existing stack, no changes are allowed to the existing resources of the stack. The import operation will only allow the Change Set action of Import. Changes to parameters are allowed as long as they don’t cause changes to resolved values of properties in existing resources. You can change the template for existing resources to replace hard coded values with a Ref to a resource being imported. For example, you may have a stack with an EC2 instance using an existing IAM role that was created using the console. You can now import the IAM role into the stack and replace in the template the hard coded value used by the EC2 instance with a Ref to the role.

Moving on, each resource has its corresponding import events in the CloudFormation console.

When the import is complete, in the Resources tab, I see that the S3 bucket and the DynamoDB table are now part of the stack.

To be sure the imported resources are in sync with the stack template, I use drift detection.

All stack-level tags, including automatically created tags, are propagated to resources that CloudFormation supports. For example, I can use the AWS CLI to get the tag set associated with the S3 bucket I just imported into my stack. Those tags give me the CloudFormation stack name and ID, and the logical ID of the resource in the stack template:

$ aws s3api get-bucket-tagging --bucket danilop-toimport

{
  "TagSet": [
    {
      "Key": "aws:cloudformation:stack-name",
      "Value": "imported-stack"
    },
    {
      "Key": "aws:cloudformation:stack-id",
      "Value": "arn:aws:cloudformation:eu-west-1:123412341234:stack/imported-stack/..."
    },
    {
      "Key": "aws:cloudformation:logical-id",
      "Value": "ImportedBucket"
    }
  ]
}

Available Now
You can use the new CloudFormation import operation via the console, AWS Command Line Interface (CLI), or AWS SDKs, in the following regions: US East (Ohio), US East (N. Virginia), US West (N. California), US West (Oregon), Canada (Central), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Singapore), Asia Pacific (Sydney), Asia Pacific (Tokyo), EU (Frankfurt), EU (Ireland), EU (London), EU (Paris), and South America (São Paulo).

It is now simpler to manage your infrastructure as code, you can learn more on bringing existing resources into CloudFormation management in the documentation.

Danilo

New – Step Functions Support for Dynamic Parallelism

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-step-functions-support-for-dynamic-parallelism/

Microservices make applications easier to scale and faster to develop, but coordinating the components of a distributed application can be a daunting task. AWS Step Functions is a fully managed service that makes coordinating tasks easier by letting you design and run workflows that are made of steps, each step receiving as input the output of the previous step. For example, Novartis Institutes for Biomedical Research is using Step Functions to empower scientists to run image analysis without depending on cluster experts.

Step Functions added some very interesting capabilities recently, such as callback patterns, to simplify the integration of human activities and third-party services, and nested workflows, to assemble together modular, reusable workflows. Today, we are adding support for dynamic parallelism within a workflow!

How Dynamic Parallelism Works
States machines are defined using the Amazon States Language, a JSON-based, structured language. The Parallel state can be used to execute in parallel a fixed number of branches defined in the state machine. Now, Step Functions supports a new Map state type for dynamic parallelism.

To configure a Map state, you define an Iterator, which is a complete sub-workflow. When a Step Functions execution enters a Map state, it will iterate over a JSON array in the state input. For each item, the Map state will execute one sub-workflow, potentially in parallel. When all sub-workflow executions complete, the Map state will return an array containing the output for each item processed by the Iterator.

You can configure an upper bound on how many concurrent sub-workflows Map executes by adding the MaxConcurrency field. The default value is 0, which places no limit on parallelism and iterations are invoked as concurrently as possible. A MaxConcurrency value of 1 has the effect to invoke the Iterator one element at a time, in the order of their appearance in the input state, and will not start an iteration until the previous iteration has completed execution.

One way to use the new Map state is to leverage fan-out or scatter-gather messaging patterns in your workflows:

  • Fan-out is a applied when delivering a message to multiple destinations, and can be useful in workflows such as order processing or batch data processing. For example, you can retrieve arrays of messages from Amazon SQS and Map will send each message to a separate AWS Lambda function.
  • Scatter-gather broadcasts a single message to multiple destinations (scatter) and then aggregates the responses back for the next steps (gather). This can be useful in file processing and test automation. For example, you can transcode ten 500 MB media files in parallel and then join to create a 5 GB file.

Like Parallel and Task states, Map supports Retry and Catch fields to handle service and custom exceptions. You can also apply Retry and Catch to states inside your Iterator to handle exceptions. If any Iterator execution fails, because of an unhandled error or by transitioning to a Fail state, the entire Map state is considered to have failed and all its iterations are stopped. If the error is not handled by the Map state itself, Step Functions stops the workflow execution with an error.

Using the Map State
Let’s build a workflow to process an order and, by using the Map state, to work on the items in the order in parallel. The tasks executed as part of this workflow are all Lambda functions, but with Step Functions you can use other AWS service integrations and have code running on EC2 instances, containers, or on-premises infrastructure.

Here’s our sample order, expressed as a JSON document, for a few books, plus some coffee to drink while reading them. The order has a detail section, where there is a list of items that are part of the order.

{
  "orderId": "12345678",
  "orderDate": "20190820101213",
  "detail": {
    "customerId": "1234",
    "deliveryAddress": "123, Seattle, WA",
    "deliverySpeed": "1-day",
    "paymentMethod": "aCreditCard",
    "items": [
      {
        "productName": "Agile Software Development",
        "category": "book",
        "price": 60.0,
        "quantity": 1
      },
      {
        "productName": "Domain-Driven Design",
        "category": "book",
        "price": 32.0,
        "quantity": 1
      },
      {
        "productName": "The Mythical Man Month",
        "category": "book",
        "price": 18.0,
        "quantity": 1
      },
      {
        "productName": "The Art of Computer Programming",
        "category": "book",
        "price": 180.0,
        "quantity": 1
      },
      {
        "productName": "Ground Coffee, Dark Roast",
        "category": "grocery",
        "price": 8.0,
        "quantity": 6
      }
    ]
  }
}

To process this order, I am using a state machine defining how the different tasks should be executed. The Step Functions console creates a visual representation of the workflow I am building:

  • First, I validate and check the payment.
  • Then, I process the items in the order, potentially in parallel, to check their availability, prepare for delivery, and start the delivery process.
  • At the end, a summary of the order is sent to the customer.
  • In case the payment check fails, I intercept that, for example to send a notification to the customer.

 

Here is the same state machine definition expressed as a JSON document. The ProcessAllItems state is using Map to process items in the order in parallel. In this case, I limit concurrency to 3 using the MaxConcurrency field. Inside the Iterator, I can put a sub-workflow of arbitrary complexity. In this case, I have three steps, to CheckAvailability, PrepareForDelivery, and StartDelivery of the item. Each of this step can Retry and Catch errors to make the sub-workflow execution more reliable, for example in case of integrations with external services.

{
  "StartAt": "ValidatePayment",
  "States": {
    "ValidatePayment": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-west-2:123456789012:function:validatePayment",
      "Next": "CheckPayment"
    },
    "CheckPayment": {
      "Type": "Choice",
      "Choices": [
        {
          "Not": {
            "Variable": "$.payment",
            "StringEquals": "Ok"
          },
          "Next": "PaymentFailed"
        }
      ],
      "Default": "ProcessAllItems"
    },
    "PaymentFailed": {
      "Type": "Task",
      "Resource": "arn:aws:lambda:us-west-2:123456789012:function:paymentFailed",
      "End": true
    },
    "ProcessAllItems": {
      "Type": "Map",
      "InputPath": "$.detail",
      "ItemsPath": "$.items",
      "MaxConcurrency": 3,
      "Iterator": {
        "StartAt": "CheckAvailability",
        "States": {
          "CheckAvailability": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-west-2:123456789012:function:checkAvailability",
            "Retry": [
              {
                "ErrorEquals": [
                  "TimeOut"
                ],
                "IntervalSeconds": 1,
                "BackoffRate": 2,
                "MaxAttempts": 3
              }
            ],
            "Next": "PrepareForDelivery"
          },
          "PrepareForDelivery": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-west-2:123456789012:function:prepareForDelivery",
            "Next": "StartDelivery"
          },
          "StartDelivery": {
            "Type": "Task",
            "Resource": "arn:aws:lambda:us-west-2:123456789012:function:startDelivery",
            "End": true
          }
        }
      },
      "ResultPath": "$.detail.processedItems",
      "Next": "SendOrderSummary"
    },
    "SendOrderSummary": {
      "Type": "Task",
      "InputPath": "$.detail.processedItems",
      "Resource": "arn:aws:lambda:us-west-2:123456789012:function:sendOrderSummary",
      "ResultPath": "$.detail.summary",
      "End": true
    }
  }
}

The Lambda functions used by this workflow are not aware of the overall structure of the order JSON document. They just need to know the part of the input state they are going to process. This is a best practice to make those functions easily reusable in multiple workflows. The state machine definition is manipulating the path used for the input and the output of the functions using JsonPath syntax via the InputPathItemsPathResultPath, and OutputPath fields:

  • InputPath is used to filter the data in the input state, for example to only pass the detail of the order to the Iterator.
  • ItemsPath is specific to the Map state and is used to identify where, in the input, the array field to process is found, for example to process the items inside the detail of the order.
  • ResultPath makes it possible to add the output of a task to the input state, and not overwrite it completely, for example to add a summary to the detail of the order.
  • I am not using OutputPath this time, but it could be useful to filter out unwanted information and pass only the portion of JSON that you care about to the next state. For example, to send as output only the detail of the order.

Optionally, the Parameters field may be used to customize the raw input used for each iteration. For example, the deliveryAddress is in the detail of the order, but not in each item. To have the Iterator have an index of the items, and access the deliveryAddress, I can add this to a Map state:

"Parameters": {
  "index.$": "$$.Map.Item.Index",
  "item.$": "$$.Map.Item.Value",
  "deliveryAddress.$": "$.deliveryAddress"
}

Available Now
This new feature is available today in all regions where Step Functions is offered. Dynamic parallelism was probably the most requested features for Step Functions. It unblocks the implementation of new use cases and can help optimize existing ones. Let us know what are you going to use it for!

NoSQL Workbench for Amazon DynamoDB – Available in Preview

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/nosql-workbench-for-amazon-dynamodb-available-in-preview/

I am always impressed by the flexibility of Amazon DynamoDB, providing our customers a fully-managed key-value and document database that can easily scale from a few requests per month to millions of requests per second.

The DynamoDB team released so many great features recently, from on-demand capacity, to support for native ACID transactions. Here’s a great recap of other recent DynamoDB announcements such as global tables, point-in-time recovery, and instant adaptive capacity. DynamoDB now encrypts all customer data at rest by default.

However, switching mindset from a relational database to NoSQL is not that easy. Last year we had two amazing talks at re:Invent that can help you understand how DynamoDB works, and how you can use it for your use cases:

To help you even further, we are introducing today in preview NoSQL Workbench for Amazon DynamoDB, a free, client-side application available for Windows and macOS to help you design and visualize your data model, run queries on your data, and generate the code for your application!

The three main capabilities provided by the NoSQL Workbench are:

  • Data modeler — to build new data models, adding tables and indexes, or to import, modify, and export existing data models.
  • Visualizer — to visualize data models based on their applications access patterns, with sample data that you can add manually or import via a SQL query.
  • Operation builder — to define and execute data-plane operations or generate ready-to-use sample code for them.

To see how this new tool can simplify working with DynamoDB, let’s build an application to retrieve information on customers and their orders.

Using the NoSQL Workbench
In the Data modeler, I start by creating a CustomerOrders data model, and I add a table, CustomerAndOrders, to hold my customer data and the information on their orders. You can use this tool to create a simple data model where customers and orders are in two distinct tables, each one with their own primary keys. There would be nothing wrong with that. Here I’d like to show how this tool can also help you use more advanced design patterns. By having the customer and order data in a single table, I can construct queries that return all the data I need with a single interaction with DynamoDB, speeding up the performance of my application.

As partition key, I use the customerId. This choice provides an even distribution of data across multiple partitions. The sort key in my data model will be an overloaded attribute, in the sense that it can hold different data depending on the item:

  • A fixed string, for example customer, for the items containing the customer data.
  • The order date, written using ISO 8601 strings such as 20190823, for the items containing orders.

By overloading the sort key with these two possible values, I am able to run a single query that returns the customer data and the most recent orders. For this reason, I use a generic name for the sort key. In this case, I use sk.

Apart from the partition key and the optional sort key, DynamoDB has a flexible schema, and the other attributes can be different for each item in a table. However, with this tool I have the option to describe in the data model all the possible attributes I am going to use for a table. In this way, I can check later that all the access patterns I need for my application work well with this data model.

For this table, I add the following attributes:

  • customerName and customerAddress, for the items in the table containing customer data.
  • orderId and deliveryAddress, for the items in the table containing order data.

I am not adding a orderDate attribute, because for this data model the value will be stored in the sk sort key. For a real production use case, you would probably have much more attributes to describe your customers and orders, but I am trying to keep things simple enough here to show what you can do, without getting lost in details.

Another access pattern for my application is to be able to get a specific order by ID. For that, I add a global secondary index to my table, with orderId as partition key and no sort key.

I add the table definition to the data model, and move on to the Visualizer. There, I update the table by adding some sample data. I add data manually, but I could import a few rows from a table in a MySQL database, for example to simplify a NoSQL migration from a relational database.

Now, I visualize my data model with the sample data to have a better understanding of what to expect from this table. For example, if I select a customerId, and I query for all the orders greater than a specific date, I also get the customer data at the end, because the string customer, stored in the sk sort key, is always greater that any date written in ISO 8601 syntax.

In the Visualizer, I can also see how the global secondary index on the orderId works. Interestingly, items without an orderId are not part of this index, so I get only 4 of the 6 items that are part of my sample data. This happens because DynamoDB writes a corresponding index entry only if the index sort key value is present in the item. If the sort key doesn’t appear in every table item, the index is said to be sparseSparse indexes are useful for queries over a subsection of a table.

I now commit my data model to DynamoDB. This step creates server-side resources such as tables and global secondary indexes for the selected data model, and loads the sample data. To do so, I need AWS credentials for an AWS account. I have the AWS Command Line Interface (CLI) installed and configured in the environment where I am using this tool, so I can just select one of my named profiles.

I move to the Operation builder, where I see all the tables in the selected AWS Region. I select the newly created CustomerAndOrders table to browse the data and build the code for the operations I need in my application.

In this case, I want to run a query that, for a specific customer, selects all orders more recent that a date I provide. As we saw previously, the overloaded sort key would also return the customer data as last item. The Operation builder can help you use the full syntax of DynamoDB operations, for example adding conditions and child expressions. In this case, I add the condition to only return orders where the deliveryAddress contains Seattle.

I have the option to execute the operation on the DynamoDB table, but this time I want to use the query in my application. To generate the code, I select between Python, JavaScript (Node.js), or Java.

You can use the Operation builder to generate the code for all the access patterns that you plan to use with your application, using all the advanced features that DynamoDB provides, including ACID transactions.

Available Now
You can find how to set up NoSQL Workbench for Amazon DynamoDB (Preview) for Windows and macOS here.

We welcome your suggestions in the DynamoDB discussion forum. Let us know what you build with this new tool and how we can help you more!

Amazon Forecast – Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/amazon-forecast-now-generally-available/

Getting accurate time series forecasts from historical data is not an easy task. Last year at re:Invent we introduced Amazon Forecast, a fully managed service that requires no experience in machine learning to deliver highly accurate forecasts. I’m excited to share that Amazon Forecast is generally available today!

With Amazon Forecast, there are no servers to provision. You only need to provide historical data, plus any additional metadata that you think may have an impact on your forecasts. For example, the demand for a particular product you need or produce may change with the weather, the time of the year, and the location where the product is used.

Amazon Forecast is based on the same technology used at Amazon and packages our years of experience in building and operating scalable, highly accurate forecasting technology in a way that is easy to use, and can be used for lots of different use cases, such as estimating product demand, cloud computing usage, financial planning, resource planning in a supply chain management system, as it uses deep learning to learn from multiple datasets and automatically try different algorithms.

Using Amazon Forecast
For this post, I need some sample data. To have an interesting use case, I go for the individual household electric power consumption dataset from the UCI Machine Learning Repository. For simplicity, I am using a version where data is aggregated hourly in a file in CSV format. Here are the first few lines where you can see the timestamp, the energy consumption, and the client ID:

2014-01-01 01:00:00,38.34991708126038,client_12
2014-01-01 02:00:00,33.5820895522388,client_12
2014-01-01 03:00:00,34.41127694859037,client_12
2014-01-01 04:00:00,39.800995024875625,client_12
2014-01-01 05:00:00,41.044776119402975,client_12

Let’s see how easy it is to build a predictor and get forecasts by using the Amazon Forecast console. Another option, for more advanced users, would be to use a Jupyter notebook and the AWS SDK for Python. You can find some sample notebooks in this GitHub repository.

In the Amazon Forecast console, the first step is to create a dataset group. Dataset groups act as containers for datasets that are related.

I can select a forecasting domain for my dataset group. Each domain covers a specific use case, such as retail, inventory planning, or web traffic, and brings its own dataset types based on the type of data used for training. For now, I use a custom domain that covers all use cases that don’t fall in other categories.

Next, I create a dataset. The data I am going to upload is aggregated by the hour, so 1 hour is the frequency of my data. The default data schema depends on the forecasting domain I selected earlier. I am using a custom domain here, and I change the data schema to have a timestamp, a target_value, and an item_id, in that order, as you can see in the sample few lines of data at the beginning of this post.

Now is the time to upload my time series data from Amazon Simple Storage Service (S3) into my dataset. The default timestamp format is exactly what I have in my data, so I don’t need to change it. I need an AWS Identity and Access Management (IAM) role to give Amazon Forecast access to the S3 bucket. I can select one here, or create a new one for this use case. As usual, avoid creating IAM roles that are too permissive and apply a least privilege approach to reduce the amount of permissions to the minimum required for this activity. After I tell Amazon Forecast in which S3 bucket and folder to look for my historical data, I start the import job.

The dataset group dashboard gives an overview of the process. My target time series data is being imported, and I can optionally add:

  • item metadata information on the items I want to predict on; for example, the color of the items in a retail scenario, or the kind of household (is this an apartment or a detached house?) for this electricity-focused use case.
  • related time series data that don’t include the target variable I want to predict, but can help improve my model; for example, price and promotions used by an ecommerce company are probably related to actual sales.

I am not adding more data for this use case. As soon as my dataset is imported, I start to train a predictor that I can then use to generate forecasts. I give the predictor a name, then select the forecast horizon, that in my case is 24 hours, and the frequency at which my forecast are generated.

To train the predictor, I can select a specific machine learning algorithms of my choice, such as ARIMA or DeepAR+, but I prefer simplicity and use AutoML to let Amazon Forecast evaluate all algorithms and choose the one that performs best for my dataset.

In the case of my dataset, each household is identified by a single variable, the item_id, but you can add more dimensions if required. I can then select the Country for holidays. This is optional, but can improve your results if the data you are using may be affected by people being on holidays or not. I think energy usage is different on holidays, so I select United States, the country my dataset is coming from.

The configuration of the backtest windows is a more advanced topic, and you can skip the next paragraph if you’re not interested into the details of how a machine learning model is evaluated in case of time series. In this case, I am leaving the default.

When training a machine learning model, you need two split your dataset in two: a training dataset you use to train with the machine learning algorithm, and an evaluation dataset that you use to evaluate the performance of your trained model. With time series, you can’t just create these two subsets of your data randomly, like you would normally do, because the order of your data points is important. The approach we use for Amazon Forecast is to split the time series in one or more parts, each one called a backtest window, preserving the order of the data. When evaluating your model against a backtest window, you should always use an evaluation dataset of the same length, otherwise it would be very difficult to compare different results.  The backtest window offset tells how many points ahead of a split point you want to use for evaluation, and this is the same value for all the splits. For example, by leaving 24 (hours) I always use one day of data for evaluating my model against multiple window offsets.

In the advanced configurations, I have the option to enable hyperparameter optimization (HPO), for the algorithms that support it, and featurizations, to develop additional features computed from the ones in your data. I am not touching those settings now.

After a few minutes, the predictor is active. To understand the quality of a predictor, I look at some of the metrics that are automatically computed.

Quantile loss (QL) calculates how far off the forecast at a certain quantile is from the actual demand. It weights underestimation and overestimation according to a specific quantile. For example, a P90 forecast, if calibrated, means that 90% of the time the true demand is less than the forecast value. Thus, when the demand turns out to be higher than the forecast, the loss would be greater than the other way around.

When the predictor is ready, and I am satisfied by its metrics, I can use it to create a forecast.

When the forecast is active, I can query it to get predictions. I can export the whole forecast as CSV file, or query for specific lookups. Let’s do a lookup. In the case of the dataset I am using, I can forecast the energy used by a household for a specific range of time. Dates here are in the past because I used an old dataset. I am pretty sure you’re going to use Amazon Forecast to look into the future.

For each timestamp in the forecast, I get a range of values. The P10, P50, and P90 forecasts have respectively 10%, 50%, and 90% probability of satisfying the actual demand. How you use these three values depends on your use case and how it is impacted by overestimating or underestimating demand. The P50 forecast is the most likely estimate for the demand. The P10 and P90 forecasts give you an 80% confidence interval for what to expect.

Available Now
You can use Amazon Forecast via the console, the AWS Command Line Interface (CLI) and the AWS SDKs. For example, you can use Amazon Forecast within a Jupyter notebook with the AWS SDK for Python to create a new predictor, or use the AWS SDK for JavaScript in the Browser to get predictions from within a web or mobile app, or the AWS SDK for Java or AWS SDK for .NET to add forecast capabilities to an existing enterprise application.

Here’s the overall flow of the Amazon Forecast API, from creating the dataset group to querying and extracting the forecast:

The dataset I used for this walkthrough and other examples are available in this GitHub repository:

Amazon Forecast is now available in US East (N. Virginia), US West (Oregon), US East (Ohio), Europe (Ireland), Asia Pacific (Singapore), and Asia Pacific (Tokyo).

More information on specific features and pricing is one click away at:

I look forward to see what you’re going to use it for, please share your results with me!

Danilo

AWS Lake Formation – Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-lake-formation-now-generally-available/

As soon as companies started to have data in digital format, it was possible for them to build a data warehouse, collecting data from their operational systems, such as Customer relationship management (CRM) and Enterprise resource planning (ERP) systems, and use this information to support their business decisions.

The reduction in costs of storage, together with an even greater reduction in complexity for managing large quantities of data, made possible by services such as Amazon S3, has allowed companies to retain more information, including raw data that is not structured, such as logs, images, video, and scanned documents.

This is the idea of a data lake: to store all your data in one, centralized repository, at any scale. We are seeing this approach with customers like Netflix, Zillow, NASDAQ, Yelp, iRobot, FINRA, and Lyft. They can run their analytics on this larger dataset, from simple aggregations to complex machine learning algorithms, to better discover patterns in their data and understand their business.

Last year at re:Invent we introduced in preview AWS Lake Formation, a service that makes it easy to ingest, clean, catalog, transform, and secure your data and make it available for analytics and machine learning. I am happy to share that Lake Formation is generally available today!

With Lake Formation you have a central console to manage your data lake, for example to configure the jobs that move data from multiple sources, such as databases and logs, to your data lake. Having such a large and diversified amount of data makes configuring the right access permission also critical. You can secure access to metadata in the Glue Data Catalog and data stored in S3 using a single set of granular data access policies defined in Lake Formation. These policies allow you to define table and column-level data access.

One thing I like the most of Lake Formation is that it works with your data already in S3! You can easily register your existing data with Lake Formation, and you don’t need to change existing processes loading your data to S3. Since data remains in your account, you have full control.

You can also use Glue ML Transforms to easily deduplicate your data. Deduplication is important to reduce the amount of storage you need, but also to make analyzing your data more efficient because you don’t have neither the overhead nor the possible confusion of looking at the same data twice. This problem is trivial if duplicate records can be identified by a unique key, but becomes very challenging when you have to do a “fuzzy match”. A similar approach can be used for record linkage, that is when you are looking for similar items in different tables, for example to do a “fuzzy join” of two databases that do not share a unique key.

In this way, implementing a data lake from scratch is much faster, and managing a data lake is much easier, making these technologies available to more customers.

Creating a Data Lake
Let’s build a data lake using the Lake Formation console. First I register the S3 buckets that are going to be part of my data lake. Then I create a database and grant permission to the IAM users and roles that I am going to use to manage my data lake. The database is registered in the Glue Data Catalog and holds the metadata required to analyze the raw data, such as the structure of the tables that are going to be automatically generated during data ingestion.

Managing permissions is one of the most complex tasks for a data lake. Consider for example the huge amount of data that can be part of it, the sensitive, mission-critical nature of some of the data, and the different structured, semi-structured, and unstructured formats in which data can reside. Lake Formation makes it easier with a central location where you can give IAM users, roles, groups, and Active Directory users (via federation) access to databases, tables, optionally allowing or denying access to specific columns within a table.

To simplify data ingestion, I can use blueprints that create the necessary workflows, crawlers and jobs on AWS Glue for common use cases. Workflows enable orchestration of your data loading workloads by building dependencies between Glue entities, such as triggers, crawlers and jobs, and allow you to track visually the status of the different nodes in the workflows on the console, making it easier to monitor progress and troubleshoot issues.

Database blueprints help load data from operational databases. For example, if you have an e-commerce website, you can ingest all your orders in your data lake. You can load a full snapshot from an existing database, or incrementally load new data. In case of an incremental load, you can select a table and one or more of its columns as bookmark keys (for example, a timestamp in your orders) to determine previously imported data.

Log file blueprints simplify ingesting logging formats used by Application Load Balancers, Elastic Load Balancers, and AWS CloudTrail. Let’s see how that works more in depth.

Security is always a top priority, and I want to be able to have a forensic log of all management operations across my account, so I choose the CloudTrail blueprint. As source, I select a trail collecting my CloudTrail logs from all regions into an S3 bucket. In this way, I’ll be able to query account activity across all my AWS infrastructure. This works similarly for a larger organization having multiple AWS accounts: they just need, when configuring the trail in the CloudTrial console, to apply the trail to their whole organization.

I then select the target database, and the S3 location for the data lake. As data format I use Parquet, a columnar storage format that will make querying the data faster and cheaper. The import frequency can be hourly to monthly, with the option to choose the day of the week and the time. For now, I want to run the workflow on demand. I can do that from the console or programmatically, for example using any AWS SDK or the AWS Command Line Interface (CLI).

Finally, I give the workflow a name, the IAM role to use during execution, and a prefix for the tables that will be automatically created by this workflow.

I start the workflow from the Lake Formation console and select to view the workflow graph. This opens the AWS Glue console, where I can visually see the steps of the workflow and monitor the progress of this run.

When the workflow is completed a new table is available in my data lake database. The source data remain as logs in the S3 bucket output of CloudTrail, but now I have them consolidated, in Parquet format and partitioned by date, in my data lake S3 location. To optimize costs, I can set up an S3 lifecycle policy that automatically expires data in the source S3 bucket after a safe amount of time has passed.

Securing Access to the Data Lake
Lake Formation provides secure and granular access to data stores in the data lake, via a new grant/revoke permissions model that augments IAM policies. It is simple to set up these permissions, for example using the console:

I simply select the IAM user or role I want to grant access to. Then I select the database and optionally the tables and the columns I want to provide access to. It is also possible to select which type of access to provide. For this demo, simple select permissions are sufficient.

Accessing the Data Lake
Now I can query the data using tools like Amazon Athena or Amazon Redshift. For example, I open the query editor in the Athena console. First, I want to use my new data lake to look into which source IP addresses are most common in my AWS Account activity:

SELECT sourceipaddress, count(*)
FROM my_trail_cloudtrail
GROUP BY  sourceipaddress
ORDER BY  2 DESC;

Looking at the result of the query, you can see which are the AWS API endpoints that I use the most. Then, I’d like to check which user identity types are used. That is an information stored in JSON format inside one of the columns. I can use some of the JSON functions available with Amazon Athena to get that information in my SQL statements:

SELECT json_extract_scalar(useridentity, '$.type'), count(*)
FROM "mylake"."my_trail_cloudtrail"
GROUP BY  json_extract_scalar(useridentity, '$.type')
ORDER BY  2 DESC;

Most of the times, AWS services are the ones creating activities in my trail. These queries are just an example, but give me quickly a deeper insight in what is happening in my AWS account.

Think of what could be a similar impact for your business! Using database and logs blueprints, you can quickly create workflows to ingest data from multiple sources within your organization, set the right permission at column level of who can have access to any information collected, clean and prepare your data using machine learning transforms, and correlate and visualize the information using tools like Amazon Athena, Amazon Redshift, and Amazon QuickSight.

Customizing Data Access with Column-Level Permissions
In order to follow data privacy guidelines and compliance, the mission-critical data stored in a data lake requires to create custom views for different stakeholders inside the company. Let’s compare the visibility of two IAM users in my AWS account, one that has full permissions on a table, and one that has only select access to a subset of the columns of the same table.

I already have a user with full access to the table containing my CloudTrail data, it’s called danilop. I create a new limitedview IAM user and I give it access to the Athena console. In the Lake Formation console, I only give this new user select permissions on three of the columns.

To verify the different access to the data in the table, I log in with one user at a time and go to the Athena console. On the left I can explore which tables and columns the logged-in user can see in the Glue Data Catalog. Here’s a comparison for the two users, side-by-side:

The limited user has access only to the three columns that I explicitly configured, and to the four columns used for partitioning the table, whose access is required to see any data. When I query the table in the Athena console with a select * SQL statement, logged in as the limitedview user, I only see data from those seven columns:

Available Now
There is no additional cost in using AWS Lake Formation, you pay for the use of the underlying services such as Amazon S3 and AWS Glue. One of the core benefits of Lake Formation are the security policies it is introducing. Previously you had to use separate policies to secure data and metadata access, and these policies only allowed table-level access. Now you can give access to each user, from a central location, only to the the columns they need to use.

AWS Lake Formation is now available in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo). Redshift integration with Lake Formation requires Redshift cluster version 1.0.8610 or higher, your clusters should have been automatically updated by the time you read this. Support for Apache Spark with Amazon EMR will follow over the next few months.

I only scratched the surface of what you can do with Lake Formation. Building and managing a data lake for your business is now much easier, let me know how you are using these new capabilities!

Danilo

New – Local Mocking and Testing with the Amplify CLI

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-local-mocking-and-testing-with-the-amplify-cli/

The open source Amplify Framework provides a set of libraries, user interface (UI) components, and a command line interface (CLI) that make it easier to add sophisticated cloud features to your web or mobile apps by provisioning backend resources using AWS CloudFormation.

A comment I often get when talking with our customers, is that when you are adding new features or solving bugs, it is important to iterate as fast as possible, getting a quick feedback from your actions. How can we improve their development experience?

Well, last week the Amplify team launched the new Predictions category, to let you quickly add machine learning capabilities to your web or mobile app. Today, they are doing it again. I am very happy to share that you can now use the Amplify CLI to mock some of the most common cloud services it provides, and test your application 100% locally!

By mocking here I mean that instead of using the actual backend component, an API in the case of cloud services, a local, simplified emulation of that API is available instead. This emulation provides the basic functionality that you need for testing during development, but not the full behavior you’d get from the production service.

With this new mocking capability you can test your changes quickly, without the need of provisioning or updating the cloud resources you are using at every step. In this way, you can set up unit and integration tests that can be executed rapidly, without affecting your cloud backend. Depending on the architecture of your app, you can set up automatic testing in your CI/CD pipeline without provisioning backend resources.

This is really useful when editing AWS AppSync resolver mapping templates, written in Apache Velocity Template Language (VTL), which take your requests as input, and output a JSON document containing the instructions for the resolver. You can now have immediate feedback on your edits, and test if your resolvers work as expected without having to wait for a deployment for every update.

For this first release, the Amplify CLI can mock locally:

API Mocking
Let’s do a quick overview of what you can do. For example, let’s create a sample app that helps people store and share the location of those nice places that allow you to refill your reusable water bottle and reduce plastic waste.

To install the Amplify CLI, I need Node.js (version >= 8.11.x) and npm (version >= 5.x):

npm install -g @aws-amplify/cli
amplify configure

Amplify supports lots of different frameworks, for this example I am using React and I start with a sample app (npx requires npm >= 5.2.x):

npx create-react-app refillapp
cd refillapp

I use the Amplify CLI to inizialize the project and add an API. The Amplify CLI are interactive, asking you questions that drive the configuration of your backend. In this case, when asked, I select to add a GraphQL API.

amplify init
amplify add api

During the creation of the API, I edit the GraphQL schema, and define a RefillLocation in this way:

type RefillLocation @model {
  id: ID!
  name: String!
  description: String
  streetAddress: String!
  city: String!
  stateProvinceOrRegion: String
  zipCode: String!
  countryCode: String!
}

The fields that have an exclamation mark ! at the end are mandatory. The other fields are optional, and can be omitted when creating a new object.

The @model you see in the first line is a directive using GraphQL Transform to define top level object types in your API that are backed by DynamoDB and generate for you all the necessary CRUDL (create, read, update, delete, and list) queries and mutations, and the subscriptions to be notified of such mutations.

Now, I would normally need to run amplify push to configure and provision the backend resources required by the project (AppSync and DynamoDB in this case). But to get a quick feedback, I use the new local mocking capability running this command:

amplify mock

Alternatively, I can use the amplify mock api command to specifically mock just my GraphQL API. It would be the same at this stage, but it can be handy when using more than one mocking capability at a time.

The output of the mock command gives you some information on what it does, and what you can do, including the AppSync Mock endpoint:

GraphQL schema compiled successfully.

Edit your schema at /MyCode/refillapp/amplify/backend/api/refillapp/schema.graphql or place .graphql files in a directory at /MyCode/refillapp/amplify/backend/api/refillapp/schema

Creating table RefillLocationTable locally

Running GraphQL codegen

✔ Generated GraphQL operations successfully and saved at src/graphql

AppSync Mock endpoint is running at http://localhost:20002

I keep the mock command running in a terminal window to get feedback of possible errors in my code. For example, when I edit a VTL template, the Amplify CLI recognizes that immediately, and generates the updated code for the resolver. In case of a mistake, I get an error from the running mock command.

The AppSync Mock endpoint gives you access to:

I can now run GraphQL queries, mutations, and subscriptions locally for my API, using a web interface. For example, to create a new RefillLocation I build the mutation visually, like this:

To get the list of the RefillLocation objects in a city, I build the query using the same web interface, and run it against the local DynamoDB storage:

When I am confident that my data model is correct, I start building the frontend code of my app, editing the App.js file of my React app, and add functionalities that I can immediately test, thanks to local mocking.

To add the Amplify Framework to my app, including the React extensions, I use Yarn:

yarn add aws-amplify
yarn add aws-amplify-react

Now, using the Amplify Framework library, I can write code like this to run a GraphQL operation:

import API, { graphqlOperation } from '@aws-amplify/api';
import { createRefillLocation } from './graphql/mutations';

const refillLocation = {
  name: "My Favorite Place",
  streetAddress: "123 Here or There",
  zipCode: "12345"
  city: "Seattle",
  countryCode: "US"
};

await API.graphql(graphqlOperation(createRefillLocation, { input: refillLocation }));

Storage Mocking
I now want to add a new feature to my app, to let users upload and share pictures of a RefillLocation. To do so, I add the Storage category to the configuration of my project and select “Content” to use S3:

amplify add storage

Using the Amplify Framework library, I can now, straight from the browser, put, get, or remove objects from S3 using the following syntax:

import Storage from '@aws-amplify/storage';

Storage.put(name, file, {
  level: 'public'
})
.then(result => console.log(result))
.catch(err => console.log(err));

Storage.get(file, {
  level: 'public'
})
.then(result => {
  console.log(result);
  this.setState({ imageUrl: result });
  fetch(result);
})
.catch(err => alert(err));

All those interactions with S3 are marked as public, because I want my users to share their pictures with each other publicly, but the Amplify Framework supports different access levels, such as private, protected, and public. You can find more information on this in the File Access Levels section of the Amplify documentation.

Since S3 storage is supported by this new mocking capability, I use again amplify mock to test my whole application locally, including the backend used by my GraphQL API (AppSync and DynamoDB) and my content storage (S3).

If I want to test only part of my application locally, I can use amplify mock api or amplify mock storage to have only the GraphQL API, or the S3 storage, mocked locally.

Available Now
There are lots of other features that I didn’t have time to cover in this post, the best way to learn is to be curious and get hands on! You can start using Amplify by following the get-started tutorial.

Here you can find a great walkthrough of the features, and a description of how we collaborated with the open source community for this release.

Being able to mock and test your application locally can help you build and refine your ideas faster, let us know what you think in the Amplify CLI GitHub repository.

Danilo

Amplify Framework Update – Quickly Add Machine Learning Capabilities to Your Web and Mobile Apps

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/amplify-framework-update-quickly-add-machine-learning-capabilities-to-your-web-and-mobile-apps/

At AWS, we want to put machine learning in the hands of every developer. For example, we have pre-trained AI services for areas such as computer vision and language that you can use without any expertise in machine learning. Today we are making another step in that direction with the addition of a new Predictions category to the Amplify Framework. In this way, you can add and configure AI/ML uses cases for your web or mobile application with few lines of code!

AWS Amplify consists of a development framework and developer services that make super easy to build mobile and web applications on AWS. The open-source Amplify Framework provides an opinionated set of libraries, user interface (UI) components, and a command line interface (CLI) to build a cloud backend and integrate it with your web or mobile apps. Amplify leverages a core set of AWS services organized into categories, including storage, authentication & authorization, APIs (GraphQL and REST), analytics, push notifications, chat bots, and AR/VR.

Using the Amplify Framework CLI, you can interactively initialize your project with amplify init. Then, you can go through your storage (amplify add storage) and user authentication & authorization (amplify add auth) options.

Now, you can also use amplify add predictions to configure your app to:

  • Identify text, entities, and labels in images using Amazon Rekognition, or identify text in scanned documents to get the contents of fields in forms and information stored in tables using Amazon Textract.
  • Convert text into a different language using Amazon Translate, text to speech using Amazon Polly, and speech to text using Amazon Transcribe.
  • Interpret text to find the dominant language, the entities, the key phrases, the sentiment, or the syntax of unstructured text using Amazon Comprehend.

You can select to have each of the above actions available only to authenticated users of your app, or also for guest, unauthenticated users. Based on your inputs, Amplify configures the necessary permissions using AWS Identity and Access Management (IAM) roles and Amazon Cognito.

Let’s see how Predictions works for a web application. For example, to identify text in an image using Amazon Rekognition directly from the browser, you can use the following JavaScript syntax and pass a file object:

Predictions.identify({
  text: {
    source: file
    format: "PLAIN" # "PLAIN" uses Amazon Rekognition
  }
}).then((result) => {...})

If the image is stored on Amazon S3, you can change the source to link to the S3 bucket selected when adding storage to this project. You can also change the format to analyze a scanned document using Amazon Textract. Here’s how to extract text from a form in a document stored on S3:

Predictions.identify({
  text: {
    source: { key: "my/image" }
    format: "FORM" # "FORM" or "TABLE" use Amazon Textract
  }
}).then((result) => {...})

Here’s an example of how to interpret text using all the pre-trained capabilities of Amazon Comprehend:

Predictions.interpret({
  text: {
    source: {
      text: "text to interpret",
    },
    type: "ALL"
  }
}).then((result) => {...})

To convert text to speech using Amazon Polly, using the language and the voice selected when adding the prediction, and play it back in the browser, you can use the following code:

Predictions.convert({
  textToSpeech: {
    source: {
      text: "text to generate speech"
    }
  }
}).then(result => {
  var audio = new Audio();
  audio.src = result.speech.url;
  audio.play();
})

Available Now
You can start building you next web or mobile app using Amplify today by following the get-started tutorial here and give us your feedback in the Amplify Framework Github repository.

There are lots of other options and features available in the Predictions category of the Amplify Framework. Please see this walkthrough on the AWS Mobile Blog for an in-depth example of building a machine-learning powered app.

It has never been easier to add machine learning functionalities to a web or mobile app, please let me know what you’re going to build next.

Danilo

AWS Cloud Development Kit (CDK) – TypeScript and Python are Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-cloud-development-kit-cdk-typescript-and-python-are-now-generally-available/

Managing your Infrastructure as Code provides great benefits and is often a stepping stone for a successful application of DevOps practices. In this way, instead of relying on manually performed steps, both administrators and developers can automate provisioning of compute, storage, network, and application services required by their applications using configuration files.

For example, defining your Infrastructure as Code makes it possible to:

  • Keep infrastructure and application code in the same repository
  • Make infrastructure changes repeatable and predictable across different environments, AWS accounts, and AWS regions
  • Replicate production in a staging environment to enable continuous testing
  • Replicate production in a performance test environment that you use just for the time required to run a stress test
  • Release infrastructure changes using the same tools as code changes, so that deployments include infrastructure updates
  • Apply software development best practices to infrastructure management, such as code reviews, or deploying small changes frequently

Configuration files used to manage your infrastructure are traditionally implemented as YAML or JSON text files, but in this way you’re missing most of the advantages of modern programming languages. Specifically with YAML, it can be very difficult to detect a file truncated while transferring to another system, or a missing line when copying and pasting from one template to another.

Wouldn’t it be better if you could use the expressive power of your favorite programming language to define your cloud infrastructure? For this reason, we introduced last year in developer preview the AWS Cloud Development Kit (CDK), an extensible open-source software development framework to model and provision your cloud infrastructure using familiar programming languages.

I am super excited to share that the AWS CDK for TypeScript and Python is generally available today!

With the AWS CDK you can design, compose, and share your own custom components that incorporate your unique requirements. For example, you can create a component setting up your own standard VPC, with its associated routing and security configurations. Or a standard CI/CD pipeline for your microservices using tools like AWS CodeBuild and CodePipeline.

Personally I really like that by using the AWS CDK, you can build your application, including the infrastructure, in your IDE, using the same programming language and with the support of autocompletion and parameter suggestion that modern IDEs have built in, without having to do a mental switch between one tool, or technology, and another. The AWS CDK makes it really fun to quickly code up your AWS infrastructure, configure it, and tie it together with your application code!

How the AWS CDK works
Everything in the AWS CDK is a construct. You can think of constructs as cloud components that can represent architectures of any complexity: a single resource, such as an S3 bucket or an SNS topic, a static website, or even a complex, multi-stack application that spans multiple AWS accounts and regions. To foster reusability, constructs can include other constructs. You compose constructs together into stacks, that you can deploy into an AWS environment, and apps, a collection of one of more stacks.

How to use the AWS CDK
We continuously add new features based on the feedback of our customers. That means that when creating an AWS resource, you often have to specify many options and dependencies. For example, if you create a VPC you have to think about how many Availability Zones (AZs) to use and how to configure subnets to give private and public access to the resources that will be deployed in the VPC.

To make it easy to define the state of AWS resources, the AWS Construct Library exposes the full richness of many AWS services with sensible defaults that you can customize as needed. In the case above, the VPC construct creates by default public and private subnets for all the AZs in the VPC, using 3 AZs if not specified.

For creating and managing CDK apps, you can use the AWS CDK Command Line Interface (CLI), a command-line tool that requires Node.js and can be installed quickly with:

npm install -g aws-cdk

After that, you can use the CDK CLI with different commands:

  • cdk init to initialize in the current directory a new CDK project in one of the supported programming languages
  • cdk synth to print the CloudFormation template for this app
  • cdk deploy to deploy the app in your AWS Account
  • cdk diff to compare what is in the project files with what has been deployed

Just run cdk to see more of the available commands and options.

You can easily include the CDK CLI in your deployment automation workflow, for example using Jenkins or AWS CodeBuild.

Let’s use the AWS CDK to build two sample projects, using different programming languages.

An example in TypeScript
For the first project I am using TypeScript to define the infrastructure:

cdk init app --language=typescript

Here’s a simplified view of what I want to build, not entering into the details of the public/private subnets in the VPC. There is an online frontend, writing messages in a queue, and an asynchronous backend, consuming messages from the queue:

Inside the stack, the following TypeScript code defines the resources I need, and their relations:

  • First I define the VPC and an Amazon ECS cluster in that VPC. By using the defaults provided by the AWS Construct Library, I don’t need to specify any parameter here.
  • Then I use an ECS pattern that in a few lines of code sets up an Amazon SQS queue and an ECS service running on AWS Fargate to consume the messages in that queue.
  • The ECS pattern library provides higher-level ECS constructs which follow common architectural patterns, such as load balanced services, queue processing, and scheduled tasks.
  • A Lambda function has the name of the queue, created by the ECS pattern, passed as an environment variable and is granted permissions to send messages to the queue.
  • The code of the Lambda function and the Docker image are passed as assets. Assets allow you to bundle files or directories from your project and use them with Lambda or ECS.
  • Finally, an Amazon API Gateway endpoint provides an HTTPS REST interface to the function.
const myVpc = new ec2.Vpc(this, "MyVPC");

const myCluster = new ecs.Cluster(this, "MyCluster", {
  vpc: myVpc
});

const myQueueProcessingService = new ecs_patterns.QueueProcessingFargateService(
  this, "MyQueueProcessingService", {
    cluster: myCluster,
    memoryLimitMiB: 512,
    image: ecs.ContainerImage.fromAsset("my-queue-consumer")
  });

const myFunction = new lambda.Function(
  this, "MyFrontendFunction", {
    runtime: lambda.Runtime.NODEJS_10_X,
    timeout: Duration.seconds(3),
    handler: "index.handler",
    code: lambda.Code.asset("my-front-end"),
    environment: {
      QUEUE_NAME: myQueueProcessingService.sqsQueue.queueName
    }
  });

myQueueProcessingService.sqsQueue.grantSendMessages(myFunction);

const myApi = new apigateway.LambdaRestApi(
  this, "MyFrontendApi", {
    handler: myFunction
  });

I find this code very readable and easier to maintain than the corresponding JSON or YAML. By the way, cdk synth in this case outputs more than 800 lines of plain CloudFormation YAML.

An example in Python
For the second project I am using Python:

cdk init app --language=python

I want to build a Lambda function that is executed every 10 minutes:

When you initialize a CDK project in Python, a virtualenv is set up for you. You can activate the virtualenv and install your project requirements with:

source .env/bin/activate

pip install -r requirements.txt

Note that Python autocompletion may not work with some editors, like Visual Studio Code, if you don’t start the editor from an active virtualenv.

Inside the stack, here’s the Python code defining the Lambda function and the CloudWatch Event rule to invoke the function periodically as target:

myFunction = aws_lambda.Function(
    self, "MyPeriodicFunction",
    code=aws_lambda.Code.asset("src"),
    handler="index.main",
    timeout=core.Duration.seconds(30),
    runtime=aws_lambda.Runtime.PYTHON_3_7,
)

myRule = aws_events.Rule(
    self, "MyRule",
    schedule=aws_events.Schedule.rate(core.Duration.minutes(10)),
)
myRule.add_target(aws_events_targets.LambdaFunction(myFunction))

Again, this is easy to understand even if you don’t know the details of AWS CDK. For example, durations include the time unit and you don’t have to wonder if they are expressed in seconds, milliseconds, or days. The output of cdk synth in this case is more than 90 lines of plain CloudFormation YAML.

Available Now
There is no charge for using the AWS CDK, you pay for the AWS resources that are deployed by the tool.

To quickly get hands-on with the CDK, start with this awesome step-by-step online tutorial!

More examples of CDK projects, using different programming languages, are available in this repository:

https://github.com/aws-samples/aws-cdk-examples

You can find more information on writing your own constructs here.

The AWS CDK is open source and we welcome your contribution to make it an even better tool:

https://github.com/awslabs/aws-cdk

Check out our source code on GitHub, start building your infrastructure today using TypeScript or Python, or try different languages in developer preview, such as C# and Java, and give us your feedback!

Amazon Aurora PostgreSQL Serverless – Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/amazon-aurora-postgresql-serverless-now-generally-available/

The database is usually the most critical part of a software architecture and managing databases, especially relational ones, has never been easy. For this reason, we created Amazon Aurora Serverless, an auto-scaling version of Amazon Aurora that automatically starts up, shuts down and scales up or down based on your application workload.

The MySQL-compatible edition of Aurora Serverless has been available for some time now. I am pleased to announce that the PostgreSQL-compatible edition of Aurora Serverless is generally available today.

Before moving on with details, I take the opportunity to congratulate the Amazon Aurora development team that has just won the 2019 Association for Computing Machinery’s (ACM) Special Interest Group on Management of Data (SIGMOD) Systems Award!

When you create a database with Aurora Serverless, you set the minimum and maximum capacity. Your client applications transparently connect to a proxy fleet that routes the workload to a pool of resources that are automatically scaled. Scaling is very fast because resources are “warm” and ready to be added to serve your requests.

 

There is no change with Aurora Serverless on how storage is managed by Aurora. The storage layer is independent from the compute resources used by the database. There is no need to provision storage in advance. The minimum storage is 10GB and, based on the database usage, the Amazon Aurora storage will automatically grow, up to 64 TB, in 10GB increments with no impact to database performance.

Creating an Aurora Serverless PostgreSQL Database
Let’s start an Aurora Serverless PostgreSQL database and see the automatic scalability at work. From the Amazon RDS console, I select to create a database using Amazon Aurora as engine. Currently, Aurora serverless is compatible with PostgreSQL version 10.5. Selecting that version, the serverless option becomes available.

I give the new DB cluster an identifier, choose my master username, and let Amazon RDS generate a password for me. I will be able to retrieve my credentials during database creation.

I can now select the minimum and maximum capacity for my database, in terms of Aurora Capacity Units (ACUs), and in the additional scaling configuration I choose to pause compute capacity after 5 minutes of inactivity. Based on my settings, Aurora Serverless automatically creates scaling rules for thresholds for CPU utilization, connections, and available memory.

Testing Some Load on the Database
To generate some load on the database I am using sysbench on an EC2 instance. There are a couple of Lua scripts bundled with sysbench that can help generate an online transaction processing (OLTP) workload:

  • The first script, parallel_prepare.lua, generates 100,000 rows per table for 24 tables.
  • The second script, oltp.lua, generates workload against those data using 64 worker threads.

By using those scripts, I start generating load on my database cluster. As you can see from this graph, taken from the RDS console monitoring tab, the serverless database capacity grows and shrinks to follow my requirements. The metric shown on this graph is the number of ACUs used by the database cluster. First it scales up to accommodate the sysbench workload. When I stop the load generator, it scales down and then pauses.

Available Now
Aurora Serverless PostgreSQL is available now in US East (N. Virginia), US East (Ohio), US West (Oregon), EU (Ireland), and Asia Pacific (Tokyo). With Aurora Serverless, you pay on a per-second basis for the database capacity you use when the database is active, plus the usual Aurora storage costs.

For more information on Amazon Aurora, I recommend this great post explaining why and how it was created:

Amazon Aurora ascendant: How we designed a cloud-native relational database

It’s never been so easy to use a relational database in production. I am so excited to see what you are going to use it for!