Earlier this year, James Conger built a chartplotter for his boat using a Raspberry Pi. Here he is with a detailed explanation of how everything works:
Provides an overview of the hardware and software needed to put together a home-made Chartplotter with its own GPS and AIS receiver. Cost for this project was about $350 US in 2019.
The entire build cost approximately $350. It incorporates a Raspberry Pi 3 Model B+, dAISy AIS receiver HAT, USB GPS module, and touchscreen display, all hooked up to his boat.
Perfect for navigating the often foggy San Francisco Bay, the chartplotter allows James to track the position, speed, and direction of major vessels in the area, superimposed over high-quality NOAA nautical charts.
Carputers! Fabrice Aneche is documenting his ongoing build, which equips an older (2011) car with some of the features a 2018 model might have: thus far, a reversing camera (bought off the shelf, with a modified GUI to show the date and the camera’s output built with Qt and Golang), GPS and offline route guidance.
We’re not sure how the car got through that little door there.
It was back in 2013, when the Raspberry Pi had been on the market for about a year, that we started to see carputer projects emerge. They tended to be focussed in two directions: in-car entertainment, and on-board diagnostics (OBD). We ended up hiring the wonderful Martin O’Hanlon, who wrote up the first OBD project we came across, just this year. Being featured on this blog can change your life, I tell you.
In the last five years, the Pi’s evolved: you’re now working with a lot more processing power, there’s onboard WiFi, and far more peripherals which can be useful in a…vehicular context are available. Consequently, the flavour of the car projects we’re seeing has changed somewhat, with navigation systems and cameras much more visible. Fabrice’s is one of the best examples we’ve found.
Night-view navigation system
GPS is all very well, but you, the human person driver, will want directions at every turn. So Fabrice wrote a user interface to serve up live maps and directions, mostly in Qt5 and QML (he’s got some interesting discussion on his website about why he stopped using X11, which turned out to be too slow for his needs). All the non-QML work is done in Go. It’s all open-source, and on GitHub, if you’d like to contribute or roll your own project. He’s also worked over the Linux GPS daemons, found them lacking, and has produced his own:
…the Linux gps daemons are using obscure and over complicated protocols so I’ve decided to write my own gps daemon in Go using a gRPC stream interface. You can find it here.
I’m also not satisfied with the map matching of OSRM for real time display, I may rewrite one using mbmatch.
We’ll be keeping an eye on this project; given how much clever has gone into it already, we’re pretty sure that Fabrice will be adding new features. Thanks Fabrice!
Last year, we released Amazon Connect, a cloud-based contact center service that enables any business to deliver better customer service at low cost. This service is built based on the same technology that empowers Amazon customer service associates. Using this system, associates have millions of conversations with customers when they inquire about their shipping or order information. Because we made it available as an AWS service, you can now enable your contact center agents to make or receive calls in a matter of minutes. You can do this without having to provision any kind of hardware. 2
There are several advantages of building your contact center in the AWS Cloud, as described in our documentation. In addition, customers can extend Amazon Connect capabilities by using AWS products and the breadth of AWS services. In this blog post, we focus on how to get analytics out of the rich set of data published by Amazon Connect. We make use of an Amazon Connect data stream and create an end-to-end workflow to offer an analytical solution that can be customized based on need.
Solution overview
The following diagram illustrates the solution.
In this solution, Amazon Connect exports its contact trace records (CTRs) using Amazon Kinesis. CTRs are data streams in JSON format, and each has information about individual contacts. For example, this information might include the start and end time of a call, which agent handled the call, which queue the user chose, queue wait times, number of holds, and so on. You can enable this feature by reviewing our documentation.
In this architecture, we use Kinesis Firehose to capture Amazon Connect CTRs as raw data in an Amazon S3 bucket. We don’t use the recent feature added by Kinesis Firehose to save the data in S3 as Apache Parquet format. We use AWS Glue functionality to automatically detect the schema on the fly from an Amazon Connect data stream.
The primary reason for this approach is that it allows us to use attributes and enables an Amazon Connect administrator to dynamically add more fields as needed. Also by converting data to parquet in batch (every couple of hours) compression can be higher. However, if your requirement is to ingest the data in Parquet format on realtime, we recoment using Kinesis Firehose recently launched feature. You can review this blog post for further information.
By default, Firehose puts these records in time-series format. To make it easy for AWS Glue crawlers to capture information from new records, we use AWS Lambda to move all new records to a single S3 prefix called flatfiles. Our Lambda function is configured using S3 event notification. To comply with AWS Glue and Athena best practices, the Lambda function also converts all column names to lowercase. Finally, we also use the Lambda function to start AWS Glue crawlers. AWS Glue crawlers identify the data schema and update the AWS Glue Data Catalog, which is used by extract, transform, load (ETL) jobs in AWS Glue in the latter half of the workflow.
You can see our approach in the Lambda code following.
from __future__ import print_function
import json
import urllib
import boto3
import os
import re
s3 = boto3.resource('s3')
client = boto3.client('s3')
def convertColumntoLowwerCaps(obj):
for key in obj.keys():
new_key = re.sub(r'[\W]+', '', key.lower())
v = obj[key]
if isinstance(v, dict):
if len(v) > 0:
convertColumntoLowwerCaps(v)
if new_key != key:
obj[new_key] = obj[key]
del obj[key]
return obj
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
try:
client.download_file(bucket, key, '/tmp/file.json')
with open('/tmp/out.json', 'w') as output, open('/tmp/file.json', 'rb') as file:
i = 0
for line in file:
for object in line.replace("}{","}\n{").split("\n"):
record = json.loads(object,object_hook=convertColumntoLowwerCaps)
if i != 0:
output.write("\n")
output.write(json.dumps(record))
i += 1
newkey = 'flatfiles/' + key.replace("/", "")
client.upload_file('/tmp/out.json', bucket,newkey)
s3.Object(bucket,key).delete()
return "success"
except Exception as e:
print(e)
print('Error coping object {} from bucket {}'.format(key, bucket))
raise e
We trigger AWS Glue crawlers based on events because this approach lets us capture any new data frame that we want to be dynamic in nature. CTR attributes are designed to offer multiple custom options based on a particular call flow. Attributes are essentially key-value pairs in nested JSON format. With the help of event-based AWS Glue crawlers, you can easily identify newer attributes automatically.
We recommend setting up an S3 lifecycle policy on the flatfiles folder that keeps records only for 24 hours. Doing this optimizes AWS Glue ETL jobs to process a subset of files rather than the entire set of records.
After we have data in the flatfiles folder, we use AWS Glue to catalog the data and transform it into Parquet format inside a folder called parquet/ctr/. The AWS Glue job performs the ETL that transforms the data from JSON to Parquet format. We use AWS Glue crawlers to capture any new data frame inside the JSON code that we want to be dynamic in nature. What this means is that when you add new attributes to an Amazon Connect instance, the solution automatically recognizes them and incorporates them in the schema of the results.
After AWS Glue stores the results in Parquet format, you can perform analytics using Amazon Redshift Spectrum, Amazon Athena, or any third-party data warehouse platform. To keep this solution simple, we have used Amazon Athena for analytics. Amazon Athena allows us to query data without having to set up and manage any servers or data warehouse platforms. Additionally, we only pay for the queries that are executed.
Try it out!
You can get started with our sample AWS CloudFormation template. This template creates the components starting from the Kinesis stream and finishes up with S3 buckets, the AWS Glue job, and crawlers. To deploy the template, open the AWS Management Console by clicking the following link.
In the console, specify the following parameters:
BucketName: The name for the bucket to store all the solution files. This name must be unique; if it’s not, template creation fails.
etlJobSchedule: The schedule in cron format indicating how often the AWS Glue job runs. The default value is every hour.
KinesisStreamName: The name of the Kinesis stream to receive data from Amazon Connect. This name must be different from any other Kinesis stream created in your AWS account.
s3interval: The interval in seconds for Kinesis Firehose to save data inside the flatfiles folder on S3. The value must between 60 and 900 seconds.
sampledata: When this parameter is set to true, sample CTR records are used. Doing this lets you try this solution without setting up an Amazon Connect instance. All examples in this walkthrough use this sample data.
Select the “I acknowledge that AWS CloudFormation might create IAM resources.” check box, and then choose Create. After the template finishes creating resources, you can see the stream name on the stack Outputs tab.
If you haven’t created your Amazon Connect instance, you can do so by following the Getting Started Guide. When you are done creating, choose your Amazon Connect instance in the console, which takes you to instance settings. Choose Data streaming to enable streaming for CTR records. Here, you can choose the Kinesis stream (defined in the KinesisStreamName parameter) that was created by the CloudFormation template.
Now it’s time to generate the data by making or receiving calls by using Amazon Connect. You can go to Amazon Connect Cloud Control Panel (CCP) to make or receive calls using a software phone or desktop phone. After a few minutes, we should see data inside the flatfiles folder. To make it easier to try this solution, we provide sample data that you can enable by setting the sampledata parameter to true in your CloudFormation template.
You can navigate to the AWS Glue console by choosing Jobs on the left navigation pane of the console. We can select our job here. In my case, the job created by CloudFormation is called glueJob-i3TULzVtP1W0; yours should be similar. You run the job by choosing Run job for Action.
After that, we wait for the AWS Glue job to run and to finish successfully. We can track the status of the job by checking the History tab.
When the job finishes running, we can check the Database section. There should be a new table created called ctr in Parquet format.
To query the data with Athena, we can select the ctr table, and for Action choose View data.
Doing this takes us to the Athena console. If you run a query, Athena shows a preview of the data.
When we can query the data using Athena, we can visualize it using Amazon QuickSight. Before connecting Amazon QuickSight to Athena, we must make sure to grant Amazon QuickSight access to Athena and the associated S3 buckets in the account. For more information on doing this, see Managing Amazon QuickSight Permissions to AWS Resources in the Amazon QuickSight User Guide. We can then create a new data set in Amazon QuickSight based on the Athena table that was created.
After setting up permissions, we can create a new analysis in Amazon QuickSight by choosing New analysis.
Then we add a new data set.
We choose Athena as the source and give the data source a name (in this case, I named it connectctr).
Choose the name of the database and the table referencing the Parquet results.
Then choose Visualize.
After that, we should see the following screen.
Now we can create some visualizations. First, search for the agent.username column, and drag it to the AutoGraph section.
We can see the agents and the number of calls for each, so we can easily see which agents have taken the largest amount of calls. If we want to see from what queues the calls came for each agent, we can add the queue.arn column to the visual.
After following all these steps, you can use Amazon QuickSight to add different columns from the call records and perform different types of visualizations. You can build dashboards that continuously monitor your connect instance. You can share those dashboards with others in your organization who might need to see this data.
Conclusion
In this post, you see how you can use services like AWS Lambda, AWS Glue, and Amazon Athena to process Amazon Connect call records. The post also demonstrates how to use AWS Lambda to preprocess files in Amazon S3 and transform them into a format that recognized by AWS Glue crawlers. Finally, the post shows how to used Amazon QuickSight to perform visualizations.
You can use the provided template to analyze your own contact center instance. Or you can take the CloudFormation template and modify it to process other data streams that can be ingested using Amazon Kinesis or stored on Amazon S3.
Luis Caro is a Big Data Consultant for AWS Professional Services. He works with our customers to provide guidance and technical assistance on big data projects, helping them improving the value of their solutions when using AWS.
Peter Dalbhanjan is a Solutions Architect for AWS based in Herndon, VA. Peter has a keen interest in evangelizing AWS solutions and has written multiple blog posts that focus on simplifying complex use cases. At AWS, Peter helps with designing and architecting variety of customer workloads.
Thanks to Susan Ferrell, Senior Technical Writer, for a great blog post on how to use CodeCommit branch-level permissions. —-
AWS CodeCommit users have been asking for a way to restrict commits to some repository branches to just a few people. In this blog post, we’re going to show you how to do that by creating and applying a conditional policy, an AWS Identity and Access Management (IAM) policy that contains a context key.
Why would I do this?
When you create a branch in an AWS CodeCommit repository, the branch is available, by default, to all repository users. Here are some scenarios in which refining access might help you:
You maintain a branch in a repository for production-ready code, and you don’t want to allow changes to this branch except from a select group of people.
You want to limit the number of people who can make changes to the default branch in a repository.
You want to ensure that pull requests cannot be merged to a branch except by an approved group of developers.
We’ll show you how to create a policy in IAM that prevents users from pushing commits to and merging pull requests to a branch named master. You’ll attach that policy to one group or role in IAM, and then test how users in that group are affected when that policy is applied. We’ll explain how it works, so you can create custom policies for your repositories.
What you need to get started
You’ll need to sign in to AWS with sufficient permissions to:
Create and apply policies in IAM.
Create groups in IAM.
Add users to those groups.
Apply policies to those groups.
You can use existing IAM groups, but because you’re going to be changing permissions, you might want to first test this out on groups and users you’ve created specifically for this purpose.
You’ll need a repository in AWS CodeCommit with at least two branches: master and test-branch. For information about how to create repositories, see Create a Repository. For information about how to create branches, see Create a Branch. In this blog post, we’ve named the repository MyDemoRepo. You can use an existing repository with branches of another name, if you prefer.
Let’s get started!
Create two groups in IAM
We’re going to set up two groups in IAM: Developers and Senior_Developers. To start, both groups will have the same managed policy, AWSCodeCommitPowerUsers, applied. Users in each group will have exactly the same permissions to perform actions in IAM.
Figure 1: Two example groups in IAM, with distinct users but the same managed policy applied to each group
In the navigation pane, choose Groups, and then choose Create New Group.
In the Group Name box, type Developers, and then choose Next Step.
In the list of policies, select the check box for AWSCodeCommitPowerUsers, then choose Next Step.
Choose Create Group.
Now, follow these steps to create the Senior_Developers group and attach the AWSCodeCommitPowerUsers managed policy. You now have two empty groups with the same policy attached.
Create users in IAM
Next, add at least one unique user to each group. You can use existing IAM users, but because you’ll be affecting their access to AWS CodeCommit, you might want to create two users just for testing purposes. Let’s go ahead and create Arnav and Mary.
In the navigation pane, choose Users, and then choose Add user.
For the new user, type Arnav_Desai.
Choose Add another user, and then type Mary_Major.
Select the type of access (programmatic access, access to the AWS Management Console, or both). In this blog post, we’ll be testing everything from the console, but if you want to test AWS CodeCommit using the AWS CLI, make sure you include programmatic access and console access.
For Console password type, choose Custom password. Each user is assigned the password that you type in the box. Write these down so you don’t forget them. You’ll need to sign in to the console using each of these accounts.
Choose Next: Permissions.
On the Set permissions page, choose Add user to group. Add Arnav to the Developers group. Add Mary to the Senior_Developers group.
Choose Next: Review to see all of the choices you made up to this point. When you are ready to proceed, choose Create user.
Sign in as Arnav, and then follow these steps to go to the master branch and add a file. Then sign in as Mary and follow the same steps.
On the Dashboard page, from the list of repositories, choose MyDemoRepo.
In the Code view, choose the branch named master.
Choose Add file, and then choose Create file. Type some text or code in the editor.
Provide information to other users about who added this file to the repository and why.
In Author name, type the name of the user (Arnav or Mary).
In Email address, type an email address so that other repository users can contact you about this change.
In Commit message, type a brief description to help you remember why you added this file or any other details you might find helpful.
Type a name for the file.
Choose Commit file.
Now follow the same steps to add a file in a different branch. (In our example repository, that’s the branch named test-branch.) You should be able to add a file to both branches regardless of whether you’re signed in as Arnav or Mary.
Let’s change that.
Create a conditional policy in IAM
You’re going to create a policy in IAM that will deny API actions if certain conditions are met. We want to prevent users with this policy applied from updating a branch named master, but we don’t want to prevent them from viewing the branch, cloning the repository, or creating pull requests that will merge to that branch. For this reason, we want to pick and choose our APIs carefully. Looking at the Permissions Reference, the logical permissions for this are:
GitPush
PutFile
MergePullRequestByFastForward
Now’s the time to think about what else you might want this policy to do. For example, because we don’t want users with this policy to make changes to this branch, we probably don’t want them to be able to delete it either, right? So let’s add one more permission:
DeleteBranch
The branch in which we want to deny these actions is master. The repository in which the branch resides is MyDemoRepo. We’re going to need more than just the repository name, though. We need the repository ARN. Fortunately, that’s easy to find. Just go to the AWS CodeCommit console, choose the repository, and choose Settings. The repository ARN is displayed on the General tab.
Now we’re ready to create a policy. 1. Open the IAM console at https://console.aws.amazon.com/iam/. Make sure you’re signed in with the account that has sufficient permissions to create policies, and not as Arnav or Mary. 2. In the navigation pane, choose Policies, and then choose Create policy. 3. Choose JSON, and then paste in the following:
You’ll notice a few things here. First, change the repository ARN to the ARN for your repository and include the repository name. Second, if you want to restrict access to a branch with a name different from our example, master, change that reference too.
Now let’s talk about this policy and what it does. You might be wondering why we’re using a Git reference (refs/heads) value instead of just the branch name. The answer lies in how Git references things, and how AWS CodeCommit, as a Git-based repository service, implements its APIs. A branch in Git is a simple pointer (reference) to the SHA-1 value of the head commit for that branch.
You might also be wondering about the second part of the condition, the nullification language. This is necessary because of the way git push and git-receive-pack work. Without going into too many technical details, when you attempt to push a change from a local repo to AWS CodeCommit, an initial reference call is made to AWS CodeCommit without any branch information. AWS CodeCommit evaluates that initial call to ensure that:
a) You’re authorized to make calls.
b) A repository exists with the name specified in the initial call. If you left that null out of the policy, users with that policy would be unable to complete any pushes from their local repos to the AWS CodeCommit remote repository at all, regardless of which branch they were trying to push their commits to.
Could you write a policy in such a way that the null is not required? Of course. IAM policy language is flexible. There’s an example of how to do this in the AWS CodeCommit User Guide, if you’re curious. But for the purposes of this blog post, let’s continue with this policy as written.
So what have we essentially said in this policy? We’ve asked IAM to deny the relevant CodeCommit permissions if the request is made to the resource MyDemoRepo and it meets the following condition: the reference is to refs/heads/master. Otherwise, the deny does not apply.
I’m sure you’re wondering if this policy has to be constrained to a specific repository resource like MyDemoRepo. After all, it would be awfully convenient if a single policy could apply to all branches in any repository in an AWS account, particularly since the default branch in any repository is initially the master branch. Good news! Simply replace the ARN with an *, and your policy will affect ALL branches named master in every AWS CodeCommit repository in your AWS account. Make sure that this is really what you want, though. We suggest you start by limiting the scope to just one repository, and then changing things when you’ve tested it and are happy with how it works.
When you’re sure you’ve modified the policy for your environment, choose Review policy to validate it. Give this policy a name, such as DenyChangesToMaster, provide a description of its purpose, and then choose Create policy.
Now that you have a policy, it’s time to apply and test it.
Apply the policy to a group
In theory, you could apply the policy you just created directly to any IAM user, but that really doesn’t scale well. You should apply this policy to a group, if you use IAM groups to manage users, or to a role, if your users assume a role when interacting with AWS resources.
In the IAM console, choose Groups, and then choose Developers.
On the Permissions tab, choose Attach Policy.
Choose DenyChangesToMaster, and then choose Attach policy.
Your groups now have a critical difference: users in the Developers group have an additional policy applied that restricts their actions in the master branch. In other words, Mary can continue to add files, push commits, and merge pull requests in the master branch, but Arnav cannot.
Figure 2: Two example groups in IAM, one with an additional policy applied that will prevent users in this group from making changes to the master branch
Test it out. Sign in as Arnav, and do the following:
On the Dashboard page, from the list of repositories, choose MyDemoRepo.
In the Code view, choose the branch named master.
Choose Add file, and then choose Create file, just as you did before. Provide some text, and then add the file name and your user information.
Choose Commit file.
This time you’ll see an error after choosing Commit file. It’s not a pretty message, but at the very end, you’ll see a telling phrase: “explicit deny”. That’s the policy in action. You, as Arnav, are explicitly denied PutFile, which prevents you from adding a file to the master branch. You’ll see similar results if you try other actions denied by that policy, such as deleting the master branch.
Stay signed in as Arnav, but this time add a file to test-branch. You should be able to add a file without seeing any errors. You can create a branch based on the master branch, add a file to it, and create a pull request that will merge to the master branch, all just as before. However, you cannot perform denied actions on that master branch.
Sign out as Arnav and sign in as Mary. You’ll see that as that IAM user, you can add and edit files in the master branch, merge pull requests to it, and even, although we don’t recommend this, delete it.
Conclusion
You can use conditional statements in policies in IAM to refine how users interact with your AWS CodeCommit repositories. This blog post showed how to use such a policy to prevent users from making changes to a branch named master. There are many other options. We hope this blog post will encourage you to experiment with AWS CodeCommit, IAM policies, and permissions. If you have any questions or suggestions, we’d love to hear from you.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easier to prepare and load your data for analytics. You can create and run an ETL job with a few clicks on the AWS Management Console. Just point AWS Glue to your data store. AWS Glue discovers your data and stores the associated metadata (for example, a table definition and schema) in the AWS Glue Data Catalog.
AWS Glue has native connectors to data sources using JDBC drivers, either on AWS or elsewhere, as long as there is IP connectivity. In this post, we demonstrate how to connect to data sources that are not natively supported in AWS Glue today. We walk through connecting to and running ETL jobs against two such data sources, IBM DB2 and SAP Sybase. However, you can use the same process with any other JDBC-accessible database.
AWS Glue data sources
AWS Glue natively supports the following data stores by using the JDBC protocol:
One of the fastest growing architectures deployed on AWS is the data lake. The ETL processes that are used to ingest, clean, transform, and structure data are critically important for this architecture. Having the flexibility to interoperate with a broader range of database engines allows for a quicker adoption of the data lake architecture.
For data sources that AWS Glue doesn’t natively support, such as IBM DB2, Pivotal Greenplum, SAP Sybase, or any other relational database management system (RDBMS), you can import custom database connectors from Amazon S3 into AWS Glue jobs. In this case, the connection to the data source must be made from the AWS Glue script to extract the data, rather than using AWS Glue connections. To learn more, see Providing Your Own Custom Scripts in the AWS Glue Developer Guide.
Setting up an ETL job for an IBM DB2 data source
The first example demonstrates how to connect the AWS Glue ETL job to an IBM DB2 instance, transform the data from the source, and store it in Apache Parquet format in Amazon S3. To successfully create the ETL job using an external JDBC driver, you must define the following:
The S3 location of the job script
The S3 location of the temporary directory
The S3 location of the JDBC driver
The S3 location of the Parquet data (output)
The IAM role for the job
By default, AWS Glue suggests bucket names for the scripts and the temporary directory using the following format:
Keep in mind that having the AWS Glue job and S3 buckets in the same AWS Region helps save on cross-Region data transfer fees. For this post, we will work in the US East (Ohio) Region (us-east-2).
Creating the IAM role
The next step is to set up the IAM role that the ETL job will use:
Sign in to the AWS Management Console, and search for IAM:
On the IAM console, choose Roles in the left navigation pane.
Choose Create role. The role type of trusted entity must be an AWS service, specifically AWS Glue.
Choose Next: Permissions.
Search for the AWSGlueServiceRole policy, and select it.
Search again, now for the SecretsManagerReadWrite This policy allows the AWS Glue job to access database credentials that are stored in AWS Secrets Manager.
CAUTION: This policy is open and is being used for testing purposes only. You should create a custom policy to narrow the access just to the secrets that you want to use in the ETL job.
Select this policy, and choose Next: Review.
Give your role a name, for example, GluePermissions, and confirm that both policies were selected.
Choose Create role.
Now that you have created the IAM role, it’s time to upload the JDBC driver to the defined location in Amazon S3. For this example, we will use the DB2 driver, which is available on the IBM Support site.
Storing database credentials
It is a best practice to store database credentials in a safe store. In this case, we use AWS Secrets Manager to securely store credentials. Follow these steps to create those credentials:
Open the console, and search for Secrets Manager.
In the AWS Secrets Manager console, choose Store a new secret.
Under Select a secret type, choose Other type of secrets.
In the Secret key/value, set one row for each of the following parameters:
db_username
db_password
db_url (for example, jdbc:db2://10.10.12.12:50000/SAMPLE)
db_table
driver_name (ibm.db2.jcc.DB2Driver)
output_bucket: (for example, aws-glue-data-output-1234567890-us-east-2/User)
Choose Next.
For Secret name, use DB2_Database_Connection_Info.
Choose Next.
Keep the Disable automatic rotation check box selected.
Choose Next.
Choose Store.
Adding a job in AWS Glue
The next step is to author the AWS Glue job, following these steps:
In the AWS Management Console, search for AWS Glue.
In the navigation pane on the left, choose Jobs under the ETL
Choose Add job.
Fill in the basic Job properties:
Give the job a name (for example, db2-job).
Choose the IAM role that you created previously (GluePermissions).
For This job runs, choose A new script to be authored by you.
For ETL language, choose Python.
In the Script libraries and job parameters section, choose the location of your JDBC driver for Dependent jars path.
Choose Next.
On the Connections page, choose Next
On the summary page, choose Save job and edit script. This creates the job and opens the script editor.
In the editor, replace the existing code with the following script. Important: Line 47 of the script corresponds to the mapping of the fields in the source table to the destination, dropping of the null fields to save space in the Parquet destination, and finally writing to Amazon S3 in Parquet format.
Choose the black X on the right side of the screen to close the editor.
Running the ETL job
Now that you have created the job, the next step is to execute it as follows:
On the Jobs page, select your new job. On the Action menu, choose Run job, and confirm that you want to run the job. Wait a few moments as it finishes the execution.
After the job shows as Succeeded, choose Logs to read the output of the job.
In the output of the job, you will find the result of executing the df.printSchema() and the message with the df.count().
Also, if you go to your output bucket in S3, you will find the Parquet result of the ETL job.
Using AWS Glue, you have created an ETL job that connects to an existing database using an external JDBC driver. It enables you to execute any transformation that you need.
Setting up an ETL job for an SAP Sybase data source
In this section, we describe how to create an AWS Glue ETL job against an SAP Sybase data source. The process mentioned in the previous section works for a Sybase data source with a few changes required in the job:
While creating the job, choose the correct jar for the JDBC dependency.
In the script, change the reference to the secret to be used from AWS Secrets Manager:
After you successfully execute the new ETL job, the output contains the same type of information that was generated with the DB2 data source.
Note that each of these JDBC drivers has its own nuances and different licensing terms that you should be aware of before using them.
Maximizing JDBC read parallelism
Something to keep in mind while working with big data sources is the memory consumption. In some cases, “Out of Memory” errors are generated when all the data is read into a single executor. One approach to optimize this is to rely on the parallelism on read that you can implement with Apache Spark and AWS Glue. To learn more, see the Apache Spark SQL module.
You can use the following options:
partitionColumn: The name of an integer column that is used for partitioning.
lowerBound: The minimum value of partitionColumn that is used to decide partition stride.
upperBound: The maximum value of partitionColumn that is used to decide partition stride.
numPartitions: The number of partitions. This, along with lowerBound (inclusive) and upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the partitionColumn When unset, this defaults to SparkContext.defaultParallelism.
Those options specify the parallelism of the table read. lowerBound and upperBound decide the partition stride, but they don’t filter the rows in the table. Therefore, Spark partitions and returns all rows in the table. For example:
It’s important to be careful with the number of partitions because too many partitions could also result in Spark crashing your external database systems.
Conclusion
Using the process described in this post, you can connect to and run AWS Glue ETL jobs against any data source that can be reached using a JDBC driver. This includes new generations of common analytical databases like Greenplum and others.
You can improve the query efficiency of these datasets by using partitioning and pushdown predicates. For more information, see Managing Partitions for ETL Output in AWS Glue. This technique opens the door to moving data and feeding data lakes in hybrid environments.
Kapil Shardha is a Technical Account Manager and supports enterprise customers with their AWS adoption. He has background in infrastructure automation and DevOps.
William Torrealba is an AWS Solutions Architect supporting customers with their AWS adoption. He has background in Application Development, High Available Distributed Systems, Automation, and DevOps.
AWS Glue is an increasingly popular way to develop serverless ETL (extract, transform, and load) applications for big data and data lake workloads. Organizations that transform their ETL applications to cloud-based, serverless ETL architectures need a seamless, end-to-end continuous integration and continuous delivery (CI/CD) pipeline: from source code, to build, to deployment, to product delivery. Having a good CI/CD pipeline can help your organization discover bugs before they reach production and deliver updates more frequently. It can also help developers write quality code and automate the ETL job release management process, mitigate risk, and more.
AWS Glue is a fully managed data catalog and ETL service. It simplifies and automates the difficult and time-consuming tasks of data discovery, conversion, and job scheduling. AWS Glue crawls your data sources and constructs a data catalog using pre-built classifiers for popular data formats and data types, including CSV, Apache Parquet, JSON, and more.
When you are developing ETL applications using AWS Glue, you might come across some of the following CI/CD challenges:
Iterative development with unit tests
Continuous integration and build
Pushing the ETL pipeline to a test environment
Pushing the ETL pipeline to a production environment
Testing ETL applications using real data (live test)
The following diagram shows the pipeline workflow:
This solution uses AWS CodePipeline, which lets you orchestrate and automate the test and deploy stages for ETL application source code. The solution consists of a pipeline that contains the following stages:
1.) Source Control: In this stage, the AWS Glue ETL job source code and the AWS CloudFormation template file for deploying the ETL jobs are both committed to version control. I chose to use AWS CodeCommit for version control.
2.) LiveTest: In this stage, all resources—including AWS Glue crawlers, jobs, S3 buckets, roles, and other resources that are required for the solution—are provisioned, deployed, live tested, and cleaned up.
The LiveTest stage includes the following actions:
Deploy: In this action, all the resources that are required for this solution (crawlers, jobs, buckets, roles, and so on) are provisioned and deployed using an AWS CloudFormation template.
AutomatedLiveTest: In this action, all the AWS Glue crawlers and jobs are executed and data exploration and validation tests are performed. These validation tests include, but are not limited to, record counts in both raw tables and transformed tables in the data lake and any other business validations. I used AWS CodeBuild for this action.
LiveTestApproval: This action is included for the cases in which a pipeline administrator approval is required to deploy/promote the ETL applications to the next stage. The pipeline pauses in this action until an administrator manually approves the release.
LiveTestCleanup: In this action, all the LiveTest stage resources, including test crawlers, jobs, roles, and so on, are deleted using the AWS CloudFormation template. This action helps minimize cost by ensuring that the test resources exist only for the duration of the AutomatedLiveTest and LiveTestApproval
3.) DeployToProduction: In this stage, all the resources are deployed using the AWS CloudFormation template to the production environment.
Try it out
This code pipeline takes approximately 20 minutes to complete the LiveTest test stage (up to the LiveTest approval stage, in which manual approval is required).
To get started with this solution, choose Launch Stack:
This creates the CI/CD pipeline with all of its stages, as described earlier. It performs an initial commit of the sample AWS Glue ETL job source code to trigger the first release change.
In the AWS CloudFormation console, choose Create. After the template finishes creating resources, you see the pipeline name on the stack Outputs tab.
After that, open the CodePipeline console and select the newly created pipeline. Initially, your pipeline’s CodeCommit stage shows that the source action failed.
Allow a few minutes for your new pipeline to detect the initial commit applied by the CloudFormation stack creation. As soon as the commit is detected, your pipeline starts. You will see the successful stage completion status as soon as the CodeCommit source stage runs.
In the CodeCommit console, choose Code in the navigation pane to view the solution files.
Next, you can watch how the pipeline goes through the LiveTest stage of the deploy and AutomatedLiveTest actions, until it finally reaches the LiveTestApproval action.
At this point, if you check the AWS CloudFormation console, you can see that a new template has been deployed as part of the LiveTest deploy action.
At this point, make sure that the AWS Glue crawlers and the AWS Glue job ran successfully. Also check whether the corresponding databases and external tables have been created in the AWS Glue Data Catalog. Then verify that the data is validated using Amazon Athena, as shown following.
Open the AWS Glue console, and choose Databases in the navigation pane. You will see the following databases in the Data Catalog:
Open the Amazon Athena console, and run the following queries. Verify that the record counts are matching.
SELECT count(*) FROM "nycitytaxi_gluedemocicdtest"."data";
SELECT count(*) FROM "nytaxiparquet_gluedemocicdtest"."datalake";
The following shows the raw data:
The following shows the transformed data:
The pipeline pauses the action until the release is approved. After validating the data, manually approve the revision on the LiveTestApproval action on the CodePipeline console.
Add comments as needed, and choose Approve.
The LiveTestApproval stage now appears as Approved on the console.
After the revision is approved, the pipeline proceeds to use the AWS CloudFormation template to destroy the resources that were deployed in the LiveTest deploy action. This helps reduce cost and ensures a clean test environment on every deployment.
Production deployment is the final stage. In this stage, all the resources—AWS Glue crawlers, AWS Glue jobs, Amazon S3 buckets, roles, and so on—are provisioned and deployed to the production environment using the AWS CloudFormation template.
After successfully running the whole pipeline, feel free to experiment with it by changing the source code stored on AWS CodeCommit. For example, if you modify the AWS Glue ETL job to generate an error, it should make the AutomatedLiveTest action fail. Or if you change the AWS CloudFormation template to make its creation fail, it should affect the LiveTest deploy action. The objective of the pipeline is to guarantee that all changes that are deployed to production are guaranteed to work as expected.
Conclusion
In this post, you learned how easy it is to implement CI/CD for serverless AWS Glue ETL solutions with AWS developer tools like AWS CodePipeline and AWS CodeBuild at scale. Implementing such solutions can help you accelerate ETL development and testing at your organization.
If you have questions or suggestions, please comment below.
Prasad Alle is a Senior Big Data Consultant with AWS Professional Services. He spends his time leading and building scalable, reliable Big data, Machine learning, Artificial Intelligence and IoT solutions for AWS Enterprise and Strategic customers. His interests extend to various technologies such as Advanced Edge Computing, Machine learning at Edge. In his spare time, he enjoys spending time with his family.
Luis Caro is a Big Data Consultant for AWS Professional Services. He works with our customers to provide guidance and technical assistance on big data projects, helping them improving the value of their solutions when using AWS.
Thanks to Raja Mani, AWS Solutions Architect, for this great blog.
—
In this blog post, I’ll walk you through the steps for setting up continuous replication of an AWS CodeCommit repository from one AWS region to another AWS region using a serverless architecture. CodeCommit is a fully-managed, highly scalable source control service that stores anything from source code to binaries. It works seamlessly with your existing Git tools and eliminates the need to operate your own source control system. Replicating an AWS CodeCommit repository from one AWS region to another AWS region enables you to achieve lower latency pulls for global developers. This same approach can also be used to automatically back up repositories currently hosted on other services (for example, GitHub or BitBucket) to AWS CodeCommit.
This solution uses AWS Lambda and AWS Fargate for continuous replication. Benefits of this approach include:
The replication process can be easily setup to trigger based on events, such as commits made to the repository.
Setting up a serverless architecture means you don’t need to provision, maintain, or administer servers.
Note: AWS Fargate has a limitation of 10 GB for storage and is available in US East (N. Virginia) region. A similar solution that uses Amazon EC2 instances to replicate the repositories on a schedule was published in a previous blog and can be used if your repository does not meet these conditions.
Replication using Fargate
As you follow this blog post, you’ll set up an architecture that looks like this:
Any change in the AWS CodeCommit repository will trigger a Lambda function. The Lambda function will call the Fargate task that replicates the repository using a Git command line tool.
Let us assume a user wants to replicate a repository (Source) from US East (N. Virginia/us-east-1) region to a repository (Destination) in US West (Oregon/us-west-2) region. I’ll walk you through the steps for it:
Prerequisites
Create an AWS Service IAM role for Amazon EC2 that has permission for both source and destination repositories, IAM CreateRole, AttachRolePolicy and Amazon ECR privileges. Here is the EC2 role policy I used:
You need a Docker environment to build this solution. You can launch an EC2 instance and install Docker (or) you can use AWS Cloud9 that comes with Docker and Git preinstalled. I used an EC2 instance and installed Docker in it. Use the IAM role created in the previous step when creating the EC2 instance. I am going to refer this environment as “Docker Environment” in the following steps.
You need to install the AWS CLI on the Docker environment. For AWS CLI installation, refer this page.
You need to install Git, including a Git command line on the Docker environment.
Step 1: Create the Docker image
To create the Docker image, first it needs a Dockerfile. A Dockerfile is a manifest that describes the base image to use for your Docker image and what you want installed and running on it. For more information about Dockerfiles, go to the Dockerfile Reference.
1. Choose a directory in the Docker environment and perform the following steps in that directory. I used /home/ec2-user directory to perform the following steps.
2. Clone the AWS CodeCommit repository in the Docker environment. Open the terminal to the Docker environment and run the following commands to clone your source AWS CodeCommit repository (I ran the commands from /home/ec2-user directory):
Note: Change the URL marked in red to your source and destination repository URL.
3. Create a file called Dockerfile (case sensitive) with the following content (I created it in /home/ec2-user directory):
# Pull the Amazon Linux latest base image
FROM amazonlinux:latest
#Install aws-cli and git command line tools
RUN yum -y install unzip aws-cli
RUN yum -y install git
WORKDIR /home/ec2-user
RUN mkdir LocalRepository
WORKDIR /home/ec2-user/LocalRepository
#Copy Cloned CodeCommit repository to Docker container
COPY ./LocalRepository /home/ec2-user/LocalRepository
#Copy shell script that does the replication
COPY ./repl_repository.bash /home/ec2-user/LocalRepository
RUN chmod ugo+rwx /home/ec2-user/LocalRepository/repl_repository.bash
WORKDIR /home/ec2-user/LocalRepository
#Call this script when Docker starts the container
ENTRYPOINT ["/home/ec2-user/LocalRepository/repl_repository.bash"]
4. Copy the following shell script into a file called repl_repository.bash to the DockerFile directory location in the Docker environment (I created it in /home/ec2-user directory)
6. Verify whether the replication is working by running the repl_repository.bash script from the LocalRepository directory. Go to LocalRepository directory and run this command: . ../repl_repository.bash If it is successful, you will get the “Everything up-to-date” at the last line of the result like this:
$ . ../repl_repository.bash
Everything up-to-date
Step 2: Build the Docker Image
1. Build the Docker image by running this command from the directory where you created the DockerFile in the Docker environment in the previous step (I ran it from /home/ec2-user directory):
$ docker build . –t ccrepl
Output: It installs various packages and set environment variables as part of steps 1 to 3 from the Dockerfile. The steps 4 to 11 from the Dockerfile should produce an output similar to the following:
2. Run the following command to verify that the image was created successfully. It will display “Everything up-to-date” at the end if it is successful.
[[email protected] LocalRepository]$ docker run ccrepl
Everything up-to-date
Step 3: Push the Docker Image to Amazon Elastic Container Registry (ECR)
Perform the following steps in the Docker Environment.
1. Run the AWS CLI configure command and set default region as your source repository region (I used us-east-1).
$ aws configure set default.region <Source Repository Region>
2. Create an Amazon ECR repository using this command to store your ccrepl image (Note the repositoryUri in the output):
2. Create a role called AccessRoleForCCfromFG using the following command in the DockerEnvironment:
$ aws iam create-role --role-name AccessRoleForCCfromFG --assume-role-policy-document file://trustpolicyforecs.json
3. Assign CodeCommit service full access to the above role using the following command in the DockerEnvironment:
$ aws iam attach-role-policy --policy-arn arn:aws:iam::aws:policy/AWSCodeCommitFullAccess --role-name AccessRoleForCCfromFG
4. In the Amazon ECS Console, choose Repositories and select the ccrepl repository that was created in the previous step. Copy the Repository URI.
5. In the Amazon ECS Console, choose Task Definitions and click Create New Task Definition.
6. Select launch type compatibility as FARGATE and click Next Step.
7. In the create task definition screen, do the following:
In Task Definition Name, type ccrepl
In Task Role, choose AccessRoleForCCfromFG
In Task Memory, choose 2GB
In Task CPU, choose 1 vCPU
Click Add Container under Container Definitions in the same screen. In the Add Container screen, do the following:
Enter Container name as ccreplcont
Enter Image URL copied from step 4
Enter Memory Limits as 128 and click Add.
Note: Select TaskExecutionRole as “ecsTaskExecutionRole” if it already exists. If not, select create new role and it will create “ecsTaskExecutionRole” for you.
8. Click the Create button in the task definition screen to create the task. It will successfully create the task, execution role and AWS CloudWatch Log groups.
9. In the Amazon ECS Console, click Clusters and create cluster. Select template as “Networking only, Powered by AWS Fargate” and click next step.
10. Enter cluster name as ccreplcluster and click create.
Step 5: Create the Lambda Function
In this section, I used Amazon Elastic Container Service (ECS) run task API from Lambda to invoke the Fargate task.
1. In the IAM Console, create a new role called ECSLambdaRole with the permissions to AWS CodeCommit, Amazon ECS as well as pass roles privileges needed to run the ECS task. Your statement should look similar to the following (replace <your account id>):
2. In AWS management console, select VPC service and click subnets in the left navigation screen. Note down the Subnet IDs that you want to run the Fargate task in.
3. Create a new Lambda Node.js function called FargateTaskExecutionFunc and assign the role ECSLambdaRole with the following content:
Note: Replace subnets values (marked in red color) with the subnet IDs you identified as the subnets you wanted to run the Fargate task on in Step 2 of this section.
1. In the Lambda Console, click FargateTaskExecutionFunc under functions.
2. Under Add triggers in the Designer, select CodeCommit
3. In the Configure triggers screen, do the following:
Enter Repository name as Source (your source repository name)
Enter trigger name as LambdaTrigger
Leave the Events as “All repository events”
Leave the Branch names as “All branches”
Click Add button
Click Save button to save the changes
Step 6: Verification
To test the application, make a commit and push the changes to the source repository in AWS CodeCommit. That should automatically trigger the Lambda function and replicate the changes in the destination repository. You can verify this by checking CloudWatch Logs for Lambda and ECS, or simply going to the destination repository and verifying the change appears.
Conclusion
Congratulations! You have successfully configured repository replication of an AWS CodeCommit repository using AWS Lambda and AWS Fargate. You can use this technique in a deployment pipeline. You can also tweak the trigger configuration in AWS CodeCommit to call the Lambda function in response to any supported trigger event in AWS CodeCommit.
In this post, I’ll show you how to create a sample dataset for Amazon Macie, and how you can use Amazon Macie to implement data-centric compliance and security analytics in your Amazon S3 environment. I’ll also dive into the different kinds of credentials, document types, and PII detections supported by Macie. First, I’ll walk through creating a “getting started” sample set of artificial, generated data that you can use to test Macie capabilities and start building your own policies and alerts.
Create a realistic data sample set in S3
I’ll use amazon-macie-activity-generator, which we call “AMG” for short, a sample application developed by AWS that generates realistic content and accesses your test account to create the data. AMG uses AWS CloudFormation, AWS Lambda, and Python’s excellent Faker library to create a data set with artificial—but realistic—data classifications and access patterns to help test some of the features and capabilities of Macie. AMG is released under Amazon Software License 1.0, and we’ll accept pull requests on our GitHub repository and monitor any issues that are opened so we can try to fix bugs and consider new feature requests.
The following diagram shows a high level architecture overview of the components that will be created in your AWS account for AMG. For additional detail about these components and their relationships, review the CloudFormation setup script.
Depending on the data types specified in your JSON configuration template (details below), AMG will periodically generate artificial documents for the specified S3 target with a PutObject action. By default, the CloudFormation stack uses a configuration file that instructs AMG to create a new, private S3 bucket that can only be accessed by authorized AWS users/roles in the same account as the bucket. All the S3 objects with fake data in this bucket have a private ACL and inherit the bucket’s access control configuration. All generated objects feature the header in the example below, and AMG supports all fake data providers offered by https://faker.readthedocs.io/en/latest/index.html, as well as a few of AMG‘s own custom fake data providers requested by our customers: aws_creds, slack_creds, github_creds, facebook_creds, linux_shadow, rsa, linux_passwd, dsa, ec, pgp, cert, itin, swift_code, and cve.
# Sample Report - No identification of actual persons or places is # intended or should be inferred
74323 Julie Field Lake Joshuamouth, OR 30055-3905 1-196-191-4438x974 53001 Paul Union New John, HI 94740 Mastercard Amanda Wells 5135725008183484 09/26 CVV: 550
354-70-6172 242 George Plaza East Lawrencefurt, VA 37287-7620 GB73WAUS0628038988364 587 Silva Village Pearsonburgh, NM 11616-7231 LDNM1948227117807 American Express Brett Garza 347965534580275 05/20 CID: 4758
599.335.2742 JCB 15 digit Michael Arias 210069190253121 03/27 CVC: 861
Create your amazon-macie-activity-generator CloudFormation stack
You can deploy AMG in your AWS account by using either these methods:
Log in to the AWS Console in a region supported by Amazon Macie, which currently includes US East (N. Virginia), US West (Oregon).
Select the One-click CloudFormation launch stack, or launch CloudFormation using the template above.
Read our terms, select the Acknowledgement box, and then select Create.
Creating the data takes a few minutes, and you can periodically refresh CloudWatch to track progress.
Add the new sample data to Macie
Now, I’ll log into the Macie console and add the newly created sample data buckets for analysis by Macie.
Note: If you don’t explicitly specify a bucket for S3 targets in CloudFormation, AMG will use the S3 bucket that’s created by default for the stack, which will be printed out in the CloudFormation stack’s output.
To add buckets for data classification, follow these steps:
Log in to Amazon Macie.
Select Integrations, and then select Services.
Select your account, and then select Details from the Amazon S3 card.
Select your newly created buckets for Full classification, including existing data.
For additional details on configuring Macie, refer to our getting started documentation.
Macie classifies all historical and newly created data in the buckets created by AMG, and the data will be available in the Macie console as it’s classified. Typically, you can expect the data in the sample set to be classified within 60 minutes of the time it was selected for analysis.
Classifying objects with Macie
To see the objects in your test sample set, in Macie, open the Research tab, and then select the S3 Objects index. We’ll use the regular expression search capability in Macie to find any objects written to buckets that start with “amazon-macie-activity-generator-defaults3bucket”. To search for this, type the following text into the Macie search box and select the magnifying glass icon.
From here, you can see a nice breakdown of the kinds of objects that have been classified by Macie, as well as the object-specific details. Create an advanced search using Lucene Query Syntax, and save it as an alert to be matched against any newly created data.
Analyzing accesses to your test data
In addition to classifying data, Macie tracks all control plane and data plane accesses to your content using CloudTrail. To see accesses to your generated environment (created periodically by AMG to mimic user activity), on the Macie navigation bar, select Research, select the CloudTrail data index, and then use the following search to identify our generated role activity:
From this search, you can dive into the user activity (IAM users, assumed roles, federated users, and so on), which is summarized in 5-minute aggregations (user sessions). For example, in the screen shot you can see that one of our AMG-generated users listed objects one time (ListObjects) and wrote 56 objects to S3 (PutObject) during a 5-minute period.
Macie alerts
Macie features both predictive (machine learning-based) and basic (rule-based) alerts, including alerts on unencrypted credentials being uploaded to S3 (because this activity might not follow compliance best practices), risky activity such as data exfiltration, and user-defined alerts that are based on saved searches. To see alerts that have been generated based on AMG‘s activity, on the Macie navigation bar, select Alerts.
AMG will continue to run, periodically uploading content to the specified S3 buckets. To stop AMG, delete the AMG CloudFormation stack and associated resources here.
What are the costs?
Macie has a free tier enabling up to 1GB of content to be analyzed per month at no cost to you. By default, AMG will write approximately 10MB of objects to Amazon S3 per day, and you will incur charges for data classification after crossing the 1GB monthly free tier. Running continuously, AMG will generate about 310MB of content per month (10MB/day x 31 days), which will stay below the free tier. Any data use above 1GB will be billed at the Macie public price of $5/GB. For more detail, see the Macie pricing documentation.
If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the Amazon Macie forum or contact AWS Support.
This blog was contributed by Rucha Nene, Sr. Product Manager for Amazon EBS
AWS customers use tags to track ownership of resources, implement compliance protocols, control access to resources via IAM policies, and drive their cost accounting processes. Last year, we made tagging for Amazon EC2 instances and Amazon EBS volumes easier by adding the ability to tag these resources upon creation. We are now extending this capability to EBS snapshots.
Earlier, you could tag your EBS snapshots only after the resource had been created and sometimes, ended up with EBS snapshots in an untagged state if tagging failed. You also could not control the actions that users and groups could take over specific snapshots, or enforce tighter security policies.
To address these issues, we are making tagging for EBS snapshots more flexible and giving customers more control over EBS snapshots by introducing two new capabilities:
Tag on creation for EBS snapshots – You can now specify tags for EBS snapshots as part of the API call that creates the resource or via the Amazon EC2 Console when creating an EBS snapshot.
Resource-level permission and enforced tag usage – The CreateSnapshot, DeleteSnapshot, and ModifySnapshotAttrribute API actions now support IAM resource-level permissions. You can now write IAM policies that mandate the use of specific tags when taking actions on EBS snapshots.
Tag on creation
You can now specify tags for EBS snapshots as part of the API call that creates the resources. The resource creation and the tagging are performed atomically; both must succeed in order for the operation CreateSnapshot to succeed. You no longer need to build tagging scripts that run after EBS snapshots have been created.
Here’s how you specify tags when you create an EBS snapshot, using the console:
CreateSnapshot, DeleteSnapshot, and ModifySnapshotAttribute now support resource-level permissions, which allow you to exercise more control over EBS snapshots. You can write IAM policies that give you precise control over access to resources and let you specify which users are able to create snapshots for a given set of volumes. You can also enforce the use of specific tags to help track resources and achieve more accurate cost allocation reporting.
For example, here’s a statement that requires that the costcenter tag (with a value of “115”) be present on the volume from which snapshots are being created. It requires that this tag be applied to all newly created snapshots. In addition, it requires that the created snapshots are tagged with User:username for the customer.
To implement stronger compliance and security policies, you could also restrict access to DeleteSnapshot, if the resource is not tagged with the user’s name. Here’s a statement that allows the deletion of a snapshot only if the snapshot is tagged with User:username for the customer.
In this blog post, I will show how you can perform unit testing as a part of your AWS CodeStar project. AWS CodeStar helps you quickly develop, build, and deploy applications on AWS. With AWS CodeStar, you can set up your continuous delivery (CD) toolchain and manage your software development from one place.
Because unit testing tests individual units of application code, it is helpful for quickly identifying and isolating issues. As a part of an automated CI/CD process, it can also be used to prevent bad code from being deployed into production.
Many of the AWS CodeStar project templates come preconfigured with a unit testing framework so that you can start deploying your code with more confidence. The unit testing is configured to run in the provided build stage so that, if the unit tests do not pass, the code is not deployed. For a list of AWS CodeStar project templates that include unit testing, see AWS CodeStar Project Templates in the AWS CodeStar User Guide.
The scenario
As a big fan of superhero movies, I decided to list my favorites and ask my friends to vote on theirs by using a WebService endpoint I created. The example I use is a Python web service running on AWS Lambda with AWS CodeCommit as the code repository. CodeCommit is a fully managed source control system that hosts Git repositories and works with all Git-based tools.
Here’s how you can create the WebService endpoint:
Sign in to the AWS CodeStar console. Choose Start a project, which will take you to the list of project templates.
For code edits I will choose AWS Cloud9, which is a cloud-based integrated development environment (IDE) that you use to write, run, and debug code.
Here are the other tasks required by my scenario:
Create a database table where the votes can be stored and retrieved as needed.
Update the logic in the Lambda function that was created for posting and getting the votes.
Update the unit tests (of course!) to verify that the logic works as expected.
For a database table, I’ve chosen Amazon DynamoDB, which offers a fast and flexible NoSQL database.
Getting set up on AWS Cloud9
From the AWS CodeStar console, go to the AWS Cloud9 console, which should take you to your project code. I will open up a terminal at the top-level folder under which I will set up my environment and required libraries.
Use the following command to set the PYTHONPATH environment variable on the terminal.
You should now be able to use the following command to execute the unit tests in your project.
python -m unittest discover vote-your-movie/tests
Start coding
Now that you have set up your local environment and have a copy of your code, add a DynamoDB table to the project by defining it through a template file. Open template.yml, which is the Serverless Application Model (SAM) template file. This template extends AWS CloudFormation to provide a simplified way of defining the Amazon API Gateway APIs, AWS Lambda functions, and Amazon DynamoDB tables required by your serverless application.
AWSTemplateFormatVersion: 2010-09-09
Transform:
- AWS::Serverless-2016-10-31
- AWS::CodeStar
Parameters:
ProjectId:
Type: String
Description: CodeStar projectId used to associate new resources to team members
Resources:
# The DB table to store the votes.
MovieVoteTable:
Type: AWS::Serverless::SimpleTable
Properties:
PrimaryKey:
# Name of the "Candidate" is the partition key of the table.
Name: Candidate
Type: String
# Creating a new lambda function for retrieving and storing votes.
MovieVoteLambda:
Type: AWS::Serverless::Function
Properties:
Handler: index.handler
Runtime: python3.6
Environment:
# Setting environment variables for your lambda function.
Variables:
TABLE_NAME: !Ref "MovieVoteTable"
TABLE_REGION: !Ref "AWS::Region"
Role:
Fn::ImportValue:
!Join ['-', [!Ref 'ProjectId', !Ref 'AWS::Region', 'LambdaTrustRole']]
Events:
GetEvent:
Type: Api
Properties:
Path: /
Method: get
PostEvent:
Type: Api
Properties:
Path: /
Method: post
We’ll use Python’s boto3 library to connect to AWS services. And we’ll use Python’s mock library to mock AWS service calls for our unit tests. Use the following command to install these libraries:
pip install --upgrade boto3 mock -t .
Add these libraries to the buildspec.yml, which is the YAML file that is required for CodeBuild to execute.
version: 0.2
phases:
install:
commands:
# Upgrade AWS CLI to the latest version
- pip install --upgrade awscli boto3 mock
pre_build:
commands:
# Discover and run unit tests in the 'tests' directory. For more information, see <https://docs.python.org/3/library/unittest.html#test-discovery>
- python -m unittest discover tests
build:
commands:
# Use AWS SAM to package the application by using AWS CloudFormation
- aws cloudformation package --template template.yml --s3-bucket $S3_BUCKET --output-template template-export.yml
artifacts:
type: zip
files:
- template-export.yml
Open the index.py where we can write the simple voting logic for our Lambda function.
import json
import datetime
import boto3
import os
table_name = os.environ['TABLE_NAME']
table_region = os.environ['TABLE_REGION']
VOTES_TABLE = boto3.resource('dynamodb', region_name=table_region).Table(table_name)
CANDIDATES = {"A": "Black Panther", "B": "Captain America: Civil War", "C": "Guardians of the Galaxy", "D": "Thor: Ragnarok"}
def handler(event, context):
if event['httpMethod'] == 'GET':
resp = VOTES_TABLE.scan()
return {'statusCode': 200,
'body': json.dumps({item['Candidate']: int(item['Votes']) for item in resp['Items']}),
'headers': {'Content-Type': 'application/json'}}
elif event['httpMethod'] == 'POST':
try:
body = json.loads(event['body'])
except:
return {'statusCode': 400,
'body': 'Invalid input! Expecting a JSON.',
'headers': {'Content-Type': 'application/json'}}
if 'candidate' not in body:
return {'statusCode': 400,
'body': 'Missing "candidate" in request.',
'headers': {'Content-Type': 'application/json'}}
if body['candidate'] not in CANDIDATES.keys():
return {'statusCode': 400,
'body': 'You must vote for one of the following candidates - {}.'.format(get_allowed_candidates()),
'headers': {'Content-Type': 'application/json'}}
resp = VOTES_TABLE.update_item(
Key={'Candidate': CANDIDATES.get(body['candidate'])},
UpdateExpression='ADD Votes :incr',
ExpressionAttributeValues={':incr': 1},
ReturnValues='ALL_NEW'
)
return {'statusCode': 200,
'body': "{} now has {} votes".format(CANDIDATES.get(body['candidate']), resp['Attributes']['Votes']),
'headers': {'Content-Type': 'application/json'}}
def get_allowed_candidates():
l = []
for key in CANDIDATES:
l.append("'{}' for '{}'".format(key, CANDIDATES.get(key)))
return ", ".join(l)
What our code basically does is take in the HTTPS request call as an event. If it is an HTTP GET request, it gets the votes result from the table. If it is an HTTP POST request, it sets a vote for the candidate of choice. We also validate the inputs in the POST request to filter out requests that seem malicious. That way, only valid calls are stored in the table.
In the example code provided, we use a CANDIDATES variable to store our candidates, but you can store the candidates in a JSON file and use Python’s json library instead.
Let’s update the tests now. Under the tests folder, open the test_handler.py and modify it to verify the logic.
import os
# Some mock environment variables that would be used by the mock for DynamoDB
os.environ['TABLE_NAME'] = "MockHelloWorldTable"
os.environ['TABLE_REGION'] = "us-east-1"
# The library containing our logic.
import index
# Boto3's core library
import botocore
# For handling JSON.
import json
# Unit test library
import unittest
## Getting StringIO based on your setup.
try:
from StringIO import StringIO
except ImportError:
from io import StringIO
## Python mock library
from mock import patch, call
from decimal import Decimal
@patch('botocore.client.BaseClient._make_api_call')
class TestCandidateVotes(unittest.TestCase):
## Test the HTTP GET request flow.
## We expect to get back a successful response with results of votes from the table (mocked).
def test_get_votes(self, boto_mock):
# Input event to our method to test.
expected_event = {'httpMethod': 'GET'}
# The mocked values in our DynamoDB table.
items_in_db = [{'Candidate': 'Black Panther', 'Votes': Decimal('3')},
{'Candidate': 'Captain America: Civil War', 'Votes': Decimal('8')},
{'Candidate': 'Guardians of the Galaxy', 'Votes': Decimal('8')},
{'Candidate': "Thor: Ragnarok", 'Votes': Decimal('1')}
]
# The mocked DynamoDB response.
expected_ddb_response = {'Items': items_in_db}
# The mocked response we expect back by calling DynamoDB through boto.
response_body = botocore.response.StreamingBody(StringIO(str(expected_ddb_response)),
len(str(expected_ddb_response)))
# Setting the expected value in the mock.
boto_mock.side_effect = [expected_ddb_response]
# Expecting that there would be a call to DynamoDB Scan function during execution with these parameters.
expected_calls = [call('Scan', {'TableName': os.environ['TABLE_NAME']})]
# Call the function to test.
result = index.handler(expected_event, {})
# Run unit test assertions to verify the expected calls to mock have occurred and verify the response.
assert result.get('headers').get('Content-Type') == 'application/json'
assert result.get('statusCode') == 200
result_body = json.loads(result.get('body'))
# Verifying that the results match to that from the table.
assert len(result_body) == len(items_in_db)
for i in range(len(result_body)):
assert result_body.get(items_in_db[i].get("Candidate")) == int(items_in_db[i].get("Votes"))
assert boto_mock.call_count == 1
boto_mock.assert_has_calls(expected_calls)
## Test the HTTP POST request flow that places a vote for a selected candidate.
## We expect to get back a successful response with a confirmation message.
def test_place_valid_candidate_vote(self, boto_mock):
# Input event to our method to test.
expected_event = {'httpMethod': 'POST', 'body': "{\"candidate\": \"D\"}"}
# The mocked response in our DynamoDB table.
expected_ddb_response = {'Attributes': {'Candidate': "Thor: Ragnarok", 'Votes': Decimal('2')}}
# The mocked response we expect back by calling DynamoDB through boto.
response_body = botocore.response.StreamingBody(StringIO(str(expected_ddb_response)),
len(str(expected_ddb_response)))
# Setting the expected value in the mock.
boto_mock.side_effect = [expected_ddb_response]
# Expecting that there would be a call to DynamoDB UpdateItem function during execution with these parameters.
expected_calls = [call('UpdateItem', {
'TableName': os.environ['TABLE_NAME'],
'Key': {'Candidate': 'Thor: Ragnarok'},
'UpdateExpression': 'ADD Votes :incr',
'ExpressionAttributeValues': {':incr': 1},
'ReturnValues': 'ALL_NEW'
})]
# Call the function to test.
result = index.handler(expected_event, {})
# Run unit test assertions to verify the expected calls to mock have occurred and verify the response.
assert result.get('headers').get('Content-Type') == 'application/json'
assert result.get('statusCode') == 200
assert result.get('body') == "{} now has {} votes".format(
expected_ddb_response['Attributes']['Candidate'],
expected_ddb_response['Attributes']['Votes'])
assert boto_mock.call_count == 1
boto_mock.assert_has_calls(expected_calls)
## Test the HTTP POST request flow that places a vote for an non-existant candidate.
## We expect to get back a successful response with a confirmation message.
def test_place_invalid_candidate_vote(self, boto_mock):
# Input event to our method to test.
# The valid IDs for the candidates are A, B, C, and D
expected_event = {'httpMethod': 'POST', 'body': "{\"candidate\": \"E\"}"}
# Call the function to test.
result = index.handler(expected_event, {})
# Run unit test assertions to verify the expected calls to mock have occurred and verify the response.
assert result.get('headers').get('Content-Type') == 'application/json'
assert result.get('statusCode') == 400
assert result.get('body') == 'You must vote for one of the following candidates - {}.'.format(index.get_allowed_candidates())
## Test the HTTP POST request flow that places a vote for a selected candidate but associated with an invalid key in the POST body.
## We expect to get back a failed (400) response with an appropriate error message.
def test_place_invalid_data_vote(self, boto_mock):
# Input event to our method to test.
# "name" is not the expected input key.
expected_event = {'httpMethod': 'POST', 'body': "{\"name\": \"D\"}"}
# Call the function to test.
result = index.handler(expected_event, {})
# Run unit test assertions to verify the expected calls to mock have occurred and verify the response.
assert result.get('headers').get('Content-Type') == 'application/json'
assert result.get('statusCode') == 400
assert result.get('body') == 'Missing "candidate" in request.'
## Test the HTTP POST request flow that places a vote for a selected candidate but not as a JSON string which the body of the request expects.
## We expect to get back a failed (400) response with an appropriate error message.
def test_place_malformed_json_vote(self, boto_mock):
# Input event to our method to test.
# "body" receives a string rather than a JSON string.
expected_event = {'httpMethod': 'POST', 'body': "Thor: Ragnarok"}
# Call the function to test.
result = index.handler(expected_event, {})
# Run unit test assertions to verify the expected calls to mock have occurred and verify the response.
assert result.get('headers').get('Content-Type') == 'application/json'
assert result.get('statusCode') == 400
assert result.get('body') == 'Invalid input! Expecting a JSON.'
if __name__ == '__main__':
unittest.main()
I am keeping the code samples well commented so that it’s clear what each unit test accomplishes. It tests the success conditions and the failure paths that are handled in the logic.
In my unit tests I use the patch decorator (@patch) in the mock library. @patch helps mock the function you want to call (in this case, the botocore library’s _make_api_call function in the BaseClient class). Before we commit our changes, let’s run the tests locally. On the terminal, run the tests again. If all the unit tests pass, you should expect to see a result like this:
You:~/environment $ python -m unittest discover vote-your-movie/tests
.....
----------------------------------------------------------------------
Ran 5 tests in 0.003s
OK
You:~/environment $
Upload to AWS
Now that the tests have passed, it’s time to commit and push the code to source repository!
Add your changes
From the terminal, go to the project’s folder and use the following command to verify the changes you are about to push.
git status
To add the modified files only, use the following command:
git add -u
Commit your changes
To commit the changes (with a message), use the following command:
git commit -m "Logic and tests for the voting webservice."
Push your changes to AWS CodeCommit
To push your committed changes to CodeCommit, use the following command:
git push
In the AWS CodeStar console, you can see your changes flowing through the pipeline and being deployed. There are also links in the AWS CodeStar console that take you to this project’s build runs so you can see your tests running on AWS CodeBuild. The latest link under the Build Runs table takes you to the logs.
After the deployment is complete, AWS CodeStar should now display the AWS Lambda function and DynamoDB table created and synced with this project. The Project link in the AWS CodeStar project’s navigation bar displays the AWS resources linked to this project.
Because this is a new database table, there should be no data in it. So, let’s put in some votes. You can download Postman to test your application endpoint for POST and GET calls. The endpoint you want to test is the URL displayed under Application endpoints in the AWS CodeStar console.
Now let’s open Postman and look at the results. Let’s create some votes through POST requests. Based on this example, a valid vote has a value of A, B, C, or D. Here’s what a successful POST request looks like:
Here’s what it looks like if I use some value other than A, B, C, or D:
Now I am going to use a GET request to fetch the results of the votes from the database.
And that’s it! You have now created a simple voting web service using AWS Lambda, Amazon API Gateway, and DynamoDB and used unit tests to verify your logic so that you ship good code. Happy coding!
Now, your applications and federated users can complete longer running workloads in a single session by increasing the maximum session duration up to 12 hours for an IAM role. Users and applications still retrieve temporary credentials by assuming roles using AWS Security Token Service (AWS STS), but these credentials can now be valid for up to 12 hours when using the AWS SDK or CLI. This change allows your users and applications to perform longer running workloads, such as a batch upload to S3 or a CloudFormation template, using a single session. You can extend the maximum session duration using the IAM console or CLI. Once you increase the maximum session duration, users and applications assuming the IAM role can request temporary credentials that expire when the IAM role session expires.
In this post, I show you how to configure the maximum session duration for an existing IAM role to 4 hours (maximum allowed duration is 12 hours) using the IAM console. I’ll use 4 hours because AWS recommends configuring the session duration for a role to the shortest duration that your federated users would require to access your AWS resources. I’ll then show how existing federated users can use the AWS SDK or CLI to request temporary security credentials that are valid until the role session expires.
Configure the maximum session duration for an existing IAM role to 4 hours
Let’s assume you have an existing IAM role called ADFS-Production that allows your federated users to upload objects to an S3 bucket in your AWS account. You want to extend the maximum session duration for this role to 4 hours. By default, IAM roles in your AWS accounts have a maximum session duration of one hour. To extend a role’s maximum session duration to 4 hours, follow the steps below:
In the left navigation pane, select Roles and then select the role for which you want to increase the maximum session duration. For this example, I select ADFS-Production and verify the maximum session duration for this role. This value is set to 1 hour (3,600 seconds) by default.
Select Edit, and then define the maximum session duration.
Select one of the predefined durations or provide a custom duration. For this example, I set the maximum session duration to be 4 hours.
Select Save changes.
Alternatively, you can use the latest AWS CLI and call Update-Role to set the maximum session duration for the role ADFS-Production. Here’s an example to set the maximum session duration to 14,400 seconds (4 hours).
$ aws iam update-role -–role-name ADFS-Production -–MaxSessionDuration 14400
Now that you’ve successfully extended the maximum session for your IAM role, ADFS-Production, your federated users can use AWS STS to retrieve temporary credentials that are valid for 4 hours to access your S3 buckets.
Access AWS resources with temporary security credentials using AWS CLI/SDK
To enable federated SDK and CLI access for your users who use temporary security credentials, you might have implemented the solution described in the blog post on How to Implement Federated API and CLI Access Using SAML 2.0 and AD FS. That blog post demonstrates how to use the AWS Python SDK and some additional client-side integration code provided in the post to implement federated SDK and CLI access for your users. To enable your users to request longer temporary security credentials, you can make the following changes suggested in this blog to the solution provided in that post.
When calling AssumeRoleWithSAML API to request AWS temporary security credentials, you need to include the DurationSeconds parameter. The value of this parameter is the duration the user requests and, therefore, the duration their temporary security credentials are valid. In this example, I am using boto to request the maximum length of 14,400 seconds (4 hours) using code from the How to Implement Federated API and CLI Access Using SAML 2.0 and AD FS post that I have updated:
# Use the assertion to get an AWS STS token using Assume Role with SAML conn = boto.sts.connect_to_region(region) token = conn.assume_role_with_saml(role_arn, principal_arn, assertion, 14400)
By adding a value for the DurationSeconds parameter in the AssumeRoleWithSAML call, your federated user can retrieve temporary security credentials that are valid for up to 14,400 seconds (4 hours). If you don’t provide this value, the default session duration is 1 hour. If you provide a value of 5 hours for your temporary security credentials, AWS STS will throw an error since this is longer than the role session duration of 4 hours.
Conclusion
I demonstrated how you can configure the maximum session duration for a role from 1 hour (default) up to 12 hours. Then, I showed you how your federated users can retrieve temporary security credentials that are valid for longer durations to access AWS resources using AWS CLI/SDK for up to 12 hours.
Similarly, you can also increase the maximum role session duration for your applications and users who use Web Identity or OpenID Connect Federation or Cross-Account Access with Assume Role. If you have comments about this blog, submit them in the Comments section below. If you have questions or suggestions, please start a new thread on the IAM forum.
Amazon EMR empowers many customers to build big data processing applications quickly and cost-effectively, using popular distributed frameworks such as Apache Spark, Apache HBase, Presto, and Apache Flink. For organizations that are crafting their analytical applications on Amazon EMR, there is a growing need to keep their data assets organized in an automated fashion. Because datasets tend to grow exponentially, using cataloging tools is essential to automating data discovery and organizing data assets.
AWS Glue Data Catalog provides this essential capability, allowing you to automatically discover and catalog metadata about your data stores in a central repository. Since Amazon EMR 5.8.0, customers have been using the AWS Glue Data Catalog as a metadata store for Apache Hive and Spark SQL applications that are running on Amazon EMR. Starting with Amazon EMR 5.10.0, you can catalog datasets using AWS Glue and run queries using Presto on Amazon EMR from the Hue (Hadoop User Experience) and Apache Zeppelin UIs.
You might wonder what scenarios warrant using Presto running on Amazon EMR and when to choose Amazon Athena (which uses Presto as the query engine under the hood). It is important to note that both are excellent tools for querying massive amounts of data and addressing different needs and use cases.
Amazon Athena provides the easiest way to run interactive queries for data in Amazon S3 without needing to set up or manage any servers. Presto running on Amazon EMR gives you much more flexibility in how you configure and run your queries, providing the ability to federate to other data sources if needed. For example, you might have a use case that requires LDAP authentication for clients such as the Presto CLI or JDBC/ODBC drivers. Or you might have a workflow where you need to join data between different systems like MySQL/Amazon Redshift/Apache Cassandra and Hive. In these examples, Presto running on Amazon EMR is the right tool to use because it can be configured to enable LDAP authentication in addition to the desired database connectors at cluster launch.
Now, let’s look at how metadata management for Presto works with AWS Glue.
Using an AWS Glue crawler to discover datasets
The AWS Glue Data Catalog is a reference to the location, schema, and runtime metrics of your datasets. To create this reference metadata, AWS Glue needs to crawl your datasets. In this exercise, we use an AWS Glue crawler to populate tables in the Data Catalog for the NYC taxi rides dataset.
The following are the steps for adding a crawler:
Sign in to the AWS Management Console, and open the AWS Glue console. In the navigation pane, choose Crawlers. Then choose Add crawler.
On the Add a data store page, specify the location of the NYC taxi rides dataset.
In the next step, choose an existing IAM role if one is available, or create a new role. Then choose Next.
On the scheduling page, for Frequency, choose Run on demand.
On the Configure the crawler’s output page, choose Add database. Specify blog-db as the database name. (You can specify a name of your choice, but be sure to choose the correct database name when running queries.)
Follow the remaining steps using the default values to create a crawler.
When the crawler displays the Ready state, navigate to the Databases (Choose blog-db from the list of databases, or search for it by specifying it as a filter, as shown in the following screenshot.) Then choose Tables. You should see the three tables created by the crawler, as follows.
(Optional) The discovered data is classified as CSV files. You can optionally convert this data into Parquet format for better response times on your queries.
Launching an Amazon EMR cluster
With the dataset discovered and organized, we can now walk through different options for launching Presto on an Amazon EMR cluster to use the AWS Glue Data Catalog.
After you’ve set up the Amazon EMR cluster with Presto, the AWS Glue Data Catalog is available through a default “hive” catalog. To change between the Hive and Glue metastores, you have to manually update hive.properties and restart the Presto server. Connect to the master node on your EMR cluster using SSH, and run the Presto CLI to start running queries interactively.
$ presto-cli --catalog hive
Begin with a simple query to sample a few rows:
presto> SELECT * FROM “blog-db”.taxi limit 10;
The query shows a few sample rows as follows:
Query the average fare for trips at each hour of the day and for each day of the month on the Parquet version of the taxi dataset.
presto> SELECT EXTRACT (HOUR FROM pickup_datetime) AS hour, avg(fare_amount) AS average_fare FROM “blog-db”.taxi_parquet GROUP BY 1 ORDER BY 1;
The following image shows the results:
More interestingly, you can compute the number of trips that gave tips in the 10 percent, 15 percent, or higher percentage range:
presto> -- Tip Percent Category
SELECT TipPrctCtgry
, COUNT (DISTINCT TripID) TripCt
FROM
(SELECT TripID
, (CASE
WHEN fare_prct < 0.7 THEN 'FL70'
WHEN fare_prct < 0.8 THEN 'FL80'
WHEN fare_prct < 0.9 THEN 'FL90'
ELSE 'FL100'
END) FarePrctCtgry
, (CASE
WHEN tip_prct < 0.1 THEN 'TL10'
WHEN tip_prct < 0.15 THEN 'TL15'
WHEN tip_prct < 0.2 THEN 'TL20'
ELSE 'TG20'
END) TipPrctCtgry
FROM
(SELECT TripID
, (fare_amount / total_amount) as fare_prct
, (extra / total_amount) as extra_prct
, (mta_tax / total_amount) as tip_prct
, (tolls_amount / total_amount) as mta_taxprct
, (tip_amount / total_amount) as tolls_prct
, (improvement_surcharge / total_amount) as imprv_suchrgprct
, total_amount
FROM
(SELECT *
, (cast(pickup_longitude AS VARCHAR(100)) || '_' || cast(pickup_latitude AS VARCHAR(100))) as TripID
from "blog-db”.taxi_parquet
WHERE total_amount > 0
) as t
) as t
) ct
GROUP BY TipPrctCtgry;
The results are as follows:
While the preceding query is running, navigate to the web interface for Presto on Amazon EMR at <http://master-public-dns-name:8889/. Here you can look into the query metrics, such as active worker nodes, number of rows read per second, reserved memory, and parallelism.
Running queries in the Presto Editor on Hue
If you installed Hue with your Amazon EMR launch, you can also run queries on Hue’s Presto Editor. On the Amazon EMR Cluster console, choose Enable Web Connection, and follow the instructions to access the web interfaces for Hue and Zeppelin.
After the web connection is enabled, choose the Hue link to open the web interface. At the login screen, if you are the administrator logging in for the first time, type a user name and password to create your Hue superuser account. Then choose Create account. Otherwise, type your user name and password and choose Create account, or type the credentials provided by your administrator.
Choose the Presto Editor from the menu. You can run Presto queries against your tables in the AWS Glue Data Catalog.
Conclusion
Having a shared data catalog for applications on Amazon EMR alleviates a myriad of data-related challenges that organizations face today—including discovery, governance, auditability, and collaboration. In this post, we explored how the AWS Glue Data Catalog addresses discoverability and manageability for table metadata for Presto on Amazon EMR. Go ahead, give this a try, and share your experience with us!
Radhika Ravirala is a Solutions Architect at Amazon Web Services where she helps customers craft distributed big data applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley. She holds a M.S in computer science from San Jose State University.
With the explosion in virtual reality (VR) technologies over the past few years, we’ve had an increasing number of customers ask us for advice and best practices around deploying their VR-based products and service offerings on the AWS Cloud. It soon became apparent that while the VR ecosystem is large in both scope and depth of types of workloads (gaming, e-medicine, security analytics, live streaming events, etc.), many of the workloads followed repeatable patterns, with storage and delivery of live and on-demand immersive video at the top of the list.
Looking at consumer trends, the desire for live and on-demand immersive video is fairly self-explanatory. VR has ushered in convenient and low-cost access for consumers and businesses to a wide variety of options for consuming content, ranging from browser playback of live and on-demand 360º video, all the way up to positional tracking systems with a high degree of immersion. All of these scenarios contain one lowest common denominator: video.
Which brings us to the topic of this post. We set out to build a solution that could support both live and on-demand events, bring with it a high degree of scalability, be flexible enough to support transformation of video if required, run at a low cost, and use open-source software to every extent possible.
In this post, we describe the reference architecture we created to solve this challenge, using Amazon EC2 Spot Instances, Amazon S3, Elastic Load Balancing, Amazon CloudFront, AWS CloudFormation, and Amazon CloudWatch, with open-source software such as NGINX, FFMPEG, and JavaScript-based client-side playback technologies. We step you through deployment of the solution and how the components work, as well as the capture, processing, and playback of the underlying live and on-demand immersive media streams.
This GitHub repository includes the source code necessary to follow along. We’ve also provided a self-paced workshop, from AWS re:Invent 2017 that breaks down this architecture even further. If you experience any issues or would like to suggest an enhancement, please use the GitHub issue tracker.
Prerequisites
As a side note, you’ll also need a few additional components to take best advantage of the infrastructure:
A camera/capture device capable of encoding and streaming RTMP video
A browser to consume the content.
You’re going to generate HTML5-compatible video (Apple HLS to be exact), but there are many other native iOS and Android options for consuming the media that you create. It’s also worth noting that your playback device should support projection of your input stream. We’ll talk more about that in the next section.
How does immersive media work?
At its core, any flavor of media, be that audio or video, can be viewed with some level of immersion. The ability to interact passively or actively with the content brings with it a further level of immersion. When you look at VR devices with rotational and positional tracking, you naturally need more than an ability to interact with a flat plane of video. The challenge for any creative thus becomes a tradeoff between immersion features (degrees of freedom, monoscopic 2D or stereoscopic 3D, resolution, framerate) and overall complexity.
Where can you start from a simple and effective point of view, that enables you to build out a fairly modular solution and test it? There are a few areas we chose to be prescriptive with our solution.
Source capture from the Ricoh Theta S
First, monoscopic 360-degree video is currently one of the most commonly consumed formats on consumer devices. We explicitly chose to focus on this format, although the infrastructure is not limited to it. More on this later.
Second, if you look at most consumer-level cameras that provide live streaming ability, and even many professional rigs, there are at least two lenses or cameras at a minimum. The figure above illustrates a single capture from a Ricoh Theta S in monoscopic 2D. The left image captures 180 degrees of the field of view, and the right image captures the other 180 degrees.
For this post, we chose a typical midlevel camera (the Ricoh Theta S), and used a laptop with open-source software (Open Broadcaster Software) to encode and stream the content. Again, the solution infrastructure is not limited to this particular brand of camera. Any camera or encoder that outputs 360º video and encodes to H264+AAC with an RTMP transport will work.
Third, capturing and streaming multiple camera feeds brings additional requirements around stream synchronization and cost of infrastructure. There is also a requirement to stitch media in real time, which can be CPU and GPU-intensive. Many devices and platforms do this either on the device, or via outboard processing that sits close to the camera location. If you stitch and deliver a single stream, you can save the costs of infrastructure and bitrate/connectivity requirements. We chose to keep these aspects on the encoder side to save on cost and reduce infrastructure complexity.
Last, the most common delivery format that requires little to no processing on the infrastructure side is equirectangular projection, as per the above figure. By stitching and unwrapping the spherical coordinates into a flat plane, you can easily deliver the video exactly as you would with any other live or on-demand stream. The only caveat is that resolution and bit rate are of utmost importance. The higher you can push these (high bit rate @ 4K resolution), the more immersive the experience is for viewers. This is due to the increase in sharpness and reduction of compression artifacts.
Knowing that we would be transcoding potentially at 4K on the source camera, but in a format that could be transmuxed without an encoding penalty on the origin servers, we implemented a pass-through for the highest bit rate, and elected to only transcode lower bitrates. This requires some level of configuration on the source encoder, but saves on cost and infrastructure. Because you can conform the source stream, you may as well take advantage of that!
For this post, we chose not to focus on ways to optimize projection. However, the reference architecture does support this with additional open source components compiled into the FFMPEG toolchain. A number of options are available to this end, such as open source equirectangular to cubic transformation filters. There is a tradeoff, however, in that reprojection implies that all streams must be transcoded.
Processing and origination stack
To get started, we’ve provided a CloudFormation template that you can launch directly into your own AWS account. We quickly review how it works, the solution’s components, key features, processing steps, and examine the main configuration files. Following this, you launch the stack, and then proceed with camera and encoder setup.
Immersive streaming reference architecture
The event encoder publishes the RTMP source to multiple origin elastic IP addresses for packaging into the HLS adaptive bitrate.
The client requests the live stream through the CloudFront CDN.
The origin responds with the appropriate HLS stream.
The edge fleet caches media requests from clients and elastically scales across both Availability Zones to meet peak demand.
CloudFront caches media at local edge PoPs to improve performance for users and reduce the origin load.
When the live event is finished, the VOD asset is published to S3. An S3 event is then published to SQS.
The encoding fleet processes the read messages from the SQS queue, processes the VOD clips, and stores them in the S3 bucket.
How it works
A camera captures content, and with the help of a contribution encoder, publishes a live stream in equirectangular format. The stream is encoded at a high bit rate (at least 2.5 Mbps, but typically 16+ Mbps for 4K) using H264 video and AAC audio compression codecs, and delivered to a primary origin via the RTMP protocol. Streams may transit over the internet or dedicated links to the origins. Typically, for live events in the field, internet or bonded cellular are the most widely used.
The encoder is typically configured to push the live stream to a primary URI, with the ability (depending on the source encoding software/hardware) to roll over to a backup publishing point origin if the primary fails. Because you run across multiple Availability Zones, this architecture could handle an entire zone outage with minor disruption to live events. The primary and backup origins handle the ingestion of the live stream as well as transcoding to H264+AAC-based adaptive bit rate sets. After transcode, they package the streams into HLS for delivery and create a master-level manifest that references all adaptive bit rates.
The edge cache fleet pulls segments and manifests from the active origin on demand, and supports failover from primary to backup if the primary origin fails. By adding this caching tier, you effectively separate the encoding backend tier from the cache tier that responds to client or CDN requests. In addition to origin protection, this separation allows you to independently monitor, configure, and scale these components.
Viewers can use the sample HTML5 player (or compatible desktop, iOS or Android application) to view the streams. Navigation in the 360-degree view is handled either natively via device-based gyroscope, positionally via more advanced devices such as a head mount display, or via mouse drag on the desktop. Adaptive bit rate is key here, as this allows you to target multiple device types, giving the player on each device the option of selecting an optimum stream based on network conditions or device profile.
Solution components
When you deploy the CloudFormation template, all the architecture services referenced above are created and launched. This includes:
The compute tier running on Spot Instances for the corresponding components:
the primary and backup ingest origins
the edge cache fleet
the transcoding fleet
the test source
The CloudFront distribution
S3 buckets for storage of on-demand VOD assets
An Application Load Balancer for load balancing the service
An Amazon ECS cluster and container for the test source
The template also provisions the underlying dependencies:
A VPC
Security groups
IAM policies and roles
Elastic network interfaces
Elastic IP addresses
The edge cache fleet instances need some way to discover the primary and backup origin locations. You use elastic network interfaces and elastic IP addresses for this purpose.
As each component of the infrastructure is provisioned, software required to transcode and process the streams across the Spot Instances is automatically deployed. This includes NGiNX-RTMP for ingest of live streams, FFMPEG for transcoding, NGINX for serving, and helper scripts to handle various tasks (potential Spot Instance interruptions, queueing, moving content to S3). Metrics and logs are available through CloudWatch and you can manage the deployment using the CloudFormation console or AWS CLI.
Key features include:
Live and video-on-demand recording
You’re supporting both live and on-demand. On-demand content is created automatically when the encoder stops publishing to the origin.
Cost-optimization and operating at scale using Spot Instances
Spot Instances are used exclusively for infrastructure to optimize cost and scale throughput.
Midtier caching
To protect the origin servers, the midtier cache fleet pulls, caches, and delivers to downstream CDNs.
Distribution via CloudFront or multi-CDN
The Application Load Balancer endpoint allows CloudFront or any third-party CDN to source content from the edge fleet and, indirectly, the origin.
FFMPEG + NGINX + NGiNX-RTMP
These three components form the core of the stream ingest, transcode, packaging, and delivery infrastructure, as well as the VOD-processing component for creating transcoded VOD content on-demand.
Simple deployment using a CloudFormation template
All infrastructure can be easily created and modified using CloudFormation.
Prototype player page
To provide an end-to-end experience right away, we’ve included a test player page hosted as a static site on S3. This page uses A-Frame, a cross-platform, open-source framework for building VR experiences in the browser. Though A-Frame provides many features, it’s used here to render a sphere that acts as a 3D canvas for your live stream.
Spot Instance considerations
At this stage, and before we discuss processing, it is important to understand how the architecture operates with Spot Instances.
Spot Instances are spare compute capacity in the AWS Cloud available to you at steep discounts compared to On-Demand prices. Spot Instances enables you to optimize your costs on the AWS Cloud and scale your application’s throughput up to 10X for the same budget. By selecting Spot Instances, you can save up-to 90% on On-Demand prices. This allows you to greatly reduce the cost of running the solution because, outside of S3 for storage and CloudFront for delivery, this solution is almost entirely dependent on Spot Instances for infrastructure requirements.
We also know that customers running events look to deploy streaming infrastructure at the lowest price point, so it makes sense to take advantage of it wherever possible. A potential challenge when using Spot Instances for live streaming and on-demand processing is that you need to proactively deal with potential Spot Instance interruptions. How can you best deal with this?
First, the origin is deployed in a primary/backup deployment. If a Spot Instance interruption happens on the primary origin, you can fail over to the backup with a brief interruption. Should a potential interruption not be acceptable, then either Reserved Instances or On-Demand options (or a combination) can be used at this tier.
Second, the edge cache fleet runs a job (started automatically at system boot) that periodically queries the local instance metadata to detect if an interruption is scheduled to occur. Spot Instance Interruption Notices provide a two-minute warning of a pending interruption. If you poll every 5 seconds, you have almost 2 full minutes to detach from the Load Balancer and drain or stop any traffic directed to your instance.
Lastly, use an SQS queue when transcoding. If a transcode for a Spot Instance is interrupted, the stale item falls back into the SQS queue and is eventually re-surfaced into the processing pipeline. Only remove items from the queue after the transcoded files have been successfully moved to the destination S3 bucket.
Processing
As discussed in the previous sections, you pass through the video for the highest bit rate to save on having to increase the instance size to transcode the 4K or similar high bit rate or resolution content.
We’ve selected a handful of bitrates for the adaptive bit rate stack. You can customize any of these to suit the requirements for your event. The default ABR stack includes:
2160p (4K)
1080p
540p
480p
These can be modified by editing the /etc/nginx/rtmp.d/rtmp.conf NGINX configuration file on the origin or the CloudFormation template.
It’s important to understand where and how streams are transcoded. When the source high bit rate stream enters the primary or backup origin at the /live RTMP application entry point, it is recorded on stop and start of publishing. On completion, it is moved to S3 by a cleanup script, and a message is placed in your SQS queue for workers to use. These workers transcode the media and push it to a playout location bucket.
This solution uses Spot Fleet with automatic scaling to drive the fleet size. You can customize it based on CloudWatch metrics, such as simple utilization metrics to drive the size of the fleet. Why use Spot Instances for the transcode option instead of Amazon Elastic Transcoder? This allows you to implement reprojection of the input stream via FFMPEG filters in the future.
The origins handle all the heavy live streaming work. Edges only store and forward the segments and manifests, and provide scaling plus reduction of burden on the origin. This lets you customize the origin to the right compute capacity without having to rely on a ‘high watermark’ for compute sizing, thus saving additional costs.
Loopback is an important concept for the live origins. The incoming stream entering /live is transcoded by FFMPEG to multiple bit rates, which are streamed back to the same host via RTMP, on a secondary publishing point /show. The secondary publishing point is transparent to the user and encoder, but handles HLS segment generation and cleanup, and keeps a sliding window of live segments and constantly updating manifests.
Configuration
Our solution provides two key points of configuration that can be used to customize the solution to accommodate ingest, recording, transcoding, and delivery, all controlled via origin and edge configuration files, which are described later. In addition, a number of job scripts run on the instances to provide hooks into Spot Instance interruption events and the VOD SQS-based processing queue.
Origin instances
The rtmp.conf excerpt below also shows additional parameters that can be customized, such as maximum recording file size in Kbytes, HLS Fragment length, and Playlist sizes. We’ve created these in accordance with general industry best practices to ensure the reliable streaming and delivery of your content.
rtmp {
server {
listen 1935;
chunk_size 4000;
application live {
live on;
record all;
record_path /var/lib/nginx/rec;
record_max_size 128000K;
exec_record_done /usr/local/bin/record-postprocess.sh $path $basename;
exec /usr/local/bin/ffmpeg <…parameters…>;
}
application show {
live on;
hls on;
...
hls_type live;
hls_fragment 10s;
hls_playlist_length 60s;
...
}
}
}
This exposes a few URL endpoints for debugging and general status. In production, you would most likely turn these off:
/stat provides a statistics endpoint accessible via any standard web browser.
/control enables control of RTMP streams and publishing points.
You also control the TTLs, as previously discussed. It’s important to note here that you are setting TTLs explicitly at the origin, instead of in CloudFront’s distribution configuration. While both are valid, this approach allows you to reconfigure and restart the service on the fly without having to push changes through CloudFront. This is useful for debugging any caching or playback issues.
record-postprocess.sh – Ensures that recorded files on the origin are well-formed, and transfers them to S3 for processing.
ffmpeg.sh – Transcodes content on the encoding fleet, pulling source media from your S3 ingress bucket, based on SQS queue entries, and pushing transcoded adaptive bit rate segments and manifests to your VOD playout egress bucket.
For more details, see the Delivery and Playback section later in this post.
Camera source
With the processing and origination infrastructure running, you need to configure your camera and encoder.
As discussed, we chose to use a Ricoh Theta S camera and Open Broadcaster Software (OBS) to stitch and deliver a stream into the infrastructure. Ricoh provides a free ‘blender’ driver, which allows you to transform, stitch, encode, and deliver both transformed equirectangular (used for this post) video as well as spherical (two camera) video. The Theta provides an easy way to get capturing for under $300, and OBS is a free and open-source software application for capturing and live streaming on a budget. It is quick, cheap, and enjoys wide use by the gaming community. OBS lowers the barrier to getting started with immersive streaming.
While the resolution and bit rate of the Theta may not be 4K, it still provides us with a way to test the functionality of the entire pipeline end to end, without having to invest in a more expensive camera rig. One could also use this type of model to target smaller events, which may involve mobile devices with smaller display profiles, such as phones and potentially smaller sized tablets.
Looking for a more professional solution? Nokia, GoPro, Samsung, and many others have options ranging from $500 to $50,000. This solution is based around the Theta S capabilities, but we’d encourage you to extend it to meet your specific needs.
If your device can support equirectangular RTMP, then it can deliver media through the reference architecture (dependent on instance sizing for higher bit rate sources, of course). If additional features are required such as camera stitching, mixing, or device bonding, we’d recommend exploring a commercial solution such as Teradek Sphere.
Teradek Rig (Teradek)
Ricoh Theta (CNET)
All cameras have varied PC connectivity support. We chose the Ricoh Theta S due to the real-time video connectivity that it provides through software drivers on macOS and PC. If you plan to purchase a camera to use with a PC, confirm that it supports real-time capabilities as a peripheral device.
Encoding and publishing
Now that you have a camera, encoder, and AWS stack running, you can finally publish a live stream.
To start streaming with OBS, configure the source camera and set a publishing point. Use the RTMP application name /live on port 1935 to ingest into the primary origin’s Elastic IP address provided as the CloudFormation output: primaryOriginElasticIp.
You also need to choose a stream name or stream key in OBS. You can use any stream name, but keep the naming short and lowercase, and use only alphanumeric characters. This avoids any parsing issues on client-side player frameworks. There’s no publish point protection in your deployment, so any stream key works with the default NGiNX-RTMP configuration. For more information about stream keys, publishing point security, and extending the NGiNX-RTMP module, see the NGiNX-RTMP Wiki.
You should end up with a configuration similar to the following:
OBS Stream Settings
The Output settings dialog allows us to rescale the Video canvas and encode it for delivery to our AWS infrastructure. In the dialog below, we’ve set the Theta to encode at 5 Mbps in CBR mode using a preset optimized for low CPU utilization. We chose these settings in accordance with best practices for the stream pass-through at the origin for the initial incoming bit rate. You may notice that they largely match the FFMPEG encoding settings we use on the origin – namely constant bit rate, a single audio track, and x264 encoding with the ‘veryfast’ encoding profile.
OBS Output Settings
Live to On-Demand
As you may have noticed, an on-demand component is included in the solution architecture. When talking to customers, one frequent request that we see is that they would like to record the incoming stream with as little effort as possible.
NGINX-RTMP’s recording directives provide an easy way to accomplish this. We record any newly published stream on stream start at the primary or backup origins, using the incoming source stream, which also happens to be the highest bit rate. When the encoder stops broadcasting, NGINX-RTMP executes an exec_record_done script – record-postprocess.sh (described in the Configuration section earlier), which ensures that the content is well-formed, and then moves it to an S3 ingest bucket for processing.
Transcoding of content to make it ready for VOD as adaptive bit rate is a multi-step pipeline. First, Spot Instances in the transcoding cluster periodically poll the SQS queue for new jobs. Items on the queue are pulled off on demand by processing instances, and transcoded via FFMPEG into adaptive bit rate HLS. This allows you to also extend FFMPEG using filters for cubic and other bitrate-optimizing 360-specific transforms. Finally, transcoded content is moved from the ingest bucket to an egress bucket, making them ready for playback via your CloudFront distribution.
Separate ingest and egress by bucket to provide hard security boundaries between source recordings (which are highest quality and unencrypted), and destination derivatives (which may be lower quality and potentially require encryption). Bucket separation also allows you to order and archive input and output content using different taxonomies, which is common when moving content from an asset management and archival pipeline (the ingest bucket) to a consumer-facing playback pipeline (the egress bucket, and any other attached infrastructure or services, such as CMS, Mobile applications, and so forth).
Because streams are pushed over the internet, there is always the chance that an interruption could occur in the network path, or even at the origin side of the equation (primary to backup roll-over). Both of these scenarios could result in malformed or partial recordings being created. For the best level of reliability, encoding should always be recorded locally on-site as a precaution to deal with potential stream interruptions.
Delivery and playback
With the camera turned on and OBS streaming to AWS, the final step is to play the live stream. We’ve primarily tested the prototype player on the latest Chrome and Firefox browsers on macOS, so your mileage may vary on different browsers or operating systems. For those looking to try the livestream on Google Cardboard, or similar headsets, native apps for iOS (VRPlayer) and Android exist that can play back HLS streams.
The prototype player is hosted in an S3 bucket and can be found from the CloudFormation output clientWebsiteUrl. It requires a stream URL provided as a query parameter ?url=<stream_url> to begin playback. This stream URL is determined by the RTMP stream configuration in OBS. For example, if OBS is publishing to rtmp://x.x.x.x:1935/live/foo, the resulting playback URL would be:
https://<cloudFrontDistribution>/hls/foo.m3u8
The combined player URL and playback URL results in a path like this one:
To assist in setup/debugging, we’ve provided a test source as part of the CloudFormation template. A color bar pattern with timecode and audio is being generated by FFmpeg running as an ECS task. Much like OBS, FFmpeg is streaming the test pattern to the primary origin over the RTMP protocol. The prototype player and test HLS stream can be accessed by opening the clientTestPatternUrl CloudFormation output link.
Test Stream Playback
What’s next?
In this post, we walked you through the design and implementation of a full end-to-end immersive streaming solution architecture. As you may have noticed, there are a number of areas this could expand into, and we intend to do this in follow-up posts around the topic of virtual reality media workloads in the cloud. We’ve identified a number of topics such as load testing, content protection, client-side metrics and analytics, and CI/CD infrastructure for 24/7 live streams. If you have any requests, please drop us a line.
We would like to extend extra-special thanks to Scott Malkie and Chad Neal for their help and contributions to this post and reference architecture.
Most malware tries to compromise your systems by using a known vulnerability that the operating system maker has already patched. As best practices to help prevent malware from affecting your systems, you should apply all operating system patches and actively monitor your systems for missing patches.
Launch an Amazon EC2 instance for use with Systems Manager.
Configure Systems Manager to patch your Amazon EC2 Linux instances.
In two previous blog posts (Part 1 and Part 2), I showed how to use the AWS Management Console to perform the necessary steps to patch, inspect, and protect Microsoft Windows workloads. You can implement those same processes for your Linux instances running in AWS by changing the instance tags and types shown in the previous blog posts.
Because most Linux system administrators are more familiar with using a command line, I show how to patch Linux workloads by using the AWS CLI in this blog post. The steps to use the Amazon EBS Snapshot Scheduler and Amazon Inspector are identical for both Microsoft Windows and Linux.
What you should know first
To follow along with the solution in this post, you need one or more Amazon EC2 instances. You may use existing instances or create new instances. For this post, I assume this is an Amazon EC2 for Amazon Linux instance installed from Amazon Machine Images (AMIs).
Systems Manager is a collection of capabilities that helps you automate management tasks for AWS-hosted instances on Amazon EC2 and your on-premises servers. In this post, I use Systems Manager for two purposes: to run remote commands and apply operating system patches. To learn about the full capabilities of Systems Manager, see What Is AWS Systems Manager?
If you are not familiar with how to launch an Amazon EC2 instance, see Launching an Instance. I also assume you launched or will launch your instance in a private subnet. You must make sure that the Amazon EC2 instance can connect to the internet using a network address translation (NAT) instance or NAT gateway to communicate with Systems Manager. The following diagram shows how you should structure your VPC.
Later in this post, you will assign tasks to a maintenance window to patch your instances with Systems Manager. To do this, the IAM user you are using for this post must have the iam:PassRole permission. This permission allows the IAM user assigning tasks to pass his own IAM permissions to the AWS service. In this example, when you assign a task to a maintenance window, IAM passes your credentials to Systems Manager. You also should authorize your IAM user to use Amazon EC2 and Systems Manager. As mentioned before, you will be using the AWS CLI for most of the steps in this blog post. Our documentation shows you how to get started with the AWS CLI. Make sure you have the AWS CLI installed and configured with an AWS access key and secret access key that belong to an IAM user that have the following AWS managed policies attached to the IAM user you are using for this example: AmazonEC2FullAccess and AmazonSSMFullAccess.
Step 1: Launch an Amazon EC2 Linux instance
In this section, I show you how to launch an Amazon EC2 instance so that you can use Systems Manager with the instance. This step requires you to do three things:
Create an IAM role for Systems Manager before launching your Amazon EC2 instance.
Launch your Amazon EC2 instance with Amazon EBS and the IAM role for Systems Manager.
Add tags to the instances so that you can add your instances to a Systems Manager maintenance window based on tags.
A. Create an IAM role for Systems Manager
Before launching an Amazon EC2 instance, I recommend that you first create an IAM role for Systems Manager, which you will use to update the Amazon EC2 instance. AWS already provides a preconfigured policy that you can use for the new role and it is called AmazonEC2RoleforSSM.
Create a JSON file named trustpolicy-ec2ssm.json that contains the following trust policy. This policy describes which principal (an entity that can take action on an AWS resource) is allowed to assume the role we are going to create. In this example, the principal is the Amazon EC2 service.
Use the following command to create a role named EC2SSM that has the AWS managed policy AmazonEC2RoleforSSM attached to it. This generates JSON-based output that describes the role and its parameters, if the command is successful.
$ aws iam create-role --role-name EC2SSM --assume-role-policy-document file://trustpolicy-ec2ssm.json
Use the following command to attach the AWS managed IAM policy (AmazonEC2RoleforSSM) to your newly created role.
$ aws iam attach-role-policy --role-name EC2SSM --policy-arn arn:aws:iam::aws:policy/service-role/AmazonEC2RoleforSSM
Use the following commands to create the IAM instance profile and add the role to the instance profile. The instance profile is needed to attach the role we created earlier to your Amazon EC2 instance.
$ aws iam create-instance-profile --instance-profile-name EC2SSM-IP
$ aws iam add-role-to-instance-profile --instance-profile-name EC2SSM-IP --role-name EC2SSM
B. Launch your Amazon EC2 instance
To follow along, you need an Amazon EC2 instance that is running Amazon Linux. You can use any existing instance you may have or create a new instance.
When launching a new Amazon EC2 instance, be sure that:
Use the following command to launch a new Amazon EC2 instance using an Amazon Linux AMI available in the US East (N. Virginia) Region (also known as us-east-1). Replace YourKeyPair and YourSubnetId with your information. For more information about creating a key pair, see the create-key-pair documentation. Write down the InstanceId that is in the output because you will need it later in this post.
If you are using an existing Amazon EC2 instance, you can use the following command to attach the instance profile you created earlier to your instance.
The final step of configuring your Amazon EC2 instances is to add tags. You will use these tags to configure Systems Manager in Step 2 of this post. For this example, I add a tag named Patch Group and set the value to Linux Servers. I could have other groups of Amazon EC2 instances that I treat differently by having the same tag name but a different tag value. For example, I might have a collection of other servers with the tag name Patch Group with a value of Web Servers.
Use the following command to add the Patch Group tag to your Amazon EC2 instance.
Note: You must wait a few minutes until the Amazon EC2 instance is available before you can proceed to the next section. To make sure your Amazon EC2 instance is online and ready, you can use the following AWS CLI command:
At this point, you now have at least one Amazon EC2 instance you can use to configure Systems Manager.
Step 2: Configure Systems Manager
In this section, I show you how to configure and use Systems Manager to apply operating system patches to your Amazon EC2 instances, and how to manage patch compliance.
To start, I provide some background information about Systems Manager. Then, I cover how to:
Create the Systems Manager IAM role so that Systems Manager is able to perform patch operations.
Create a Systems Manager patch baseline and associate it with your instance to define which patches Systems Manager should apply.
Define a maintenance window to make sure Systems Manager patches your instance when you tell it to.
Monitor patch compliance to verify the patch state of your instances.
You must meet two prerequisites to use Systems Manager to apply operating system patches. First, you must attach the IAM role you created in the previous section, EC2SSM, to your Amazon EC2 instance. Second, you must install the Systems Manager agent on your Amazon EC2 instance. If you have used a recent Amazon Linux AMI, Amazon has already installed the Systems Manager agent on your Amazon EC2 instance. You can confirm this by logging in to an Amazon EC2 instance and checking the Systems Manager agent log files that are located at /var/log/amazon/ssm/.
For a maintenance window to be able to run any tasks, you must create a new role for Systems Manager. This role is a different kind of role than the one you created earlier: this role will be used by Systems Manager instead of Amazon EC2. Earlier, you created the role, EC2SSM, with the policy, AmazonEC2RoleforSSM, which allowed the Systems Manager agent on your instance to communicate with Systems Manager. In this section, you need a new role with the policy, AmazonSSMMaintenanceWindowRole, so that the Systems Manager service can execute commands on your instance.
To create the new IAM role for Systems Manager:
Create a JSON file named trustpolicy-maintenancewindowrole.json that contains the following trust policy. This policy describes which principal is allowed to assume the role you are going to create. This trust policy allows not only Amazon EC2 to assume this role, but also Systems Manager.
Use the following command to create a role named MaintenanceWindowRole that has the AWS managed policy, AmazonSSMMaintenanceWindowRole, attached to it. This command generates JSON-based output that describes the role and its parameters, if the command is successful.
$ aws iam create-role --role-name MaintenanceWindowRole --assume-role-policy-document file://trustpolicy-maintenancewindowrole.json
Use the following command to attach the AWS managed IAM policy (AmazonEC2RoleforSSM) to your newly created role.
$ aws iam attach-role-policy --role-name MaintenanceWindowRole --policy-arn arn:aws:iam::aws:policy/service-role/AmazonSSMMaintenanceWindowRole
B. Create a Systems Manager patch baseline and associate it with your instance
Next, you will create a Systems Manager patch baseline and associate it with your Amazon EC2 instance. A patch baseline defines which patches Systems Manager should apply to your instance. Before you can associate the patch baseline with your instance, though, you must determine if Systems Manager recognizes your Amazon EC2 instance. Use the following command to list all instances managed by Systems Manager. The --filters option ensures you look only for your newly created Amazon EC2 instance.
If your instance is missing from the list, verify that:
Your instance is running.
You attached the Systems Manager IAM role, EC2SSM.
You deployed a NAT gateway in your public subnet to ensure your VPC reflects the diagram shown earlier in this post so that the Systems Manager agent can connect to the Systems Manager internet endpoint.
Now that you have checked that Systems Manager can manage your Amazon EC2 instance, it is time to create a patch baseline. With a patch baseline, you define which patches are approved to be installed on all Amazon EC2 instances associated with the patch baseline. The Patch Group resource tag you defined earlier will determine to which patch group an instance belongs. If you do not specifically define a patch baseline, the default AWS-managed patch baseline is used.
To create a patch baseline:
Use the following command to create a patch baseline named AmazonLinuxServers. With approval rules, you can determine the approved patches that will be included in your patch baseline. In this example, you add all Critical severity patches to the patch baseline as soon as they are released, by setting the Auto approval delay to 0 days. By setting the Auto approval delay to 2 days, you add to this patch baseline the Important, Medium, and Low severity patches two days after they are released.
Use the following command to register the patch baseline you created with your instance. To do so, you use the Patch Group tag that you added to your Amazon EC2 instance.
Now that you have successfully set up a role, created a patch baseline, and registered your Amazon EC2 instance with your patch baseline, you will define a maintenance window so that you can control when your Amazon EC2 instances will receive patches. By creating multiple maintenance windows and assigning them to different patch groups, you can make sure your Amazon EC2 instances do not all reboot at the same time.
To define a maintenance window:
Use the following command to define a maintenance window. In this example command, the maintenance window will start every Saturday at 10:00 P.M. UTC. It will have a duration of 4 hours and will not start any new tasks 1 hour before the end of the maintenance window.
After defining the maintenance window, you must register the Amazon EC2 instance with the maintenance window so that Systems Manager knows which Amazon EC2 instance it should patch in this maintenance window. You can register the instance by using the same Patch Group tag you used to associate the Amazon EC2 instance with the AWS-provided patch baseline, as shown in the following command.
Assign a task to the maintenance window that will install the operating system patches on your Amazon EC2 instance. The following command includes the following options.
name is the name of your task and is optional. I named mine Patching.
task-arn is the name of the task document you want to run.
max-concurrency allows you to specify how many of your Amazon EC2 instances Systems Manager should patch at the same time. max-errors determines when Systems Manager should abort the task. For patching, this number should not be too low, because you do not want your entire patch task to stop on all instances if one instance fails. You can set this, for example, to 20%.
service-role-arn is the Amazon Resource Name (ARN) of the AmazonSSMMaintenanceWindowRole role you created earlier in this blog post.
task-invocation-parameters defines the parameters that are specific to the AWS-RunPatchBaseline task document and tells Systems Manager that you want to install patches with a timeout of 600 seconds (10 minutes).
Now, you must wait for the maintenance window to run at least once according to the schedule you defined earlier. If your maintenance window has expired, you can check the status of any maintenance tasks Systems Manager has performed by using the following command.
You also can see the overall patch compliance of all Amazon EC2 instances using the following command in the AWS CLI.
$ aws ssm list-compliance-summaries
This command shows you the number of instances that are compliant with each category and the number of instances that are not in JSON format.
You also can see overall patch compliance by choosing Compliance under Insights in the navigation pane of the Systems Manager console. You will see a visual representation of how many Amazon EC2 instances are up to date, how many Amazon EC2 instances are noncompliant, and how many Amazon EC2 instances are compliant in relation to the earlier defined patch baseline.
In this section, you have set everything up for patch management on your instance. Now you know how to patch your Amazon EC2 instance in a controlled manner and how to check if your Amazon EC2 instance is compliant with the patch baseline you have defined. Of course, I recommend that you apply these steps to all Amazon EC2 instances you manage.
Summary
In this blog post, I showed how to use Systems Manager to create a patch baseline and maintenance window to keep your Amazon EC2 Linux instances up to date with the latest security patches. Remember that by creating multiple maintenance windows and assigning them to different patch groups, you can make sure your Amazon EC2 instances do not all reboot at the same time.
If you have comments about this post, submit them in the “Comments” section below. If you have questions about or issues implementing any part of this solution, start a new thread on the Amazon EC2 forum or contact AWS Support.
This post courtesy of Roberto Iturralde, Sr. Application Developer- AWS Professional Services
Application architects are faced with key decisions throughout the process of designing and implementing their systems. One decision common to nearly all solutions is how to manage the storage and access rights of application configuration. Shared configuration should be stored centrally and securely with each system component having access only to the properties that it needs for functioning.
With AWS Systems Manager Parameter Store, developers have access to central, secure, durable, and highly available storage for application configuration and secrets. Parameter Store also integrates with AWS Identity and Access Management (IAM), allowing fine-grained access control to individual parameters or branches of a hierarchical tree.
This post demonstrates how to create and access shared configurations in Parameter Store from AWS Lambda. Both encrypted and plaintext parameter values are stored with only the Lambda function having permissions to decrypt the secrets. You also use AWS X-Ray to profile the function.
Solution overview
This example is made up of the following components:
An unencrypted Parameter Store parameter that the Lambda function loads
A KMS key that only the Lambda function can access. You use this key to create an encrypted parameter later.
Lambda function code in Python 3.6 that demonstrates how to load values from Parameter Store at function initialization for reuse across invocations.
Launch the AWS SAM template
To create the resources shown in this post, you can download the SAM template or choose the button to launch the stack. The template requires one parameter, an IAM user name, which is the name of the IAM user to be the admin of the KMS key that you create. In order to perform the steps listed in this post, this IAM user will need permissions to execute Lambda functions, create Parameter Store parameters, administer keys in KMS, and view the X-Ray console. If you have these privileges in your IAM user account you can use your own account to complete the walkthrough. You can not use the root user to administer the KMS keys.
SAM template resources
The following sections show the code for the resources defined in the template. Lambda function
In this YAML code, you define a Lambda function named ParameterStoreBlogFunctionDev using the SAM AWS::Serverless::Function type. The environment variables for this function include the ENV (dev) and the APP_CONFIG_PATH where you find the configuration for this app in Parameter Store. X-Ray tracing is also enabled for profiling later.
The IAM role for this function extends the AWSLambdaBasicExecutionRole by adding IAM policies that grant the function permissions to write to X-Ray and get parameters from Parameter Store, limited to paths under /dev/parameterStoreBlog*. Parameter Store parameter
SimpleParameter:
Type: AWS::SSM::Parameter
Properties:
Name: '/dev/parameterStoreBlog/appConfig'
Description: 'Sample dev config values for my app'
Type: String
Value: '{"key1": "value1","key2": "value2","key3": "value3"}'
This YAML code creates a plaintext string parameter in Parameter Store in a path that your Lambda function can access. KMS encryption key
ParameterStoreBlogDevEncryptionKeyAlias:
Type: AWS::KMS::Alias
Properties:
AliasName: 'alias/ParameterStoreBlogKeyDev'
TargetKeyId: !Ref ParameterStoreBlogDevEncryptionKey
ParameterStoreBlogDevEncryptionKey:
Type: AWS::KMS::Key
Properties:
Description: 'Encryption key for secret config values for the Parameter Store blog post'
Enabled: True
EnableKeyRotation: False
KeyPolicy:
Version: '2012-10-17'
Id: 'key-default-1'
Statement:
-
Sid: 'Allow administration of the key & encryption of new values'
Effect: Allow
Principal:
AWS:
- !Sub 'arn:aws:iam::${AWS::AccountId}:user/${IAMUsername}'
Action:
- 'kms:Create*'
- 'kms:Encrypt'
- 'kms:Describe*'
- 'kms:Enable*'
- 'kms:List*'
- 'kms:Put*'
- 'kms:Update*'
- 'kms:Revoke*'
- 'kms:Disable*'
- 'kms:Get*'
- 'kms:Delete*'
- 'kms:ScheduleKeyDeletion'
- 'kms:CancelKeyDeletion'
Resource: '*'
-
Sid: 'Allow use of the key'
Effect: Allow
Principal:
AWS: !GetAtt ParameterStoreBlogFunctionRoleDev.Arn
Action:
- 'kms:Encrypt'
- 'kms:Decrypt'
- 'kms:ReEncrypt*'
- 'kms:GenerateDataKey*'
- 'kms:DescribeKey'
Resource: '*'
This YAML code creates an encryption key with a key policy with two statements.
The first statement allows a given user (${IAMUsername}) to administer the key. Importantly, this includes the ability to encrypt values using this key and disable or delete this key, but does not allow the administrator to decrypt values that were encrypted with this key.
The second statement grants your Lambda function permission to encrypt and decrypt values using this key. The alias for this key in KMS is ParameterStoreBlogKeyDev, which is how you reference it later.
Lambda function
Here I walk you through the Lambda function code.
import os, traceback, json, configparser, boto3
from aws_xray_sdk.core import patch_all
patch_all()
# Initialize boto3 client at global scope for connection reuse
client = boto3.client('ssm')
env = os.environ['ENV']
app_config_path = os.environ['APP_CONFIG_PATH']
full_config_path = '/' + env + '/' + app_config_path
# Initialize app at global scope for reuse across invocations
app = None
class MyApp:
def __init__(self, config):
"""
Construct new MyApp with configuration
:param config: application configuration
"""
self.config = config
def get_config(self):
return self.config
def load_config(ssm_parameter_path):
"""
Load configparser from config stored in SSM Parameter Store
:param ssm_parameter_path: Path to app config in SSM Parameter Store
:return: ConfigParser holding loaded config
"""
configuration = configparser.ConfigParser()
try:
# Get all parameters for this app
param_details = client.get_parameters_by_path(
Path=ssm_parameter_path,
Recursive=False,
WithDecryption=True
)
# Loop through the returned parameters and populate the ConfigParser
if 'Parameters' in param_details and len(param_details.get('Parameters')) > 0:
for param in param_details.get('Parameters'):
param_path_array = param.get('Name').split("/")
section_position = len(param_path_array) - 1
section_name = param_path_array[section_position]
config_values = json.loads(param.get('Value'))
config_dict = {section_name: config_values}
print("Found configuration: " + str(config_dict))
configuration.read_dict(config_dict)
except:
print("Encountered an error loading config from SSM.")
traceback.print_exc()
finally:
return configuration
def lambda_handler(event, context):
global app
# Initialize app if it doesn't yet exist
if app is None:
print("Loading config and creating new MyApp...")
config = load_config(full_config_path)
app = MyApp(config)
return "MyApp config is " + str(app.get_config()._sections)
Beneath the import statements, you import the patch_all function from the AWS X-Ray library, which you use to patch boto3 to create X-Ray segments for all your boto3 operations.
Next, you create a boto3 SSM client at the global scope for reuse across function invocations, following Lambda best practices. Using the function environment variables, you assemble the path where you expect to find your configuration in Parameter Store. The class MyApp is meant to serve as an example of an application that would need its configuration injected at construction. In this example, you create an instance of ConfigParser, a class in Python’s standard library for handling basic configurations, to give to MyApp.
The load_config function loads the all the parameters from Parameter Store at the level immediately beneath the path provided in the Lambda function environment variables. Each parameter found is put into a new section in ConfigParser. The name of the section is the name of the parameter, less the base path. In this example, the full parameter name is /dev/parameterStoreBlog/appConfig, which is put in a section named appConfig.
Finally, the lambda_handler function initializes an instance of MyApp if it doesn’t already exist, constructing it with the loaded configuration from Parameter Store. Then it simply returns the currently loaded configuration in MyApp. The impact of this design is that the configuration is only loaded from Parameter Store the first time that the Lambda function execution environment is initialized. Subsequent invocations reuse the existing instance of MyApp, resulting in improved performance. You see this in the X-Ray traces later in this post. For more advanced use cases where configuration changes need to be received immediately, you could implement an expiry policy for your configuration entries or push notifications to your function.
To confirm that everything was created successfully, test the function in the Lambda console.
In the Functions pane, filter to ParameterStoreBlogFunctionDev to find the function created by the SAM template earlier. Open the function name to view its details.
On the top right of the function detail page, choose Test. You may need to create a new test event. The input JSON doesn’t matter as this function ignores the input.
After running the test, you should see output similar to the following. This demonstrates that the function successfully fetched the unencrypted configuration from Parameter Store.
Create an encrypted parameter
You currently have a simple, unencrypted parameter and a Lambda function that can access it.
Next, you create an encrypted parameter that only your Lambda function has permission to use for decryption. This limits read access for this parameter to only this Lambda function.
To follow along with this section, deploy the SAM template for this post in your account and make your IAM user name the KMS key admin mentioned earlier.
For Name, enter /dev/parameterStoreBlog/appSecrets.
For Type, select Secure String.
For KMS Key ID, choose alias/ParameterStoreBlogKeyDev, which is the key that your SAM template created.
For Value, enter {"secretKey": "secretValue"}.
Choose Create Parameter.
If you now try to view the value of this parameter by choosing the name of the parameter in the parameters list and then choosing Show next to the Value field, you won’t see the value appear. This is because, even though you have permission to encrypt values using this KMS key, you do not have permissions to decrypt values.
In the Lambda console, run another test of your function. You now also see the secret parameter that you created and its decrypted value.
If you do not see the new parameter in the Lambda output, this may be because the Lambda execution environment is still warm from the previous test. Because the parameters are loaded at Lambda startup, you need a fresh execution environment to refresh the values.
Adjust the function timeout to a different value in the Advanced Settings at the bottom of the Lambda Configuration tab. Choose Save and test to trigger the creation of a new Lambda execution environment.
Profiling the impact of querying Parameter Store using AWS X-Ray
By using the AWS X-Ray SDK to patch boto3 in your Lambda function code, each invocation of the function creates traces in X-Ray. In this example, you can use these traces to validate the performance impact of your design decision to only load configuration from Parameter Store on the first invocation of the function in a new execution environment.
From the Lambda function details page where you tested the function earlier, under the function name, choose Monitoring. Choose View traces in X-Ray.
This opens the X-Ray console in a new window filtered to your function. Be aware of the time range field next to the search bar if you don’t see any search results. In this screenshot, I’ve invoked the Lambda function twice, one time 10.3 minutes ago with a response time of 1.1 seconds and again 9.8 minutes ago with a response time of 8 milliseconds.
Looking at the details of the longer running trace by clicking the trace ID, you can see that the Lambda function spent the first ~350 ms of the full 1.1 sec routing the request through Lambda and creating a new execution environment for this function, as this was the first invocation with this code. This is the portion of time before the initialization subsegment.
Next, it took 725 ms to initialize the function, which includes executing the code at the global scope (including creating the boto3 client). This is also a one-time cost for a fresh execution environment.
Finally, the function executed for 65 ms, of which 63.5 ms was the GetParametersByPath call to Parameter Store.
Looking at the trace for the second, much faster function invocation, you see that the majority of the 8 ms execution time was Lambda routing the request to the function and returning the response. Only 1 ms of the overall execution time was attributed to the execution of the function, which makes sense given that after the first invocation you’re simply returning the config stored in MyApp.
While the Traces screen allows you to view the details of individual traces, the X-Ray Service Map screen allows you to view aggregate performance data for all traced services over a period of time.
In the X-Ray console navigation pane, choose Service map. Selecting a service node shows the metrics for node-specific requests. Selecting an edge between two nodes shows the metrics for requests that traveled that connection. Again, be aware of the time range field next to the search bar if you don’t see any search results.
After invoking your Lambda function several more times by testing it from the Lambda console, you can view some aggregate performance metrics. Look at the following:
From the client perspective, requests to the Lambda service for the function are taking an average of 50 ms to respond. The function is generating ~1 trace per minute.
The function itself is responding in an average of 3 ms. In the following screenshot, I’ve clicked on this node, which reveals a latency histogram of the traced requests showing that over 95% of requests return in under 5 ms.
Parameter Store is responding to requests in an average of 64 ms, but note the much lower trace rate in the node. This is because you only fetch data from Parameter Store on the initialization of the Lambda execution environment.
Conclusion
Deduplication, encryption, and restricted access to shared configuration and secrets is a key component to any mature architecture. Serverless architectures designed using event-driven, on-demand, compute services like Lambda are no different.
In this post, I walked you through a sample application accessing unencrypted and encrypted values in Parameter Store. These values were created in a hierarchy by application environment and component name, with the permissions to decrypt secret values restricted to only the function needing access. The techniques used here can become the foundation of secure, robust configuration management in your enterprise serverless applications.
Amazon EMR enables data analysts and scientists to deploy a cluster running popular frameworks such as Spark, HBase, Presto, and Flink of any size in minutes. When you launch a cluster, Amazon EMR automatically configures the underlying Amazon EC2 instances with the frameworks and applications that you choose for your cluster. This can include popular web interfaces such as Hue workbench, Zeppelin notebook, and Ganglia monitoring dashboards and tools.
These web interfaces are hosted on the EMR master node and must be accessed using the public DNS name of the master node (master public DNS value). The master public DNS value is dynamically created, not very user friendly and is hard to remember— it looks something like ip-###-###-###-###.us-west-2.compute.internal. Not having a friendly URL to connect to the popular workbench or notebook interfaces may impact the workflow and hinder your gained agility.
Some customers have addressed this challenge through custom bootstrap actions, steps, or external scripts that periodically check for new clusters and register a friendlier name in DNS. These approaches either put additional burden on the data practitioners or require additional resources to execute the scripts. In addition, there is typically some lag time associated with such scripts. They often don’t do a great job cleaning up the DNS records after the cluster has terminated, potentially resulting in a security risk.
The solution in this post provides an automated, serverless approach to registering a friendly master node name for easy access to the web interfaces.
Before I dive deeper, I review these key services and how they are part of this solution.
CloudWatch Events
CloudWatch Events delivers a near real-time stream of system events that describe changes in AWS resources. Using simple rules, you can match events and route them to one or more target functions or streams. An event can be generated in one of four ways:
From an AWS service when resources change state
From API calls that are delivered via AWS CloudTrail
From your own code that can generate application-level events
In this solution, I cover the first type of event, which is automatically emitted by EMR when the cluster state changes. Based on the state of this event, either create or update the DNS record in Route 53 when the cluster state changes to STARTING, or delete the DNS record when the cluster is no longer needed and the state changes to TERMINATED. For more information about all EMR event details, see Monitor CloudWatch Events.
Route 53 private hosted zones
A private hosted zone is a container that holds information about how to route traffic for a domain and its subdomains within one or more VPCs. Private hosted zones enable you to use custom DNS names for your internal resources without exposing the names or IP addresses to the internet.
Route 53 supports resource record sets with a wide range of record types. In this solution, you use a CNAME record that is used to specify a domain name as an alias for another domain (the ‘canonical’ domain). You use a friendly name of the cluster as the CNAME for the EMR master public DNS value.
You are using private hosted zones because an EMR cluster is typically deployed within a private subnet and is accessed either from within the VPC or from on-premises resources over VPN or AWS Direct Connect. To resolve domain names in private hosted zones from your on-premises network, configure a DNS forwarder, as described in How can I resolve Route 53 private hosted zones from an on-premises network via an Ubuntu instance?.
Lambda
Lambda is a compute service that lets you run code without provisioning or managing servers. Lambda executes your code only when needed and scales automatically to thousands of requests per second. Lambda takes care of high availability, and server and OS maintenance and patching. You pay only for the consumed compute time. There is no charge when your code is not running.
Lambda provides the ability to invoke your code in response to events, such as when an object is put to an Amazon S3 bucket or as in this case, when a CloudWatch event is emitted. As part of this solution, you deploy a Lambda function as a target that is invoked by CloudWatch Events when the event matches your rule. You also configure the necessary permissions based on the Lambda permissions model, including a Lambda function policy and Lambda execution role.
Putting it all together
Now that you have all of the pieces, you can put together a complete solution. The following diagram illustrates how the solution works:
Start with a user activity such as launching or terminating an EMR cluster.
EMR automatically sends events to the CloudWatch Events stream.
A CloudWatch Events rule matches the specified event, and routes it to a target, which in this case is a Lambda function. In this case, you are using the EMR Cluster State Change
The Lambda function performs the following key steps:
Get the clusterId value from the event detail and use it to call EMR. DescribeCluster API to retrieve the following data points:
MasterPublicDnsName – public DNS name of the master node
Locate the tag containing the friendly name to use as the CNAME for the cluster. The key name containing the friendly name should be The value should be specified as host.domain.com, where domain is the private hosted zone in which to update the DNS record.
Update DNS based on the state in the event detail.
If the state is STARTING, the function calls the Route 53 API to create or update a resource record set in the private hosted zone specified by the domain tag. This is a CNAME record mapped to MasterPublicDnsName.
Conversely, if the state is TERMINATED, the function calls the Route 53 API to delete the associated resource record set from the private hosted zone.
Deploying the solution
Because all of the components of this solution are serverless, use the AWS Serverless Application Model (AWS SAM) template to deploy the solution. AWS SAM is natively supported by AWS CloudFormation and provides a simplified syntax for expressing serverless resources, resulting in fewer lines of code.
Overview of the SAM template
For this solution, the SAM template has 76 lines of text as compared to 142 lines without SAM resources (and writing the template in YAML would be even slightly smaller). The solution can be deployed using the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS SAM Local.
CloudFormation transforms help simplify template authoring by condensing a multiple-line resource declaration into a single line in your template. To inform CloudFormation that your template defines a serverless application, add a line under the template format version as follows:
Before SAM, you would use the AWS::Lambda::Function resource type to define your Lambda function. You would then need a resource to define the permissions for the function (AWS::Lambda::Permission), another resource to define a Lambda execution role (AWS::IAM::Role), and finally a CloudWatch Events resource (Events::Rule) that triggers this function.
With SAM, you need to define just a single resource for your function, AWS::Serverless::Function. Using this single resource type, you can define everything that you need, including function properties such as function handler, runtime, and code URI, as well as the required IAM policies and the CloudWatch event.
A few additional things to note in the code example:
CodeUri – Before you can deploy a SAM template, first upload your Lambda function code zip to S3. You can do this manually or use the aws cloudformation package CLI command to automate the task of uploading local artifacts to a S3 bucket, as shown later.
Lambda execution role and permissions – You are not specifying a Lambda execution role in the template. Rather, you are providing the required permissions as IAM policy documents. When the template is submitted, CloudFormation expands the AWS::Serverless::Function resource, declaring a Lambda function and an execution role. The created role has two attached policies: a default AWSLambdaBasicExecutionRole and the inline policy specified in the template.
CloudWatch Events rule – Instead of specifying a CloudWatch Events resource type, you are defining an event source object as a property of the function itself. When the template is submitted, CloudFormation expands this into a CloudWatch Events rule resource and automatically creates the Lambda resource-based permissions to allow the CloudWatch Events rule to trigger the function.
NOTE: If you are trying this solution outside of us-east-1, then you should download the necessary files, upload them to the buckets in your region, edit the script as appropriate and then run it or use the CLI deployment method below.
3.) Choose Next.
4.) On the Specify Details page, keep or modify the stack name and choose Next.
5.) On the Options page, choose Next.
6.) On the Review page, take the following steps:
Acknowledge the two Transform access capabilities. This allows the CloudFormation transform to create the required IAM resources with custom names.
Under Transforms, choose Create Change Set.
Wait a few seconds for the change set to be created before proceeding. The change set should look as follows:
7.) Choose Execute to deploy the template.
After the template is deployed, you should see four resources created:
After the package is successfully uploaded, the output should look as follows:
Uploading to 0f6d12c7872b50b37dbfd5a60385b854 1872 / 1872.0 (100.00%)
Successfully packaged artifacts and wrote output template to file serverless-output.template.
The CodeUri property in serverless-output.template is now referencing the packaged artifacts in the S3 bucket that you specified:
s3://<bucket>/0f6d12c7872b50b37dbfd5a60385b854
Use the aws cloudformation deploy CLI command to deploy the stack:
You should see the following output after the stack has been successfully created:
Waiting for changeset to be created...
Waiting for stack create/update to complete
Successfully created/updated stack – EmrDnsSetterCli
Validating results
To test the solution, launch an EMR cluster. The Lambda function looks for the cluster_name tag associated with the EMR cluster. Make sure to specify the friendly name of your cluster as host.domain.com where the domain is the private hosted zone in which to create the CNAME record.
Here is a sample CLI command to launch a cluster within a specific subnet in a VPC with the required tag cluster_name.
After the cluster is launched, log in to the Route 53 console. In the left navigation pane, choose Hosted Zones to view the list of private and public zones currently configured in Route 53. Select the hosted zone that you specified in the ZONE tag when you launched the cluster. Verify that the resource records were created.
You can also monitor the CloudWatch Events metrics that are published to CloudWatch every minute, such as the number of TriggeredRules and Invocations.
Now that you’ve verified that the Lambda function successfully updated the Route 53 resource records in the zone file, terminate the EMR cluster and verify that the records are removed by the same function.
Conclusion
This solution provides a serverless approach to automatically assigning a friendly name for your EMR cluster for easy access to popular notebooks and other web interfaces. CloudWatch Events also supports cross-account event delivery, so if you are running EMR clusters in multiple AWS accounts, all cluster state events across accounts can be consolidated into a single account.
I hope that this solution provides a small glimpse into the power of CloudWatch Events and Lambda and how they can be leveraged with EMR and other AWS big data services. For example, by using the EMR step state change event, you can chain various pieces of your analytics pipeline. You may have a transient cluster perform data ingest and, when the task successfully completes, spin up an ETL cluster for transformation and upload to Amazon Redshift. The possibilities are truly endless.
When managing your AWS resources, you often need to grant one AWS service access to another to accomplish tasks. For example, you could use an AWS Lambdafunction to resize, watermark, and postprocess images, for which you would need to store the associated metadata in Amazon DynamoDB. You also could use Lambda, Amazon S3, and Amazon CloudFront to build a serverless website that uses a DynamoDB table as a session store, with Lambda updating the information in the table. In both these examples, you need to grant Lambda functions permissions to write to DynamoDB.
In this post, I demonstrate how to create an AWS Identity and Access Management (IAM) policy that will be attached to an IAM role. The role is then used to grant a Lambda function access to a DynamoDB table. By using an IAM policy and role to control access, I don’t need to embed credentials in code and can tightly control which services the Lambda function can access. The policy also includes permissions to allow the Lambda function to write log files to Amazon CloudWatch Logs. This allows me to view utilization statistics for your Lambda functions and to have access to additional logging for troubleshooting issues.
Solution overview
The following architecture diagram presents an overview of the solution in this post.
The architecture of this post’s solution uses a Lambda function (1 in the preceding diagram) to make read API calls such as GET or SCAN and write API calls such as PUT or UPDATE to a DynamoDB table (2). The Lambda function also writes log files to CloudWatch Logs (3). The Lambda function uses an IAM role (4) that has an IAM policy attached (5) that grants access to DynamoDB and CloudWatch.
Overview of the AWS services used in this post
I use the following AWS services in this post’s solution:
IAM – For securely controlling access to AWS services. With IAM, you can centrally manage users, security credentials such as access keys, and permissions that control which AWS resources users and applications can access.
DynamoDB – A fast and flexible NoSQL database service for all applications that need consistent, single-digit-millisecond latency at any scale.
Lambda – Run code without provisioning or managing servers. You pay only for the compute time you consume—there is no charge when your code is not running.
CloudWatch Logs– For monitoring, storing, and accessing log files generated by AWS resources, including Lambda.
IAM access policies
I have authored an IAM access policy with JSON to grant the required permissions to the DynamoDB table and CloudWatch Logs. I will attach this policy to a role, and this role will then be attached to a Lambda function, which will assume the required access to DynamoDB and CloudWatch Logs
I will walk through this policy, and explain its elements and how to create the policy in the IAM console.
The following policy grants a Lambda function read and write access to a DynamoDB table and writes log files to CloudWatch Logs. This policy is called MyLambdaPolicy. The following is the full JSON document of this policy (the AWS account ID is a placeholder value that you would replace with your own account ID).
The first element in this policy is the Version, which defines the JSON version. At the time of this post’s publication, the most recent version of JSON is 2012-10-17.
The next element in this first policy is a Statement. This is the main section of the policy and includes multiple elements. This first statement is to Allow access to DynamoDB, and in this example, the elements I use are:
An Effect element – Specifies whether the statement results in an Allow or an explicit Deny. By default, access to resources is implicitly denied. In this example, I have used Allow because I want to allow the actions.
An Action element – Describes the specific actions for this statement. Each AWS service has its own set of actions that describe tasks that you can perform with that service. I have used the DynamoDB actions that I want to allow. For the definitions of all available actions for DynamoDB, see the DynamoDB API Reference.
A Resource element – Specifies the object or objects for this statement using Amazon Resource Names (ARNs). You use an ARN to uniquely identify an AWS resource. All Resource elements start with arn:aws and then define the object or objects for the statement. I use this to specify the DynamoDB table to which I want to allow access. To build the Resource element for DynamoDB, I have to specify:
The AWS service (dynamodb)
The AWS Region (eu-west-1)
The AWS account ID (123456789012)
The table (table/SampleTable)
The complete Resource element of the first statement is: arn:aws:dynamodb:eu-west-1:123456789012:table/SampleTable
In this policy, I created a second statement to allow access to CloudWatch Logs so that the Lambda function can write log files for troubleshooting and analysis. I have used the same elements as for the DynamoDB statement, but have changed the following values:
For the Action element, I used the CloudWatch actions that I want to allow. Definitions of all the available actions for CloudWatch are provided in the CloudWatch API Reference.
For the Resource element, I specified the AWS account to which I want to allow my Lambda function to write its log files. As in the preceding example for DynamoDB, you have to use the ARN for CloudWatch Logs to specify where access should be granted. To build the Resource element for CloudWatch Logs, I have to specify:
The AWS service (logs)
The AWS region (eu-west-1)
The AWS account ID (123456789012)
All log groups in this account (*)
The complete Resource element of the second statement is: arn:aws:logs:eu-west-1:123456789012:*
Create the IAM policy in your account
Before you can apply MyLambdaPolicy to a Lambda function, you have to create the policy in your own account and then apply it to an IAM role.
To create an IAM policy:
Navigate to the IAM console and choose Policies in the navigation pane. Choose Create policy.
Because I have already written the policy in JSON, you don’t need to use the Visual Editor, so you can choose the JSON tab and paste the content of the JSON policy document shown earlier in this post (remember to replace the placeholder account IDswith your own account ID). Choose Review policy.
Name the policy MyLambdaPolicy and give it a description that will help you remember the policy’s purpose. You also can view a summary of the policy’s permissions. Choose Create policy.
You have created the IAM policy that you will apply to the Lambda function.
Attach the IAM policy to an IAM role
To apply MyLambdaPolicy to a Lambda function, you first have to attach the policy to an IAM role.
To create a new role and attach MyLambdaPolicy to the role:
Navigate to the IAM console and choose Roles in the navigation pane. Choose Create role.
Choose AWS service and then choose Lambda. Choose Next: Permissions.
On the Attach permissions policies page, type MyLambdaPolicy in the Search box. Choose MyLambdaPolicy from the list of returned search results, and then choose Next: Review.
On the Review page, type MyLambdaRole in the Role name box and an appropriate description, and then choose Create role.
You have attached the policy you created earlier to a new IAM role, which in turn can be used by a Lambda function.
Apply the IAM role to a Lambda function
You have created an IAM role that has an attached IAM policy that grants both read and write access to DynamoDB and write access to CloudWatch Logs. The next step is to apply the IAM role to a Lambda function.
To apply the IAM role to a Lambda function:
Navigate to the Lambda console and choose Create function.
On the Create function page under Author from scratch, name the function MyLambdaFunction, and choose the runtime you want to use based on your application requirements. Lambda currently supports Node.js, Java, Python, Go, and .NET Core. From the Role dropdown, choose Choose an existing role, and from the Existing role dropdown, choose MyLambdaRole. Then choose Create function.
MyLambdaFunction now has access to CloudWatch Logs and DynamoDB. You can choose either of these services to see the details of which permissions the function has.
If you have any comments about this blog post, submit them in the “Comments” section below. If you have any questions about the services used, start a new thread in the applicable AWS forum: IAM, Lambda, DynamoDB, or CloudWatch.
Previously, applications running inside a VPC required internet access to connect to AWS KMS. This meant managing internet connectivity through internet gateways, Network Address Translation (NAT) devices, or firewall proxies. With support for Amazon VPC endpoints, you can now keep all traffic between your VPC and AWS KMS within the AWS network and avoid management of internet connectivity. In this blog post, I show you how to create and use an Amazon VPC endpoint for AWS KMS, audit the use of AWS KMS keys through the Amazon VPC endpoint, and build stricter access controls using key policies.
Create and use an Amazon VPC endpoint with AWS KMS
To get started, I will show you how to use the Amazon VPC console to create an endpoint in the US East (N. Virginia) Region, also known as us-east-1.
To create an endpoint in the US East (N. Virginia) Region:
Navigate to the Amazon VPC console. In the navigation pane, choose Endpoints, and then choose Create Endpoint.
Choose AWS services for Service category.
Choose the AWS KMS endpoint service, com.amazonaws.us-east-1.kms, from the Service Name list, as shown in the following screenshot.
Your VPC endpoint can span multiple Availability Zones, providing isolation and fault tolerance. Choose a subnet from each Availability Zone from which you want to connect. An elastic network interface for the VPC endpoint is created in each subnet that you choose, each with its own DNS hostname and private IP address.
If your VPC has DNS hostnames and DNS support enabled, choose Enable for this endpoint under Enable Private DNS Name to have applications use the VPC endpoint by default.
You use security groups to control access to your endpoint. Choose a security group from the list, or create a new one.
To finish creating the endpoint, choose Create endpoint. The console returns a VPC Endpoint ID. In our example, the VPC Endpoint ID is vpce-0c0052e3fbffdb450.
To connect to this endpoint, you need a DNS hostname that is generated for this endpoint. You can view these DNS hostnames by choosing the VPC Endpoint ID and then choosing the Details tab of the endpoint in the Amazon VPC console. One of the DNS hostnames for the endpoint that I created in the previous step is vpce-0c0052e3fbffdb450-afmosqu8.kms.us-east-1.vpce.amazonaws.com.
You can connect to AWS KMS through the VPC endpoint by using the AWS CLI or an AWS SDK. In this example, I use the following AWS CLI command to list all AWS KMS keys in the account in us-east-1.
If your VPC has DNS hostnames and DNS support enabled and you enabled private DNS names in the preceding steps, you can connect to your VPC endpoint by using the standard AWS KMS DNS hostname (https://kms.<region>.amazonaws.com), instead of manually configuring the endpoints in the AWS CLI or AWS SDKs. The AWS CLI and SDKs use this hostname by default to connect to KMS, so there’s nothing to change in your application to begin using the VPC endpoint.
You can monitor and audit AWS KMS usage through your VPC endpoint. Every request made to AWS KMS is logged by AWS CloudTrail. Now, when you use a VPC endpoint to make requests to AWS KMS, the endpoint ID appears in the CloudTrail log entries.
Restrict access using key policies
A good security practice to follow is least privilege: granting the fewest permissions required to complete a task. You can control access to your AWS KMS keys from a specific VPC endpoint by using AWS KMS key policies and AWS Identity and Access Management (IAM) policies. The aws:sourceVpce condition key lets you grant or restrict access to AWS KMS keys based on the VPC endpoint used. For example, the following example key policy allows a user to perform encryption operations with a key only when the request comes through the specified VPC endpoint (replace the placeholder AWS account ID with your own account ID, and the placeholder VPC endpoint ID with your own endpoint ID).
This policy works by including a Deny statement with a StringNotEquals condition. When a user makes a request to AWS KMS through a VPC endpoint, the endpoint’s ID is compared to the aws:sourceVpce value specified in the policy. If the two values are not the same, the request is denied. You can modify AWS KMS key policies in the AWS KMS console. For more information, see Modifying a Key Policy.
You also can control access to your AWS KMS keys from any endpoint running in one or more VPCs by using the aws:sourceVpc policy condition key. Suppose you have an application that is running in one VPC, but uses a second VPC for resource management functions. In the following example policy, AWS KMS key administrative actions can only be made from VPC vpc-12345678, and the key can only be used for cryptographic operations from VPC vpc-2b2b2b2b.
The previous examples show how you can limit access to AWS KMS API actions that are attached to a key policy. If you want to limit access to AWS KMS API actions that are not attached to a specific key, you have to use these VPC-related conditions in an IAM policy that refers to the desired AWS KMS API actions.
Summary
In this post, I have demonstrated how to create and use a VPC endpoint for AWS KMS, and how to use the aws:sourceVpc and aws:sourceVpce policy conditions to scope permissions to call various AWS KMS APIs. AWS KMS VPC endpoints provide you with more control over how your applications connect to AWS KMS and can save you from managing internet connectivity from your VPC.
To learn more about connecting to AWS KMS through a VPC endpoint, see the AWS KMS Developer Guide. For helpful guidance about your overall VPC network structure, see Practical VPC Design.
If you have questions about this feature or anything else related to AWS KMS, start a new thread in the AWS KMS forum.
We are excited to announce AWS Glue support for running ETL (extract, transform, and load) scripts in Scala. Scala lovers can rejoice because they now have one more powerful tool in their arsenal. Scala is the native language for Apache Spark, the underlying engine that AWS Glue offers for performing data transformations.
Beyond its elegant language features, writing Scala scripts for AWS Glue has two main advantages over writing scripts in Python. First, Scala is faster for custom transformations that do a lot of heavy lifting because there is no need to shovel data between Python and Apache Spark’s Scala runtime (that is, the Java virtual machine, or JVM). You can build your own transformations or invoke functions in third-party libraries. Second, it’s simpler to call functions in external Java class libraries from Scala because Scala is designed to be Java-compatible. It compiles to the same bytecode, and its data structures don’t need to be converted.
To illustrate these benefits, we walk through an example that analyzes a recent sample of the GitHub public timeline available from the GitHub archive. This site is an archive of public requests to the GitHub service, recording more than 35 event types ranging from commits and forks to issues and comments.
This post shows how to build an example Scala script that identifies highly negative issues in the timeline. It pulls out issue events in the timeline sample, analyzes their titles using the sentiment prediction functions from the Stanford CoreNLP libraries, and surfaces the most negative issues.
Getting started
Before we start writing scripts, we use AWS Glue crawlers to get a sense of the data—its structure and characteristics. We also set up a development endpoint and attach an Apache Zeppelin notebook, so we can interactively explore the data and author the script.
Crawl the data
The dataset used in this example was downloaded from the GitHub archive website into our sample dataset bucket in Amazon S3, and copied to the following locations:
Choose the best folder by replacing <region> with the region that you’re working in, for example, us-east-1. Crawl this folder, and put the results into a database named githubarchivein the AWS Glue Data Catalog, as described in the AWS Glue Developer Guide. This folder contains 12 hours of the timeline from January 22, 2017, and is organized hierarchically (that is, partitioned) by year, month, and day.
When finished, use the AWS Glue console to navigate to the table named data in the githubarchive database. Notice that this data has eight top-level columns, which are common to each event type, and three partition columns that correspond to year, month, and day.
Choose the payload column, and you will notice that it has a complex schema—one that reflects the union of the payloads of event types that appear in the crawled data. Also note that the schema that crawlers generate is a subset of the true schema because they sample only a subset of the data.
Set up the library, development endpoint, and notebook
Next, you need to download and set up the libraries that estimate the sentiment in a snippet of text. The Stanford CoreNLP libraries contain a number of human language processing tools, including sentiment prediction.
Download the Stanford CoreNLP libraries. Unzip the .zip file, and you’ll see a directory full of jar files. For this example, the following jars are required:
stanford-corenlp-3.8.0.jar
stanford-corenlp-3.8.0-models.jar
ejml-0.23.jar
Upload these files to an Amazon S3 path that is accessible to AWS Glue so that it can load these libraries when needed. For this example, they are in s3://glue-sample-other/corenlp/.
Development endpoints are static Spark-based environments that can serve as the backend for data exploration. You can attach notebooks to these endpoints to interactively send commands and explore and analyze your data. These endpoints have the same configuration as that of AWS Glue’s job execution system. So, commands and scripts that work there also work the same when registered and run as jobs in AWS Glue.
To set up an endpoint and a Zeppelin notebook to work with that endpoint, follow the instructions in the AWS Glue Developer Guide. When you are creating an endpoint, be sure to specify the locations of the previously mentioned jars in the Dependent jars path as a comma-separated list. Otherwise, the libraries will not be loaded.
After you set up the notebook server, go to the Zeppelin notebook by choosing Dev Endpoints in the left navigation pane on the AWS Glue console. Choose the endpoint that you created. Next, choose the Notebook Server URL, which takes you to the Zeppelin server. Log in using the notebook user name and password that you specified when creating the notebook. Finally, create a new note to try out this example.
Each notebook is a collection of paragraphs, and each paragraph contains a sequence of commands and the output for that command. Moreover, each notebook includes a number of interpreters. If you set up the Zeppelin server using the console, the (Python-based) pyspark and (Scala-based) spark interpreters are already connected to your new development endpoint, with pyspark as the default. Therefore, throughout this example, you need to prepend %spark at the top of your paragraphs. In this example, we omit these for brevity.
Working with the data
In this section, we use AWS Glue extensions to Spark to work with the dataset. We look at the actual schema of the data and filter out the interesting event types for our analysis.
Start with some boilerplate code to import libraries that you need:
Then, create the Spark and AWS Glue contexts needed for working with the data:
@transient val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)
You need the transient decorator on the SparkContext when working in Zeppelin; otherwise, you will run into a serialization error when executing commands.
Dynamic frames
This section shows how to create a dynamic frame that contains the GitHub records in the table that you crawled earlier. A dynamic frame is the basic data structure in AWS Glue scripts. It is like an Apache Spark data frame, except that it is designed and optimized for data cleaning and transformation workloads. A dynamic frame is well-suited for representing semi-structured datasets like the GitHub timeline.
A dynamic frame is a collection of dynamic records. In Spark lingo, it is an RDD (resilient distributed dataset) of DynamicRecords. A dynamic record is a self-describing record. Each record encodes its columns and types, so every record can have a schema that is unique from all others in the dynamic frame. This is convenient and often more efficient for datasets like the GitHub timeline, where payloads can vary drastically from one event type to another.
The following creates a dynamic frame, github_events, from your table:
The getCatalogSource() method returns a DataSource, which represents a particular table in the Data Catalog. The getDynamicFrame() method returns a dynamic frame from the source.
Recall that the crawler created a schema from only a sample of the data. You can scan the entire dataset, count the rows, and print the complete schema as follows:
github_events.count
github_events.printSchema()
The result looks like the following:
The data has 414,826 records. As before, notice that there are eight top-level columns, and three partition columns. If you scroll down, you’ll also notice that the payload is the most complex column.
Run functions and filter records
This section describes how you can create your own functions and invoke them seamlessly to filter records. Unlike filtering with Python lambdas, Scala scripts do not need to convert records from one language representation to another, thereby reducing overhead and running much faster.
Let’s create a function that picks only the IssuesEvents from the GitHub timeline. These events are generated whenever someone posts an issue for a particular repository. Each GitHub event record has a field, “type”, that indicates the kind of event it is. The issueFilter() function returns true for records that are IssuesEvents.
Note that the getField() method returns an Option[Any] type, so you first need to check that it exists before checking the type.
You pass this function to the filter transformation, which applies the function on each record and returns a dynamic frame of those records that pass.
val issue_events = github_events.filter(issueFilter)
Now, let’s look at the size and schema of issue_events.
issue_events.count
issue_events.printSchema()
It’s much smaller (14,063 records), and the payload schema is less complex, reflecting only the schema for issues. Keep a few essential columns for your analysis, and drop the rest using the ApplyMapping() transform:
The ApplyMapping() transform is quite handy for renaming columns, casting types, and restructuring records. The preceding code snippet tells the transform to select the fields (or columns) that are enumerated in the left half of the tuples and map them to the fields and types in the right half.
Estimating sentiment using Stanford CoreNLP
To focus on the most pressing issues, you might want to isolate the records with the most negative sentiments. The Stanford CoreNLP libraries are Java-based and offer sentiment-prediction functions. Accessing these functions through Python is possible, but quite cumbersome. It requires creating Python surrogate classes and objects for those found on the Java side. Instead, with Scala support, you can use those classes and objects directly and invoke their methods. Let’s see how.
First, import the libraries needed for the analysis:
The Stanford CoreNLP libraries have a main driver that orchestrates all of their analysis. The driver setup is heavyweight, setting up threads and data structures that are shared across analyses. Apache Spark runs on a cluster with a main driver process and a collection of backend executor processes that do most of the heavy sifting of the data.
The Stanford CoreNLP shared objects are not serializable, so they cannot be distributed easily across a cluster. Instead, you need to initialize them once for every backend executor process that might need them. Here is how to accomplish that:
val props = new Properties()
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment")
props.setProperty("parse.maxlen", "70")
object myNLP {
lazy val coreNLP = new StanfordCoreNLP(props)
}
The properties tell the libraries which annotators to execute and how many words to process. The preceding code creates an object, myNLP, with a field coreNLP that is lazily evaluated. This field is initialized only when it is needed, and only once. So, when the backend executors start processing the records, each executor initializes the driver for the Stanford CoreNLP libraries only one time.
Next is a function that estimates the sentiment of a text string. It first calls Stanford CoreNLP to annotate the text. Then, it pulls out the sentences and takes the average sentiment across all the sentences. The sentiment is a double, from 0.0 as the most negative to 4.0 as the most positive.
Now, let’s estimate the sentiment of the issue titles and add that computed field as part of the records. You can accomplish this with the map() method on dynamic frames:
val issue_sentiments = issue_titles.map((rec: DynamicRecord) => {
val mbody = rec.getField("title")
mbody match {
case Some(mval: String) => {
rec.addField("sentiment", ScalarNode(estimatedSentiment(mval)))
rec }
case _ => rec
}
})
The map() method applies the user-provided function on every record. The function takes a DynamicRecord as an argument and returns a DynamicRecord. The code above computes the sentiment, adds it in a top-level field, sentiment, to the record, and returns the record.
Count the records with sentiment and show the schema. This takes a few minutes because Spark must initialize the library and run the sentiment analysis, which can be involved.
Notice that all records were processed (14,063), and the sentiment value was added to the schema.
Finally, let’s pick out the titles that have the lowest sentiment (less than 1.5). Count them and print out a sample to see what some of the titles look like.
val pressing_issues = issue_sentiments.filter(_.getField("sentiment").exists(_.asInstanceOf[Double] < 1.5))
pressing_issues.count
pressing_issues.show(10)
Next, write them all to a file so that you can handle them later. (You’ll need to replace the output path with your own.)
Take a look in the output path, and you can see the output files.
Putting it all together
Now, let’s create a job from the preceding interactive session. The following script combines all the commands from earlier. It processes the GitHub archive files and writes out the highly negative issues:
import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.GlueArgParser
import com.amazonaws.services.glue.util.Job
import com.amazonaws.services.glue.util.JsonOptions
import com.amazonaws.services.glue.types._
import org.apache.spark.SparkContext
import java.util.Properties
import edu.stanford.nlp.ling.CoreAnnotations
import edu.stanford.nlp.neural.rnn.RNNCoreAnnotations
import edu.stanford.nlp.pipeline.{Annotation, StanfordCoreNLP}
import edu.stanford.nlp.sentiment.SentimentCoreAnnotations
import scala.collection.convert.wrapAll._
object GlueApp {
object myNLP {
val props = new Properties()
props.setProperty("annotators", "tokenize, ssplit, parse, sentiment")
props.setProperty("parse.maxlen", "70")
lazy val coreNLP = new StanfordCoreNLP(props)
}
def estimatedSentiment(text: String): Double = {
if ((text == null) || (!text.nonEmpty)) { return Double.NaN }
val annotations = myNLP.coreNLP.process(text)
val sentences = annotations.get(classOf[CoreAnnotations.SentencesAnnotation])
sentences.foldLeft(0.0)( (csum, x) => {
csum + RNNCoreAnnotations.getPredictedClass(x.get(classOf[SentimentCoreAnnotations.SentimentAnnotatedTree]))
}) / sentences.length
}
def main(sysArgs: Array[String]) {
val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)
val dbname = "githubarchive"
val tblname = "data"
val outpath = "s3://<bucket>/out/path/"
val github_events = glueContext
.getCatalogSource(database = dbname, tableName = tblname)
.getDynamicFrame()
val issue_events = github_events.filter((rec: DynamicRecord) => {
rec.getField("type").exists(_ == "IssuesEvent")
})
val issue_titles = issue_events.applyMapping(Seq(("id", "string", "id", "string"),
("actor.login", "string", "actor", "string"),
("repo.name", "string", "repo", "string"),
("payload.action", "string", "action", "string"),
("payload.issue.title", "string", "title", "string")))
val issue_sentiments = issue_titles.map((rec: DynamicRecord) => {
val mbody = rec.getField("title")
mbody match {
case Some(mval: String) => {
rec.addField("sentiment", ScalarNode(estimatedSentiment(mval)))
rec }
case _ => rec
}
})
val pressing_issues = issue_sentiments.filter(_.getField("sentiment").exists(_.asInstanceOf[Double] < 1.5))
glueContext.getSinkWithFormat(connectionType = "s3",
options = JsonOptions(s"""{"path": "$outpath"}"""),
format = "json")
.writeDynamicFrame(pressing_issues)
}
}
Notice that the script is enclosed in a top-level object called GlueApp, which serves as the script’s entry point for the job. (You’ll need to replace the output path with your own.) Upload the script to an Amazon S3 location so that AWS Glue can load it when needed.
To create the job, open the AWS Glue console. Choose Jobs in the left navigation pane, and then choose Add job. Create a name for the job, and specify a role with permissions to access the data. Choose An existing script that you provide, and choose Scala as the language.
For the Scala class name, type GlueApp to indicate the script’s entry point. Specify the Amazon S3 location of the script.
Choose Script libraries and job parameters. In the Dependent jars path field, enter the Amazon S3 locations of the Stanford CoreNLP libraries from earlier as a comma-separated list (without spaces). Then choose Next.
No connections are needed for this job, so choose Next again. Review the job properties, and choose Finish. Finally, choose Run job to execute the job.
You can simply edit the script’s input table and output path to run this job on whatever GitHub timeline datasets that you might have.
Conclusion
In this post, we showed how to write AWS Glue ETL scripts in Scala via notebooks and how to run them as jobs. Scala has the advantage that it is the native language for the Spark runtime. With Scala, it is easier to call Scala or Java functions and third-party libraries for analyses. Moreover, data processing is faster in Scala because there’s no need to convert records from one language runtime to another.
You can find more example of Scala scripts in our GitHub examples repository: https://github.com/awslabs/aws-glue-samples. We encourage you to experiment with Scala scripts and let us know about any interesting ETL flows that you want to share.
Mehul Shah is a senior software manager for AWS Glue. His passion is leveraging the cloud to build smarter, more efficient, and easier to use data systems. He has three girls, and, therefore, he has no spare time.
Ben Sowell is a software development engineer at AWS Glue.
Vinay Vivili is a software development engineer for AWS Glue.
As companies mature in their cloud journey, they implement layered security capabilities and practices in their cloud architectures. One such practice is to continually assess golden Amazon Machine Images (AMIs) for security vulnerabilities. AMIs provide the information required to launch an Amazon EC2 instance, which is a virtual server in the AWS Cloud. A golden AMI is an AMI that contains the latest security patches, software, configuration, and software agents that you need to install for logging, security maintenance, and performance monitoring. You can build and deploy golden AMIs in your environment, but the AMIs quickly become dated as new vulnerabilities are discovered.
A security best practice is to perform routine vulnerability assessments of your golden AMIs to identify if newly found vulnerabilities apply to them. If you identify a vulnerability, you can update your golden AMIs with the appropriate security patches, test the AMIs, and deploy the patched AMIs in your environment. In this blog post, I demonstrate how to use Amazon Inspector to set up such continuous vulnerability assessments to scan your golden AMIs routinely.
Solution overview
Amazon Inspector performs security assessments of Amazon EC2 instances by using AWS managed rules packages such as the Common Vulnerabilities and Exposures (CVEs) package. The solution in this post creates EC2 instances from golden AMIs and then runs an Amazon Inspector security assessment on the created instances. When the assessment results are available, the solution consolidates the findings and advises you about next steps. Furthermore, the solution schedules an Amazon CloudWatch Events rule to run the golden AMI vulnerability assessments on a regular basis.
The following solution diagram illustrates how this solution works.
Here’s how this solution works, as illustrated in the preceding diagram:
A scheduled CloudWatch Events event triggers the StartContinuousAssessmentAWS Lambda function, which starts the security assessment of your golden AMIs.
The StartContinuousAssessment Lambda function performs the following actions:
It reads a JSON parameter stored in the AWS Systems Manager (Systems Manager) Parameter Store. This JSON parameter contains the following metadata for each golden AMI:
InstanceType – A valid instance-type for launching an EC2 instance of the golden AMI.
Later in this blog post, I provide instructions for creating this JSON parameter.
For each AMI specified in the JSON parameter, the Lambda function creates an EC2 instance. When each instance starts, it installs the Amazon Inspector agent by using the user-data script provided in the JSON. The Lambda function then copies each golden AMI’s tags (you will assign custom metadata in the form of tags to each golden AMI when you set up the solution) to the corresponding EC2 instance. The function also adds a tag with the key of continuous-assessment-instance and value as true. This tag identifies EC2 instances that require regular security assessments. The Lambda function copies the AMI’s tags to the instance (and later, to the security findings found for the instance) to help you identify the golden AMIs for each security finding. After you analyze security findings, you can patch your golden AMIs.
The first time the StartContinuousAssessment function runs, it creates:
An Amazon Inspector assessment template: The template contains a reference to the Amazon Inspector assessment target created in the preceding step and the following AWS managed rules packages to evaluate:
For subsequent assessments, the StartContinuousAssessment function reuses the target and the template created during the first run of StartContinuousAssessment function.
Note: Amazon Inspector can start an assessment only after it finds at least one running Amazon Inspector agent. To allow EC2 instances to boot and the Amazon inspector agent to start, the Lambda function waits four minutes. Because the assessment runs for approximately one hour and boot time for EC2 instances typically takes a few minutes, all Amazon Inspector agents start before the assessment ends.
The Lambda function then runs the assessment. The Amazon Inspector agents collect behavior and configuration data, and pass it to Amazon Inspector. Amazon Inspector analyzes the data and generates Amazon Inspector findings, which are possible security findings you may need to address.
After the Lambda function completes the assessment, Amazon Inspector publishes an assessment-completion notification message to an Amazon SNS topic called ContinuousAssessmentCompleteTopic. SNS uses topics, which are communication channels for sending messages and subscribing to notifications.
The notification message published to SNS triggers the AnalyzeInspectorFindings Lambda function, which performs the following actions:
Associates the tags of each EC2 instance with security findings found for that EC2 instance. This enables you to identify the security findings using the app-name tag you specified for your golden AMIs. You can use the information provided in the findings to patch your golden AMIs.
Terminates all instances associated with the continuous-assessment-instance=true tag.
Aggregates the number of findings found for each EC2 instance by severity and then publishes a consolidated result to an SNS topic called ContinuousAssessmentResultsTopic.
How to deploy the solution
To deploy this solution, you must set it up in the AWS Region where you build your golden AMIs. If that AWS Region does not support Amazon Inspector, at the end of your continuous integration pipeline, you can copy your AMIs to an AWS Region where Amazon Inspector assessments are supported. To learn more about continuous integration pipelines, see What is Continuous Integration?
To deploy continuous golden AMI vulnerability assessments in your AWS account, follow these steps:
Tag your golden AMIs – Tagging your golden AMIs lets you search assessment result findings based on tags after Amazon Inspector completes an assessment.
Store your golden AMI metadata in the Systems Manager Parameter Store – Prepare and store the golden AMI metadata in the Systems Manager Parameter Store. The StartContinuousAssessment Lambda function reads golden AMI metadata and starts assessing for vulnerabilities.
Run the supplied AWS CloudFormation template and subscribe to an SNS topic to receive assessment results – Set up the infrastructure required to run vulnerability assessments and subscribe to an SNS topic to receive assessment results via email.
Test golden AMI vulnerability assessments – Ensure you have successfully set up the required resources to run vulnerability assessments.
Set up a CloudWatch Events rule for triggering continuous golden AMI vulnerability assessments – Schedule the execution of vulnerability assessments on a regular basis.
1. Tag your golden AMIs
You can search assessment findings based on golden AMI tags after Amazon Inspector completes an assessment.
To tag a golden AMI by using the AWS Management Console:
Choose your AMI from the list, and then choose Actions > Add/Edit Tags.
Choose Create Tag. In the Key column, type app-name. In the Value column, type your application name. Following the same steps, create the app-version and app-environment tags. Choose Save.
Now that you have tagged your golden AMIs, you need to create golden AMI metadata, which will be read by the StartContinuousAssessment function to initiate vulnerability assessments. You will store the golden AMI metadata in the Systems Manager Parameter Store.
2. Store your golden AMI metadata in the Systems Manager Parameter Store
This solution reads golden AMI metadata from a parameter stored in the Systems Manager Parameter Store. The metadata must be in JSON format and must contain the following information for each golden AMI:
Ami-Id
InstanceType
UserData
Step A: Find the AMI ID of your golden AMI.
An AMI ID uniquely identifies an AMI in an AWS Region and is a required parameter for launching an EC2 instance from a golden AMI. To find the AMI ID of your golden AMI:
Choose your AMI from the list and then note the corresponding value in the AMI ID column.
Step B: Find a compatible InstanceType for your golden AMI.
Each AMI has a list of compatible InstanceTypes. The InstanceType is a required parameter for launching an EC2 instance from a golden AMI. To find a compatible InstanceType for your golden AMI:
Choose Launch Instance. On the Choose an Amazon Machine Image (AMI) page, choose My AMIs.
Type the AMI ID that you noted in Step A in the Search my AMIs box, and then choose Enter.
The search result will contain your golden AMI. To choose it, choose Select.
Locate any available Instance Type and then note the corresponding value in the Type column.
Choose Cancel.
Note: Amazon Inspector will launch the chosen InstanceType every time the vulnerability assessment runs.
Step C: Create the user-data script to install and start the Amazon Inspector agent.
The user-data script automates the installation of software packages when an EC2 instance launches for the first time. In this step, you create an operating system specific, JSON-compatible user-data script that installs and starts the Amazon Inspector agent.
Identify the command that installs the Amazon Inspector agent
Based on Installing Amazon Inspector Agents, the following shell command installs the Amazon Inspector agent on an Amazon Linux-based EC2 instance.
Based on Running Commands on Your Linux Instance at Launch, you make a Linux shell script user-data compatible by prefixing it with a #!/bin/bash. In this step, you add the #!/bin/bash prefix to the script from the preceding step. The following is the user-data compatible version of the script from the preceding step.
The user-data script provided in the JSON metadata must be JSON-compatible, which you will do next.
Make the user-data script JSON compatible
To make the user-data script JSON compatible, you must replace all new-line characters with a \r\n\r\n sequence. The following is the JSON-compatible user-data script that you specify for your Amazon Linux-based golden AMI in Step D.
Repeat Steps A, B, and C to find the Ami Id, InstanceType, and UserData for each of your golden AMIs. When you have this metadata, you can create the JSON document of metadata for all your golden AMIs. The StartContinuousAssessment Lambda function reads this JSON to start golden AMI vulnerability assessments.
Step D: Create a JSON document of metadata of all your golden AMIs.
Use the following template to create a JSON document:
Replace all placeholder values with values corresponding to your first golden AMI. If your golden AMI is Amazon Linux-based, you can specify the userData as the JSON-compatible-user-data-for-Amazon-Linux-AMI from Step C.5. Next, replace the placeholder values for your second golden AMI. You can add more entries to your JSON document, if you have more than two golden AMIs.
Note: The total number of characters in the JSON document must be fewer than or equal to 4,096 characters, and the number of golden AMIs must be fewer than 500. You must verify whether your account has permissions to run one on-demand EC2 instance for each of your golden AMIs. For information about how to verify service limits, see Amazon EC2 Service Limits.
Now that you have created the JSON document of your golden AMIs, you will store the JSON document in a Systems Manager parameter. The StartContinuousAssessment Lambda function will read the metadata from this parameter.
Step E: Store the JSON in a Systems Manager parameter.
Expand Systems Manager Shared Resources in the navigation pane, and then choose Parameter Store.
Choose Create Parameter.
For Name, type ContinuousAssessmentInput.
In the Description field, type Continuous golden AMI vulnerability assessment process metadata.
For Type, choose String.
Paste the JSON that you created in Step D in the Value field.
Choose Create Parameter. After the system creates the parameter, choose Close.
To set up the remaining components required to run assessments, you will run a CloudFormation template and perform the configuration explained in the next section.
3. Run the CloudFormation template and subscribe to an SNS topic to receive assessment results
Next, create a CloudFormation stack using the provided CloudFormation template. Before you start, download the CloudFormation template to your computer.
On the Stacks page, choose AmazonInspectorAssessment.
In the Detail pane, choose Outputs to view the output of your stack.
After CloudFormation successfully creates a stack, the Outputs tab displays following results:
StartContinuousAssessmentLambdaFunction – The Value box displays the name of the StartContinuousAssessment function. You will run this function to trigger the entire workflow.
ContinuousAssessmentResultsTopic – The Value box displays the ContinuousAssessmentResultsTopic topic’s Amazon Resource Name (ARN), which you will use later.
To receive consolidated vulnerability assessment results in email, you must subscribe to ContinuousAssessmentResultsTopic.
Choose Create subscription. In the Topic ARN field, paste the ARN of ContinuousAssessmentResultsTopic that you noted in the previous section.
In the Protocol drop-down, choose Email.
In the Endpoint box, type the email address where you will receive notifications.
Choose Create subscription.
Navigate to your email application and open the message from AWS Notifications. Click the link to confirm your subscription to the SNS topic.
4. Test golden AMI vulnerability assessments
Before you schedule vulnerability assessments, you should test the process by running the StartContinuousAssessment function. In this test, you trigger a security assessment and monitor it. You then receive an email after the assessment has completed, which shows that vulnerability assessments have been successfully set up.
On Dashboard under Recent Assessment Runs, you will see an entry with the status, Collecting Data. This status indicates that Amazon Inspector agents are collecting data from instances running your golden AMIs. The agents collect data for an hour and then Amazon Inspector analyzes the collected data.
After Amazon Inspector completes the assessment, the status in the console changes to Analysis complete. Amazon Inspector then publishes an SNS message that triggers the AnalyzeInspectionReports Lambda function. When AnalyzeInspectionReports publishes results, you will receive an email containing consolidated assessment results. You also will be able to see the findings.
To see the findings in Amazon Inspector’s Findings section:
In the navigation pane, choose Assessment Runs. In the table on the Amazon Inspector – Assessment Runs page, choose the findings of the latest assessment run.
Choose the settings () icon and choose the appropriate tags to see the details of findings, as shown in the following screenshot. The findings also contain information about how you can address each underlying vulnerability.
Having verified that you have successfully set up all components of golden AMI vulnerability assessments, you now will schedule the vulnerability assessments to run on a regular basis to give you continual insight into the health of instances created from your golden AMIs.
5. Set up a CloudWatch Events rule for triggering continuous golden AMI vulnerability assessments
The last step is to create a CloudWatch Events rule to schedule the execution of the vulnerability assessments on a daily or weekly basis.
In the navigation pane, choose Rules > Create rule.
On the Event Source page, choose Schedule. Choose Fixed rate of and specify the interval (for example, 1 day).
For Targets, choose Add target and then choose Lambda function.
For Function, choose the StartContinuousAssessment function.
Choose Configure Input.
Choose Constant (JSON text).
In the box, paste the following JSON code.
{
"AMIsParamName": "ContinuousAssessmentInput"
}
Choose Configure details.
For Rule definition, type ContinuousGoldenAMIAssessmentTrigger for the name, and type as the description, This rule triggers the continuous golden AMI vulnerability assessment process.
Choose Create rule.
The vulnerability assessments are executed on the first occurrence of the schedule you chose while setting up the CloudWatch Events rule. After the vulnerability assessment is executed, you will receive an email to indicate that your continuous golden AMI vulnerability assessments are set up.
Summary
To get visibility into the security of your EC2 instances created from your golden AMIs, it is important that you perform security assessments of your golden AMIs on a regular basis. In this blog post, I have demonstrated how to set up vulnerability assessments, and the results of these continuous golden AMI vulnerability assessments can help you keep your environment up to date with security patches. To learn how to patch your golden AMIs, see Streamline AMI Maintenance and Patching Using Amazon EC2 Systems Manager.
If you have comments about this blog post, submit them in the “Comments” section below. If you have questions about implementing the solution in this post, start a new thread on the Amazon Inspector forum or contact AWS Support.
– Kanchan and David
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.