Today, we’re excited to announce Workers KV is entering general availability and is ready for production use!
What is Workers KV?
Workers KV is a highly distributed, eventually consistent, key-value store that spans Cloudflare’s global edge. It allows you to store billions of key-value pairs and read them with ultra-low latency anywhere in the world. Now you can build entire applications with the performance of a CDN static cache.
Why did we build it?
Workers is a platform that lets you run JavaScript on Cloudflare’s global edge of 175+ data centers. With only a few lines of code, you can route HTTP requests, modify responses, or even create new responses without an origin server.
// A Worker that handles a single redirect,
// such a humble beginning...
addEventListener("fetch", event => {
event.respondWith(handleOneRedirect(event.request))
})
async function handleOneRedirect(request) {
let url = new URL(request.url)
let device = request.headers.get("CF-Device-Type")
// If the device is mobile, add a prefix to the hostname.
// (eg. example.com becomes mobile.example.com)
if (device === "mobile") {
url.hostname = "mobile." + url.hostname
return Response.redirect(url, 302)
}
// Otherwise, send request to the original hostname.
return await fetch(request)
}
Customers quickly came to us with use cases that required a way to store persistent data. Following our example above, it’s easy to handle a single redirect, but what if you want to handle billions of them? You would have to hard-code them into your Workers script, fit it all in under 1 MB, and re-deploy it every time you wanted to make a change — yikes! That’s why we built Workers KV.
// A Worker that can handle billions of redirects,
// now that's more like it!
addEventListener("fetch", event => {
event.respondWith(handleBillionsOfRedirects(event.request))
})
async function handleBillionsOfRedirects(request) {
let prefix = "/redirect"
let url = new URL(request.url)
// Check if the URL is a special redirect.
// (eg. example.com/redirect/<random-hash>)
if (url.pathname.startsWith(prefix)) {
// REDIRECTS is a custom variable that you define,
// it binds to a Workers KV "namespace." (aka. a storage bucket)
let redirect = await REDIRECTS.get(url.pathname.replace(prefix, ""))
if (redirect) {
url.pathname = redirect
return Response.redirect(url, 302)
}
}
// Otherwise, send request to the original path.
return await fetch(request)
}
With only a few changes from our previous example, we scaled from one redirect to billions − that’s just a taste of what you can build with Workers KV.
How does it work?
Distributed data stores are often modeled using the CAP Theorem, which states that distributed systems can only pick between 2 out of the 3following guarantees:
Consistency – is my data the same everywhere?
Availability – is my data accessible all the time?
Partition tolerance – is my data stored in multiple locations?
Diagram of the choices and tradeoffs of the CAP Theorem.
Workers KV chooses to guarantee Availability and Partition tolerance. This combination is known as eventual consistency, which presents Workers KV with two unique competitive advantages:
Reads are ultra fast (median of 12 ms) since its powered by our caching technology.
Data is available across 175+ edge data centers and resilient to regional outages.
Although, there are tradeoffs to eventual consistency. If two clients write different values to the same key at the same time, the last client to write eventually“wins” and its value becomes globally consistent. This also means that if a client writes to a key and that same client reads that same key, the values may be inconsistent for a short amount of time.
To help visualize this scenario, here’s a real-life example amongst three friends:
Suppose Matthew, Michelle, and Lee are planning their weekly lunch.
Matthew decides they’re going out for sushi.
Matthew tells Michelle their sushi plans, Michelle agrees.
Lee, not knowing the plans, tells Michelle they’re actually having pizza.
An hour later, Michelle and Lee are waiting at the pizza parlor while Matthew is sitting alone at the sushi restaurant — what went wrong? We can chalk this up to eventual consistency, because after waiting for a few minutes, Matthew looks at his updated calendar and eventually finds the new truth, they’re going out for pizza instead.
While it may take minutes in real-life, Workers KV is much faster. It can achieve global consistency in less than 60 seconds. Additionally, when a Worker writes to a key, then immediately reads that same key, it can expect the values to be consistent if both operations came from the same location.
When should I use it?
Now that you understand the benefits and tradeoffs of using eventual consistency, how do you determine if it’s the right storage solution for your application? Simply put, if you want global availability with ultra-fast reads, Workers KV is right for you.
However, if your application is frequently writing to the same key, there is an additional consideration. We call it “the Matthew question”: Are you okay with the Matthews of the world occasionally going to the wrong restaurant?
You can imagine use cases (like our redirect Worker example) where this doesn’t make any material difference. But if you decide to keep track of a user’s bank account balance, you would not want the possibility of two balances existing at once, since they could purchase something with money they’ve already spent.
What can I build with it?
Here are a few examples of applications that have been built with KV:
Mass redirects – handle billions of HTTP redirects.
User authentication – validate user requests to your API.
Translation keys – dynamically localize your web pages.
Configuration data – manage who can access your origin.
Step functions – sync state data between multiple APIs functions.
Edge file store – host large amounts of small files.
We’ve highlighted several of those use cases in our previous blog post. We also have some more in-depth code walkthroughs, including a recently published blog post on how to build an online To-do list with Workers KV.
What’s new since beta?
By far, our most common request was to make it easier to write data to Workers KV. That’s why we’re releasing three new ways to make that experience even better:
1. Bulk Writes
If you want to import your existing data into Workers KV, you don’t want to go through the hassle of sending an HTTP request for every key-value pair. That’s why we added a bulk endpoint to the Cloudflare API. Now you can upload up to 10,000 pairs (up to 100 MB of data) in a single PUT request.
Let’s walk through an example use case: you want to off-load your website translation to Workers. Since you’re reading translation keys frequently and only occasionally updating them, this application works well with the eventual consistency model of Workers KV.
In this example, we hook into Crowdin, a popular platform to manage translation data. This Worker responds to a /translate endpoint, downloads all your translation keys, and bulk writes them to Workers KV so you can read it later on our edge:
addEventListener("fetch", event => {
if (event.request.url.pathname === "/translate") {
event.respondWith(uploadTranslations())
}
})
async function uploadTranslations() {
// Ask crowdin for all of our translations.
var response = await fetch(
"https://api.crowdin.com/api/project" +
"/:ci_project_id/download/all.zip?key=:ci_secret_key")
// If crowdin is responding, parse the response into
// a single json with all of our translations.
if (response.ok) {
var translations = await zipToJson(response)
return await bulkWrite(translations)
}
// Return the errored response from crowdin.
return response
}
async function bulkWrite(keyValuePairs) {
return fetch(
"https://api.cloudflare.com/client/v4/accounts" +
"/:cf_account_id/storage/kv/namespaces/:cf_namespace_id/bulk",
{
method: "PUT",
headers: {
"Content-Type": "application/json",
"X-Auth-Key": ":cf_auth_key",
"X-Auth-Email": ":cf_email"
},
body: JSON.stringify(keyValuePairs)
}
)
}
async function zipToJson(response) {
// ... omitted for brevity ...
// (eg. https://stuk.github.io/jszip)
return [
{key: "hello.EN", value: "Hello World"},
{key: "hello.ES", value: "Hola Mundo"}
]
}
Now, when you want to translate a page, all you have to do is read from Workers KV:
async function translate(keys, lang) {
// You bind your translations namespace to the TRANSLATIONS variable.
return Promise.all(keys.map(key => TRANSLATIONS.get(key + "." + lang)))
}
2. Expiring Keys
By default, key-value pairs stored in Workers KV last forever. However, sometimes you want your data to auto-delete after a certain amount of time. That’s why we’re introducing the expiration and expirationTtloptions for write operations.
// Key expires 60 seconds from now.
NAMESPACE.put("myKey", "myValue", {expirationTtl: 60})
// Key expires if the UNIX epoch is in the past.
NAMESPACE.put("myKey", "myValue", {expiration: 1247788800})
# You can also set keys to expire from the Cloudflare API.
curl "https://api.cloudflare.com/client/v4/accounts/ \
$ACCOUNT_ID/storage/kv/namespaces/$NAMESPACE_ID/ \
values/$KEY?expiration_ttl=$EXPIRATION_IN_SECONDS"
-X PUT \
-H "X-Auth-Key: $CLOUDFLARE_AUTH_KEY" \
-H "X-Auth-Email: $CLOUDFLARE_AUTH_EMAIL" \
-d "$VALUE"
Let’s say you want to block users that have been flagged as inappropriate from your website, but only for a week. With an expiring key, you can set the expire time and not have to worry about deleting it later.
In this example, we assume users and IP addresses are one of the same. If your application has authentication, you could use access tokens as the key identifier.
addEventListener("fetch", event => {
var url = new URL(event.request.url)
// An internal API that blocks a new user IP.
// (eg. example.com/block/1.2.3.4)
if (url.pathname.startsWith("/block")) {
var ip = url.pathname.split("/").pop()
event.respondWith(blockIp(ip))
} else {
// Other requests check if the IP is blocked.
event.respondWith(handleRequest(event.request))
}
})
async function blockIp(ip) {
// Values are allowed to be empty in KV,
// we don't need to store any extra information anyway.
await BLOCKED.put(ip, "", {expirationTtl: 60*60*24*7})
return new Response("ok")
}
async function handleRequest(request) {
var ip = request.headers.get("CF-Connecting-IP")
if (ip) {
var blocked = await BLOCKED.get(ip)
// If we detect an IP and its blocked, respond with a 403 error.
if (blocked) {
return new Response({status: 403, statusText: "You are blocked!"})
}
}
// Otherwise, passthrough the original request.
return fetch(request)
}
3. Larger Values
We’ve increased our size limit on values from 64 kB to 2 MB. This is quite useful if you need to store buffer-based or file data in Workers KV.
Consider this scenario: you want to let your users upload their favorite GIF to their profile without having to store these GIFs as binaries in your database or managing another cloud storage bucket.
Workers KV is a great fit for this use case! You can create a Workers KV namespace for your users’ GIFs that is fast and reliable wherever your customers are located.
In this example, users upload a link to their favorite GIF, then a Worker downloads it and stores it to Workers KV.
addEventListener("fetch", event => {
var url = event.request.url
var arg = request.url.split("/").pop()
// User sends a URI encoded link to the GIF they wish to upload.
// (eg. example.com/api/upload_gif/<encoded-uri>)
if (url.pathname.startsWith("/api/upload_gif")) {
event.respondWith(uploadGif(arg))
// Profile contains link to view the GIF.
// (eg. example.com/api/view_gif/<username>)
} else if (url.pathname.startsWith("/api/view_gif")) {
event.respondWith(getGif(arg))
}
})
async function uploadGif(url) {
// Fetch the GIF from the Internet.
var gif = await fetch(decodeURIComponent(url))
var buffer = await gif.arrayBuffer()
// Upload the GIF as a buffer to Workers KV.
await GIFS.put(user.name, buffer)
return gif
}
async function getGif(username) {
var gif = await GIFS.get(username, "arrayBuffer")
// If the user has set one, respond with the GIF.
if (gif) {
return new Response(gif, {headers: {"Content-Type": "image/gif"}})
} else {
return new Response({status: 404, statusText: "User has no GIF!"})
}
}
Lastly, we want to thank all of our beta customers. It was your valuable feedback that led us to develop these changes to Workers KV. Make sure to stay in touch with us, we’re always looking ahead for what’s next and we love hearing from you!
Pricing
We’re also ready to announce our GA pricing. If you’re one of our Enterprise customers, your pricing obviously remains unchanged.
$0.50 / GB of data stored, 1 GB included
$0.50 / million reads, 10 million included
$5 / million write, list, and delete operations, 1 million included
During the beta period, we learned customers don’t want to just read values at our edge, they want to write values from our edge too. Since there is high demand for these edge operations, which are more costly, we have started charging non-read operations per month.
Limits
As mentioned earlier, we increased our value size limit from 64 kB to 2 MB. We’ve also removed our cap on the number of keys per namespace — it’s now unlimited. Here are our GA limits:
Up to 20 namespaces per account, each with unlimited keys
Keys of up to 512 bytes and values of up to 2 MB
Unlimited writes per second for different keys
One write per second for the same key
Unlimited reads per second per key
Try it out now!
Now open to all customers, you can start using Workers KV today from your Cloudflare dashboard under the Workers tab. You can also look at our updated documentation.
We’re really excited to see what you all can build with Workers KV!
We all know that we should not commit any passwords or keys to the repo with our code (no matter if public or private). Yet, thousands of production passwords can be found on GitHub (and probably thousands more in internal company repositories). Some have tried to fix that by removing the passwords (once they learned it’s not a good idea to store them publicly), but passwords have remained in the git history.
Knowing what not to do is the first and very important step. But how do we store production credentials. Database credentials, system secrets (e.g. for HMACs), access keys for 3rd party services like payment providers or social networks. There doesn’t seem to be an agreed upon solution.
I’ve previously argued with the 12-factor app recommendation to use environment variables – if you have a few that might be okay, but when the number of variables grow (as in any real application), it becomes impractical. And you can set environment variables via a bash script, but you’d have to store it somewhere. And in fact, even separate environment variables should be stored somewhere.
This somewhere could be a local directory (risky), a shared storage, e.g. FTP or S3 bucket with limited access, or a separate git repository. I think I prefer the git repository as it allows versioning (Note: S3 also does, but is provider-specific). So you can store all your environment-specific properties files with all their credentials and environment-specific configurations in a git repo with limited access (only Ops people). And that’s not bad, as long as it’s not the same repo as the source code.
Since many companies are using GitHub or BitBucket for their repositories, storing production credentials on a public provider may still be risky. That’s why it’s a good idea to encrypt the files in the repository. A good way to do it is via git-crypt. It is “transparent” encryption because it supports diff and encryption and decryption on the fly. Once you set it up, you continue working with the repo as if it’s not encrypted. There’s even a fork that works on Windows.
You simply run git-crypt init (after you’ve put the git-crypt binary on your OS Path), which generates a key. Then you specify your .gitattributes, e.g. like that:
And you’re done. Well, almost. If this is a fresh repo, everything is good. If it is an existing repo, you’d have to clean up your history which contains the unencrypted files. Following these steps will get you there, with one addition – before calling git commit, you should call git-crypt status -f so that the existing files are actually encrypted.
You’re almost done. We should somehow share and backup the keys. For the sharing part, it’s not a big issue to have a team of 2-3 Ops people share the same key, but you could also use the GPG option of git-crypt (as documented in the README). What’s left is to backup your secret key (that’s generated in the .git/git-crypt directory). You can store it (password-protected) in some other storage, be it a company shared folder, Dropbox/Google Drive, or even your email. Just make sure your computer is not the only place where it’s present and that it’s protected. I don’t think key rotation is necessary, but you can devise some rotation procedure.
git-crypt authors claim to shine when it comes to encrypting just a few files in an otherwise public repo. And recommend looking at git-remote-gcrypt. But as often there are non-sensitive parts of environment-specific configurations, you may not want to encrypt everything. And I think it’s perfectly fine to use git-crypt even in a separate repo scenario. And even though encryption is an okay approach to protect credentials in your source code repo, it’s still not necessarily a good idea to have the environment configurations in the same repo. Especially given that different people/teams manage these credentials. Even in small companies, maybe not all members have production access.
The outstanding questions in this case is – how do you sync the properties with code changes. Sometimes the code adds new properties that should be reflected in the environment configurations. There are two scenarios here – first, properties that could vary across environments, but can have default values (e.g. scheduled job periods), and second, properties that require explicit configuration (e.g. database credentials). The former can have the default values bundled in the code repo and therefore in the release artifact, allowing external files to override them. The latter should be announced to the people who do the deployment so that they can set the proper values.
The whole process of having versioned environment-speific configurations is actually quite simple and logical, even with the encryption added to the picture. And I think it’s a good security practice we should try to follow.
Security updates have been issued by Arch Linux (bind, libofx, and thunderbird), Debian (thunderbird, xdg-utils, and xen), Fedora (procps-ng), Mageia (gnupg2, mbedtls, pdns, and pdns-recursor), openSUSE (bash, GraphicsMagick, icu, and kernel), Oracle (thunderbird), Red Hat (java-1.7.1-ibm, java-1.8.0-ibm, and thunderbird), Scientific Linux (thunderbird), and Ubuntu (curl).
Security updates have been issued by Debian (imagemagick), Fedora (curl, glibc, kernel, and thunderbird-enigmail), openSUSE (enigmail, knot, and python), Oracle (procps-ng), Red Hat (librelp, procps-ng, redhat-virtualization-host, rhev-hypervisor7, and unboundid-ldapsdk), Scientific Linux (procps-ng), SUSE (bash, ceph, icu, kvm, and qemu), and Ubuntu (procps and spice, spice-protocol).
Bad software is everywhere. One can even claim that every software is bad. Cool companies, tech giants, established companies, all produce bad software. And no, yours is not an exception.
Who’s to blame for bad software? It’s all complicated and many factors are intertwined – there’s business requirements, there’s organizational context, there’s lack of sufficient skilled developers, there’s the inherent complexity of software development, there’s leaky abstractions, reliance on 3rd party software, consequences of wrong business and purchase decisions, time limitations, flawed business analysis, etc. So yes, despite the catchy title, I’m aware it’s actually complicated.
But in every “it’s complicated” scenario, there’s always one or two factors that are decisive. All of them contribute somehow, but the major drivers are usually a handful of things. And in the case of base software, I think it’s the fault of technical people. Developers, architects, ops.
We don’t seem to care about best practices. And I’ll do some nasty generalizations here, but bear with me. We can spend hours arguing about tabs vs spaces, curly bracket on new line, git merge vs rebase, which IDE is better, which framework is better and other largely irrelevant stuff. But we tend to ignore the important aspects that span beyond the code itself. The context in which the code lives, the non-functional requirements – robustness, security, resilience, etc.
We don’t seem to get security. Even trivial stuff such as user authentication is almost always implemented wrong. These days Twitter and GitHub realized they have been logging plain-text passwords, for example, but that’s just the tip of the iceberg. Too often we ignore the security implications.
“But the business didn’t request the security features”, one may say. The business never requested 2-factor authentication, encryption at rest, PKI, secure (or any) audit trail, log masking, crypto shredding, etc., etc. Because the business doesn’t know these things – we do and we have to put them on the backlog and fight for them to be implemented. Each organization has its specifics and tech people can influence the backlog in different ways, but almost everywhere we can put things there and prioritize them.
The other aspect is testing. We should all be well aware by now that automated testing is mandatory. We have all the tools in the world for unit, functional, integration, performance and whatnot testing, and yet many software projects lack the necessary test coverage to be able to change stuff without accidentally breaking things. “But testing takes time, we don’t have it”. We are perfectly aware that testing saves time, as we’ve all had those “not again!” recurring bugs. And yet we think of all sorts of excuses – “let the QAs test it”, we have to ship that now, we’ll test it later”, “this is too trivial to be tested”, etc.
And you may say it’s not our job. We don’t define what has do be done, we just do it. We don’t define the budget, the scope, the features. We just write whatever has been decided. And that’s plain wrong. It’s not our job to make money out of our code, and it’s not our job to define what customers need, but apart from that everything is our job. The way the software is structured, the security aspects and security features, the stability of the code base, the way the software behaves in different environments. The non-functional requirements are our job, and putting them on the backlog is our job.
You’ve probably heard that every software becomes “legacy” after 6 months. And that’s because of us, our sloppiness, our inability to mitigate external factors and constraints. Too often we create a mess through “just doing our job”.
And of course that’s a generalization. I happen to know a lot of great professionals who don’t make these mistakes, who strive for excellence and implement things the right way. But our industry as a whole doesn’t. Our industry as a whole produces bad software. And it’s our fault, as developers – as the only people who know why a certain piece of software is bad.
In a talk of his, Bob Martin warns us of the risks of our sloppiness. We have been building websites so far, but we are more and more building stuff that interacts with the real world, directly and indirectly. Ultimately, lives may depend on our software (like the recent unfortunate death caused by a self-driving car). And I’ll agree with Uncle Bob that it’s high time we self-regulate as an industry, before some technically incompetent politician decides to do that.
How, I don’t know. We’ll have to think more about it. But I’m pretty sure it’s our fault that software is bad, and no amount of blaming the management, the budget, the timing, the tools or the process can eliminate our responsibility.
Why do I insist on bashing my fellow software engineers? Because if we start looking at software development with more responsibility; with the fact that if it fails, it’s our fault, then we’re more likely to get out of our current bug-ridden, security-flawed, fragile software hole and really become the experts of the future.
Many companies across the globe use Amazon DynamoDB to store and query historical user-interaction data. DynamoDB is a fast NoSQL database used by applications that need consistent, single-digit millisecond latency.
Often, customers want to turn their valuable data in DynamoDB into insights by analyzing a copy of their table stored in Amazon S3. Doing this separates their analytical queries from their low-latency critical paths. This data can be the primary source for understanding customers’ past behavior, predicting future behavior, and generating downstream business value. Customers often turn to DynamoDB because of its great scalability and high availability. After a successful launch, many customers want to use the data in DynamoDB to predict future behaviors or provide personalized recommendations.
DynamoDB is a good fit for low-latency reads and writes, but it’s not practical to scan all data in a DynamoDB database to train a model. In this post, I demonstrate how you can use DynamoDB table data copied to Amazon S3 by AWS Data Pipeline to predict customer behavior. I also demonstrate how you can use this data to provide personalized recommendations for customers using Amazon SageMaker. You can also run ad hoc queries using Amazon Athena against the data. DynamoDB recently released on-demand backups to create full table backups with no performance impact. However, it’s not suitable for our purposes in this post, so I chose AWS Data Pipeline instead to create managed backups are accessible from other services.
To do this, I describe how to read the DynamoDB backup file format in Data Pipeline. I also describe how to convert the objects in S3 to a CSV format that Amazon SageMaker can read. In addition, I show how to schedule regular exports and transformations using Data Pipeline. The sample data used in this post is from Bank Marketing Data Set of UCI.
The solution that I describe provides the following benefits:
Separates analytical queries from production traffic on your DynamoDB table, preserving your DynamoDB read capacity units (RCUs) for important production requests
Automatically updates your model to get real-time predictions
Optimizes for performance (so it doesn’t compete with DynamoDB RCUs after the export) and for cost (using data you already have)
Makes it easier for developers of all skill levels to use Amazon SageMaker
All code and data set in this post are available in this .zip file.
Solution architecture
The following diagram shows the overall architecture of the solution.
The steps that data follows through the architecture are as follows:
Data Pipeline regularly copies the full contents of a DynamoDB table as JSON into an S3
Exported JSON files are converted to comma-separated value (CSV) format to use as a data source for Amazon SageMaker.
Amazon SageMaker renews the model artifact and update the endpoint.
The converted CSV is available for ad hoc queries with Amazon Athena.
Data Pipeline controls this flow and repeats the cycle based on the schedule defined by customer requirements.
Building the auto-updating model
This section discusses details about how to read the DynamoDB exported data in Data Pipeline and build automated workflows for real-time prediction with a regularly updated model.
Find the automation_script.sh file and edit it for your environment. For example, you need to replace 's3://<your bucket>/<datasource path>/' with your own S3 path to the data source for Amazon ML. In the script, the text enclosed by angle brackets—< and >—should be replaced with your own path.
Upload the json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar file to your S3 path so that the ADD jar command in Apache Hive can refer to it.
For this solution, the banking.csv should be imported into a DynamoDB table.
Export a DynamoDB table
To export the DynamoDB table to S3, open the Data Pipeline console and choose the Export DynamoDB table to S3 template. In this template, Data Pipeline creates an Amazon EMR cluster and performs an export in the EMRActivity activity. Set proper intervals for backups according to your business requirements.
One core node(m3.xlarge) provides the default capacity for the EMR cluster and should be suitable for the solution in this post. Leave the option to resize the cluster before running enabled in the TableBackupActivity activity to let Data Pipeline scale the cluster to match the table size. The process of converting to CSV format and renewing models happens in this EMR cluster.
For a more in-depth look at how to export data from DynamoDB, see Export Data from DynamoDB in the Data Pipeline documentation.
Add the script to an existing pipeline
After you export your DynamoDB table, you add an additional EMR step to EMRActivity by following these steps:
Open the Data Pipeline console and choose the ID for the pipeline that you want to add the script to.
For Actions, choose Edit.
In the editing console, choose the Activities category and add an EMR step using the custom script downloaded in the previous section, as shown below.
Paste the following command into the new step after the data upload step:
The element #{output.directoryPath} references the S3 path where the data pipeline exports DynamoDB data as JSON. The path should be passed to the script as an argument.
The bash script has two goals, converting data formats and renewing the Amazon SageMaker model. Subsequent sections discuss the contents of the automation script.
Automation script: Convert JSON data to CSV with Hive
We use Apache Hive to transform the data into a new format. The Hive QL script to create an external table and transform the data is included in the custom script that you added to the Data Pipeline definition.
When you run the Hive scripts, do so with the -e option. Also, define the Hive table with the 'org.openx.data.jsonserde.JsonSerDe' row format to parse and read JSON format. The SQL creates a Hive EXTERNAL table, and it reads the DynamoDB backup data on the S3 path passed to it by Data Pipeline.
Note: You should create the table with the “EXTERNAL” keyword to avoid the backup data being accidentally deleted from S3 if you drop the table.
The full automation script for converting follows. Add your own bucket name and data source path in the highlighted areas.
After creating an external table, you need to read data. You then use the INSERT OVERWRITE DIRECTORY ~ SELECT command to write CSV data to the S3 path that you designated as the data source for Amazon SageMaker.
Depending on your requirements, you can eliminate or process the columns in the SELECT clause in this step to optimize data analysis. For example, you might remove some columns that have unpredictable correlations with the target value because keeping the wrong columns might expose your model to “overfitting” during the training. In this post, customer_id columns is removed. Overfitting can make your prediction weak. More information about overfitting can be found in the topic Model Fit: Underfitting vs. Overfitting in the Amazon ML documentation.
Automation script: Renew the Amazon SageMaker model
After the CSV data is replaced and ready to use, create a new model artifact for Amazon SageMaker with the updated dataset on S3. For renewing model artifact, you must create a new training job. Training jobs can be run using the AWS SDK ( for example, Amazon SageMaker boto3 ) or the Amazon SageMaker Python SDK that can be installed with “pip install sagemaker” command as well as the AWS CLI for Amazon SageMaker described in this post.
In addition, consider how to smoothly renew your existing model without service impact, because your model is called by applications in real time. To do this, you need to create a new endpoint configuration first and update a current endpoint with the endpoint configuration that is just created.
#!/bin/bash
## Define variable
REGION=$2
DTTIME=`date +%Y-%m-%d-%H-%M-%S`
ROLE="<your AmazonSageMaker-ExecutionRole>"
# Select containers image based on region.
case "$REGION" in
"us-west-2" )
IMAGE="174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:latest"
;;
"us-east-1" )
IMAGE="382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest"
;;
"us-east-2" )
IMAGE="404615174143.dkr.ecr.us-east-2.amazonaws.com/linear-learner:latest"
;;
"eu-west-1" )
IMAGE="438346466558.dkr.ecr.eu-west-1.amazonaws.com/linear-learner:latest"
;;
*)
echo "Invalid Region Name"
exit 1 ;
esac
# Start training job and creating model artifact
TRAINING_JOB_NAME=TRAIN-${DTTIME}
S3OUTPUT="s3://<your bucket name>/model/"
INSTANCETYPE="ml.m4.xlarge"
INSTANCECOUNT=1
VOLUMESIZE=5
aws sagemaker create-training-job – training-job-name ${TRAINING_JOB_NAME} – region ${REGION} – algorithm-specification TrainingImage=${IMAGE},TrainingInputMode=File – role-arn ${ROLE} – input-data-config '[{ "ChannelName": "train", "DataSource": { "S3DataSource": { "S3DataType": "S3Prefix", "S3Uri": "s3://<your bucket name>/<datasource path>/", "S3DataDistributionType": "FullyReplicated" } }, "ContentType": "text/csv", "CompressionType": "None" , "RecordWrapperType": "None" }]' – output-data-config S3OutputPath=${S3OUTPUT} – resource-config InstanceType=${INSTANCETYPE},InstanceCount=${INSTANCECOUNT},VolumeSizeInGB=${VOLUMESIZE} – stopping-condition MaxRuntimeInSeconds=120 – hyper-parameters feature_dim=20,predictor_type=binary_classifier
# Wait until job completed
aws sagemaker wait training-job-completed-or-stopped – training-job-name ${TRAINING_JOB_NAME} – region ${REGION}
# Get newly created model artifact and create model
MODELARTIFACT=`aws sagemaker describe-training-job – training-job-name ${TRAINING_JOB_NAME} – region ${REGION} – query 'ModelArtifacts.S3ModelArtifacts' – output text `
MODELNAME=MODEL-${DTTIME}
aws sagemaker create-model – region ${REGION} – model-name ${MODELNAME} – primary-container Image=${IMAGE},ModelDataUrl=${MODELARTIFACT} – execution-role-arn ${ROLE}
# create a new endpoint configuration
CONFIGNAME=CONFIG-${DTTIME}
aws sagemaker create-endpoint-config – region ${REGION} – endpoint-config-name ${CONFIGNAME} – production-variants VariantName=Users,ModelName=${MODELNAME},InitialInstanceCount=1,InstanceType=ml.m4.xlarge
# create or update the endpoint
STATUS=`aws sagemaker describe-endpoint – endpoint-name ServiceEndpoint – query 'EndpointStatus' – output text – region ${REGION} `
if [[ $STATUS -ne "InService" ]] ;
then
aws sagemaker create-endpoint – endpoint-name ServiceEndpoint – endpoint-config-name ${CONFIGNAME} – region ${REGION}
else
aws sagemaker update-endpoint – endpoint-name ServiceEndpoint – endpoint-config-name ${CONFIGNAME} – region ${REGION}
fi
Grant permission
Before you execute the script, you must grant proper permission to Data Pipeline. Data Pipeline uses the DataPipelineDefaultResourceRole role by default. I added the following policy to DataPipelineDefaultResourceRole to allow Data Pipeline to create, delete, and update the Amazon SageMaker model and data source in the script.
After you deploy a model into production using Amazon SageMaker hosting services, your client applications use this API to get inferences from the model hosted at the specified endpoint. This approach is useful for interactive web, mobile, or desktop applications.
Following, I provide a simple Python code example that queries against Amazon SageMaker endpoint URL with its name (“ServiceEndpoint”) and then uses them for real-time prediction.
Data Pipeline exports DynamoDB table data into S3. The original JSON data should be kept to recover the table in the rare event that this is needed. Data Pipeline then converts JSON to CSV so that Amazon SageMaker can read the data.Note: You should select only meaningful attributes when you convert CSV. For example, if you judge that the “campaign” attribute is not correlated, you can eliminate this attribute from the CSV.
Train the Amazon SageMaker model with the new data source.
When a new customer comes to your site, you can judge how likely it is for this customer to subscribe to your new product based on “predictedScores” provided by Amazon SageMaker.
If the new user subscribes your new product, your application must update the attribute “y” to the value 1 (for yes). This updated data is provided for the next model renewal as a new data source. It serves to improve the accuracy of your prediction. With each new entry, your application can become smarter and deliver better predictions.
Running ad hoc queries using Amazon Athena
Amazon Athena is a serverless query service that makes it easy to analyze large amounts of data stored in Amazon S3 using standard SQL. Athena is useful for examining data and collecting statistics or informative summaries about data. You can also use the powerful analytic functions of Presto, as described in the topic Aggregate Functions of Presto in the Presto documentation.
With the Data Pipeline scheduled activity, recent CSV data is always located in S3 so that you can run ad hoc queries against the data using Amazon Athena. I show this with example SQL statements following. For an in-depth description of this process, see the post Interactive SQL Queries for Data in Amazon S3 on the AWS News Blog.
Creating an Amazon Athena table and running it
Simply, you can create an EXTERNAL table for the CSV data on S3 in Amazon Athena Management Console.
=== Table Creation ===
CREATE EXTERNAL TABLE datasource (
age int,
job string,
marital string ,
education string,
default string,
housing string,
loan string,
contact string,
month string,
day_of_week string,
duration int,
campaign int,
pdays int ,
previous int ,
poutcome string,
emp_var_rate double,
cons_price_idx double,
cons_conf_idx double,
euribor3m double,
nr_employed double,
y int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n'
LOCATION 's3://<your bucket name>/<datasource path>/';
The following query calculates the correlation coefficient between the target attribute and other attributes using Amazon Athena.
=== Sample Query ===
SELECT corr(age,y) AS correlation_age_and_target,
corr(duration,y) AS correlation_duration_and_target,
corr(campaign,y) AS correlation_campaign_and_target,
corr(contact,y) AS correlation_contact_and_target
FROM ( SELECT age , duration , campaign , y ,
CASE WHEN contact = 'telephone' THEN 1 ELSE 0 END AS contact
FROM datasource
) datasource ;
Conclusion
In this post, I introduce an example of how to analyze data in DynamoDB by using table data in Amazon S3 to optimize DynamoDB table read capacity. You can then use the analyzed data as a new data source to train an Amazon SageMaker model for accurate real-time prediction. In addition, you can run ad hoc queries against the data on S3 using Amazon Athena. I also present how to automate these procedures by using Data Pipeline.
You can adapt this example to your specific use case at hand, and hopefully this post helps you accelerate your development. You can find more examples and use cases for Amazon SageMaker in the video AWS 2017: Introducing Amazon SageMaker on the AWS website.
Yong Seong Lee is a Cloud Support Engineer for AWS Big Data Services. He is interested in every technology related to data/databases and helping customers who have difficulties in using AWS services. His motto is “Enjoy life, be curious and have maximum experience.”
Thanks to Raja Mani, AWS Solutions Architect, for this great blog.
—
In this blog post, I’ll walk you through the steps for setting up continuous replication of an AWS CodeCommit repository from one AWS region to another AWS region using a serverless architecture. CodeCommit is a fully-managed, highly scalable source control service that stores anything from source code to binaries. It works seamlessly with your existing Git tools and eliminates the need to operate your own source control system. Replicating an AWS CodeCommit repository from one AWS region to another AWS region enables you to achieve lower latency pulls for global developers. This same approach can also be used to automatically back up repositories currently hosted on other services (for example, GitHub or BitBucket) to AWS CodeCommit.
This solution uses AWS Lambda and AWS Fargate for continuous replication. Benefits of this approach include:
The replication process can be easily setup to trigger based on events, such as commits made to the repository.
Setting up a serverless architecture means you don’t need to provision, maintain, or administer servers.
Note: AWS Fargate has a limitation of 10 GB for storage and is available in US East (N. Virginia) region. A similar solution that uses Amazon EC2 instances to replicate the repositories on a schedule was published in a previous blog and can be used if your repository does not meet these conditions.
Replication using Fargate
As you follow this blog post, you’ll set up an architecture that looks like this:
Any change in the AWS CodeCommit repository will trigger a Lambda function. The Lambda function will call the Fargate task that replicates the repository using a Git command line tool.
Let us assume a user wants to replicate a repository (Source) from US East (N. Virginia/us-east-1) region to a repository (Destination) in US West (Oregon/us-west-2) region. I’ll walk you through the steps for it:
Prerequisites
Create an AWS Service IAM role for Amazon EC2 that has permission for both source and destination repositories, IAM CreateRole, AttachRolePolicy and Amazon ECR privileges. Here is the EC2 role policy I used:
You need a Docker environment to build this solution. You can launch an EC2 instance and install Docker (or) you can use AWS Cloud9 that comes with Docker and Git preinstalled. I used an EC2 instance and installed Docker in it. Use the IAM role created in the previous step when creating the EC2 instance. I am going to refer this environment as “Docker Environment” in the following steps.
You need to install the AWS CLI on the Docker environment. For AWS CLI installation, refer this page.
You need to install Git, including a Git command line on the Docker environment.
Step 1: Create the Docker image
To create the Docker image, first it needs a Dockerfile. A Dockerfile is a manifest that describes the base image to use for your Docker image and what you want installed and running on it. For more information about Dockerfiles, go to the Dockerfile Reference.
1. Choose a directory in the Docker environment and perform the following steps in that directory. I used /home/ec2-user directory to perform the following steps.
2. Clone the AWS CodeCommit repository in the Docker environment. Open the terminal to the Docker environment and run the following commands to clone your source AWS CodeCommit repository (I ran the commands from /home/ec2-user directory):
Note: Change the URL marked in red to your source and destination repository URL.
3. Create a file called Dockerfile (case sensitive) with the following content (I created it in /home/ec2-user directory):
# Pull the Amazon Linux latest base image
FROM amazonlinux:latest
#Install aws-cli and git command line tools
RUN yum -y install unzip aws-cli
RUN yum -y install git
WORKDIR /home/ec2-user
RUN mkdir LocalRepository
WORKDIR /home/ec2-user/LocalRepository
#Copy Cloned CodeCommit repository to Docker container
COPY ./LocalRepository /home/ec2-user/LocalRepository
#Copy shell script that does the replication
COPY ./repl_repository.bash /home/ec2-user/LocalRepository
RUN chmod ugo+rwx /home/ec2-user/LocalRepository/repl_repository.bash
WORKDIR /home/ec2-user/LocalRepository
#Call this script when Docker starts the container
ENTRYPOINT ["/home/ec2-user/LocalRepository/repl_repository.bash"]
4. Copy the following shell script into a file called repl_repository.bash to the DockerFile directory location in the Docker environment (I created it in /home/ec2-user directory)
6. Verify whether the replication is working by running the repl_repository.bash script from the LocalRepository directory. Go to LocalRepository directory and run this command: . ../repl_repository.bash If it is successful, you will get the “Everything up-to-date” at the last line of the result like this:
$ . ../repl_repository.bash
Everything up-to-date
Step 2: Build the Docker Image
1. Build the Docker image by running this command from the directory where you created the DockerFile in the Docker environment in the previous step (I ran it from /home/ec2-user directory):
$ docker build . –t ccrepl
Output: It installs various packages and set environment variables as part of steps 1 to 3 from the Dockerfile. The steps 4 to 11 from the Dockerfile should produce an output similar to the following:
2. Run the following command to verify that the image was created successfully. It will display “Everything up-to-date” at the end if it is successful.
[[email protected] LocalRepository]$ docker run ccrepl
Everything up-to-date
Step 3: Push the Docker Image to Amazon Elastic Container Registry (ECR)
Perform the following steps in the Docker Environment.
1. Run the AWS CLI configure command and set default region as your source repository region (I used us-east-1).
$ aws configure set default.region <Source Repository Region>
2. Create an Amazon ECR repository using this command to store your ccrepl image (Note the repositoryUri in the output):
2. Create a role called AccessRoleForCCfromFG using the following command in the DockerEnvironment:
$ aws iam create-role – role-name AccessRoleForCCfromFG – assume-role-policy-document file://trustpolicyforecs.json
3. Assign CodeCommit service full access to the above role using the following command in the DockerEnvironment:
$ aws iam attach-role-policy – policy-arn arn:aws:iam::aws:policy/AWSCodeCommitFullAccess – role-name AccessRoleForCCfromFG
4. In the Amazon ECS Console, choose Repositories and select the ccrepl repository that was created in the previous step. Copy the Repository URI.
5. In the Amazon ECS Console, choose Task Definitions and click Create New Task Definition.
6. Select launch type compatibility as FARGATE and click Next Step.
7. In the create task definition screen, do the following:
In Task Definition Name, type ccrepl
In Task Role, choose AccessRoleForCCfromFG
In Task Memory, choose 2GB
In Task CPU, choose 1 vCPU
Click Add Container under Container Definitions in the same screen. In the Add Container screen, do the following:
Enter Container name as ccreplcont
Enter Image URL copied from step 4
Enter Memory Limits as 128 and click Add.
Note: Select TaskExecutionRole as “ecsTaskExecutionRole” if it already exists. If not, select create new role and it will create “ecsTaskExecutionRole” for you.
8. Click the Create button in the task definition screen to create the task. It will successfully create the task, execution role and AWS CloudWatch Log groups.
9. In the Amazon ECS Console, click Clusters and create cluster. Select template as “Networking only, Powered by AWS Fargate” and click next step.
10. Enter cluster name as ccreplcluster and click create.
Step 5: Create the Lambda Function
In this section, I used Amazon Elastic Container Service (ECS) run task API from Lambda to invoke the Fargate task.
1. In the IAM Console, create a new role called ECSLambdaRole with the permissions to AWS CodeCommit, Amazon ECS as well as pass roles privileges needed to run the ECS task. Your statement should look similar to the following (replace <your account id>):
2. In AWS management console, select VPC service and click subnets in the left navigation screen. Note down the Subnet IDs that you want to run the Fargate task in.
3. Create a new Lambda Node.js function called FargateTaskExecutionFunc and assign the role ECSLambdaRole with the following content:
Note: Replace subnets values (marked in red color) with the subnet IDs you identified as the subnets you wanted to run the Fargate task on in Step 2 of this section.
1. In the Lambda Console, click FargateTaskExecutionFunc under functions.
2. Under Add triggers in the Designer, select CodeCommit
3. In the Configure triggers screen, do the following:
Enter Repository name as Source (your source repository name)
Enter trigger name as LambdaTrigger
Leave the Events as “All repository events”
Leave the Branch names as “All branches”
Click Add button
Click Save button to save the changes
Step 6: Verification
To test the application, make a commit and push the changes to the source repository in AWS CodeCommit. That should automatically trigger the Lambda function and replicate the changes in the destination repository. You can verify this by checking CloudWatch Logs for Lambda and ECS, or simply going to the destination repository and verifying the change appears.
Conclusion
Congratulations! You have successfully configured repository replication of an AWS CodeCommit repository using AWS Lambda and AWS Fargate. You can use this technique in a deployment pipeline. You can also tweak the trigger configuration in AWS CodeCommit to call the Lambda function in response to any supported trigger event in AWS CodeCommit.
games: I launchedStrawberry Jam 2, a month-long February game jam about making horny games. I will probably be making a horny game for it.
idchoppers: I took another crack at dilation. Some meager progress, maybe. I think I’m now porting bad academic C++ to Rust to get the algorithm I want, and I can’t help but wonder if I could just make up something of my own faster than this.
fox flux: I did a bunch of brainstorming and consolidated a bunch of notes from like four different places, which feels like work but also feels like it doesn’t actually move the project forward.
anise!!: Ah, yes, this fell a bit by the wayside. Some map work, some attempts at a 3D effect for a particular thing without much luck (though I found a workaround in the last couple days).
computers: I relieved myself of some 200 browser tabs, which feels fantastic, though I’ve since opened like 80 more. Alas. I also tried to put together a firejail profile for running mystery games from the internet, and I got like 90% of the way there, but it turns out there’s basically no way to stop an X application from reading all keyboard input.
(Yes, I know about that, and I tried it. Yes, that too.)
I’ve got a small pile of little projects that are vaguely urgent, so as much as I’d love to bash my head against idchoppers for a solid week, I’m gonna try to focus on getting a couple half-done things full-done. And maybe try to find time for art regularly so I don’t fall out of practice? Huff puff.
As companies mature in their cloud journey, they implement layered security capabilities and practices in their cloud architectures. One such practice is to continually assess golden Amazon Machine Images (AMIs) for security vulnerabilities. AMIs provide the information required to launch an Amazon EC2 instance, which is a virtual server in the AWS Cloud. A golden AMI is an AMI that contains the latest security patches, software, configuration, and software agents that you need to install for logging, security maintenance, and performance monitoring. You can build and deploy golden AMIs in your environment, but the AMIs quickly become dated as new vulnerabilities are discovered.
A security best practice is to perform routine vulnerability assessments of your golden AMIs to identify if newly found vulnerabilities apply to them. If you identify a vulnerability, you can update your golden AMIs with the appropriate security patches, test the AMIs, and deploy the patched AMIs in your environment. In this blog post, I demonstrate how to use Amazon Inspector to set up such continuous vulnerability assessments to scan your golden AMIs routinely.
Solution overview
Amazon Inspector performs security assessments of Amazon EC2 instances by using AWS managed rules packages such as the Common Vulnerabilities and Exposures (CVEs) package. The solution in this post creates EC2 instances from golden AMIs and then runs an Amazon Inspector security assessment on the created instances. When the assessment results are available, the solution consolidates the findings and advises you about next steps. Furthermore, the solution schedules an Amazon CloudWatch Events rule to run the golden AMI vulnerability assessments on a regular basis.
The following solution diagram illustrates how this solution works.
Here’s how this solution works, as illustrated in the preceding diagram:
A scheduled CloudWatch Events event triggers the StartContinuousAssessmentAWS Lambda function, which starts the security assessment of your golden AMIs.
The StartContinuousAssessment Lambda function performs the following actions:
It reads a JSON parameter stored in the AWS Systems Manager (Systems Manager) Parameter Store. This JSON parameter contains the following metadata for each golden AMI:
InstanceType – A valid instance-type for launching an EC2 instance of the golden AMI.
Later in this blog post, I provide instructions for creating this JSON parameter.
For each AMI specified in the JSON parameter, the Lambda function creates an EC2 instance. When each instance starts, it installs the Amazon Inspector agent by using the user-data script provided in the JSON. The Lambda function then copies each golden AMI’s tags (you will assign custom metadata in the form of tags to each golden AMI when you set up the solution) to the corresponding EC2 instance. The function also adds a tag with the key of continuous-assessment-instance and value as true. This tag identifies EC2 instances that require regular security assessments. The Lambda function copies the AMI’s tags to the instance (and later, to the security findings found for the instance) to help you identify the golden AMIs for each security finding. After you analyze security findings, you can patch your golden AMIs.
The first time the StartContinuousAssessment function runs, it creates:
An Amazon Inspector assessment template: The template contains a reference to the Amazon Inspector assessment target created in the preceding step and the following AWS managed rules packages to evaluate:
For subsequent assessments, the StartContinuousAssessment function reuses the target and the template created during the first run of StartContinuousAssessment function.
Note: Amazon Inspector can start an assessment only after it finds at least one running Amazon Inspector agent. To allow EC2 instances to boot and the Amazon inspector agent to start, the Lambda function waits four minutes. Because the assessment runs for approximately one hour and boot time for EC2 instances typically takes a few minutes, all Amazon Inspector agents start before the assessment ends.
The Lambda function then runs the assessment. The Amazon Inspector agents collect behavior and configuration data, and pass it to Amazon Inspector. Amazon Inspector analyzes the data and generates Amazon Inspector findings, which are possible security findings you may need to address.
After the Lambda function completes the assessment, Amazon Inspector publishes an assessment-completion notification message to an Amazon SNS topic called ContinuousAssessmentCompleteTopic. SNS uses topics, which are communication channels for sending messages and subscribing to notifications.
The notification message published to SNS triggers the AnalyzeInspectorFindings Lambda function, which performs the following actions:
Associates the tags of each EC2 instance with security findings found for that EC2 instance. This enables you to identify the security findings using the app-name tag you specified for your golden AMIs. You can use the information provided in the findings to patch your golden AMIs.
Terminates all instances associated with the continuous-assessment-instance=true tag.
Aggregates the number of findings found for each EC2 instance by severity and then publishes a consolidated result to an SNS topic called ContinuousAssessmentResultsTopic.
How to deploy the solution
To deploy this solution, you must set it up in the AWS Region where you build your golden AMIs. If that AWS Region does not support Amazon Inspector, at the end of your continuous integration pipeline, you can copy your AMIs to an AWS Region where Amazon Inspector assessments are supported. To learn more about continuous integration pipelines, see What is Continuous Integration?
To deploy continuous golden AMI vulnerability assessments in your AWS account, follow these steps:
Tag your golden AMIs – Tagging your golden AMIs lets you search assessment result findings based on tags after Amazon Inspector completes an assessment.
Store your golden AMI metadata in the Systems Manager Parameter Store – Prepare and store the golden AMI metadata in the Systems Manager Parameter Store. The StartContinuousAssessment Lambda function reads golden AMI metadata and starts assessing for vulnerabilities.
Run the supplied AWS CloudFormation template and subscribe to an SNS topic to receive assessment results – Set up the infrastructure required to run vulnerability assessments and subscribe to an SNS topic to receive assessment results via email.
Test golden AMI vulnerability assessments – Ensure you have successfully set up the required resources to run vulnerability assessments.
Set up a CloudWatch Events rule for triggering continuous golden AMI vulnerability assessments – Schedule the execution of vulnerability assessments on a regular basis.
1. Tag your golden AMIs
You can search assessment findings based on golden AMI tags after Amazon Inspector completes an assessment.
To tag a golden AMI by using the AWS Management Console:
Choose your AMI from the list, and then choose Actions > Add/Edit Tags.
Choose Create Tag. In the Key column, type app-name. In the Value column, type your application name. Following the same steps, create the app-version and app-environment tags. Choose Save.
Now that you have tagged your golden AMIs, you need to create golden AMI metadata, which will be read by the StartContinuousAssessment function to initiate vulnerability assessments. You will store the golden AMI metadata in the Systems Manager Parameter Store.
2. Store your golden AMI metadata in the Systems Manager Parameter Store
This solution reads golden AMI metadata from a parameter stored in the Systems Manager Parameter Store. The metadata must be in JSON format and must contain the following information for each golden AMI:
Ami-Id
InstanceType
UserData
Step A: Find the AMI ID of your golden AMI.
An AMI ID uniquely identifies an AMI in an AWS Region and is a required parameter for launching an EC2 instance from a golden AMI. To find the AMI ID of your golden AMI:
Choose your AMI from the list and then note the corresponding value in the AMI ID column.
Step B: Find a compatible InstanceType for your golden AMI.
Each AMI has a list of compatible InstanceTypes. The InstanceType is a required parameter for launching an EC2 instance from a golden AMI. To find a compatible InstanceType for your golden AMI:
Choose Launch Instance. On the Choose an Amazon Machine Image (AMI) page, choose My AMIs.
Type the AMI ID that you noted in Step A in the Search my AMIs box, and then choose Enter.
The search result will contain your golden AMI. To choose it, choose Select.
Locate any available Instance Type and then note the corresponding value in the Type column.
Choose Cancel.
Note: Amazon Inspector will launch the chosen InstanceType every time the vulnerability assessment runs.
Step C: Create the user-data script to install and start the Amazon Inspector agent.
The user-data script automates the installation of software packages when an EC2 instance launches for the first time. In this step, you create an operating system specific, JSON-compatible user-data script that installs and starts the Amazon Inspector agent.
Identify the command that installs the Amazon Inspector agent
Based on Installing Amazon Inspector Agents, the following shell command installs the Amazon Inspector agent on an Amazon Linux-based EC2 instance.
Based on Running Commands on Your Linux Instance at Launch, you make a Linux shell script user-data compatible by prefixing it with a #!/bin/bash. In this step, you add the #!/bin/bash prefix to the script from the preceding step. The following is the user-data compatible version of the script from the preceding step.
The user-data script provided in the JSON metadata must be JSON-compatible, which you will do next.
Make the user-data script JSON compatible
To make the user-data script JSON compatible, you must replace all new-line characters with a \r\n\r\n sequence. The following is the JSON-compatible user-data script that you specify for your Amazon Linux-based golden AMI in Step D.
Repeat Steps A, B, and C to find the Ami Id, InstanceType, and UserData for each of your golden AMIs. When you have this metadata, you can create the JSON document of metadata for all your golden AMIs. The StartContinuousAssessment Lambda function reads this JSON to start golden AMI vulnerability assessments.
Step D: Create a JSON document of metadata of all your golden AMIs.
Use the following template to create a JSON document:
Replace all placeholder values with values corresponding to your first golden AMI. If your golden AMI is Amazon Linux-based, you can specify the userData as the JSON-compatible-user-data-for-Amazon-Linux-AMI from Step C.5. Next, replace the placeholder values for your second golden AMI. You can add more entries to your JSON document, if you have more than two golden AMIs.
Note: The total number of characters in the JSON document must be fewer than or equal to 4,096 characters, and the number of golden AMIs must be fewer than 500. You must verify whether your account has permissions to run one on-demand EC2 instance for each of your golden AMIs. For information about how to verify service limits, see Amazon EC2 Service Limits.
Now that you have created the JSON document of your golden AMIs, you will store the JSON document in a Systems Manager parameter. The StartContinuousAssessment Lambda function will read the metadata from this parameter.
Step E: Store the JSON in a Systems Manager parameter.
Expand Systems Manager Shared Resources in the navigation pane, and then choose Parameter Store.
Choose Create Parameter.
For Name, type ContinuousAssessmentInput.
In the Description field, type Continuous golden AMI vulnerability assessment process metadata.
For Type, choose String.
Paste the JSON that you created in Step D in the Value field.
Choose Create Parameter. After the system creates the parameter, choose Close.
To set up the remaining components required to run assessments, you will run a CloudFormation template and perform the configuration explained in the next section.
3. Run the CloudFormation template and subscribe to an SNS topic to receive assessment results
Next, create a CloudFormation stack using the provided CloudFormation template. Before you start, download the CloudFormation template to your computer.
On the Stacks page, choose AmazonInspectorAssessment.
In the Detail pane, choose Outputs to view the output of your stack.
After CloudFormation successfully creates a stack, the Outputs tab displays following results:
StartContinuousAssessmentLambdaFunction – The Value box displays the name of the StartContinuousAssessment function. You will run this function to trigger the entire workflow.
ContinuousAssessmentResultsTopic – The Value box displays the ContinuousAssessmentResultsTopic topic’s Amazon Resource Name (ARN), which you will use later.
To receive consolidated vulnerability assessment results in email, you must subscribe to ContinuousAssessmentResultsTopic.
Choose Create subscription. In the Topic ARN field, paste the ARN of ContinuousAssessmentResultsTopic that you noted in the previous section.
In the Protocol drop-down, choose Email.
In the Endpoint box, type the email address where you will receive notifications.
Choose Create subscription.
Navigate to your email application and open the message from AWS Notifications. Click the link to confirm your subscription to the SNS topic.
4. Test golden AMI vulnerability assessments
Before you schedule vulnerability assessments, you should test the process by running the StartContinuousAssessment function. In this test, you trigger a security assessment and monitor it. You then receive an email after the assessment has completed, which shows that vulnerability assessments have been successfully set up.
On Dashboard under Recent Assessment Runs, you will see an entry with the status, Collecting Data. This status indicates that Amazon Inspector agents are collecting data from instances running your golden AMIs. The agents collect data for an hour and then Amazon Inspector analyzes the collected data.
After Amazon Inspector completes the assessment, the status in the console changes to Analysis complete. Amazon Inspector then publishes an SNS message that triggers the AnalyzeInspectionReports Lambda function. When AnalyzeInspectionReports publishes results, you will receive an email containing consolidated assessment results. You also will be able to see the findings.
To see the findings in Amazon Inspector’s Findings section:
In the navigation pane, choose Assessment Runs. In the table on the Amazon Inspector – Assessment Runs page, choose the findings of the latest assessment run.
Choose the settings () icon and choose the appropriate tags to see the details of findings, as shown in the following screenshot. The findings also contain information about how you can address each underlying vulnerability.
Having verified that you have successfully set up all components of golden AMI vulnerability assessments, you now will schedule the vulnerability assessments to run on a regular basis to give you continual insight into the health of instances created from your golden AMIs.
5. Set up a CloudWatch Events rule for triggering continuous golden AMI vulnerability assessments
The last step is to create a CloudWatch Events rule to schedule the execution of the vulnerability assessments on a daily or weekly basis.
In the navigation pane, choose Rules > Create rule.
On the Event Source page, choose Schedule. Choose Fixed rate of and specify the interval (for example, 1 day).
For Targets, choose Add target and then choose Lambda function.
For Function, choose the StartContinuousAssessment function.
Choose Configure Input.
Choose Constant (JSON text).
In the box, paste the following JSON code.
{
"AMIsParamName": "ContinuousAssessmentInput"
}
Choose Configure details.
For Rule definition, type ContinuousGoldenAMIAssessmentTrigger for the name, and type as the description, This rule triggers the continuous golden AMI vulnerability assessment process.
Choose Create rule.
The vulnerability assessments are executed on the first occurrence of the schedule you chose while setting up the CloudWatch Events rule. After the vulnerability assessment is executed, you will receive an email to indicate that your continuous golden AMI vulnerability assessments are set up.
Summary
To get visibility into the security of your EC2 instances created from your golden AMIs, it is important that you perform security assessments of your golden AMIs on a regular basis. In this blog post, I have demonstrated how to set up vulnerability assessments, and the results of these continuous golden AMI vulnerability assessments can help you keep your environment up to date with security patches. To learn how to patch your golden AMIs, see Streamline AMI Maintenance and Patching Using Amazon EC2 Systems Manager.
If you have comments about this blog post, submit them in the “Comments” section below. If you have questions about implementing the solution in this post, start a new thread on the Amazon Inspector forum or contact AWS Support.
We find that AWS customers often require that every query submitted to Presto running on Amazon EMR is logged. They want to track what query was submitted, when it was submitted and who submitted it.
In this blog post, we will demonstrate how to implement and install a Presto event listener for purposes of custom logging, debugging and performance analysis for queries executed on an EMR cluster. An event listener is a plugin to Presto that is invoked when an event such as query creation, query completion, or split completion occurs.
Presto also provides a system connector to access metrics and details about a running cluster. In particular, the system connector gets information about currently running and recently run queries by using the system.runtime.queries table. From the Presto command line interface (CLI), you get this data with the entries Select * from system.runtime.queries; and Select * from system.runtime.tasks;. This connector is available out of the box and exposes information and metrics through SQL.
In addition to providing custom logging and debugging, the Presto event listener allows you to enhance the information provided by the system connector by providing a mechanism to capture detailed query context, query statistics and failure information during query creation or completion, or a split completion.
We will begin by providing a detailed walkthrough of the implementation of the Presto event listener in Java followed by its deployment on the EMR cluster.
Implementation
We use the Eclipse IDE to create a Maven Project, as shown below:
Once you have created the Maven Project, modify the pom.xml file to add the dependency for Presto, as shown following:
After you add the Presto dependency to our pom.xml file, create a Java package under the src/main/java folder. In our project, we have named the package com.amazonaws.QueryEventListener. You can choose the naming convention that best fits your organization. Within this package, create three Java files for the EventListener, the EventListenerFactory, and the EventListenerPlugin.
As the Presto website says: “EventListenerFactory is responsible for creating an EventListener instance. It also defines an EventListener name, which is used by the administrator in a Presto configuration. Implementations of EventListener implement methods for the event types they are interested in handling. The implementation of EventListener and EventListenerFactory must be wrapped as a plugin and installed on the Presto cluster.”
In our project, we have named these Java files QueryEventListener, QueryEventListenerFactory, and QueryEventListenerPlugin:
Now we write our code for the Java files.
QueryEventListener – QueryEventListener implements the Presto EventListener interface. It has a constructor that creates five rotating log files of 524 MB each. After creating QueryEventListener, we implement the query creation, query completion, and split completion methods and log the events relevant to us. You can choose to include more events based on your needs.
QueryEventListenerFactory – The QueryEventListenerFactory class implements the Presto EventListenerFactory interface. We implement the method getName, which provides a registered EventListenerFactory name to Presto. We also implement the create method, which creates an EventListener instance.
You can find the code for the QueryEventListenerFactory class in this Amazon repository.
QueryEventListenerPlugin – The QueryEventListenerPlugin class implements the Presto EventListenerPlugin interface. This class has a getEventListenerFactories method that returns an immutable list containing the EventListenerFactory. Basically, in this class we are wrapping QueryEventListener and QueryEventListenerFactory.
Finally, in our project we add the META-INF folder and a services subfolder within the META-INF folder. In the services subfolder, you create a file called com.facebook.presto.spi.Plugin:
As the Presto documentation describes: “Each plugin identifies an entry point: an implementation of the plugin interface. This class name is provided to Presto via the standard Java ServiceLoader interface: the classpath contains a resource file named com.facebook.presto.spi.Plugin in the META-INF/services directory”.
We add the name of our plugin class com.amazonaws.QueryEventListener.QueryEventListenerPlugin to the com.facebook.presto.spi.Plugin file, as shown below:
Next we will show you how to deploy the Presto plugin we created to Amazon EMR for custom logging.
Presto logging overview on Amazon EMR
Presto by default will produce three log files that capture the configurations properties and the overall operational events of the components that make up Presto, plus log end user access to the Presto UI.
On Amazon EMR, these log files are written into /var/log/presto. The log files in this directory are pushed into Amazon S3. This S3 location is the location of the new log file.
Steps to deploy Presto on Amazon EMR with custom logging
To deploy Presto on EMR with custom logging a bootstrap action will be used. The bootstrap is available in this Amazon Repository.
#!/bin/bash
IS_MASTER=true
if [ -f /mnt/var/lib/info/instance.json ]
then
if grep isMaster /mnt/var/lib/info/instance.json | grep true;
then
IS_MASTER=true
else
IS_MASTER=false
fi
fi
sudo mkdir -p /usr/lib/presto/plugin/queryeventlistener
sudo /usr/bin/aws s3 cp s3://replace-with-your-bucket/QueryEventListener.jar /tmp
sudo cp /tmp/QueryEventListener.jar /usr/lib/presto/plugin/queryeventlistener/
if [ "$IS_MASTER" = true ]; then
sudo mkdir -p /usr/lib/presto/etc
sudo bash -c 'cat <<EOT >> /usr/lib/presto/etc/event-listener.properties
event-listener.name=event-listener
EOT'
fi
First, upload the JAR file created on the last section and update the s3 location in the bootstrap, s3://replace-with-your-bucket/QueryEventListener.jar, with the bucket name where the jar was placed.
After updating the bootstrap with the S3 location for your JAR, upload that bootstrap to your own bucket.
The bootstrap action will copy the jar file with the custom EventListener implementation into all machines of the cluster. Moreover, the bootstrap action will create a file named event-listener.properties on the Amazon EMR Master node. This file will configure the coordinator to enable the custom logging plugin via property event-listener.name. The event-listener.name property is set to event-listener in the event-listener.properties file. As per Presto documentation, this property is used by Presto to find a registered EventListenerFactory based on the name returned by EventListenerFactory.getName().
Now that the bootstrap is ready, the following AWS CLI command can be used to create a new EMR cluster with the bootstrap:
After we have implemented the Presto custom logging plugin and deployed it, we can run a test and see the output of the log files.
We first create a Hive table with the DDL below from the Hive Command Line (simply type Hive on the SSH terminal to the EMR Master to access the Hive Command Line):
CREATE EXTERNAL TABLE wikistats (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
LOCATION 's3://support.elasticmapreduce/training/datasets/wikistats/';
Then, we access the presto command line by typing the following command on the terminal:
presto-cli – server localhost:8889 – catalog hive – schema default
Finally, we run a simple query:
Select * from wikistats limit 10;
We then go to the /var/log/presto directory and look at the contents of the log file queries-YYYY-MM-DDTHH\:MM\:SS.0.log. As depicted in the screenshot below, our QueryEventListener plugin captures the fields shown for the Query Created and Query Completed events. Moreover, if there are splits, the plugin will also capture split events.
Note: If you want to include the query text executed by the user for auditing and debugging purposes, add the field appropriately in the QueryEventListener class methods, as shown below:
Because this is custom logging, you can capture as many fields as are available for the particular events. To find out the fields available for each of the events, see the Java Classes provided by Presto at this GitHub location.
Summary
In this post, you learned how to add custom logging to Presto on EMR to enhance your organization’s auditing capabilities and provide insights into performance.
If you have questions or suggestions, please leave a comment below.
Zafar Kapadia is a Cloud Application Architect for AWS. He works on Application Development and Optimization projects. He is also an avid cricketer and plays in various local leagues.
Francisco Oliveira is a Big Data Engineer with AWS Professional Services. He focuses on building big data solutions with open source technology and AWS. In his free time, he likes to try new sports, travel and explore national parks.
Genomics analysis has taken off in recent years as organizations continue to adopt the cloud for its elasticity, durability, and cost. With the AWS Cloud, customers have a number of performant options to choose from. These options include AWS Batch in conjunction with AWS Lambda and AWS Step Functions; AWS Glue, a serverless extract, transform, and load (ETL) service; and of course, the AWS big data and machine learning workhorse Amazon EMR.
For this task, we use Hail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it using Amazon Athena.
Compiling Hail
Because Hail is still under active development, you must compile it before you can start using it. To help simplify the process, you can launch the following AWS CloudFormation template that creates an EMR cluster, compiles Hail, and installs a Jupyter Notebook so that you’re ready to go with Hail.
There are a few things to note about the AWS CloudFormation template. You must provide a password for the Jupyter Notebook. Also, you must provide a virtual private cloud (VPC) to launch Amazon EMR in, and make sure that you select a subnet from within that VPC. Next, update the cluster resources to fit your needs. Lastly, the HailBuildOutputS3Path parameter should be an Amazon S3 bucket/prefix, where you should save the compiled Hail binaries for later use. Leave the Hail and Spark versions as is, unless you’re comfortable experimenting with more recent versions.
When you’ve completed these steps, the following files are saved locally on the cluster to be used when running the Apache Spark Python API (PySpark) shell.
The files are also copied to the Amazon S3 location defined by the AWS CloudFormation template so that you can include them when running jobs using the Amazon EMR Step API.
Collecting genome data
To get started with Hail, use the 1000 Genome Project dataset available on AWS. The data that you will use is located at s3://1000genomes/release/20130502/.
For Hail to process these files in an efficient manner, they need to be block compressed. In many cases, files that use gzip compression are compressed in blocks, so you don’t need to recompress—you can just rename the file extension to “.bgz” from “.gz” . Hail can process .gz files, but it’s much slower and not recommended. The simple way to accomplish this is to copy the data files from the public S3 bucket to your own and rename them.
The following is the Bash command line to copy the first five genome Variant Call Format (VCF) files and rename them appropriately using the AWS CLI.
for i in $(seq 5); do aws s3 cp s3://1000genomes/release/20130502/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz s3://your_bucket/prefix/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz; done
Now that you have some data files containing variants in the Variant Call Format, you need to get the sample annotations that go along with them. These annotations provide more information about each sample, such as the population they are a part of.
In this section, you use the data collected in the previous section to explore genome variations interactively using a Jupyter Notebook. You then create a simple ETL job to convert these variations into Parquet format. Finally, you query it using Amazon Athena.
Let’s open the Jupyter notebook. To start, sign in to the AWS Management Console, and open the AWS CloudFormation console. Choose the stack that you created, and then choose the Output tab. There you see the JupyterURL. Open this URL in your browser.
Go ahead and download the Jupyter Notebook that is provided to your local machine. Log in to Jupyter with the password that you provided during stack creation. Choose Upload on the right side, and then choose the notebook from your local machine.
After the notebook is uploaded, choose it from the list on the left to open it.
Select the first cell, update the S3 bucket location to point to the bucket where you saved the compiled Hail libraries, and then choose Run. This code imports the Hail modules that you compiled at the beginning. When the cell is executing, you will see In [*]. When the process is complete, the asterisk (*) is replaced by a number, for example, In [1].
Next, run the subsequent two cells, which imports the Hail module into PySpark and initiates the Hail context.
The next cell imports a single VCF file from the bucket where you saved your data in the previous section. If you change the Amazon S3 path to not include a file name, it imports all the VCF files in that directory. Depending on your cluster size, it might take a few minutes.
Remember that in the previous section, you also copied an annotation file. Now you use it to annotate the VCF files that you’ve loaded with Hail. Execute the next cell—as a shortcut, you can select the cell and press Shift+Enter.
The import_table API takes a path to the annotation file in TSV (tab-separated values) format and a parameter named impute that attempts to infer the schema of the file, as shown in the output below the cell.
At this point, you can interactively explore the data. For example, you can count the number of samples you have and group them by population.
You can also calculate the standard quality control (QC) metrics on your variants and samples.
What if you want to query this data outside of Hail and Spark, for example, using Amazon Athena? To start, you need to change the column names to lowercase because Athena currently supports only lowercase names. To do that, use the two functions provided in the notebook and call them on your virtual dedicated server (VDS), as shown in the following image. Note that you’re only changing the case of the variants and samples schemas. If you’ve further augmented your VDS, you might need to modify the lowercase functions to do the same for those schemas.
In the current version of Hail, the sample annotations are not stored in the exported Parquet VDS, so you need to save them separately. As noted by the Hail maintainers, in future versions, the data represented by the VDS Parquet output will change, and it is recommended that you also export the variant annotations. So let’s do that.
Note that both of these lines are similar in that they export a table representation of the sample and variant annotations, convert them to a Spark DataFrame, and finally save them to Amazon S3 in Parquet file format.
Finally, it is beneficial to save the VDS file back to Amazon S3 so that next time you need to analyze your data, you can load it without having to start from the raw VCF. Note that when Hail saves your data, it requires a path and a file name.
After you run these cells, expect it to take some time as it writes out the data.
Discovering table metadata
Before you can query your data, you need to tell Athena the schema of your data. You have a couple of options. The first is to use AWS Glue to crawl the S3 bucket, infer the schema from the data, and create the appropriate table. Before proceeding, you might need to migrate your Athena database to use the AWS Glue Data Catalog.
Creating tables in AWS Glue
To use the AWS Glue crawler, open the AWS Glue console and choose Crawlers in the left navigation pane.
Then choose Add crawler to create a new crawler.
Next, give your crawler a name and assign the appropriate IAM role. Leave Amazon S3 as the data source, and select the S3 bucket where you saved the data and the sample annotations. When you set the crawler’s Include path, be sure to include the entire path, for example: s3://output_bucket/hail_data/sample_annotations/
Under the Exclusion Paths, type _SUCCESS, so that you don’t crawl that particular file.
Continue forward with the default settings until you are asked if you want to add another source. Choose Yes, and add the Amazon S3 path to the variant annotation bucket s3://your_output_bucket/hail_data/sample_annotations/ so that it can build your variant annotation table. Give it an existing database name, or create a new one.
Provide a table prefix and choose Next. Then choose Finish. At this point, assuming that the data is finished writing, you can go ahead and run the crawler. When it finishes, you have two new tables in the database you created that should look something like the following:
You can explore the schema of these tables by choosing their name and then choosing Edit Schema on the right side of the table view; for example:
Creating tables in Amazon Athena
If you cannot or do not want to use AWS Glue crawlers, you can add the tables via the Athena console by typing the following statements:
In the Amazon Athena console, choose the database in which your tables were created. In this case, it looks something like the following:
To verify that you have data, choose the three dots on the right, and then choose Preview table.
Indeed, you can see some data.
You can further explore the sample and variant annotations along with the calculated QC metrics that you calculated previously using Hail.
Summary
To summarize, this post demonstrated the ease in which you can install, configure, and use Hail, an open source highly scalable framework for exploring and analyzing genomics data on Amazon EMR. We demonstrated setting up a Jupyter Notebook to make our exploration easy. We also used the power of Hail to calculate quality control metrics for variants and samples. We exported them to Amazon S3 and allowed a broader range of users and analysts to explore them on-demand in a serverless environment using Amazon Athena.
Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.
We can’t believe that there are just few days left before re:Invent 2017. If you are attending this year, you’ll want to check out our Big Data sessions! The Big Data and Machine Learning categories are bigger than ever. As in previous years, you can find these sessions in various tracks, including Analytics & Big Data, Deep Learning Summit, Artificial Intelligence & Machine Learning, Architecture, and Databases.
We have great sessions from organizations and companies like Vanguard, Cox Automotive, Pinterest, Netflix, FINRA, Amtrak, AmazonFresh, Sysco Foods, Twilio, American Heart Association, Expedia, Esri, Nextdoor, and many more. All sessions are recorded and made available on YouTube. In addition, all slide decks from the sessions will be available on SlideShare.net after the conference.
This post highlights the sessions that will be presented as part of the Analytics & Big Data track, as well as relevant sessions from other tracks like Architecture, Artificial Intelligence & Machine Learning, and IoT. If you’re interested in Machine Learning sessions, don’t forget to check out our Guide to Machine Learning at re:Invent 2017.
This year’s session catalog contains the following breakout sessions.
Raju Gulabani, VP, Database, Analytics and AI at AWS will discuss the evolution of database and analytics services in AWS, the new database and analytics services and features we launched this year, and our vision for continued innovation in this space. We are witnessing an unprecedented growth in the amount of data collected, in many different forms. Storage, management, and analysis of this data require database services that scale and perform in ways not possible before. AWS offers a collection of database and other data services—including Amazon Aurora, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Amazon ElastiCache, Amazon Kinesis, and Amazon EMR—to process, store, manage, and analyze data. In this session, we provide an overview of AWS database and analytics services and discuss how customers are using these services today.
Deep dive customer use cases
ABD401 – How Netflix Monitors Applications in Near Real-Time with Amazon Kinesis Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this session, we first discuss why Netflix chose Kinesis Streams to address these challenges at scale. We then dive deep into how Netflix uses Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this session, you will learn how to build a real-time application monitoring system using network traffic logs and get real-time, actionable insights.
In this session, learn how Nextdoor replaced their home-grown data pipeline based on a topology of Flume nodes with a completely serverless architecture based on Kinesis and Lambda. By making these changes, they improved both the reliability of their data and the delivery times of billions of records of data to their Amazon S3–based data lake and Amazon Redshift cluster. Nextdoor is a private social networking service for neighborhoods.
ABD205 – Taking a Page Out of Ivy Tech’s Book: Using Data for Student Success Data speaks. Discover how Ivy Tech, the nation’s largest singly accredited community college, uses AWS to gather, analyze, and take action on student behavioral data for the betterment of over 3,100 students. This session outlines the process from inception to implementation across the state of Indiana and highlights how Ivy Tech’s model can be applied to your own complex business problems.
ABD207 – Leveraging AWS to Fight Financial Crime and Protect National Security Banks aren’t known to share data and collaborate with one another. But that is exactly what the Mid-Sized Bank Coalition of America (MBCA) is doing to fight digital financial crime—and protect national security. Using the AWS Cloud, the MBCA developed a shared data analytics utility that processes terabytes of non-competitive customer account, transaction, and government risk data. The intelligence produced from the data helps banks increase the efficiency of their operations, cut labor and operating costs, and reduce false positive volumes. The collective intelligence also allows greater enforcement of Anti-Money Laundering (AML) regulations by helping members detect internal risks—and identify the challenges to detecting these risks in the first place. This session demonstrates how the AWS Cloud supports the MBCA to deliver advanced data analytics, provide consistent operating models across financial institutions, reduce costs, and strengthen national security.
ABD208 – Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores New Innovation with Amazon Kinesis Firehose In this session, learn how Cox Automotive is using Splunk Cloud for real time visibility into its AWS and hybrid environments to achieve near instantaneous MTTI, reduce auction incidents by 90%, and proactively predict outages. We also introduce a highly anticipated capability that allows you to ingest, transform, and analyze data in real time using Splunk and Amazon Kinesis Firehose to gain valuable insights from your cloud resources. It’s now quicker and easier than ever to gain access to analytics-driven infrastructure monitoring using Splunk Enterprise & Splunk Cloud.
ABD209 – Accelerating the Speed of Innovation with a Data Sciences Data & Analytics Hub at Takeda Historically, silos of data, analytics, and processes across functions, stages of development, and geography created a barrier to R&D efficiency. Gathering the right data necessary for decision-making was challenging due to issues of accessibility, trust, and timeliness. In this session, learn how Takeda is undergoing a transformation in R&D to increase the speed-to-market of high-impact therapies to improve patient lives. The Data and Analytics Hub was built, with Deloitte, to address these issues and support the efficient generation of data insights for functions such as clinical operations, clinical development, medical affairs, portfolio management, and R&D finance. In the AWS hosted data lake, this data is processed, integrated, and made available to business end users through data visualization interfaces, and to data scientists through direct connectivity. Learn how Takeda has achieved significant time reductions—from weeks to minutes—to gather and provision data that has the potential to reduce cycle times in drug development. The hub also enables more efficient operations and alignment to achieve product goals through cross functional team accountability and collaboration due to the ability to access the same cross domain data.
ABD210 – Modernizing Amtrak: Serverless Solution for Real-Time Data Capabilities As the nation’s only high-speed intercity passenger rail provider, Amtrak needs to know critical information to run their business such as: Who’s onboard any train at any time? How are booking and revenue trending? Amtrak was faced with unpredictable and often slow response times from existing databases, ranging from seconds to hours; existing booking and revenue dashboards were spreadsheet-based and manual; multiple copies of data were stored in different repositories, lacking integration and consistency; and operations and maintenance (O&M) costs were relatively high. Join us as we demonstrate how Deloitte and Amtrak successfully went live with a cloud-native operational database and analytical datamart for near-real-time reporting in under six months. We highlight the specific challenges and the modernization of architecture on an AWS native Platform as a Service (PaaS) solution. The solution includes cloud-native components such as AWS Lambda for microservices, Amazon Kinesis and AWS Data Pipeline for moving data, Amazon S3 for storage, Amazon DynamoDB for a managed NoSQL database service, and Amazon Redshift for near-real time reports and dashboards. Deloitte’s solution enabled “at scale” processing of 1 million transactions/day and up to 2K transactions/minute. It provided flexibility and scalability, largely eliminate the need for system management, and dramatically reduce operating costs. Moreover, it laid the groundwork for decommissioning legacy systems, anticipated to save at least $1M over 3 years.
ABD211 – Sysco Foods: A Journey from Too Much Data to Curated Insights In this session, we detail Sysco’s journey from a company focused on hindsight-based reporting to one focused on insights and foresight. For this shift, Sysco moved from multiple data warehouses to an AWS ecosystem, including Amazon Redshift, Amazon EMR, AWS Data Pipeline, and more. As the team at Sysco worked with Tableau, they gained agile insight across their business. Learn how Sysco decided to use AWS, how they scaled, and how they became more strategic with the AWS ecosystem and Tableau.
ABD217 – From Batch to Streaming: How Amazon Flex Uses Real-time Analytics to Deliver Packages on Time Reducing the time to get actionable insights from data is important to all businesses, and customers who employ batch data analytics tools are exploring the benefits of streaming analytics. Learn best practices to extend your architecture from data warehouses and databases to real-time solutions. Learn how to use Amazon Kinesis to get real-time data insights and integrate them with Amazon Aurora, Amazon RDS, Amazon Redshift, and Amazon S3. The Amazon Flex team describes how they used streaming analytics in their Amazon Flex mobile app, used by Amazon delivery drivers to deliver millions of packages each month on time. They discuss the architecture that enabled the move from a batch processing system to a real-time system, overcoming the challenges of migrating existing batch data to streaming data, and how to benefit from real-time analytics.
ABD218 – How EuroLeague Basketball Uses IoT Analytics to Engage Fans IoT and big data have made their way out of industrial applications, general automation, and consumer goods, and are now a valuable tool for improving consumer engagement across a number of industries, including media, entertainment, and sports. The low cost and ease of implementation of AWS analytics services and AWS IoT have allowed AGT, a leader in IoT, to develop their IoTA analytics platform. Using IoTA, AGT brought a tailored solution to EuroLeague Basketball for real-time content production and fan engagement during the 2017-18 season. In this session, we take a deep dive into how this solution is architected for secure, scalable, and highly performant data collection from athletes, coaches, and fans. We also talk about how the data is transformed into insights and integrated into a content generation pipeline. Lastly, we demonstrate how this solution can be easily adapted for other industries and applications.
ABD222 – How to Confidently Unleash Data to Meet the Needs of Your Entire Organization Where are you on the spectrum of IT leaders? Are you confident that you’re providing the technology and solutions that consistently meet or exceed the needs of your internal customers? Do your peers at the executive table see you as an innovative technology leader? Innovative IT leaders understand the value of getting data and analytics directly into the hands of decision makers, and into their own. In this session, Daren Thayne, Domo’s Chief Technology Officer, shares how innovative IT leaders are helping drive a culture change at their organizations. See how transformative it can be to have real-time access to all of the data that’ is relevant to YOUR job (including a complete view of your entire AWS environment), as well as understand how it can help you lead the way in applying that same pattern throughout your entire company
ABD303 – Developing an Insights Platform – Sysco’s Journey from Disparate Systems to Data Lake and Beyond Sysco has nearly 200 operating companies across its multiple lines of business throughout the United States, Canada, Central/South America, and Europe. As the global leader in food services, Sysco identified the need to streamline the collection, transformation, and presentation of data produced by the distributed units and systems, into a central data ecosystem. Sysco’s Business Intelligence and Analytics team addressed these requirements by creating a data lake with scalable analytics and query engines leveraging AWS services. In this session, Sysco will outline their journey from a hindsight reporting focused company to an insights driven organization. They will cover solution architecture, challenges, and lessons learned from deploying a self-service insights platform. They will also walk through the design patterns they used and how they designed the solution to provide predictive analytics using Amazon Redshift Spectrum, Amazon S3, Amazon EMR, AWS Glue, Amazon Elasticsearch Service and other AWS services.
ABD309 – How Twilio Scaled Its Data-Driven Culture As a leading cloud communications platform, Twilio has always been strongly data-driven. But as headcount and data volumes grew—and grew quickly—they faced many new challenges. One-off, static reports work when you’re a small startup, but how do you support a growth stage company to a successful IPO and beyond? Today, Twilio’s data team relies on AWS and Looker to provide data access to 700 colleagues. Departments have the data they need to make decisions, and cloud-based scale means they get answers fast. Data delivers real-business value at Twilio, providing a 360-degree view of their customer, product, and business. In this session, you hear firsthand stories directly from the Twilio data team and learn real-world tips for fostering a truly data-driven culture at scale.
ABD310 – How FINRA Secures Its Big Data and Data Science Platform on AWS FINRA uses big data and data science technologies to detect fraud, market manipulation, and insider trading across US capital markets. As a financial regulator, FINRA analyzes highly sensitive data, so information security is critical. Learn how FINRA secures its Amazon S3 Data Lake and its data science platform on Amazon EMR and Amazon Redshift, while empowering data scientists with tools they need to be effective. In addition, FINRA shares AWS security best practices, covering topics such as AMI updates, micro segmentation, encryption, key management, logging, identity and access management, and compliance.
ABD331 – Log Analytics at Expedia Using Amazon Elasticsearch Service Expedia uses Amazon Elasticsearch Service (Amazon ES) for a variety of mission-critical use cases, ranging from log aggregation to application monitoring and pricing optimization. In this session, the Expedia team reviews how they use Amazon ES and Kibana to analyze and visualize Docker startup logs, AWS CloudTrail data, and application metrics. They share best practices for architecting a scalable, secure log analytics solution using Amazon ES, so you can add new data sources almost effortlessly and get insights quickly
ABD316 – American Heart Association: Finding Cures to Heart Disease Through the Power of Technology Combining disparate datasets and making them accessible to data scientists and researchers is a prevalent challenge for many organizations, not just in healthcare research. American Heart Association (AHA) has built a data science platform using Amazon EMR, Amazon Elasticsearch Service, and other AWS services, that corrals multiple datasets and enables advanced research on phenotype and genotype datasets, aimed at curing heart diseases. In this session, we present how AHA built this platform and the key challenges they addressed with the solution. We also provide a demo of the platform, and leave you with suggestions and next steps so you can build similar solutions for your use cases
ABD319 – Tooling Up for Efficiency: DIY Solutions @ Netflix At Netflix, we have traditionally approached cloud efficiency from a human standpoint, whether it be in-person meetings with the largest service teams or manually flipping reservations. Over time, we realized that these manual processes are not scalable as the business continues to grow. Therefore, in the past year, we have focused on building out tools that allow us to make more insightful, data-driven decisions around capacity and efficiency. In this session, we discuss the DIY applications, dashboards, and processes we built to help with capacity and efficiency. We start at the ten thousand foot view to understand the unique business and cloud problems that drove us to create these products, and discuss implementation details, including the challenges encountered along the way. Tools discussed include Picsou, the successor to our AWS billing file cost analyzer; Libra, an easy-to-use reservation conversion application; and cost and efficiency dashboards that relay useful financial context to 50+ engineering teams and managers.
ABD312 – Deep Dive: Migrating Big Data Workloads to AWS Customers are migrating their analytics, data processing (ETL), and data science workloads running on Apache Hadoop, Spark, and data warehouse appliances from on-premise deployments to AWS in order to save costs, increase availability, and improve performance. AWS offers a broad set of analytics services, including solutions for batch processing, stream processing, machine learning, data workflow orchestration, and data warehousing. This session will focus on identifying the components and workflows in your current environment; and providing the best practices to migrate these workloads to the right AWS data analytics product. We will cover services such as Amazon EMR, Amazon Athena, Amazon Redshift, Amazon Kinesis, and more. We will also feature Vanguard, an American investment management company based in Malvern, Pennsylvania with over $4.4 trillion in assets under management. Ritesh Shah, Sr. Program Manager for Cloud Analytics Program at Vanguard, will describe how they orchestrated their migration to AWS analytics services, including Hadoop and Spark workloads to Amazon EMR. Ritesh will highlight the technical challenges they faced and overcame along the way, as well as share common recommendations and tuning tips to accelerate the time to production.
ABD402 – How Esri Optimizes Massive Image Archives for Analytics in the Cloud Petabyte scale archives of satellites, planes, and drones imagery continue to grow exponentially. They mostly exist as semi-structured data, but they are only valuable when accessed and processed by a wide range of products for both visualization and analysis. This session provides an overview of how ArcGIS indexes and structures data so that any part of it can be quickly accessed, processed, and analyzed by reading only the minimum amount of data needed for the task. In this session, we share best practices for structuring and compressing massive datasets in Amazon S3, so it can be analyzed efficiently. We also review a number of different image formats, including GeoTIFF (used for the Public Datasets on AWS program, Landsat on AWS), cloud optimized GeoTIFF, MRF, and CRF as well as different compression approaches to show the effect on processing performance. Finally, we provide examples of how this technology has been used to help image processing and analysis for the response to Hurricane Harvey.
ABD329 – A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Massive Scale Amazon’s consumer business continues to grow, and so does the volume of data and the number and complexity of the analytics done in support of the business. In this session, we talk about how Amazon.com uses AWS technologies to build a scalable environment for data and analytics. We look at how Amazon is evolving the world of data warehousing with a combination of a data lake and parallel, scalable compute engines such as Amazon EMR and Amazon Redshift.
ABD327 – Migrating Your Traditional Data Warehouse to a Modern Data Lake In this session, we discuss the latest features of Amazon Redshift and Redshift Spectrum, and take a deep dive into its architecture and inner workings. We share many of the recent availability, performance, and management enhancements and how they improve your end user experience. You also hear from 21st Century Fox, who presents a case study of their fast migration from an on-premises data warehouse to Amazon Redshift. Learn how they are expanding their data warehouse to a data lake that encompasses multiple data sources and data formats. This architecture helps them tie together siloed business units and get actionable 360-degree insights across their consumer base. MCL202 – Ally Bank & Cognizant: Transforming Customer Experience Using Amazon Alexa Given the increasing popularity of natural language interfaces such as Voice as User technology or conversational artificial intelligence (AI), Ally® Bank was looking to interact with customers by enabling direct transactions through conversation or voice. They also needed to develop a capability that allows third parties to connect to the bank securely for information sharing and exchange, using oAuth, an authentication protocol seen as the future of secure banking technology. Cognizant’s Architecture team partnered with Ally Bank’s Enterprise Architecture group and identified the right product for oAuth integration with Amazon Alexa and third-party technologies. In this session, we discuss how building products with conversational AI helps Ally Bank offer an innovative customer experience; increase retention through improved data-driven personalization; increase the efficiency and convenience of customer service; and gain deep insights into customer needs through data analysis and predictive analytics to offer new products and services.
MCL317 – Orchestrating Machine Learning Training for Netflix Recommendations At Netflix, we use machine learning (ML) algorithms extensively to recommend relevant titles to our 100+ million members based on their tastes. Everything on the member home page is an evidence-driven, A/B-tested experience that we roll out backed by ML models. These models are trained using Meson, our workflow orchestration system. Meson distinguishes itself from other workflow engines by handling more sophisticated execution graphs, such as loops and parameterized fan-outs. Meson can schedule Spark jobs, Docker containers, bash scripts, gists of Scala code, and more. Meson also provides a rich visual interface for monitoring active workflows and inspecting execution logs. It has a powerful Scala DSL for authoring workflows as well as the REST API. In this session, we focus on how Meson trains recommendation ML models in production, and how we have re-architected it to scale up for a growing need of broad ETL applications within Netflix. As a driver for this change, we have had to evolve the persistence layer for Meson. We talk about how we migrated from Cassandra to Amazon RDS backed by Amazon Aurora
MCL350 – Humans vs. the Machines: How Pinterest Uses Amazon Mechanical Turk’s Worker Community to Improve Machine Learning Ever since the term “crowdsourcing” was coined in 2006, it’s been a buzzword for technology companies and social institutions. In the technology sector, crowdsourcing is instrumental for verifying machine learning algorithms, which, in turn, improves the user’s experience. In this session, we explore how Pinterest adapted to an increased reliability on human evaluation to improve their product, with a focus on how they’ve integrated with Mechanical Turk’s platform. This presentation is aimed at engineers, analysts, program managers, and product managers who are interested in how companies rely on Mechanical Turk’s human evaluation platform to better understand content and improve machine learning algorithms. The discussion focuses on the analysis and product decisions related to building a high quality crowdsourcing system that takes advantage of Mechanical Turk’s powerful worker community.
ABD201 – Big Data Architectural Patterns and Best Practices on AWS In this session, we simplify big data processing as a data bus comprising various stages: collect, store, process, analyze, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architectures, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost
ABD202 – Best Practices for Building Serverless Big Data Applications Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this session, we show you how to incorporate serverless concepts into your big data architectures. We explore the concepts behind and benefits of serverless architectures for big data, looking at design patterns to ingest, store, process, and visualize your data. Along the way, we explain when and how you can use serverless technologies to streamline data processing, minimize infrastructure management, and improve agility and robustness and share a reference architecture using a combination of cloud and open source technologies to solve your big data problems. Topics include: use cases and best practices for serverless big data applications; leveraging AWS technologies such as Amazon DynamoDB, Amazon S3, Amazon Kinesis, AWS Lambda, Amazon Athena, and Amazon EMR; and serverless ETL, event processing, ad hoc analysis, and real-time analytics.
ABD206 – Building Visualizations and Dashboards with Amazon QuickSight Just as a picture is worth a thousand words, a visual is worth a thousand data points. A key aspect of our ability to gain insights from our data is to look for patterns, and these patterns are often not evident when we simply look at data in tables. The right visualization will help you gain a deeper understanding in a much quicker timeframe. In this session, we will show you how to quickly and easily visualize your data using Amazon QuickSight. We will show you how you can connect to data sources, generate custom metrics and calculations, create comprehensive business dashboards with various chart types, and setup filters and drill downs to slice and dice the data.
ABD203 – Real-Time Streaming Applications on AWS: Use Cases and Patterns To win in the marketplace and provide differentiated customer experiences, businesses need to be able to use live data in real time to facilitate fast decision making. In this session, you learn common streaming data processing use cases and architectures. First, we give an overview of streaming data and AWS streaming data capabilities. Next, we look at a few customer examples and their real-time streaming applications. Finally, we walk through common architectures and design patterns of top streaming data use cases.
ABD213 – How to Build a Data Lake with AWS Glue Data Catalog As data volumes grow and customers store more data on AWS, they often have valuable data that is not easily discoverable and available for analytics. The AWS Glue Data Catalog provides a central view of your data lake, making data readily available for analytics. We introduce key features of the AWS Glue Data Catalog and its use cases. Learn how crawlers can automatically discover your data, extract relevant metadata, and add it as table definitions to the AWS Glue Data Catalog. We will also explore the integration between AWS Glue Data Catalog and Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
ABD214 – Real-time User Insights for Mobile and Web Applications with Amazon Pinpoint With customers demanding relevant and real-time experiences across a range of devices, digital businesses are looking to gather user data at scale, understand this data, and respond to customer needs instantly. This requires tools that can record large volumes of user data in a structured fashion, and then instantly make this data available to generate insights. In this session, we demonstrate how you can use Amazon Pinpoint to capture user data in a structured yet flexible manner. Further, we demonstrate how this data can be set up for instant consumption using services like Amazon Kinesis Firehose and Amazon Redshift. We walk through example data based on real world scenarios, to illustrate how Amazon Pinpoint lets you easily organize millions of events, record them in real-time, and store them for further analysis.
ABD223 – IT Innovators: New Technology for Leveraging Data to Enable Agility, Innovation, and Business Optimization Companies of all sizes are looking for technology to efficiently leverage data and their existing IT investments to stay competitive and understand where to find new growth. Regardless of where companies are in their data-driven journey, they face greater demands for information by customers, prospects, partners, vendors and employees. All stakeholders inside and outside the organization want information on-demand or in “real time”, available anywhere on any device. They want to use it to optimize business outcomes without having to rely on complex software tools or human gatekeepers to relevant information. Learn how IT innovators at companies such as MasterCard, Jefferson Health, and TELUS are using Domo’s Business Cloud to help their organizations more effectively leverage data at scale.
ABD301 – Analyzing Streaming Data in Real Time with Amazon Kinesis Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this session, we present an end-to-end streaming data solution using Kinesis Streams for data ingestion, Kinesis Analytics for real-time processing, and Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Kinesis Analytics applications. Lastly, we discuss how to estimate the cost of the entire system
ABD302 – Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service and Kibana In this session, we use Apache web logs as example and show you how to build an end-to-end analytics solution. First, we cover how to configure an Amazon ES cluster and ingest data using Amazon Kinesis Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data. Then we demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we review approaches for generating custom, ad-hoc reports.
ABD304 – Best Practices for Data Warehousing with Amazon Redshift & Redshift Spectrum Most companies are over-run with data, yet they lack critical insights to make timely and accurate business decisions. They are missing the opportunity to combine large amounts of new, unstructured big data that resides outside their data warehouse with trusted, structured data inside their data warehouse. In this session, we take an in-depth look at how modern data warehousing blends and analyzes all your data, inside and outside your data warehouse without moving the data, to give you deeper insights to run your business. We will cover best practices on how to design optimal schemas, load data efficiently, and optimize your queries to deliver high throughput and performance.
ABD305 – Design Patterns and Best Practices for Data Analytics with Amazon EMR Amazon EMR is one of the largest Hadoop operators in the world, enabling customers to run ETL, machine learning, real-time processing, data science, and low-latency SQL at petabyte scale. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about lowering cost with Auto Scaling and Spot Instances, and security best practices for encryption and fine-grained access control. Finally, we dive into some of our recent launches to keep you current on our latest features.
ABD307 – Deep Analytics for Global AWS Marketing Organization To meet the needs of the global marketing organization, the AWS marketing analytics team built a scalable platform that allows the data science team to deliver custom econometric and machine learning models for end user self-service. To meet data security standards, we use end-to-end data encryption and different AWS services such as Amazon Redshift, Amazon RDS, Amazon S3, Amazon EMR with Apache Spark and Auto Scaling. In this session, you see real examples of how we have scaled and automated critical analysis, such as calculating the impact of marketing programs like re:Invent and prioritizing leads for our sales teams.
ABD311 – Deploying Business Analytics at Enterprise Scale with Amazon QuickSight One of the biggest tradeoffs customers usually make when deploying BI solutions at scale is agility versus governance. Large-scale BI implementations with the right governance structure can take months to design and deploy. In this session, learn how you can avoid making this tradeoff using Amazon QuickSight. Learn how to easily deploy Amazon QuickSight to thousands of users using Active Directory and Federated SSO, while securely accessing your data sources in Amazon VPCs or on-premises. We also cover how to control access to your datasets, implement row-level security, create scheduled email reports, and audit access to your data.
ABD315 – Building Serverless ETL Pipelines with AWS Glue Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), APIs, clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. In this session, we introduce key ETL features of AWS Glue, cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. We discuss how to build scalable, efficient, and serverless ETL pipelines using AWS Glue. Additionally, Merck will share how they built an end-to-end ETL pipeline for their application release management system, and launched it in production in less than a week using AWS Glue.
ABD318 – Architecting a data lake with Amazon S3, Amazon Kinesis, and Amazon Athena Learn how to architect a data lake where different teams within your organization can publish and consume data in a self-service manner. As organizations aim to become more data-driven, data engineering teams have to build architectures that can cater to the needs of diverse users – from developers, to business analysts, to data scientists. Each of these user groups employs different tools, have different data needs and access data in different ways. In this talk, we will dive deep into assembling a data lake using Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue. The session will feature Mohit Rao, Architect and Integration lead at Atlassian, the maker of products such as JIRA, Confluence, and Stride. First, we will look at a couple of common architectures for building a data lake. Then we will show how Atlassian built a self-service data lake, where any team within the company can publish a dataset to be consumed by a broad set of users.
Companies have valuable data that they may not be analyzing due to the complexity, scalability, and performance issues of loading the data into their data warehouse. However, with the right tools, you can extend your analytics to query data in your data lake—with no loading required. Amazon Redshift Spectrum extends the analytic power of Amazon Redshift beyond data stored in your data warehouse to run SQL queries directly against vast amounts of unstructured data in your Amazon S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for analytics when you need it. Join a discussion with AWS solution architects to ask question.
ABD330 – Combining Batch and Stream Processing to Get the Best of Both Worlds Today, many architects and developers are looking to build solutions that integrate batch and real-time data processing, and deliver the best of both approaches. Lambda architecture (not to be confused with the AWS Lambda service) is a design pattern that leverages both batch and real-time processing within a single solution to meet the latency, accuracy, and throughput requirements of big data use cases. Come join us for a discussion on how to implement Lambda architecture (batch, speed, and serving layers) and best practices for data processing, loading, and performance tuning
ABD335 – Real-Time Anomaly Detection Using Amazon Kinesis Amazon Kinesis Analytics offers a built-in machine learning algorithm that you can use to easily detect anomalies in your VPC network traffic and improve security monitoring. Join us for an interactive discussion on how to stream your VPC flow Logs to Amazon Kinesis Streams and identify anomalies using Kinesis Analytics.
ABD339 – Deep Dive and Best Practices for Amazon Athena Amazon Athena is an interactive query service that enables you to process data directly from Amazon S3 without the need for infrastructure. Since its launch at re:invent 2016, several organizations have adopted Athena as the central tool to process all their data. In this talk, we dive deep into the most common use cases, including working with other AWS services. We review the best practices for creating tables and partitions and performance optimizations. We also dive into how Athena handles security, authorization, and authentication. Lastly, we hear from a customer who has reduced costs and improved time to market by deploying Athena across their organization.
We look forward to meeting you at re:Invent 2017!
About the Author
Roy Ben-Alta is a solution architect and principal business development manager at Amazon Web Services in New York. He focuses on Data Analytics and ML Technologies, working with AWS customers to build innovative data-driven products.
Laurent Pinchart ran a session at the 2017 Embedded Linux Conference Europe entitled “Bash the kernel maintainers”; the idea was to get feedback from developers on their experience working with the kernel community. A few days later, the Maintainers Summit held a free-flowing discussion on the issues that were brought up in that session. Some changes may result from this discussion, but it also showed how hard it can be to change how kernel subsystem maintainers work.
This time I need it to run in a cluster on AWS, so I went on to setup such a cluster. Googling how to do it gives several interesting results, like this, this and this, but they are either incomplete, or outdates, or have too many irrelevant details. So they are only of moderate help.
My goal is to use CloudFormation (or Terraform potentially) to launch a stack which has a Cassandra auto-scaling group (in a single region) that can grow as easily as increasing the number of nodes in the group.
Also, in order to have the web application connect to Cassandra without hardcoding the node IPs, I wanted to have a load balancer in front of all Cassandra nodes that does the round-robin for me. The alternative for that would be to have a client-side round-robin, but that would mean some extra complexity on the client which seems avoidable with a load balancer in front of the Cassandra auto-scaling group.
Sets up 3 private subnet (1 per availability zone in the eu-west region)
Creates a security group which allows incoming and outgoing ports that allow cassandra to accept connections (9042) and for the nodes to gossip (7000/7001). Note that the ports are only accessible from within the VPC, no external connection is allowed. SSH goes only through a bastion host.
Defines a TCP load balancer for port 9042 where all clients will connect. The load balancer requires a so-called “Target group” which is defined as well.
Configures an auto-scaling group, with a pre-configured number of nodes. The autoscaling group has a reference to the “target group”, so that the load balancer always sees all nodes in the auto-scaling group
Each node in the auto-scaling group is identical based on a launch configuration. The launch configuration runs a few scripts on initialization. These scripts will be run for every node – either initially, or in case a node dies and another one is spawned in its place, or when the cluster has to grow. The scripts are fetched from S3, where you can publish them (and version them) either manually, or with an automated process.
Note: this does not configure specific EBS volumes and in reality you may need to configure and attach them, if the instance storage is insufficient. Don’t worry about nodes dying, though, as data is safely replicated.
That was the easy part – a bunch of AWS resources and port configurations. The Cassandra-specific setup is a bit harder, as it requires understanding on how Cassandra functions.
The two scripts are setup-cassandra.sh and update-cassandra-cluster-config.py, so bash and python. Bash for setting-up the machine, and python for cassandra-specific stuff. Instead of the bash script one could use a pre-built AMI (image), e.g. with packer, but since only 2 pieces of software are installed, I thought it’s a bit of an overhead to support AMIs.
The bash script can be seen here, and simply installs Java 8 and the latest Cassandra, runs the python script, runs the Cassandra services and creates (if needed) a keyspace with proper replication configuration. A few notes here – the cassandra.yaml.template could be supplied via the cloudformation script instead of having it fetched via bash (and having the pass the bucket name); you could also have it fetched in the python script itself – it’s a matter of preference. Cassandra is not configured for use with SSL, which is generally a bad idea, but the SSL configuration is out of scope of the basic setup. Finally, the script waits for the Cassandra process to run (using a while/sleep loop) and then creates the keyspace if needed. The keyspace (=database) has to be created with a NetworkTopologyStrategy, and the number of replicas for the particular datacenter (=AWS region) has to be configured. The value is 3, for the 3 availability zones where we’ll have nodes. That means there’s a copy in each AZ (which is seen like a “rack”, although it’s exactly that).
The python script does some very important configurations – without them the cluster won’t work. (I don’t work with Python normally, so feel free to criticize my Python code). The script does the following:
Gets the current autoscaling group details (using AWS EC2 APIs)
Sorts the instances by time
Fetches the first instance in the group in order to assign it as seed node
Sets the seed node in the configuration file (by replacing a placeholder)
Sets the listen_address (and therefore rpc_address) to the private IP of the node in order to allow Cassandra to listen for incoming connections
Designating the seed node is important, as all cluster nodes have to join the cluster by specifying at least one seed. You can get the first two nodes instead of just one, but it shouldn’t matter. Note that the seed node is not always fixed – it’s just the oldest node in the cluster. If at some point the oldest node is terminated, each new node will use the second oldest as seed.
What I haven’t shown is the cassandra.yaml.template file. It is basically a copy of the cassandra.yaml file from a standard Cassandra installation, with a few changes:
cluster_name is modified to match your application name. This is just for human-readable purposes, doesn’t matter what you set it to.
allocate_tokens_for_keyspace: your_keyspace is uncommented and the keyspace is set to match your main keyspace. This enables the new token distribution algorithm in Cassandra 3.0. It allows for evenly distributing the data across nodes.
endpoint_snitch: Ec2Snitch is set instead of the SimpleSnitch to make use of AWS metadata APIs. Note that this setup is in a single region. For multi-region there’s another snitch and some addtional complications of exposing ports and changing the broadcast address.
as mentionted above, ${private_ip} and ${seeds} placeholders are placed in the appropriate places (listen_address and rpc_address for the IP) in order to allow substitution.
The lets you run a Cassandra cluster as part of your AWS stack, which is auto-scalable and doesn’t require any manual intervention – neither on setup, nor on scaling up. Well, allegedly – there may be issues that have to be resolved once you hit the usecases of reality. And for clients to connect to the cluster, simply use the load balancer DNS name (you can print it in a config file on each application node)
Twitter has raised the limit to 280 characters for a select number of people. However, they left open a hole, allowing anybody to make large tweets with a little bit of hacking. The hacking skills needed are basic hacking skills, which I thought I’d write up in a blog post.
Specifically, the skills you will exercise are:
basic command-line shell
basic HTTP requests
basic browser DOM editing
The short instructions
The basic instructions were found in tweets like the following:
Click ‘Tweet’ in the web ui F12 Remove ‘disable’ on the tweet button Click it, and go to ‘network’, right click on the request and copy as cURL Then, add &weighted_character_count=true as a param to the end of the url Then, resubmit the tweet with curl. Enjoy your 280 characters.
These instructions are clear to the average hacker, but of course, a bit difficult for those learning hacking, hence this post.
The command-line
The basics of most hacking start with knowledge of the command-line. This is the “Terminal” app under macOS or cmd.exe under Windows. Almost always when you see hacking dramatized in the movies, they are using the command-line.
In the beginning, the command-line is all computers had. To do anything on a computer, you had to type a “command” telling it what to do. What we see as the modern graphical screen is a layer on top of the command-line, one that translates clicks of the mouse into the raw commands.
On most systems, the command-line is known as “bash”. This is what you’ll find on Linux and macOS. Windows historically has had a different command-line that uses slightly different syntax, though in the last couple years, they’ve also supported “bash”. You’ll have to install it first, such as by following these instructions.
You’ll see me use command that may not be yet installed on your “bash” command-line, like nc and curl. You’ll need to run a command to install them, such as:
sudo apt-get install nc curl
The thing to remember about the command-line is that the mouse doesn’t work. You can’t click to move the cursor as you normally do in applications. That’s because the command-line predates the mouse by decades. Instead, you have to use arrow keys.
I’m not going to spend much effort discussing the command-line, as a complete explanation is beyond the scope of this document. Instead, I’m assuming the reader either already knows it, or will learn-from-example as we go along.
Web requests
The basics of how the web works are really simple. A request to a web server is just a small packet of text, such as the following, which does a search on Google for the search-term “penguin” (presumably, you are interested in knowing more about penguins):
GET /search?q=penguin HTTP/1.0 Host: www.google.com User-Agent: human
The command we are sending to the server is GET, meaning get a page. We are accessing the URL /search, which on Google’s website, is how you do a search. We are then sending the parameter q with the value penguin. We also declare that we are using version 1.0 of the HTTP (hyper-text transfer protocol).
Following the first line there are a number of additional headers. In one header, we declare the Host name that we are accessing. Web servers can contain many different websites, with different names, so this header is usually imporant.
We also add the User-Agent header. The “user-agent” means the “browser” that you use, like Edge, Chrome, Firefox, or Safari. It allows servers to send content optimized for different browsers. Since we are sending web requests without a browser here, we are joking around saying human.
Here’s what happens when we use the nc program to send this to a google web server:
The first part is us typing, until we hit the [enter] key to create a blank line. After that point is the response from the Google server. We get back a result code (OK), followed by more headers from the server, and finally the contents of the webpage, which goes on from many screens. (We’ll talk about what web pages look like below).
Note that a lot of HTTP headers are optional and really have little influence on what’s going on. They are just junk added to web requests. For example, we see Google report a P3P header is some relic of 2002 that nobody uses anymore, as far as I can tell. Indeed, if you follow the URL in the P3P header, Google pretty much says exactly that.
I point this out because the request I show above is a simplified one. In practice, most requests contain a lot more headers, especially Cookie headers. We’ll see that later when making requests.
Using cURL instead
Sending the raw HTTP request to the server, and getting raw HTTP/HTML back, is annoying. The better way of doing this is with the tool known as cURL, or plainly, just curl. You may be familiar with the older command-line tools wget. cURL is similar, but more flexible.
To use curl for the experiment above, we’d do something like the following. We are saving the web page to “penguin.html” instead of just spewing it on the screen.
Underneath, cURL builds an HTTP header just like the one we showed above, and sends it to the server, getting the response back.
Web-pages
Now let’s talk about web pages. When you look at the web page we got back from Google while searching for “penguin”, you’ll see that it’s intimidatingly complex. I mean, it intimidates me. But it all starts from some basic principles, so we’ll look at some simpler examples.
The following is text of a simple web page:
<html>
<body>
<h1>Test</h1>
<p>This is a simple web page</p>
</body>
</html>
This is HTML, “hyper-text markup language”. As it’s name implies, we “markup” text, such as declaring the first text as a level-1 header (H1), and the following text as a paragraph (P).
In a web browser, this gets rendered as something that looks like the following. Notice how a header is formatted differently from a paragraph. Also notice that web browsers can use local files as well as make remote requests to web servers:
You can right-mouse click on the page and do a “View Source”. This will show the raw source behind the web page:
Web pages don’t just contain marked-up text. They contain two other important features, style information that dictates how things appear, and script that does all the live things that web pages do, from which we build web apps.
So let’s add a little bit of style and scripting to our web page. First, let’s view the source we’ll be adding:
In our header (H1) field, we’ve added the attribute to the markup giving this an id of mytitle. In the style section above, we give that element a color of blue, and tell it to align to the center.
Then, in our script section, we’ve told it that when somebody clicks on the element “mytitle”, it should send an “alert” message of “hello”.
This is what our web page now looks like, with the center blue title:
When we click on the title, we get a popup alert:
Thus, we see an example of the three components of a webpage: markup, style, and scripting.
Chrome developer tools
Now we go off the deep end. Right-mouse click on “Test” (not normal click, but right-button click, to pull up a menu). Select “Inspect”.
You should now get a window that looks something like the following. Chrome splits the screen in half, showing the web page on the left, and it’s debug tools on the right.
This looks similar to what “View Source” shows, but it isn’t. Instead, it’s showing how Chrome interpreted the source HTML. For example, our style/script tags should’ve been marked up with a head (header) tag. We forgot it, but Chrome adds it in anyway.
What Google is showing us is called the DOM, or document object model. It shows us all the objects that make up a web page, and how they fit together.
For example, it shows us how the style information for #mytitle is created. It first starts with the default style information for an h1 tag, and then how we’ve changed it with our style specifications.
We can edit the DOM manually. Just double click on things you want to change. For example, in this screen shot, I’ve changed the style spec from blue to red, and I’ve changed the header and paragraph test. The original file on disk hasn’t changed, but I’ve changed the DOM in memory.
This is a classic hacking technique. If you don’t like things like paywalls, for example, just right-click on the element blocking your view of the text, “Inspect” it, then delete it. (This works for some paywalls).
This edits the markup and style info, but changing the scripting stuff is a bit more complicated. To do that, click on the [Console] tab. This is the scripting console, and allows you to run code directly as part of the webpage. We are going to run code that resets what happens when we click on the title. In this case, we are simply going to change the message to “goodbye”.
Now when we click on the title, we indeed get the message:
Again, a common way to get around paywalls is to run some code like that that change which functions will be called.
Putting it all together
Now let’s put this all together in order to hack Twitter to allow us (the non-chosen) to tweet 280 characters. Review Dildog’s instructions above.
The first step is to get to Chrome Developer Tools. Dildog suggests F12. I suggest right-clicking on the Tweet button (or Reply button, as I use in my example) and doing “Inspect”, as I describe above.
You’ll now see your screen split in half, with the DOM toward the right, similar to how I describe above. However, Twitter’s app is really complex. Well, not really complex, it’s all basic stuff when you come right down to it. It’s just so much stuff — it’s a large web app with lots of parts. So we have to dive in without understanding everything that’s going on.
The Tweet/Reply button we are inspecting is going to look like this in the DOM:
The Tweet/Reply button is currently greyed out because it has the “disabled” attribute. You need to double click on it and remove that attribute. Also, in the class attribute, there is also a “disabled” part. Double-click, then click on that and removed just that disabled as well, without impacting the stuff around it. This should change the button from disabled to enabled. It won’t be greyed out, and it’ll respond when you click on it.
Now click on it. You’ll get an error message, as shown below:
What we’ve done here is bypass what’s known as client-side validation. The script in the web page prevented sending Tweets longer than 140 characters. Our editing of the DOM changed that, allowing us to send a bad request to the server. Bypassing client-side validation this way is the source of a lot of hacking.
But Twitter still does server-side validation as well. They know any client-side validation can be bypassed, and are in on the joke. They tell us hackers “You’ll have to be more clever”. So let’s be more clever.
In order to make longer 280 characters tweets work for select customers, they had to change something on the server-side. The thing they added was adding a “weighted_character_count=true” to the HTTP request. We just need to repeat the request we generated above, adding this parameter.
In theory, we can do this by fiddling with the scripting. The way Dildog describes does it a different way. He copies the request out of the browser, edits it, then send it via the command-line using curl.
We’ve used the [Elements] and [Console] tabs in Chrome’s DevTools. Now we are going to use the [Network] tab. This lists all the requests the web page has made to the server. The twitter app is constantly making requests to refresh the content of the web page. The request we made trying to do a long tweet is called “create”, and is red, because it failed.
Google Chrome gives us a number of ways to duplicate the request. The most useful is that it copies it as a full cURL command we can just paste onto the command-line. We don’t even need to know cURL, it takes care of everything for us. On Windows, since you have two command-lines, it gives you a choice to use the older Windows cmd.exe, or the newer bash.exe. I use the bash version, since I don’t know where to get the Windows command-line version of cURL.exe.
There’s a lot of going on here. The first thing to notice is the long xxxxxx strings. That’s actually not in the original screenshot. I edited the picture. That’s because these are session-cookies. If inserted them into your browser, you’d hijack my Twitter session, and be able to tweet as me (such as making Carlos Danger style tweets). Therefore, I have to remove them from the example.
At the top of the screen is the URL that we are accessing, which is https://twitter.com/i/tweet/create. Much of the rest of the screen uses the cURL -H option to add a header. These are all the HTTP headers that I describe above. Finally, at the bottom, is the –data section, which contains the data bits related to the tweet, especially the tweet itself.
We need to edit either the URL above to read https://twitter.com/i/tweet/create?weighted_character_count=true, or we need to add &weighted_character_count=true to the –data section at the bottom (either works). Remember: mouse doesn’t work on command-line, so you have to use the cursor-keys to navigate backwards in the line. Also, since the line is larger than the screen, it’s on several visual lines, even though it’s all a single line as far as the command-line is concerned.
Now just hit [return] on your keyboard, and the tweet will be sent to the server, which at the moment, works. Presto!
Twitter will either enable or disable the feature for everyone in a few weeks, at which point, this post won’t work. But the reason I’m writing this is to demonstrate the basic hacking skills. We manipulate the web pages we receive from servers, and we manipulate what’s sent back from our browser back to the server.
Easier: hack the scripting
Instead of messing with the DOM and editing the HTTP request, the better solution would be to change the scripting that does both DOM client-side validation and HTTP request generation. The only reason Dildog above didn’t do that is that it’s a lot more work trying to find where all this happens.
Others have, though. @Zemnmez did just that, though his technique works for the alternate TweetDeck client (https://tweetdeck.twitter.com) instead of the default client. Go copy his code from here, then paste it into the DevTools scripting [Console]. It’ll go in an replace some scripting functions, such like my simpler example above.
The console is showing a stream of error messages, because TweetDeck has bugs, ignore those.
Now you can effortlessly do long tweets as normal, without all the messing around I’ve spent so much text in this blog post describing.
Now, as I’ve mentioned this before, you are only editing what’s going on in the current web page. If you refresh this page, or close it, everything will be lost. You’ll have to re-open the DevTools scripting console and repaste the code. The easier way of doing this is to use the [Sources] tab instead of [Console] and use the “Snippets” feature to save this bit of code in your browser, to make it easier next time.
The even easier way is to use Chrome extensions like TamperMonkey and GreaseMonkey that’ll take care of this for you. They’ll save the script, and automatically run it when they see you open the TweetDeck webpage again.
An even easier way is to use one of the several Chrome extensions written in the past day specifically designed to bypass the 140 character limit. Since the purpose of this blog post is to show you how to tamper with your browser yourself, rather than help you with Twitter, I won’t list them.
Conclusion
Tampering with the web-page the server gives you, and the data you send back, is a basic hacker skill. In truth, there is a lot to this. You have to get comfortable with the command-line, using tools like cURL. You have to learn how HTTP requests work. You have to understand how web pages are built from markup, style, and scripting. You have to be comfortable using Chrome’s DevTools for messing around with web page elements, network requests, scripting console, and scripting sources.
So it’s rather a lot, actually.
My hope with this page is to show you a practical application of all this, without getting too bogged down in fully explaining how every bit works.
Security updates have been issued by Debian (freerdp, mbedtls, tiff, and tiff3), Fedora (chromium, krb5, libstaroffice, mbedtls, mingw-libidn2, mingw-openjpeg2, openjpeg2, and rubygems), Mageia (bzr, libarchive, libgcrypt, and tcpdump), openSUSE (gdk-pixbuf, libidn2, mpg123, postgresql94, postgresql96, and xen), Slackware (bash, mariadb, and tcpdump), and SUSE (evince and kernel).
Security updates have been issued by Mageia (atril, mpg123, perl-SOAP-Lite, and virtualbox), openSUSE (kernel and libzypp, zypper), Oracle (authconfig, bash, curl, gdm and gnome-session, ghostscript, git, glibc, gnutls, gtk-vnc, kernel, libreoffice, libtasn1, mariadb, openldap, openssh, pidgin, postgresql, python, qemu-kvm, samba, tcpdump, tigervnc and fltk, and tomcat), Red Hat (kernel, kernel-rt, openstack-neutron, and qemu-kvm), and SUSE (puppet and tcmu-runner).
Security updates have been issued by Debian (varnish), Fedora (gcc, gcc-python-plugin, libtool, mingw-c-ares, and php-PHPMailer), Red Hat (bash, curl, evince, freeradius, gdm and gnome-session, ghostscript, git, glibc, golang, GStreamer, gtk-vnc, kernel, kernel-rt, libtasn1, mariadb, openldap, openssh, pidgin, postgresql, python, qemu-kvm, qemu-kvm-rhev, samba, tigervnc and fltk, tomcat, and X.org X11 libraries), Slackware (gnupg), and Ubuntu (apache2, lxc, and webkit2gtk).
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.