Israel is using emergency surveillance powers to track people who may have COVID-19, joining China and Iran in using mass surveillance in this way. I believe pressure will increase to leverage existing corporate surveillance infrastructure for these purposes in the US and other countries. With that in mind, the EFF has some good thinking on how to balance public safety with civil liberties:
Thus, any data collection and digital monitoring of potential carriers of COVID-19 should take into consideration and commit to these principles:
Privacy intrusions must be necessary and proportionate. A program that collects, en masse, identifiable information about people must be scientifically justified and deemed necessary by public health experts for the purpose of containment. And that data processing must be proportionate to the need. For example, maintenance of 10 years of travel history of all people would not be proportionate to the need to contain a disease like COVID-19, which has a two-week incubation period.
Data collection based on science, not bias. Given the global scope of communicable diseases, there is historical precedent for impropergovernmentcontainmentefforts driven by bias based on nationality, ethnicity, religion, and race — rather than facts about a particular individual’s actual likelihood of contracting the virus, such as their travel history or contact with potentially infected people. Today, we must ensure that any automated data systems used to contain COVID-19 do not erroneously identify members of specific demographic groups as particularly susceptible to infection.
Expiration. As in other major emergencies in the past, there is a hazard that the data surveillance infrastructure we build to contain COVID-19 may long outlive the crisis it was intended to address. The government and its corporate cooperators must roll back any invasive programs created in the name of public health after crisis has been contained.
Transparency. Any government use of “big data” to track virus spread must be clearly and quickly explained to the public. This includes publication of detailed information about the information being gathered, the retention period for the information, the tools used to process that information, the ways these tools guide public health decisions, and whether these tools have had any positive or negative outcomes.
Due Process. If the government seeks to limit a person’s rights based on this “big data” surveillance (for example, to quarantine them based on the system’s conclusions about their relationships or travel), then the person must have the opportunity to timely and fairly challenge these conclusions and limits.
EFF has published a comprehensible and very readable “deep dive” into the technologies of corporate surveillance, both on the Internet and off. Well worth reading and sharing.
There’s a serious debate on reforming Section 230 of the Communications Decency Act. I am in the process of figuring out what I believe, and this is more a place to put resources and listen to people’s comments.
The EFF has written extensively on why it is so important and dismantling it will be catastrophic for the Internet. Danielle Citron disagrees. (There’s also this law journal article by Citron and Ben Wittes.) Sarah Jeong’s op-ed. Another op-ed. Another paper.
Reading all of this, I am reminded of this decade-old quote by Dan Geer. He’s addressing Internet service providers:
Hello, Uncle Sam here.
You can charge whatever you like based on the contents of what you are carrying, but you are responsible for that content if it is illegal; inspecting brings with it a responsibility for what you learn.
-or-
You can enjoy common carrier protections at all times, but you can neither inspect nor act on the contents of what you are carrying and can only charge for carriage itself. Bits are bits.
Choose wisely. No refunds or exchanges at this window.
We can revise this choice for the social-media age:
Hi Facebook/Twitter/YouTube/everyone else:
You can build a communications based on inspecting user content and presenting it as you want, but that business model also conveys responsibility for that content.
-or-
You can be a communications service and enjoy the protections of CDA 230, in which case you cannot inspect or control the content you deliver.
Facebook would be an example of the former. WhatsApp would be an example of the latter.
I am honestly undecided about all of this. I want CDA230 to protect things like the commenting section of this blog. But I don’t think it should protect dating apps when they are used as a conduit for abuse. And I really don’t want society to pay the cost for all the externalities inherent in Facebook’s business model.
Earlier this month we launched the C5 Instances with Local NVMe Storage and I told you that we would be doing the same for additional instance types in the near future!
Today we are introducing M5 instances equipped with local NVMe storage. Available for immediate use in 5 regions, these instances are a great fit for workloads that require a balance of compute and memory resources. Here are the specs:
Instance Name
vCPUs
RAM
Local Storage
EBS-Optimized Bandwidth
Network Bandwidth
m5d.large
2
8 GiB
1 x 75 GB NVMe SSD
Up to 2.120 Gbps
Up to 10 Gbps
m5d.xlarge
4
16 GiB
1 x 150 GB NVMe SSD
Up to 2.120 Gbps
Up to 10 Gbps
m5d.2xlarge
8
32 GiB
1 x 300 GB NVMe SSD
Up to 2.120 Gbps
Up to 10 Gbps
m5d.4xlarge
16
64 GiB
1 x 600 GB NVMe SSD
2.210 Gbps
Up to 10 Gbps
m5d.12xlarge
48
192 GiB
2 x 900 GB NVMe SSD
5.0 Gbps
10 Gbps
m5d.24xlarge
96
384 GiB
4 x 900 GB NVMe SSD
10.0 Gbps
25 Gbps
The M5d instances are powered by Custom Intel® Xeon® Platinum 8175M series processors running at 2.5 GHz, including support for AVX-512.
You can use any AMI that includes drivers for the Elastic Network Adapter (ENA) and NVMe; this includes the latest Amazon Linux, Microsoft Windows (Server 2008 R2, Server 2012, Server 2012 R2 and Server 2016), Ubuntu, RHEL, SUSE, and CentOS AMIs.
Here are a couple of things to keep in mind about the local NVMe storage on the M5d instances:
Naming – You don’t have to specify a block device mapping in your AMI or during the instance launch; the local storage will show up as one or more devices (/dev/nvme*1 on Linux) after the guest operating system has booted.
Encryption – Each local NVMe device is hardware encrypted using the XTS-AES-256 block cipher and a unique key. Each key is destroyed when the instance is stopped or terminated.
Lifetime – Local NVMe devices have the same lifetime as the instance they are attached to, and do not stick around after the instance has been stopped or terminated.
Available Now M5d instances are available in On-Demand, Reserved Instance, and Spot form in the US East (N. Virginia), US West (Oregon), EU (Ireland), US East (Ohio), and Canada (Central) Regions. Prices vary by Region, and are just a bit higher than for the equivalent M5 instances.
Join us this month to learn about AWS services and solutions. New this month, we have a fireside chat with the GM of Amazon WorkSpaces and our 2nd episode of the “How to re:Invent” series. We’ll also cover best practices, deep dives, use cases and more! Join us and register today!
AWS re:Invent June 13, 2018 | 05:00 PM – 05:30 PM PT – Episode 2: AWS re:Invent Breakout Content Secret Sauce – Hear from one of our own AWS content experts as we dive deep into the re:Invent content strategy and how we maintain a high bar. Compute
Containers June 25, 2018 | 09:00 AM – 09:45 AM PT – Running Kubernetes on AWS – Learn about the basics of running Kubernetes on AWS including how setup masters, networking, security, and add auto-scaling to your cluster.
June 19, 2018 | 11:00 AM – 11:45 AM PT – Launch AWS Faster using Automated Landing Zones – Learn how the AWS Landing Zone can automate the set up of best practice baselines when setting up new
June 21, 2018 | 01:00 PM – 01:45 PM PT – Enabling New Retail Customer Experiences with Big Data – Learn how AWS can help retailers realize actual value from their big data and deliver on differentiated retail customer experiences.
June 28, 2018 | 01:00 PM – 01:45 PM PT – Fireside Chat: End User Collaboration on AWS – Learn how End User Compute services can help you deliver access to desktops and applications anywhere, anytime, using any device. IoT
June 27, 2018 | 11:00 AM – 11:45 AM PT – AWS IoT in the Connected Home – Learn how to use AWS IoT to build innovative Connected Home products.
Mobile June 25, 2018 | 11:00 AM – 11:45 AM PT – Drive User Engagement with Amazon Pinpoint – Learn how Amazon Pinpoint simplifies and streamlines effective user engagement.
June 26, 2018 | 11:00 AM – 11:45 AM PT – Deep Dive: Hybrid Cloud Storage with AWS Storage Gateway – Learn how you can reduce your on-premises infrastructure by using the AWS Storage Gateway to connecting your applications to the scalable and reliable AWS storage services. June 27, 2018 | 01:00 PM – 01:45 PM PT – Changing the Game: Extending Compute Capabilities to the Edge – Discover how to change the game for IIoT and edge analytics applications with AWS Snowball Edge plus enhanced Compute instances. June 28, 2018 | 11:00 AM – 11:45 AM PT – Big Data and Analytics Workloads on Amazon EFS – Get best practices and deployment advice for running big data and analytics workloads on Amazon EFS.
Last week, researchersdisclosed vulnerabilities in a large number of encrypted e-mail clients: specifically, those that use OpenPGP and S/MIME, including Thunderbird and AppleMail. These are seriousvulnerabilities: An attacker who can alter mail sent to a vulnerable client can trick that client into sending a copy of the plaintext to a web server controlled by that attacker. The story of these vulnerabilities and the tale of how they were disclosed illustrate some important lessons about security vulnerabilities in general and e-mail security in particular.
But first, if you use PGP or S/MIME to encrypt e-mail, you need to check the list on this page and see if you are vulnerable. If you are, check with the vendor to see if they’ve fixed the vulnerability. (Note that some early patches turned out not to fix the vulnerability.) If not, stop using the encrypted e-mail program entirely until it’s fixed. Or, if you know how to do it, turn off your e-mail client’s ability to process HTML e-mail or — even better — stop decrypting e-mails from within the client. There’s even more complex advice for more sophisticated users, but if you’re one of those, you don’t need me to explain this to you.
Consider your encrypted e-mail insecure until this is fixed.
All software contains security vulnerabilities, and one of the primary ways we all improve our security is by researchers discovering those vulnerabilities and vendors patching them. It’s a weird system: Corporate researchers are motivated by publicity, academic researchers by publication credentials, and just about everyone by individual fame and the small bug-bounties paid by some vendors.
Software vendors, on the other hand, are motivated to fix vulnerabilities by the threat of public disclosure. Without the threat of eventual publication, vendors are likely to ignore researchers and delay patching. This happened a lot in the 1990s, and even today, vendors often use legal tactics to try to block publication. It makes sense; they look bad when their products are pronounced insecure.
Over the past few years, researchers have started to choreograph vulnerability announcements to make a big press splash. Clever names — the e-mail vulnerability is called “Efail” — websites, and cute logos are now common. Key reporters are given advance information about the vulnerabilities. Sometimes advance teasers are released. Vendors are now part of this process, trying to announce their patches at the same time the vulnerabilities are announced.
This simultaneous announcement is best for security. While it’s always possible that some organization — either government or criminal — has independently discovered and is using the vulnerability before the researchers go public, use of the vulnerability is essentially guaranteed after the announcement. The time period between announcement and patching is the most dangerous, and everyone except would-be attackers wants to minimize it.
Things get much more complicated when multiple vendors are involved. In this case, Efail isn’t a vulnerability in a particular product; it’s a vulnerability in a standard that is used in dozens of different products. As such, the researchers had to ensure both that everyone knew about the vulnerability in time to fix it and that no one leaked the vulnerability to the public during that time. As you can imagine, that’s close to impossible.
Efail was discovered sometime last year, and the researchers alerted dozens of different companies between last October and March. Some companies took the news more seriously than others. Most patched. Amazingly, news about the vulnerability didn’t leak until the day before the scheduled announcement date. Two days before the scheduled release, the researchers unveiled a teaser — honestly, a really bad idea — which resulted in details leaking.
After the leak, the Electronic Frontier Foundation posted a notice about the vulnerability without details. The organization has beencriticized for its announcement, but I am hard-pressed to find fault with its advice. (Note: I am a board member at EFF.) Then, the researchers published — and lotsofpressfollowed.
All of this speaks to the difficulty of coordinating vulnerability disclosure when it involves a large number of companies or — even more problematic — communities without clear ownership. And that’s what we have with OpenPGP. It’s even worse when the bug involves the interaction between different parts of a system. In this case, there’s nothing wrong with PGP or S/MIME in and of themselves. Rather, the vulnerability occurs because of the way many e-mail programs handle encrypted e-mail. GnuPG, an implementation of OpenPGP, decided that the bug wasn’t its fault and did nothing about it. This is arguably true, but irrelevant. They should fix it.
Expect more of these kinds of problems in the future. The Internet is shifting from a set of systems we deliberately use — our phones and computers — to a fully immersive Internet-of-things world that we live in 24/7. And like this e-mail vulnerability, vulnerabilities will emerge through the interactions of different systems. Sometimes it will be obvious who should fix the problem. Sometimes it won’t be. Sometimes it’ll be two secure systems that, when they interact in a particular way, cause an insecurity. In April, I wrote about a vulnerability that arose because Google and Netflix make different assumptions about e-mail addresses. I don’t even know who to blame for that one.
It gets even worse. Our system of disclosure and patching assumes that vendors have the expertise and ability to patch their systems, but that simply isn’t true for many of the embedded and low-cost Internet of things software packages. They’re designed at a much lower cost, often by offshore teams that come together, create the software, and then disband; as a result, there simply isn’t anyone left around to receive vulnerability alerts from researchers and write patches. Even worse, many of these devices aren’t patchable at all. Right now, if you own a digital video recorder that’s vulnerable to being recruited for a botnet — remember Mirai from 2016? — the only way to patch it is to throw it away and buy a new one.
Patching is starting to fail, which means that we’re losing the best mechanism we have for improving software security at exactly the same time that software is gaining autonomy and physical agency. Many researchers and organizations, including myself, have proposed government regulations enforcing minimal security standards for Internet-of-things devices, including standards around vulnerability disclosure and patching. This would be expensive, but it’s hard to see any other viable alternative.
Getting back to e-mail, the truth is that it’s incredibly difficult to secure well. Not because the cryptography is hard, but because we expect e-mail to do so many things. We use it for correspondence, for conversations, for scheduling, and for record-keeping. I regularly search my 20-year e-mail archive. The PGP and S/MIME security protocols are outdated, needlessly complicated and have been difficult to properly use the whole time. If we could start again, we would design something better and more user friendlybut the huge number of legacy applications that use the existing standards mean that we can’t. I tell people that if they want to communicate securely with someone, to use one of the secure messaging systems: Signal, Off-the-Record, or — if having one of those two on your system is itself suspicious — WhatsApp. Of course they’re not perfect, as last week’s announcement of a vulnerability (patched within hours) in Signal illustrates. And they’re not as flexible as e-mail, but that makes them easier to secure.
One of the most common enquiries I receive at Pi Towers is “How can I get my hands on a Raspberry Pi Oracle Weather Station?” Now the answer is: “Why not build your own version using our guide?”
Tadaaaa! The BYO weather station fully assembled.
Our Oracle Weather Station
In 2016 we sent out nearly 1000 Raspberry Pi Oracle Weather Station kits to schools from around the world who had applied to be part of our weather station programme. In the original kit was a special HAT that allows the Pi to collect weather data with a set of sensors.
The original Raspberry Pi Oracle Weather Station HAT
We designed the HAT to enable students to create their own weather stations and mount them at their schools. As part of the programme, we also provide an ever-growing range of supporting resources. We’ve seen Oracle Weather Stations in great locations with a huge differences in climate, and they’ve even recorded the effects of a solar eclipse.
Our new BYO weather station guide
We only had a single batch of HATs made, and unfortunately we’ve given nearly* all the Weather Station kits away. Not only are the kits really popular, we also receive lots of questions about how to add extra sensors or how to take more precise measurements of a particular weather phenomenon. So today, to satisfy your demand for a hackable weather station, we’re launching our Build your own weather station guide!
Fun with meteorological experiments!
Our guide suggests the use of many of the sensors from the Oracle Weather Station kit, so can build a station that’s as close as possible to the original. As you know, the Raspberry Pi is incredibly versatile, and we’ve made it easy to hack the design in case you want to use different sensors.
Many other tutorials for Pi-powered weather stations don’t explain how the various sensors work or how to store your data. Ours goes into more detail. It shows you how to put together a breadboard prototype, it describes how to write Python code to take readings in different ways, and it guides you through recording these readings in a database.
There’s also a section on how to make your station weatherproof. And in case you want to move past the breadboard stage, we also help you with that. The guide shows you how to solder together all the components, similar to the original Oracle Weather Station HAT.
Who should try this build
We think this is a great project to tackle at home, at a STEM club, Scout group, or CoderDojo, and we’re sure that many of you will be chomping at the bit to get started. Before you do, please note that we’ve designed the build to be as straight-forward as possible, but it’s still fairly advanced both in terms of electronics and programming. You should read through the whole guide before purchasing any components.
The sensors and components we’re suggesting balance cost, accuracy, and easy of use. Depending on what you want to use your station for, you may wish to use different components. Similarly, the final soldered design in the guide may not be the most elegant, but we think it is achievable for someone with modest soldering experience and basic equipment.
You can build a functioning weather station without soldering with our guide, but the build will be more durable if you do solder it. If you’ve never tried soldering before, that’s OK: we have a Getting started with soldering resource plus video tutorial that will walk you through how it works step by step.
For those of you who are more experienced makers, there are plenty of different ways to put the final build together. We always like to hear about alternative builds, so please post your designs in the Weather Station forum.
Our plans for the guide
Our next step is publishing supplementary guides for adding extra functionality to your weather station. We’d love to hear which enhancements you would most like to see! Our current ideas under development include adding a webcam, making a tweeting weather station, adding a light/UV meter, and incorporating a lightning sensor. Let us know which of these is your favourite, or suggest your own amazing ideas in the comments!
*We do have a very small number of kits reserved for interesting projects or locations: a particularly cool experiment, a novel idea for how the Oracle Weather Station could be used, or places with specific weather phenomena. If have such a project in mind, please send a brief outline to [email protected], and we’ll consider how we might be able to help you.
The German charity Save Nemo works to protect coral reefs, and they are developing Nemo-Pi, an underwater “weather station” that monitors ocean conditions. Right now, you can vote for Save Nemo in the Google.org Impact Challenge.
Save Nemo
The organisation says there are two major threats to coral reefs: divers, and climate change. To make diving saver for reefs, Save Nemo installs buoy anchor points where diving tour boats can anchor without damaging corals in the process.
In addition, they provide dos and don’ts for how to behave on a reef dive.
The Nemo-Pi
To monitor the effects of climate change, and to help divers decide whether conditions are right at a reef while they’re still on shore, Save Nemo is also in the process of perfecting Nemo-Pi.
This Raspberry Pi-powered device is made up of a buoy, a solar panel, a GPS device, a Pi, and an array of sensors. Nemo-Pi measures water conditions such as current, visibility, temperature, carbon dioxide and nitrogen oxide concentrations, and pH. It also uploads its readings live to a public webserver.
The Save Nemo team is currently doing long-term tests of Nemo-Pi off the coast of Thailand and Indonesia. They are also working on improving the device’s power consumption and durability, and testing prototypes with the Raspberry Pi Zero W.
The web dashboard showing live Nemo-Pi data
Long-term goals
Save Nemo aims to install a network of Nemo-Pis at shallow reefs (up to 60 metres deep) in South East Asia. Then diving tour companies can check the live data online and decide day-to-day whether tours are feasible. This will lower the impact of humans on reefs and help the local flora and fauna survive.
A healthy coral reef
Nemo-Pi data may also be useful for groups lobbying for reef conservation, and for scientists and activists who want to shine a spotlight on the awful effects of climate change on sea life, such as coral bleaching caused by rising water temperatures.
A bleached coral reef
Vote now for Save Nemo
If you want to help Save Nemo in their mission today, vote for them to win the Google.org Impact Challenge:
Click “Abstimmen” in the footer of the page to vote
Click “JA” in the footer to confirm
Voting is open until 6 June. You can also follow Save Nemo on Facebook or Twitter. We think this organisation is doing valuable work, and that their projects could be expanded to reefs across the globe. It’s fantastic to see the Raspberry Pi being used to help protect ocean life.
Amazon QuickSight is a fully managed cloud business intelligence system that gives you Fast & Easy to Use Business Analytics for Big Data. QuickSight makes business analytics available to organizations of all shapes and sizes, with the ability to access data that is stored in your Amazon Redshift data warehouse, your Amazon Relational Database Service (RDS) relational databases, flat files in S3, and (via connectors) data stored in on-premises MySQL, PostgreSQL, and SQL Server databases. QuickSight scales to accommodate tens, hundreds, or thousands of users per organization.
Today we are launching a new, session-based pricing option for QuickSight, along with additional region support and other important new features. Let’s take a look at each one:
Pay-per-Session Pricing Our customers are making great use of QuickSight and take full advantage of the power it gives them to connect to data sources, create reports, and and explore visualizations.
However, not everyone in an organization needs or wants such powerful authoring capabilities. Having access to curated data in dashboards and being able to interact with the data by drilling down, filtering, or slicing-and-dicing is more than adequate for their needs. Subscribing them to a monthly or annual plan can be seen as an unwarranted expense, so a lot of such casual users end up not having access to interactive data or BI.
In order to allow customers to provide all of their users with interactive dashboards and reports, the Enterprise Edition of Amazon QuickSight now allows Reader access to dashboards on a Pay-per-Session basis. QuickSight users are now classified as Admins, Authors, or Readers, with distinct capabilities and prices:
Authors have access to the full power of QuickSight; they can establish database connections, upload new data, create ad hoc visualizations, and publish dashboards, all for $9 per month (Standard Edition) or $18 per month (Enterprise Edition).
Readers can view dashboards, slice and dice data using drill downs, filters and on-screen controls, and download data in CSV format, all within the secure QuickSight environment. Readers pay $0.30 for 30 minutes of access, with a monthly maximum of $5 per reader.
Admins have all authoring capabilities, and can manage users and purchase SPICE capacity in the account. The QuickSight admin now has the ability to set the desired option (Author or Reader) when they invite members of their organization to use QuickSight. They can extend Reader invites to their entire user base without incurring any up-front or monthly costs, paying only for the actual usage.
A New Region QuickSight is now available in the Asia Pacific (Tokyo) Region:
The UI is in English, with a localized version in the works.
Hourly Data Refresh Enterprise Edition SPICE data sets can now be set to refresh as frequently as every hour. In the past, each data set could be refreshed up to 5 times a day. To learn more, read Refreshing Imported Data.
Access to Data in Private VPCs This feature was launched in preview form late last year, and is now available in production form to users of the Enterprise Edition. As I noted at the time, you can use it to implement secure, private communication with data sources that do not have public connectivity, including on-premises data in Teradata or SQL Server, accessed over an AWS Direct Connect link. To learn more, read Working with AWS VPC.
Parameters with On-Screen Controls QuickSight dashboards can now include parameters that are set using on-screen dropdown, text box, numeric slider or date picker controls. The default value for each parameter can be set based on the user name (QuickSight calls this a dynamic default). You could, for example, set an appropriate default based on each user’s office location, department, or sales territory. Here’s an example:
URL Actions for Linked Dashboards You can now connect your QuickSight dashboards to external applications by defining URL actions on visuals. The actions can include parameters, and become available in the Details menu for the visual. URL actions are defined like this:
You can use this feature to link QuickSight dashboards to third party applications (e.g. Salesforce) or to your own internal applications. Read Custom URL Actions to learn how to use this feature.
Dashboard Sharing You can now share QuickSight dashboards across every user in an account.
Larger SPICE Tables The per-data set limit for SPICE tables has been raised from 10 GB to 25 GB.
Upgrade to Enterprise Edition The QuickSight administrator can now upgrade an account from Standard Edition to Enterprise Edition with a click. This enables provisioning of Readers with pay-per-session pricing, private VPC access, row-level security for dashboards and data sets, and hourly refresh of data sets. Enterprise Edition pricing applies after the upgrade.
Available Now Everything I listed above is available now and you can start using it today!
Previously, I showed you how to rotate Amazon RDS database credentials automatically with AWS Secrets Manager. In addition to database credentials, AWS Secrets Manager makes it easier to rotate, manage, and retrieve API keys, OAuth tokens, and other secrets throughout their lifecycle. You can configure Secrets Manager to rotate these secrets automatically, which can help you meet your compliance needs. You can also use Secrets Manager to rotate secrets on demand, which can help you respond quickly to security events. In this post, I show you how to store an API key in Secrets Manager and use a custom Lambda function to rotate the key automatically. I’ll use a Twitter API key and bearer token as an example; you can reference this example to rotate other types of API keys.
The instructions are divided into four main phases:
Store a Twitter API key and bearer token in Secrets Manager.
Create a custom Lambda function to rotate the bearer token.
Configure your application to retrieve the bearer token from Secrets Manager.
Configure Secrets Manager to use the custom Lambda function to rotate the bearer token automatically.
For the purpose of this post, I use the placeholder Demo/Twitter_Api_Key to denote the API key, the placeholder Demo/Twitter_bearer_token to denote the bearer token, and placeholder Lambda_Rotate_Bearer_Token to denote the custom Lambda function. Be sure to replace these placeholders with the resource names from your account.
Phase 1: Store a Twitter API key and bearer token in Secrets Manager
Twitter enables developers to register their applications and retrieve an API key, which includes a consumer_key and consumer_secret. Developers use these to generate a bearer token that applications can then use to authenticate and retrieve information from Twitter. At any given point of time, you can use an API key to create only one valid bearer token.
Start by storing the API key in Secrets Manager. Here’s how:
Figure 1: The “Store a new secret” button in the AWS Secrets Manager console
Select Other type of secrets (because you’re storing an API key).
Input the consumer_key and consumer_secret, and then select Next.
Figure 2: Select the consumer_key and the consumer_secret
Specify values for Secret Name and Description, then select Next. For this example, I use Demo/Twitter_API_Key.
Figure 3: Set values for “Secret Name” and “Description”
On the next screen, keep the default setting, Disable automatic rotation, because you’ll use the same API key to rotate bearer tokens programmatically and automatically. Applications and employees will not retrieve this API key. Select Next.
Figure 4: Keep the default “Disable automatic rotation” setting
Review the information on the next screen and, if everything looks correct, select Store. You’ve now successfully stored a Twitter API key in Secrets Manager.
Next, store the bearer token in Secrets Manager. Here’s how:
From the Secrets Manager console, select Store a new secret, select Other type of secrets, input details (access_token, token_type, and ARN of the API key) about the bearer token, and then select Next.
Figure 5: Add details about the bearer token
Specify values for Secret Name and Description, and then select Next. For this example, I use Demo/Twitter_bearer_token.
Figure 6: Again set values for “Secret Name” and “Description”
Keep the default rotation setting, Disable automatic rotation, and then select Next. You’ll enable rotation after you’ve updated the application to use Secrets Manager APIs to retrieve secrets.
Review the information and select Store. You’ve now completed storing the bearer token in Secrets Manager. I take note of the sample code provided on the review page. I’ll use this code to update my application to retrieve the bearer token using Secrets Manager APIs.
Figure 7: The sample code you can use in your app
Phase 2: Create a custom Lambda function to rotate the bearer token
While Secrets Manager supports rotating credentials for databases hosted on Amazon RDS natively, it also enables you to meet your unique rotation-related use cases by authoring custom Lambda functions. Now that you’ve stored the API key and bearer token, you’ll create a Lambda function to rotate the bearer token. For this example, I’ll create my Lambda function using Python 3.6.
Figure 8: In the Lambda console, select “Create function”
Select Author from scratch. For this example, I use the name Lambda_Rotate_Bearer_Token for my Lambda function. I also set the Runtime environment as Python 3.6.
Figure 9: Create a new function from scratch
This Lambda function requires permissions to call AWS resources on your behalf. To grant these permissions, select Create a custom role. This opens a console tab.
Select Create a new IAM Role and specify the value for Role Name. For this example, I use Role_Lambda_Rotate_Twitter_Bearer_Token.
Figure 10: For “IAM Role,” select “Create a new IAM role”
Next, to define the IAM permissions, copy and paste the following IAM policy in the View Policy Document text-entry field. Be sure to replace the placeholder ARN-OF-Demo/Twitter_API_Key with the ARN of your secret.
Figure 11: The IAM policy pasted in the “View Policy Document” text-entry field
Now, select Allow. This brings me back to the Lambda console with the appropriate Role selected.
Select Create function.
Figure 12: Select the “Create function” button in the lower-right corner
Copy the following Python code and paste it in the Function code section.
import base64
import json
import logging
import os
import boto3
from botocore.vendored import requests
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
"""Secrets Manager Twitter Bearer Token Handler
This handler uses the master-user rotation scheme to rotate a bearer token of a Twitter app.
The Secret PlaintextString is expected to be a JSON string with the following format:
{
'access_token': ,
'token_type': ,
'masterarn':
}
Args:
event (dict): Lambda dictionary of event parameters. These keys must include the following:
- SecretId: The secret ARN or identifier
- ClientRequestToken: The ClientRequestToken of the secret version
- Step: The rotation step (one of createSecret, setSecret, testSecret, or finishSecret)
context (LambdaContext): The Lambda runtime information
Raises:
ResourceNotFoundException: If the secret with the specified arn and stage does not exist
ValueError: If the secret is not properly configured for rotation
KeyError: If the secret json does not contain the expected keys
"""
arn = event['SecretId']
token = event['ClientRequestToken']
step = event['Step']
# Setup the client and environment variables
service_client = boto3.client('secretsmanager', endpoint_url=os.environ['SECRETS_MANAGER_ENDPOINT'])
oauth2_token_url = os.environ['TWITTER_OAUTH2_TOKEN_URL']
oauth2_invalid_token_url = os.environ['TWITTER_OAUTH2_INVALID_TOKEN_URL']
tweet_search_url = os.environ['TWITTER_SEARCH_URL']
# Make sure the version is staged correctly
metadata = service_client.describe_secret(SecretId=arn)
if not metadata['RotationEnabled']:
logger.error("Secret %s is not enabled for rotation" % arn)
raise ValueError("Secret %s is not enabled for rotation" % arn)
versions = metadata['VersionIdsToStages']
if token not in versions:
logger.error("Secret version %s has no stage for rotation of secret %s." % (token, arn))
raise ValueError("Secret version %s has no stage for rotation of secret %s." % (token, arn))
if "AWSCURRENT" in versions[token]:
logger.info("Secret version %s already set as AWSCURRENT for secret %s." % (token, arn))
return
elif "AWSPENDING" not in versions[token]:
logger.error("Secret version %s not set as AWSPENDING for rotation of secret %s." % (token, arn))
raise ValueError("Secret version %s not set as AWSPENDING for rotation of secret %s." % (token, arn))
# Call the appropriate step
if step == "createSecret":
create_secret(service_client, arn, token, oauth2_token_url, oauth2_invalid_token_url)
elif step == "setSecret":
set_secret(service_client, arn, token, oauth2_token_url)
elif step == "testSecret":
test_secret(service_client, arn, token, tweet_search_url)
elif step == "finishSecret":
finish_secret(service_client, arn, token)
else:
logger.error("lambda_handler: Invalid step parameter %s for secret %s" % (step, arn))
raise ValueError("Invalid step parameter %s for secret %s" % (step, arn))
def create_secret(service_client, arn, token, oauth2_token_url, oauth2_invalid_token_url):
"""Get a new bearer token from Twitter
This method invalidates existing bearer token for the Twitter app and retrieves a new one from Twitter.
If a secret version with AWSPENDING stage exists, updates it with the newly retrieved bearer token and if
the AWSPENDING stage does not exist, creates a new version of the secret with that stage label.
Args:
service_client (client): The secrets manager service client
arn (string): The secret ARN or other identifier
token (string): The ClientRequestToken associated with the secret version
oauth2_token_url (string): The Twitter API endpoint to request a bearer token
oauth2_invalid_token_url (string): The Twitter API endpoint to invalidate a bearer token
Raises:
ValueError: If the current secret is not valid JSON
KeyError: If the secret json does not contain the expected keys
ResourceNotFoundException: If the current secret is not found
"""
# Make sure the current secret exists and try to get the master arn from the secret
try:
current_secret_dict = get_secret_dict(service_client, arn, "AWSCURRENT")
master_arn = current_secret_dict['masterarn']
logger.info("createSecret: Successfully retrieved secret for %s." % arn)
except service_client.exceptions.ResourceNotFoundException:
return
# create bearer token credentials to be passed as authorization string to Twitter
bearer_token_credentials = encode_credentials(service_client, master_arn, "AWSCURRENT")
# get the bearer token from Twitter
bearer_token_from_twitter = get_bearer_token(bearer_token_credentials,oauth2_token_url)
# invalidate the current bearer token
invalidate_bearer_token(oauth2_invalid_token_url,bearer_token_credentials,bearer_token_from_twitter)
# get a new bearer token from Twitter
new_bearer_token = get_bearer_token(bearer_token_credentials, oauth2_token_url)
# if a secret version with AWSPENDING stage exists, update it with the lastest bearer token
# if the AWSPENDING stage does not exist, then create the version with AWSPENDING stage
try:
pending_secret_dict = get_secret_dict(service_client, arn, "AWSPENDING", token)
pending_secret_dict['access_token'] = new_bearer_token
service_client.put_secret_value(SecretId=arn, ClientRequestToken=token, SecretString=json.dumps(pending_secret_dict), VersionStages=['AWSPENDING'])
logger.info("createSecret: Successfully invalidated the bearer token of the secret %s and updated the pending version" % arn)
except service_client.exceptions.ResourceNotFoundException:
current_secret_dict['access_token'] = new_bearer_token
service_client.put_secret_value(SecretId=arn, ClientRequestToken=token, SecretString=json.dumps(current_secret_dict), VersionStages=['AWSPENDING'])
logger.info("createSecret: Successfully invalidated the bearer token of the secret %s and and created the pending version." % arn)
def set_secret(service_client, arn, token, oauth2_token_url):
"""Validate the pending secret with that in Twitter
This method checks wether the bearer token in Twitter is the same as the one in the version with AWSPENDING stage.
Args:
service_client (client): The secrets manager service client
arn (string): The secret ARN or other identifier
token (string): The ClientRequestToken associated with the secret version
oauth2_token_url (string): The Twitter API endopoint to get a bearer token
Raises:
ResourceNotFoundException: If the secret with the specified arn and stage does not exist
ValueError: If the secret is not valid JSON or master credentials could not be used to login to DB
KeyError: If the secret json does not contain the expected keys
"""
# First get the pending version of the bearer token and compare it with that in Twitter
pending_secret_dict = get_secret_dict(service_client, arn, "AWSPENDING")
master_arn = pending_secret_dict['masterarn']
# create bearer token credentials to be passed as authorization string to Twitter
bearer_token_credentials = encode_credentials(service_client, master_arn, "AWSCURRENT")
# get the bearer token from Twitter
bearer_token_from_twitter = get_bearer_token(bearer_token_credentials, oauth2_token_url)
# if the bearer tokens are same, invalidate the bearer token in Twitter
# if not, raise an exception that bearer token in Twitter was changed outside Secrets Manager
if pending_secret_dict['access_token'] == bearer_token_from_twitter:
logger.info("createSecret: Successfully verified the bearer token of arn %s" % arn)
else:
raise ValueError("The bearer token of the Twitter app was changed outside Secrets Manager. Please check.")
def test_secret(service_client, arn, token, tweet_search_url):
"""Test the pending secret by calling a Twitter API
This method tries to use the bearer token in the secret version with AWSPENDING stage and search for tweets
with 'aws secrets manager' string.
Args:
service_client (client): The secrets manager service client
arn (string): The secret ARN or other identifier
token (string): The ClientRequestToken associated with the secret version
Raises:
ResourceNotFoundException: If the secret with the specified arn and stage does not exist
ValueError: If the secret is not valid JSON or pending credentials could not be used to login to the database
KeyError: If the secret json does not contain the expected keys
"""
# First get the pending version of the bearer token and compare it with that in Twitter
pending_secret_dict = get_secret_dict(service_client, arn, "AWSPENDING", token)
# Now verify you can search for tweets using the bearer token
if verify_bearer_token(pending_secret_dict['access_token'], tweet_search_url):
logger.info("testSecret: Successfully authorized with the pending secret in %s." % arn)
return
else:
logger.error("testSecret: Unable to authorize with the pending secret of secret ARN %s" % arn)
raise ValueError("Unable to connect to Twitter with pending secret of secret ARN %s" % arn)
def finish_secret(service_client, arn, token):
"""Finish the rotation by marking the pending secret as current
This method moves the secret from the AWSPENDING stage to the AWSCURRENT stage.
Args:
service_client (client): The secrets manager service client
arn (string): The secret ARN or other identifier
token (string): The ClientRequestToken associated with the secret version
Raises:
ResourceNotFoundException: If the secret with the specified arn and stage does not exist
"""
# First describe the secret to get the current version
metadata = service_client.describe_secret(SecretId=arn)
current_version = None
for version in metadata["VersionIdsToStages"]:
if "AWSCURRENT" in metadata["VersionIdsToStages"][version]:
if version == token:
# The correct version is already marked as current, return
logger.info("finishSecret: Version %s already marked as AWSCURRENT for %s" % (version, arn))
return
current_version = version
break
# Finalize by staging the secret version current
service_client.update_secret_version_stage(SecretId=arn, VersionStage="AWSCURRENT", MoveToVersionId=token, RemoveFromVersionId=current_version)
logger.info("finishSecret: Successfully set AWSCURRENT stage to version %s for secret %s." % (version, arn))
def encode_credentials(service_client, arn, stage):
"""Encodes the Twitter credentials
This helper function encodes the Twitter credentials (consumer_key and consumer_secret)
Args:
service_client (client):The secrets manager service client
arn (string): The secret ARN or other identifier
stage (stage): The stage identifying the secret version
Returns:
encoded_credentials (string): base64 encoded authorization string for Twitter
Raises:
KeyError: If the secret json does not contain the expected keys
"""
required_fields = ['consumer_key','consumer_secret']
master_secret_dict = get_secret_dict(service_client, arn, stage)
for field in required_fields:
if field not in master_secret_dict:
raise KeyError("%s key is missing from the secret JSON" % field)
encoded_credentials = base64.urlsafe_b64encode(
'{}:{}'.format(master_secret_dict['consumer_key'], master_secret_dict['consumer_secret']).encode('ascii')).decode('ascii')
return encoded_credentials
def get_bearer_token(encoded_credentials, oauth2_token_url):
"""Gets a bearer token from Twitter
This helper function retrieves the current bearer token from Twitter, given a set of credentials.
Args:
encoded_credentials (string): Twitter credentials for authentication
oauth2_token_url (string): REST API endpoint to request a bearer token from Twitter
Raises:
KeyError: If the secret json does not contain the expected keys
"""
headers = {
'Authorization': 'Basic {}'.format(encoded_credentials),
'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
}
data = 'grant_type=client_credentials'
response = requests.post(oauth2_token_url, headers=headers, data=data)
response_data = response.json()
if response_data['token_type'] == 'bearer':
bearer_token = response_data['access_token']
return bearer_token
else:
raise RuntimeError('unexpected token type: {}'.format(response_data['token_type']))
def invalidate_bearer_token(oauth2_invalid_token_url, bearer_token_credentials, bearer_token):
"""Invalidates a Bearer Token of a Twitter App
This helper function invalidates a bearer token of a Twitter app.
If successful, it returns the invalidated bearer token, else None
Args:
oauth2_invalid_token_url (string): The Twitter API endpoint to invalidate a bearer token
bearer_token_credentials (string): encoded consumer key and consumer secret to authenticate with Twitter
bearer_token (string): The bearer token to be invalidated
Returns:
invalidated_bearer_token: The invalidated bearer token
Raises:
ResourceNotFoundException: If the secret with the specified arn and stage does not exist
ValueError: If the secret is not valid JSON
KeyError: If the secret json does not contain the expected keys
"""
headers = {
'Authorization': 'Basic {}'.format(bearer_token_credentials),
'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
}
data = 'access_token=' + bearer_token
invalidate_response = requests.post(oauth2_invalid_token_url, headers=headers, data=data)
invalidate_response_data = invalidate_response.json()
if invalidate_response_data:
return
else:
raise RuntimeError('Invalidate bearer token request failed')
def verify_bearer_token(bearer_token, tweet_search_url):
"""Verifies access to Twitter APIs using a bearer token
This helper function verifies that the bearer token is valid by calling Twitter's search/tweets API endpoint
Args:
bearer_token (string): The current bearer token for the application
Returns:
True or False
Raises:
KeyError: If the response of search tweets API call fails
"""
headers = {
'Authorization' : 'Bearer {}'.format(bearer_token),
'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
}
search_results = requests.get(tweet_search_url, headers=headers)
try:
search_results.json()['statuses']
return True
except:
return False
def get_secret_dict(service_client, arn, stage, token=None):
"""Gets the secret dictionary corresponding for the secret arn, stage, and token
This helper function gets credentials for the arn and stage passed in and returns the dictionary by parsing the JSON string
Args:
service_client (client): The secrets manager service client
arn (string): The secret ARN or other identifier
token (string): The ClientRequestToken associated with the secret version, or None if no validation is desired
stage (string): The stage identifying the secret version
Returns:
SecretDictionary: Secret dictionary
Raises:
ResourceNotFoundException: If the secret with the specified arn and stage does not exist
ValueError: If the secret is not valid JSON
"""
# Only do VersionId validation against the stage if a token is passed in
if token:
secret = service_client.get_secret_value(SecretId=arn, VersionId=token, VersionStage=stage)
else:
secret = service_client.get_secret_value(SecretId=arn, VersionStage=stage)
plaintext = secret['SecretString']
# Parse and return the secret JSON string
return json.loads(plaintext)
Here’s what it will look like:
Figure 13: The Python code pasted in the “Function code” section
On the same page, provide the following environment variables:
Note: Resources used in this example are in US East (Ohio) region. If you intend to use another AWS Region, change the SECRETS_MANAGER_ENDPOINT set in the Environment variables to the appropriate region.
You’ve now created a Lambda function that can rotate the bearer token:
Figure 15: The new Lambda function
Before you can configure Secrets Manager to use this Lambda function, you need to update the function policy of the Lambda function. A function policy permits AWS services, such as Secrets Manager, to invoke a Lambda function on behalf of your application. You can attach a Lambda function policy from the AWS Command Line Interface (AWS CLI) or SDK. To attach a function policy, call the add-permission Lambda API from the AWS CLI.
Phase 3: Configure your application to retrieve the bearer token from Secrets Manager
Now that you’ve stored the bearer token in Secrets Manager, update the application to retrieve the bearer token from Secrets Manager instead of hard-coding this information in a configuration file or source code. For this example, I show you how to configure a Python application to retrieve this secret from Secrets Manager.
import config
def no_secrets_manager_sample()
# Get the bearer token from a config file.
Bearer_token = config.bearer_token
# Use the bearer token to authenticate requests to Twitter
Use the sample code from section titled Phase 1 and update the application to retrieve the bearer token from Secrets Manager. The following code sets up the client and retrieves and decrypts the secret Demo/Twitter_bearer_token.
# Use this code snippet in your app.
import boto3
from botocore.exceptions import ClientError
def get_secret():
secret_name = "Demo/Twitter_bearer_token"
endpoint_url = "https://secretsmanager.us-east-2.amazonaws.com"
region_name = "us-east-2"
session = boto3.session.Session()
client = session.client(
service_name='secretsmanager',
region_name=region_name,
endpoint_url=endpoint_url
)
try:
get_secret_value_response = client.get_secret_value(
SecretId=secret_name
)
except ClientError as e:
if e.response['Error']['Code'] == 'ResourceNotFoundException':
print("The requested secret " + secret_name + " was not found")
elif e.response['Error']['Code'] == 'InvalidRequestException':
print("The request was invalid due to:", e)
elif e.response['Error']['Code'] == 'InvalidParameterException':
print("The request had invalid params:", e)
else:
# Decrypted secret using the associated KMS CMK
# Depending on whether the secret was a string or binary, one of these fields will be populated
if 'SecretString' in get_secret_value_response:
secret = get_secret_value_response['SecretString']
else:
binary_secret_data = get_secret_value_response['SecretBinary']
# Your code goes here.
Applications require permissions to access Secrets Manager. My application runs on Amazon EC2 and uses an IAM role to get access to AWS services. I’ll attach the following policy to my IAM role, and you should take a similar action with your IAM role. This policy uses the GetSecretValue action to grant my application permissions to read secrets from Secrets Manager. This policy also uses the resource element to limit my application to read only the Demo/Twitter_bearer_token secret from Secrets Manager. Read the AWS Secrets Manager documentation to understand the minimum IAM permissions required to retrieve a secret.
{
"Version": "2012-10-17",
"Statement": {
"Sid": "RetrieveBearerToken",
"Effect": "Allow",
"Action": "secretsmanager:GetSecretValue",
"Resource": Input ARN of the secret Demo/Twitter_bearer_token here
}
}
Note: To improve the resiliency of your applications, associate your application with two API keys/bearer tokens. This is a higher availability option because you can continue to use one bearer token while Secrets Manager rotates the other token. Read the AWS documentation to learn how AWS Secrets Manager rotates your secrets.
Phase 4: Enable and verify rotation
Now that you’ve stored the secret in Secrets Manager and created a Lambda function to rotate this secret, configure Secrets Manager to rotate the secret Demo/Twitter_bearer_token.
From the Secrets Manager console, go to the list of secrets and choose the secret you created in the first step (in my example, this is named Demo/Twitter_bearer_token).
Scroll to Rotation configuration, and then select Edit rotation.
Figure 16: Select the “Edit rotation” button
To enable rotation, select Enable automatic rotation, and then choose how frequently you want Secrets Manager to rotate this secret. For this example, I set the rotation interval to 30 days. I also choose the rotation Lambda function, Lambda_Rotate_Bearer_Token, from the drop-down list.
Figure 17: “Edit rotation configuration” options
The banner on the next screen confirms that I have successfully configured rotation and the first rotation is in progress, which enables you to verify that rotation is functioning as expected. Secrets Manager will rotate this credential automatically every 30 days.
Figure 18: Confirmation notice
Summary
In this post, I showed you how to configure Secrets Manager to manage and rotate an API key and bearer token used by applications to authenticate and retrieve information from Twitter. You can use the steps described in this blog to manage and rotate other API keys, as well.
Secrets Manager helps you protect access to your applications, services, and IT resources without the upfront investment and on-going maintenance costs of operating your own secrets management infrastructure. To get started, open the Secrets Manager console. To learn more, read the Secrets Manager documentation.
If you have comments about this post, submit them in the Comments section below. If you have questions about anything in this post, start a new thread on the Secrets Manager forum or contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
Hey folks, Rob here! It’s the last Thursday of the month, and that means it’s time for a brand-new The MagPi. Issue 70 is all about home automation using your favourite microcomputer, the Raspberry Pi.
Home automation in this month’s The MagPi!
Raspberry Pi home automation
We think home automation is an excellent use of the Raspberry Pi, hiding it around your house and letting it power your lights and doorbells and…fish tanks? We show you how to do all of that, and give you some excellent tips on how to add even more automation to your home in our ten-page cover feature.
Upcycle your life
Our other big feature this issue covers upcycling, the hot trend of taking old electronics and making them better than new with some custom code and a tactically placed Raspberry Pi. For this feature, we had a chat with Martin Mander, upcycler extraordinaire, to find out his top tips for hacking your old hardware.
Upcycling is a lot of fun
But wait, there’s more!
If for some reason you want even more content, you’re in luck! We have some fun tutorials for you to try, like creating a theremin and turning a Babbage into an IoT nanny cam. We also continue our quest to make a video game in C++. Our project showcase is headlined by the Teslonda on page 28, a Honda/Tesla car hybrid that is just wonderful.
We review PiBorg’s latest robot
All this comes with our definitive reviews and the community section where we celebrate you, our amazing community! You’re all good beans
An amazing, and practical, Raspberry Pi project
Get The MagPi 70
Issue 70 is available today from WHSmith, Tesco, Sainsbury’s, and Asda. If you live in the US, head over to your local Barnes & Noble or Micro Center in the next few days for a print copy. You can also get the new issue online from our store, or digitally via our Android and iOS apps. And don’t forget, there’s always the free PDF as well.
New subscription offer!
Want to support the Raspberry Pi Foundation and the magazine? We’ve launched a new way to subscribe to the print version of The MagPi: you can now take out a monthly £4 subscription to the magazine, effectively creating a rolling pre-order system that saves you money on each issue.
You can also take out a twelve-month print subscription and get a Pi Zero W plus case and adapter cables absolutely free! This offer does not currently have an end date.
Backblaze is hiring a Director of Sales. This is a critical role for Backblaze as we continue to grow the team. We need a strong leader who has experience in scaling a sales team and who has an excellent track record for exceeding goals by selling Software as a Service (SaaS) solutions. In addition, this leader will need to be highly motivated, as well as able to create and develop a highly-motivated, success oriented sales team that has fun and enjoys what they do.
The History of Backblaze from our CEO In 2007, after a friend’s computer crash caused her some suffering, we realized that with every photo, video, song, and document going digital, everyone would eventually lose all of their information. Five of us quit our jobs to start a company with the goal of making it easy for people to back up their data.
Like many startups, for a while we worked out of a co-founder’s one-bedroom apartment. Unlike most startups, we made an explicit agreement not to raise funding during the first year. We would then touch base every six months and decide whether to raise or not. We wanted to focus on building the company and the product, not on pitching and slide decks. And critically, we wanted to build a culture that understood money comes from customers, not the magical VC giving tree. Over the course of 5 years we built a profitable, multi-million dollar revenue business — and only then did we raise a VC round.
Fast forward 10 years later and our world looks quite different. You’ll have some fantastic assets to work with:
A brand millions recognize for openness, ease-of-use, and affordability.
A computer backup service that stores over 500 petabytes of data, has recovered over 30 billion files for hundreds of thousands of paying customers — most of whom self-identify as being the people that find and recommend technology products to their friends.
Our B2 service that provides the lowest cost cloud storage on the planet at 1/4th the price Amazon, Google or Microsoft charges. While being a newer product on the market, it already has over 100,000 IT and developers signed up as well as an ecosystem building up around it.
A growing, profitable and cash-flow positive company.
And last, but most definitely not least: a great sales team.
You might be saying, “sounds like you’ve got this under control — why do you need me?” Don’t be misled. We need you. Here’s why:
We have a great team, but we are in the process of expanding and we need to develop a structure that will easily scale and provide the most success to drive revenue.
We just launched our outbound sales efforts and we need someone to help develop that into a fully successful program that’s building a strong pipeline and closing business.
We need someone to work with the marketing department and figure out how to generate more inbound opportunities that the sales team can follow up on and close.
We need someone who will work closely in developing the skills of our current sales team and build a path for career growth and advancement.
We want someone to manage our Customer Success program.
So that’s a bit about us. What are we looking for in you?
Experience: As a sales leader, you will strategically build and drive the territory’s sales pipeline by assembling and leading a skilled team of sales professionals. This leader should be familiar with generating, developing and closing software subscription (SaaS) opportunities. We are looking for a self-starter who can manage a team and make an immediate impact of selling our Backup and Cloud Storage solutions. In this role, the sales leader will work closely with the VP of Sales, marketing staff, and service staff to develop and implement specific strategic plans to achieve and exceed revenue targets, including new business acquisition as well as build out our customer success program.
Leadership: We have an experienced team who’s brought us to where we are today. You need to have the people and management skills to get them excited about working with you. You need to be a strong leader and compassionate about developing and supporting your team.
Data driven and creative: The data has to show something makes sense before we scale it up. However, without creativity, it’s easy to say “the data shows it’s impossible” or to find a local maximum. Whether it’s deciding how to scale the team, figuring out what our outbound sales efforts should look like or putting a plan in place to develop the team for career growth, we’ve seen a bit of creativity get us places a few extra dollars couldn’t.
Jive with our culture: Strong leaders affect culture and the person we hire for this role may well shape, not only fit into, ours. But to shape the culture you have to be accepted by the organism, which means a certain set of shared values. We default to openness with our team, our customers, and everyone if possible. We love initiative — without arrogance or dictatorship. We work to create a place people enjoy showing up to work. That doesn’t mean ping pong tables and foosball (though we do try to have perks & fun), but it means people are friendly, non-political, working to build a good service but also a good place to work.
Do the work: Ideas and strategy are critical, but good execution makes them happen. We’re looking for someone who can help the team execute both from the perspective of being capable of guiding and organizing, but also someone who is hands-on themselves.
Additional Responsibilities needed for this role:
Recruit, coach, mentor, manage and lead a team of sales professionals to achieve yearly sales targets. This includes closing new business and expanding upon existing clientele.
Expand the customer success program to provide the best customer experience possible resulting in upsell opportunities and a high retention rate.
Develop effective sales strategies and deliver compelling product demonstrations and sales pitches.
Acquire and develop the appropriate sales tools to make the team efficient in their daily work flow.
Apply a thorough understanding of the marketplace, industry trends, funding developments, and products to all management activities and strategic sales decisions.
Ensure that sales department operations function smoothly, with the goal of facilitating sales and/or closings; operational responsibilities include accurate pipeline reporting and sales forecasts.
This position will report directly to the VP of Sales and will be staffed in our headquarters in San Mateo, CA.
Requirements:
7 – 10+ years of successful sales leadership experience as measured by sales performance against goals. Experience in developing skill sets and providing career growth and opportunities through advancement of team members.
Background in selling SaaS technologies with a strong track record of success.
Strong presentation and communication skills.
Must be able to travel occasionally nationwide.
BA/BS degree required
Think you want to join us on this adventure? Send an email to jobscontact@backblaze.com with the subject “Director of Sales.” (Recruiters and agencies, please don’t email us.) Include a resume and answer these two questions:
How would you approach evaluating the current sales team and what is your process for developing a growth strategy to scale the team?
What are the goals you would set for yourself in the 3 month and 1-year timeframes?
Thank you for taking the time to read this and I hope that this sounds like the opportunity for which you’ve been waiting.
Amazon Neptune is now Generally Available in US East (N. Virginia), US East (Ohio), US West (Oregon), and EU (Ireland). Amazon Neptune is a fast, reliable, fully-managed graph database service that makes it easy to build and run applications that work with highly connected datasets. At the core of Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with millisecond latencies. Neptune supports two popular graph models, Property Graph and RDF, through Apache TinkerPop Gremlin and SPARQL, allowing you to easily build queries that efficiently navigate highly connected datasets. Neptune can be used to power everything from recommendation engines and knowledge graphs to drug discovery and network security. Neptune is fully-managed with automatic minor version upgrades, backups, encryption, and fail-over. I wrote about Neptune in detail for AWS re:Invent last year and customers have been using the preview and providing great feedback that the team has used to prepare the service for GA.
Now that Amazon Neptune is generally available there are a few changes from the preview:
A large number of performance enhancements and updates
Launching a Neptune cluster is as easy as navigating to the AWS Management Console and clicking create cluster. Of course you can also launch with CloudFormation, the CLI, or the SDKs.
You can monitor your cluster health and the health of individual instances through Amazon CloudWatch and the console.
Additional Resources
We’ve created two repos with some additional tools and examples here. You can expect continuous development on these repos as we add additional tools and examples.
Amazon Neptune Tools Repo This repo has a useful tool for converting GraphML files into Neptune compatible CSVs for bulk loading from S3.
Amazon Neptune Samples Repo This repo has a really cool example of building a collaborative filtering recommendation engine for video game preferences.
Purpose Built Databases
There’s an industry trend where we’re moving more and more onto purpose-built databases. Developers and businesses want to access their data in the format that makes the most sense for their applications. As cloud resources make transforming large datasets easier with tools like AWS Glue, we have a lot more options than we used to for accessing our data. With tools like Amazon Redshift, Amazon Athena, Amazon Aurora, Amazon DynamoDB, and more we get to choose the best database for the job or even enable entirely new use-cases. Amazon Neptune is perfect for workloads where the data is highly connected across data rich edges.
I’m really excited about graph databases and I see a huge number of applications. Looking for ideas of cool things to build? I’d love to build a web crawler in AWS Lambda that uses Neptune as the backing store. You could further enrich it by running Amazon Comprehend or Amazon Rekognition on the text and images found and creating a search engine on top of Neptune.
As always, feel free to reach out in the comments or on twitter to provide any feedback!
Today I’m excited to announce built-in authentication support in Application Load Balancers (ALB). ALB can now securely authenticate users as they access applications, letting developers eliminate the code they have to write to support authentication and offload the responsibility of authentication from the backend. The team built a great live example where you can try out the authentication functionality.
Identity-based security is a crucial component of modern applications and as customers continue to move mission critical applications into the cloud, developers are asked to write the same authentication code again and again. Enterprises want to use their on-premises identities with their cloud applications. Web developers want to use federated identities from social networks to allow their users to sign-in. ALB’s new authentication action provides authentication through social Identity Providers (IdP) like Google, Facebook, and Amazon through Amazon Cognito. It also natively integrates with any OpenID Connect protocol compliant IdP, providing secure authentication and a single sign-on experience across your applications.
How Does ALB Authentication Work?
Authentication is a complicated topic and our readers may have differing levels of expertise with it. I want to cover a few key concepts to make sure we’re all on the same page. If you’re already an authentication expert and you just want to see how ALB authentication works feel free to skip to the next section!
Authentication verifies identity.
Authorization verifies permissions, the things an identity is allowed to do.
OpenID Connect (OIDC) is a simple identity, or authentication, layer built on top on top of the OAuth 2.0 protocol. The OIDC specification document is pretty well written and worth a casual read.
Identity Providers (IdPs) manage identity information and provide authentication services. ALB supports any OIDC compliant IdP and you can use a service like Amazon Cognito or Auth0 to aggregate different identities from various IdPs like Active Directory, LDAP, Google, Facebook, Amazon, or others deployed in AWS or on premises.
When we get away from the terminology for a bit, all of this boils down to figuring out who a user is and what they’re allowed to do. Doing this securely and efficiently is hard. Traditionally, enterprises have used a protocol called SAML with their IdPs, to provide a single sign-on (SSO) experience for their internal users. SAML is XML heavy and modern applications have started using OIDC with JSON mechanism to share claims. Developers can use SAML in ALB with Amazon Cognito’s SAML support. Web app or mobile developers typically use federated identities via social IdPs like Facebook, Amazon, or Google which, conveniently, are also supported by Amazon Cognito.
ALB Authentication works by defining an authentication action in a listener rule. The ALB’s authentication action will check if a session cookie exists on incoming requests, then check that it’s valid. If the session cookie is set and valid then the ALB will route the request to the target group with X-AMZN-OIDC-* headers set. The headers contain identity information in JSON Web Token (JWT) format, that a backend can use to identify a user. If the session cookie is not set or invalid then ALB will follow the OIDC protocol and issue an HTTP 302 redirect to the identity provider. The protocol is a lot to unpack and is covered more thoroughly in the documentation for those curious.
ALB Authentication Walkthrough
I have a simple Python flask app in an Amazon ECS cluster running in some AWS Fargate containers. The containers are in a target group routed to by an ALB. I want to make sure users of my application are logged in before accessing the authenticated portions of my application. First, I’ll navigate to the ALB in the console and edit the rules.
I want to make sure all access to /account* endpoints is authenticated so I’ll add new rule with a condition to match those endpoints.
Now, I’ll add a new rule and create an Authenticate action in that rule.
I’ll have ALB create a new Amazon Cognito user pool for me by providing some configuration details.
After creating the Amazon Cognito pool, I can make some additional configuration in the advanced settings.
I can change the default cookie name, adjust the timeout, adjust the scope, and choose the action for unauthenticated requests.
I can pick Deny to serve a 401 for all unauthenticated requests or I can pick Allow which will pass through to the application if unauthenticated. This is useful for Single Page Apps (SPAs). For now, I’ll choose Authenticate, which will prompt the IdP, in this case Amazon Cognito, to authenticate the user and reload the existing page.
Now I’ll add a forwarding action for my target group and save the rule.
Over on the Facebook side I just need to add my Amazon Cognito User Pool Domain to the whitelisted OAuth redirect URLs.
I would follow similar steps for other authentication providers.
Now, when I navigate to an authenticated page my Fargate containers receive the originating request with the X-Amzn-Oidc-* headers set by ALB. Using the information in those headers (claims-data, identity, access-token) my application can implement authorization.
All of this was possible without having to write a single line of code to deal with each of the IdPs. However, it’s still important for the implementing applications to verify the signature on the JWT header to ensure the request hasn’t been tampered with.
Additional Resources
Of course everything we’ve seen today is also available in the the API and AWS Command Line Interface (CLI). You can find additional information on the feature in the documentation. This feature is provided at no additional charge.
With authentication built-in to ALB, developers can focus on building their applications instead of rebuilding authentication for every application, all the while maintaining the scale, availability, and reliability of ALB. I think this feature is a pretty big deal and I can’t wait to see what customers build with it. Let us know what you think of this feature in the comments or on twitter!
What do I do with a Mac that still has personal data on it? Do I take out the disk drive and smash it? Do I sweep it with a really strong magnet? Is there a difference in how I handle a hard drive (HDD) versus a solid-state drive (SSD)? Well, taking a sledgehammer or projectile weapon to your old machine is certainly one way to make the data irretrievable, and it can be enormously cathartic as long as you follow appropriate safety and disposal protocols. But there are far less destructive ways to make sure your data is gone for good. Let me introduce you to secure erasing.
Which Type of Drive Do You Have?
Before we start, you need to know whether you have a HDD or a SSD. To find out, or at least to make sure, you click on the Apple menu and select “About this Mac.” Once there, select the “Storage” tab to see which type of drive is in your system.
The first example, below, shows a SATA Disk (HDD) in the system.
In the next case, we see we have a Solid State SATA Drive (SSD), plus a Mac SuperDrive.
The third screen shot shows an SSD, as well. In this case it’s called “Flash Storage.”
Make Sure You Have a Backup
Before you get started, you’ll want to make sure that any important data on your hard drive has moved somewhere else. OS X’s built-in Time Machine backup software is a good start, especially when paired with Backblaze. You can learn more about using Time Machine in our Mac Backup Guide.
With a local backup copy in hand and secure cloud storage, you know your data is always safe no matter what happens.
Once you’ve verified your data is backed up, roll up your sleeves and get to work. The key is OS X Recovery — a special part of the Mac operating system since OS X 10.7 “Lion.”
How to Wipe a Mac Hard Disk Drive (HDD)
NOTE: If you’re interested in wiping an SSD, see below.
Make sure your Mac is turned off.
Press the power button.
Immediately hold down the command and R keys.
Wait until the Apple logo appears.
Select “Disk Utility” from the OS X Utilities list. Click Continue.
Select the disk you’d like to erase by clicking on it in the sidebar.
Click the Erase button.
Click the Security Options button.
The Security Options window includes a slider that enables you to determine how thoroughly you want to erase your hard drive.
There are four notches to that Security Options slider. “Fastest” is quick but insecure — data could potentially be rebuilt using a file recovery app. Moving that slider to the right introduces progressively more secure erasing. Disk Utility’s most secure level erases the information used to access the files on your disk, then writes zeroes across the disk surface seven times to help remove any trace of what was there. This setting conforms to the DoD 5220.22-M specification.
Once you’ve selected the level of secure erasing you’re comfortable with, click the OK button.
Click the Erase button to begin. Bear in mind that the more secure method you select, the longer it will take. The most secure methods can add hours to the process.
Once it’s done, the Mac’s hard drive will be clean as a whistle and ready for its next adventure: a fresh installation of OS X, being donated to a relative or a local charity, or just sent to an e-waste facility. Of course you can still drill a hole in your disk or smash it with a sledgehammer if it makes you happy, but now you know how to wipe the data from your old computer with much less ruckus.
The above instructions apply to older Macintoshes with HDDs. What do you do if you have an SSD?
Securely Erasing SSDs, and Why Not To
Most new Macs ship with solid state drives (SSDs). Only the iMac and Mac mini ship with regular hard drives anymore, and even those are available in pure SSD variants if you want.
If your Mac comes equipped with an SSD, Apple’s Disk Utility software won’t actually let you zero the hard drive.
Wait, what?
In a tech note posted to Apple’s own online knowledgebase, Apple explains that you don’t need to securely erase your Mac’s SSD:
With an SSD drive, Secure Erase and Erasing Free Space are not available in Disk Utility. These options are not needed for an SSD drive because a standard erase makes it difficult to recover data from an SSD.
In fact, some folks will tell you not to zero out the data on an SSD, since it can cause wear and tear on the memory cells that, over time, can affect its reliability. I don’t think that’s nearly as big an issue as it used to be — SSD reliability and longevity has improved.
If “Standard Erase” doesn’t quite make you feel comfortable that your data can’t be recovered, there are a couple of options.
FileVault Keeps Your Data Safe
One way to make sure that your SSD’s data remains secure is to use FileVault. FileVault is whole-disk encryption for the Mac. With FileVault engaged, you need a password to access the information on your hard drive. Without it, that data is encrypted.
There’s one potential downside of FileVault — if you lose your password or the encryption key, you’re screwed: You’re not getting your data back any time soon. Based on my experience working at a Mac repair shop, losing a FileVault key happens more frequently than it should.
When you first set up a new Mac, you’re given the option of turning FileVault on. If you don’t do it then, you can turn on FileVault at any time by clicking on your Mac’s System Preferences, clicking on Security & Privacy, and clicking on the FileVault tab. Be warned, however, that the initial encryption process can take hours, as will decryption if you ever need to turn FileVault off.
With FileVault turned on, you can restart your Mac into its Recovery System (by restarting the Mac while holding down the command and R keys) and erase the hard drive using Disk Utility, once you’ve unlocked it (by selecting the disk, clicking the File menu, and clicking Unlock). That deletes the FileVault key, which means any data on the drive is useless.
FileVault doesn’t impact the performance of most modern Macs, though I’d suggest only using it if your Mac has an SSD, not a conventional hard disk drive.
Securely Erasing Free Space on Your SSD
If you don’t want to take Apple’s word for it, if you’re not using FileVault, or if you just want to, there is a way to securely erase free space on your SSD. It’s a little more involved but it works.
Before we get into the nitty-gritty, let me state for the record that this really isn’t necessary to do, which is why Apple’s made it so hard to do. But if you’re set on it, you’ll need to use Apple’s Terminal app. Terminal provides you with command line interface access to the OS X operating system. Terminal lives in the Utilities folder, but you can access Terminal from the Mac’s Recovery System, as well. Once your Mac has booted into the Recovery partition, click the Utilities menu and select Terminal to launch it.
From a Terminal command line, type:
diskutil secureErase freespace VALUE /Volumes/DRIVE
That tells your Mac to securely erase the free space on your SSD. You’ll need to change VALUE to a number between 0 and 4. 0 is a single-pass run of zeroes; 1 is a single-pass run of random numbers; 2 is a 7-pass erase; 3 is a 35-pass erase; and 4 is a 3-pass erase. DRIVE should be changed to the name of your hard drive. To run a 7-pass erase of your SSD drive in “JohnB-Macbook”, you would enter the following:
And remember, if you used a space in the name of your Mac’s hard drive, you need to insert a leading backslash before the space. For example, to run a 35-pass erase on a hard drive called “Macintosh HD” you enter the following:
diskutil secureErase freespace 3 /Volumes/Macintosh\ HD
Something to remember is that the more extensive the erase procedure, the longer it will take.
When Erasing is Not Enough — How to Destroy a Drive
If you absolutely, positively need to be sure that all the data on a drive is irretrievable, see this Scientific American article (with contributions by Gleb Budman, Backblaze CEO), How to Destroy a Hard Drive — Permanently.
The adoption of Apache Spark has increased significantly over the past few years, and running Spark-based application pipelines is the new normal. Spark jobs that are in an ETL (extract, transform, and load) pipeline have different requirements—you must handle dependencies in the jobs, maintain order during executions, and run multiple jobs in parallel. In most of these cases, you can use workflow scheduler tools like Apache Oozie, Apache Airflow, and even Cron to fulfill these requirements.
Apache Oozie is a widely used workflow scheduler system for Hadoop-based jobs. However, its limited UI capabilities, lack of integration with other services, and heavy XML dependency might not be suitable for some users. On the other hand, Apache Airflow comes with a lot of neat features, along with powerful UI and monitoring capabilities and integration with several AWS and third-party services. However, with Airflow, you do need to provision and manage the Airflow server. The Cron utility is a powerful job scheduler. But it doesn’t give you much visibility into the job details, and creating a workflow using Cron jobs can be challenging.
What if you have a simple use case, in which you want to run a few Spark jobs in a specific order, but you don’t want to spend time orchestrating those jobs or maintaining a separate application? You can do that today in a serverless fashion using AWS Step Functions. You can create the entire workflow in AWS Step Functions and interact with Spark on Amazon EMR through Apache Livy.
In this post, I walk you through a list of steps to orchestrate a serverless Spark-based ETL pipeline using AWS Step Functions and Apache Livy.
Input data
For the source data for this post, I use the New York City Taxi and Limousine Commission (TLC) trip record data. For a description of the data, see this detailed dictionary of the taxi data. In this example, we’ll work mainly with the following three columns for the Spark jobs.
Column name
Column description
RateCodeID
Represents the rate code in effect at the end of the trip (for example, 1 for standard rate, 2 for JFK airport, 3 for Newark airport, and so on).
FareAmount
Represents the time-and-distance fare calculated by the meter.
TripDistance
Represents the elapsed trip distance in miles reported by the taxi meter.
The trip data is in comma-separated values (CSV) format with the first row as a header. To shorten the Spark execution time, I trimmed the large input data to only 20,000 rows. During the deployment phase, the input file tripdata.csv is stored in Amazon S3 in the <<your-bucket>>/emr-step-functions/input/ folder.
The following image shows a sample of the trip data:
Solution overview
The next few sections describe how Spark jobs are created for this solution, how you can interact with Spark using Apache Livy, and how you can use AWS Step Functions to create orchestrations for these Spark applications.
At a high level, the solution includes the following steps:
Trigger the AWS Step Function state machine by passing the input file path.
The first stage in the state machine triggers an AWS Lambda
The Lambda function interacts with Apache Spark running on Amazon EMR using Apache Livy, and submits a Spark job.
The state machine waits a few seconds before checking the Spark job status.
Based on the job status, the state machine moves to the success or failure state.
Subsequent Spark jobs are submitted using the same approach.
The state machine waits a few seconds for the job to finish.
The job finishes, and the state machine updates with its final status.
Let’s take a look at the Spark application that is used for this solution.
Spark jobs
For this example, I built a Spark jar named spark-taxi.jar. It has two different Spark applications:
MilesPerRateCode – The first job that runs on the Amazon EMR cluster. This job reads the trip data from an input source and computes the total trip distance for each rate code. The output of this job consists of two columns and is stored in Apache Parquet format in the output path.
The following are the expected output columns:
rate_code – Represents the rate code for the trip.
total_distance – Represents the total trip distance for that rate code (for example, sum(trip_distance)).
RateCodeStatus – The second job that runs on the EMR cluster, but only if the first job finishes successfully. This job depends on two different input sets:
csv – The same trip data that is used for the first Spark job.
miles-per-rate – The output of the first job.
This job first reads the tripdata.csv file and aggregates the fare_amount by the rate_code. After this point, you have two different datasets, both aggregated by rate_code. Finally, the job uses the rate_code field to join two datasets and output the entire rate code status in a single CSV file.
The output columns are as follows:
rate_code_id – Represents the rate code type.
total_distance – Derived from first Spark job and represents the total trip distance.
total_fare_amount – A new field that is generated during the second Spark application, representing the total fare amount by the rate code type.
Note that in this case, you don’t need to run two different Spark jobs to generate that output. The goal of setting up the jobs in this way is just to create a dependency between the two jobs and use them within AWS Step Functions.
Both Spark applications take one input argument called rootPath. It’s the S3 location where the Spark job is stored along with input and output data. Here is a sample of the final output:
The next section discusses how you can use Apache Livy to interact with Spark applications that are running on Amazon EMR.
Using Apache Livy to interact with Apache Spark
Apache Livy provides a REST interface to interact with Spark running on an EMR cluster. Livy is included in Amazon EMR release version 5.9.0 and later. In this post, I use Livy to submit Spark jobs and retrieve job status. When Amazon EMR is launched with Livy installed, the EMR master node becomes the endpoint for Livy, and it starts listening on port 8998 by default. Livy provides APIs to interact with Spark.
Let’s look at a couple of examples how you can interact with Spark running on Amazon EMR using Livy.
To list active running jobs, you can execute the following from the EMR master node:
curl localhost:8998/sessions
If you want to do the same from a remote instance, just change localhost to the EMR hostname, as in the following (port 8998 must be open to that remote instance through the security group):
Through Spark submit, you can pass multiple arguments for the Spark job and Spark configuration settings. You can also do that using Livy, by passing the S3 path through the args parameter, as shown following:
curl -X POST – data '{"file": "s3://<<bucket-location>>/spark.jar", "className": "com.example.SparkApp", “args”: [“s3://bucket-path”]}' -H "Content-Type: application/json" http://ec2-xx-xx-xx-xx.compute-1.amazonaws.com:8998/batches
All Apache Livy REST calls return a response as JSON, as shown in the following image:
If you want to pretty-print that JSON response, you can pipe command with Python’s JSON tool as follows:
For a detailed list of Livy APIs, see the Apache Livy REST API page. This post uses GET /batches and POST /batches.
In the next section, you create a state machine and orchestrate Spark applications using AWS Step Functions.
Using AWS Step Functions to create a Spark job workflow
AWS Step Functions automatically triggers and tracks each step and retries when it encounters errors. So your application executes in order and as expected every time. To create a Spark job workflow using AWS Step Functions, you first create a Lambda state machine using different types of states to create the entire workflow.
First, you use the Task state—a simple state in AWS Step Functions that performs a single unit of work. You also use the Wait state to delay the state machine from continuing for a specified time. Later, you use the Choice state to add branching logic to a state machine.
The following is a quick summary of how to use different states in the state machine to create the Spark ETL pipeline:
Task state – Invokes a Lambda function. The first Task state submits the Spark job on Amazon EMR, and the next Task state is used to retrieve the previous Spark job status.
Wait state – Pauses the state machine until a job completes execution.
Choice state – Each Spark job execution can return a failure, an error, or a success state So, in the state machine, you use the Choice state to create a rule that specifies the next action or step based on the success or failure of the previous step.
Here is one of my Task states, MilesPerRateCode, which simply submits a Spark job:
"MilesPerRate Job": {
"Type": "Task",
"Resource":"arn:aws:lambda:us-east-1:xxxxxx:function:blog-miles-per-rate-job-submit-function",
"ResultPath": "$.jobId",
"Next": "Wait for MilesPerRate job to complete"
}
This Task state configuration specifies the Lambda function to execute. Inside the Lambda function, it submits a Spark job through Livy using Livy’s POST API. Using ResultPath, it tells the state machine where to place the result of the executing task. As discussed in the previous section, Spark submit returns the session ID, which is captured with $.jobId and used in a later state.
The following code section shows the Lambda function, which is used to submit the MilesPerRateCode job. It uses the Python request library to submit a POST against the Livy endpoint hosted on Amazon EMR and passes the required parameters in JSON format through payload. It then parses the response, grabs id from the response, and returns it. The Next field tells the state machine which state to go to next.
Just like in the MilesPerRate job, another state submits the RateCodeStatus job, but it executes only when all previous jobs have completed successfully.
Here is the Task state in the state machine that checks the Spark job status:
Just like other states, the preceding Task executes a Lambda function, captures the result (represented by jobStatus), and passes it to the next state. The following is the Lambda function that checks the Spark job status based on a given session ID:
In the Choice state, it checks the Spark job status value, compares it with a predefined state status, and transitions the state based on the result. For example, if the status is success, move to the next state (RateCodeJobStatus job), and if it is dead, move to the MilesPerRate job failed state.
To set up this entire solution, you need to create a few AWS resources. To make it easier, I have created an AWS CloudFormation template. This template creates all the required AWS resources and configures all the resources that are needed to create a Spark-based ETL pipeline on AWS Step Functions.
This CloudFormation template requires you to pass the following four parameters during initiation.
Parameter
Description
ClusterSubnetID
The subnet where the Amazon EMR cluster is deployed and Lambda is configured to talk to this subnet.
KeyName
The name of the existing EC2 key pair to access the Amazon EMR cluster.
VPCID
The ID of the virtual private cloud (VPC) where the EMR cluster is deployed and Lambda is configured to talk to this VPC.
S3RootPath
The Amazon S3 path where all required files (input file, Spark job, and so on) are stored and the resulting data is written.
IMPORTANT: These templates are designed only to show how you can create a Spark-based ETL pipeline on AWS Step Functions using Apache Livy. They are not intended for production use without modification. And if you try this solution outside of the us-east-1 Region, download the necessary files from s3://aws-data-analytics-blog/emr-step-functions, upload the files to the buckets in your Region, edit the script as appropriate, and then run it.
To launch the CloudFormation stack, choose Launch Stack:
Launching this stack creates the following list of AWS resources.
Logical ID
Resource Type
Description
StepFunctionsStateExecutionRole
IAM role
IAM role to execute the state machine and have a trust relationship with the states service.
SparkETLStateMachine
AWS Step Functions state machine
State machine in AWS Step Functions for the Spark ETL workflow.
LambdaSecurityGroup
Amazon EC2 security group
Security group that is used for the Lambda function to call the Livy API.
RateCodeStatusJobSubmitFunction
AWS Lambda function
Lambda function to submit the RateCodeStatus job.
MilesPerRateJobSubmitFunction
AWS Lambda function
Lambda function to submit the MilesPerRate job.
SparkJobStatusFunction
AWS Lambda function
Lambda function to check the Spark job status.
LambdaStateMachineRole
IAM role
IAM role for all Lambda functions to use the lambda trust relationship.
EMRCluster
Amazon EMR cluster
EMR cluster where Livy is running and where the job is placed.
During the AWS CloudFormation deployment phase, it sets up S3 paths for input and output. Input files are stored in the <<s3-root-path>>/emr-step-functions/input/ path, whereas spark-taxi.jar is copied under <<s3-root-path>>/emr-step-functions/.
The following screenshot shows how the S3 paths are configured after deployment. In this example, I passed a bucket that I created in the AWS account s3://tm-app-demos for the S3 root path.
If the CloudFormation template completed successfully, you will see Spark-ETL-State-Machine in the AWS Step Functions dashboard, as follows:
Choose the Spark-ETL-State-Machine state machine to take a look at this implementation. The AWS CloudFormation template built the entire state machine along with its dependent Lambda functions, which are now ready to be executed.
On the dashboard, choose the newly created state machine, and then choose New execution to initiate the state machine. It asks you to pass input in JSON format. This input goes to the first state MilesPerRate Job, which eventually executes the Lambda function blog-miles-per-rate-job-submit-function.
Pass the S3 root path as input:
{
“rootPath”: “s3://tm-app-demos”
}
Then choose Start Execution:
The rootPath value is the same value that was passed when creating the CloudFormation stack. It can be an S3 bucket location or a bucket with prefixes, but it should be the same value that is used for AWS CloudFormation. This value tells the state machine where it can find the Spark jar and input file, and where it will write output files. After the state machine starts, each state/task is executed based on its definition in the state machine.
At a high level, the following represents the flow of events:
Execute the first Spark job, MilesPerRate.
The Spark job reads the input file from the location <<rootPath>>/emr-step-functions/input/tripdata.csv. If the job finishes successfully, it writes the output data to <<rootPath>>/emr-step-functions/miles-per-rate.
If the Spark job fails, it transitions to the error state MilesPerRate job failed, and the state machine stops. If the Spark job finishes successfully, it transitions to the RateCodeStatus Job state, and the second Spark job is executed.
If the second Spark job fails, it transitions to the error state RateCodeStatus job failed, and the state machine stops with the Failed status.
If this Spark job completes successfully, it writes the final output data to the <<rootPath>>/emr-step-functions/rate-code-status/ It also transitions the RateCodeStatus job finished state, and the state machine ends its execution with the Success status.
This following screenshot shows a successfully completed Spark ETL state machine:
The right side of the state machine diagram shows the details of individual states with their input and output.
When you execute the state machine for the second time, it fails because the S3 path already exists. The state machine turns red and stops at MilePerRate job failed. The following image represents that failed execution of the state machine:
You can also check your Spark application status and logs by going to the Amazon EMR console and viewing the Application history tab:
I hope this walkthrough paints a picture of how you can create a serverless solution for orchestrating Spark jobs on Amazon EMR using AWS Step Functions and Apache Livy. In the next section, I share some ideas for making this solution even more elegant.
Next steps
The goal of this post is to show a simple example that uses AWS Step Functions to create an orchestration for Spark-based jobs in a serverless fashion. To make this solution robust and production ready, you can explore the following options:
In this example, I manually initiated the state machine by passing the rootPath as input. You can instead trigger the state machine automatically. To run the ETL pipeline as soon as the files arrive in your S3 bucket, you can pass the new file path to the state machine. Because CloudWatch Events supports AWS Step Functions as a target, you can create a CloudWatch rule for an S3 event. You can then set AWS Step Functions as a target and pass the new file path to your state machine. You’re all set!
You can also improve this solution by adding an alerting mechanism in case of failures. To do this, create a Lambda function that sends an alert email and assigns that Lambda function to a Fail That way, when any part of your state fails, it triggers an email and notifies the user.
If you want to submit multiple Spark jobs in parallel, you can use the Parallel state type in AWS Step Functions. The Parallel state is used to create parallel branches of execution in your state machine.
With Lambda and AWS Step Functions, you can create a very robust serverless orchestration for your big data workload.
Cleaning up
When you’ve finished testing this solution, remember to clean up all those AWS resources that you created using AWS CloudFormation. Use the AWS CloudFormation console or AWS CLI to delete the stack named Blog-Spark-ETL-Step-Functions.
Summary
In this post, I showed you how to use AWS Step Functions to orchestrate your Spark jobs that are running on Amazon EMR. You used Apache Livy to submit jobs to Spark from a Lambda function and created a workflow for your Spark jobs, maintaining a specific order for job execution and triggering different AWS events based on your job’s outcome. Go ahead—give this solution a try, and share your experience with us!
Tanzir Musabbir is an EMR Specialist Solutions Architect with AWS. He is an early adopter of open source Big Data technologies. At AWS, he works with our customers to provide them architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena & AWS Glue. Tanzir is a big Real Madrid fan and he loves to travel in his free time.
Thanks to Greg Eppel, Sr. Solutions Architect, Microsoft Platform for this great blog that describes how to create a custom CodeBuild build environment for the .NET Framework. — AWS CodeBuild is a fully managed build service that compiles source code, runs tests, and produces software packages that are ready to deploy. CodeBuild provides curated build environments for programming languages and runtimes such as Android, Go, Java, Node.js, PHP, Python, Ruby, and Docker. CodeBuild now supports builds for the Microsoft Windows Server platform, including a prepackaged build environment for .NET Core on Windows. If your application uses the .NET Framework, you will need to use a custom Docker image to create a custom build environment that includes the Microsoft proprietary Framework Class Libraries. For information about why this step is required, see our FAQs. In this post, I’ll show you how to create a custom build environment for .NET Framework applications and walk you through the steps to configure CodeBuild to use this environment.
Build environments are Docker images that include a complete file system with everything required to build and test your project. To use a custom build environment in a CodeBuild project, you build a container image for your platform that contains your build tools, push it to a Docker container registry such as Amazon Elastic Container Registry (Amazon ECR), and reference it in the project configuration. When it builds your application, CodeBuild retrieves the Docker image from the container registry specified in the project configuration and uses the environment to compile your source code, run your tests, and package your application.
Step 1: Launch EC2 Windows Server 2016 with Containers
In the Amazon EC2 console, in your region, launch an Amazon EC2 instance from a Microsoft Windows Server 2016 Base with Containers AMI.
Increase disk space on the boot volume to at least 50 GB to account for the larger size of containers required to install and run Visual Studio Build Tools.
Run the following command in that directory. This process can take a while. It depends on the size of EC2 instance you launched. In my tests, a t2.2xlarge takes less than 30 minutes to build the image and produces an approximately 15 GB image.
docker build -t buildtools2017:latest -m 2GB .
Run the following command to test the container and start a command shell with all the developer environment variables:
docker run -it buildtools2017
Create a repository in the Amazon ECS console. For the repository name, type buildtools2017. Choose Next step and then complete the remaining steps.
Execute the following command to generate authentication details for our registry to the local Docker engine. Make sure you have permissions to the Amazon ECR registry before you execute the command.
aws ecr get-login
In the same command prompt window, copy and paste the following commands:
In the CodeCommit console, create a repository named DotNetFrameworkSampleApp. On the Configure email notifications page, choose Skip.
Clone a .NET Framework Docker sample application from GitHub. The repository includes a sample ASP.NET Framework that we’ll use to demonstrate our custom build environment.On the EC2 instance, open a command prompt and execute the following commands:
Navigate to the CodeCommit repository and confirm that the files you just pushed are there.
Step 4: Configure build spec
To build your .NET Framework application with CodeBuild you use a build spec, which is a collection of build commands and related settings, in YAML format, that AWS CodeBuild can use to run a build. You can include a build spec as part of the source code or you can define a build spec when you create a build project. In this example, I include a build spec as part of the source code.
In the root directory of your source directory, create a YAML file named buildspec.yml.
At this point, we have a Docker image with Visual Studio Build Tools installed and stored in the Amazon ECR registry. We also have a sample ASP.NET Framework application in a CodeCommit repository. Now we are going to set up CodeBuild to build the ASP.NET Framework application.
In the Amazon ECR console, choose the repository that was pushed earlier with the docker push command. On the Permissions tab, choose Add.
For Source Provider, choose AWS CodeCommit and then choose the called DotNetFrameworkSampleApp repository.
For Environment Image, choose Specify a Docker image.
For Environment type, choose Windows.
For Custom image type, choose Amazon ECR.
For Amazon ECR repository, choose the Docker image with the Visual Studio Build Tools installed, buildtools2017. Your configuration should look like the image below:
Choose Continue and then Save and Build to create your CodeBuild project and start your first build. You can monitor the status of the build in the console. You can also configure notifications that will notify subscribers whenever builds succeed, fail, go from one phase to another, or any combination of these events.
Summary
CodeBuild supports a number of platforms and languages out of the box. By using custom build environments, it can be extended to other runtimes. In this post, I showed you how to build a .NET Framework environment on a Windows container and demonstrated how to use it to build .NET Framework applications in CodeBuild.
We’re excited to see how customers extend and use CodeBuild to enable continuous integration and continuous delivery for their Windows applications. Feel free to share what you’ve learned extending CodeBuild for your own projects. Just leave questions or suggestions in the comments.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.