Enabling two-factor authentication (2FA) to boost security for your important accounts is becoming a lot more common these days. However you might be surprised to learn that you can do the same with your Raspberry Pi. You can enable 2FA on Raspberry Pi, and afterwards you’ll be challenged for a verification code when you access it remotely via Secure Shell (SSH).
Accessing your Raspberry Pi via SSH
A lot of people use a Raspberry Pi at home as a file, or media, server. This is has become rather common with the launch of Raspberry Pi 4, which has both USB 3 and Gigabit Ethernet. However, when you’re setting up this sort of server you often want to run it “headless”; without a monitor, keyboard, or mouse. This is especially true if you intend tuck your Raspberry Pi away behind your television, or somewhere else out of the way. In any case, it means that you are going to need to enable Secure Shell (SSH) for remote access.
However, it’s also pretty common to set up your server so that you can access your files when you’re away from home, making your Raspberry Pi accessible from the Internet.
Most of us aren’t going to be out of the house much for a while yet, but if you’re taking the time right now to build a file server, you might want to think about adding some extra security. Especially if you intend to make the server accessible from the Internet, you probably want to enable two-factor authentication (2FA) using Time-based One-Time Password (TOTP).
What is two-factor authentication?
Two-factor authentication is an extra layer of protection. As well as a password, “something you know,” you’ll need another piece of information to log in. This second factor will be based either on “something you have,” like a smart phone, or on “something you are,” like biometric information.
We’re going to go ahead and set up “something you have,” and use your smart phone as the second factor to protect your Raspberry Pi.
Updating the operating system
The first thing you should do is make sure your Raspberry Pi is up to date with the latest version of Raspbian. If you’re running a relatively recent version of the operating system you can do that from the command line:
$ sudo apt-get update
$ sudo apt-get full-upgrade
If you’re pulling your Raspberry Pi out of a drawer for the first time in a while, though, you might want to go as far as to install a new copy of Raspbian using the new Raspberry Pi Imager, so you know you’re working from a good image.
Enabling Secure Shell
The Raspbian operating system has the SSH server disabled on boot. However, since we’re intending to run the board without a monitor or keyboard, we need to enable it if we want to be able to SSH into our Raspberry Pi.
The easiest way to enable SSH is from the desktop. Go to the Raspbian menu and select “Preferences > Raspberry Pi Configuration”. Next, select the “Interfaces” tab and click on the radio button to enable SSH, then hit “OK.”
You can also enable it from the command line using systemctl:
Alternatively, you can enable SSH using raspi-config, or, if you’re installing the operating system for the first time, you can enable SSH as you burn your SD Card.
Enabling challenge-response
Next, we need to tell the SSH daemon to enable “challenge-response” passwords. Go ahead and open the SSH config file:
$ sudo nano /etc/ssh/sshd_config
Enable challenge response by changing ChallengeResponseAuthentication from the default no to yes.
Editing /etc/ssh/ssd_config.
Then restart the SSH daemon:
$ sudo systemctl restart ssh
It’s good idea to open up a terminal on your laptop and make sure you can still SSH into your Raspberry Pi at this point — although you won’t be prompted for a 2FA code quite yet. It’s sensible to check that everything still works at this stage.
Installing two-factor authentication
The first thing you need to do is download an app to your phone that will generate the TOTP. One of the most commonly used is Google Authenticator. It’s available for Android, iOS, and Blackberry, and there is even an open source version of the app available on GitHub.
Google Authenticator in the App Store.
So go ahead and install Google Authenticator, or another 2FA app like Authy, on your phone. Afterwards, install the Google Authenticator PAM module on your Raspberry Pi:
$ sudo apt install libpam-google-authenticator
Now we have 2FA installed on both our phone, and our Raspberry Pi, we’re ready to get things configured.
Configuring two-factor authentication
You should now run Google Authenticator from the command line — without using sudo — on your Raspberry Pi in order to generate a QR code:
$ google-authenticator
Afterwards you’re probably going to have to resize the Terminal window so that the QR code is rendered correctly. Unfortunately, it’s just slightly wider than the standard 80 characters across.
The QR code generated by google-authenticator. Don’t worry, this isn’t the QR code for my key; I generated one just for this post that I didn’t use.
Don’t move forward quite yet! Before you do anything else you should copy the emergency codes and put them somewhere safe.
These codes will let you access your Raspberry Pi — and turn off 2FA — if you lose your phone. Without them, you won’t be able to SSH into your Raspberry Pi if you lose or break the device you’re using to authenticate.
Next, before we continue with Google Authenticator on the Raspberry Pi, open the Google Authenticator app on your phone and tap the plus sign (+) at the top right, then tap on “Scan barcode.”
Your phone will ask you whether you want to allow the app access to your camera; you should say “Yes.” The camera view will open. Position the barcode squarely in the green box on the screen.
Scanning the QR code with the Google Authenticator app.
As soon as your phone app recognises the QR code it will add your new account, and it will start generating TOTP codes automatically.
The TOTP in Google Authenticator app.
Your phone will generate a new one-time password every thirty seconds. However, this code isn’t going to be all that useful until we finish what we were doing on your Raspberry Pi. Switch back to your terminal window and answer “Y” when asked whether Google Authenticator should update your .google_authenticator file.
Then answer “Y” to disallow multiple uses of the same authentication token, “N” to increasing the time skew window, and “Y” to rate limiting in order to protect against brute-force attacks.
You’re done here. Now all we have to do is enable 2FA.
Enabling two-factor authentication
We’re going to use Linux Pluggable Authentication Modules (PAM), which provides dynamic authentication support for applications and services, to add 2FA to SSH on Raspberry Pi.
Now we need to configure PAM to add 2FA:
$ sudo nano /etc/pam.d/sshd
Add auth required pam_google_authenticator.so to the top of the file. You can do this either above or below the line that says @include common-auth.
Editing /etc/pam.d/sshd.
As I prefer to be prompted for my verification code after entering my password, I’ve added this line after the @include line. If you want to be prompted for the code before entering your password you should add it before the @include line.
Now restart the SSH daemon:
$ sudo systemctl restart ssh
Next, open up a terminal window on your laptop and try and SSH into your Raspberry Pi.
Wrapping things up
If everything has gone to plan, when you SSH into the Raspberry Pi, you should be prompted for a TOTP after being prompted for your password.
SSH’ing into my Raspberry Pi.
You should go ahead and open Google Authenticator on your phone, and enter the six-digit code when prompted. Then you should be logged into your Raspberry Pi as normal.
You’ll now need your phone, and a TOTP, every time you ssh into, or scp to and from, your Raspberry Pi. But because of that, you’ve just given a huge boost to the security of your device.
Now you have the Google Authenticator app on your phone, you should probably start enabling 2FA for your important services and sites — like Google, Twitter, Amazon, and others — since most bigger sites, and many smaller ones, now support two-factor authentication.
Almost exactly two years ago, we launched Cloudflare Spectrum for our Enterprise customers. Today, we’re thrilled to extend DDoS protection and traffic acceleration with Spectrum for SSH, RDP, and Minecraft to our Pro and Business plan customers.
When we think of Cloudflare, a lot of the time we think about protecting and improving the performance of websites. But the Internet is so much more, ranging from gaming, to managing servers, to cryptocurrencies. How do we make sure these applications are secure and performant?
With Spectrum, you can put Cloudflare in front of your SSH, RDP and Minecraft services, protecting them from DDoS attacks and improving network performance. This allows you to protect the management of your servers, not just your website. Better yet, by leveraging the Cloudflare network you also get increased reliability and increased performance: lower latency!
Remote access to servers
While access to websites from home is incredibly important, being able to remotely manage your servers can be equally critical. Losing access to your infrastructure can be disastrous: people need to know their infrastructure is safe and connectivity is good and performant. Usually, server management is done through SSH (Linux or Unix based servers) and RDP (Windows based servers). With these protocols, performance and reliability are key: you need to know you can always reliably manage your servers and that the bad guys are kept out. What’s more, low latency is really important. Every time you type a key in an SSH terminal or click a button in a remote desktop session, that key press or button click has to traverse the Internet to your origin before the server can process the input and send feedback. While increasing bandwidth can help, lowering latency can help even more in getting your sessions to feel like you’re working on a local machine and not one half-way across the globe.
All work and no play makes Jack Steve a dull boy
While we stay at home, many of us are also looking to play and not only work. Video games in particular have seen a huge increase in popularity. As personal interaction becomes more difficult to come by, Minecraft has become a popular social outlet. Many of us at Cloudflare are using it to stay in touch and have fun with friends and family in the current age of quarantine. And it’s not just employees at Cloudflare that feel this way, we’ve seen a big increase in Minecraft traffic flowing through our network. Traffic per week had remained steady for a while but has more than tripled since many countries have put their citizens in lockdown:
Minecraft is a particularly popular target for DDoS attacks: it’s not uncommon for people to develop feuds whilst playing the game. When they do, some of the more tech-savvy players of this game opt to take matters into their own hands and launch a (D)DoS attack, rendering it unusable for the duration of the attacks. Our friends at Hypixel and Nodecraft have known this for many years, which is why they’ve chosen to protect their servers using Spectrum.
While we love recommending their services, we realize some of you prefer to run your own Minecraft server on a VPS (virtual private server like a DigitalOcean droplet) that you maintain. To help you protect your Minecraft server, we’re providing Spectrum for Minecraft as well, available on Pro and Business plans. You’ll be able to use the entire Cloudflare network to protect your server and increase network performance.
How does it work?
Configuring Spectrum is easy, just log into your dashboard and head on over to the Spectrum tab. From there you can choose a protocol and configure the IP of your server:
After that all you have to do is use the subdomain you configured to connect instead of your IP. Traffic will be proxied using Spectrum on the Cloudflare network, keeping the bad guys out and your services safe.
So how much does this cost? We’re happy to announce that all paid plans will get access to Spectrum for free, with a generous free data allowance. Pro plans will be able to use SSH and Minecraft, up to 5 gigabytes for free each month. Biz plans can go up to 10 gigabytes for free and also get access to RDP. After the free cap you will be billed on a per gigabyte basis.
Spectrum is complementary to Access: it offers DDoS protection and improved network performance as a ‘drop-in’ product, no configuration necessary on your origins. If you want more control over who has access to which services, we highly recommend taking a look at Cloudflare for Teams.
We’re very excited to extend Cloudflare’s services to not just HTTP traffic, allowing you to protect your core management services and Minecraft gaming servers. In the future, we’ll add support for more protocols. If you have a suggestion, let us know! In the meantime, if you have a Pro or Business account, head on over to the dashboard and enable Spectrum today!
Have you ever considered attaching your Raspberry Pi 4 to an Apple iPad Pro? How would you do it, and why would you want to? Here’s YouTuber Tech Craft to explain why Raspberry Pi 4 is their favourite iPad Pro accessory, and why you may want to consider using yours in the same way.
We’ve set the video to start at Tech Craft’s explanation.
The Raspberry Pi 4 is my favourite accessory to use with the iPad Pro. In this video, learn more about what the Pi can do, what gear you need to get running with one, how to connect it to your iPad and what you’ll find it useful for.
Having installed Raspbian on Raspberry Pi and configured the computer to use USB-C as an Ethernet connection (read Ben Hardill’s guide to find out how to do this), Tech Craft could select it as an Ethernet device in the iPad’s Settings menu.
So why would you want to connect your Raspberry Pi 4 to your iPad? For starters, using your iPad instead of a conventional HDMI monitor will free up desk space, and also allow you to edit your code on the move. And when you’ve connected the two devices like this, you don’t need a separate power lead for Raspberry Pi, because the iPad powers the computer. So this setup is perfect for train or plane journeys, or for that moment when your robot stops working at a Raspberry Jam, or for maker conventions.
You can also use Raspberry Pi as a bridge between your iPad and portable hard drive, for disk management.
Tech Craft uses the SSH client Blink to easily connect to their Raspberry Pi via its fixed IP address, and with Juno Connect, they connect to a running Jupyter instance on their Raspberry Pi to do data science work.
For more information on using Raspberry Pi with an iPad, make sure you watch the whole video. And, because you’re a lovely person, be sure to subscribe to Tech Craft for more videos, such as this one on how to connect wirelessly to your Raspberry Pi from any computer or tablet:
Following on my from earlier video about pairing the Raspberry Pi 4 with the iPad Pro over USB-C, this video show how to pair any iPad (or iPhone, or Android tablet) with a Pi4 or a Pi3 over WiFi.
AWS Secrets Manager provides full lifecycle management for secrets within your environment. In this post, Maitreya and I will show you how to use Secrets Manager to store, deliver, and rotate SSH keypairs used for communication within compute clusters. Rotation of these keypairs is a security best practice, and sometimes a regulatory requirement. Traditionally, these keypairs have been associated with a number of tough challenges. For example, synchronizing key rotation across all compute nodes, enable detailed logging and auditing, and manage access to users in order to modify secrets.
However, rotating the keypair on all compute clusters’ nodes must be done in a tightly coordinated fashion, and failures generally result in availability risks. Moreover, the keypairs themselves are highly sensitive security credentials which must be carefully controlled with fine-grain access controls, detailed monitoring, and audit logging. These are precisely the types of tough challenges that AWS Secrets Manger solves for you.
In this post, we’ll show you how to secure, rotate, and use SSH keypairs for inter-cluster communication. You’ll use an AWS CloudFormation template to launch a cluster and configure Secrets Manager. Then we’ll show you how to use Secrets Manager to deliver the keypair to the cluster and use it for management operations, such as securely copying a file between nodes. Finally, we’ll use Secrets Manager to seamlessly rotate the keypair used by the cluster without any changes or outages. In this post, we’ve highlighted compute clusters, but you can use Secrets Manager to apply this solution directly to any SSH based use-case.
Solution overview
The following architecture diagram presents an overview of the solution:
Figure 1: Solution architecture
The sample architecture created by CloudFormation includes one master node, three worker nodes, AWS Secret Manager—which utilizes a rotation AWS Lambda function—and AWS Systems Manager. Setting up the cluster is out of scope for this post; in our walkthrough, we’ll focus on the keypair rotation architecture.
Secrets Manager uses staging labels to identify different versions of a secret during rotation. A staging label is a text string. For example, by default, AWSCURRENT is attached to the current version of the secret, while AWSPENDING will be attached to new versions of the secret before they have been verified and deployed to corresponding resources.
As shown in the diagram:
A secret is created in AWS Secrets Manager. The secret holds the SSH keypair that the master node will use to connect to the other nodes in the cluster. Upon keypair rotation, Secrets Manager will invoke a Lambda function (labeled 1.a in the diagram). The Lambda function will perform four steps:
1.b: createSecret – create a new SSH keypair and store the private key as a new version of the secret.
1.c: setSecret – label the newly created secret version with the label AWSPENDING and copy the public key to the worker nodes with AWS Systems Manager Run Command.
The Lambda function will also perform two steps not shown in the diagram:
testSecret – verify that the new SSH keypair has been successfully deployed by invoking a test SSH connection.
finishSecret – set the staging label AWSCURRENT to the new secret version and remove the old keys from the worker nodes. This will also set the staging label AWSPREVIOUS to the old secret, allowing your administrator to have the ‘last known password’ if something goes wrong.
An overview of the rotation Lambda function is available in the AWS Secrets Manager user guide. You have full control over the rotation function so that you can customize it to your needs. Note that no key is installed on the master node. Instead, the function will retrieve the private key from Secrets Manager only when it needs to securely communicate with the worker nodes. That private key is not saved on the master node’s filesystem but rather in volatile memory (per best practice, the private key variable is overwritten after successful authentication and deleted before the script exits); details about keeping secret data in volatile memory will follow later in this post.
When the master node needs to communicate with any worker node, it will use an AWS SDK (Python Boto3) to read the SSH private key from Secrets Manager (2.a) and use the private key to establish an SSH tunnel with the worker nodes (2.b). The master node is authorized to read the private key from Secrets Manager because an AWS Identity and Access Management (IAM) role with a policy that allows it to access the secret is attached to the master node. The corresponding public key was deployed to each of the worker nodes during the rotation process in step one above.
The secrets in Secrets Managers are encrypted with AWS Key Management System (KMS), and every version of the secret is encrypted with a unique data encryption key. The SSH key pair in the cluster will periodically rotate based on a configurable rotation interval, which you’ll configure from the Secrets Manager console later in this post. Each rotation repeats the process described in steps 1-2, resulting in a new version of the secret. Each new version will be encrypted using a new KMS data key, which provides an extra layer of security.
The AWS Systems Manager Run Command will use the Amazon Elastic Compute Cloud (EC2) tag RotateSSHKeys with a value of True to identify the cluster’s worker node instances. Note that if you rely on tags as a security control, you must have clear governance and control over which users are able to change the tags and tag values on your EC2 instances.
Solution cost
Today, this solution will cost $0.48 an hour for the four T2.micro EC2 instances that comprise the sample cluster. Secrets manager has a 30-day trial period, after which one secret will cost $0.40 per month and $0.05 per 10,000 API calls. There is no additional charge for AWS Systems Manager.
Deploying the sample solution
In this section, you’ll deploy a test stack that demonstrates the entire solution. After deployment, you’ll log in to the master node and securely copy a file to one of the worker nodes. Finally, you’ll use Secrets Manager to rotate and deploy a new SSH keypair. The CloudFormation templates and secret rotation code are available in the AWS GitHub repository.
Set up the sample deployment by selecting the AWS CloudFormation Launch Stack button bellow; by default, the stack will be deployed in the us-east-1 (N. Virginia) Region.
Select your EC2 SSH key pair and input your IP range as stack parameters. In the YourIPRange field, enter the CIDR of your machine or network only, as this ensures only hosts from your network can access the master server. You may leave all other parameters as default. This CloudFormation template launches four t2.micro instances in a new VPC. One instance will be tagged as MasterServer and the rest will be tagged WorkerServer1-3.
Note: The SSH keypair referenced here will be used to connect from your local computer to the master node. It is distinct from the SSH keypair used by the master node to connect to the worker nodes.
Figure 2: Enter the CIDR of your machine or network
Important: For simplicity, the master node you’ll create in this walkthrough will be in a public subnet, making it accessible from the CIDR you provided in Step 2. However, this is not the most secure approach possible. Follow the guidance in the Amazon EC2 VPC documentation to securely configure your cluster in a private subnet following the “defense in depth” principal.
Monitor the status of the stack. When the status is CREATE_COMPLETE, the deployment is ready. Select the Outputs tab to find information about the newly created resources, and write down the master node’s public DNS and a worker node IP address. You’ll need both later in this post.
Select the Launch Stack button to launch the AWS CloudFormation template that will deploy the Lambda function used by Secrets Manager, Accept the default values for the parameters. This template is designed for reusability; it can be applied to any SSH rotation use-case.
Next, create and configure a new secret from the Secrets Manager console to store the cluster communication SSH keypair.
Configuring a secret in AWS Secrets Manager
The CloudFormation template did not deploy a secret, so follow these steps to create a secret from the console and rotation function configuration. To create a new secret:
Open the AWS Secrets Manager console and select Store New Secret.
Select Other type of secrets, then select the Plaintext tab.
As shown in Figure 3, enter {} to create an empty JSON value with no properties. This value will be initially populated with a keypair by the rotation Lambda function.
Figure 3: Create an empty JSON value with no properties
Keep the default encryption key and select Next. We’re keeping the default encryption key for the sake of simplicity in this example, but security best practices suggest using a Customer Master Key (CMK) that you’ve created.
In Step 2: Name and description, name the secret /dev/ssh. The path of a secret can be used in the secret’s IAM policy to restrict users and roles to a secret or hierarchy of secrets. For example, the IAM policy could include /dev/* or /prod/* to control access to secrets in development or production, respectively.
Add a description, then select Next.
Figure 4: Add a description
In Step 3: Configure rotation, choose Enable automatic rotation and enable a rotation interval of your choice, which you can configure using the rotation interval dropdown list.
Select the Choose an AWS Lambda function drop-down and choose RotateSSH. This is the Lambda function that was deployed by the CloudFormation template.
Select Next, then review your configuration and select Store. When the new secrets configuration is stored, the rotation Lambda function is immediately invoked, populating the value of the secret.
Figure 5: Configure the rotation
Testing the sample solution
With the secret configuration completed and the instances up and running, you’re now going to securely copy a file from the master node to one of the worker nodes, using the SSH key stored in Secrets Manager to test the solution.
Log in to the master node via SSH, using the EC2 key that you specified in the CloudFormation template.
Once connected, securely copy a file from the master node to the worker node using SCP (secure copy protocol) by entering the command below. Replace <private-ip-of-worker> with the worker node IP you copied down in step 3:
Figure 6 shows ssh login to master node, and the copy_file.py command to worker node.
Figure 6: The ssh login to master node, and the copy_file.py command
During execution, the python script will use the Secrets Manager get_secret_value API to retrieve the secret, which includes the private key. It will then use this key to establish a secure SSH connection with the worker nodes, without saving the private key on any master node storage.
You can review the copy_file.py on the master node or on GitHub. In the get_private_key() function, you can read the secret value, which includes the private key:
In the copy_file function, create a secured SSH tunnel to copy a file using the private key from memory, using Paramiko, a Python implementation of SSHv2.
private_key_str = io.StringIO()
# Write private key to a memory file
private_key_str.write(private_key)
# Create key object
key = paramiko.RSAKey.from_private_key(private_key_str)
# Open a channel and authenticate
trans = paramiko.Transport(ip, 22)
trans.start_client()
trans.auth_publickey(user, key)
del key
To demonstrate the rotation of the SSH keypair, you’ll now manually invoke the rotation function:
Return to the Secrets Manager console, select your /dev/ssh secret, and choose Retrieve Secret Value to see the key pair.
Select Rotate secret immediately. In the pop-up window, confirm your choice by selecting Rotate.
Figure 7: Set the “Secret value” and “Rotation configuration”
Choose Rotate again to complete the rotation.
Figure 8: Select “Rotate”
Select the Close button to refresh the view, and then choose Retrieve Secret Value again.
Once the rotation has completed, you can inspect the new keypair via the Secrets Manager console. Go back to the terminal and run the same python script to copy a file using SCP. Replace <private-ip-of-worker> with your own worker node ID:
The file has now been transferred successfully using a new key pair, with no updates required.
Auditing and monitoring
You can monitor and audit all APIs used to create and rotate your keys in Secrets Manager via AWS CloudTrail. To view CloudTrail events, follow these steps:
Open the CloudTrail console and select Event history.
From the Filter dropdown field, select Event source, enter secret in the filter field, then select secretsmanager.amazonaws.com from the dropdown menu.
From here, you can review Secrets Manager’s events, such as GetSecretValue, PutSecretValue, UpdateSecretVersionStage (which modifies the staging labels attached to a version of a secret), and RotationSucceeded, in the CloudTrail event history. These event logs help to audit secrets configuration, rotation, and access.
Figure 9: The “Event history” window
Additionally, Secrets Manager can work with CloudWatch Events to trigger alerts when administrator-specified operations occur in an organization (for example, to notify you of a secret deletion attempt).
Cleaning up the CloudFormation Stack
To delete the entire CloudFormation stack:
Select the stack named RotateSSH from the CloudFormation console.
Select Actions, and then Delete Stack. This will delete all AWS resources created by the stack.
Repeat the steps above to delete the stack named MasterWorkers.
From the AWS Secrets Manager console, delete the secret /dev/ssh. Read more about Deleting and Restoring a Secret in the AWS Secrets Manager User Guide.
Conclusion
In this post, we demonstrate how you can use AWS Secrets Manager to store, rotate, and deliver SSH keypairs in order to secure communication within a compute cluster. Keys are securely encrypted and stored in AWS Secret Manager, which will also rotate the keys and install public keys on all nodes for you. By using this method, you won’t have to manually deploy SSH Keys on the various EC2 instances or manually rotate them. APIs associated with secrets management and rotation are logged in CloudTrail for auditing and monitoring. This key rotation solution is serverless. It does not require any servers to maintain and can scale rapidly.
If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the AWS Secrets Manager forum.
Want more AWS Security news? Follow us on Twitter.
We often mention SSH (Secure Shell) when we talk about headless Raspberry Pi projects — projects that involve accessing a Pi remotely. If you’re a coding creative who doesn’t know what SSH involves, we’ve got you covered with our comprehensive online guide to using SSH with your Raspberry Pi.
You know who’s also got you covered? YouTube favourite Tinkernut, with his great beginners’ guide to SSH, what it is, why we use it, and how you can use it with your device:
Me: “I have a question about controlling another computer over the internet” You: “SSH” Me: “Don’t tell me to ‘shhh’, I’m asking you a question”. Ok, enough with the play on words. If you’ve ever wanted to securely control another computer over the internet, then you’ve probably heard of SSH.
SSHhhhhhhhhh
Between our guide and Tinkernut’s video, I don’t think I need to add anything else on the subject.
So here, have this GIF, and have yourself a lovely weekend!
Previously, I showed you how to rotate Amazon RDS database credentials automatically with AWS Secrets Manager. In addition to database credentials, AWS Secrets Manager makes it easier to rotate, manage, and retrieve API keys, OAuth tokens, and other secrets throughout their lifecycle. You can configure Secrets Manager to rotate these secrets automatically, which can help you meet your compliance needs. You can also use Secrets Manager to rotate secrets on demand, which can help you respond quickly to security events. In this post, I show you how to store an API key in Secrets Manager and use a custom Lambda function to rotate the key automatically. I’ll use a Twitter API key and bearer token as an example; you can reference this example to rotate other types of API keys.
The instructions are divided into four main phases:
Store a Twitter API key and bearer token in Secrets Manager.
Create a custom Lambda function to rotate the bearer token.
Configure your application to retrieve the bearer token from Secrets Manager.
Configure Secrets Manager to use the custom Lambda function to rotate the bearer token automatically.
For the purpose of this post, I use the placeholder Demo/Twitter_Api_Key to denote the API key, the placeholder Demo/Twitter_bearer_token to denote the bearer token, and placeholder Lambda_Rotate_Bearer_Token to denote the custom Lambda function. Be sure to replace these placeholders with the resource names from your account.
Phase 1: Store a Twitter API key and bearer token in Secrets Manager
Twitter enables developers to register their applications and retrieve an API key, which includes a consumer_key and consumer_secret. Developers use these to generate a bearer token that applications can then use to authenticate and retrieve information from Twitter. At any given point of time, you can use an API key to create only one valid bearer token.
Start by storing the API key in Secrets Manager. Here’s how:
Figure 1: The “Store a new secret” button in the AWS Secrets Manager console
Select Other type of secrets (because you’re storing an API key).
Input the consumer_key and consumer_secret, and then select Next.
Figure 2: Select the consumer_key and the consumer_secret
Specify values for Secret Name and Description, then select Next. For this example, I use Demo/Twitter_API_Key.
Figure 3: Set values for “Secret Name” and “Description”
On the next screen, keep the default setting, Disable automatic rotation, because you’ll use the same API key to rotate bearer tokens programmatically and automatically. Applications and employees will not retrieve this API key. Select Next.
Figure 4: Keep the default “Disable automatic rotation” setting
Review the information on the next screen and, if everything looks correct, select Store. You’ve now successfully stored a Twitter API key in Secrets Manager.
Next, store the bearer token in Secrets Manager. Here’s how:
From the Secrets Manager console, select Store a new secret, select Other type of secrets, input details (access_token, token_type, and ARN of the API key) about the bearer token, and then select Next.
Figure 5: Add details about the bearer token
Specify values for Secret Name and Description, and then select Next. For this example, I use Demo/Twitter_bearer_token.
Figure 6: Again set values for “Secret Name” and “Description”
Keep the default rotation setting, Disable automatic rotation, and then select Next. You’ll enable rotation after you’ve updated the application to use Secrets Manager APIs to retrieve secrets.
Review the information and select Store. You’ve now completed storing the bearer token in Secrets Manager. I take note of the sample code provided on the review page. I’ll use this code to update my application to retrieve the bearer token using Secrets Manager APIs.
Figure 7: The sample code you can use in your app
Phase 2: Create a custom Lambda function to rotate the bearer token
While Secrets Manager supports rotating credentials for databases hosted on Amazon RDS natively, it also enables you to meet your unique rotation-related use cases by authoring custom Lambda functions. Now that you’ve stored the API key and bearer token, you’ll create a Lambda function to rotate the bearer token. For this example, I’ll create my Lambda function using Python 3.6.
Figure 8: In the Lambda console, select “Create function”
Select Author from scratch. For this example, I use the name Lambda_Rotate_Bearer_Token for my Lambda function. I also set the Runtime environment as Python 3.6.
Figure 9: Create a new function from scratch
This Lambda function requires permissions to call AWS resources on your behalf. To grant these permissions, select Create a custom role. This opens a console tab.
Select Create a new IAM Role and specify the value for Role Name. For this example, I use Role_Lambda_Rotate_Twitter_Bearer_Token.
Figure 10: For “IAM Role,” select “Create a new IAM role”
Next, to define the IAM permissions, copy and paste the following IAM policy in the View Policy Document text-entry field. Be sure to replace the placeholder ARN-OF-Demo/Twitter_API_Key with the ARN of your secret.
Figure 11: The IAM policy pasted in the “View Policy Document” text-entry field
Now, select Allow. This brings me back to the Lambda console with the appropriate Role selected.
Select Create function.
Figure 12: Select the “Create function” button in the lower-right corner
Copy the following Python code and paste it in the Function code section.
import base64
import json
import logging
import os
import boto3
from botocore.vendored import requests
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
"""Secrets Manager Twitter Bearer Token Handler
This handler uses the master-user rotation scheme to rotate a bearer token of a Twitter app.
The Secret PlaintextString is expected to be a JSON string with the following format:
{
'access_token': ,
'token_type': ,
'masterarn':
}
Args:
event (dict): Lambda dictionary of event parameters. These keys must include the following:
- SecretId: The secret ARN or identifier
- ClientRequestToken: The ClientRequestToken of the secret version
- Step: The rotation step (one of createSecret, setSecret, testSecret, or finishSecret)
context (LambdaContext): The Lambda runtime information
Raises:
ResourceNotFoundException: If the secret with the specified arn and stage does not exist
ValueError: If the secret is not properly configured for rotation
KeyError: If the secret json does not contain the expected keys
"""
arn = event['SecretId']
token = event['ClientRequestToken']
step = event['Step']
# Setup the client and environment variables
service_client = boto3.client('secretsmanager', endpoint_url=os.environ['SECRETS_MANAGER_ENDPOINT'])
oauth2_token_url = os.environ['TWITTER_OAUTH2_TOKEN_URL']
oauth2_invalid_token_url = os.environ['TWITTER_OAUTH2_INVALID_TOKEN_URL']
tweet_search_url = os.environ['TWITTER_SEARCH_URL']
# Make sure the version is staged correctly
metadata = service_client.describe_secret(SecretId=arn)
if not metadata['RotationEnabled']:
logger.error("Secret %s is not enabled for rotation" % arn)
raise ValueError("Secret %s is not enabled for rotation" % arn)
versions = metadata['VersionIdsToStages']
if token not in versions:
logger.error("Secret version %s has no stage for rotation of secret %s." % (token, arn))
raise ValueError("Secret version %s has no stage for rotation of secret %s." % (token, arn))
if "AWSCURRENT" in versions[token]:
logger.info("Secret version %s already set as AWSCURRENT for secret %s." % (token, arn))
return
elif "AWSPENDING" not in versions[token]:
logger.error("Secret version %s not set as AWSPENDING for rotation of secret %s." % (token, arn))
raise ValueError("Secret version %s not set as AWSPENDING for rotation of secret %s." % (token, arn))
# Call the appropriate step
if step == "createSecret":
create_secret(service_client, arn, token, oauth2_token_url, oauth2_invalid_token_url)
elif step == "setSecret":
set_secret(service_client, arn, token, oauth2_token_url)
elif step == "testSecret":
test_secret(service_client, arn, token, tweet_search_url)
elif step == "finishSecret":
finish_secret(service_client, arn, token)
else:
logger.error("lambda_handler: Invalid step parameter %s for secret %s" % (step, arn))
raise ValueError("Invalid step parameter %s for secret %s" % (step, arn))
def create_secret(service_client, arn, token, oauth2_token_url, oauth2_invalid_token_url):
"""Get a new bearer token from Twitter
This method invalidates existing bearer token for the Twitter app and retrieves a new one from Twitter.
If a secret version with AWSPENDING stage exists, updates it with the newly retrieved bearer token and if
the AWSPENDING stage does not exist, creates a new version of the secret with that stage label.
Args:
service_client (client): The secrets manager service client
arn (string): The secret ARN or other identifier
token (string): The ClientRequestToken associated with the secret version
oauth2_token_url (string): The Twitter API endpoint to request a bearer token
oauth2_invalid_token_url (string): The Twitter API endpoint to invalidate a bearer token
Raises:
ValueError: If the current secret is not valid JSON
KeyError: If the secret json does not contain the expected keys
ResourceNotFoundException: If the current secret is not found
"""
# Make sure the current secret exists and try to get the master arn from the secret
try:
current_secret_dict = get_secret_dict(service_client, arn, "AWSCURRENT")
master_arn = current_secret_dict['masterarn']
logger.info("createSecret: Successfully retrieved secret for %s." % arn)
except service_client.exceptions.ResourceNotFoundException:
return
# create bearer token credentials to be passed as authorization string to Twitter
bearer_token_credentials = encode_credentials(service_client, master_arn, "AWSCURRENT")
# get the bearer token from Twitter
bearer_token_from_twitter = get_bearer_token(bearer_token_credentials,oauth2_token_url)
# invalidate the current bearer token
invalidate_bearer_token(oauth2_invalid_token_url,bearer_token_credentials,bearer_token_from_twitter)
# get a new bearer token from Twitter
new_bearer_token = get_bearer_token(bearer_token_credentials, oauth2_token_url)
# if a secret version with AWSPENDING stage exists, update it with the lastest bearer token
# if the AWSPENDING stage does not exist, then create the version with AWSPENDING stage
try:
pending_secret_dict = get_secret_dict(service_client, arn, "AWSPENDING", token)
pending_secret_dict['access_token'] = new_bearer_token
service_client.put_secret_value(SecretId=arn, ClientRequestToken=token, SecretString=json.dumps(pending_secret_dict), VersionStages=['AWSPENDING'])
logger.info("createSecret: Successfully invalidated the bearer token of the secret %s and updated the pending version" % arn)
except service_client.exceptions.ResourceNotFoundException:
current_secret_dict['access_token'] = new_bearer_token
service_client.put_secret_value(SecretId=arn, ClientRequestToken=token, SecretString=json.dumps(current_secret_dict), VersionStages=['AWSPENDING'])
logger.info("createSecret: Successfully invalidated the bearer token of the secret %s and and created the pending version." % arn)
def set_secret(service_client, arn, token, oauth2_token_url):
"""Validate the pending secret with that in Twitter
This method checks wether the bearer token in Twitter is the same as the one in the version with AWSPENDING stage.
Args:
service_client (client): The secrets manager service client
arn (string): The secret ARN or other identifier
token (string): The ClientRequestToken associated with the secret version
oauth2_token_url (string): The Twitter API endopoint to get a bearer token
Raises:
ResourceNotFoundException: If the secret with the specified arn and stage does not exist
ValueError: If the secret is not valid JSON or master credentials could not be used to login to DB
KeyError: If the secret json does not contain the expected keys
"""
# First get the pending version of the bearer token and compare it with that in Twitter
pending_secret_dict = get_secret_dict(service_client, arn, "AWSPENDING")
master_arn = pending_secret_dict['masterarn']
# create bearer token credentials to be passed as authorization string to Twitter
bearer_token_credentials = encode_credentials(service_client, master_arn, "AWSCURRENT")
# get the bearer token from Twitter
bearer_token_from_twitter = get_bearer_token(bearer_token_credentials, oauth2_token_url)
# if the bearer tokens are same, invalidate the bearer token in Twitter
# if not, raise an exception that bearer token in Twitter was changed outside Secrets Manager
if pending_secret_dict['access_token'] == bearer_token_from_twitter:
logger.info("createSecret: Successfully verified the bearer token of arn %s" % arn)
else:
raise ValueError("The bearer token of the Twitter app was changed outside Secrets Manager. Please check.")
def test_secret(service_client, arn, token, tweet_search_url):
"""Test the pending secret by calling a Twitter API
This method tries to use the bearer token in the secret version with AWSPENDING stage and search for tweets
with 'aws secrets manager' string.
Args:
service_client (client): The secrets manager service client
arn (string): The secret ARN or other identifier
token (string): The ClientRequestToken associated with the secret version
Raises:
ResourceNotFoundException: If the secret with the specified arn and stage does not exist
ValueError: If the secret is not valid JSON or pending credentials could not be used to login to the database
KeyError: If the secret json does not contain the expected keys
"""
# First get the pending version of the bearer token and compare it with that in Twitter
pending_secret_dict = get_secret_dict(service_client, arn, "AWSPENDING", token)
# Now verify you can search for tweets using the bearer token
if verify_bearer_token(pending_secret_dict['access_token'], tweet_search_url):
logger.info("testSecret: Successfully authorized with the pending secret in %s." % arn)
return
else:
logger.error("testSecret: Unable to authorize with the pending secret of secret ARN %s" % arn)
raise ValueError("Unable to connect to Twitter with pending secret of secret ARN %s" % arn)
def finish_secret(service_client, arn, token):
"""Finish the rotation by marking the pending secret as current
This method moves the secret from the AWSPENDING stage to the AWSCURRENT stage.
Args:
service_client (client): The secrets manager service client
arn (string): The secret ARN or other identifier
token (string): The ClientRequestToken associated with the secret version
Raises:
ResourceNotFoundException: If the secret with the specified arn and stage does not exist
"""
# First describe the secret to get the current version
metadata = service_client.describe_secret(SecretId=arn)
current_version = None
for version in metadata["VersionIdsToStages"]:
if "AWSCURRENT" in metadata["VersionIdsToStages"][version]:
if version == token:
# The correct version is already marked as current, return
logger.info("finishSecret: Version %s already marked as AWSCURRENT for %s" % (version, arn))
return
current_version = version
break
# Finalize by staging the secret version current
service_client.update_secret_version_stage(SecretId=arn, VersionStage="AWSCURRENT", MoveToVersionId=token, RemoveFromVersionId=current_version)
logger.info("finishSecret: Successfully set AWSCURRENT stage to version %s for secret %s." % (version, arn))
def encode_credentials(service_client, arn, stage):
"""Encodes the Twitter credentials
This helper function encodes the Twitter credentials (consumer_key and consumer_secret)
Args:
service_client (client):The secrets manager service client
arn (string): The secret ARN or other identifier
stage (stage): The stage identifying the secret version
Returns:
encoded_credentials (string): base64 encoded authorization string for Twitter
Raises:
KeyError: If the secret json does not contain the expected keys
"""
required_fields = ['consumer_key','consumer_secret']
master_secret_dict = get_secret_dict(service_client, arn, stage)
for field in required_fields:
if field not in master_secret_dict:
raise KeyError("%s key is missing from the secret JSON" % field)
encoded_credentials = base64.urlsafe_b64encode(
'{}:{}'.format(master_secret_dict['consumer_key'], master_secret_dict['consumer_secret']).encode('ascii')).decode('ascii')
return encoded_credentials
def get_bearer_token(encoded_credentials, oauth2_token_url):
"""Gets a bearer token from Twitter
This helper function retrieves the current bearer token from Twitter, given a set of credentials.
Args:
encoded_credentials (string): Twitter credentials for authentication
oauth2_token_url (string): REST API endpoint to request a bearer token from Twitter
Raises:
KeyError: If the secret json does not contain the expected keys
"""
headers = {
'Authorization': 'Basic {}'.format(encoded_credentials),
'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
}
data = 'grant_type=client_credentials'
response = requests.post(oauth2_token_url, headers=headers, data=data)
response_data = response.json()
if response_data['token_type'] == 'bearer':
bearer_token = response_data['access_token']
return bearer_token
else:
raise RuntimeError('unexpected token type: {}'.format(response_data['token_type']))
def invalidate_bearer_token(oauth2_invalid_token_url, bearer_token_credentials, bearer_token):
"""Invalidates a Bearer Token of a Twitter App
This helper function invalidates a bearer token of a Twitter app.
If successful, it returns the invalidated bearer token, else None
Args:
oauth2_invalid_token_url (string): The Twitter API endpoint to invalidate a bearer token
bearer_token_credentials (string): encoded consumer key and consumer secret to authenticate with Twitter
bearer_token (string): The bearer token to be invalidated
Returns:
invalidated_bearer_token: The invalidated bearer token
Raises:
ResourceNotFoundException: If the secret with the specified arn and stage does not exist
ValueError: If the secret is not valid JSON
KeyError: If the secret json does not contain the expected keys
"""
headers = {
'Authorization': 'Basic {}'.format(bearer_token_credentials),
'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
}
data = 'access_token=' + bearer_token
invalidate_response = requests.post(oauth2_invalid_token_url, headers=headers, data=data)
invalidate_response_data = invalidate_response.json()
if invalidate_response_data:
return
else:
raise RuntimeError('Invalidate bearer token request failed')
def verify_bearer_token(bearer_token, tweet_search_url):
"""Verifies access to Twitter APIs using a bearer token
This helper function verifies that the bearer token is valid by calling Twitter's search/tweets API endpoint
Args:
bearer_token (string): The current bearer token for the application
Returns:
True or False
Raises:
KeyError: If the response of search tweets API call fails
"""
headers = {
'Authorization' : 'Bearer {}'.format(bearer_token),
'Content-Type': 'application/x-www-form-urlencoded;charset=UTF-8',
}
search_results = requests.get(tweet_search_url, headers=headers)
try:
search_results.json()['statuses']
return True
except:
return False
def get_secret_dict(service_client, arn, stage, token=None):
"""Gets the secret dictionary corresponding for the secret arn, stage, and token
This helper function gets credentials for the arn and stage passed in and returns the dictionary by parsing the JSON string
Args:
service_client (client): The secrets manager service client
arn (string): The secret ARN or other identifier
token (string): The ClientRequestToken associated with the secret version, or None if no validation is desired
stage (string): The stage identifying the secret version
Returns:
SecretDictionary: Secret dictionary
Raises:
ResourceNotFoundException: If the secret with the specified arn and stage does not exist
ValueError: If the secret is not valid JSON
"""
# Only do VersionId validation against the stage if a token is passed in
if token:
secret = service_client.get_secret_value(SecretId=arn, VersionId=token, VersionStage=stage)
else:
secret = service_client.get_secret_value(SecretId=arn, VersionStage=stage)
plaintext = secret['SecretString']
# Parse and return the secret JSON string
return json.loads(plaintext)
Here’s what it will look like:
Figure 13: The Python code pasted in the “Function code” section
On the same page, provide the following environment variables:
Note: Resources used in this example are in US East (Ohio) region. If you intend to use another AWS Region, change the SECRETS_MANAGER_ENDPOINT set in the Environment variables to the appropriate region.
You’ve now created a Lambda function that can rotate the bearer token:
Figure 15: The new Lambda function
Before you can configure Secrets Manager to use this Lambda function, you need to update the function policy of the Lambda function. A function policy permits AWS services, such as Secrets Manager, to invoke a Lambda function on behalf of your application. You can attach a Lambda function policy from the AWS Command Line Interface (AWS CLI) or SDK. To attach a function policy, call the add-permission Lambda API from the AWS CLI.
Phase 3: Configure your application to retrieve the bearer token from Secrets Manager
Now that you’ve stored the bearer token in Secrets Manager, update the application to retrieve the bearer token from Secrets Manager instead of hard-coding this information in a configuration file or source code. For this example, I show you how to configure a Python application to retrieve this secret from Secrets Manager.
import config
def no_secrets_manager_sample()
# Get the bearer token from a config file.
Bearer_token = config.bearer_token
# Use the bearer token to authenticate requests to Twitter
Use the sample code from section titled Phase 1 and update the application to retrieve the bearer token from Secrets Manager. The following code sets up the client and retrieves and decrypts the secret Demo/Twitter_bearer_token.
# Use this code snippet in your app.
import boto3
from botocore.exceptions import ClientError
def get_secret():
secret_name = "Demo/Twitter_bearer_token"
endpoint_url = "https://secretsmanager.us-east-2.amazonaws.com"
region_name = "us-east-2"
session = boto3.session.Session()
client = session.client(
service_name='secretsmanager',
region_name=region_name,
endpoint_url=endpoint_url
)
try:
get_secret_value_response = client.get_secret_value(
SecretId=secret_name
)
except ClientError as e:
if e.response['Error']['Code'] == 'ResourceNotFoundException':
print("The requested secret " + secret_name + " was not found")
elif e.response['Error']['Code'] == 'InvalidRequestException':
print("The request was invalid due to:", e)
elif e.response['Error']['Code'] == 'InvalidParameterException':
print("The request had invalid params:", e)
else:
# Decrypted secret using the associated KMS CMK
# Depending on whether the secret was a string or binary, one of these fields will be populated
if 'SecretString' in get_secret_value_response:
secret = get_secret_value_response['SecretString']
else:
binary_secret_data = get_secret_value_response['SecretBinary']
# Your code goes here.
Applications require permissions to access Secrets Manager. My application runs on Amazon EC2 and uses an IAM role to get access to AWS services. I’ll attach the following policy to my IAM role, and you should take a similar action with your IAM role. This policy uses the GetSecretValue action to grant my application permissions to read secrets from Secrets Manager. This policy also uses the resource element to limit my application to read only the Demo/Twitter_bearer_token secret from Secrets Manager. Read the AWS Secrets Manager documentation to understand the minimum IAM permissions required to retrieve a secret.
{
"Version": "2012-10-17",
"Statement": {
"Sid": "RetrieveBearerToken",
"Effect": "Allow",
"Action": "secretsmanager:GetSecretValue",
"Resource": Input ARN of the secret Demo/Twitter_bearer_token here
}
}
Note: To improve the resiliency of your applications, associate your application with two API keys/bearer tokens. This is a higher availability option because you can continue to use one bearer token while Secrets Manager rotates the other token. Read the AWS documentation to learn how AWS Secrets Manager rotates your secrets.
Phase 4: Enable and verify rotation
Now that you’ve stored the secret in Secrets Manager and created a Lambda function to rotate this secret, configure Secrets Manager to rotate the secret Demo/Twitter_bearer_token.
From the Secrets Manager console, go to the list of secrets and choose the secret you created in the first step (in my example, this is named Demo/Twitter_bearer_token).
Scroll to Rotation configuration, and then select Edit rotation.
Figure 16: Select the “Edit rotation” button
To enable rotation, select Enable automatic rotation, and then choose how frequently you want Secrets Manager to rotate this secret. For this example, I set the rotation interval to 30 days. I also choose the rotation Lambda function, Lambda_Rotate_Bearer_Token, from the drop-down list.
Figure 17: “Edit rotation configuration” options
The banner on the next screen confirms that I have successfully configured rotation and the first rotation is in progress, which enables you to verify that rotation is functioning as expected. Secrets Manager will rotate this credential automatically every 30 days.
Figure 18: Confirmation notice
Summary
In this post, I showed you how to configure Secrets Manager to manage and rotate an API key and bearer token used by applications to authenticate and retrieve information from Twitter. You can use the steps described in this blog to manage and rotate other API keys, as well.
Secrets Manager helps you protect access to your applications, services, and IT resources without the upfront investment and on-going maintenance costs of operating your own secrets management infrastructure. To get started, open the Secrets Manager console. To learn more, read the Secrets Manager documentation.
If you have comments about this post, submit them in the Comments section below. If you have questions about anything in this post, start a new thread on the Secrets Manager forum or contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
This post is courtesy of Alan Protasio, Software Development Engineer, Amazon Web Services
Just like compute and storage, messaging is a fundamental building block of enterprise applications. Message brokers (aka “message-oriented middleware”) enable different software systems, often written in different languages, on different platforms, running in different locations, to communicate and exchange information. Mission-critical applications, such as CRM and ERP, rely on message brokers to work.
A common performance consideration for customers deploying a message broker in a production environment is the throughput of the system, measured as messages per second. This is important to know so that application environments (hosts, threads, memory, etc.) can be configured correctly.
In this post, we demonstrate how to measure the throughput for Amazon MQ, a new managed message broker service for ActiveMQ, using JMS Benchmark. It should take between 15–20 minutes to set up the environment and an hour to run the benchmark. We also provide some tips on how to configure Amazon MQ for optimal throughput.
Benchmarking throughput for Amazon MQ
ActiveMQ can be used for a number of use cases. These use cases can range from simple fire and forget tasks (that is, asynchronous processing), low-latency request-reply patterns, to buffering requests before they are persisted to a database.
The throughput of Amazon MQ is largely dependent on the use case. For example, if you have non-critical workloads such as gathering click events for a non-business-critical portal, you can use ActiveMQ in a non-persistent mode and get extremely high throughput with Amazon MQ.
On the flip side, if you have a critical workload where durability is extremely important (meaning that you can’t lose a message), then you are bound by the I/O capacity of your underlying persistence store. We recommend using mq.m4.large for the best results. The mq.t2.micro instance type is intended for product evaluation. Performance is limited, due to the lower memory and burstable CPU performance.
Tip: To improve your throughput with Amazon MQ, make sure that you have consumers processing messaging as fast as (or faster than) your producers are pushing messages.
Because it’s impossible to talk about how the broker (ActiveMQ) behaves for each and every use case, we walk through how to set up your own benchmark for Amazon MQ using our favorite open-source benchmarking tool: JMS Benchmark. We are fans of the JMS Benchmark suite because it’s easy to set up and deploy, and comes with a built-in visualizer of the results.
Non-Persistent Scenarios – Queue latency as you scale producer throughput
Getting started
At the time of publication, you can create an mq.m4.large single-instance broker for testing for $0.30 per hour (US pricing).
Step 2 – Create an EC2 instance to run your benchmark Launch the EC2 instance using Step 1: Launch an Instance. We recommend choosing the m5.large instance type.
Step 3 – Configure the security groups Make sure that all the security groups are correctly configured to let the traffic flow between the EC2 instance and your broker.
From the broker list, choose the name of your broker (for example, MyBroker)
In the Details section, under Security and network, choose the name of your security group or choose the expand icon ( ).
From the security group list, choose your security group.
At the bottom of the page, choose Inbound, Edit.
In the Edit inbound rules dialog box, add a role to allow traffic between your instance and the broker: • Choose Add Rule. • For Type, choose Custom TCP. • For Port Range, type the ActiveMQ SSL port (61617). • For Source, leave Custom selected and then type the security group of your EC2 instance. • Choose Save.
Your broker can now accept the connection from your EC2 instance.
Step 4 – Run the benchmark Connect to your EC2 instance using SSH and run the following commands:
After the benchmark finishes, you can find the results in the ~/reports directory. As you may notice, the performance of ActiveMQ varies based on the number of consumers, producers, destinations, and message size.
Amazon MQ architecture
The last bit that’s important to know so that you can better understand the results of the benchmark is how Amazon MQ is architected.
Amazon MQ is architected to be highly available (HA) and durable. For HA, we recommend using the multi-AZ option. After a message is sent to Amazon MQ in persistent mode, the message is written to the highly durable message store that replicates the data across multiple nodes in multiple Availability Zones. Because of this replication, for some use cases you may see a reduction in throughput as you migrate to Amazon MQ. Customers have told us they appreciate the benefits of message replication as it helps protect durability even in the face of the loss of an Availability Zone.
Conclusion
We hope this gives you an idea of how Amazon MQ performs. We encourage you to run tests to simulate your own use cases.
To learn more, see the Amazon MQ website. You can try Amazon MQ for free with the AWS Free Tier, which includes up to 750 hours of a single-instance mq.t2.micro broker and up to 1 GB of storage per month for one year.
This post courtesy of Jeff Levine Solutions Architect for Amazon Web Services
Amazon Linux 2 is the next generation of Amazon Linux, a Linux server operating system from Amazon Web Services (AWS). Amazon Linux 2 offers a high-performance Linux environment suitable for organizations of all sizes. It supports applications ranging from small websites to enterprise-class, mission-critical platforms.
Amazon Linux 2 includes support for the LAMP (Linux/Apache/MariaDB/PHP) stack, one of the most popular platforms for deploying websites. To secure the transmission of data-in-transit to such websites and prevent eavesdropping, organizations commonly leverage Secure Sockets Layer/Transport Layer Security (SSL/TLS) services which leverage certificates to provide encryption. The LAMP stack provided by Amazon Linux 2 includes a self-signed SSL/TLS certificate. Such certificates may be fine for internal usage but are not acceptable when attestation by a certificate authority is required.
In this post, I discuss how to extend the capabilities of Amazon Linux 2 by installing Let’s Encrypt, a certificate authority provided by the Internet Security Research Group. Let’s Encrypt offers basic SSL/TLS certificates for DNS hosts at no charge that you can use to add encryption-in-transit to a single web server. For commercial or multi-server configurations, you should consider AWS Certificate Manager and Elastic Load Balancing.
Let’s Encrypt also requires the certbot package, which you install from EPEL, the Extra Packaged for Enterprise Linux collection. Although EPEL is not included with Amazon Linux 2, I show how you can install it from the Fedora Project.
Walkthrough
At a high level, you perform the following tasks for this walkthrough:
Provision a VPC, Amazon Linux 2 instance, and LAMP stack.
Install and enable the EPEL repository.
Install and configure Let’s Encrypt.
Validate the installation.
Clean up.
Prerequisites and costs
To follow along with this walkthrough, you need the following:
Accept all other default values including with regard to storage.
Create a new security group and accept the default rule that allows TCP port 22 (SSH) from everywhere (0.0.0.0/0 in IPv4). For the purposes of this walkthrough, permitting access from all IP addresses is reasonable. In a production environment, you may restrict access to different addresses.
Allocate and associate an Elastic IP address to the server when it enters the running state.
Respond “Y” to all requests for approval to install the software.
Step 3: Install and configure Let’s Encrypt
If you are no longer connected to the Amazon Linux 2 instance, connect to it at the Elastic IP address that you just created.
Install certbot, the Let’s Encrypt client to be used to obtain an SSL/TLS certificate and install it into Apache.
sudo yum install python2-certbot-apache.noarch
Respond “Y” to all requests for approval to install the software. If you see a message appear about SELinux, you can safely ignore it. This is a known issue with the latest version of certbot.
Create a DNS “A record” that maps a host name to the Elastic IP address. For this post, assume that the name of the host is lamp.example.com. If you are hosting your DNS in Amazon Route 53, do this by creating the appropriate record set.
After the “A record” has propagated, browse to lamp.example.com. The Apache test page should appear. If the page does not appear, use a tool such as nslookup on your workstation to confirm that the DNS record has been properly configured.
You are now ready to install Let’s Encrypt. Let’s Encrypt does the following:
Confirms that you have control over the DNS domain being used, by having you create a DNS TXT record using the value that it provides.
Obtains an SSL/TLS certificate.
Modifies the Apache-related scripts to use the SSL/TLS certificate and redirects users browsing the site in HTTP mode to HTTPS mode.
Use the following command to install certbot:
sudo certbot -i apache -a manual \
--preferred-challenges dns -d lamp.example.com
The options have the following meanings:
-i apache Use the Apache installer.
-a manual Authenticate domain ownership manually.
--preferred-challenges dns Use DNS TXT records for authentication challenge.
-d lamp.example.com Specify the domain for the SSL/TLS certificate.
You are prompted for the following information: E-mail address for renewals? Enter an email address for certificate renewals. Accept the terms of services? Respond as appropriate. Send your e-mail address to the EFF? Respond as appropriate. Log your current IP address? Respond as appropriate.
You are prompted to deploy a DNS TXT record with the name “_acme-challenge.lamp.example.com” with the supplied value, as shown below.
After you enter the record, wait until the TXT record propagates. To look up the TXT record to confirm the deployment, use the nslookup command in a separate command window, as shown below. Remember to use the set ty=txt command before entering the TXT record. You are prompted to select a virtual host. There is only one, so choose 1. The final prompt asks whether to redirect HTTP traffic to HTTPS. To perform the redirection, choose 2. That completes the configuration of Let’s Encrypt.
Browse to the http:// lamp.example.com site. You are redirected to the SSL/TLS page https://lamp.example.com.
To look at the encryption information, use the appropriate actions within your browser. For example, in Firefox, you can open the padlock and traverse the menus. In the encryption technical details, you can see from the “Connection Encrypted” line that traffic to the website is now encrypted using TLS 1.2.
Security note: As of the time of publication, this website also supports TLS 1.0. I recommend that you disable this protocol because of some known vulnerabilities associated with it. To do this:
Edit the file /etc/letsencrypt/options-ssl-apache.conf.
Look for the line beginning with SSLProtocol and change it to the following:
SSLProtocol all -SSLv2 -SSLv3 -TLSv1
Save the file. After you make changes to this file, Let’s Encrypt no longer automatically updates it. Periodically check your log files for recommended updates to this file.
Restart the httpd server with the following command:
sudo service httpd restart
Step 5: Cleanup
Use the following steps to avoid incurring any further costs.
Terminate the Amazon Linux 2 instance that you created.
Release the Elastic IP address that you allocated.
Revert any DNS changes that you made, including the A and TXT records.
Conclusion
Amazon Linux 2 is an excellent option for hosting websites through the LAMP stack provided by the Amazon-Linux-Extras feature. You can then enhance the security of the Apache web server by installing EPEL and Let’s Encrypt. Let’s Encrypt provisions an SSL/TLS certificate, optionally installs it for you on the Apache server, and enables data-in-transit encryption. You can get started with Amazon Linux 2 in just a few clicks.
Some gaming consoles make it easy to stream to Twitch, some gaming consoles don’t (come on, Nintendo). So for those that don’t, I’ve made this beta version of the “Twitch-O-Matic”. No it doesn’t chop onions or fold your laundry, but what it DOES do is stream anything with HDMI output to your Twitch channel with the simple push of a button!
eSports and online game streaming
Interest in eSports has skyrocketed over the last few years, with viewership numbers in the hundreds of millions, sponsorship deals increasing in value and prestige, and tournament prize funds reaching millions of dollars. So it’s no wonder that more and more gamers are starting to stream live to online platforms in order to boost their fanbase and try to cash in on this growing industry.
Streaming to Twitch
Launched in 2011, Twitch.tv is an online live-streaming platform with a primary focus on video gaming. Users can create accounts to contribute their comments and content to the site, as well as watching live-streamed gaming competitions and broadcasts. With a staggering fifteen million daily users, Twitch is accessible via smartphone and gaming console apps, smart TVs, computers, and tablets. But if you want to stream to Twitch, you may find yourself using third-party software in order to do so. And with more buttons to click and more wires to plug in for older, app-less consoles, streaming can get confusing.
Enter Tinkernut.
Side note: we Tinkernut
We’ve featured Tinkernut a few times on the Raspberry Pi blog – his tutorials are clear, his projects are interesting and useful, and his live-streamed comment videos for every build are a nice touch to sharing homebrew builds on the internet.
So, yes, we love him. [This is true. Alex never shuts up about him. – Ed.] And since he has over 500K subscribers on YouTube, we’re obviously not the only ones. We wave our Tinkernut flags with pride.
Twitch-O-Matic
With a Raspberry Pi Zero W, an HDMI to CSI adapter, and a case to fit it all in, Tinkernut’s Twitch-O-Matic allows easy connection to the Twitch streaming service. You’ll also need a button – the bigger, the better in our opinion, though Tinkernut has opted for the Adafruit 16mm Illuminated Pushbutton for his build, and not the 100mm Massive Arcade Button that, sadly, we still haven’t found a reason to use yet.
“I’m sorry, Dave…”
For added frills and pizzazz, Tinketnut has also incorporated Adafruit’s White LED Backlight Module into the case, though you don’t have to do so unless you’re feeling super fancy.
The setup
The Raspberry Pi Zero W is connected to the HDMI to CSI adapter via the camera connector, in the same way you’d attach the camera ribbon. Tinkernut uses a standard Raspbian image on an 8GB SD card, with SSH enabled for remote access from his laptop. He uses the simple command Raspivid to test the HDMI connection by recording ten seconds of video footage from his console.
One lead is all you need
Once you have the Pi receiving video from your console, you can connect to Twitch using your Twitch stream key, which you can find by logging in to your account at Twitch.tv. Tinkernut’s tutorial gives you all the commands you need to stream from your Pi.
The frills
To up the aesthetic impact of your project, adding buttons and backlights is fairly straightforward.
Pretty LED frills
To run the stream command, Tinketnut uses a button: press once to start the stream, press again to stop. Pressing the button also turns on the LED backlight, so it’s obvious when streaming is in progress.
The tutorial
For the full code and 3D-printable case STL file, head to Tinketnut’s hackster.io project page. And if you’re already using a Raspberry Pi for Twitch streaming, share your build setup with us. Cheers!
AWS Glue provides enhanced support for working with datasets that are organized into Hive-style partitions. AWS Glue crawlers automatically identify partitions in your Amazon S3 data. The AWS Glue ETL (extract, transform, and load) library natively supports partitions when you work with DynamicFrames. DynamicFrames represent a distributed collection of data without requiring you to specify a schema. You can now push down predicates when creating DynamicFrames to filter out partitions and avoid costly calls to S3. We have also added support for writing DynamicFrames directly into partitioned directories without converting them to Apache Spark DataFrames.
Partitioning has emerged as an important technique for organizing datasets so that they can be queried efficiently by a variety of big data systems. Data is organized in a hierarchical directory structure based on the distinct values of one or more columns. For example, you might decide to partition your application logs in Amazon S3 by date—broken down by year, month, and day. Files corresponding to a single day’s worth of data would then be placed under a prefix such as s3://my_bucket/logs/year=2018/month=01/day=23/.
Systems like Amazon Athena, Amazon Redshift Spectrum, and now AWS Glue can use these partitions to filter data by value without making unnecessary calls to Amazon S3. This can significantly improve the performance of applications that need to read only a few partitions.
In this post, we show you how to efficiently process partitioned datasets using AWS Glue. First, we cover how to set up a crawler to automatically scan your partitioned dataset and create a table and partitions in the AWS Glue Data Catalog. Then, we introduce some features of the AWS Glue ETL library for working with partitioned data. You can now filter partitions using SQL expressions or user-defined functions to avoid listing and reading unnecessary data from Amazon S3. We’ve also added support in the ETL library for writing AWS Glue DynamicFrames directly into partitions without relying on Spark SQL DataFrames.
Let’s get started!
Crawling partitioned data
In this example, we use the same GitHub archive dataset that we introduced in a previous post about Scala support in AWS Glue. This data, which is publicly available from the GitHub archive, contains a JSON record for every API request made to the GitHub service. A sample dataset containing one month of activity from January 2017 is available at the following location:
Here you can replace <region> with the AWS Region in which you are working, for example, us-east-1. This dataset is partitioned by year, month, and day, so an actual file will be at a path like the following:
To crawl this data, you can either follow the instructions in the AWS Glue Developer Guide or use the provided AWS CloudFormation template. This template creates a stack that contains the following:
An IAM role with permissions to access AWS Glue resources
A database in the AWS Glue Data Catalog named githubarchive_month
A crawler set up to crawl the GitHub dataset
An AWS Glue development endpoint (which is used in the next section to transform the data)
To run this template, you must provide an S3 bucket and prefix where you can write output data in the next section. The role that this template creates will have permission to write to this bucket only. You also need to provide a public SSH key for connecting to the development endpoint. For more information about creating an SSH key, see our Development Endpoint tutorial. After you create the AWS CloudFormation stack, you can run the crawler from the AWS Glue console.
In addition to inferring file types and schemas, crawlers automatically identify the partition structure of your dataset and populate the AWS Glue Data Catalog. This ensures that your data is correctly grouped into logical tables and makes the partition columns available for querying in AWS Glue ETL jobs or query engines like Amazon Athena.
After you crawl the table, you can view the partitions by navigating to the table in the AWS Glue console and choosing View partitions. The partitions should look like the following:
For partitioned paths in Hive-style of the form key=val, crawlers automatically populate the column name. In this case, because the GitHub data is stored in directories of the form 2017/01/01, the crawlers use default names like partition_0, partition_1, and so on. You can easily change these names on the AWS Glue console: Navigate to the table, choose Edit schema, and rename partition_0 to year, partition_1 to month, and partition_2 to day:
Now that you’ve crawled the dataset and named your partitions appropriately, let’s see how to work with partitioned data in an AWS Glue ETL job.
Transforming and filtering the data
To get started with the AWS Glue ETL libraries, you can use an AWS Glue development endpoint and an Apache Zeppelin notebook. AWS Glue development endpoints provide an interactive environment to build and run scripts using Apache Spark and the AWS Glue ETL library. They are great for debugging and exploratory analysis, and can be used to develop and test scripts before migrating them to a recurring job.
If you ran the AWS CloudFormation template in the previous section, then you already have a development endpoint named partition-endpoint in your account. Otherwise, you can follow the instructions in this development endpoint tutorial. In either case, you need to set up an Apache Zeppelin notebook, either locally, or on an EC2 instance. You can find more information about development endpoints and notebooks in the AWS Glue Developer Guide.
The following examples are all written in the Scala programming language, but they can all be implemented in Python with minimal changes.
Reading a partitioned dataset
To get started, let’s read the dataset and see how the partitions are reflected in the schema. First, you import some classes that you will need for this example and set up a GlueContext, which is the main class that you will use to read and write data.
Execute the following in a Zeppelin paragraph, which is a unit of executable code:
%spark
import com.amazonaws.services.glue.DynamicFrame import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.JsonOptions import org.apache.spark.SparkContext
import java.util.Calendar
import java.util.GregorianCalendar
import scala.collection.JavaConversions._
@transient val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)
This is straightforward with two caveats: First, each paragraph must start with the line %spark to indicate that the paragraph is Scala. Second, the spark variable must be marked @transient to avoid serialization issues. This is only necessary when running in a Zeppelin notebook.
Next, read the GitHub data into a DynamicFrame, which is the primary data structure that is used in AWS Glue scripts to represent a distributed collection of data. A DynamicFrame is similar to a Spark DataFrame, except that it has additional enhancements for ETL transformations. DynamicFrames are discussed further in the post AWS Glue Now Supports Scala Scripts, and in the AWS Glue API documentation.
The following snippet creates a DynamicFrame by referencing the Data Catalog table that you just crawled and then prints the schema:
%spark
val githubEvents: DynamicFrame = glueContext.getCatalogSource(
database = "githubarchive_month",
tableName = "data"
).getDynamicFrame()
githubEvents.schema.asFieldList.foreach { field =>
println(s"${field.getName}: ${field.getType.getType.getName}")
}
You could also print the full schema using githubEvents.printSchema(). But in this case, the full schema is quite large, so I’ve printed only the top-level columns. This paragraph takes about 5 minutes to run on a standard size AWS Glue development endpoint. After it runs, you should see the following output:
Note that the partition columns year, month, and day were automatically added to each record.
Filtering by partition columns
One of the primary reasons for partitioning data is to make it easier to operate on a subset of the partitions, so now let’s see how to filter data by the partition columns. In particular, let’s find out what people are building in their free time by looking at GitHub activity on the weekends. One way to accomplish this is to use the filter transformation on the githubEvents DynamicFrame that you created earlier to select the appropriate events:
%spark
def filterWeekend(rec: DynamicRecord): Boolean = {
def getAsInt(field: String): Int = {
rec.getField(field) match {
case Some(strVal: String) => strVal.toInt
// The filter transformation will catch exceptions and mark the record as an error.
case _ => throw new IllegalArgumentException(s"Unable to extract field $field")
}
}
val (year, month, day) = (getAsInt("year"), getAsInt("month"), getAsInt("day"))
val cal = new GregorianCalendar(year, month - 1, day) // Calendar months start at 0.
val dayOfWeek = cal.get(Calendar.DAY_OF_WEEK)
dayOfWeek == Calendar.SATURDAY || dayOfWeek == Calendar.SUNDAY
}
val filteredEvents = githubEvents.filter(filterWeekend)
filteredEvents.count
This snippet defines the filterWeekend function that uses the Java Calendar class to identify those records where the partition columns (year, month, and day) fall on a weekend. If you run this code, you see that there were 6,303,480 GitHub events falling on the weekend in January 2017, out of a total of 29,160,561 events. This seems reasonable—about 22 percent of the events fell on the weekend, and about 29 percent of the days that month fell on the weekend (9 out of 31). So people are using GitHub slightly less on the weekends, but there is still a lot of activity!
Predicate pushdowns for partition columns
The main downside to using the filter transformation in this way is that you have to list and read all files in the entire dataset from Amazon S3 even though you need only a small fraction of them. This is manageable when dealing with a single month’s worth of data. But as you try to process more data, you will spend an increasing amount of time reading records only to immediately discard them.
To address this issue, we recently released support for pushing down predicates on partition columns that are specified in the AWS Glue Data Catalog. Instead of reading the data and filtering the DynamicFrame at executors in the cluster, you apply the filter directly on the partition metadata available from the catalog. Then you list and read only the partitions from S3 that you need to process.
To accomplish this, you can specify a Spark SQL predicate as an additional parameter to the getCatalogSource method. This predicate can be any SQL expression or user-defined function as long as it uses only the partition columns for filtering. Remember that you are applying this to the metadata stored in the catalog, so you don’t have access to other fields in the schema.
The following snippet shows how to use this functionality to read only those partitions occurring on a weekend:
%spark
val partitionPredicate =
"date_format(to_date(concat(year, '-', month, '-', day)), 'E') in ('Sat', 'Sun')"
val pushdownEvents = glueContext.getCatalogSource(
database = "githubarchive_month",
tableName = "data",
pushDownPredicate = partitionPredicate).getDynamicFrame()
Here you use the SparkSQL string concat function to construct a date string. You use the to_date function to convert it to a date object, and the date_format function with the ‘E’ pattern to convert the date to a three-character day of the week (for example, Mon, Tue, and so on). For more information about these functions, Spark SQL expressions, and user-defined functions in general, see the Spark SQL documentation and list of functions.
Note that the pushdownPredicate parameter is also available in Python. The corresponding call in Python is as follows:
You can observe the performance impact of pushing down predicates by looking at the execution time reported for each Zeppelin paragraph. The initial approach using a Scala filter function took 2.5 minutes:
Because the version using a pushdown lists and reads much less data, it takes only 24 seconds to complete, a 5X improvement!
Of course, the exact benefit that you see depends on the selectivity of your filter. The more partitions that you exclude, the more improvement you will see.
In addition to Hive-style partitioning for Amazon S3 paths, Parquet and ORC file formats further partition each file into blocks of data that represent column values. Each block also stores statistics for the records that it contains, such as min/max for column values. AWS Glue supports pushdown predicates for both Hive-style partitions and block partitions in these formats. While reading data, it prunes unnecessary S3 partitions and also skips the blocks that are determined unnecessary to be read by column statistics in Parquet and ORC formats.
Additional transformations
Now that you’ve read and filtered your dataset, you can apply any additional transformations to clean or modify the data. For example, you could augment it with sentiment analysis as described in the previous AWS Glue post.
To keep things simple, you can just pick out some columns from the dataset using the ApplyMapping transformation:
ApplyMapping is a flexible transformation for performing projection and type-casting. In this example, we use it to unnest several fields, such as actor.login, which we map to the top-level actor field. We also cast the id column to a long and the partition columns to integers.
Writing out partitioned data
The final step is to write out your transformed dataset to Amazon S3 so that you can process it with other systems like Amazon Athena. By default, when you write out a DynamicFrame, it is not partitioned—all the output files are written at the top level under the specified output path. Until recently, the only way to write a DynamicFrame into partitions was to convert it into a Spark SQL DataFrame before writing. We are excited to share that DynamicFrames now support native partitioning by a sequence of keys.
You can accomplish this by passing the additional partitionKeys option when creating a sink. For example, the following code writes out the dataset that you created earlier in Parquet format to S3 in directories partitioned by the type field.
Here, $outpath is a placeholder for the base output path in S3. The partitionKeys parameter can also be specified in Python in the connection_options dict:
When you execute this write, the type field is removed from the individual records and is encoded in the directory structure. To demonstrate this, you can list the output path using the aws s3 ls command from the AWS CLI:
PRE type=CommitCommentEvent/
PRE type=CreateEvent/
PRE type=DeleteEvent/
PRE type=ForkEvent/
PRE type=GollumEvent/
PRE type=IssueCommentEvent/
PRE type=IssuesEvent/
PRE type=MemberEvent/
PRE type=PublicEvent/
PRE type=PullRequestEvent/
PRE type=PullRequestReviewCommentEvent/
PRE type=PushEvent/
PRE type=ReleaseEvent/
PRE type=WatchEvent/
As expected, there is a partition for each distinct event type. In this example, we partitioned by a single value, but this is by no means required. For example, if you want to preserve the original partitioning by year, month, and day, you could simply set the partitionKeys option to be Seq(“year”, “month”, “day”).
Conclusion
In this post, we showed you how to work with partitioned data in AWS Glue. Partitioning is a crucial technique for getting the most out of your large datasets. Many tools in the AWS big data ecosystem, including Amazon Athena and Amazon Redshift Spectrum, take advantage of partitions to accelerate query processing. AWS Glue provides mechanisms to crawl, filter, and write partitioned data so that you can structure your data in Amazon S3 however you want, to get the best performance out of your big data applications.
Ben Sowell is a senior software development engineer at AWS Glue. He has worked for more than 5 years on ETL systems to help users unlock the potential of their data. In his free time, he enjoys reading and exploring the Bay Area.
Mohit Saxena is a senior software development engineer at AWS Glue. His passion is building scalable distributed systems for efficiently managing data on cloud. He also enjoys watching movies and reading about the latest technology.
Security updates have been issued by Debian (freeplane and jruby), Fedora (kernel and python-bleach), Gentoo (evince, gdk-pixbuf, and ncurses), openSUSE (kernel), Oracle (gcc, glibc, kernel, krb5, ntp, openssh, openssl, policycoreutils, qemu-kvm, and xdg-user-dirs), Red Hat (corosync, glusterfs, kernel, and kernel-rt), SUSE (openssl), and Ubuntu (openssl and perl).
Security updates have been issued by CentOS (libvorbis and thunderbird), Debian (pjproject), Fedora (compat-openssl10, java-1.8.0-openjdk-aarch32, libid3tag, python-pip, python3, and python3-docs), Gentoo (ZendFramework), Oracle (thunderbird), Red Hat (ansible, gcc, glibc, golang, kernel, kernel-alt, kernel-rt, krb5, kubernetes, libvncserver, libvorbis, ntp, openssh, openssl, pcs, policycoreutils, qemu-kvm, and xdg-user-dirs), SUSE (openssl and openssl1), and Ubuntu (python-crypto, ubuntu-release-upgrader, and wayland).
Jennifer Fox is back, this time with a Raspberry Pi Zero–controlled impact force monitor that will notify you if your collision is a worth a trip to the doctor.
Check out my latest Hacker in Residence project for SparkFun Electronics: the Helmet Guardian! It’s a Pi Zero powered impact force monitor that turns on an LED if your head/body experiences a potentially dangerous impact. Install in your sports helmets, bicycle, or car to keep track of impact and inform you when it’s time to visit the doctor.
Concussion
We’ve all knocked our heads at least once in our lives, maybe due to tripping over a loose paving slab, or to falling off a bike, or to walking into the corner of the overhead cupboard door for the third time this week — will I ever learn?! More often than not, even when we’re seeing stars, we brush off the accident and continue with our day, oblivious to the long-term damage we may be doing.
Force of impact
After some thorough research, Jennifer Fox, founder of FoxBot Industries, concluded that forces of 4 to 6 G sustained for more than a few seconds are dangerous to the human body. With this in mind, she decided to use a Raspberry Pi Zero W and an accelerometer to create helmet with an impact force monitor that notifies its wearer if this level of G-force has been met.
Obviously, if you do have a serious fall, you should always seek medical advice. This project is an example of how affordable technology can be used to create medical and citizen science builds, and not a replacement for professional medical services.
Setting up the impact monitor
Jennifer’s monitor requires only a few pieces of tech: a Zero W, an accelerometer and breakout board, a rechargeable USB battery, and an LED, plus the standard wires and resistors for these components.
After installing Raspbian, Jennifer enabled SSH and I2C on the Zero W to make it run headlessly, and then accessed it from a laptop. This allows her to control the Pi without physically connecting to it, and it makes for a wireless finished project.
Jen wired the Pi to the accelerometer breakout board and LED as shown in the schematic below.
The LED acts as a signal of significant impacts, turning on when the G-force threshold is reached, and not turning off again until the program is reset.
Recently, we launched AWS Secrets Manager, a service that makes it easier to rotate, manage, and retrieve database credentials, API keys, and other secrets throughout their lifecycle. You can configure Secrets Manager to rotate secrets automatically, which can help you meet your security and compliance needs. Secrets Manager offers built-in integrations for MySQL, PostgreSQL, and Amazon Aurora on Amazon RDS, and can rotate credentials for these databases natively. You can control access to your secrets by using fine-grained AWS Identity and Access Management (IAM) policies. To retrieve secrets, employees replace plaintext secrets with a call to Secrets Manager APIs, eliminating the need to hard-code secrets in source code or update configuration files and redeploy code when secrets are rotated.
In this post, I introduce the key features of Secrets Manager. I then show you how to store a database credential for a MySQL database hosted on Amazon RDS and how your applications can access this secret. Finally, I show you how to configure Secrets Manager to rotate this secret automatically.
Key features of Secrets Manager
These features include the ability to:
Rotate secrets safely. You can configure Secrets Manager to rotate secrets automatically without disrupting your applications. Secrets Manager offers built-in integrations for rotating credentials for Amazon RDS databases for MySQL, PostgreSQL, and Amazon Aurora. You can extend Secrets Manager to meet your custom rotation requirements by creating an AWS Lambda function to rotate other types of secrets. For example, you can create an AWS Lambda function to rotate OAuth tokens used in a mobile application. Users and applications retrieve the secret from Secrets Manager, eliminating the need to email secrets to developers or update and redeploy applications after AWS Secrets Manager rotates a secret.
Secure and manage secrets centrally. You can store, view, and manage all your secrets. By default, Secrets Manager encrypts these secrets with encryption keys that you own and control. Using fine-grained IAM policies, you can control access to secrets. For example, you can require developers to provide a second factor of authentication when they attempt to retrieve a production database credential. You can also tag secrets to help you discover, organize, and control access to secrets used throughout your organization.
Monitor and audit easily. Secrets Manager integrates with AWS logging and monitoring services to enable you to meet your security and compliance requirements. For example, you can audit AWS CloudTrail logs to see when Secrets Manager rotated a secret or configure AWS CloudWatch Events to alert you when an administrator deletes a secret.
Pay as you go. Pay for the secrets you store in Secrets Manager and for the use of these secrets; there are no long-term contracts or licensing fees.
Get started with Secrets Manager
Now that you’re familiar with the key features, I’ll show you how to store the credential for a MySQL database hosted on Amazon RDS. To demonstrate how to retrieve and use the secret, I use a python application running on Amazon EC2 that requires this database credential to access the MySQL instance. Finally, I show how to configure Secrets Manager to rotate this database credential automatically. Let’s get started.
I select Credentials for RDS database because I’m storing credentials for a MySQL database hosted on Amazon RDS. For this example, I store the credentials for the database superuser. I start by securing the superuser because it’s the most powerful database credential and has full access over the database.
Next, I review the encryption setting and choose to use the default encryption settings. Secrets Manager will encrypt this secret using the Secrets Manager DefaultEncryptionKeyDefaultEncryptionKey in this account. Alternatively, I can choose to encrypt using a customer master key (CMK) that I have stored in AWS KMS.
Next, I view the list of Amazon RDS instances in my account and select the database this credential accesses. For this example, I select the DB instance mysql-rds-database, and then I select Next.
In this step, I specify values for Secret Name and Description. For this example, I use Applications/MyApp/MySQL-RDS-Database as the name and enter a description of this secret, and then select Next.
For the next step, I keep the default setting Disable automatic rotation because my secret is used by my application running on Amazon EC2. I’ll enable rotation after I’ve updated my application (see Phase 2 below) to use Secrets Manager APIs to retrieve secrets. I then select Next.
Review the information on the next screen and, if everything looks correct, select Store. We’ve now successfully stored a secret in Secrets Manager.
Next, I select See sample code.
Take note of the code samples provided. I will use this code to update my application to retrieve the secret using Secrets Manager APIs.
Phase 2: Update an application to retrieve secret from Secrets Manager
Now that I have stored the secret in Secrets Manager, I update my application to retrieve the database credential from Secrets Manager instead of hard coding this information in a configuration file or source code. For this example, I show how to configure a python application to retrieve this secret from Secrets Manager.
Previously, I configured my application to retrieve the database user name and password from the configuration file. Below is the source code for my application. import MySQLdb import config
def no_secrets_manager_sample()
# Get the user name, password, and database connection information from a config file. database = config.database user_name = config.user_name password = config.password
# Use the user name, password, and database connection information to connect to the database db = MySQLdb.connect(database.endpoint, user_name, password, database.db_name, database.port)
I use the sample code from Phase 1 above and update my application to retrieve the user name and password from Secrets Manager. This code sets up the client and retrieves and decrypts the secret Applications/MyApp/MySQL-RDS-Database. I’ve added comments to the code to make the code easier to understand. # Use the code snippet provided by Secrets Manager. import boto3 from botocore.exceptions import ClientError
def get_secret(): #Define the secret you want to retrieve secret_name = "Applications/MyApp/MySQL-RDS-Database" #Define the Secrets mManager end-point your code should use. endpoint_url = "https://secretsmanager.us-east-1.amazonaws.com" region_name = "us-east-1"
#Use the client to retrieve the secret try: get_secret_value_response = client.get_secret_value( SecretId=secret_name ) #Error handling to make it easier for your code to tolerate faults except ClientError as e: if e.response['Error']['Code'] == 'ResourceNotFoundException': print("The requested secret " + secret_name + " was not found") elif e.response['Error']['Code'] == 'InvalidRequestException': print("The request was invalid due to:", e) elif e.response['Error']['Code'] == 'InvalidParameterException': print("The request had invalid params:", e) else: # Decrypted secret using the associated KMS CMK # Depending on whether the secret was a string or binary, one of these fields will be populated if 'SecretString' in get_secret_value_response: secret = get_secret_value_response['SecretString'] else: binary_secret_data = get_secret_value_response['SecretBinary']
# Your code goes here.
Applications require permissions to access Secrets Manager. My application runs on Amazon EC2 and uses an IAM role to obtain access to AWS services. I will attach the following policy to my IAM role. This policy uses the GetSecretValue action to grant my application permissions to read secret from Secrets Manager. This policy also uses the resource element to limit my application to read only the Applications/MyApp/MySQL-RDS-Database secret from Secrets Manager. You can visit the AWS Secrets Manager Documentation to understand the minimum IAM permissions required to retrieve a secret. { "Version": "2012-10-17", "Statement": { "Sid": "RetrieveDbCredentialFromSecretsManager", "Effect": "Allow", "Action": "secretsmanager:GetSecretValue", "Resource": "arn:aws:secretsmanager:::secret:Applications/MyApp/MySQL-RDS-Database" } }
Phase 3: Enable Rotation for Your Secret
Rotating secrets periodically is a security best practice because it reduces the risk of misuse of secrets. Secrets Manager makes it easy to follow this security best practice and offers built-in integrations for rotating credentials for MySQL, PostgreSQL, and Amazon Aurora databases hosted on Amazon RDS. When you enable rotation, Secrets Manager creates a Lambda function and attaches an IAM role to this function to execute rotations on a schedule you define.
Note: Configuring rotation is a privileged action that requires several IAM permissions and you should only grant this access to trusted individuals. To grant these permissions, you can use the AWS IAMFullAccess managed policy.
Next, I show you how to configure Secrets Manager to rotate the secret Applications/MyApp/MySQL-RDS-Database automatically.
From the Secrets Manager console, I go to the list of secrets and choose the secret I created in the first step Applications/MyApp/MySQL-RDS-Database.
I scroll to Rotation configuration, and then select Edit rotation.
To enable rotation, I select Enable automatic rotation. I then choose how frequently I want Secrets Manager to rotate this secret. For this example, I set the rotation interval to 60 days.
Next, Secrets Manager requires permissions to rotate this secret on your behalf. Because I’m storing the superuser database credential, Secrets Manager can use this credential to perform rotations. Therefore, I select Use the secret that I provided in step 1, and then select Next.
The banner on the next screen confirms that I have successfully configured rotation and the first rotation is in progress, which enables you to verify that rotation is functioning as expected. Secrets Manager will rotate this credential automatically every 60 days.
Summary
I introduced AWS Secrets Manager, explained the key benefits, and showed you how to help meet your compliance requirements by configuring AWS Secrets Manager to rotate database credentials automatically on your behalf. Secrets Manager helps you protect access to your applications, services, and IT resources without the upfront investment and on-going maintenance costs of operating your own secrets management infrastructure. To get started, visit the Secrets Manager console. To learn more, visit Secrets Manager documentation.
If you have comments about this post, submit them in the Comments section below. If you have questions about anything in this post, start a new thread on the Secrets Manager forum.
Want more AWS Security news? Follow us on Twitter.
SPECIAL NOTE*** THE FULL TUTORIAL WILL BE AVAILABLE NEXT WEEK April Fools! What a terrible day. So many pranks. You can’t believe anything you read. People invading your space. The mental and physical anguish of enduring the day. It’s time to fight back! Let’s catch the perps in action by making a device that always watches.
Keeping tabs
A Raspberry Pi Zero W, a small camera, and a rechargeable Lithium Polymer (LiPo) battery constitute the bulk of this project’s tech. A pair of 3D-printed parts, and gelatine-solidified Coke Zero make up the fake fizzy body.
“So let’s make this video as short as possible and just buy a cheap pre-made spy cam off of Amazon. Just kidding,” Tinkernut jokes in the tutorial video for the project, before going through the step-by-step process of using the Raspberry Pi to “DIY this the right way”.
After accessing the Zero W from his laptop via SSH, Tinkernut opted for using the rpi_camera_surveillance_system Python script written by GitHub user RuiSantosdotme to control the spy cam. Luckily, this meant no additional library setup, and basically no lag on the video feed.
What we want to do is create a script that activates the camera and serves it to a web page so that we can access it from any web browser. There are plenty of different ways to do this (Motion, Raspivid, etc), but I found a simple Python script that does everything I need it to do and doesn’t require any extra software or libraries to install. The best thing about it is that the lag time is practically unnoticeable.
With the code in place, every boot-up of the Raspberry Pi automatically launches both the script and a web page of the live video, allowing for constant monitoring of potential sneaks and thieves.
The projects is powered by a 1500mAh LiPo battery and the Adafruit LiPo charger. It also includes a simple on/off switch, which Tinkernut wired to the charger and the Pi’s PP1 and PP6 connector pads.
Tinkernut decided to use a Coke Zero bottle for the build, incorporating 3D-printed parts to house the Pi, and a mix of Coke and gelatine to create a realistic-looking filling for the bottle. However, the setup can be transferred to pretty much any hollow item in your home, say, a cookie jar or a cracker box. So get creative and get spying!
A complete spy cam how-to
If you’d like to make your own secret spy cam, you can find a tutorial for Tinkernut’s build at hackster.io, or follow along with his video below. Also make sure to subscribe his YouTube channel to be updated on all his newest builds — they’re rather splendid.
Learn how to take a regular Coke Zero bottle, cram a Raspberry Pi and webcam inside of it, and have it still look like a regular Coke Zero bottle. Why would you want to do this? To spy on those irritating April Fooligans!!!
Hadoop User Experience (Hue) is an open-source, web-based, graphical user interface for use with Amazon EMR and Apache Hadoop. The Hue database stores things like users, groups, authorization permissions, Apache Hive queries, Apache Oozie workflows, and so on.
There might come a time when you want to migrate your Hue database to a new EMR cluster. For example, you might want to upgrade from an older version of the Amazon EMR AMI (Amazon Machine Image), but your Hue application and its database have had a lot of customization.You can avoid re-creating these user entities and retain query/workflow histories in Hue by migrating the existing Hue database, or remote database in Amazon RDS, to a new cluster.
By default, Hue user information and query histories are stored in a local MySQL database on the EMR cluster’s master node. However, you can create one or more Hue-enabled clusters using a configuration stored in Amazon S3 and a remote MySQL database in Amazon RDS. This allows you to preserve user information and query history that Hue creates without keeping your Amazon EMR cluster running.
This post describes the step-by-step process for migrating the Hue database from an existing EMR cluster.
Note: Amazon EMR supports different Hue versions across different AMI releases. Keep in mind the compatibility of Hue versions between the old and new clusters in this migration activity. Currently, Hue 3.x.x versions are not compatible with Hue 4.x.x versions, and therefore a migration between these two Hue versions might create issues. In addition, Hue 3.10.0 is not backward compatible with its previous 3.x.x versions.
Before you begin
First, let’s create a new testUser in Hue on an existing EMR cluster, as shown following:
You will use these credentials later to log in to Hue on the new EMR cluster and validate whether you have successfully migrated the Hue database.
Let’s get started!
Migration how-to
Follow these steps to migrate your database to a new EMR cluster and then validate the migration process.
1.) Make a backup of the existing Hue database.
Use SSH to connect to the master node of the old cluster, as shown following (if you are using Linux/Unix/macOS), and dump the Hue database to a JSON file.
Edit the hue-mysql.json output file by removing all JSON objects that have useradmin.userprofile in the model field, and save the file. For example, remove the objects as shown following:
2.) Store the hue-mysql.json file on persistent storage like Amazon S3.
You can copy the file from the old EMR cluster to Amazon S3 using the AWS CLI or Secure Copy (SCP) client. For example, the following uses the AWS CLI:
b.) Connect to the Hue database—either the local MySQL database or the remote database in Amazon RDS for your cluster as shown following, using the mysql client.
$ mysql -h HOST –u USER –pPASSWORD
For a local MySQL database, you can find the hostname, user name, and password for connecting to the database in the /etc/hue/conf/hue.ini file on the master node.
[[database]]
engine = mysql
name = huedb
case_insensitive_collation = utf8_unicode_ci
test_charset = utf8
test_collation = utf8_bin
host = ip-172-31-37-133.us-west-2.compute.internal
user = hue
test_name = test_huedb
password = QdWbL3Ai6GcBqk26
port = 3306
Based on the preceding example configuration, the sample command is as follows. (Replace the host, user, and password details based on your EMR cluster settings.)
$ mysql -h ip-172-31-37-133.us-west-2.compute.internal -u hue -pQdWbL3Ai6GcBqk26
c.) Drop the existing Hue database with the name huedb from the MySQL server.
mysql> DROP DATABASE IF EXISTS huedb;
d.) Create a new empty database with the same name huedb.
mysql> CREATE DATABASE huedb DEFAULT CHARACTER SET utf8 DEFAULT COLLATE=utf8_bin;
i.) In MySQL, add the foreign key content_type_id back to the auth_permission
mysql> use huedb;
mysql> ALTER TABLE huedb.auth_permission ADD FOREIGN KEY (`content_type_id`) REFERENCES `django_content_type` (`id`);
j.) Start the Hue service again.
$ sudo start hue
hue start/running, process XXXX
That’s it! Now, verify whether you can successfully access the Hue UI, and sign in using your existing testUser credentials.
After a successful sign in to Hue on the new EMR cluster, you should see a similar Hue homepage as shown following with testUser as the user signed in:
Conclusion
You have now learned how to migrate an existing Hue database to a new Amazon EMR cluster and validate the migration process. If you have any similar Amazon EMR administration topics that you want to see covered in a future post, please let us know in the comments below.
Anvesh Ragi is a Big Data Support Engineer with Amazon Web Services. He works closely with AWS customers to provide them architectural and engineering assistance for their data processing workflows. In his free time, he enjoys traveling and going for hikes.
Data that describe processes in a spatial context are everywhere in our day-to-day lives and they dominate big data problems. Map data, for instance, whether describing networks of roads or remote sensing data from satellites, get us where we need to go. Atmospheric data from simulations and sensors underlie our weather forecasts and climate models. Devices and sensors with GPS can provide a spatial context to nearly all mobile data.
In this post, we introduce the WIND toolkit, a huge (500 TB), open weather model dataset that’s available to the world on Amazon’s cloud services. We walk through how to access this data and some of the open-source software developed to make it easily accessible. Our solution considers a subset of geospatial data that exist on a grid (raster) and explores ways to provide access to large-scale raster data from weather models. The solution uses foundational AWS services and the Hierarchical Data Format (HDF), a well adopted format for scientific data.
The approach developed here can be extended to any data that fit in an HDF5 file, which can describe sparse and dense vectors and matrices of arbitrary dimensions. This format is already popular within the physical sciences for both experimental and simulation data. We discuss solutions to gridded data storage for a massive dataset of public weather model outputs called the Wind Integration National Dataset (WIND) toolkit. We also highlight strategies that are general to other large geospatial data management problems.
Wind Integration National Dataset
As variable renewable power penetration levels increase in power systems worldwide, the importance of renewable integration studies to ensure continued economic and reliable operation of the power grid is also increasing. The WIND toolkit is the largest freely available grid integration dataset to date.
The WIND toolkit was developed by 3TIER by Vaisala. They were under a subcontract to the National Renewable Energy Laboratory (NREL) to support studies on integration of wind energy into the existing US grid. NREL is a part of a network of national laboratories for the US Department of Energy and has a mission to advance the science and engineering of energy efficiency, sustainable transportation, and renewable power technologies.
The toolkit has been used by consultants, research groups, and universities worldwide to support grid integration studies. Less traditional uses also include resource assessments for wind plants (such as those powering Amazon data centers), and studying the effects of weather on California condor migrations in the Baja peninsula.
The diversity of applications highlights the value of accessible, open public data. Yet, there’s a catch: the dataset is huge. The WIND toolkit provides simulated atmospheric (weather) data at a two-km spatial resolution and five-minute temporal resolution at multiple heights for seven years. The entire dataset is half a petabyte (500 TB) in size and is stored in the NREL High Performance Computing data center in Golden, Colorado. Making this dataset publicly available easily and in a cost-effective manner is a major challenge.
As other laboratories and public institutions work to release their data to the world, they may face similar challenges to those that we experienced. Some prior, well-intentioned efforts to release huge datasets as-is have resulted in data resources that are technically available but fundamentally unusable. They may be stored in an unintuitive format or indexed and organized to support only a subset of potential uses. Downloading hundreds of terabytes of data is often impractical. Most users don’t have access to a big data cluster (or super computer) to slice and dice the data as they need after it’s downloaded.
We aim to provide a large amount of data (50 terabytes) to the public in a way that is efficient, scalable, and easy to use. In many cases, researchers can access these huge cloud-located datasets using the same software and algorithms they have developed for smaller datasets stored locally. Only the pieces of data they need for their individual analysis must be downloaded. To make this work in practice, we worked with the HDF Group and have built upon their forthcoming Highly Scalable Data Service.
In the rest of this post, we discuss how the HSDS software was developed to use Amazon EC2 and Amazon S3 resources to provide convenient and scalable access to these huge geospatial datasets. We describe how the HSDS service has been put to work for the WIND Toolkit dataset and demonstrate how to access it using the h5pyd Python library and the REST API. We conclude with information about our ongoing work to release more ‘open’ datasets to the public using AWS services, and ways to improve and extend the HSDS with newer Amazon services like Amazon ECS and AWS Lambda.
Developing a scalable service for big geospatial data
The HDF5 file format and API have been used for many years and is an effective means of storing large scientific datasets. For example, NASA’s Earth Observing System (EOS) satellites collect more than 16 TBs of data per day using HDF5.
With the rise of the cloud, there are new challenges and opportunities to rethink how HDF5 can be enhanced to work effectively as a component in a cloud-native architecture. For the HDF Group, working with NREL has been a great opportunity to put ideas into practice with a production-size dataset.
An HDF5 file consists of a directed graph of group and dataset objects. Datasets can be thought of as a multidimensional array with support for user-defined metadata tags and compression. Typical operations on datasets would be reading or writing data to a regular subregion (a hyperslab) or reading and writing individual elements (a point selection). Also, group and dataset objects may each contain an arbitrary number of the user-defined metadata elements known as attributes.
Many people have used the HDF library in applications developed or ported to run on EC2 instances, but there are a number of constraints that often prove problematic:
The HDF5 library can’t read directly from HDF5 files stored as S3 objects. The entire file (often many GB in size) would need to be copied to local storage before the first byte can be read. Also, the instance must be configured with the appropriately sized EBS volume)
The HDF library only has access to the computational resources of the instance itself (as opposed to a cluster of instances), so many operations are bottlenecked by the library.
Any modifications to the HDF5 file would somehow have to be synchronized with changes that other instances have made to same file before writing back to S3.
Using a pattern common to many offerings from AWS, the solution to these constraints is to develop a service framework around the HDF data model. Using this model, the HDF Group has created the Highly Scalable Data Service (HSDS) that provides all the functionality that traditionally was provided by the HDF5 library. By using the service, you don’t need to manage your own file volumes, but can just read and write whatever data that you need.
Because the service manages the actual data persistence to a durable medium (S3, in this case), you don’t need to worry about disk management. Simply stream the data you need from the service as you need it. Secondly, putting the functionality behind a service allows some tricks to increase performance (described in more detail later). And lastly, HSDS allows any number of clients to access the data at the same time, enabling HDF5 to be used as a coordination mechanism for multiple readers and writers.
In designing the HSDS architecture, we gave much thought to how to achieve scalability of the HSDS service. For accessing HDF5 data, there are two different types of scaling to consider:
Multiple clients making many requests to the service
Single requests that require a significant amount of data processing
To deal with the first scaling challenge, as with most services, we considered how the service responds as the request rate increases. AWS provides some great tools that help in this regard:
Auto Scaling groups
Elastic Load Balancing load balancers
The ability of S3 to handle large aggregate throughput rates
By using a cluster of EC2 instances behind a load balancer, you can handle different client loads in a cost-effective manner.
The second scaling challenge concerns single requests that would take significant processing time with just one compute node. One example of this from the WIND toolkit would be extracting all the values in the seven-year time span for a given geographic point and dataset.
In HDF5, large datasets are typically stored as “chunks”; that is, a regular partition of the array. In HSDS, each chunk is stored as a binary object in S3. The sequential approach to retrieving the time series values would be for the service to read each chunk needed from S3, extract the needed elements, and go on to the next chunk. In this case, that would involve processing 2557 chunks, and would be quite slow.
Fortunately, with HSDS, you can speed this up quite a bit by exploiting the compute and I/O capabilities of the cluster. Upon receiving the request, the receiving node can use other nodes in the cluster to read different portions of the selection. With multiple nodes reading from S3 in parallel, performance improves as the cluster size increases.
The diagram below illustrates how this works in simplified case of four chunks and four nodes.
This architecture has worked in well in practice. In testing with the WIND toolkit and time series extraction, we observed a request latency of ~60 seconds using four nodes vs. ~5 seconds with 40 nodes. Performance roughly scales with the size of the cluster.
A planned enhancement to this is to use AWS Lambda for the worker processing. This enables 1000-way parallel reads at a reasonable cost, as you only pay for the milliseconds of CPU time used with AWS Lambda.
Public access to atmospheric data using HSDS and AWS
An early challenge in releasing the WIND toolkit data was in deciding how to subset the data for different use cases. In general, few researchers need access to the entire 0.5 PB of data and a great deal of efficiency and cost reduction can be gained by making directed constituent datasets.
NREL grid integration researchers initially extracted a 2-TB subset by selecting 120,000 points where the wind resource seemed appropriate for development. They also chose only those data important for wind applications (100-m wind speed, converted to power), the most interesting locations for those performing grid studies. To support the remaining users who needed more data resolution, we down-sampled the data to a 60-minute temporal resolution, keeping all the other variables and spatial resolution intact. This reduced dataset is 50 TB of data describing 30+ atmospheric variables of data for 7 years at a 60-minute temporal resolution.
The WindViz browser-based Gridded Wind Toolkit Visualizer was created as an example implementation of the HSDS REST API in JavaScript. The visualizer is written in the style of ECMAScript 2016 using a modern development toolchain that includes webpack and Babel. The source code is available through our GitHub repository. The demo page is hosted via GitHub pages, and we use a cross-origin AJAX request to fetch data from the HSDS service running on the EC2 infrastructure. The visualizer can be used to explore the gridded wind toolkit data on a map. Achieve full spatial resolution by zooming in to a specific region.
Programmatic access is possible using the h5pyd Python library, a distributed analog to the widely used h5py library. Users interact with the datasets (variables) and slice the data from its (time x longitude x latitude) cube form as they see fit.
Examples and use cases are described in a set of Jupyter notebooks and available on GitHub:
Now you have a Jupyter notebook server running on your EC2 server.
From your laptop, create an SSH tunnel:
$ ssh –L 8888:localhost:8888 (IP address of the EC2 server)
Now, you can browse to localhost:8888 using the correct token, and interact with the notebooks as if they were local. Within the directory, there are examples for accessing the HSDS API and plotting wind and weather data using matplotlib.
Controlling access and defraying costs
A final concern is rate limiting and access control. Although the HSDS service is scalable and relatively robust, we had a few practical concerns:
How can we protect from malicious or accidental use that may lead to high egress fees (for example, someone who attempts to repeatedly download the entire dataset from S3)?
How can we keep track of who is using the data both to document the value of the data resource and to justify the costs?
If costs become too high, can we charge for some or all API use to help cover the costs?
To approach these problems, we investigated using Amazon API Gateway and its simplified integration with the AWS Marketplace for SaaS monetization as well as third-party API proxies.
In the end, we chose to use API Umbrella due to its close involvement with http://data.gov. While AWS Marketplace is a compelling option for future datasets, the decision was made to keep this dataset entirely open, at least for now. As community use and associated costs grow, we’ll likely revisit Marketplace. Meanwhile, API Umbrella provides controls for rate limiting and API key registration out of the box and was simple to implement as a front-end proxy to HSDS. Those applications that may want to charge for API use can accomplish a similar strategy using Amazon API Gateway and AWS Marketplace.
Ongoing work and other resources
As NREL and other government research labs, municipalities, and organizations try to share data with the public, we expect many of you will face similar challenges to those we have tried to approach with the architecture described in this post. Providing large datasets is one challenge. Doing so in a way that is affordable and convenient for users is an entirely more difficult goal. Using AWS cloud-native services and the existing foundation of the HDF file format has allowed us to tackle that challenge in a meaningful way.
Dr. Caleb Phillips is a senior scientist with the Data Analysis and Visualization Group within the Computational Sciences Center at the National Renewable Energy Laboratory. Caleb comes from a background in computer science systems, applied statistics, computational modeling, and optimization. His work at NREL spans the breadth of renewable energy technologies and focuses on applying modern data science techniques to data problems at scale.
Dr. Caroline Draxl is a senior scientist at NREL. She supports the research and modeling activities of the US Department of Energy from mesoscale to wind plant scale. Caroline uses mesoscale models to research wind resources in various countries, and participates in on- and offshore boundary layer research and in the coupling of the mesoscale flow features (kilometer scale) to the microscale (tens of meters). She holds a M.S. degree in Meteorology and Geophysics from the University of Innsbruck, Austria, and a PhD in Meteorology from the Technical University of Denmark.
John Readey has been a Senior Architect at The HDF Group since he joined in June 2014. His interests include web services related to HDF, applications that support the use of HDF and data visualization.Before joining The HDF Group, John worked at Amazon.com from 2006–2014 where he developed service-based systems for eCommerce and AWS.
Jordan Perr-Sauer is an RPP intern with the Data Analysis and Visualization Group within the Computational Sciences Center at the National Renewable Energy Laboratory. Jordan hopes to use his professional background in software engineering and his academic training in applied mathematics to solve the challenging problems facing America and the world.
Amazon EMR empowers many customers to build big data processing applications quickly and cost-effectively, using popular distributed frameworks such as Apache Spark, Apache HBase, Presto, and Apache Flink. For organizations that are crafting their analytical applications on Amazon EMR, there is a growing need to keep their data assets organized in an automated fashion. Because datasets tend to grow exponentially, using cataloging tools is essential to automating data discovery and organizing data assets.
AWS Glue Data Catalog provides this essential capability, allowing you to automatically discover and catalog metadata about your data stores in a central repository. Since Amazon EMR 5.8.0, customers have been using the AWS Glue Data Catalog as a metadata store for Apache Hive and Spark SQL applications that are running on Amazon EMR. Starting with Amazon EMR 5.10.0, you can catalog datasets using AWS Glue and run queries using Presto on Amazon EMR from the Hue (Hadoop User Experience) and Apache Zeppelin UIs.
You might wonder what scenarios warrant using Presto running on Amazon EMR and when to choose Amazon Athena (which uses Presto as the query engine under the hood). It is important to note that both are excellent tools for querying massive amounts of data and addressing different needs and use cases.
Amazon Athena provides the easiest way to run interactive queries for data in Amazon S3 without needing to set up or manage any servers. Presto running on Amazon EMR gives you much more flexibility in how you configure and run your queries, providing the ability to federate to other data sources if needed. For example, you might have a use case that requires LDAP authentication for clients such as the Presto CLI or JDBC/ODBC drivers. Or you might have a workflow where you need to join data between different systems like MySQL/Amazon Redshift/Apache Cassandra and Hive. In these examples, Presto running on Amazon EMR is the right tool to use because it can be configured to enable LDAP authentication in addition to the desired database connectors at cluster launch.
Now, let’s look at how metadata management for Presto works with AWS Glue.
Using an AWS Glue crawler to discover datasets
The AWS Glue Data Catalog is a reference to the location, schema, and runtime metrics of your datasets. To create this reference metadata, AWS Glue needs to crawl your datasets. In this exercise, we use an AWS Glue crawler to populate tables in the Data Catalog for the NYC taxi rides dataset.
The following are the steps for adding a crawler:
Sign in to the AWS Management Console, and open the AWS Glue console. In the navigation pane, choose Crawlers. Then choose Add crawler.
On the Add a data store page, specify the location of the NYC taxi rides dataset.
In the next step, choose an existing IAM role if one is available, or create a new role. Then choose Next.
On the scheduling page, for Frequency, choose Run on demand.
On the Configure the crawler’s output page, choose Add database. Specify blog-db as the database name. (You can specify a name of your choice, but be sure to choose the correct database name when running queries.)
Follow the remaining steps using the default values to create a crawler.
When the crawler displays the Ready state, navigate to the Databases (Choose blog-db from the list of databases, or search for it by specifying it as a filter, as shown in the following screenshot.) Then choose Tables. You should see the three tables created by the crawler, as follows.
(Optional) The discovered data is classified as CSV files. You can optionally convert this data into Parquet format for better response times on your queries.
Launching an Amazon EMR cluster
With the dataset discovered and organized, we can now walk through different options for launching Presto on an Amazon EMR cluster to use the AWS Glue Data Catalog.
After you’ve set up the Amazon EMR cluster with Presto, the AWS Glue Data Catalog is available through a default “hive” catalog. To change between the Hive and Glue metastores, you have to manually update hive.properties and restart the Presto server. Connect to the master node on your EMR cluster using SSH, and run the Presto CLI to start running queries interactively.
$ presto-cli --catalog hive
Begin with a simple query to sample a few rows:
presto> SELECT * FROM “blog-db”.taxi limit 10;
The query shows a few sample rows as follows:
Query the average fare for trips at each hour of the day and for each day of the month on the Parquet version of the taxi dataset.
presto> SELECT EXTRACT (HOUR FROM pickup_datetime) AS hour, avg(fare_amount) AS average_fare FROM “blog-db”.taxi_parquet GROUP BY 1 ORDER BY 1;
The following image shows the results:
More interestingly, you can compute the number of trips that gave tips in the 10 percent, 15 percent, or higher percentage range:
presto> -- Tip Percent Category
SELECT TipPrctCtgry
, COUNT (DISTINCT TripID) TripCt
FROM
(SELECT TripID
, (CASE
WHEN fare_prct < 0.7 THEN 'FL70'
WHEN fare_prct < 0.8 THEN 'FL80'
WHEN fare_prct < 0.9 THEN 'FL90'
ELSE 'FL100'
END) FarePrctCtgry
, (CASE
WHEN tip_prct < 0.1 THEN 'TL10'
WHEN tip_prct < 0.15 THEN 'TL15'
WHEN tip_prct < 0.2 THEN 'TL20'
ELSE 'TG20'
END) TipPrctCtgry
FROM
(SELECT TripID
, (fare_amount / total_amount) as fare_prct
, (extra / total_amount) as extra_prct
, (mta_tax / total_amount) as tip_prct
, (tolls_amount / total_amount) as mta_taxprct
, (tip_amount / total_amount) as tolls_prct
, (improvement_surcharge / total_amount) as imprv_suchrgprct
, total_amount
FROM
(SELECT *
, (cast(pickup_longitude AS VARCHAR(100)) || '_' || cast(pickup_latitude AS VARCHAR(100))) as TripID
from "blog-db”.taxi_parquet
WHERE total_amount > 0
) as t
) as t
) ct
GROUP BY TipPrctCtgry;
The results are as follows:
While the preceding query is running, navigate to the web interface for Presto on Amazon EMR at <http://master-public-dns-name:8889/. Here you can look into the query metrics, such as active worker nodes, number of rows read per second, reserved memory, and parallelism.
Running queries in the Presto Editor on Hue
If you installed Hue with your Amazon EMR launch, you can also run queries on Hue’s Presto Editor. On the Amazon EMR Cluster console, choose Enable Web Connection, and follow the instructions to access the web interfaces for Hue and Zeppelin.
After the web connection is enabled, choose the Hue link to open the web interface. At the login screen, if you are the administrator logging in for the first time, type a user name and password to create your Hue superuser account. Then choose Create account. Otherwise, type your user name and password and choose Create account, or type the credentials provided by your administrator.
Choose the Presto Editor from the menu. You can run Presto queries against your tables in the AWS Glue Data Catalog.
Conclusion
Having a shared data catalog for applications on Amazon EMR alleviates a myriad of data-related challenges that organizations face today—including discovery, governance, auditability, and collaboration. In this post, we explored how the AWS Glue Data Catalog addresses discoverability and manageability for table metadata for Presto on Amazon EMR. Go ahead, give this a try, and share your experience with us!
Radhika Ravirala is a Solutions Architect at Amazon Web Services where she helps customers craft distributed big data applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley. She holds a M.S in computer science from San Jose State University.
***This blog is authored by Mike Okner of Monsanto, an AWS customer. It originally appeared on the Monsanto company blog. Minor edits were made to the original post.***
Recently, I was looking to create a status page app to monitor a few important internal services. I wanted this app to be as lightweight, reliable, and hassle-free as possible, so using a “serverless” architecture that doesn’t require any patching or other maintenance was quite appealing.
I also don’t deploy anything in a production AWS environment outside of some sort of template (usually CloudFormation) as a rule. I don’t want to have to come back to something I created ad hoc in the console after 6 months and try to recall exactly how I architected all of the resources. I’ll inevitably forget something and create more problems before solving the original one. So building the status page in a template was a requirement.
The Design I settled on a design using two Lambda functions, both written in Python 3.6.
The first Lambda function makes requests out to a list of important services and writes their current status to a DynamoDB table. This function is executed once per minute via CloudWatch Event Rule.
The second Lambda function reads each service’s status & uptime information from DynamoDB and renders a Jinja template. This function is behind an API Gateway that has been configured to return text/html instead of its default application/json Content-Type.
The CloudFormation Template AWS provides a Serverless Application Model template transformer to streamline the templating of Lambda + API Gateway designs, but it assumes (like everything else about the API Gateway) that you’re actually serving an API that returns JSON content. So, unfortunately, it won’t work for this use-case because we want to return HTML content. Instead, we’ll have to enumerate every resource like usual.
The Skeleton We’ll be using YAML for the template in this example. I find it easier to read than JSON, but you can easily convert between the two with a converter if you disagree.
The CheckerLambda is the actual Lambda function. The Code section is a local path to a ZIP file containing the code and its dependencies. I’m using CloudFormation’s packaging feature to automatically push the deployable to S3.
The CheckerLambdaRole is the IAM role the Lambda will assume which grants it access to DynamoDB in addition to the usual Lambda logging permissions.
The CheckerLambdaTimer is the CloudWatch Events Rule that triggers the checker to run once per minute.
The CheckerLambdaTimerPermission grants CloudWatch the ability to invoke the checker Lambda function on its interval.
The Web Page Gateway The API Gateway handles incoming requests for the web page, invokes the Lambda, and then returns the Lambda’s results as HTML content. Its template looks like:
There’s a lot going on here, but the real meat is in the PageGatewayMethod section. There are a couple properties that deviate from the default which is why we couldn’t use the SAM transformer.
First, we’re passing request headers through to the Lambda in theRequestTemplates section. I’m doing this so I can validate incoming auth headers. The API Gateway can do some types of auth, but I found it easier to check auth myself in the Lambda function since the Gateway is designed to handle API calls and not browser requests.
Next, note that in the IntegrationResponses section we’re defining the Content-Type header to be ‘text/html’ (with single-quotes) and defining the ResponseTemplate to be $input.path(‘$’). This is what makes the request render as a HTML page in your browser instead of just raw text.
Due to the StageName and PathPart values in the other sections, your actual page will be accessible at https://someId.execute-api.region.amazonaws.com/Prod/page. I have the page behind an existing reverse-proxy and give it a saner URL for end-users. The reverse proxy also attaches the auth header I mentioned above. If that header isn’t present, the Lambda will render an error page instead so the proxy can’t be bypassed.
The Web Page Rendering Lambda This Lambda is invoked by calls to the API Gateway and looks like:
The WebRenderLambda and WebRenderLambdaRole should look familiar.
The WebRenderLambdaGatewayPermission is similar to the Status Checker’s CloudWatch permission, only this time it allows the API Gateway to invoke this Lambda.
The DynamoDB Table This one is straightforward.
# DynamoDB table
DynamoTable:
Type: AWS::DynamoDB::Table
Properties:
AttributeDefinitions:
- AttributeName: name
AttributeType: S
ProvisionedThroughput:
WriteCapacityUnits: 1
ReadCapacityUnits: 1
TableName: status-page-checker-results
KeySchema:
- KeyType: HASH
AttributeName: name
The Deployment We’ve made it this far defining every resource in a template that we can check in to version control, so we might as well script the deployment as well rather than manually manage the CloudFormation Stack via the AWS web console.
Since I’m using the packaging feature, I first run:
$ aws cloudformation package \
--template-file template.yaml \
--s3-bucket <some-bucket-name> \
--output-template-file template-packaged.yaml
Uploading to 34cd6e82c5e8205f9b35e71afd9e1548 1922559 / 1922559.0 (100.00%) Successfully packaged artifacts and wrote output template to file template-packaged.yaml.
Then to deploy the template (whether new or modified), I run:
$ aws cloudformation deploy \
--region '<aws-region>' \
--template-file template-packaged.yaml \
--stack-name '<some-name>' \
--capabilities CAPABILITY_IAM
Waiting for changeset to be created.. Waiting for stack create/update to complete Successfully created/updated stack - <some-name>
And that’s it! You’ve just created a dynamic web page that will never require you to SSH anywhere, patch a server, recover from a disaster after Amazon terminates your unhealthy EC2, or any other number of pitfalls that are now the problem of some ops person at AWS. And you can reproduce deployments and make changes with confidence because everything is defined in the template and can be tracked in version control.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.