Michael Portera‘s trading card scanner uses LEGO, servo motors, and a Raspberry Pi and Camera Module to scan Magic: The Gathering cards and look up their prices online. This is a neat and easy-to-recreate project that you can adapt for whatever your, or your younger self’s, favourite trading cards are.
For those of you who aren’t this nerdy [Janina is 100% this nerdy – Ed.], Magic: The Gathering (or MTG for short) is a trading card game first launched in 1993. It’s based on a sprawling fantasy multiverse storyline, and is very heavy on mechanics — the current comprehensive rules fill 228 pages! You can imagine it as being a bit like Dungeons and Dragons, with less role-playing and more of a chess vibe. Unlike in chess, however, you can beat your MTG opponent in one turn with just the right combination of cards. If that’s your style of play, that is.
Scanning trading cards
So far, there are around 20000 official MTG cards, and, as with other types of trading cards, some of them are worth a lot of money.
Michael is one of the many people who were keen MTG players in their youth. Here’s how he came up with his project idea:
I was really into trading cards as a kid. I recently came across a lot of Magic: The Gathering cards in a box and thought to myself — I wonder how many cards I have and how much they’re worth?! Logging and looking these up manually would take a while, so I decided to see if I could automate some of the process. Somehow, the process led to building a platform out of Lego and leveraging AWS S3 and Rekognition.
LEGO, servos and camera
To build the housing of the scanner, Michael used LEGO, stating “I’m not good at wood working, and I thought that it might be rough on the cards.” While he doesn’t provide a build plan for the housing, Michael only used bricks from in the LEGO Medium Creative Brick Box he bought for the project. In addition, his tutorial includes a lot of pictures to guide you.
Servo motors spin plastic wheels to move single cards from a stack set into the scanner. Michael positioned a Raspberry Pi Camera Module so that it can take a picture of the title of each card as it is set before the lens. The length of the camera’s ribbon cable gave Michael a little difficulty, so he recommends getting an extension for it if you’re planning to recreate the build.
Optical character recognition and MTG card price API
On the software side, Michael wrote three scripts. One is a Python script to control the servos and take pictures. This, he says, “[records] about 20–25 cards a minute.”
Another script identifies the cards and looks up their prices automatically. Michael tried out OpenCV and Tesseract for optical character recognition (OCR) first, before settling on AWS S3 and Rekognition for storing and processing images, respectively. You’ll need an AWS account to do this — Michael used the free tier, which he says allows him to process 5000 pictures per month.
A sizeable collection
Finally, the data that Rekognition sends back gets processed by another Python script that looks up the identified cards on the TCGplayer API to find their price.
Michael says he’s very satisfied with the accuracy of the project’s OCR. He found out that the 920 Magic: The Gathering cards he scanned are worth about $275 in total. He provides a full write-up plus code over on hackster.io.
And for my next trick…
You might be thinking what I’m thinking: the logical next step for this project is to turn it into a card sorter. Then you could input a list of the card deck you want to put together, and presto! The device picks out the right cards from your collection. Building a Commander deck just became a little easier!
What trading cards would you use this project with, and how would you extend it? Also, what’s your favourite commander? Let me know in the comments!
The Internet of Things (IoT) has precipitated to an influx of connected devices and data that can be mined to gain useful business insights. If you own an IoT device, you might want the data to be uploaded seamlessly from your connected devices to the cloud so that you can make use of cloud storage and the processing power to perform sophisticated analysis of data. To upload the data to the AWS Cloud, devices must pass authentication and authorization checks performed by the respective AWS services. The standard way of authenticating AWS requests is the Signature Version 4 algorithm that requires the caller to have an access key ID and secret access key. Consequently, you need to hardcode the access key ID and the secret access key on your devices. Alternatively, you can use the built-in X.509 certificate as the unique device identity to authenticate AWS requests.
AWS IoT has introduced the credentials provider feature that allows a caller to authenticate AWS requests by having an X.509 certificate. The credentials provider authenticates a caller using an X.509 certificate, and vends a temporary, limited-privilege security token. The token can be used to sign and authenticate any AWS request. Thus, the credentials provider relieves you from having to manage and periodically refresh the access key ID and secret access key remotely on your devices.
In the process of retrieving a security token, you use AWS IoT to create a thing (a representation of a specific device or logical entity), register a certificate, and create AWS IoT policies. You also configure an AWS Identity and Access Management (IAM) role and attach appropriate IAM policies to the role so that the credentials provider can assume the role on your behalf. You also make an HTTP-over-Transport Layer Security (TLS) mutual authentication request to the credentials provider that uses your preconfigured thing, certificate, policies, and IAM role to authenticate and authorize the request, and obtain a security token on your behalf. You can then use the token to sign any AWS request using Signature Version 4.
In this blog post, I explain the AWS IoT credentials provider design and then demonstrate the end-to-end process of retrieving a security token from AWS IoT and using the token to write a temperature and humidity record to a specific Amazon DynamoDB table.
Note: This post assumes you are familiar with AWS IoT and IAM to perform steps using the AWS CLI and OpenSSL. Make sure you are running the latest version of the AWS CLI.
Overview of the credentials provider workflow
The following numbered diagram illustrates the credentials provider workflow. The diagram is followed by explanations of the steps.
To explain the steps of the workflow as illustrated in the preceding diagram:
The AWS IoT device uses the AWS SDK or custom client to make an HTTPS request to the credentials provider for a security token. The request includes the device X.509 certificate for authentication.
The credentials provider forwards the request to the AWS IoT authentication and authorization module to verify the certificate and the permission to request the security token.
If the certificate is valid and has permission to request a security token, the AWS IoT authentication and authorization module returns success. Otherwise, it returns failure, which goes back to the device with the appropriate exception.
If assuming the role succeeds, AWS STS returns a temporary, limited-privilege security token to the credentials provider.
The credentials provider returns the security token to the device.
The AWS SDK on the device uses the security token to sign an AWS request with AWS Signature Version 4.
The requested service invokes IAM to validate the signature and authorize the request against access policies attached to the preconfigured IAM role.
If IAM validates the signature successfully and authorizes the request, the request goes through.
In another solution, you could configure an AWS Lambda rule that ingests your device data and sends it to another AWS service. However, in applications that require the uploading of large files such as videos or aggregated telemetry to the AWS Cloud, you may want your devices to be able to authenticate and send data directly to the AWS service of your choice. The credentials provider enables you to do that.
Outline of the steps to retrieve and use security token
Perform the following steps as part of this solution:
Create an AWS IoT thing: Start by creating a thing that corresponds to your home thermostat in the AWS IoT thing registry database. This allows you to authenticate the request as a thing and use thing attributes as policy variables in AWS IoT and IAM policies.
Register a certificate: Create and register a certificate with AWS IoT, and attach it to the thing for successful device authentication.
Create and configure an IAM role: Create an IAM role to be assumed by the service on behalf of your device. I illustrate how to configure a trust policy and an access policy so that AWS IoT has permission to assume the role, and the token has necessary permission to make requests to DynamoDB.
Create a role alias: Create a role alias in AWS IoT. A role alias is an alternate data model pointing to an IAM role. The credentials provider request must include a role alias name to indicate which IAM role to assume for obtaining a security token from AWS STS. You may update the role alias on the server to point to a different IAM role and thus make your device obtain a security token with different permissions.
Attach a policy: Create an authorization policy with AWS IoT and attach it to the certificate to control which device can assume which role aliases.
Request a security token: Make an HTTPS request to the credentials provider and retrieve a security token and use it to sign a DynamoDB request with Signature Version 4.
Use the security token to sign a request: Use the retrieved token to sign a request to DynamoDB and successfully write a temperature and humidity record from your home thermostat in a specific table. Thus, starting with an X.509 certificate on your home thermostat, you can successfully upload your thermostat record to DynamoDB and use it for further analysis. Before the availability of the credentials provider, you could not do this.
Deploy the solution
1. Create an AWS IoT thing
Register your home thermostat in the AWS IoT thing registry database by creating a thing type and a thing. You can use the AWS CLI with the following command to create a thing type. The thing type allows you to store description and configuration information that is common to a set of things.
Now, you need to have a Certificate Authority (CA) certificate, sign a device certificate using the CA certificate, and register both certificates with AWS IoT before your device can authenticate to AWS IoT. If you do not already have a CA certificate, you can use OpenSSL to create a CA certificate, as described in Use Your Own Certificate. To register your CA certificate with AWS IoT, follow the steps on Registering Your CA Certificate.
You then have to create a device certificate signed by the CA certificate and register it with AWS IoT, which you can do by following the steps on Creating a Device Certificate Using Your CA Certificate. Save the certificate and the corresponding key pair; you will use them when you request a security token later. Also, remember the password you provide when you create the certificate.
Run the following command in the AWS CLI to attach the device certificate to your thing so that you can use thing attributes in policy variables.
If the attach-thing-principal command succeeds, the output is empty.
3. Configure an IAM role
Next, configure an IAM role in your AWS account that will be assumed by the credentials provider on behalf of your device. You are required to associate two policies with the role: a trust policy that controls who can assume the role, and an access policy that controls which actions can be performed on which resources by assuming the role.
The following trust policy grants the credentials provider permission to assume the role. Put it in a text document and save the document with the name, trustpolicyforiot.json.
The following access policy allows DynamoDB operations on the table that has the same name as the thing name that you created in Step 1, MyHomeThermostat, by using credentials-iot:ThingName as a policy variable. I explain after Step 5 about using thing attributes as policy variables. Put the following policy in a text document and save the document with the name, accesspolicyfordynamodb.json.
Finally, run the following command in the AWS CLI to attach the access policy to your role.
aws iam attach-role-policy --role-name dynamodb-access-role --policy-arn arn:aws:iam::<your_aws_account_id>:policy/accesspolicyfordynamodb
If the attach-role-policy command succeeds, the output is empty.
Configure the PassRole permissions
The IAM role that you have created must be passed to AWS IoT to create a role alias, as described in Step 4. The user who performs the operation requires iam:PassRole permission to authorize this action. You also should add permission for the iam:GetRole action to allow the user to retrieve information about the specified role. Create the following policy to grant iam:PassRole and iam:GetRole permissions. Name this policy, passrolepermission.json.
Now, run the following command to attach the policy to the user.
aws iam attach-user-policy --policy-arn arn:aws:iam::<your_aws_account_id>:policy/passrolepermission --user-name <user_name>
If the attach-user-policy command succeeds, the output is empty.
4. Create a role alias
Now that you have configured the IAM role, you will create a role alias with AWS IoT. You must provide the following pieces of information when creating a role alias:
RoleAlias: This is the primary key of the role alias data model and hence a mandatory attribute. It is a string; the minimum length is 1 character, and the maximum length is 128 characters.
RoleArn: This is the Amazon Resource Name (ARN) of the IAM role you have created. This is also a mandatory attribute.
CredentialDurationSeconds: This is an optional attribute specifying the validity (in seconds) of the security token. The minimum value is 900 seconds (15 minutes), and the maximum value is 3,600 seconds (60 minutes); the default value is 3,600 seconds, if not specified.
Run the following command in the AWS CLI to create a role alias. Use the credentials of the user to whom you have given the iam:PassRole permission.
You created and registered a certificate with AWS IoT earlier for successful authentication of your device. Now, you need to create and attach a policy to the certificate to authorize the request for the security token.
Let’s say you want to allow a thing to get credentials for the role alias, Thermostat-dynamodb-access-role-alias, with thing owner Alice, thing type thermostat, and the thing attached to a principal. The following policy, with thing attributes as policy variables, achieves these requirements. After this step, I explain more about using thing attributes as policy variables. Put the policy in a text document, and save it with the name, alicethermostatpolicy.json.
If the attach-policy command succeeds, the output is empty.
You have completed all the necessary steps to request an AWS security token from the credentials provider!
Using thing attributes as policy variables
Before I show how to request a security token, I want to explain more about how to use thing attributes as policy variables and the advantage of using them. As a prerequisite, a device must provide a thing name in the credentials provider request.
Thing substitution variables in AWS IoT policies
AWS IoT Simplified Permission Management allows you to associate a connection with a specific thing, and allow the thing name, thing type, and other thing attributes to be available as substitution variables in AWS IoT policies. You can write a generic AWS IoT policy as in alicethermostatpolicy.json in Step 5, attach it to multiple certificates, and authorize the connection as a thing. For example, you could attach alicethermostatpolicy.json to certificates corresponding to each of the thermostats you have that you want to assume the role alias, Thermostat-dynamodb-access-role-alias, and allow operations only on the table with the name that matches the thing name. For more information, see the full list of thing policy variables.
Thing substitution variables in IAM policies
You also can use the following three substitution variables in the IAM role’s access policy (I used credentials-iot:ThingName in accesspolicyfordynamodb.json in Step 3):
credentials-iot:ThingName
credentials-iot:ThingTypeName
credentials-iot:AwsCertificateId
When the device provides the thing name in the request, the credentials provider fetches these three variables from the database and adds them as context variables to the security token. When the device uses the token to access DynamoDB, the variables in the role’s access policy are replaced with the corresponding values in the security token. Note that you also can use credentials-iot:AwsCertificateId as a policy variable; AWS IoT returns certificateId during registration.
6. Request a security token
Make an HTTPS request to the credentials provider to fetch a security token. You have to supply the following information:
Certificate and key pair: Because this is an HTTP request over TLS mutual authentication, you have to provide the certificate and the corresponding key pair to your client while making the request. Use the same certificate and key pair that you used during certificate registration with AWS IoT.
RoleAlias: Provide the role alias (in this example, Thermostat-dynamodb-access-role-alias) to be assumed in the request.
ThingName: Provide the thing name that you created earlier in the AWS IoT thing registry database. This is passed as a header with the name, x-amzn-iot-thingname. Note that the thing name is mandatory only if you have thing attributes as policy variables in AWS IoT or IAM policies.
Run the following command in the AWS CLI to obtain your AWS account-specific endpoint for the credentials provider. See the DescribeEndpoint API documentation for further details.
Note that if you are on Mac OS X, you need to export your certificate to a .pfx or .p12 file before you can pass it in the https request. Use OpenSSL with the following command to convert the device certificate from .pem to .pfx format. Remember the password because you will need it subsequently in a curl command.
Now, make an HTTPS request to the credentials provider to fetch a security token. You may use your preferred HTTP client for the request. I use curl in the following examples.
This command returns a security token object that has an accessKeyId, a secretAccessKey, a sessionToken, and an expiration. The following is sample output of the curl command.
Create a DynamoDB table called MyHomeThermostat in your AWS account. You will have to choose the hash (partition key) and the range (sort key) while creating the table to uniquely identify a record. Make the hash the serial_number of the thermostat and the range the timestamp of the record. Create a text file with the following JSON to put a temperature and humidity record in the table. Name the file, item.json.
You can use the accessKeyId, secretAccessKey, and sessionToken retrieved from the output of the curl command to sign a request that writes the temperature and humidity record to the DynamoDB table. Use the following commands to accomplish this.
In this blog post, I demonstrated how to retrieve a security token by using an X.509 certificate and then writing an item to a DynamoDB table by using the security token. Similarly, you could run applications on surveillance cameras or sensor devices that exchange the X.509 certificate for an AWS security token and use the token to upload video streams to Amazon Kinesis or telemetry data to Amazon CloudWatch.
If you have comments about this blog post, submit them in the “Comments” section below. If you have questions about or issues implementing this solution, start a new thread on the AWS IoT forum.
Discover new sounds and explore the role of machine learning in music production and sound research with the NSynth Super, an ongoing project from Google’s Magenta research team that you can build at home.
Part of the ongoing Magenta research project within Google, NSynth Super explores the ways in which machine learning tools help artists and musicians be creative.
“Technology has always played a role in creating new types of sounds that inspire musicians — from the sounds of distortion to the electronic sounds of synths,” explains the team behind the NSynth Super. “Today, advances in machine learning and neural networks have opened up new possibilities for sound generation.”
Using TensorFlow, the Magenta team builds tools and interfaces that let artists and musicians use machine learning in their work. The NSynth Super AI algorithm uses deep neural networking to investigate the character of sounds. It then builds new sounds based on these characteristics instead of simply mixing sounds together.
Using an autoencoder, it extracts 16 defining temporal features from each input. These features are then interpolated linearly to create new embeddings (mathematical representations of each sound). These new embeddings are then decoded into new sounds, which have the acoustic qualities of both inputs.
The team publishes all hardware designs and software that are part of their ongoing research under open-source licences, allowing you to build your own synth.
Build your own NSynth Super
Using these open-source tools, Andrew Black has produced his own NSynth Super, demoed in the video above. Andrew’s list of build materials includes a Raspberry Pi 3, potentiometers, rotary encoders, and the Adafruit 1.3″ OLED display. Magenta also provides Gerber files for you to fabricate your own PCB.
Once fabricated, the PCB includes a table of contents for adding components.
The build isn’t easy — it requires soldering skills or access to someone who can assemble PCBs. Take a look at Andrew’s blog post and the official NSynth GitHub repo to see whether you’re up to the challenge.
Music and Raspberry Pi
The Raspberry Pi has been widely used for music production and music builds. Be it retrofitting a boombox, distributing music atop Table Mountain, or coding tracks with Sonic Pi, the Pi offers endless opportunities for musicians and music lovers to expand their repertoire of builds and instruments.
If you’d like to try more music-based projects using the Raspberry Pi, you can check out our free resources. And if you’ve used a Raspberry Pi in your own musical project, please share it with us in the comments or via our social network accounts.
American Public Television was like many organizations that have been around for a while. They were entrenched using an older technology — in their case, tape storage and distribution — that once met their needs but was limiting their productivity and preventing them from effectively collaborating with their many media partners. APT’s VP of Technology knew that he needed to move into the future and embrace cloud storage to keep APT ahead of the game.
Since 1961, American Public Television (APT) has been a leading distributor of groundbreaking, high-quality, top-rated programming to the nation’s public television stations. Gerry Field is the Vice President of Technology at APT and is responsible for delivering their extensive program catalog to 350+ public television stations nationwide.
In the time since Gerry joined APT in 2007, the industry has been in digital overdrive. During that time APT has continued to acquire and distribute the best in public television programming to their technically diverse subscribers.
This created two challenges for Gerry. First, new technology and format proliferation were driving dramatic increases in digital storage. Second, many of APT’s subscribers struggled to keep up with the rapidly changing industry. While some subscribers had state-of-the-art satellite systems to receive programming, others had to wait for the post office to drop off programs recorded on tape weeks earlier. With no slowdown on the horizon of innovation in the industry, Gerry knew that his storage and distribution systems would reach a crossroads in no time at all.
Living the tape paradigm
The digital media industry is only a few years removed from its film, and later videotape, roots. Tape was the input and the output of the industry for many years. As a consequence, the tools and workflows used by the industry were built and designed to work with tape. Over time, the “file” slowly replaced the tape as the object to be captured, edited, stored and distributed. Trouble was, many of the systems and more importantly workflows were based on processing tape, and these have proven to be hard to change.
At APT, Gerry realized the limits of the tape paradigm and began looking for technologies and solutions that enabled workflows based on file and object based storage and distribution.
Thinking file based storage and distribution
For data (digital media) storage, APT, like everyone else, started by installing onsite storage servers. As the amount of digital data grew, more storage was added. In addition, APT was expanding its distribution footprint by creating or partnering with distribution channels such as CreateTV and APT Worldwide. This dramatically increased the number of programming formats and the amount of data that had to be stored. As a consequence, updating, maintaining, and managing the APT storage systems was becoming a major challenge and a major resource hog.
Knowing that his in-house storage system was only going to cost more time and money, Gerry decided it was time to look at cloud storage. But that wasn’t the only reason he looked at the cloud. While most people consider cloud storage as just a place to back up and archive files, Gerry was envisioning how the ubiquity of the cloud could help solve his distribution challenges. The trouble was the price of cloud storage from vendors like Amazon S3 and Microsoft Azure was a non-starter, especially for a non-profit. Then Gerry came across Backblaze. B2 Cloud Storage service met all of his performance requirements, and at $0.005/GB/month for storage and $0.01/GB for downloads it was nearly 75% less than S3 or Azure.
Gerry did the math and found that he could economically incorporate B2 Cloud Storage into his IT portfolio, using it for both program submission and for active storage and archiving of the APT programs. In addition, B2 now gives him the foundation necessary to receive and distribute programming content over the Internet. This is especially useful for organizations that can’t conveniently access satellite distribution systems. Not to mention downloading from the cloud is much faster than sending a tape through the mail.
Adding B2 Cloud Storage to their infrastructure has helped American Public Television address two key challenges. First, they now have “unlimited” storage in the cloud without having to add any hardware. In addition, with B2, they only pay for the storage they use. That means they don’t have to buy storage upfront trying to match the maximum amount of storage they’ll ever need. Second, by using B2 as a distribution source for their programming APT subscribers, especially the smaller and remote ones, can get content faster and more reliably without having to perform costly upgrades to their infrastructure.
The road ahead
As APT gets used to their file based infrastructure and workflow, there are a number of cost saving and income generating ideas they are pondering which are now worth considering. Here are a few:
Program Submissions — New content can be uploaded from anywhere using a web browser, an Internet connection, and a login. For example, a producer in Cambodia can upload their film to B2. From there the film is downloaded to an in-house system where it is processed and transcoded using compute. The finished film is added to the APT catalog and added to B2. Once there, the program is instantly available for subscribers to order and download.
“The affordability and performance of Backblaze B2 is what allowed us to make the B2 cloud part of the APT data storage and distribution strategy into the future.” — Gerry Field
Easier Previews — At any time, work in process or finished programs can be made available for download from the B2 cloud. One place this could be useful is where a subscriber needs to review a program to comply with local policies and practices before airing. In the old system, each “one-off” was a time consuming manual process.
Instant Subscriptions — There are many organizations such as schools and businesses that want to use just one episode of a desired show. With an e-commerce based website, current or even archived programming kept in B2 could be available to download or stream for a minimal charge.
At APT there were multiple technologies needed to make their file-based infrastructure work, but as Gerry notes, having an affordable, trustworthy, cloud storage service like B2 is one of the critical building blocks needed to make everything work together.
In this post, I’ll show you how to create a sample dataset for Amazon Macie, and how you can use Amazon Macie to implement data-centric compliance and security analytics in your Amazon S3 environment. I’ll also dive into the different kinds of credentials, document types, and PII detections supported by Macie. First, I’ll walk through creating a “getting started” sample set of artificial, generated data that you can use to test Macie capabilities and start building your own policies and alerts.
Create a realistic data sample set in S3
I’ll use amazon-macie-activity-generator, which we call “AMG” for short, a sample application developed by AWS that generates realistic content and accesses your test account to create the data. AMG uses AWS CloudFormation, AWS Lambda, and Python’s excellent Faker library to create a data set with artificial—but realistic—data classifications and access patterns to help test some of the features and capabilities of Macie. AMG is released under Amazon Software License 1.0, and we’ll accept pull requests on our GitHub repository and monitor any issues that are opened so we can try to fix bugs and consider new feature requests.
The following diagram shows a high level architecture overview of the components that will be created in your AWS account for AMG. For additional detail about these components and their relationships, review the CloudFormation setup script.
Depending on the data types specified in your JSON configuration template (details below), AMG will periodically generate artificial documents for the specified S3 target with a PutObject action. By default, the CloudFormation stack uses a configuration file that instructs AMG to create a new, private S3 bucket that can only be accessed by authorized AWS users/roles in the same account as the bucket. All the S3 objects with fake data in this bucket have a private ACL and inherit the bucket’s access control configuration. All generated objects feature the header in the example below, and AMG supports all fake data providers offered by https://faker.readthedocs.io/en/latest/index.html, as well as a few of AMG‘s own custom fake data providers requested by our customers: aws_creds, slack_creds, github_creds, facebook_creds, linux_shadow, rsa, linux_passwd, dsa, ec, pgp, cert, itin, swift_code, and cve.
# Sample Report - No identification of actual persons or places is # intended or should be inferred
74323 Julie Field Lake Joshuamouth, OR 30055-3905 1-196-191-4438x974 53001 Paul Union New John, HI 94740 Mastercard Amanda Wells 5135725008183484 09/26 CVV: 550
354-70-6172 242 George Plaza East Lawrencefurt, VA 37287-7620 GB73WAUS0628038988364 587 Silva Village Pearsonburgh, NM 11616-7231 LDNM1948227117807 American Express Brett Garza 347965534580275 05/20 CID: 4758
599.335.2742 JCB 15 digit Michael Arias 210069190253121 03/27 CVC: 861
Create your amazon-macie-activity-generator CloudFormation stack
You can deploy AMG in your AWS account by using either these methods:
Log in to the AWS Console in a region supported by Amazon Macie, which currently includes US East (N. Virginia), US West (Oregon).
Select the One-click CloudFormation launch stack, or launch CloudFormation using the template above.
Read our terms, select the Acknowledgement box, and then select Create.
Creating the data takes a few minutes, and you can periodically refresh CloudWatch to track progress.
Add the new sample data to Macie
Now, I’ll log into the Macie console and add the newly created sample data buckets for analysis by Macie.
Note: If you don’t explicitly specify a bucket for S3 targets in CloudFormation, AMG will use the S3 bucket that’s created by default for the stack, which will be printed out in the CloudFormation stack’s output.
To add buckets for data classification, follow these steps:
Log in to Amazon Macie.
Select Integrations, and then select Services.
Select your account, and then select Details from the Amazon S3 card.
Select your newly created buckets for Full classification, including existing data.
For additional details on configuring Macie, refer to our getting started documentation.
Macie classifies all historical and newly created data in the buckets created by AMG, and the data will be available in the Macie console as it’s classified. Typically, you can expect the data in the sample set to be classified within 60 minutes of the time it was selected for analysis.
Classifying objects with Macie
To see the objects in your test sample set, in Macie, open the Research tab, and then select the S3 Objects index. We’ll use the regular expression search capability in Macie to find any objects written to buckets that start with “amazon-macie-activity-generator-defaults3bucket”. To search for this, type the following text into the Macie search box and select the magnifying glass icon.
From here, you can see a nice breakdown of the kinds of objects that have been classified by Macie, as well as the object-specific details. Create an advanced search using Lucene Query Syntax, and save it as an alert to be matched against any newly created data.
Analyzing accesses to your test data
In addition to classifying data, Macie tracks all control plane and data plane accesses to your content using CloudTrail. To see accesses to your generated environment (created periodically by AMG to mimic user activity), on the Macie navigation bar, select Research, select the CloudTrail data index, and then use the following search to identify our generated role activity:
From this search, you can dive into the user activity (IAM users, assumed roles, federated users, and so on), which is summarized in 5-minute aggregations (user sessions). For example, in the screen shot you can see that one of our AMG-generated users listed objects one time (ListObjects) and wrote 56 objects to S3 (PutObject) during a 5-minute period.
Macie alerts
Macie features both predictive (machine learning-based) and basic (rule-based) alerts, including alerts on unencrypted credentials being uploaded to S3 (because this activity might not follow compliance best practices), risky activity such as data exfiltration, and user-defined alerts that are based on saved searches. To see alerts that have been generated based on AMG‘s activity, on the Macie navigation bar, select Alerts.
AMG will continue to run, periodically uploading content to the specified S3 buckets. To stop AMG, delete the AMG CloudFormation stack and associated resources here.
What are the costs?
Macie has a free tier enabling up to 1GB of content to be analyzed per month at no cost to you. By default, AMG will write approximately 10MB of objects to Amazon S3 per day, and you will incur charges for data classification after crossing the 1GB monthly free tier. Running continuously, AMG will generate about 310MB of content per month (10MB/day x 31 days), which will stay below the free tier. Any data use above 1GB will be billed at the Macie public price of $5/GB. For more detail, see the Macie pricing documentation.
If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the Amazon Macie forum or contact AWS Support.
Backblaze’s rapid ingest service, Fireball, graduates out of public beta. Our device holds 70 terabytes of customer data and is perfect for migrating large data sets to B2 Cloud Storage.
At Backblaze, we like to put ourselves in the customer’s shoes. Specifically, we ask questions like “how can we make cloud storage more useful?” There is a long list of things we can do to help — over the last few weeks, we’ve addressed some of them when we lowered the cost of downloading data to $0.01 / GB. Today, we are pleased to publicly release our rapid ingest service, Fireball.
What is the Backblaze B2 Fireball?
The Fireball is a hardware device, specifically a NAS device. Any Backblaze B2 customer can order it from inside their account. The Fireball device can hold up to 70 terabytes of data. Upon ordering, it ships from a Backblaze data center to you. When you receive it, you can transfer your data onto the Fireball using your internal network. Once your data transfer is complete, you send it back to a Backblaze data center. Finally, inside our secure data center, your data is uploaded from the Fireball to your account. Your data remains encrypted throughout the process. Step by step instructions can be found here.
Why Use the Fireball?
“We would not have been able to get this project off the ground without the B2 Fireball.” — James Cole, KLRU (Austin City Limits)
For most customers, transferring large quantities of data isn’t always simple. The need can arise as you migrate off of legacy systems (e.g. replacing LTO) or simply on a project basis (e.g. transferring video shot in the field to the cloud). An common approach is to upload your data via the internet to the cloud storage vendor of your choosing. While cloud storage vendors don’t charge for uploads, you have to pay your network provider for bandwidth. That’s assuming you are in a place where the bandwidth can be secured.
Your data is stored in megabytes (“MB”) but your bandwidth is measured in megabits per second (“Mbps”). The difference? An 80 Mbps upload connection will transfer no more than 10 MB per second. That means, in your best case scenario, you might be able to upload 50 terabytes in 50 days, assuming you use nearly all of your upload bandwidth for the upload.
If you’re looking to migrate old backups from LTO or even a large project, a 3 month lag time is not operationally viable. That’s why multiple cloud storage providers have introduced rapid ingest devices.
How It Compares: Backblaze B2 Fireball vs AWS Snowball vs GCS Transfer Appliance
“We found the B2 Fireball much simpler and easier to use than Amazon’s Snowball. WunderVu had been looking for a cloud solution for security and simplicity, and B2 hit every check box.” — Aaron Rhodes, Executive Producer, WunderVu
Every vendor that offers a rapid ingest service only lets you upload to that vendor’s cloud. For example, you can’t use an Amazon Snowball to upload to Google Cloud Storage. This means that when considering a rapid ingest service, you are also making a decision on what cloud storage vendor to use. As such, one should consider not only the cost of the rapid ingest service, but also how much that vendor is going to charge you to store and download your data.
Device Capacity
Service Fee
Shipping
Cloud Storage $/GB/Month
Download $/GB
Backblaze B2
70 TB
$550 (30 day rental)
$75
$0.005
$0.01
Amazon S3
50 TB
$200 (10 day rental)
$? *
$0.021 +320%
$0.05+ +500%
Google Cloud
100 TB
$300 (10 day rental)
$500
$0.020 +300%
$0.08+ +800%
*AWS does not estimate shipping fees at the time of the Snowball order.
To make the comparison easier, let’s create a hypothetical case and compare the costs incurred in the first year. Assume you have 100 TB as an initial upload. But that’s just the initial upload. Over the course of the year, let’s consider a usage pattern where every month you add 5 TB, delete 2 TBs, and download 10 TBs.
Transfer Cost
Cloud Storage Fees
Total Transfer + Cloud Storage Fees
Backblaze B2
$1,250 (2 Fireballs)
$9,570
$10,820
Amazon S3
$400 (2 Snowballs)
$36,114
$36,514 +337%
Google Cloud
$800 (1 transit)
$39,684
$40,484 +374%
Just looking at the first year, Amazon is 337% more expensive than Backblaze and Google is 374% more expensive than Backblaze.
Put simply, Backblaze offers the lowest cost, high performance cloud storage on the planet. During our public beta of the Fireball program we’ve had extremely positive feedback around how the Fireball enables customers to get their projects started in a time efficient and cost effective way. We hope you’ll give it a try!
We love Mugsy, the Raspberry Pi coffee robot that has smashed its crowdfunding goal within days! Our latest YouTube video shows our catch-up with Mugsy and its creator Matthew Oswald at Maker Faire New York last year.
Labelled ‘the world’s first hackable, customisable, dead simple, robotic coffee maker’, Mugsy allows you to take control of every aspect of the coffee-making process: from grind size and water temperature, to brew and bloom time. Feeling lazy instead? Read in your beans’ barcode via an onboard scanner, and it will automatically use the best settings for your brew.
Looking to start your day with your favourite coffee straight out of bed? Send the robot a text, email, or tweet, and it will notify you when your coffee is ready!
Learning through product development
“Initially, I used [Mugsy] as a way to teach myself hardware design,” explained Matthew at his Editor’s Choice–winning Maker Faire stand. “I really wanted to hold something tangible in my hands. By using the Raspberry Pi and just being curious, anytime I wanted to use a new technology, I would try to pull back [and ask] ‘How can I integrate this into Mugsy?’”
By exploring his passions and using Mugsy as his guinea pig, Matthew created a project that not only solves a problem — how to make amazing coffee at home — but also brings him one step closer to ‘making things’ for a living. “I used to dream about this stuff when I was a kid, and I used to say ‘I’m never going to be able to do something like that.’” he admitted. But now, with open-source devices like the Raspberry Pi so readily available, he “can see the end of the road”: making his passion his livelihood.
Back Mugsy
With only a few days left on the Kickstarter campaign, Mugsy has reached its goal and then some. It’s available for backing from $150 if you provide your own Raspberry Pi 3, or from $175 with a Pi included — check it out today!
Want to work at a company that helps customers in 156 countries around the world protect the memories they hold dear? A company that stores over 500 petabytes of customers’ photos, music, documents and work files in a purpose-built cloud storage system?
Well here’s your chance. Backblaze is looking for a Vault Storage Engineer!
Company Description:
Founded in 2007, Backblaze started with a mission to make backup software elegant and provide complete peace of mind. Over the course of almost a decade, we have become a pioneer in robust, scalable low cost cloud backup. Recently, we launched B2 — robust and reliable object storage at just $0.005/gb/mo. Part of our differentiation is being able to offer the lowest price of any of the big players while still being profitable.
We’ve managed to nurture a team oriented culture with amazingly low turnover. We value our people and their families. Don’t forget to check out our “About Us” page to learn more about the people and some of our perks.
We have built a profitable, high growth business. While we love our investors, we have maintained control over the business. That means our corporate goals are simple – grow sustainably and profitably.
Some Backblaze Perks:
Competitive healthcare plans
Competitive compensation and 401k
All employees receive Option grants
Unlimited vacation days
Strong coffee
Fully stocked Micro kitchen
Catered breakfast and lunches
Awesome people who work on awesome projects
New Parent Childcare bonus
Normal work hours
Get to bring your pets into the office
San Mateo Office – located near Caltrain and Highways 101 & 280.
Want to know what you’ll be doing?
You will work on the core of the Backblaze: the vault cloud storage system (https://www.backblaze.com/blog/vault-cloud-storage-architecture/). The system accepts files uploaded from customers, stores them durably by distributing them across the data center, automatically handles drive failures, rebuilds data when drives are replaced, and maintains high availability for customers to download their files. There are significant enhancements in the works, and you’ll be a part of making them happen.
Must have a strong background in:
Computer Science
Multi-threaded programming
Distributed Systems
Java
Math (such as matrix algebra and statistics)
Building reliable, testable systems
Bonus points for:
Java
JavaScript
Python
Cassandra
SQL
Looking for an attitude of:
Passionate about building reliable clean interfaces and systems.
Likes to work closely with other engineers, support, and sales to help customers.
Customer Focused (!!) — always focus on the customer’s point of view and how to solve their problem!
Required for all Backblaze Employees:
Good attitude and willingness to do whatever it takes to get the job done
Strong desire to work for a small fast-paced company
Desire to learn and adapt to rapidly changing technologies and work environment
Rigorous adherence to best practices
Relentless attention to detail
Excellent interpersonal skills and good oral/written communication
Excellent troubleshooting and problem solving skills
This position is located in San Mateo, California but will also consider remote work as long as you’re no more than three time zones away and can come to San Mateo now and then.
Backblaze is an Equal Opportunity Employer.
Contact Us: If this sounds like you, follow these steps:
Germany-based Andreas Rottach’s multi-purpose LED table is an impressive build within a gorgeous-looking body. Play games, view (heavily pixelated) images, and become hypnotised by flashy lights, once you’ve built your own using his newly released tutorial.
This is a short presentation of my LED-Matrix Table. The table is controlled by a raspberry pi computer that executes a control engine, written in c++. It supports input from keyboards or custom made game controllers. A full list of all features as well as the source code is available on GitHub (https://github.com/rottaca/LEDTableEngine).
Much excitement
Andreas uploaded a video of his LED Matrix Table to YouTube back in February, with the promise of publishing a complete write-up within the coming weeks. And so the members of Pi Towers sat, eagerly waiting and watching. Now the write-up has arrived, to our cheers of acclaim for this beautful, shiny, flashy, LED-based wonderment.
Andreas created the table’s impressive light matrix using a strip of 300 LEDs, chained together and connected to the Raspberry Pi via an LED controller.
The LEDs are set out in zigzags
For the code, he used several open-source tools, such as SDL for image and audio support, and CMake for building the project software.
Anyone planning to recreate Andreas’ table can compile its engine by downloading the project repository from GitHub. Again, find full instructions for this on his GitHub.
Features
The table boasts multiple cool features, including games and visualisation tools. Using the controllers, you can play simplified versions of Flappy Bird and Minesweeper, or go on a nostalgia trip with Tetris, Pong, and Snake.
There’s also a version of Conway’s Game of Life. Andreas explains: “The lifespan of each cell is color-coded. If the game field gets static, the animation is automatically reset to a new random cell population.”
The table can also display downsampled Bitmap images, or show clear static images such as a chess board, atop of which you can place physical game pieces.
Find all the 3D-printable aspects of the LED table on Thingiverse here and here, and the full GitHub tutorial and repository here. If you build your own, or have already dabbled in LED tables and displays, be sure to share your project with us, either in the comments below or via our social media accounts. What other functions would you integrate into this awesome build?
Amazon S3 provides comprehensive security and compliance capabilities that meet even the most stringent regulatory requirements. It gives you flexibility in the way you manage data for cost optimization, access control, and compliance. However, because the service is flexible, a user could accidentally configure buckets in a manner that is not secure. For example, let’s say you uploaded files to an Amazon S3 bucket with public read permissions, even though you intended only to share this file with a colleague or a partner. Although this might have accomplished your task to share the file internally, the file is now available to anyone on the internet, even without authentication.
In this blog post, we show you how to prevent your Amazon S3 buckets and objects from allowing public access. We discuss how to secure data in Amazon S3 with a defense-in-depth approach, where multiple security controls are put in place to help prevent data leakage. This approach helps prevent you from allowing public access to confidential information, such as personally identifiable information (PII) or protected health information (PHI).
Preventing your Amazon S3 buckets and objects from allowing public access
Every call to an Amazon S3 service becomes a REST API request. When your request is transformed via a REST call, the permissions are converted into parameters included in the HTTP header or as URL parameters. The Amazon S3 bucket policy allows or denies access to the Amazon S3 bucket or Amazon S3 objects based on policy statements, and then evaluates conditions based on those parameters. To learn more, see Using Bucket Policies and User Policies.
With this in mind, let’s say multiple AWS Identity and Access Management (IAM) users at Example Corp. have access to an Amazon S3 bucket and the objects in the bucket. Example Corp. wants to share the objects among its IAM users, while at the same time preventing the objects from being made available publicly.
To demonstrate how to do this, we start by creating an Amazon S3 bucket named examplebucket. After creating this bucket, we must apply the following bucket policy. This policy denies any uploaded object (PutObject) with the attribute x-amz-acl having the values public-read, public-read-write, or authenticated-read. This means authenticated users cannot upload objects to the bucket if the objects have public permissions.
“Deny any Amazon S3 request to PutObject or PutObjectAcl in the bucket examplebucket when the request includes one of the following access control lists (ACLs): public-read, public-read-write, or authenticated-read.”
Remember that IAM policies are evaluated not in a first-match-and-exit model. Instead, IAM evaluates first if there is an explicit Deny. If there is not, IAM continues to evaluate if you have an explicit Allow and then you have an implicit Deny.
The above policy creates an explicit Deny. Even when any authenticated user tries to upload (PutObject) an object with public read or write permissions, such as public-read or public-read-write or authenticated-read, the action will be denied. To understand how S3 Access Permissions work, you must understand what Access Control Lists (ACL) and Grants are. You can find the documentation here.
Now let’s continue our bucket policy explanation by examining the next statement.
This statement is very similar to the first statement, except that instead of checking the ACLs, we are checking specific user groups’ grants that represent the following groups:
AuthenticatedUsers group. Represented by http://acs.amazonaws.com/groups/global/AuthenticatedUsers, this group represents all AWS accounts. Access permissions to this group allow any AWS account to access the resource. However, all requests must be signed (authenticated).
AllUsers group. Represented by http://acs.amazonaws.com/groups/global/AllUsers, access permissions to this group allow anyone on the internet access to the resource. The requests can be signed (authenticated) or unsigned (anonymous). Unsigned requests omit the Authentication header in the request.
Now that you know how to deny object uploads with permissions that would make the object public, you just have two statement policies that prevent users from changing the bucket permissions (Denying s3:PutBucketACL from ACL and Denying s3:PutBucketACL from Grants).
Below is how we’re preventing users from changing the bucket permisssions.
As you can see above, the statement is very similar to the Object statements, except that now we use s3:PutBucketAcl instead of s3:PutObjectAcl, the Resource is just the bucket ARN, and the objects have the “/*” in the end of the ARN.
In this section, we showed how to prevent IAM users from accidently uploading Amazon S3 objects with public permissions to buckets. In the next section, we show you how to enforce multiple layers of security controls, such as encryption of data at rest and in transit while serving traffic from Amazon S3.
Securing data on Amazon S3 with defense-in-depth
Let’s say that Example Corp. wants to serve files securely from Amazon S3 to its users with the following requirements:
The data must be encrypted at rest and during transit.
The data must be accessible only by a limited set of public IP addresses.
All requests for data should be handled only by Amazon CloudFront (which is a content delivery network) instead of being directly available from an Amazon S3 URL. If you’re using an Amazon S3 bucket as the origin for a CloudFront distribution, you can grant public permission to read the objects in your bucket. This allows anyone to access your objects either through CloudFront or the Amazon S3 URL. CloudFront doesn’t expose Amazon S3 URLs, but your users still might have access to those URLs if your application serves any objects directly from Amazon S3, or if anyone gives out direct links to specific objects in Amazon S3.
A domain name is required to consume the content. Custom SSL certificate support lets you deliver content over HTTPS by using your own domain name and your own SSL certificate. This gives visitors to your website the security benefits of CloudFront over an SSL connection that uses your own domain name, in addition to lower latency and higher reliability.
To represent defense-in-depth visually, the following diagram contains several Amazon S3 objects (A) in a single Amazon S3 bucket (B). You can encrypt these objects on the server side or the client side. You also can configure the bucket policy such that objects are accessible only through CloudFront, which you can accomplish through an origin access identity (C). You then can configure CloudFront to deliver content only over HTTPS in addition to using your own domain name (D).
Defense-in-depth requirement 1: Data must be encrypted at rest and during transit
Let’s start with the objects themselves. Amazon S3 objects—files in this case—can range from zero bytes to multiple terabytes in size (see service limits for the latest information). Each Amazon S3 bucket includes a collection of objects, and the objects can be uploaded via the Amazon S3 console, AWS CLI, or AWS API.
If you choose to use server-side encryption, Amazon S3 encrypts your objects before saving them on disks in AWS data centers. To encrypt an object at the time of upload, you need to add the x-amz-server-side-encryption header to the request to tell Amazon S3 to encrypt the object using Amazon S3 managed keys (SSE-S3), AWS KMS managed keys (SSE-KMS), or customer-provided keys (SSE-C). There are two possible values for the x-amz-server-side-encryption header: AES256, which tells Amazon S3 to use Amazon S3 managed keys, and aws:kms, which tells Amazon S3 to use AWS KMS managed keys.
The following code example shows a Put request using SSE-S3.
PUT /example-object HTTP/1.1
Host: myBucket.s3.amazonaws.com
Date: Wed, 8 Jun 2016 17:50:00 GMT
Authorization: authorization string
Content-Type: text/plain
Content-Length: 11434
x-amz-meta-author: Janet
Expect: 100-continue
x-amz-server-side-encryption: AES256
[11434 bytes of object data]
If you choose to use client-side encryption, you can encrypt data on the client side and upload the encrypted data to Amazon S3. In this case, you manage the encryption process, the encryption keys, and related tools. You encrypt data on the client side by using AWS KMS managed keys or a customer-supplied, client-side master key.
Defense-in-depth requirement 2: Data must be accessible only by a limited set of public IP addresses
At the Amazon S3 bucket level, you can configure permissions through a bucket policy. For example, you can limit access to the objects in a bucket by IP address range or specific IP addresses. Alternatively, you can make the objects accessible only through HTTPS.
The following bucket policy allows access to Amazon S3 objects only through HTTPS (the policy was generated with the AWS Policy Generator). Here the bucket policy explicitly denies ("Effect": "Deny") all read access ("Action": "s3:GetObject") from anybody who browses ("Principal": "*") to Amazon S3 objects within an Amazon S3 bucket if they are not accessed through HTTPS ("aws:SecureTransport": "false").
Defense-in-depth requirement 3: Data must not be publicly accessible directly from an Amazon S3 URL
Next, configure Amazon CloudFront to serve traffic from within the bucket. The use of CloudFront serves several purposes:
CloudFront is a content delivery network that acts as a cache to serve static files quickly to clients.
Depending on the number of requests, the cost of delivery is less than if objects were served directly via Amazon S3.
Objects served through CloudFront can be limited to specific countries.
Access to these Amazon S3 objects is available only through CloudFront. We do this by creating an origin access identity (OAI) for CloudFront and granting access to objects in the respective Amazon S3 bucket only to that OAI. As a result, access to Amazon S3 objects from the internet is possible only through CloudFront; all other means of accessing the objects—such as through an Amazon S3 URL—are denied. CloudFront acts not only as a content distribution network, but also as a host that denies access based on geographic restrictions. You apply these restrictions by updating your CloudFront web distribution and adding a whitelist that contains only a specific country’s name (let’s say Liechtenstein). Alternatively, you could add a blacklist that contains every country except that country. Learn more about how to use CloudFront geographic restriction to whitelist or blacklist a country to restrict or allow users in specific locations from accessing web content in the AWS Support Knowledge Center.
Defense-in-depth requirement 4: A domain name is required to consume the content
To serve content from CloudFront, you must use a domain name in the URLs for objects on your webpages or in your web application. The domain name can be either of the following:
The domain name that CloudFront automatically assigns when you create a distribution, such as d111111abcdef8.cloudfront.net
Your own domain name, such as example.com
For example, you might use one of the following URLs to return the file image.jpg:
You use the same URL format whether you store the content in Amazon S3 buckets or at a custom origin, like one of your own web servers.
Instead of using the default domain name that CloudFront assigns for you when you create a distribution, you can add an alternate domain name that’s easier to work with, like example.com. By setting up your own domain name with CloudFront, you can use a URL like this for objects in your distribution: http://example.com/images/image.jpg.
Let’s say that you already have a domain name hosted on Amazon Route 53. You would like to serve traffic from the domain name, request an SSL certificate, and add this to your CloudFront web distribution. The SSL offloading occurs in CloudFront by serving traffic securely from each CloudFront location. You also can configure CloudFront to deliver your content over HTTPS by using your custom domain name and your own SSL certificate. Serving web content through CloudFront reduces response from the origin as requests are redirected to the nearest edge location. This results in faster download times than if the visitor had requested the content from a data center that is located farther away.
Summary
In this post, we demonstrated how you can apply policies to Amazon S3 buckets so that only users with appropriate permissions are allowed to access the buckets. Anonymous users (with public-read/public-read-write permissions) and authenticated users without the appropriate permissions are prevented from accessing the buckets.
We also examined how to secure access to objects in Amazon S3 buckets. The objects in Amazon S3 buckets can be encrypted at rest and during transit. Doing so helps provide end-to-end security from the source (in this case, Amazon S3) to your users.
If you have feedback about this blog post, submit comments in the “Comments” section below. If you have questions about this blog post, start a new thread on the Amazon S3 forum or contact AWS Support.
Some things don’t get the credit they deserve. For one of our engineers, Billy, the Locate My Computer feature is near and dear to his heart. It took him a while to build it, and it requires some regular updates, even after all these years. Billy loves the Locate My Computer feature, but really loves knowing how it’s helped customers over the years. One recent story made us decide to write a bit of a greatest hits post as an ode to one of our favorite features — Locate My Computer.
What is it?
Locate My Computer, as you’ll read in the stories below, came about because some of our users had their computers stolen and were trying to find a way to retrieve their devices. They realized that while some of their programs and services like Find My Mac were wiped, in some cases, Backblaze was still running in the background. That created the ability to use our software to figure out where the computer was contacting us from. After manually helping some of the individuals that wrote in, we decided to build it in as a feature. Little did we know the incredible stories it would lead to. We’ll get into that, but first, a little background on why the whole thing came about.
Identifying the Customer Need
“My friend’s laptop was stolen. He tracked the thief via @Backblaze for weeks & finally identified him on Facebook & Twitter. Digital 007.”
Mat — In December 2010, we saw a tweet from @DigitalRoyalty which read: “My friend’s laptop was stolen. He tracked the thief via @Backblaze for weeks & finally identified him on Facebook & Twitter. Digital 007.” Our CEO was manning Twitter at the time and reached out for the whole story. It turns out that Mat Miller had his laptop stolen, and while he was creating some restores a few days later, he noticed a new user was created on his computer and was backing up data. He restored some of those files, saw some information that could help identify the thief, and filed a police report. Read the whole story: Digital 007 — Outwitting The Thief.
Mark — Following Mat Miller’s story we heard from Mark Bao, an 18-year old entrepreneur and student at Bentley University who had his laptop stolen. The laptop was stolen out of Mark’s dorm room and the thief started using it in a variety of ways, including audition practice for Dancing with the Stars. Once Mark logged in to Backblaze and saw that there were new files being uploaded, including a dance practice video, he was able to reach out to campus police and got his laptop back. You can read more about the story on: 18 Year Old Catches Thief Using Backblaze.
After Mat and Mark’s story we thought we were onto something. In addition to those stories that had garnered some media attention, we would occasionally get requests from users that said something along the lines of, “Hey, my laptop was stolen, but I had Backblaze installed. Could you please let me know if it’s still running, and if so, what the IP address is so that I can go to the authorities?” We would help them where we could, but knew that there was probably a much more efficient method of helping individuals and businesses keep track of their computers.
Some of the Greatest Hits, and the Mafia Story
In May of 2011, we launched “Locate My Computer.” This was our way of adding a feature to our already-popular backup client that would allow users to see a rough representation of where their computer was located, and the IP address associated with its last known transmission. After speaking to law enforcement, we learned that those two things were usually enough for the authorities to subpoena an ISP and get the physical address of the last known place the computer phoned home from. From there, they could investigate and, if the device was still there, return it to its rightful owner.
Bridgette — Once the feature went live the stories got even more interesting. Almost immediately after we launched Locate My Computer, we were contacted by Bridgette, who told us of a break-in at her house. Luckily no one was home at the time, but the thief was able to get away with her iMac, DSLR, and a few other prized possessions. As soon as she reported the robbery to the police, they were able to use the Locate My Computer feature to find the thief’s location and recover her missing items. We even made a case study out of Bridgette’s experience. You can read it at: Backblaze And The Stolen iMac.
“Joe” — The crazy recovery stories didn’t end there. Shortly after Bridgette’s story, we received an email from a user (“Joe” — to protect the innocent) who was traveling to Argentina from the United States and had his laptop stolen. After he contacted the police department in Buenos Aires, and explained to them that he was using Backblaze (which the authorities thought was a computer tracking service, and in this case, we were), they were able to get the location of the computer from an ISP in Argentina. When they went to investigate, they realized that the perpetrators were foreign nationals connected to the mafia, and that in addition to a handful of stolen laptops, they were also in the possession of over $1,000,000 in counterfeit currency! Read the whole story about “Joe” and how: Backblaze Found $1 Million in Counterfeit Cash!
The Maker — After “Joe,” we thought that our part in high-profile “busts was over, but we were wrong. About a year later we received word from a “maker” who told us that he was able to act as an “internet super-sleuth” and worked hard to find his stolen computer. After a Maker Faire in Detroit, the maker’s car was broken into while they were getting BBQ following a successful show. While some of the computers were locked and encrypted, others were in hibernation mode and wide open to prying eyes. After the police report was filed, the maker went to Backblaze to retrieve his lost files and remembered seeing the little Locate My Computer button. That’s when the story gets really interesting. The victim used a combination of ingenuity, Craigslist, Backblaze, and the local police department to get his computer back, and make a drug bust along the way. Head over to Makezine.com to read about how:How Tracking Down My Stolen Computer Triggered a Drug Bust.
Una — While we kept hearing praise and thanks from our customers who were able to recover their data and find their computers, a little while passed before we would hear a story that was as incredible as the ones above. In July of 2016, we received an email from Una who told us one of the most amazing stories of perseverance that we’d ever heard. With the help of Backblaze and a sympathetic constable in Australia, Una tracked her stolen computer’s journey across 6 countries. She got her computer back and we wrote up the whole story: How Una Found Her Stolen Laptop.
And the Hits Keep on Coming
The most recent story came from “J,” and we’ll share the whole thing with you because it has a really nice conclusion:
Back in September of 2017, I brought my laptop to work to finish up some administrative work before I took off for a vacation. I work in a mall where traffic [is] plenty and more specifically I work at a kiosk in the middle of the mall. This allows for a high amount of traffic passing by every few seconds. I turned my back for about a minute to put away some paperwork. At the time I didn’t notice my laptop missing. About an hour later when I was gathering my belongings for the day I noticed it was gone. I was devastated. This was a high end MacBook Pro that I just purchased. So we are not talking about a little bit of money here. This was a major investment.
Time [went] on. When I got back from my vacation I reached out to my LP (Loss Prevention) team to get images from our security to submit to the police with some thread of hope that they would find whomever stole it. December approached and I did not hear anything. I gave up hope and assumed that the laptop was scrapped. I put an iCloud lock on it and my Find My Mac feature was saying that laptop was “offline.” I just assumed that they opened it, saw it was locked, and tried to scrap it for parts.
Towards the end of January I got an email from Backblaze saying that the computer was successfully backed up. This came as a shock to me as I thought it was wiped. But I guess however they wiped it didn’t remove Backblaze from the SSD. None the less, I was very happy. I sifted through the backup and found the person’s name via the search history. Then, using the Locate my Computer feature I saw where it came online. I reached out on social media to the person in question and updated the police. I finally got ahold of the person who stated she bought it online a few weeks backs. We made arrangements and I’m happy to say that I am typing this email on my computer right now.
J finished by writing: “Not only did I want to share this story with you but also wanted to say thanks! Apple’s find my computer system failed. The police failed to find it. But Backblaze saved the day. This has been the best $5 a month I have ever spent. Not only that but I got all my stuff back. Which made the deal even better! It was like it was never gone.”
Have a Story of Your Own?
We’re more than thrilled to have helped all of these people restore their lost data using Backblaze. Recovering the actual machine using Locate My Computer though, that’s the icing on the cake. We’re proud of what we’ve been able to build here at Backblaze, and we really enjoy hearing stories from people who have used our service to successfully get back up and running, whether that meant restoring their data or recovering their actual computer.
If you have any interesting data recovery or computer recovery stories that you’d like to share with us, please email press@backblaze.com and we’ll share it with Billy and the rest of the Backblaze team. We love hearing them!
Right now, 400km above the Earth aboard the International Space Station, are two very special Raspberry Pi computers. They were launched into space on 6 December 2015 and are, most assuredly, the farthest-travelled Raspberry Pi computers in existence. Each year they run experiments that school students create in the European Astro Pi Challenge.
Left: Astro Pi Vis (Ed); right: Astro Pi IR (Izzy). Image credit: ESA.
The European Columbus module
Today marks the tenth anniversary of the launch of the European Columbus module. The Columbus module is the European Space Agency’s largest single contribution to the ISS, and it supports research in many scientific disciplines, from astrobiology and solar science to metallurgy and psychology. More than 225 experiments have been carried out inside it during the past decade. It’s also home to our Astro Pi computers.
Here’s a video from 7 February 2008, when Space Shuttle Atlantis went skywards carrying the Columbus module in its cargo bay.
From February 7th, 2008 NASA-TV Coverage of The 121st Space Shuttle Launch Launched At:2:45:30 P.M E.T – Coverage begins exactly one hour till launch STS-122 Crew:
Today, coincidentally, is also the deadline for the European Astro Pi Challenge: Mission Space Lab. Participating teams have until midnight tonight to submit their experiments.
Anniversary celebrations
At 16:30 GMT today there will be a live event on NASA TV for the Columbus module anniversary with NASA flight engineers Joe Acaba and Mark Vande Hei.
Our Astro Pi computers will be joining in the celebrations by displaying a digital birthday candle that the crew can blow out. It works by detecting an increase in humidity when someone blows on it. The video below demonstrates the concept.
The exact Astro Pi code that will run on the ISS today is available for you to download and run on your own Raspberry Pi and Sense HAT. You’ll notice that the program includes code to make it stop automatically when the date changes to 8 February. This is just to save time for the ground control team.
If you have a Raspberry Pi and a Sense HAT, you can use the terminal commands below to download and run the code yourself:
When you see a blank blue screen with the brightness increasing, the Sense HAT is measuring the baseline humidity. It does this every 15 minutes so it can recalibrate to take account of natural changes in background humidity. A humidity increase of 2% is needed to blow out the candle, so if the background humidity changes by more than 2% in 15 minutes, it’s possible to get a false positive. Press Ctrl + C to quit.
Please tweet pictures of your candles to @astro_pi – we might share yours! And if we’re lucky, we might catch a glimpse of the candle on the ISS during the NASA TV event at 16:30 GMT today.
Amazon EMR enables data analysts and scientists to deploy a cluster running popular frameworks such as Spark, HBase, Presto, and Flink of any size in minutes. When you launch a cluster, Amazon EMR automatically configures the underlying Amazon EC2 instances with the frameworks and applications that you choose for your cluster. This can include popular web interfaces such as Hue workbench, Zeppelin notebook, and Ganglia monitoring dashboards and tools.
These web interfaces are hosted on the EMR master node and must be accessed using the public DNS name of the master node (master public DNS value). The master public DNS value is dynamically created, not very user friendly and is hard to remember— it looks something like ip-###-###-###-###.us-west-2.compute.internal. Not having a friendly URL to connect to the popular workbench or notebook interfaces may impact the workflow and hinder your gained agility.
Some customers have addressed this challenge through custom bootstrap actions, steps, or external scripts that periodically check for new clusters and register a friendlier name in DNS. These approaches either put additional burden on the data practitioners or require additional resources to execute the scripts. In addition, there is typically some lag time associated with such scripts. They often don’t do a great job cleaning up the DNS records after the cluster has terminated, potentially resulting in a security risk.
The solution in this post provides an automated, serverless approach to registering a friendly master node name for easy access to the web interfaces.
Before I dive deeper, I review these key services and how they are part of this solution.
CloudWatch Events
CloudWatch Events delivers a near real-time stream of system events that describe changes in AWS resources. Using simple rules, you can match events and route them to one or more target functions or streams. An event can be generated in one of four ways:
From an AWS service when resources change state
From API calls that are delivered via AWS CloudTrail
From your own code that can generate application-level events
In this solution, I cover the first type of event, which is automatically emitted by EMR when the cluster state changes. Based on the state of this event, either create or update the DNS record in Route 53 when the cluster state changes to STARTING, or delete the DNS record when the cluster is no longer needed and the state changes to TERMINATED. For more information about all EMR event details, see Monitor CloudWatch Events.
Route 53 private hosted zones
A private hosted zone is a container that holds information about how to route traffic for a domain and its subdomains within one or more VPCs. Private hosted zones enable you to use custom DNS names for your internal resources without exposing the names or IP addresses to the internet.
Route 53 supports resource record sets with a wide range of record types. In this solution, you use a CNAME record that is used to specify a domain name as an alias for another domain (the ‘canonical’ domain). You use a friendly name of the cluster as the CNAME for the EMR master public DNS value.
You are using private hosted zones because an EMR cluster is typically deployed within a private subnet and is accessed either from within the VPC or from on-premises resources over VPN or AWS Direct Connect. To resolve domain names in private hosted zones from your on-premises network, configure a DNS forwarder, as described in How can I resolve Route 53 private hosted zones from an on-premises network via an Ubuntu instance?.
Lambda
Lambda is a compute service that lets you run code without provisioning or managing servers. Lambda executes your code only when needed and scales automatically to thousands of requests per second. Lambda takes care of high availability, and server and OS maintenance and patching. You pay only for the consumed compute time. There is no charge when your code is not running.
Lambda provides the ability to invoke your code in response to events, such as when an object is put to an Amazon S3 bucket or as in this case, when a CloudWatch event is emitted. As part of this solution, you deploy a Lambda function as a target that is invoked by CloudWatch Events when the event matches your rule. You also configure the necessary permissions based on the Lambda permissions model, including a Lambda function policy and Lambda execution role.
Putting it all together
Now that you have all of the pieces, you can put together a complete solution. The following diagram illustrates how the solution works:
Start with a user activity such as launching or terminating an EMR cluster.
EMR automatically sends events to the CloudWatch Events stream.
A CloudWatch Events rule matches the specified event, and routes it to a target, which in this case is a Lambda function. In this case, you are using the EMR Cluster State Change
The Lambda function performs the following key steps:
Get the clusterId value from the event detail and use it to call EMR. DescribeCluster API to retrieve the following data points:
MasterPublicDnsName – public DNS name of the master node
Locate the tag containing the friendly name to use as the CNAME for the cluster. The key name containing the friendly name should be The value should be specified as host.domain.com, where domain is the private hosted zone in which to update the DNS record.
Update DNS based on the state in the event detail.
If the state is STARTING, the function calls the Route 53 API to create or update a resource record set in the private hosted zone specified by the domain tag. This is a CNAME record mapped to MasterPublicDnsName.
Conversely, if the state is TERMINATED, the function calls the Route 53 API to delete the associated resource record set from the private hosted zone.
Deploying the solution
Because all of the components of this solution are serverless, use the AWS Serverless Application Model (AWS SAM) template to deploy the solution. AWS SAM is natively supported by AWS CloudFormation and provides a simplified syntax for expressing serverless resources, resulting in fewer lines of code.
Overview of the SAM template
For this solution, the SAM template has 76 lines of text as compared to 142 lines without SAM resources (and writing the template in YAML would be even slightly smaller). The solution can be deployed using the AWS Management Console, AWS Command Line Interface (AWS CLI), or AWS SAM Local.
CloudFormation transforms help simplify template authoring by condensing a multiple-line resource declaration into a single line in your template. To inform CloudFormation that your template defines a serverless application, add a line under the template format version as follows:
Before SAM, you would use the AWS::Lambda::Function resource type to define your Lambda function. You would then need a resource to define the permissions for the function (AWS::Lambda::Permission), another resource to define a Lambda execution role (AWS::IAM::Role), and finally a CloudWatch Events resource (Events::Rule) that triggers this function.
With SAM, you need to define just a single resource for your function, AWS::Serverless::Function. Using this single resource type, you can define everything that you need, including function properties such as function handler, runtime, and code URI, as well as the required IAM policies and the CloudWatch event.
A few additional things to note in the code example:
CodeUri – Before you can deploy a SAM template, first upload your Lambda function code zip to S3. You can do this manually or use the aws cloudformation package CLI command to automate the task of uploading local artifacts to a S3 bucket, as shown later.
Lambda execution role and permissions – You are not specifying a Lambda execution role in the template. Rather, you are providing the required permissions as IAM policy documents. When the template is submitted, CloudFormation expands the AWS::Serverless::Function resource, declaring a Lambda function and an execution role. The created role has two attached policies: a default AWSLambdaBasicExecutionRole and the inline policy specified in the template.
CloudWatch Events rule – Instead of specifying a CloudWatch Events resource type, you are defining an event source object as a property of the function itself. When the template is submitted, CloudFormation expands this into a CloudWatch Events rule resource and automatically creates the Lambda resource-based permissions to allow the CloudWatch Events rule to trigger the function.
NOTE: If you are trying this solution outside of us-east-1, then you should download the necessary files, upload them to the buckets in your region, edit the script as appropriate and then run it or use the CLI deployment method below.
3.) Choose Next.
4.) On the Specify Details page, keep or modify the stack name and choose Next.
5.) On the Options page, choose Next.
6.) On the Review page, take the following steps:
Acknowledge the two Transform access capabilities. This allows the CloudFormation transform to create the required IAM resources with custom names.
Under Transforms, choose Create Change Set.
Wait a few seconds for the change set to be created before proceeding. The change set should look as follows:
7.) Choose Execute to deploy the template.
After the template is deployed, you should see four resources created:
After the package is successfully uploaded, the output should look as follows:
Uploading to 0f6d12c7872b50b37dbfd5a60385b854 1872 / 1872.0 (100.00%)
Successfully packaged artifacts and wrote output template to file serverless-output.template.
The CodeUri property in serverless-output.template is now referencing the packaged artifacts in the S3 bucket that you specified:
s3://<bucket>/0f6d12c7872b50b37dbfd5a60385b854
Use the aws cloudformation deploy CLI command to deploy the stack:
You should see the following output after the stack has been successfully created:
Waiting for changeset to be created...
Waiting for stack create/update to complete
Successfully created/updated stack – EmrDnsSetterCli
Validating results
To test the solution, launch an EMR cluster. The Lambda function looks for the cluster_name tag associated with the EMR cluster. Make sure to specify the friendly name of your cluster as host.domain.com where the domain is the private hosted zone in which to create the CNAME record.
Here is a sample CLI command to launch a cluster within a specific subnet in a VPC with the required tag cluster_name.
After the cluster is launched, log in to the Route 53 console. In the left navigation pane, choose Hosted Zones to view the list of private and public zones currently configured in Route 53. Select the hosted zone that you specified in the ZONE tag when you launched the cluster. Verify that the resource records were created.
You can also monitor the CloudWatch Events metrics that are published to CloudWatch every minute, such as the number of TriggeredRules and Invocations.
Now that you’ve verified that the Lambda function successfully updated the Route 53 resource records in the zone file, terminate the EMR cluster and verify that the records are removed by the same function.
Conclusion
This solution provides a serverless approach to automatically assigning a friendly name for your EMR cluster for easy access to popular notebooks and other web interfaces. CloudWatch Events also supports cross-account event delivery, so if you are running EMR clusters in multiple AWS accounts, all cluster state events across accounts can be consolidated into a single account.
I hope that this solution provides a small glimpse into the power of CloudWatch Events and Lambda and how they can be leveraged with EMR and other AWS big data services. For example, by using the EMR step state change event, you can chain various pieces of your analytics pipeline. You may have a transient cluster perform data ingest and, when the task successfully completes, spin up an ETL cluster for transformation and upload to Amazon Redshift. The possibilities are truly endless.
As we head into 2018 and start looking forward to longer days in the Northern hemisphere, I thought I’d take a look back at last year’s weather using data from Raspberry Pi Oracle Weather Stations. One of the great things about the kit is that as well as uploading all its readings to the shared online Oracle database, it stores them locally on the Pi in a MySQL or MariaDB database. This means you can use the power of SQL queries coupled with Python code to do automatic data analysis.
Soggy Surrey
My Weather Station has only been installed since May, so I didn’t have a full 52 weeks of my own data to investigate. Still, my station recorded more than 70000 measurements. Living in England, the first thing I wanted to know was: which was the wettest month? Unsurprisingly, both in terms of average daily rainfall and total rainfall, the start of the summer period — exactly when I went on a staycation — was the soggiest:
What about the global Weather Station community?
Even soggier Bavaria
Here things get slightly trickier. Although we have a shiny Oracle database full of all participating schools’ sensor readings, some of the data needs careful interpretation. Many kits are used as part of the school curriculum and do not always record genuine outdoor conditions. Nevertheless, it appears that Adalbert Stifter Gymnasium in Bavaria, Germany, had an even wetter 2017 than my home did:
The records Robert-Dannemann Schule in Westerstede, Germany, is a good example of data which was most likely collected while testing and investigating the weather station sensors, rather than in genuine external conditions. Unless this school’s Weather Station was transported to a planet which suffers from extreme hurricanes, it wasn’t actually subjected to wind speeds above 1000km/h in November. Dismissing these and all similarly suspect records, I decided to award the ‘Windiest location of the year’ prize to CEIP Noalla-Telleiro, Spain.
This school is right on the coast, and is subject to some strong and squally weather systems.
Weather Station at CEIP Noalla-Telleiro
They’ve mounted their wind vane and anemometer nice and high, so I can see how they were able to record such high wind velocities.
A couple of Weather Stations have recently been commissioned in equally exposed places — it will be interesting to see whether they will record even higher speeds during 2018.
Highs and lows
After careful analysis and a few disqualifications (a couple of Weather Stations in contention for this category were housed indoors), the ‘Hottest location’ award went to High School of Chalastra in Thessaloniki, Greece. There were a couple of Weather Stations (the one at The Marwadi Education Foundation in India, for example) that reported higher average temperatures than Chalastra’s 24.54 ºC. However, they had uploaded far fewer readings and their data coverage of 2017 was only partial.
At the other end of the thermometer, the location with the coldest average temperature is École de la Rose Sauvage in Calgary, Canada, with a very chilly 9.9 ºC.
Weather Station at École de la Rose Sauvage
I suspect this school has a good chance of retaining the title: their lowest 2017 temperature of -24 ºC is likely to be beaten in 2018 due to extreme weather currently bringing a freezing start to the year in that part of the world.
If you have an Oracle Raspberry Pi Weather Station and would like to perform an annual review of your local data, you can use this Python script as a starting point. It will display a monthly summary of the temperature and rainfall for 2017, and you should be able to customise the code to focus on other sensor data or on a particular time of year. We’d love to see your results, so please share your findings with [email protected], and we’ll send you some limited-edition Weather Station stickers.
Thank you to my colleague Harvey Bendana for this blog on how to do shallow cloning on AWS CodeBuild using GitHub Enterprise as a source.
Today we are announcing support for using GitHub Enterprise as a source type for CodeBuild. You can now initiate build tasks from changes in source code hosted on your own implementation of GitHub Enterprise.
We are also announcing support for shallow cloning of a repo when you use CodeCommit, BitBucket, GitHub, or GitHub Enterprise as a source type. Shallow cloning allows you to truncate history of a repo in order to save space and speed up cloning times.
In this post, I’ll walk you through how to configure GitHub Enterprise as a source type with a defined clone depth for an AWS CodeBuild project. I’ll also show you all the moving parts associated with a successful implementation.
AWS CodeBuild is a fully managed build service. There are no servers to provision and scale, or software to install, configure, and operate. You just specify the location of your source code, choose your build settings, and CodeBuild runs build scripts for compiling, testing, and packaging your code.
GitHub Enterprise is the on-premises version of GitHub.com. It makes collaborative coding possible and enjoyable for large-scale enterprise software development teams.
Many enterprises choose GitHub Enterprise as their preferred source code/version control repository because it can be hosted in their own trusted network, whether that is an on-premises data center or their own Amazon VPCs.
Requirements
You’ll need an AWS account.
You’ll need a GitHub Enterprise implementation with a repo. If you’d like to deploy one inside your own Amazon VPC, check out our Quick Start Guide.
You’ll need an S3 bucket to store your GitHub Enterprise self-signed SSL certificate.
Download your GitHub Enterprise SSL certificate:
Note: The following steps are required only for self-signed certificates. You can forego installation of a certificate if you are using self-signed certificates and default to HTTP communication with your repo. For this post, I am using a self-signed certificate and the Firefox browser. These steps may vary, depending on your browser of choice.
Navigate to your GitHub Enterprise environment and sign in with your credentials.
2. Choose the lock icon in the upper-left corner to view and export your GitHub Enterprise certificate.
3. When you export and download your certificate, make sure you select the format type, which includes the entire certificate chain.
2. If there are no existing build projects, a welcome page is displayed. Choose Get started. If you already have build projects, then choose Create project.
3. On the Configure your project page, type a name for your build project. Build project names must be unique across each AWS account.
6. Enter your personal access token in your CodeBuild project and choose Save Token.
7. Enter the repository URL and choose a Git clone depth value that makes sense for you. Allowed values are 1, 5, 25, or Full. For this post, I am using a depth of 1.
8. Select the Webhook check box.
9. Continue with the rest of the configuration for your project, choosing options that best suit your build needs. For this post, I am using an AWS CodeBuild managed image running the Ubuntu OS with the base runtime configuration. Enter your build specifications or build commands. I am using a simple build command of git log . so that it can be easily found in the CloudWatch logs of the CodeBuild project. It will also be used to demonstrate the shallow clone feature.
10. Next, select Install certificate from your S3 to install your GitHub Enterprise self-signed certificate from S3. For Bucket of certificate, I’ve entered the S3 bucket where I uploaded the certificate. For Object key of certificate, I’ve entered the name of the certificate.
11. Lastly, configure artifacts, caching, IAM roles, and VPC configurations. For this post, I chose not to generate any artifacts from this build. From the following screenshot, you’ll see I’ve opted out of cache, requested a new IAM role with the required permissions, and have not defined VPC access. Choose Continue to validate and complete the creation of the CodeBuild project.
Note: If your GitHub Enterprise environment is in an Amazon VPC, configure VPC access for your project. Define the VPC ID, subnet ID, and security group so that your project has access to the EC2 instances hosting your GitHub Enterprise environment.
12. After the project is created, a dialog box displays a CodeBuild payload URL and secret. They are used to create a webhook for the repo in the GitHub Enterprise environment.
Create a webhook in your GitHub Enterprise repo:
1. In your GitHub Enterprise repo, navigate to Settings, choose Hooks & services, and then choose Add webhook.
2. Paste the payload URL and secret into their respective fields. Under Which events would you like to trigger this webhook? choose an option. For this post, I am using Let me select individual events. I then chose Pull request and Push as the two event triggers.
3. Make sure Active is selected and then choose Add webhook.
4. A webhook has now been created in the GitHub Enterprise repo.
Now I will show you how to test this.
Trigger your AWS CodeBuild project by pushing a change to the GitHub Enterprise repo
1. Clone the repo to the local file system. For information, see Clone the Repository Using the Command Line on the GitHub Help website. Now create a feature branch, push a change, and generate a pull request for review, and, ultimately, merge to master. Here is the state of the GitHub Enterprise repo and AWS CodeBuild project before pushing a change:
4. After the changes have been saved, push them to the feature branch.
5. There is now notification of a new branch in the GitHub Enterprise environment.
6. Generate a pull request from the feature branch in preparation of review and merge to master.
7. The reviewer(s) will then review and merge the pull request, pushing all changes to the master branch.
8. Here is the updated repo:
9. After the change has been pushed successfully, a new build is initiated.
10. In the following screenshot, you’ll see the initiator is Github-Hookshot/eb0c46 and the source version is 03169095b8f16ac077388471035becb2070aa12c.
11. In the Recent Deliveries section, under the configuration of the GitHub Enterprise repo webhook, the CodeBuild project initiator is defined as User-Agent. The source version is denoted in the Payload output. They match!
12. A successfully completed build should appear under the CodeBuild project.
13. The entire log output of the CodeBuild project can be viewed in CloudWatch logs. In the following screenshot, the source was downloaded successfully from the GitHub Enterprise repo and the build command of git log . was run successfully. Only the most recent commit appears in the git history output. This is because I defined a clone depth of 1.
14. If I query the git history of the repo on the local repository, the output has the full commit history. This is expected because I am doing a full clone locally.
Conclusion
In this blog post, I showed you how to configure GitHub Enterprise as a source type for your AWS CodeBuild project with a clone depth of 1. These new features expand the capabilities of AWS CodeBuild and the suite of AWS Developer Tools for CI/CD and DevOps processes.
I hope you found this post useful. Feel free to leave your feedback or suggestions in the comments.
We came to see Toby and Brian Bahn, physics teacher for Dover High School and leader of the Dover Robotics club, so they could tell us about the inner workings of the Tic-Tac-Toe Robot project, and how the Raspberry Pi fit within it. Check out our video for Toby’s explanation of the build and the software controlling it.
Toby’s original robotic arm prototype used a weight to direct the pen on and off the paper. He later replaced this with a servo motor.
Toby documented the prototyping process for the robot on the Dover Robotics blog. Head over there to hear more about the highs and lows of building a robotic arm from scratch, and about how Toby learned to integrate a Raspberry Pi for both software and hardware control.
The finished build is a tic-tac-toe beast, besting everyone who dares to challenge it to a game.
And in case you’re wondering: no, none of the Raspberry Pi team were able to beat the Tic-Tac-Toe Robot when we played against it.
Your turn
We always love seeing Raspberry Pis being used in schools to teach coding and digital making, whether in the classroom or during after-school activities such as the Dover Robotics club and our own Code Clubs and CoderDojos. If you are part of a coding or robotics club, we’d love to hear your story! So make sure to share your experiences and projects in the comments below, or via our social media accounts.
Recently, Amazon announced some new Amazon S3 encryption and security features. The AWS Blog post showed how to use the Amazon S3 console to take advantage of these new features. However, if you have a large number of Amazon S3 buckets, using the console to implement these features could take hours, if not days. As an alternative, I created documentation topics in the AWS SDK for Ruby Developer Guide that include code examples showing you how to use the new Amazon S3 encryption features using the AWS SDK for Ruby.
What are my encryption options?
You can encrypt Amazon S3 bucket objects on a server or on a client:
When you encrypt objects on a server, you request that Amazon S3 encrypt the objects before saving them to disk in data centers and decrypt the objects when you download them. The main advantage of this approach is that Amazon S3 manages the entire encryption process.
When you encrypt objects on a client, you encrypt the objects before you upload them to Amazon S3. In this case, you manage the encryption process, the encryption keys, and related tools. Use this option when:
Company policy and standards require it.
You already have a development process in place that meets your needs.
Encrypting on the client has always been available, but you should know the following points:
You must be diligent about protecting your encryption keys, which is analogous to having a burglar-proof lock on your front door. If you leave a key under the mat, your security is compromised.
If you lose your encryption keys, you won’t be able to decrypt your data.
If you encrypt objects on the client, we strongly recommend that you use an AWS Key Management Service (AWS KMS) managed customer master key (CMK)
How to use encryption on a server
You can specify that Amazon S3 automatically encrypts objects as you upload them to a bucket or require that objects uploaded to an Amazon S3 bucket include encryption on a server before they are uploaded to an Amazon S3 bucket.
The advantage of these settings is that when you specify them, you ensure that objects uploaded to Amazon S3 are encrypted. Alternatively, you can have Amazon S3 encrypt individual objects on the server as you upload them to a bucket or encrypt them on the server with your own key as you upload them to a bucket.
The AWS SDK for Ruby Developer Guide now contains the following topics that explain your encryption options on a server:
You can encrypt objects on a client before you upload them to a bucket and decrypt them after you download them from a bucket by using the Amazon S3 encryption client.
The AWS SDK for Ruby Developer Guide now contains the following topics that explain your encryption options on the client:
Note: The Amazon S3 encryption client in the AWS SDK for Ruby is compatible with other Amazon S3 encryption clients, but it is not compatible with other AWS client-side encryption libraries, including the AWS Encryption SDK and the Amazon DynamoDB encryption client for Java. Each library returns a different ciphertext (“encrypted message”) format, so you can’t use one library to encrypt objects and a different library to decrypt them. For more information, see Protecting Data Using Client-Side Encryption.
If you have comments about this blog post, submit them in the “Comments” section below. If you have questions about encrypting objects on servers and clients, start a new thread on the Amazon S3 forum or contact AWS Support.
Amazon CloudFront is a web service that speeds up distribution of your static and dynamic web content to end users through a worldwide network of edge locations. CloudFront provides a number of benefits and capabilities that can help you secure your applications and content while meeting compliance requirements. For example, you can configure CloudFront to help enforce secure, end-to-end connections using HTTPS SSL/TLS encryption. You also can take advantage of CloudFront integration with AWS Shield for DDoS protection and with AWS WAF (a web application firewall) for protection against application-layer attacks, such as SQL injection and cross-site scripting.
Now, CloudFront field-level encryption helps secure sensitive data such as a customer phone numbers by adding another security layer to CloudFront HTTPS. Using this functionality, you can help ensure that sensitive information in a POST request is encrypted at CloudFront edge locations. This information remains encrypted as it flows to and beyond your origin servers that terminate HTTPS connections with CloudFront and throughout the application environment. In this blog post, we demonstrate how you can enhance the security of sensitive data by using CloudFront field-level encryption.
Note: This post assumes that you understand concepts and services such as content delivery networks, HTTP forms, public-key cryptography, CloudFront, AWS Lambda, and the AWS CLI. If necessary, you should familiarize yourself with these concepts and review the solution overview in the next section before proceeding with the deployment of this post’s solution.
How field-level encryption works
Many web applications collect and store data from users as those users interact with the applications. For example, a travel-booking website may ask for your passport number and less sensitive data such as your food preferences. This data is transmitted to web servers and also might travel among a number of services to perform tasks. However, this also means that your sensitive information may need to be accessed by only a small subset of these services (most other services do not need to access your data).
User data is often stored in a database for retrieval at a later time. One approach to protecting stored sensitive data is to configure and code each service to protect that sensitive data. For example, you can develop safeguards in logging functionality to ensure sensitive data is masked or removed. However, this can add complexity to your code base and limit performance.
Field-level encryption addresses this problem by ensuring sensitive data is encrypted at CloudFront edge locations. Sensitive data fields in HTTPS form POSTs are automatically encrypted with a user-provided public RSA key. After the data is encrypted, other systems in your architecture see only ciphertext. If this ciphertext unintentionally becomes externally available, the data is cryptographically protected and only designated systems with access to the private RSA key can decrypt the sensitive data.
It is critical to secure private RSA key material to prevent unauthorized access to the protected data. Management of cryptographic key material is a larger topic that is out of scope for this blog post, but should be carefully considered when implementing encryption in your applications. For example, in this blog post we store private key material as a secure string in the Amazon EC2 Systems Manager Parameter Store. The Parameter Store provides a centralized location for managing your configuration data such as plaintext data (such as database strings) or secrets (such as passwords) that are encrypted using AWS Key Management Service (AWS KMS). You may have an existing key management system in place that you can use, or you can use AWS CloudHSM. CloudHSM is a cloud-based hardware security module (HSM) that enables you to easily generate and use your own encryption keys in the AWS Cloud.
To illustrate field-level encryption, let’s look at a simple form submission where Name and Phone values are sent to a web server using an HTTP POST. A typical form POST would contain data such as the following.
POST / HTTP/1.1
Host: example.com
Content-Type: application/x-www-form-urlencoded
Content-Length:60
Name=Jane+Doe&Phone=404-555-0150
Instead of taking this typical approach, field-level encryption converts this data similar to the following.
POST / HTTP/1.1
Host: example.com
Content-Type: application/x-www-form-urlencoded
Content-Length: 1713
Name=Jane+Doe&Phone=AYABeHxZ0ZqWyysqxrB5pEBSYw4AAA...
To further demonstrate field-level encryption in action, this blog post includes a sample serverless application that you can deploy by using a CloudFormation template, which creates an application environment using CloudFront, Amazon API Gateway, and Lambda. The sample application is only intended to demonstrate field-level encryption functionality and is not intended for production use. The following diagram depicts the architecture and data flow of this sample application.
Sample application architecture and data flow
Here is how the sample solution works:
An application user submits an HTML form page with sensitive data, generating an HTTPS POST to CloudFront.
Field-level encryption intercepts the form POST and encrypts sensitive data with the public RSA key and replaces fields in the form post with encrypted ciphertext. The form POST ciphertext is then sent to origin servers.
The serverless application accepts the form post data containing ciphertext where sensitive data would normally be. If a malicious user were able to compromise your application and gain access to your data, such as the contents of a form, that user would see encrypted data.
Lambda stores data in a DynamoDB table, leaving sensitive data to remain safely encrypted at rest.
An administrator uses the AWS Management Console and a Lambda function to view the sensitive data.
During the session, the administrator retrieves ciphertext from the DynamoDB table.
Decrypted sensitive data is transmitted over SSL/TLS via the AWS Management Console to the administrator for review.
Deployment walkthrough
The high-level steps to deploy this solution are as follows:
Stage the required artifacts When deployment packages are used with Lambda, the zipped artifacts have to be placed in an S3 bucket in the target AWS Region for deployment. This step is not required if you are deploying in the US East (N. Virginia) Region because the package has already been staged there.
Generate an RSA key pair Create a public/private key pair that will be used to perform the encrypt/decrypt functionality.
Upload the public key to CloudFront and associate it with the field-level encryption configuration After you create the key pair, the public key is uploaded to CloudFront so that it can be used by field-level encryption.
Launch the CloudFormation stack Deploy the sample application for demonstrating field-level encryption by using AWS CloudFormation.
Add the field-level encryption configuration to the CloudFront distribution After you have provisioned the application, this step associates the field-level encryption configuration with the CloudFront distribution.
Store the RSA private key in the Parameter Store Store the private key in the Parameter Store as a SecureString data type, which uses AWS KMS to encrypt the parameter value.
Deploy the solution
1. Stage the required artifacts
(If you are deploying in the US East [N. Virginia] Region, skip to Step 2, “Generate an RSA key pair.”)
Stage the Lambda function deployment package in an Amazon S3 bucket located in the AWS Region you are using for this solution. To do this, download the zipped deployment package and upload it to your in-region bucket. For additional information about uploading objects to S3, see Uploading Object into Amazon S3.
2. Generate an RSA key pair
In this section, you will generate an RSA key pair by using OpenSSL:
Confirm access to OpenSSL.
$ openssl version
You should see version information similar to the following.
OpenSSL <version> <date>
Create a private key using the following command.
$ openssl genrsa -out private_key.pem 2048
The command results should look similar to the following.
Restrict access to the private key.$ chmod 600 private_key.pem Note: You will use the public and private key material in Steps 3 and 6 to configure the sample application.
3. Upload the public key to CloudFront and associate it with the field-level encryption configuration
Now that you have created the RSA key pair, you will use the AWS Management Console to upload the public key to CloudFront for use by field-level encryption. Complete the following steps to upload and configure the public key.
Note: Do not include spaces or special characters when providing the configuration values in this section.
From the AWS Management Console, choose Services > CloudFront.
In the navigation pane, choose Public Key and choose Add Public Key.
Complete the Add Public Key configuration boxes:
Key Name: Type a name such as DemoPublicKey.
Encoded Key: Paste the contents of the public_key.pem file you created in Step 2c. Copy and paste the encoded key value for your public key, including the -----BEGIN PUBLIC KEY----- and -----END PUBLIC KEY----- lines.
Comment: Optionally add a comment.
Choose Create.
After adding at least one public key to CloudFront, the next step is to create a profile to tell CloudFront which fields of input you want to be encrypted. While still on the CloudFront console, choose Field-level encryption in the navigation pane.
Under Profiles, choose Create profile.
Complete the Create profile configuration boxes:
Name: Type a name such as FLEDemo.
Comment: Optionally add a comment.
Public key: Select the public key you configured in Step 4.b.
Provider name: Type a provider name such as FLEDemo. This information will be used when the form data is encrypted, and must be provided to applications that need to decrypt the data, along with the appropriate private key.
Pattern to match: Type phone. This configures field-level encryption to match based on the phone.
Choose Save profile.
Configurations include options for whether to block or forward a query to your origin in scenarios where CloudFront can’t encrypt the data. Under Encryption Configurations, choose Create configuration.
Complete the Create configuration boxes:
Comment: Optionally add a comment.
Content type: Enter application/x-www-form-urlencoded. This is a common media type for encoding form data.
Default profile ID: Select the profile you added in Step 3e.
Choose Save configuration
4. Launch the CloudFormation stack
Launch the sample application by using a CloudFormation template that automates the provisioning process.
Input parameter
Input parameter description
ProviderID
Enter the Provider name you assigned in Step 3e. The ProviderID is used in field-level encryption configuration in CloudFront (letters and numbers only, no special characters)
PublicKeyName
Enter the Key Name you assigned in Step 3b. This name is assigned to the public key in field-level encryption configuration in CloudFront (letters and numbers only, no special characters).
PrivateKeySSMPath
Leave as the default: /cloudfront/field-encryption-sample/private-key
The path in the S3 bucket containing artifact files. Leave as default if deploying in us-east-1.
To finish creating the CloudFormation stack:
Choose Next on the Select Template page, enter the input parameters and choose Next. Note: The Artifacts configuration needs to be updated only if you are deploying outside of us-east-1 (US East [N. Virginia]). See Step 1 for artifact staging instructions.
On the Options page, accept the defaults and choose Next.
On the Review page, confirm the details, choose the I acknowledge that AWS CloudFormation might create IAM resources check box, and then choose Create. (The stack will be created in approximately 15 minutes.)
5. Add the field-level encryption configuration to the CloudFront distribution
While still on the CloudFront console, choose Distributions in the navigation pane, and then:
In the Outputs section of the FLE-Sample-App stack, look for CloudFrontDistribution and click the URL to open the CloudFront console.
Choose Behaviors, choose the Default (*) behavior, and then choose Edit.
For Field-level Encryption Config, choose the configuration you created in Step 3g.
Choose Yes, Edit.
While still in the CloudFront distribution configuration, choose the General Choose Edit, scroll down to Distribution State, and change it to Enabled.
Choose Yes, Edit.
6. Store the RSA private key in the Parameter Store
In this step, you store the private key in the EC2 Systems Manager Parameter Store as a SecureString data type, which uses AWS KMS to encrypt the parameter value. For more information about AWS KMS, see the AWS Key Management Service Developer Guide. You will need a working installation of the AWS CLI to complete this step.
Store the private key in the Parameter Store with the AWS CLI by running the following command. You will find the <KMSKeyID> in the KMSKeyID in the CloudFormation stack Outputs. Substitute it for the placeholderin the following command.
Verify the parameter. Your private key material should be accessible through the ssm get-parameter in the following command in the Value The key material has been truncated in the following output.
Notice we use the —with decryption argument in this command. This returns the private key as cleartext.
This completes the sample application deployment. Next, we show you how to see field-level encryption in action.
Delete the private key from local storage. On Linux for example, using the shred command, securely delete the private key material from your workstation as shown below. You may also wish to store the private key material within an AWS CloudHSM or other protected location suitable for your security requirements. For production implementations, you also should implement key rotation policies.
Use the following steps to test the sample application with field-level encryption:
Open sample application in your web browser by clicking the ApplicationURL link in the CloudFormation stack Outputs. (for example, https:d199xe5izz82ea.cloudfront.net/prod/). Note that it may take several minutes for the CloudFront distribution to reach the Deployed Status from the previous step, during which time you may not be able to access the sample application.
Fill out and submit the HTML form on the page:
Complete the three form fields: Full Name, Email Address, and Phone Number.
Choose Submit. Notice that the application response includes the form values. The phone number returns the following ciphertext encryption using your public key. This ciphertext has been stored in DynamoDB.
Execute the Lambda decryption function to download ciphertext from DynamoDB and decrypt the phone number using the private key:
In the CloudFormation stack Outputs, locate DecryptFunction and click the URL to open the Lambda console.
Configure a test event using the “Hello World” template.
Choose the Test button.
View the encrypted and decrypted phone number data.
Summary
In this blog post, we showed you how to use CloudFront field-level encryption to encrypt sensitive data at edge locations and help prevent access from unauthorized systems. The source code for this solution is available on GitHub. For additional information about field-level encryption, see the documentation.
If you have comments about this post, submit them in the “Comments” section below. If you have questions about or issues implementing this solution, please start a new thread on the CloudFront forum.
I need your help. This is a call out for those between 11- and 16-years-old in the UK and Republic of Ireland. Something has gone very, very wrong and only you can save us. I’ve collected together as much information for you as I can. You’ll find it at http://www.raspberrypi.org/pioneers.
The challenge
In August we intercepted an emergency communication from a lonesome survivor. She seemed to be in quite a bit of trouble, and asked all you young people aged 11 to 16 to come up with something to help tackle the oncoming crisis, using whatever technology you had to hand. You had ten weeks to work in teams of two to five with an adult mentor to fulfil your mission.
The judges
We received your world-saving ideas, and our savvy survivor pulled together a ragtag bunch of apocalyptic experts to help us judge which ones would be the winning entries.
Dr Shini Somara is an advocate for STEM education and a mechanical engineer. She was host of The Health Show and has appeared in documentaries for the BBC, PBS Digital, and Sky. You can check out her work hosting Crash Course Physics on YouTube.
Prof Lewis Dartnell is an astrobiologist and author of the book The Knowledge: How to Rebuild Our World From Scratch.
Emma Stephenson has a background in aeronautical engineering and currently works in the Shell Foundation’s Access to Energy and Sustainable Mobility portfolio.
151 Likes, 3 Comments – Shini Somara (@drshinisomara) on Instagram: “Currently sifting through the entries with the other judges of #makeyourideas with…”
The winners
Our survivor is currently putting your entries to good use repairing, rebuilding, and defending her base. Our judges chose the following projects as outstanding examples of world-saving digital making.
This is our entry to the pioneers ‘Only you can save us’ competition. Our team name is Computatrum. Hope you enjoy!
Are you facing an unknown enemy whose only weakness is Nerf bullets? Then this is the robot for you! We loved the especially apocalyptic feel of the Computatron’s cleverly hacked and repurposed elements. The team even used an old floppy disc mechanism to help fire their bullets!
Thousands of lines of code… Many sheets of acrylic… A camera, touchscreen and fingerprint scanner… This is our entry into the Raspberry Pi Pioneers2017 ‘Only YOU can Save Us’ theme. When zombies or other survivors break into your base, you want a secure way of storing your crackers.
The Robot Apocalypse Committee is back, and this time they’ve brought cheese! The crew designed a cheese- and cracker-dispensing machine complete with face and fingerprint recognition to ensure those rations last until the next supply drop.
Hi! We are PiChasers and we entered the Raspberry Pi Pionners challenge last time when the theme was “Make it Outdoors!” but now we’ve been faced with another theme “Apocolypse”. We spent a while thinking of an original thing that would help in an apocolypse and decided upon a ‘text-only phone’ which uses local radio communication rather than cellular.
This text-based communication device encased in a tupperware container could be a lifesaver in a crisis! And luckily, the Pi Chasers produced an excellent video and amazing GitHub repo, ensuring that any and all survivors will be able to build their own in the safety of their base.
Pioneers Entry Team Name: The Three Musketeers Team Participants: James, Zach and Tom
We all know that zombies are terrible at geometry, and the Three Musketeers used this fact to their advantage when building their zombie security system. We were impressed to see the team working together to overcome the roadblocks they faced along the way.
We appreciate what you’re trying to do: Zombie Trolls
Playing piggy in the middle with zombies sure is a unique way of saving humankind from total extinction! We loved this project idea, and although the Zombie Trolls had a little trouble with their motors, we’re sure with a little more tinkering this zombie-fooling contraption could save us all.
Most awesome
Our judges also wanted to give a special commendation to the following teams for their equally awesome apocalypse-averting ideas:
PiRates, for their multifaceted zombie-proofing defence system and the high production value of their video
Byte them Pis, for their beautiful zombie-detecting doormat
Unatecxon, for their impressive bunker security system
Team Crompton, for their pressure-activated door system
All our winning teams have secured exclusive digital maker boxes. These are jam-packed with tantalising tech to satisfy all tinkering needs, including:
Genomics analysis has taken off in recent years as organizations continue to adopt the cloud for its elasticity, durability, and cost. With the AWS Cloud, customers have a number of performant options to choose from. These options include AWS Batch in conjunction with AWS Lambda and AWS Step Functions; AWS Glue, a serverless extract, transform, and load (ETL) service; and of course, the AWS big data and machine learning workhorse Amazon EMR.
For this task, we use Hail, an open source framework for exploring and analyzing genomic data that uses the Apache Spark framework. In this post, we use Amazon EMR to run Hail. We walk through the setup, configuration, and data processing. Finally, we generate an Apache Parquet–formatted variant dataset and explore it using Amazon Athena.
Compiling Hail
Because Hail is still under active development, you must compile it before you can start using it. To help simplify the process, you can launch the following AWS CloudFormation template that creates an EMR cluster, compiles Hail, and installs a Jupyter Notebook so that you’re ready to go with Hail.
There are a few things to note about the AWS CloudFormation template. You must provide a password for the Jupyter Notebook. Also, you must provide a virtual private cloud (VPC) to launch Amazon EMR in, and make sure that you select a subnet from within that VPC. Next, update the cluster resources to fit your needs. Lastly, the HailBuildOutputS3Path parameter should be an Amazon S3 bucket/prefix, where you should save the compiled Hail binaries for later use. Leave the Hail and Spark versions as is, unless you’re comfortable experimenting with more recent versions.
When you’ve completed these steps, the following files are saved locally on the cluster to be used when running the Apache Spark Python API (PySpark) shell.
The files are also copied to the Amazon S3 location defined by the AWS CloudFormation template so that you can include them when running jobs using the Amazon EMR Step API.
Collecting genome data
To get started with Hail, use the 1000 Genome Project dataset available on AWS. The data that you will use is located at s3://1000genomes/release/20130502/.
For Hail to process these files in an efficient manner, they need to be block compressed. In many cases, files that use gzip compression are compressed in blocks, so you don’t need to recompress—you can just rename the file extension to “.bgz” from “.gz” . Hail can process .gz files, but it’s much slower and not recommended. The simple way to accomplish this is to copy the data files from the public S3 bucket to your own and rename them.
The following is the Bash command line to copy the first five genome Variant Call Format (VCF) files and rename them appropriately using the AWS CLI.
for i in $(seq 5); do aws s3 cp s3://1000genomes/release/20130502/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz s3://your_bucket/prefix/ALL.chr$i.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.bgz; done
Now that you have some data files containing variants in the Variant Call Format, you need to get the sample annotations that go along with them. These annotations provide more information about each sample, such as the population they are a part of.
In this section, you use the data collected in the previous section to explore genome variations interactively using a Jupyter Notebook. You then create a simple ETL job to convert these variations into Parquet format. Finally, you query it using Amazon Athena.
Let’s open the Jupyter notebook. To start, sign in to the AWS Management Console, and open the AWS CloudFormation console. Choose the stack that you created, and then choose the Output tab. There you see the JupyterURL. Open this URL in your browser.
Go ahead and download the Jupyter Notebook that is provided to your local machine. Log in to Jupyter with the password that you provided during stack creation. Choose Upload on the right side, and then choose the notebook from your local machine.
After the notebook is uploaded, choose it from the list on the left to open it.
Select the first cell, update the S3 bucket location to point to the bucket where you saved the compiled Hail libraries, and then choose Run. This code imports the Hail modules that you compiled at the beginning. When the cell is executing, you will see In [*]. When the process is complete, the asterisk (*) is replaced by a number, for example, In [1].
Next, run the subsequent two cells, which imports the Hail module into PySpark and initiates the Hail context.
The next cell imports a single VCF file from the bucket where you saved your data in the previous section. If you change the Amazon S3 path to not include a file name, it imports all the VCF files in that directory. Depending on your cluster size, it might take a few minutes.
Remember that in the previous section, you also copied an annotation file. Now you use it to annotate the VCF files that you’ve loaded with Hail. Execute the next cell—as a shortcut, you can select the cell and press Shift+Enter.
The import_table API takes a path to the annotation file in TSV (tab-separated values) format and a parameter named impute that attempts to infer the schema of the file, as shown in the output below the cell.
At this point, you can interactively explore the data. For example, you can count the number of samples you have and group them by population.
You can also calculate the standard quality control (QC) metrics on your variants and samples.
What if you want to query this data outside of Hail and Spark, for example, using Amazon Athena? To start, you need to change the column names to lowercase because Athena currently supports only lowercase names. To do that, use the two functions provided in the notebook and call them on your virtual dedicated server (VDS), as shown in the following image. Note that you’re only changing the case of the variants and samples schemas. If you’ve further augmented your VDS, you might need to modify the lowercase functions to do the same for those schemas.
In the current version of Hail, the sample annotations are not stored in the exported Parquet VDS, so you need to save them separately. As noted by the Hail maintainers, in future versions, the data represented by the VDS Parquet output will change, and it is recommended that you also export the variant annotations. So let’s do that.
Note that both of these lines are similar in that they export a table representation of the sample and variant annotations, convert them to a Spark DataFrame, and finally save them to Amazon S3 in Parquet file format.
Finally, it is beneficial to save the VDS file back to Amazon S3 so that next time you need to analyze your data, you can load it without having to start from the raw VCF. Note that when Hail saves your data, it requires a path and a file name.
After you run these cells, expect it to take some time as it writes out the data.
Discovering table metadata
Before you can query your data, you need to tell Athena the schema of your data. You have a couple of options. The first is to use AWS Glue to crawl the S3 bucket, infer the schema from the data, and create the appropriate table. Before proceeding, you might need to migrate your Athena database to use the AWS Glue Data Catalog.
Creating tables in AWS Glue
To use the AWS Glue crawler, open the AWS Glue console and choose Crawlers in the left navigation pane.
Then choose Add crawler to create a new crawler.
Next, give your crawler a name and assign the appropriate IAM role. Leave Amazon S3 as the data source, and select the S3 bucket where you saved the data and the sample annotations. When you set the crawler’s Include path, be sure to include the entire path, for example: s3://output_bucket/hail_data/sample_annotations/
Under the Exclusion Paths, type _SUCCESS, so that you don’t crawl that particular file.
Continue forward with the default settings until you are asked if you want to add another source. Choose Yes, and add the Amazon S3 path to the variant annotation bucket s3://your_output_bucket/hail_data/sample_annotations/ so that it can build your variant annotation table. Give it an existing database name, or create a new one.
Provide a table prefix and choose Next. Then choose Finish. At this point, assuming that the data is finished writing, you can go ahead and run the crawler. When it finishes, you have two new tables in the database you created that should look something like the following:
You can explore the schema of these tables by choosing their name and then choosing Edit Schema on the right side of the table view; for example:
Creating tables in Amazon Athena
If you cannot or do not want to use AWS Glue crawlers, you can add the tables via the Athena console by typing the following statements:
In the Amazon Athena console, choose the database in which your tables were created. In this case, it looks something like the following:
To verify that you have data, choose the three dots on the right, and then choose Preview table.
Indeed, you can see some data.
You can further explore the sample and variant annotations along with the calculated QC metrics that you calculated previously using Hail.
Summary
To summarize, this post demonstrated the ease in which you can install, configure, and use Hail, an open source highly scalable framework for exploring and analyzing genomics data on Amazon EMR. We demonstrated setting up a Jupyter Notebook to make our exploration easy. We also used the power of Hail to calculate quality control metrics for variants and samples. We exported them to Amazon S3 and allowed a broader range of users and analysts to explore them on-demand in a serverless environment using Amazon Athena.
Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.