We all know that we should not commit any passwords or keys to the repo with our code (no matter if public or private). Yet, thousands of production passwords can be found on GitHub (and probably thousands more in internal company repositories). Some have tried to fix that by removing the passwords (once they learned it’s not a good idea to store them publicly), but passwords have remained in the git history.
Knowing what not to do is the first and very important step. But how do we store production credentials. Database credentials, system secrets (e.g. for HMACs), access keys for 3rd party services like payment providers or social networks. There doesn’t seem to be an agreed upon solution.
I’ve previously argued with the 12-factor app recommendation to use environment variables – if you have a few that might be okay, but when the number of variables grow (as in any real application), it becomes impractical. And you can set environment variables via a bash script, but you’d have to store it somewhere. And in fact, even separate environment variables should be stored somewhere.
This somewhere could be a local directory (risky), a shared storage, e.g. FTP or S3 bucket with limited access, or a separate git repository. I think I prefer the git repository as it allows versioning (Note: S3 also does, but is provider-specific). So you can store all your environment-specific properties files with all their credentials and environment-specific configurations in a git repo with limited access (only Ops people). And that’s not bad, as long as it’s not the same repo as the source code.
Since many companies are using GitHub or BitBucket for their repositories, storing production credentials on a public provider may still be risky. That’s why it’s a good idea to encrypt the files in the repository. A good way to do it is via git-crypt. It is “transparent” encryption because it supports diff and encryption and decryption on the fly. Once you set it up, you continue working with the repo as if it’s not encrypted. There’s even a fork that works on Windows.
You simply run git-crypt init (after you’ve put the git-crypt binary on your OS Path), which generates a key. Then you specify your .gitattributes, e.g. like that:
And you’re done. Well, almost. If this is a fresh repo, everything is good. If it is an existing repo, you’d have to clean up your history which contains the unencrypted files. Following these steps will get you there, with one addition – before calling git commit, you should call git-crypt status -f so that the existing files are actually encrypted.
You’re almost done. We should somehow share and backup the keys. For the sharing part, it’s not a big issue to have a team of 2-3 Ops people share the same key, but you could also use the GPG option of git-crypt (as documented in the README). What’s left is to backup your secret key (that’s generated in the .git/git-crypt directory). You can store it (password-protected) in some other storage, be it a company shared folder, Dropbox/Google Drive, or even your email. Just make sure your computer is not the only place where it’s present and that it’s protected. I don’t think key rotation is necessary, but you can devise some rotation procedure.
git-crypt authors claim to shine when it comes to encrypting just a few files in an otherwise public repo. And recommend looking at git-remote-gcrypt. But as often there are non-sensitive parts of environment-specific configurations, you may not want to encrypt everything. And I think it’s perfectly fine to use git-crypt even in a separate repo scenario. And even though encryption is an okay approach to protect credentials in your source code repo, it’s still not necessarily a good idea to have the environment configurations in the same repo. Especially given that different people/teams manage these credentials. Even in small companies, maybe not all members have production access.
The outstanding questions in this case is – how do you sync the properties with code changes. Sometimes the code adds new properties that should be reflected in the environment configurations. There are two scenarios here – first, properties that could vary across environments, but can have default values (e.g. scheduled job periods), and second, properties that require explicit configuration (e.g. database credentials). The former can have the default values bundled in the code repo and therefore in the release artifact, allowing external files to override them. The latter should be announced to the people who do the deployment so that they can set the proper values.
The whole process of having versioned environment-speific configurations is actually quite simple and logical, even with the encryption added to the picture. And I think it’s a good security practice we should try to follow.
Last year, we released Amazon Connect, a cloud-based contact center service that enables any business to deliver better customer service at low cost. This service is built based on the same technology that empowers Amazon customer service associates. Using this system, associates have millions of conversations with customers when they inquire about their shipping or order information. Because we made it available as an AWS service, you can now enable your contact center agents to make or receive calls in a matter of minutes. You can do this without having to provision any kind of hardware. 2
There are several advantages of building your contact center in the AWS Cloud, as described in our documentation. In addition, customers can extend Amazon Connect capabilities by using AWS products and the breadth of AWS services. In this blog post, we focus on how to get analytics out of the rich set of data published by Amazon Connect. We make use of an Amazon Connect data stream and create an end-to-end workflow to offer an analytical solution that can be customized based on need.
Solution overview
The following diagram illustrates the solution.
In this solution, Amazon Connect exports its contact trace records (CTRs) using Amazon Kinesis. CTRs are data streams in JSON format, and each has information about individual contacts. For example, this information might include the start and end time of a call, which agent handled the call, which queue the user chose, queue wait times, number of holds, and so on. You can enable this feature by reviewing our documentation.
In this architecture, we use Kinesis Firehose to capture Amazon Connect CTRs as raw data in an Amazon S3 bucket. We don’t use the recent feature added by Kinesis Firehose to save the data in S3 as Apache Parquet format. We use AWS Glue functionality to automatically detect the schema on the fly from an Amazon Connect data stream.
The primary reason for this approach is that it allows us to use attributes and enables an Amazon Connect administrator to dynamically add more fields as needed. Also by converting data to parquet in batch (every couple of hours) compression can be higher. However, if your requirement is to ingest the data in Parquet format on realtime, we recoment using Kinesis Firehose recently launched feature. You can review this blog post for further information.
By default, Firehose puts these records in time-series format. To make it easy for AWS Glue crawlers to capture information from new records, we use AWS Lambda to move all new records to a single S3 prefix called flatfiles. Our Lambda function is configured using S3 event notification. To comply with AWS Glue and Athena best practices, the Lambda function also converts all column names to lowercase. Finally, we also use the Lambda function to start AWS Glue crawlers. AWS Glue crawlers identify the data schema and update the AWS Glue Data Catalog, which is used by extract, transform, load (ETL) jobs in AWS Glue in the latter half of the workflow.
You can see our approach in the Lambda code following.
from __future__ import print_function
import json
import urllib
import boto3
import os
import re
s3 = boto3.resource('s3')
client = boto3.client('s3')
def convertColumntoLowwerCaps(obj):
for key in obj.keys():
new_key = re.sub(r'[\W]+', '', key.lower())
v = obj[key]
if isinstance(v, dict):
if len(v) > 0:
convertColumntoLowwerCaps(v)
if new_key != key:
obj[new_key] = obj[key]
del obj[key]
return obj
def lambda_handler(event, context):
bucket = event['Records'][0]['s3']['bucket']['name']
key = urllib.unquote_plus(event['Records'][0]['s3']['object']['key'].encode('utf8'))
try:
client.download_file(bucket, key, '/tmp/file.json')
with open('/tmp/out.json', 'w') as output, open('/tmp/file.json', 'rb') as file:
i = 0
for line in file:
for object in line.replace("}{","}\n{").split("\n"):
record = json.loads(object,object_hook=convertColumntoLowwerCaps)
if i != 0:
output.write("\n")
output.write(json.dumps(record))
i += 1
newkey = 'flatfiles/' + key.replace("/", "")
client.upload_file('/tmp/out.json', bucket,newkey)
s3.Object(bucket,key).delete()
return "success"
except Exception as e:
print(e)
print('Error coping object {} from bucket {}'.format(key, bucket))
raise e
We trigger AWS Glue crawlers based on events because this approach lets us capture any new data frame that we want to be dynamic in nature. CTR attributes are designed to offer multiple custom options based on a particular call flow. Attributes are essentially key-value pairs in nested JSON format. With the help of event-based AWS Glue crawlers, you can easily identify newer attributes automatically.
We recommend setting up an S3 lifecycle policy on the flatfiles folder that keeps records only for 24 hours. Doing this optimizes AWS Glue ETL jobs to process a subset of files rather than the entire set of records.
After we have data in the flatfiles folder, we use AWS Glue to catalog the data and transform it into Parquet format inside a folder called parquet/ctr/. The AWS Glue job performs the ETL that transforms the data from JSON to Parquet format. We use AWS Glue crawlers to capture any new data frame inside the JSON code that we want to be dynamic in nature. What this means is that when you add new attributes to an Amazon Connect instance, the solution automatically recognizes them and incorporates them in the schema of the results.
After AWS Glue stores the results in Parquet format, you can perform analytics using Amazon Redshift Spectrum, Amazon Athena, or any third-party data warehouse platform. To keep this solution simple, we have used Amazon Athena for analytics. Amazon Athena allows us to query data without having to set up and manage any servers or data warehouse platforms. Additionally, we only pay for the queries that are executed.
Try it out!
You can get started with our sample AWS CloudFormation template. This template creates the components starting from the Kinesis stream and finishes up with S3 buckets, the AWS Glue job, and crawlers. To deploy the template, open the AWS Management Console by clicking the following link.
In the console, specify the following parameters:
BucketName: The name for the bucket to store all the solution files. This name must be unique; if it’s not, template creation fails.
etlJobSchedule: The schedule in cron format indicating how often the AWS Glue job runs. The default value is every hour.
KinesisStreamName: The name of the Kinesis stream to receive data from Amazon Connect. This name must be different from any other Kinesis stream created in your AWS account.
s3interval: The interval in seconds for Kinesis Firehose to save data inside the flatfiles folder on S3. The value must between 60 and 900 seconds.
sampledata: When this parameter is set to true, sample CTR records are used. Doing this lets you try this solution without setting up an Amazon Connect instance. All examples in this walkthrough use this sample data.
Select the “I acknowledge that AWS CloudFormation might create IAM resources.” check box, and then choose Create. After the template finishes creating resources, you can see the stream name on the stack Outputs tab.
If you haven’t created your Amazon Connect instance, you can do so by following the Getting Started Guide. When you are done creating, choose your Amazon Connect instance in the console, which takes you to instance settings. Choose Data streaming to enable streaming for CTR records. Here, you can choose the Kinesis stream (defined in the KinesisStreamName parameter) that was created by the CloudFormation template.
Now it’s time to generate the data by making or receiving calls by using Amazon Connect. You can go to Amazon Connect Cloud Control Panel (CCP) to make or receive calls using a software phone or desktop phone. After a few minutes, we should see data inside the flatfiles folder. To make it easier to try this solution, we provide sample data that you can enable by setting the sampledata parameter to true in your CloudFormation template.
You can navigate to the AWS Glue console by choosing Jobs on the left navigation pane of the console. We can select our job here. In my case, the job created by CloudFormation is called glueJob-i3TULzVtP1W0; yours should be similar. You run the job by choosing Run job for Action.
After that, we wait for the AWS Glue job to run and to finish successfully. We can track the status of the job by checking the History tab.
When the job finishes running, we can check the Database section. There should be a new table created called ctr in Parquet format.
To query the data with Athena, we can select the ctr table, and for Action choose View data.
Doing this takes us to the Athena console. If you run a query, Athena shows a preview of the data.
When we can query the data using Athena, we can visualize it using Amazon QuickSight. Before connecting Amazon QuickSight to Athena, we must make sure to grant Amazon QuickSight access to Athena and the associated S3 buckets in the account. For more information on doing this, see Managing Amazon QuickSight Permissions to AWS Resources in the Amazon QuickSight User Guide. We can then create a new data set in Amazon QuickSight based on the Athena table that was created.
After setting up permissions, we can create a new analysis in Amazon QuickSight by choosing New analysis.
Then we add a new data set.
We choose Athena as the source and give the data source a name (in this case, I named it connectctr).
Choose the name of the database and the table referencing the Parquet results.
Then choose Visualize.
After that, we should see the following screen.
Now we can create some visualizations. First, search for the agent.username column, and drag it to the AutoGraph section.
We can see the agents and the number of calls for each, so we can easily see which agents have taken the largest amount of calls. If we want to see from what queues the calls came for each agent, we can add the queue.arn column to the visual.
After following all these steps, you can use Amazon QuickSight to add different columns from the call records and perform different types of visualizations. You can build dashboards that continuously monitor your connect instance. You can share those dashboards with others in your organization who might need to see this data.
Conclusion
In this post, you see how you can use services like AWS Lambda, AWS Glue, and Amazon Athena to process Amazon Connect call records. The post also demonstrates how to use AWS Lambda to preprocess files in Amazon S3 and transform them into a format that recognized by AWS Glue crawlers. Finally, the post shows how to used Amazon QuickSight to perform visualizations.
You can use the provided template to analyze your own contact center instance. Or you can take the CloudFormation template and modify it to process other data streams that can be ingested using Amazon Kinesis or stored on Amazon S3.
Luis Caro is a Big Data Consultant for AWS Professional Services. He works with our customers to provide guidance and technical assistance on big data projects, helping them improving the value of their solutions when using AWS.
Peter Dalbhanjan is a Solutions Architect for AWS based in Herndon, VA. Peter has a keen interest in evangelizing AWS solutions and has written multiple blog posts that focus on simplifying complex use cases. At AWS, Peter helps with designing and architecting variety of customer workloads.
Amazon QuickSight is a fully managed cloud business intelligence system that gives you Fast & Easy to Use Business Analytics for Big Data. QuickSight makes business analytics available to organizations of all shapes and sizes, with the ability to access data that is stored in your Amazon Redshift data warehouse, your Amazon Relational Database Service (RDS) relational databases, flat files in S3, and (via connectors) data stored in on-premises MySQL, PostgreSQL, and SQL Server databases. QuickSight scales to accommodate tens, hundreds, or thousands of users per organization.
Today we are launching a new, session-based pricing option for QuickSight, along with additional region support and other important new features. Let’s take a look at each one:
Pay-per-Session Pricing Our customers are making great use of QuickSight and take full advantage of the power it gives them to connect to data sources, create reports, and and explore visualizations.
However, not everyone in an organization needs or wants such powerful authoring capabilities. Having access to curated data in dashboards and being able to interact with the data by drilling down, filtering, or slicing-and-dicing is more than adequate for their needs. Subscribing them to a monthly or annual plan can be seen as an unwarranted expense, so a lot of such casual users end up not having access to interactive data or BI.
In order to allow customers to provide all of their users with interactive dashboards and reports, the Enterprise Edition of Amazon QuickSight now allows Reader access to dashboards on a Pay-per-Session basis. QuickSight users are now classified as Admins, Authors, or Readers, with distinct capabilities and prices:
Authors have access to the full power of QuickSight; they can establish database connections, upload new data, create ad hoc visualizations, and publish dashboards, all for $9 per month (Standard Edition) or $18 per month (Enterprise Edition).
Readers can view dashboards, slice and dice data using drill downs, filters and on-screen controls, and download data in CSV format, all within the secure QuickSight environment. Readers pay $0.30 for 30 minutes of access, with a monthly maximum of $5 per reader.
Admins have all authoring capabilities, and can manage users and purchase SPICE capacity in the account. The QuickSight admin now has the ability to set the desired option (Author or Reader) when they invite members of their organization to use QuickSight. They can extend Reader invites to their entire user base without incurring any up-front or monthly costs, paying only for the actual usage.
A New Region QuickSight is now available in the Asia Pacific (Tokyo) Region:
The UI is in English, with a localized version in the works.
Hourly Data Refresh Enterprise Edition SPICE data sets can now be set to refresh as frequently as every hour. In the past, each data set could be refreshed up to 5 times a day. To learn more, read Refreshing Imported Data.
Access to Data in Private VPCs This feature was launched in preview form late last year, and is now available in production form to users of the Enterprise Edition. As I noted at the time, you can use it to implement secure, private communication with data sources that do not have public connectivity, including on-premises data in Teradata or SQL Server, accessed over an AWS Direct Connect link. To learn more, read Working with AWS VPC.
Parameters with On-Screen Controls QuickSight dashboards can now include parameters that are set using on-screen dropdown, text box, numeric slider or date picker controls. The default value for each parameter can be set based on the user name (QuickSight calls this a dynamic default). You could, for example, set an appropriate default based on each user’s office location, department, or sales territory. Here’s an example:
URL Actions for Linked Dashboards You can now connect your QuickSight dashboards to external applications by defining URL actions on visuals. The actions can include parameters, and become available in the Details menu for the visual. URL actions are defined like this:
You can use this feature to link QuickSight dashboards to third party applications (e.g. Salesforce) or to your own internal applications. Read Custom URL Actions to learn how to use this feature.
Dashboard Sharing You can now share QuickSight dashboards across every user in an account.
Larger SPICE Tables The per-data set limit for SPICE tables has been raised from 10 GB to 25 GB.
Upgrade to Enterprise Edition The QuickSight administrator can now upgrade an account from Standard Edition to Enterprise Edition with a click. This enables provisioning of Readers with pay-per-session pricing, private VPC access, row-level security for dashboards and data sets, and hourly refresh of data sets. Enterprise Edition pricing applies after the upgrade.
Available Now Everything I listed above is available now and you can start using it today!
The adoption of Apache Spark has increased significantly over the past few years, and running Spark-based application pipelines is the new normal. Spark jobs that are in an ETL (extract, transform, and load) pipeline have different requirements—you must handle dependencies in the jobs, maintain order during executions, and run multiple jobs in parallel. In most of these cases, you can use workflow scheduler tools like Apache Oozie, Apache Airflow, and even Cron to fulfill these requirements.
Apache Oozie is a widely used workflow scheduler system for Hadoop-based jobs. However, its limited UI capabilities, lack of integration with other services, and heavy XML dependency might not be suitable for some users. On the other hand, Apache Airflow comes with a lot of neat features, along with powerful UI and monitoring capabilities and integration with several AWS and third-party services. However, with Airflow, you do need to provision and manage the Airflow server. The Cron utility is a powerful job scheduler. But it doesn’t give you much visibility into the job details, and creating a workflow using Cron jobs can be challenging.
What if you have a simple use case, in which you want to run a few Spark jobs in a specific order, but you don’t want to spend time orchestrating those jobs or maintaining a separate application? You can do that today in a serverless fashion using AWS Step Functions. You can create the entire workflow in AWS Step Functions and interact with Spark on Amazon EMR through Apache Livy.
In this post, I walk you through a list of steps to orchestrate a serverless Spark-based ETL pipeline using AWS Step Functions and Apache Livy.
Input data
For the source data for this post, I use the New York City Taxi and Limousine Commission (TLC) trip record data. For a description of the data, see this detailed dictionary of the taxi data. In this example, we’ll work mainly with the following three columns for the Spark jobs.
Column name
Column description
RateCodeID
Represents the rate code in effect at the end of the trip (for example, 1 for standard rate, 2 for JFK airport, 3 for Newark airport, and so on).
FareAmount
Represents the time-and-distance fare calculated by the meter.
TripDistance
Represents the elapsed trip distance in miles reported by the taxi meter.
The trip data is in comma-separated values (CSV) format with the first row as a header. To shorten the Spark execution time, I trimmed the large input data to only 20,000 rows. During the deployment phase, the input file tripdata.csv is stored in Amazon S3 in the <<your-bucket>>/emr-step-functions/input/ folder.
The following image shows a sample of the trip data:
Solution overview
The next few sections describe how Spark jobs are created for this solution, how you can interact with Spark using Apache Livy, and how you can use AWS Step Functions to create orchestrations for these Spark applications.
At a high level, the solution includes the following steps:
Trigger the AWS Step Function state machine by passing the input file path.
The first stage in the state machine triggers an AWS Lambda
The Lambda function interacts with Apache Spark running on Amazon EMR using Apache Livy, and submits a Spark job.
The state machine waits a few seconds before checking the Spark job status.
Based on the job status, the state machine moves to the success or failure state.
Subsequent Spark jobs are submitted using the same approach.
The state machine waits a few seconds for the job to finish.
The job finishes, and the state machine updates with its final status.
Let’s take a look at the Spark application that is used for this solution.
Spark jobs
For this example, I built a Spark jar named spark-taxi.jar. It has two different Spark applications:
MilesPerRateCode – The first job that runs on the Amazon EMR cluster. This job reads the trip data from an input source and computes the total trip distance for each rate code. The output of this job consists of two columns and is stored in Apache Parquet format in the output path.
The following are the expected output columns:
rate_code – Represents the rate code for the trip.
total_distance – Represents the total trip distance for that rate code (for example, sum(trip_distance)).
RateCodeStatus – The second job that runs on the EMR cluster, but only if the first job finishes successfully. This job depends on two different input sets:
csv – The same trip data that is used for the first Spark job.
miles-per-rate – The output of the first job.
This job first reads the tripdata.csv file and aggregates the fare_amount by the rate_code. After this point, you have two different datasets, both aggregated by rate_code. Finally, the job uses the rate_code field to join two datasets and output the entire rate code status in a single CSV file.
The output columns are as follows:
rate_code_id – Represents the rate code type.
total_distance – Derived from first Spark job and represents the total trip distance.
total_fare_amount – A new field that is generated during the second Spark application, representing the total fare amount by the rate code type.
Note that in this case, you don’t need to run two different Spark jobs to generate that output. The goal of setting up the jobs in this way is just to create a dependency between the two jobs and use them within AWS Step Functions.
Both Spark applications take one input argument called rootPath. It’s the S3 location where the Spark job is stored along with input and output data. Here is a sample of the final output:
The next section discusses how you can use Apache Livy to interact with Spark applications that are running on Amazon EMR.
Using Apache Livy to interact with Apache Spark
Apache Livy provides a REST interface to interact with Spark running on an EMR cluster. Livy is included in Amazon EMR release version 5.9.0 and later. In this post, I use Livy to submit Spark jobs and retrieve job status. When Amazon EMR is launched with Livy installed, the EMR master node becomes the endpoint for Livy, and it starts listening on port 8998 by default. Livy provides APIs to interact with Spark.
Let’s look at a couple of examples how you can interact with Spark running on Amazon EMR using Livy.
To list active running jobs, you can execute the following from the EMR master node:
curl localhost:8998/sessions
If you want to do the same from a remote instance, just change localhost to the EMR hostname, as in the following (port 8998 must be open to that remote instance through the security group):
Through Spark submit, you can pass multiple arguments for the Spark job and Spark configuration settings. You can also do that using Livy, by passing the S3 path through the args parameter, as shown following:
For a detailed list of Livy APIs, see the Apache Livy REST API page. This post uses GET /batches and POST /batches.
In the next section, you create a state machine and orchestrate Spark applications using AWS Step Functions.
Using AWS Step Functions to create a Spark job workflow
AWS Step Functions automatically triggers and tracks each step and retries when it encounters errors. So your application executes in order and as expected every time. To create a Spark job workflow using AWS Step Functions, you first create a Lambda state machine using different types of states to create the entire workflow.
First, you use the Task state—a simple state in AWS Step Functions that performs a single unit of work. You also use the Wait state to delay the state machine from continuing for a specified time. Later, you use the Choice state to add branching logic to a state machine.
The following is a quick summary of how to use different states in the state machine to create the Spark ETL pipeline:
Task state – Invokes a Lambda function. The first Task state submits the Spark job on Amazon EMR, and the next Task state is used to retrieve the previous Spark job status.
Wait state – Pauses the state machine until a job completes execution.
Choice state – Each Spark job execution can return a failure, an error, or a success state So, in the state machine, you use the Choice state to create a rule that specifies the next action or step based on the success or failure of the previous step.
Here is one of my Task states, MilesPerRateCode, which simply submits a Spark job:
"MilesPerRate Job": {
"Type": "Task",
"Resource":"arn:aws:lambda:us-east-1:xxxxxx:function:blog-miles-per-rate-job-submit-function",
"ResultPath": "$.jobId",
"Next": "Wait for MilesPerRate job to complete"
}
This Task state configuration specifies the Lambda function to execute. Inside the Lambda function, it submits a Spark job through Livy using Livy’s POST API. Using ResultPath, it tells the state machine where to place the result of the executing task. As discussed in the previous section, Spark submit returns the session ID, which is captured with $.jobId and used in a later state.
The following code section shows the Lambda function, which is used to submit the MilesPerRateCode job. It uses the Python request library to submit a POST against the Livy endpoint hosted on Amazon EMR and passes the required parameters in JSON format through payload. It then parses the response, grabs id from the response, and returns it. The Next field tells the state machine which state to go to next.
Just like in the MilesPerRate job, another state submits the RateCodeStatus job, but it executes only when all previous jobs have completed successfully.
Here is the Task state in the state machine that checks the Spark job status:
Just like other states, the preceding Task executes a Lambda function, captures the result (represented by jobStatus), and passes it to the next state. The following is the Lambda function that checks the Spark job status based on a given session ID:
In the Choice state, it checks the Spark job status value, compares it with a predefined state status, and transitions the state based on the result. For example, if the status is success, move to the next state (RateCodeJobStatus job), and if it is dead, move to the MilesPerRate job failed state.
To set up this entire solution, you need to create a few AWS resources. To make it easier, I have created an AWS CloudFormation template. This template creates all the required AWS resources and configures all the resources that are needed to create a Spark-based ETL pipeline on AWS Step Functions.
This CloudFormation template requires you to pass the following four parameters during initiation.
Parameter
Description
ClusterSubnetID
The subnet where the Amazon EMR cluster is deployed and Lambda is configured to talk to this subnet.
KeyName
The name of the existing EC2 key pair to access the Amazon EMR cluster.
VPCID
The ID of the virtual private cloud (VPC) where the EMR cluster is deployed and Lambda is configured to talk to this VPC.
S3RootPath
The Amazon S3 path where all required files (input file, Spark job, and so on) are stored and the resulting data is written.
IMPORTANT: These templates are designed only to show how you can create a Spark-based ETL pipeline on AWS Step Functions using Apache Livy. They are not intended for production use without modification. And if you try this solution outside of the us-east-1 Region, download the necessary files from s3://aws-data-analytics-blog/emr-step-functions, upload the files to the buckets in your Region, edit the script as appropriate, and then run it.
To launch the CloudFormation stack, choose Launch Stack:
Launching this stack creates the following list of AWS resources.
Logical ID
Resource Type
Description
StepFunctionsStateExecutionRole
IAM role
IAM role to execute the state machine and have a trust relationship with the states service.
SparkETLStateMachine
AWS Step Functions state machine
State machine in AWS Step Functions for the Spark ETL workflow.
LambdaSecurityGroup
Amazon EC2 security group
Security group that is used for the Lambda function to call the Livy API.
RateCodeStatusJobSubmitFunction
AWS Lambda function
Lambda function to submit the RateCodeStatus job.
MilesPerRateJobSubmitFunction
AWS Lambda function
Lambda function to submit the MilesPerRate job.
SparkJobStatusFunction
AWS Lambda function
Lambda function to check the Spark job status.
LambdaStateMachineRole
IAM role
IAM role for all Lambda functions to use the lambda trust relationship.
EMRCluster
Amazon EMR cluster
EMR cluster where Livy is running and where the job is placed.
During the AWS CloudFormation deployment phase, it sets up S3 paths for input and output. Input files are stored in the <<s3-root-path>>/emr-step-functions/input/ path, whereas spark-taxi.jar is copied under <<s3-root-path>>/emr-step-functions/.
The following screenshot shows how the S3 paths are configured after deployment. In this example, I passed a bucket that I created in the AWS account s3://tm-app-demos for the S3 root path.
If the CloudFormation template completed successfully, you will see Spark-ETL-State-Machine in the AWS Step Functions dashboard, as follows:
Choose the Spark-ETL-State-Machine state machine to take a look at this implementation. The AWS CloudFormation template built the entire state machine along with its dependent Lambda functions, which are now ready to be executed.
On the dashboard, choose the newly created state machine, and then choose New execution to initiate the state machine. It asks you to pass input in JSON format. This input goes to the first state MilesPerRate Job, which eventually executes the Lambda function blog-miles-per-rate-job-submit-function.
Pass the S3 root path as input:
{
“rootPath”: “s3://tm-app-demos”
}
Then choose Start Execution:
The rootPath value is the same value that was passed when creating the CloudFormation stack. It can be an S3 bucket location or a bucket with prefixes, but it should be the same value that is used for AWS CloudFormation. This value tells the state machine where it can find the Spark jar and input file, and where it will write output files. After the state machine starts, each state/task is executed based on its definition in the state machine.
At a high level, the following represents the flow of events:
Execute the first Spark job, MilesPerRate.
The Spark job reads the input file from the location <<rootPath>>/emr-step-functions/input/tripdata.csv. If the job finishes successfully, it writes the output data to <<rootPath>>/emr-step-functions/miles-per-rate.
If the Spark job fails, it transitions to the error state MilesPerRate job failed, and the state machine stops. If the Spark job finishes successfully, it transitions to the RateCodeStatus Job state, and the second Spark job is executed.
If the second Spark job fails, it transitions to the error state RateCodeStatus job failed, and the state machine stops with the Failed status.
If this Spark job completes successfully, it writes the final output data to the <<rootPath>>/emr-step-functions/rate-code-status/ It also transitions the RateCodeStatus job finished state, and the state machine ends its execution with the Success status.
This following screenshot shows a successfully completed Spark ETL state machine:
The right side of the state machine diagram shows the details of individual states with their input and output.
When you execute the state machine for the second time, it fails because the S3 path already exists. The state machine turns red and stops at MilePerRate job failed. The following image represents that failed execution of the state machine:
You can also check your Spark application status and logs by going to the Amazon EMR console and viewing the Application history tab:
I hope this walkthrough paints a picture of how you can create a serverless solution for orchestrating Spark jobs on Amazon EMR using AWS Step Functions and Apache Livy. In the next section, I share some ideas for making this solution even more elegant.
Next steps
The goal of this post is to show a simple example that uses AWS Step Functions to create an orchestration for Spark-based jobs in a serverless fashion. To make this solution robust and production ready, you can explore the following options:
In this example, I manually initiated the state machine by passing the rootPath as input. You can instead trigger the state machine automatically. To run the ETL pipeline as soon as the files arrive in your S3 bucket, you can pass the new file path to the state machine. Because CloudWatch Events supports AWS Step Functions as a target, you can create a CloudWatch rule for an S3 event. You can then set AWS Step Functions as a target and pass the new file path to your state machine. You’re all set!
You can also improve this solution by adding an alerting mechanism in case of failures. To do this, create a Lambda function that sends an alert email and assigns that Lambda function to a Fail That way, when any part of your state fails, it triggers an email and notifies the user.
If you want to submit multiple Spark jobs in parallel, you can use the Parallel state type in AWS Step Functions. The Parallel state is used to create parallel branches of execution in your state machine.
With Lambda and AWS Step Functions, you can create a very robust serverless orchestration for your big data workload.
Cleaning up
When you’ve finished testing this solution, remember to clean up all those AWS resources that you created using AWS CloudFormation. Use the AWS CloudFormation console or AWS CLI to delete the stack named Blog-Spark-ETL-Step-Functions.
Summary
In this post, I showed you how to use AWS Step Functions to orchestrate your Spark jobs that are running on Amazon EMR. You used Apache Livy to submit jobs to Spark from a Lambda function and created a workflow for your Spark jobs, maintaining a specific order for job execution and triggering different AWS events based on your job’s outcome. Go ahead—give this solution a try, and share your experience with us!
Tanzir Musabbir is an EMR Specialist Solutions Architect with AWS. He is an early adopter of open source Big Data technologies. At AWS, he works with our customers to provide them architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena & AWS Glue. Tanzir is a big Real Madrid fan and he loves to travel in his free time.
Classic Bond villain, Elon Musk, has a new plan to create a website dedicated to measuring the credibility and adherence to “core truth” of journalists. He is, without any sense of irony, going to call this “Pravda”. This is not simply wrong but evil.
Musk has a point. Journalists do suck, and many suck consistently. I see this in my own industry, cybersecurity, and I frequently criticize them for their suckage.
But what he’s doing here is not correcting them when they make mistakes (or what Musk sees as mistakes), but questioning their legitimacy. This legitimacy isn’t measured by whether they follow established journalism ethics, but whether their “core truths” agree with Musk’s “core truths”.
An example of the problem is how the press fixates on Tesla car crashes due to its “autopilot” feature. Pretty much every autopilot crash makes national headlines, while the press ignores the other 40,000 car crashes that happen in the United States each year. Musk spies on Tesla drivers (hello, classic Bond villain everyone) so he can see the dip in autopilot usage every time such a news story breaks. He’s got good reason to be concerned about this.
He argues that autopilot is safer than humans driving, and he’s got the statistics and government studies to back this up. Therefore, the press’s fixation on Tesla crashes is illegitimate “fake news”, titillating the audience with distorted truth.
But here’s the thing: that’s still only Musk’s version of the truth. Yes, on a mile-per-mile basis, autopilot is safer, but there’s nuance here. Autopilot is used primarily on freeways, which already have a low mile-per-mile accident rate. People choose autopilot only when conditions are incredibly safe and drivers are unlikely to have an accident anyway. Musk is therefore being intentionally deceptive comparing apples to oranges. Autopilot may still be safer, it’s just that the numbers Musk uses don’t demonstrate this.
And then there is the truth calling it “autopilot” to begin with, because it isn’t. The public is overrating the capabilities of the feature. It’s little different than “lane keeping” and “adaptive cruise control” you can now find in other cars. In many ways, the technology is behind — my Tesla doesn’t beep at me when a pedestrian walks behind my car while backing up, but virtually every new car on the market does.
Yes, the press unduly covers Tesla autopilot crashes, but Musk has only himself to blame by unduly exaggerating his car’s capabilities by calling it “autopilot”.
What’s “core truth” is thus rather difficult to obtain. What the press satisfies itself with instead is smaller truths, what they can document. The facts are in such cases that the accident happened, and they try to get Tesla or Musk to comment on it.
What you can criticize a journalist for is therefore not “core truth” but whether they did journalism correctly. When such stories criticize “autopilot”, but don’t do their diligence in getting Tesla’s side of the story, then that’s a violation of journalistic practice. When I criticize journalists for their poor handling of stories in my industry, I try to focus on which journalistic principles they get wrong. For example, the NYTimes reporters do a lot of stories quoting anonymous government sources in clear violation of journalistic principles.
If “credibility” is the concern, then it’s the classic Bond villain here that’s the problem: Musk himself. His track record on business statements is abysmal. For example, when he announced the Model 3 he claimed production targets that every Wall Street analyst claimed were absurd. He didn’t make those targets, he didn’t come close. Model 3 production is still lagging behind Musk’s twice adjusted targets.
So who has a credibility gap here, the press, or Musk himself?
Not only is Musk’s credibility problem ironic, so is the name he chose, “Pravada”, the Russian word for truth that was the name of the Soviet Union Communist Party’s official newspaper. This is so absurd this has to be a joke, yet Musk claims to be serious about all this.
Yes, the press has a lot of problems, and if Musk were some journalism professor concerned about journalists meeting the objective standards of their industry (e.g. abusing anonymous sources), then this would be a fine thing. But it’s not. It’s Musk who is upset the press’s version of “core truth” does not agree with his version — a version that he’s proven time and time again differs from “real truth”.
Just in case Musk is serious, I’ve already registered “www.antipravda.com” to start measuring the credibility of statements by billionaire playboy CEOs. Let’s see who blinks first.
I stole the title, with permission, from this tweet:
Today, we’re happy to announce that the AWS GDPR Data Processing Addendum (GDPR DPA) is now part of our online Service Terms. This means all AWS customers globally can rely on the terms of the AWS GDPR DPA which will apply automatically from May 25, 2018, whenever they use AWS services to process personal data under the GDPR. The AWS GDPR DPA also includes EU Model Clauses, which were approved by the European Union (EU) data protection authorities, known as the Article 29 Working Party. This means that AWS customers wishing to transfer personal data from the European Economic Area (EEA) to other countries can do so with the knowledge that their personal data on AWS will be given the same high level of protection it receives in the EEA.
As we approach the GDPR enforcement date this week, this announcement is an important GDPR compliance component for us, our customers, and our partners. All customers which that are using cloud services to process personal data will need to have a data processing agreement in place between them and their cloud services provider if they are to comply with GDPR. As early as April 2017, AWS announced that AWS had a GDPR-ready DPA available for its customers. In this way, we started offering our GDPR DPA to customers over a year before the May 25, 2018 enforcement date. Now, with the DPA terms included in our online service terms, there is no extra engagement needed by our customers and partners to be compliant with the GDPR requirement for data processing terms.
The AWS GDPR DPA also provides our customers with a number of other important assurances, such as the following:
AWS will process customer data only in accordance with customer instructions.
AWS has implemented and will maintain robust technical and organizational measures for the AWS network.
AWS will notify its customers of a security incident without undue delay after becoming aware of the security incident.
Customers who have already signed an offline version of the AWS GDPR DPA can continue to rely on that GDPR DPA. By incorporating our GDPR DPA into the AWS Service Terms, we are simply extending the terms of our GDPR DPA to all customers globally who will require it under GDPR.
AWS GDPR DPA is only part of the story, however. We are continuing to work alongside our customers and partners to help them on their journey towards GDPR compliance.
If you have any questions about the GDPR or the AWS GDPR DPA, please contact your account representative, or visit the AWS GDPR Center at: https://aws.amazon.com/compliance/gdpr-center/
-Chad
Interested in AWS Security news? Follow the AWS Security Blog on Twitter.
Earlier this year on 3 and 4 March, communities around the world held Raspberry Jam events to celebrate Raspberry Pi’s sixth birthday. We sent out special birthday kits to participating Jams — it was amazing to know the kits would end up in the hands of people in parts of the world very far from Raspberry Pi HQ in Cambridge, UK.
The Raspberry Jam Camer team: Damien Doumer, Eyong Etta, Loïc Dessap and Lionel Sichom, aka Lionel Tellem
Preparing for the #PiParty
One birthday kit went to Yaoundé, the capital of Cameroon. There, a team of four students in their twenties — Lionel Sichom (aka Lionel Tellem), Eyong Etta, Loïc Dessap, and Damien Doumer — were organising Yaoundé’s first Jam, called Raspberry Jam Camer, as part of the Raspberry Jam Big Birthday Weekend. The team knew one another through their shared interests and skills in electronics, robotics, and programming. Damien explains in his blog post about the Jam that they planned ahead for several activities for the Jam based on their own projects, so they could be confident of having a few things that would definitely be successful for attendees to do and see.
Show-and-tell at Raspberry Jam Cameroon
Loïc presented a Raspberry Pi–based, Android app–controlled robot arm that he had built, and Lionel coded a small video game using Scratch on Raspberry Pi while the audience watched. Damien demonstrated the possibilities of Windows 10 IoT Core on Raspberry Pi, showing how to install it, how to use it remotely, and what you can do with it, including building a simple application.
Loïc showcases the prototype robot arm he built
There was lots more too, with others discussing their own Pi projects and talking about the possibilities Raspberry Pi offers, including a Pi-controlled drone and car. Cake was a prevailing theme of the Raspberry Jam Big Birthday Weekend around the world, and Raspberry Jam Camer made sure they didn’t miss out.
Yay, birthday cake!!
A big success
Most visitors to the Jam were secondary school students, while others were university students and graduates. The majority were unfamiliar with Raspberry Pi, but all wanted to learn about Raspberry Pi and what they could do with it. Damien comments that the fact most people were new to Raspberry Pi made the event more interactive rather than creating any challenges, because the visitors were all interested in finding out about the little computer. The Jam was an all-round success, and the team was pleased with how it went:
What I liked the most was that we sensitized several people about the Raspberry Pi and what one can be capable of with such a small but powerful device. — Damien Doumer
The Jam team rounded off the event by announcing that this was the start of a Raspberry Pi community in Yaoundé. They hope that they and others will be able to organise more Jams and similar events in the area to spread the word about what people can do with Raspberry Pi, and to help them realise their ideas.
Raspberry Jam Camer gets the thumbs-up
The Raspberry Pi community in Cameroon
In a French-language interview about their Jam, the team behind Raspberry Jam Camer said they’d like programming to become the third official language of Cameroon, after French and English; their aim is to to popularise programming and digital making across Cameroonian society. Neither of these fields is very familiar to most people in Cameroon, but both are very well aligned with the country’s ambitions for development. The team is conscious of the difficulties around the emergence of information and communication technologies in the Cameroonian context; in response, they are seizing the opportunities Raspberry Pi offers to give children and young people access to modern and constantly evolving technology at low cost.
Thanks to Lionel, Eyong, Damien, and Loïc, and to everyone who helped put on a Jam for the Big Birthday Weekend! Remember, anyone can start a Jam at any time — and we provide plenty of resources to get you started. Check out the Guidebook, the Jam branding pack, our specially-made Jam activities online (in multiplelanguages), printable worksheets, and more.
The EU’s General Data Protection Regulation (GDPR) describes data processor and data controller roles, and some customers and AWS Partner Network (APN) partners are asking how this affects the long-established AWS Shared Responsibility Model. I wanted to take some time to help folks understand shared responsibilities for us and for our customers in context of the GDPR.
How does the AWS Shared Responsibility Model change under GDPR? The short answer – it doesn’t. AWS is responsible for securing the underlying infrastructure that supports the cloud and the services provided; while customers and APN partners, acting either as data controllers or data processors, are responsible for any personal data they put in the cloud. The shared responsibility model illustrates the various responsibilities of AWS and our customers and APN partners, and the same separation of responsibility applies under the GDPR.
AWS responsibilities as a data processor
The GDPR does introduce specific regulation and responsibilities regarding data controllers and processors. When any AWS customer uses our services to process personal data, the controller is usually the AWS customer (and sometimes it is the AWS customer’s customer). However, in all of these cases, AWS is always the data processor in relation to this activity. This is because the customer is directing the processing of data through its interaction with the AWS service controls, and AWS is only executing customer directions. As a data processor, AWS is responsible for protecting the global infrastructure that runs all of our services. Controllers using AWS maintain control over data hosted on this infrastructure, including the security configuration controls for handling end-user content and personal data. Protecting this infrastructure, is our number one priority, and we invest heavily in third-party auditors to test our security controls and make any issues they find available to our customer base through AWS Artifact. Our ISO 27018 report is a good example, as it tests security controls that focus on protection of personal data in particular.
AWS has an increased responsibility for our managed services. Examples of managed services include Amazon DynamoDB, Amazon RDS, Amazon Redshift, Amazon Elastic MapReduce, and Amazon WorkSpaces. These services provide the scalability and flexibility of cloud-based resources with less operational overhead because we handle basic security tasks like guest operating system (OS) and database patching, firewall configuration, and disaster recovery. For most managed services, you only configure logical access controls and protect account credentials, while maintaining control and responsibility of any personal data.
Customer and APN partner responsibilities as data controllers — and how AWS Services can help
Our customers can act as data controllers or data processors within their AWS environment. As a data controller, the services you use may determine how you configure those services to help meet your GDPR compliance needs. For example, AWS Services that are classified as Infrastructure as a Service (IaaS), such as Amazon EC2, Amazon VPC, and Amazon S3, are under your control and require you to perform all routine security configuration and management that would be necessary no matter where the servers were located. With Amazon EC2 instances, you are responsible for managing: guest OS (including updates and security patches), application software or utilities installed on the instances, and the configuration of the AWS-provided firewall (called a security group).
To help you realize data protection by design principles under the GDPR when using our infrastructure, we recommend you protect AWS account credentials and set up individual user accounts with Amazon Identity and Access Management (IAM) so that each user is only given the permissions necessary to fulfill their job duties. We also recommend using multi-factor authentication (MFA) with each account, requiring the use of SSL/TLS to communicate with AWS resources, setting up API/user activity logging with AWS CloudTrail, and using AWS encryption solutions, along with all default security controls within AWS Services. You can also use advanced managed security services, such as Amazon Macie, which assists in discovering and securing personal data stored in Amazon S3.
For more information, you can download the AWS Security Best Practices whitepaper or visit the AWS Security Resources or GDPR Center webpages. In addition to our solutions and services, AWS APN partners can provide hundreds of tools and features to help you meet your security objectives, ranging from network security and configuration management to access control and data encryption.
Last month, Wired published a long article about Ray Ozzie and his supposed new scheme for adding a backdoor in encrypted devices. It’s a weird article. It paints Ozzie’s proposal as something that “attains the impossible” and “satisfies both law enforcement and privacy purists,” when (1) it’s barely a proposal, and (2) it’s essentially the same key escrow scheme we’ve been hearing about for decades.
Basically, each device has a unique public/private key pair and a secure processor. The public key goes into the processor and the device, and is used to encrypt whatever user key encrypts the data. The private key is stored in a secure database, available to law enforcement on demand. The only other trick is that for law enforcement to use that key, they have to put the device in some sort of irreversible recovery mode, which means it can never be used again. That’s basically it.
I have no idea why anyone is talking as if this were anything new. Severalcryptographershavealreadyexplained why this key escrow scheme is no better than any other key escrow scheme. The short answer is (1) we won’t be able to secure that database of backdoor keys, (2) we don’t know how to build the secure coprocessor the scheme requires, and (3) it solves none of the policy problems around the whole system. This is the typical mistake non-cryptographers make when they approach this problem: they think that the hard part is the cryptography to create the backdoor. That’s actually the easy part. The hard part is ensuring that it’s only used by the good guys, and there’s nothing in Ozzie’s proposal that addresses any of that.
I worry that this kind of thing is damaging in the long run. There should be some rule that any backdoor or key escrow proposal be a fully specified proposal, not just some cryptography and hand-waving notions about how it will be used in practice. And before it is analyzed and debated, it should have to satisfy some sort of basic security analysis. Otherwise, we’ll be swatting pseudo-proposals like this one, while those on the other side of this debate become increasingly convinced that it’s possible to design one of these things securely.
Already people are using the National Academies report on backdoors for law enforcement as evidence that engineers are developing workable and secure backdoors. Writing in Lawfare, Alan Z. Rozenshtein claims that the report — and a related New York Timesstory — “undermine the argument that secure third-party access systems are so implausible that it’s not even worth trying to develop them.” Susan Landau effectively corrects this misconception, but the damage is done.
Here’s the thing: it’s not hard to design and build a backdoor. What’s hard is building the systems — both technical and procedural — around them. Here’s Rob Graham:
He’s only solving the part we already know how to solve. He’s deliberately ignoring the stuff we don’t know how to solve. We know how to make backdoors, we just don’t know how to secure them.
A bunch of us cryptographers have already explained why we don’t think this sort of thing will work in the foreseeable future. We write:
Exceptional access would force Internet system developers to reverse “forward secrecy” design practices that seek to minimize the impact on user privacy when systems are breached. The complexity of today’s Internet environment, with millions of apps and globally connected services, means that new law enforcement requirements are likely to introduce unanticipated, hard to detect security flaws. Beyond these and other technical vulnerabilities, the prospect of globally deployed exceptional access systems raises difficult problems about how such an environment would be governed and how to ensure that such systems would respect human rights and the rule of law.
The reason so few of us are willing to bet on massive-scale key escrow systems is that we’ve thought about it and we don’t think it will work. We’ve looked at the threat model, the usage model, and the quality of hardware and software that exists today. Our informed opinion is that there’s no detection system for key theft, there’s no renewability system, HSMs are terrifically vulnerable (and the companies largely staffed with ex-intelligence employees), and insiders can be suborned. We’re not going to put the data of a few billion people on the line an environment where we believe with high probability that the system will fail.
Bad software is everywhere. One can even claim that every software is bad. Cool companies, tech giants, established companies, all produce bad software. And no, yours is not an exception.
Who’s to blame for bad software? It’s all complicated and many factors are intertwined – there’s business requirements, there’s organizational context, there’s lack of sufficient skilled developers, there’s the inherent complexity of software development, there’s leaky abstractions, reliance on 3rd party software, consequences of wrong business and purchase decisions, time limitations, flawed business analysis, etc. So yes, despite the catchy title, I’m aware it’s actually complicated.
But in every “it’s complicated” scenario, there’s always one or two factors that are decisive. All of them contribute somehow, but the major drivers are usually a handful of things. And in the case of base software, I think it’s the fault of technical people. Developers, architects, ops.
We don’t seem to care about best practices. And I’ll do some nasty generalizations here, but bear with me. We can spend hours arguing about tabs vs spaces, curly bracket on new line, git merge vs rebase, which IDE is better, which framework is better and other largely irrelevant stuff. But we tend to ignore the important aspects that span beyond the code itself. The context in which the code lives, the non-functional requirements – robustness, security, resilience, etc.
We don’t seem to get security. Even trivial stuff such as user authentication is almost always implemented wrong. These days Twitter and GitHub realized they have been logging plain-text passwords, for example, but that’s just the tip of the iceberg. Too often we ignore the security implications.
“But the business didn’t request the security features”, one may say. The business never requested 2-factor authentication, encryption at rest, PKI, secure (or any) audit trail, log masking, crypto shredding, etc., etc. Because the business doesn’t know these things – we do and we have to put them on the backlog and fight for them to be implemented. Each organization has its specifics and tech people can influence the backlog in different ways, but almost everywhere we can put things there and prioritize them.
The other aspect is testing. We should all be well aware by now that automated testing is mandatory. We have all the tools in the world for unit, functional, integration, performance and whatnot testing, and yet many software projects lack the necessary test coverage to be able to change stuff without accidentally breaking things. “But testing takes time, we don’t have it”. We are perfectly aware that testing saves time, as we’ve all had those “not again!” recurring bugs. And yet we think of all sorts of excuses – “let the QAs test it”, we have to ship that now, we’ll test it later”, “this is too trivial to be tested”, etc.
And you may say it’s not our job. We don’t define what has do be done, we just do it. We don’t define the budget, the scope, the features. We just write whatever has been decided. And that’s plain wrong. It’s not our job to make money out of our code, and it’s not our job to define what customers need, but apart from that everything is our job. The way the software is structured, the security aspects and security features, the stability of the code base, the way the software behaves in different environments. The non-functional requirements are our job, and putting them on the backlog is our job.
You’ve probably heard that every software becomes “legacy” after 6 months. And that’s because of us, our sloppiness, our inability to mitigate external factors and constraints. Too often we create a mess through “just doing our job”.
And of course that’s a generalization. I happen to know a lot of great professionals who don’t make these mistakes, who strive for excellence and implement things the right way. But our industry as a whole doesn’t. Our industry as a whole produces bad software. And it’s our fault, as developers – as the only people who know why a certain piece of software is bad.
In a talk of his, Bob Martin warns us of the risks of our sloppiness. We have been building websites so far, but we are more and more building stuff that interacts with the real world, directly and indirectly. Ultimately, lives may depend on our software (like the recent unfortunate death caused by a self-driving car). And I’ll agree with Uncle Bob that it’s high time we self-regulate as an industry, before some technically incompetent politician decides to do that.
How, I don’t know. We’ll have to think more about it. But I’m pretty sure it’s our fault that software is bad, and no amount of blaming the management, the budget, the timing, the tools or the process can eliminate our responsibility.
Why do I insist on bashing my fellow software engineers? Because if we start looking at software development with more responsibility; with the fact that if it fails, it’s our fault, then we’re more likely to get out of our current bug-ridden, security-flawed, fragile software hole and really become the experts of the future.
Students taking Design of Mechatronics at the Technical University of Denmark have created some seriously elegant and striking Raspberry Pi speakers. Their builds are part of a project asking them to “explore, design and build a 3D printed speaker, around readily available electronics and components”.
The students have been uploading their designs, incorporating Raspberry Pis and HiFiBerry HATs, to Thingiverse throughout April. The task is a collaboration with luxury brand Bang & Olufsen’s Create initiative, and the results wouldn’t look out of place in a high-end showroom; I’d happily take any of these home.
Søren Qvist’s wall-mounted kitchen sphere uses 3D-printed and laser-cut parts, along with the HiFiBerry HAT and B&O speakers to create a sleek-looking design.
Otto Ømann’s group have designed the Hex One – a work-in-progress wireless 360° speaker. A particular objective for their project is to create a speaker using as many 3D-printed parts as possible.
“The design is supposed to resemble that of a B&O speaker, and from a handful of categories we chose to create a portable and wearable speaker,” explain Gustav Larsen and his team.
Oliver Repholtz Behrens and team have housed a Raspberry Pi and HiFiBerry HAT inside this this stylish airplay speaker. You can follow their design progress on their team blog.
Tue Thomsen’s six-person team Mechatastic have produced the B&O TILE. “The speaker consists of four 3D-printed cabinet and top parts, where the top should be covered by fabric,” they explain. “The speaker insides consists of laser-cut wood to hold the tweeter and driver and encase the Raspberry Pi.”
The team aimed to design a speaker that would be at home in a kitchen. With a removable upper casing allowing for a choice of colour, the TILE can be customised to fit particular tastes and colour schemes.
Build your own speakers with Raspberry Pis
Raspberry Pi’s onboard audio jack, along with third-party HATs such as the HiFiBerry and Pimoroni Speaker pHAT, make speaker design and fabrication with the Pi an interesting alternative to pre-made tech. These builds don’t tend to be technically complex, and they provide some lovely examples of tech-based projects that reflect makers’ own particular aesthetic style.
If you have access to a 3D printer or a laser cutter, perhaps at a nearby maker space, then those can be excellent resources, but fancy kit isn’t a requirement. Basic joinery and crafting with card or paper are just a couple of ways you can build things that are all your own, using familiar tools and materials. We think more people would enjoy getting hands-on with this sort of thing if they gave it a whirl, and we publish a free magazine to help.
Looking for a new project to build around the Raspberry Pi Zero, I came across the pHAT DAC from Pimoroni. This little add-on board adds audio playback capabilities to the Pi Zero. Because the pHAT uses the GPIO pins, the USB OTG port remains available for a wifi dongle.
This video by Frederick Vandenbosch is a great example of building AirPlay speakers using a Pi and HAT, and a quick search will find you lots more relevant tutorials and ideas.
Have you built your own? Share your speaker-based Pi builds with us in the comments.
If you store sensitive or confidential data in Amazon DynamoDB, you might want to encrypt that data as close as possible to its origin so your data is protected throughout its lifecycle.
You can use the DynamoDB Encryption Client to protect your table data before you send it to DynamoDB. Encrypting your sensitive data in transit and at rest helps assure that your plaintext data isn’t available to any third party, including AWS.
You don’t need to be a cryptography expert to use the DynamoDB Encryption Client. The encryption and signing elements are designed to work with your existing DynamoDB applications. After you create and configure the required components, the DynamoDB Encryption Client transparently encrypts and signs your table items when you call PutItem and verifies and decrypts them when you call GetItem.
You can create your own custom components, or use the basic implementations that are included in the library. We’ve made sure that the classes that we provide implement strong and secure cryptography.
You can use the DynamoDB Encryption Client with AWS Key Management Service (AWS KMS) or AWS CloudHSM, but the library doesn’t require AWS or any AWS service.
The DynamoDB Encryption Client is now available in Python, as well as Java. All supported language implementations are interoperable. For example, you can encrypt table data with the Python library and decrypt it with the Java library.
The DynamoDB Encryption Client is an open-source project. We hope that you will join us in developing the libraries and writing great documentation.
How it works
The DynamoDB Encryption Client processes one table item at a time. First, it encrypts the values (but not the names) of attributes that you specify. Then, it calculates a signature over the attributes that you specify, so you can detect unauthorized changes to the item as a whole, including adding or deleting attributes, or substituting one encrypted value for another.
However, attribute names, and the names and values in the primary key (the partition key and sort key, if one is provided) must remain in plaintext to make the item discoverable. They’re included in the signature by default.
Important: Do not put any sensitive data in the table name, attribute names, the names and values of the primary key attributes, or any attribute values that you tell the client not to encrypt.
How to use it
I’ll demonstrate how to use the DynamoDB Encryption Client in Python with a simple example. I’ll encrypt and sign one table item, and then add it to an existing table. This example uses a test item with arbitrary data, but you can use a similar procedure to protect a table item that contains highly sensitive data, such as a customer’s personal information.
I’ll start by creating a DynamoDB table resource that represents an existing table. If you use the code, be sure to supply a valid table name.
# Create a DynamoDB table
table = boto3.resource('dynamodb').Table(table_name)
Step 2: Create a cryptographic materials provider
Next, create an instance of a cryptographic materials provider (CMP). The CMP is the component that gathers the encryption and signing keys that are used to encrypt and sign your table items. The CMP also determines the encryption algorithms that are used and whether you create unique keys for every item or reuse them.
The DynamoDB Encryption Client includes several CMPs and you can create your own. And, if you’re in doubt, we help you to choose a CMP that fits your application and its security requirements.
In this example, I’ll use the Direct KMS Provider, which gets its cryptographic material from the AWS Key Management Service (AWS KMS). The encryption and signing keys that you use are protected by a customer master key in your AWS account that never leaves AWS KMS unencrypted.
To create a Direct KMS Provider, you specify an AWS KMS customer master key. Be sure to replace the fictitious customer master key ID (the value of aws-cmk-id) in this example with a valid one.
# Create a Direct KMS provider. Pass in a valid KMS customer master key.
aws_cmk_id = '1234abcd-12ab-34cd-56ef-1234567890ab'
aws_kms_cmp = AwsKmsCryptographicMaterialsProvider(key_id=aws_cmk_id)
Step 3: Create an attribute actions object
An attribute actions object tells the DynamoDB Encryption Client which item attribute values to encrypt and which attributes to include in the signature. The options are: ENCRYPT_AND_SIGN, SIGN_ONLY, and DO_NOTHING.
This sample attribute action encrypts and signs all attributes values except for the value of the test attribute; that attribute is neither encrypted nor included in the signature.
# Tell the encrypted table to encrypt and sign all attributes except one.
actions = AttributeActions(
default_action=CryptoAction.ENCRYPT_AND_SIGN,
attribute_actions={
'test': CryptoAction.DO_NOTHING
}
)
If you’re using a helper class, such as the EncryptedTable class that I use in the next step, you can’t specify an attribute action for the primary key. The helper classes make sure that the primary key is signed, but never encrypted (SIGN_ONLY).
Step 4: Create an encrypted table
Now I can use the original table object, along with the materials provider and attribute actions, to create an encrypted table.
# Use these objects to create an encrypted table resource.
encrypted_table = EncryptedTable(
table=table,
materials_provider=aws_kms_cmp,
attribute_actions=actions
)
In this example, I’m using the EncryptedTable helper class, which adds encryption features to the DynamoDB Table class in the AWS SDK for Python (Boto 3). The DynamoDB Encryption Client in Python also includes EncryptedClient and EncryptedResource helper classes.
The DynamoDB Encryption Client helper classes call the DescribeTable operation to find the primary key. The application that runs the code must have permission to call the operation.
We’re done configuring the client. Now, we can encrypt, sign, verify, and decrypt table items.
When we call the PutItem operation, the item is transparently encrypted and signed, except for the primary key, which is signed, but not encrypted, and the test attribute, which is ignored.
encrypted_table.put_item(Item=plaintext_item)
And, when we call the GetItem operation, the item is transparently verified and decrypted.
To view the encrypted item, call the GetItem operation on the original table object, instead of the encrypted_table object. It gets the item from the DynamoDB table without verifying and decrypting it.
Here’s an excerpt of the output that displays the encrypted item:
Figure 1: Output that displays the encrypted item
Client-side or server-side encryption?
The DynamoDB Encryption Client is designed for client-side encryption, where you encrypt your data before you send it to DynamoDB.
But, you have other options. DynamoDB supports encryption at rest, a server-side encryption option that transparently encrypts the data in your table whenever DynamoDB saves the table to disk. You can even use both the DynamoDB Encryption Client and encryption at rest together. The encrypted and signed items that the client generates are standard table items that have binary data in their attribute values. Your choice depends on the sensitivity of your data and the security requirements of your application.
Although the Java and Python versions of the DynamoDB Encryption Client are fully compatible, the DynamoDB Encryption Client isn’t compatible with other client-side encryption libraries, such as the AWS Encryption SDK or the S3 Encryption Client. You can’t encrypt data with one library and decrypt it with another. For data that you store in DynamoDB, we recommend the DynamoDB Encryption Client.
Encryption is crucial
Using tools like the DynamoDB Encryption Client helps you to protect your table data and comply with the security requirements for your application. We hope that you use the client and join us in developing it on GitHub.
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Key Management Service forum or contact AWS Support.
Want more AWS Security news? Follow us on Twitter.
Researchers at Princeton University have released IoT Inspector, a tool that analyzes the security and privacy of IoT devices by examining the data they send across the Internet. They’ve already used the tool to study a bunch of different IoT devices. From their blog post:
Finding #3: Many IoT Devices Contact a Large and Diverse Set of Third Parties
In many cases, consumers expect that their devices contact manufacturers’ servers, but communication with other third-party destinations may not be a behavior that consumers expect.
We have found that many IoT devices communicate with third-party services, of which consumers are typically unaware. We have found many instances of third-party communications in our analyses of IoT device network traffic. Some examples include:
Samsung Smart TV. During the first minute after power-on, the TV talks to Google Play, Double Click, Netflix, FandangoNOW, Spotify, CBS, MSNBC, NFL, Deezer, and Facebookeven though we did not sign in or create accounts with any of them.
Amcrest WiFi Security Camera. The camera actively communicates with cellphonepush.quickddns.com using HTTPS. QuickDDNS is a Dynamic DNS service provider operated by Dahua. Dahua is also a security camera manufacturer, although Amcrest’s website makes no references to Dahua. Amcrest customer service informed us that Dahua was the original equipment manufacturer.
Halo Smoke Detector. The smart smoke detector communicates with broker.xively.com. Xively offers an MQTT service, which allows manufacturers to communicate with their devices.
Geeni Light Bulb. The Geeni smart bulb communicates with gw.tuyaus.com, which is operated by TuYa, a China-based company that also offers an MQTT service.
We also looked at a number of other devices, such as Samsung Smart Camera and TP-Link Smart Plug, and found communications with third parties ranging from NTP pools (time servers) to video storage services.
Their first two findings are that “Many IoT devices lack basic encryption and authentication” and that “User behavior can be inferred from encrypted IoT device traffic.” No surprises there.
Some gaming consoles make it easy to stream to Twitch, some gaming consoles don’t (come on, Nintendo). So for those that don’t, I’ve made this beta version of the “Twitch-O-Matic”. No it doesn’t chop onions or fold your laundry, but what it DOES do is stream anything with HDMI output to your Twitch channel with the simple push of a button!
eSports and online game streaming
Interest in eSports has skyrocketed over the last few years, with viewership numbers in the hundreds of millions, sponsorship deals increasing in value and prestige, and tournament prize funds reaching millions of dollars. So it’s no wonder that more and more gamers are starting to stream live to online platforms in order to boost their fanbase and try to cash in on this growing industry.
Streaming to Twitch
Launched in 2011, Twitch.tv is an online live-streaming platform with a primary focus on video gaming. Users can create accounts to contribute their comments and content to the site, as well as watching live-streamed gaming competitions and broadcasts. With a staggering fifteen million daily users, Twitch is accessible via smartphone and gaming console apps, smart TVs, computers, and tablets. But if you want to stream to Twitch, you may find yourself using third-party software in order to do so. And with more buttons to click and more wires to plug in for older, app-less consoles, streaming can get confusing.
Enter Tinkernut.
Side note: we Tinkernut
We’ve featured Tinkernut a few times on the Raspberry Pi blog – his tutorials are clear, his projects are interesting and useful, and his live-streamed comment videos for every build are a nice touch to sharing homebrew builds on the internet.
So, yes, we love him. [This is true. Alex never shuts up about him. – Ed.] And since he has over 500K subscribers on YouTube, we’re obviously not the only ones. We wave our Tinkernut flags with pride.
Twitch-O-Matic
With a Raspberry Pi Zero W, an HDMI to CSI adapter, and a case to fit it all in, Tinkernut’s Twitch-O-Matic allows easy connection to the Twitch streaming service. You’ll also need a button – the bigger, the better in our opinion, though Tinkernut has opted for the Adafruit 16mm Illuminated Pushbutton for his build, and not the 100mm Massive Arcade Button that, sadly, we still haven’t found a reason to use yet.
“I’m sorry, Dave…”
For added frills and pizzazz, Tinketnut has also incorporated Adafruit’s White LED Backlight Module into the case, though you don’t have to do so unless you’re feeling super fancy.
The setup
The Raspberry Pi Zero W is connected to the HDMI to CSI adapter via the camera connector, in the same way you’d attach the camera ribbon. Tinkernut uses a standard Raspbian image on an 8GB SD card, with SSH enabled for remote access from his laptop. He uses the simple command Raspivid to test the HDMI connection by recording ten seconds of video footage from his console.
One lead is all you need
Once you have the Pi receiving video from your console, you can connect to Twitch using your Twitch stream key, which you can find by logging in to your account at Twitch.tv. Tinkernut’s tutorial gives you all the commands you need to stream from your Pi.
The frills
To up the aesthetic impact of your project, adding buttons and backlights is fairly straightforward.
Pretty LED frills
To run the stream command, Tinketnut uses a button: press once to start the stream, press again to stop. Pressing the button also turns on the LED backlight, so it’s obvious when streaming is in progress.
The tutorial
For the full code and 3D-printable case STL file, head to Tinketnut’s hackster.io project page. And if you’re already using a Raspberry Pi for Twitch streaming, share your build setup with us. Cheers!
Most of the time this blog is dedicated to cloud storage and computer backup topics, but we also want our readers to understand the culture and people at Backblaze who all contribute to keeping our company running and making it an enjoyable place to work. We invited our HR Coordinator, Michele, to talk about how she spends her day searching for great candidates to fill employment positions at Backblaze.
What’s a Typical Day for Michele at Backblaze?
After I’ve had a yummy cup of coffee — maybe with a honey and splash of half and half, I’ll generally start my day reviewing resumes and contacting potential candidates to set up an initial phone screen.
When I start the process of filling a position, I’ll spend a lot of time on the phone speaking with potential candidates. During a phone screen call we’ll chat about their experience, background and what they are ideally looking for in their next position. I also ask about what they like to do outside of work, and most importantly, how they feel about office dogs. A candidate may not always look great on paper, but could turn out to be a great cultural fit after speaking with them about their previous experience and what they’re passionate about.
Next, I push strong candidates to the subsequent steps with the hiring managers, which range from setting up a second phone screen, to setting up a Google hangout for completing coding tasks, to scheduling in-person interviews with the team.
At the end of the day after an in-person interview, I’ll check in with all the interviewers to debrief and decide how to proceed with the candidate. Everyone that interviewed the candidate will get together to give feedback. Is there a good cultural fit? Are they someone we’d like to work with? Keeping in contact with the candidates throughout the process and making sure they are organized and informed is a big part of my job. No one likes to wait around and wonder where they are in the process.
In between all the madness, I’ll put together offer letters, send out onboarding paperwork and links, and get all the necessary signatures to move forward.
On the candidate’s first day, I’ll go over benefits and the handbook and make sure everything is going smoothly in their overall orientation as they transition into their new role here at Backblaze!
What Makes Your Job Exciting?
I get to speak with many different types of people and see what makes them tick and if they’d be a good fit at Backblaze
The fast pace of the job
Being constantly kept busy with different tasks including supporting the FUN committee by researching venues and ideas for family day and the holiday party
I work on enjoyable projects like creating a people wall for new hires so we are able to put a face to the name
Getting to take a mini road trip up to Sacramento each month to check in with the data center employees
Constantly learning more and more about the job, the people, and the company
Oh! We also offer competitive salaries, stock options, and amazing benefits.
Which Job Openings are You Currently Trying to Fill?
We are currently looking for the following positions. If you’re interested, please review the job description on our jobs page and then contact me at jobscontact@backblaze.com.
Last week, we shared the first half of our Q&A with Raspberry Pi Trading CEO and Raspberry Pi creator Eben Upton. Today we follow up with all your other questions, including your expectations for a Raspberry Pi 4, Eben’s dream add-ons, and whether we really could go smaller than the Zero.
Get your questions to us now using #AskRaspberryPi on Twitter
With internet security becoming more necessary, will there be automated versions of VPN on an SD card?
There are already third-party tools which turn your Raspberry Pi into a VPN endpoint. Would we do it ourselves? Like the power button, it’s one of those cases where there are a million things we could do and so it’s more efficient to let the community get on with it.
Just to give a counterexample, while we don’t generally invest in optimising for particular use cases, we did invest a bunch of money into optimising Kodi to run well on Raspberry Pi, because we found that very large numbers of people were using it. So, if we find that we get half a million people a year using a Raspberry Pi as a VPN endpoint, then we’ll probably invest money into optimising it and feature it on the website as we’ve done with Kodi. But I don’t think we’re there today.
Have you ever seen any Pis running and doing important jobs in the wild, and if so, how does it feel?
It’s amazing how often you see them driving displays, for example in radio and TV studios. Of course, it feels great. There’s something wonderful about the geographic spread as well. The Raspberry Pi desktop is quite distinctive, both in its previous incarnation with the grey background and logo, and the current one where we have Greg Annandale’s road picture.
And so it’s funny when you see it in places. Somebody sent me a video of them teaching in a classroom in rural Pakistan and in the background was Greg’s picture.
Raspberry Pi 4!?!
There will be a Raspberry Pi 4, obviously. We get asked about it a lot. I’m sticking to the guidance that I gave people that they shouldn’t expect to see a Raspberry Pi 4 this year. To some extent, the opportunity to do the 3B+ was a surprise: we were surprised that we’ve been able to get 200MHz more clock speed, triple the wireless and wired throughput, and better thermals, and still stick to the $35 price point.
We’re up against the wall from a silicon perspective; we’re at the end of what you can do with the 40nm process. It’s not that you couldn’t clock the processor faster, or put a larger processor which can execute more instructions per clock in there, it’s simply about the energy consumption and the fact that you can’t dissipate the heat. So we’ve got to go to a smaller process node and that’s an order of magnitude more challenging from an engineering perspective. There’s more effort, more risk, more cost, and all of those things are challenging.
With 3B+ out of the way, we’re going to start looking at this now. For the first six months or so we’re going to be figuring out exactly what people want from a Raspberry Pi 4. We’re listening to people’s comments about what they’d like to see in a new Raspberry Pi, and I’m hoping by early autumn we should have an idea of what we want to put in it and a strategy for how we might achieve that.
Could you go smaller than the Zero?
The challenge with Zero as that we’re periphery-limited. If you run your hand around the unit, there is no edge of that board that doesn’t have something there. So the question is: “If you want to go smaller than Zero, what feature are you willing to throw out?”
It’s a single-sided board, so you could certainly halve the PCB area if you fold the circuitry and use both sides, though you’d have to lose something. You could give up some GPIO and go back to 26 pins like the first Raspberry Pi. You could give up the camera connector, you could go to micro HDMI from mini HDMI. You could remove the SD card and just do USB boot. I’m inventing a product live on air! But really, you could get down to two thirds and lose a bunch of GPIO – it’s hard to imagine you could get to half the size.
What’s the one feature that you wish you could outfit on the Raspberry Pi that isn’t cost effective at this time? Your dream feature.
Well, more memory. There are obviously technical reasons why we don’t have more memory on there, but there are also market reasons. People ask “why doesn’t the Raspberry Pi have more memory?”, and my response is typically “go and Google ‘DRAM price’”. We’re used to the price of memory going down. And currently, we’re going through a phase where this has turned around and memory is getting more expensive again.
Machine learning would be interesting. There are machine learning accelerators which would be interesting to put on a piece of hardware. But again, they are not going to be used by everyone, so according to our method of pricing what we might add to a board, machine learning gets treated like a $50 chip. But that would be lovely to do.
Which citizen science projects using the Pi have most caught your attention?
I like the wildlife camera projects. We live out in the countryside in a little village, and we’re conscious of being surrounded by nature but we don’t see a lot of it on a day-to-day basis. So I like the nature cam projects, though, to my everlasting shame, I haven’t set one up yet. There’s a range of them, from very professional products to people taking a Raspberry Pi and a camera and putting them in a plastic box. So those are good fun.
How does it feel to go to bed every day knowing you’ve changed the world for the better in such a massive way?
What feels really good is that when we started this in 2006 nobody else was talking about it, but now we’re part of a very broad movement.
We were in a really bad way: we’d seen a collapse in the number of applicants applying to study Computer Science at Cambridge and elsewhere. In our view, this reflected a move away from seeing technology as ‘a thing you do’ to seeing it as a ‘thing that you have done to you’. It is problematic from the point of view of the economy, industry, and academia, but most importantly it damages the life prospects of individual children, particularly those from disadvantaged backgrounds. The great thing about STEM subjects is that you can’t fake being good at them. There are a lot of industries where your Dad can get you a job based on who he knows and then you can kind of muddle along. But if your dad gets you a job building bridges and you suck at it, after the first or second bridge falls down, then you probably aren’t going to be building bridges anymore. So access to STEM education can be a great driver of social mobility.
By the time we were launching the Raspberry Pi in 2012, there was this wonderful movement going on. Code Club, for example, and CoderDojo came along. Lots of different ways of trying to solve the same problem. What feels really, really good is that we’ve been able to do this as part of an enormous community. And some parts of that community became part of the Raspberry Pi Foundation – we merged with Code Club, we merged with CoderDojo, and we continue to work alongside a lot of these other organisations. So in the two seconds it takes me to fall asleep after my face hits the pillow, that’s what I think about.
We’re currently advertising a Programme Manager role in New Delhi, India. Did you ever think that Raspberry Pi would be advertising a role like this when you were bringing together the Foundation?
No, I didn’t.
But if you told me we were going to be hiring somewhere, India probably would have been top of my list because there’s a massive IT industry in India. When we think about our interaction with emerging markets, India, in a lot of ways, is the poster child for how we would like it to work. There have already been some wonderful deployments of Raspberry Pi, for example in Kerala, without our direct involvement. And we think we’ve got something that’s useful for the Indian market. We have a product, we have clubs, we have teacher training. And we have a body of experience in how to teach people, so we have a physical commercial product as well as a charitable offering that we think are a good fit.
It’s going to be massive.
What is your favourite BBC type-in listing?
There was a game called Codename: Druid. There is a famous game called Codename: Droid which was the sequel to Stryker’s Run, which was an awesome, awesome game. And there was a type-in game called Codename: Druid, which was at the bottom end of what you would consider a commercial game.
And I remember typing that in. And what was really cool about it was that the next month, the guy who wrote it did another article that talks about the memory map and which operating system functions used which bits of memory. So if you weren’t going to do disc access, which bits of memory could you trample on and know the operating system would survive.
I still like type-in listings. The Raspberry Pi 2018 Annual has a type-in listing that I wrote for a Babbage versus Bugs game. I will say that’s not the last type-in listing you will see from me in the next twelve months. And if you download the PDF, you could probably copy and paste it into your favourite text editor to save yourself some time.
Elections serve two purposes. The first, and obvious, purpose is to accurately choose the winner. But the second is equally important: to convince the loser. To the extent that an election system is not transparently and auditably accurate, it fails in that second purpose. Our election systems are failing, and we need to fix them.
Today, we conduct our elections on computers. Our registration lists are in computer databases. We vote on computerized voting machines. And our tabulation and reporting is done on computers. We do this for a lot of good reasons, but a side effect is that elections now have all the insecurities inherent in computers. The only way to reliably protect elections from both malice and accident is to use something that is not hackable or unreliable at scale; the best way to do that is to back up as much of the system as possible with paper.
Recently, there have been two graphic demonstrations of how bad our computerized voting system is. In 2007, the states of California and Ohio conducted audits of their electronic voting machines. Expert review teams found exploitable vulnerabilities in almost every component they examined. The researchers were able to undetectably alter vote tallies, erase audit logs, and load malware on to the systems. Some of their attacks could be implemented by a single individual with no greater access than a normal poll worker; others could be done remotely.
Last year, the Defcon hackers’ conference sponsored a Voting Village. Organizers collected 25 pieces of voting equipment, including voting machines and electronic poll books. By the end of the weekend, conference attendees had found ways to compromise every piece of test equipment: to load malicious software, compromise vote tallies and audit logs, or cause equipment to fail.
It’s important to understand that these were not well-funded nation-state attackers. These were not even academics who had been studying the problem for weeks. These were bored hackers, with no experience with voting machines, playing around between parties one weekend.
It shouldn’t be any surprise that voting equipment, including voting machines, voter registration databases, and vote tabulation systems, are that hackable. They’re computers — often ancient computers running operating systems no longer supported by the manufacturers — and they don’t have any magical security technology that the rest of the industry isn’t privy to. If anything, they’re less secure than the computers we generally use, because their manufacturers hide any flaws behind the proprietary nature of their equipment.
We’re not just worried about altering the vote. Sometimes causing widespread failures, or even just sowing mistrust in the system, is enough. And an election whose results are not trusted or believed is a failed election.
Voting systems have another requirement that makes security even harder to achieve: the requirement for a secret ballot. Because we have to securely separate the election-roll system that determines who can vote from the system that collects and tabulates the votes, we can’t use the security systems available to banking and other high-value applications.
We can securely bank online, but can’t securely vote online. If we could do away with anonymity — if everyone could check that their vote was counted correctly — then it would be easy to secure the vote. But that would lead to other problems. Before the US had the secret ballot, voter coercion and vote-buying were widespread.
We can’t, so we need to accept that our voting systems are insecure. We need an election system that is resilient to the threats. And for many parts of the system, that means paper.
Let’s start with the voter rolls. We know they’ve already been targeted. In 2016, someone changed the party affiliation of hundreds of voters before the Republican primary. That’s just one possibility. A well-executed attack that deletes, for example, one in five voters at random — or changes their addresses — would cause chaos on election day.
Yes, we need to shore up the security of these systems. We need better computer, network, and database security for the various state voter organizations. We also need to better secure the voterregistration websites, with better design and better internet security. We need better security for the companies that build and sell all this equipment.
Multiple, unchangeable backups are essential. A record of every addition, deletion, and change needs to be stored on a separate system, on write-only media like a DVD. Copies of that DVD, or — even better — a paper printout of the voter rolls, should be available at every polling place on election day. We need to be ready for anything.
Next, the voting machines themselves. Security researchers agree that the gold standard is a voter-verified paper ballot. The easiest (and cheapest) way to achieve this is through optical-scan voting. Voters mark paper ballots by hand; they are fed into a machine and counted automatically. That paper ballot is saved, and serves as a final true record in a recount in case of problems. Touch-screen machines that print a paper ballot to drop in a ballot box can also work for voters with disabilities, as long as the ballot can be easily read and verified by the voter.
Finally, the tabulation and reporting systems. Here again we need more security in the process, but we must always use those paper ballots as checks on the computers. A manual, post-election, risk-limiting audit varies the number of ballots examined according to the margin of victory. Conducting this audit after every election, before the results are certified, gives us confidence that the election outcome is correct, even if the voting machines and tabulation computers have been tampered with. Additionally, we need better coordination and communications when incidents occur.
It’s vital to agree on these procedures and policies before an election. Before the fact, when anyone can win and no one knows whose votes might be changed, it’s easy to agree on strong security. But after the vote, someone is the presumptive winner — and then everything changes. Half of the country wants the result to stand, and half wants it reversed. At that point, it’s too late to agree on anything.
The politicians running in the election shouldn’t have to argue their challenges in court. Getting elections right is in the interest of all citizens. Many countries have independent election commissions that are charged with conducting elections and ensuring their security. We don’t do that in the US.
Instead, we have representatives from each of our two parties in the room, keeping an eye on each other. That provided acceptable security against 20th-century threats, but is totally inadequate to secure our elections in the 21st century. And the belief that the diversity of voting systems in the US provides a measure of security is a dangerous myth, because few districts can be decisive and there are so few voting-machine vendors.
We candobetter. In 2017, the Department of Homeland Security declared elections to be critical infrastructure, allowing the department to focus on securing them. On 23 March, Congress allocated $380m to states to upgrade election security.
These are good starts, but don’t go nearly far enough. The constitution delegates elections to the states but allows Congress to “make or alter such Regulations”. In 1845, Congress set a nationwide election day. Today, we need it to set uniform and strict election standards.
Do you have a clear understanding of the hybrid cloud? If you don’t, it’s not surprising.
Hybrid cloud has been applied to a greater and more varied number of IT solutions than almost any other recent data management term. About the only thing that’s clear about the hybrid cloud is that the term hybrid cloud wasn’t invented by customers, but by vendors who wanted to hawk whatever solution du jour they happened to be pushing.
Let’s be honest. We’re in an industry that loves hype. We can’t resist grafting hyper, multi, ultra, and super and other prefixes onto the beginnings of words to entice customers with something new and shiny. The alphabet soup of cloud-related terms can include various options for where the cloud is located (on-premises, off-premises), whether the resources are private or shared in some degree (private, community, public), what type of services are offered (storage, computing), and what type of orchestrating software is used to manage the workflow and the resources. With so many moving parts, it’s no wonder potential users are confused.
Let’s take a step back, try to clear up the misconceptions, and come up with a basic understanding of what the hybrid cloud is. To be clear, this is our viewpoint. Others are free to do what they like, so bear that in mind.
So, What is the Hybrid Cloud?
The hybrid cloud refers to a cloud environment made up of a mixture of on-premises private cloud resources combined with third-party public cloud resources that use some kind of orchestration between them.
To get beyond the hype, let’s start with Forrester Research‘s idea of the hybrid cloud: “One or more public clouds connected to something in my data center. That thing could be a private cloud; that thing could just be traditional data center infrastructure.”
To put it simply, a hybrid cloud is a mash-up of on-premises and off-premises IT resources.
To expand on that a bit, we can say that the hybrid cloud refers to a cloud environment made up of a mixture of on-premises private cloud[1] resources combined with third-party public cloud resources that use some kind of orchestration[2] between them. The advantage of the hybrid cloud model is that it allows workloads and data to move between private and public clouds in a flexible way as demands, needs, and costs change, giving businesses greater flexibility and more options for data deployment and use.
In other words, if you have some IT resources in-house that you are replicating or augmenting with an external vendor, congrats, you have a hybrid cloud!
Private Cloud vs. Public Cloud
The cloud is really just a collection of purpose built servers. In a private cloud, the servers are dedicated to a single tenant or a group of related tenants. In a public cloud, the servers are shared between multiple unrelated tenants (customers). A public cloud is off-site, while a private cloud can be on-site or off-site — or on-prem or off-prem.
As an example, let’s look at a hybrid cloud meant for data storage, a hybrid data cloud. A company might set up a rule that says all accounting files that have not been touched in the last year are automatically moved off-prem to cloud storage to save cost and reduce the amount of storage needed on-site. The files are still available; they are just no longer stored on your local systems. The rules can be defined to fit an organization’s workflow and data retention policies.
The hybrid cloud concept also contains cloud computing. For example, at the end of the quarter, order processing application instances can be spun up off-premises in a hybrid computing cloud as needed to add to on-premises capacity.
Hybrid Cloud Benefits
If we accept that the hybrid cloud combines the best elements of private and public clouds, then the benefits of hybrid cloud solutions are clear, and we can identify the primary two benefits that result from the blending of private and public clouds.
Benefit 1: Flexibility and Scalability
Undoubtedly, the primary advantage of the hybrid cloud is its flexibility. It takes time and money to manage in-house IT infrastructure and adding capacity requires advance planning.
The cloud is ready and able to provide IT resources whenever needed on short notice. The term cloud bursting refers to the on-demand and temporary use of the public cloud when demand exceeds resources available in the private cloud. For example, some businesses experience seasonal spikes that can put an extra burden on private clouds. These spikes can be taken up by a public cloud. Demand also can vary with geographic location, events, or other variables. The public cloud provides the elasticity to deal with these and other anticipated and unanticipated IT loads. The alternative would be fixed cost investments in on-premises IT resources that might not be efficiently utilized.
For a data storage user, the on-premises private cloud storage provides, among other benefits, the highest speed access. For data that is not frequently accessed, or needed with the absolute lowest levels of latency, it makes sense for the organization to move it to a location that is secure, but less expensive. The data is still readily available, and the public cloud provides a better platform for sharing the data with specific clients, users, or with the general public.
Benefit 2: Cost Savings
The public cloud component of the hybrid cloud provides cost-effective IT resources without incurring capital expenses and labor costs. IT professionals can determine the best configuration, service provider, and location for each service, thereby cutting costs by matching the resource with the task best suited to it. Services can be easily scaled, redeployed, or reduced when necessary, saving costs through increased efficiency and avoiding unnecessary expenses.
Comparing Private vs Hybrid Cloud Storage Costs
To get an idea of the difference in storage costs between a purely on-premises solutions and one that uses a hybrid of private and public storage, we’ll present two scenarios. For each scenario we’ll use data storage amounts of 100 terabytes, 1 petabyte, and 2 petabytes. Each table is the same format, all we’ve done is change how the data is distributed: private (on-premises) cloud or public (off-premises) cloud. We are using the costs for our own B2 Cloud Storage in this example. The math can be adapted for any set of numbers you wish to use.
Scenario 1100% of data on-premises storage
Data Stored
Data stored On-Premises: 100%
100 TB
1,000 TB
2,000 TB
On-premises cost range
Monthly Cost
Low — $12/TB/Month
$1,200
$12,000
$24,000
High — $20/TB/Month
$2,000
$20,000
$40,000
Scenario 220% of data on-premises with 80% public cloud storage (B2)
Data Stored
Data stored On-Premises: 20%
20 TB
200 TB
400 TB
Data stored in Cloud: 80%
80 TB
800 TB
1,600 TB
On-premises cost range
Monthly Cost
Low — $12/TB/Month
$240
$2,400
$4,800
High — $20/TB/Month
$400
$4,000
$8,000
Public cloud cost range
Monthly Cost
Low — $5/TB/Month (B2)
$400
$4,000
$8,000
High — $20/TB/Month
$1,600
$16,000
$32,000
On-premises + public cloud cost range
Monthly Cost
Low
$640
$6,400
$12,800
High
$2,000
$20,000
$40,000
As can be seen in the numbers above, using a hybrid cloud solution and storing 80% of the data in the cloud with a provider such as Backblaze B2 can result in significant savings over storing only on-premises. For other cost scenarios, see the B2 Cost Calculator.
When Hybrid Might Not Always Be the Right Fit
There are circumstances where the hybrid cloud might not be the best solution. Smaller organizations operating on a tight IT budget might best be served by a purely public cloud solution. The cost of setting up and running private servers is substantial.
An application that requires the highest possible speed might not be suitable for hybrid, depending on the specific cloud implementation. While latency does play a factor in data storage for some users, it is less of a factor for uploading and downloading data than it is for organizations using the hybrid cloud for computing. Because Backblaze recognized the importance of speed and low-latency for customers wishing to use computing on data stored in B2, we directly connected our data centers with those of our computing partners, ensuring that latency would not be an issue even for a hybrid cloud computing solution.
It is essential to have a good understanding of workloads and their essential characteristics in order to make the hybrid cloud work well for you. Each application needs to be examined for the right mix of private cloud, public cloud, and traditional IT resources that fit the particular workload in order to benefit most from a hybrid cloud architecture.
The Hybrid Cloud Can Be a Win-Win Solution
From the high altitude perspective, any solution that enables an organization to respond in a flexible manner to IT demands is a win. Avoiding big upfront capital expenses for in-house IT infrastructure will appeal to the CFO. Being able to quickly spin up IT resources as they’re needed will appeal to the CTO and VP of Operations.
Should You Go Hybrid?
We’ve arrived at the bottom line and the question is, should you or your organization embrace hybrid cloud infrastructures?
According to 451 Research, by 2019, 69% of companies will operate in hybrid cloud environments, and 60% of workloads will be running in some form of hosted cloud service (up from 45% in 2017). That indicates that the benefits of the hybrid cloud appeal to a broad range of companies.
Clearly, depending on an organization’s needs, there are advantages to a hybrid solution. While it might have been possible to dismiss the hybrid cloud in the early days of the cloud as nothing more than a buzzword, that’s no longer true. The hybrid cloud has evolved beyond the marketing hype to offer real solutions for an increasingly complex and challenging IT environment.
If an organization approaches the hybrid cloud with sufficient planning and a structured approach, a hybrid cloud can deliver on-demand flexibility, empower legacy systems and applications with new capabilities, and become a catalyst for digital transformation. The result can be an elastic and responsive infrastructure that has the ability to quickly respond to changing demands of the business.
As data management professionals increasingly recognize the advantages of the hybrid cloud, we can expect more and more of them to embrace it as an essential part of their IT strategy.
Tell Us What You’re Doing with the Hybrid Cloud
Are you currently embracing the hybrid cloud, or are you still uncertain or hanging back because you’re satisfied with how things are currently? Maybe you’ve gone totally hybrid. We’d love to hear your comments below on how you’re dealing with the hybrid cloud.
[1] Private cloud can be on-premises or a dedicated off-premises facility.
[2] Hybrid cloud orchestration solutions are often proprietary, vertical, and task dependent.
As part of my current project (secure audit trail) I decided to make a survey about the use of audit trail “in the wild”.
I haven’t written in details about this project of mine (unlike with someotherprojects). Mostly because it’s commercial and I don’t want to use my blog as a direct promotion channel (though I am doing that at the moment, ironically). But the aim of this post is to shed some light on how audit trail is used.
The survey can be found here. The questions are basically: does your current project have audit trail functionality, and if yes, is it protected from tampering. If not – do you think you should have such functionality.
The results are interesting (although with only around 50 respondents)
So more than half of the systems (on which respondents are working) don’t have audit trail. While audit trail is recommended by information security and related standards, it may not find place in the “busy schedule” of a software project, even though it’s fairly easy to provide a trivial implementation (e.g. I’ve written how to quickly setup one with Hibernate and Spring)
A trivial implementation might do in many cases but if the audit log is critical (e.g. access to sensitive data, performing financial operations etc.), then relying on a trivial implementation might not be enough. In other words – if the sysadmin can access the database and delete or modify the audit trail, then it doesn’t serve much purpose. Hence the next question – how is the audit trail protected from tampering:
And apparently, from the less than 50% of projects with audit trail, around 50% don’t have technical guarantees that the audit trail can’t be tampered with. My guess is it’s more, because people have different understanding of what technical measures are sufficient. E.g. someone may think that digitally signing your log files (or log records) is sufficient, but in fact it isn’t, as whole files (or records) can be deleted (or fully replaced) without a way to detect that. Timestamping can help (and a good audit trail solution should have that), but it doesn’t guarantee the order of events or prevent a malicious actor from deleting or inserting fake ones. And if timestamping is done on a log file level, then any not-yet-timestamped log file is vulnerable to manipulation.
I’ve written about event logs before and their two flavours – event sourcing and audit trail. An event log can effectively be considered audit trail, but you’d need additional security to avoid the problems mentioned above.
So, let’s see what would various levels of security and usefulness of audit logs look like. There are many papers on the topic (e.g. this and this), and they often go into the intricate details of how logging should be implemented. I’ll try to give an overview of the approaches:
Regular logs – rely on regular INFO log statements in the production logs to look for hints of what has happened. This may be okay, but is harder to look for evidence (as there is non-auditable data in those log files as well), and it’s not very secure – usually logs are collected (e.g. with graylog) and whoever has access to the log collector’s database (or search engine in the case of Graylog), can manipulate the data and not be caught
Designated audit trail – whether it’s stored in the database or in logs files. It has the proper business-event level granularity, but again doesn’t prevent or detect tampering. With lower risk systems that may is perfectly okay.
Timestamped logs – whether it’s log files or (harder to implement) database records. Timestamping is good, but if it’s not an external service, a malicious actor can get access to the local timestamping service and issue fake timestamps to either re-timestamp tampered files. Even if the timestamping is not compromised, whole entries can be deleted. The fact that they are missing can sometimes be deduced based on other factors (e.g. hour of rotation), but regularly verifying that is extra effort and may not always be feasible.
Hash chaining – each entry (or sequence of log files) could be chained (just as blockchain transactions) – the next one having the hash of the previous one. This is a good solution (whether it’s local, external or 3rd party), but it has the risk of someone modifying or deleting a record, getting your entire chain and re-hashing it. All the checks will pass, but the data will not be correct
Hash chaining with anchoring – the head of the chain (the hash of the last entry/block) could be “anchored” to an external service that is outside the capabilities of a malicious actor. Ideally, a public blockchain, alternatively – paper, a public service (twitter), email, etc. That way a malicious actor can’t just rehash the whole chain, because any check against the external service would fail.
WORM storage (write once, ready many). You could send your audit logs almost directly to WORM storage, where it’s impossible to replace data. However, that is not ideal, as WORM storage can be slow and expensive. For example AWS Glacier has rather big retrieval times and searching through recent data makes it impractical. It’s actually cheaper than S3, for example, and you can have expiration policies. But having to support your own WORM storage is expensive. It is a good idea to eventually send the logs to WORM storage, but “fresh” audit trail should probably not be “archived” so that it’s searchable and some actionable insight can be gained from it.
All-in-one – applying all of the above “just in case” may be unnecessary for every project out there, but that’s what I decided to do at LogSentinel. Business-event granularity with timestamping, hash chaining, anchoring, and eventually putting to WORM storage – I think that provides both security guarantees and flexibility.
I hope the overview is useful and the results from the survey shed some light on how this aspect of information security is underestimated.
Abstract: We present a scalable dynamic analysis framework that allows for the automatic evaluation of the privacy behaviors of Android apps. We use our system to analyze mobile apps’ compliance with the Children’s Online Privacy Protection Act (COPPA), one of the few stringent privacy laws in the U.S. Based on our automated analysis of 5,855 of the most popular free children’s apps, we found that a majority are potentially in violation of COPPA, mainly due to their use of third-party SDKs. While many of these SDKs offer configuration options to respect COPPA by disabling tracking and behavioral advertising, our data suggest that a majority of apps either do not make use of these options or incorrectly propagate them across mediation SDKs. Worse, we observed that 19% of children’s apps collect identifiers or other personally identifiable information (PII) via SDKs whose terms of service outright prohibit their use in child-directed apps. Finally, we show that efforts by Google to limit tracking through the use of a resettable advertising ID have had little success: of the 3,454 apps that share the resettable ID with advertisers, 66% transmit other, non-resettable, persistent identifiers as well, negating any intended privacy-preserving properties of the advertising ID.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.