Earlier this month we launched the C5 Instances with Local NVMe Storage and I told you that we would be doing the same for additional instance types in the near future!
Today we are introducing M5 instances equipped with local NVMe storage. Available for immediate use in 5 regions, these instances are a great fit for workloads that require a balance of compute and memory resources. Here are the specs:
Instance Name
vCPUs
RAM
Local Storage
EBS-Optimized Bandwidth
Network Bandwidth
m5d.large
2
8 GiB
1 x 75 GB NVMe SSD
Up to 2.120 Gbps
Up to 10 Gbps
m5d.xlarge
4
16 GiB
1 x 150 GB NVMe SSD
Up to 2.120 Gbps
Up to 10 Gbps
m5d.2xlarge
8
32 GiB
1 x 300 GB NVMe SSD
Up to 2.120 Gbps
Up to 10 Gbps
m5d.4xlarge
16
64 GiB
1 x 600 GB NVMe SSD
2.210 Gbps
Up to 10 Gbps
m5d.12xlarge
48
192 GiB
2 x 900 GB NVMe SSD
5.0 Gbps
10 Gbps
m5d.24xlarge
96
384 GiB
4 x 900 GB NVMe SSD
10.0 Gbps
25 Gbps
The M5d instances are powered by Custom Intel® Xeon® Platinum 8175M series processors running at 2.5 GHz, including support for AVX-512.
You can use any AMI that includes drivers for the Elastic Network Adapter (ENA) and NVMe; this includes the latest Amazon Linux, Microsoft Windows (Server 2008 R2, Server 2012, Server 2012 R2 and Server 2016), Ubuntu, RHEL, SUSE, and CentOS AMIs.
Here are a couple of things to keep in mind about the local NVMe storage on the M5d instances:
Naming – You don’t have to specify a block device mapping in your AMI or during the instance launch; the local storage will show up as one or more devices (/dev/nvme*1 on Linux) after the guest operating system has booted.
Encryption – Each local NVMe device is hardware encrypted using the XTS-AES-256 block cipher and a unique key. Each key is destroyed when the instance is stopped or terminated.
Lifetime – Local NVMe devices have the same lifetime as the instance they are attached to, and do not stick around after the instance has been stopped or terminated.
Available Now M5d instances are available in On-Demand, Reserved Instance, and Spot form in the US East (N. Virginia), US West (Oregon), EU (Ireland), US East (Ohio), and Canada (Central) Regions. Prices vary by Region, and are just a bit higher than for the equivalent M5 instances.
We have two new resources to help customers address their data protection requirements in Argentina. These resources specifically address the needs outlined under the Personal Data Protection Law No. 25.326, as supplemented by Regulatory Decree No. 1558/2001 (“PDPL”), including Disposition No. 11/2006. For context, the PDPL is an Argentine federal law that applies to the protection of personal data, including during transfer and processing.
A new webpage focused on data privacy in Argentina features FAQs, helpful links, and whitepapers that provide an overview of PDPL considerations, as well as our security assurance frameworks and international certifications, including ISO 27001, ISO 27017, and ISO 27018. You’ll also find details about our Information Request Report and the high bar of security at AWS data centers.
Additionally, we’ve released a new workbook that offers a detailed mapping as to how customers can operate securely under the Shared Responsibility Model while also aligning with Disposition No. 11/2006. The AWS Disposition 11/2006 Workbook can be downloaded from the Argentina Data Privacy page or directly from this link. Both resources are also available in Spanish from the Privacidad de los datos en Argentina page.
Want more AWS Security news? Follow us on Twitter.
Linus has released the 4.17 kernel, which will indeed be called “4.17”. “No, I didn’t call it 5.0, even though all the git object count numerology was in place for that. It will happen in the not _too_ distant future, and I’m told all the release scripts on kernel.org are ready for it, but I didn’t feel there was any real reason for it.”
Headline features in this release include improved load estimation in the CPU scheduler, raw BPF tracepoints, lazytime support in the XFS filesystem, full in-kernel TLS protocol support, histogram triggers for tracing, mitigations for the latest Spectre variants, and, of course, the removal of support for eight unloved processor architectures.
There has been a lot of work in recent years to use BPF to push policy decisions into the kernel. But sometimes, it seems, what is really wanted is a way for a BPF program to punt a decision back to user space. That is the objective behind this patch set giving the secure computing (seccomp) mechanism a way to pass complex decisions to a user-space helper program.
In February, the bpfilter mechanism was first posted to the mailing lists. Bpfilter is meant to be a replacement for the current in-kernel firewall/packet-filtering code. It provides little functionality itself; instead, it creates a set of hooks that can run BPF programs to make the packet-filtering decisions. A version of that patch set has been merged into the net-next tree for 4.18. It will not be replacing any existing packet filters in its current form, but it does feature a significant change to one of its more controversial features: the new user-mode helper mechanism.
SHB is a small invitational gathering of people studying various aspects of the human side of security, organized each year by Alessandro Acquisti, Ross Anderson, and myself. The 50 or so people in the room include psychologists, economists, computer security researchers, sociologists, political scientists, neuroscientists, designers, lawyers, philosophers, anthropologists, business school professors, and a smattering of others. It’s not just an interdisciplinary event; most of the people here are individually interdisciplinary.
The goal is to maximize discussion and interaction. We do that by putting everyone on panels, and limiting talks to 7-10 minutes. The rest of the time is left to open discussion. Four hour-and-a-half panels per day over two days equals eight panels; six people per panel means that 48 people get to speak. We also have lunches, dinners, and receptions — all designed so people from different disciplines talk to each other.
I invariably find this to be the most intellectually stimulating conference of my year. It influences my thinking in many different, and sometimes surprising, ways.
Here are my posts on the first, second, third, fourth, fifth, sixth, seventh, eighth, ninth, and tenth SHB workshops. Follow those links to find summaries, papers, and occasionally audio recordings of the various workshops.
When I talk with customers and partners, I find that they are in different stages in the adoption of DevOps methodologies. They are automating the creation of application artifacts and the deployment of their applications to different infrastructure environments. In many cases, they are creating and supporting multiple applications using a variety of coding languages and artifacts.
The management of these processes and artifacts can be challenging, but using the right tools and methodologies can simplify the process.
In this post, I will show you how you can automate the creation and storage of application artifacts through the implementation of a pipeline and custom deploy action in AWS CodePipeline. The example includes a Node.js code base stored in an AWS CodeCommit repository. A Node Package Manager (npm) artifact is built from the code base, and the build artifact is published to a JFrogArtifactory npm repository.
I frequently recommend AWS CodePipeline, the AWS continuous integration and continuous delivery tool. You can use it to quickly innovate through integration and deployment of new features and bug fixes by building a workflow that automates the build, test, and deployment of new versions of your application. And, because AWS CodePipeline is extensible, it allows you to create a custom action that performs customized, automated actions on your behalf.
JFrog’s Artifactory is a universal binary repository manager where you can manage multiple applications, their dependencies, and versions in one place. Artifactory also enables you to standardize the way you manage your package types across all applications developed in your company, no matter the code base or artifact type.
If you already have a Node.js CodeCommit repository, a JFrog Artifactory host, and would like to automate the creation of the pipeline, including the custom action and CodeBuild project, you can use this AWS CloudFormationtemplate to create your AWS CloudFormation stack.
This figure shows the path defined in the pipeline for this project. It starts with a change to Node.js source code committed to a private code repository in AWS CodeCommit. With this change, CodePipeline triggers AWS CodeBuild to create the npm package from the node.js source code. After the build, CodePipeline triggers the custom action job worker to commit the build artifact to the designated artifact repository in Artifactory.
This blog post assumes you have already:
· Created a CodeCommit repository that contains a Node.js project.
· Configured a two-stage pipeline in AWS CodePipeline.
The Source stage of the pipeline is configured to poll the Node.js CodeCommit repository. The Build stage is configured to use a CodeBuild project to build the npm package using a buildspec.yml file located in the code repository.
If you do not have a Node.js repository, you can create a CodeCommit repository that contains this simple ‘Hello World’ project. This project also includes a buildspec.yml file that is used when you define your CodeBuild project. It defines the steps to be taken by CodeBuild to create the npm artifact.
If you do not already have a pipeline set up in CodePipeline, you can use this template to create a pipeline with a CodeCommit source action and a CodeBuild build action through the AWS Command Line Interface (AWS CLI). If you do not want to install the AWS CLI on your local machine, you can use AWS Cloud9, our managed integrated development environment (IDE), to interact with AWS APIs.
In your development environment, open your favorite editor and fill out the template with values appropriate to your project. For information, see the readme in the GitHub repository.
Use this CLI command to create the pipeline from the template:
It creates a pipeline that has a CodeCommit source action and a CodeBuild build action.
Integrating JFrog Artifactory
JFrog Artifactory provides default repositories for your project needs. For my NPM package repository, I am using the default virtual npm repository (named npm) that is available in Artifactory Pro. You might want to consider creating a repository per project but for the example used in this post, using the default lets me get started without having to configure a new repository.
I can use the steps in the Set Me Up -> npm section on the landing page to configure my worker to interact with the default NPM repository.
Describes the required values to run the custom action. I will define my custom action in the ‘Deploy’ category, identify the provider as ‘Artifactory’, of version ‘1’, and specify a variety of configurationProperties whose values will be defined when this stage is added to my pipeline.
Polls CodePipeline for a job, scanning for its action-definition properties. In this blog post, after a job has been found, the job worker does the work required to publish the npm artifact to the Artifactory repository.
{
"category": "Deploy",
"configurationProperties": [{
"name": "TypeOfArtifact",
"required": true,
"key": true,
"secret": false,
"description": "Package type, ex. npm for node packages",
"type": "String"
},
{ "name": "RepoKey",
"required": true,
"key": true,
"secret": false,
"type": "String",
"description": "Name of the repository in which this artifact should be stored"
},
{ "name": "UserName",
"required": true,
"key": true,
"secret": false,
"type": "String",
"description": "Username for authenticating with the repository"
},
{ "name": "Password",
"required": true,
"key": true,
"secret": true,
"type": "String",
"description": "Password for authenticating with the repository"
},
{ "name": "EmailAddress",
"required": true,
"key": true,
"secret": false,
"type": "String",
"description": "Email address used to authenticate with the repository"
},
{ "name": "ArtifactoryHost",
"required": true,
"key": true,
"secret": false,
"type": "String",
"description": "Public address of Artifactory host, ex: https://myexamplehost.com or http://myexamplehost.com:8080"
}],
"provider": "Artifactory",
"version": "1",
"settings": {
"entityUrlTemplate": "{Config:ArtifactoryHost}/artifactory/webapp/#/artifacts/browse/tree/General/{Config:RepoKey}"
},
"inputArtifactDetails": {
"maximumCount": 5,
"minimumCount": 1
},
"outputArtifactDetails": {
"maximumCount": 5,
"minimumCount": 0
}
}
There are seven sections to the custom action definition:
category: This is the stage in which you will be creating this action. It can be Source, Build, Deploy, Test, Invoke, Approval. Except for source actions, the category section simply allows us to organize our actions. I am setting the category for my action as ‘Deploy’ because I’m using it to publish my node artifact to my Artifactory instance.
configurationProperties: These are the parameters or variables required for your project to authenticate and commit your artifact. In the case of my custom worker, I need:
TypeOfArtifact: In this case, npm, because it’s for the Node Package Manager.
RepoKey: The name of the repository. In this case, it’s the default npm.
UserName and Password for the user to authenticate with the Artifactory repository.
EmailAddress used to authenticate with the repository.
Artifactory host name or IP address.
provider: The name you define for your custom action stage. I have named the provider Artifactory.
version: Version number for the custom action. Because this is the first version, I set the version number to 1.
entityUrlTemplate: This URL is presented to your users for the deploy stage along with the title you define in your provider. The link takes the user to their artifact repository page in the Artifactory host.
inputArtifactDetails: The number of artifacts to expect from the previous stage in the pipeline.
outputArtifactDetails: The number of artifacts that should be the result from the custom action stage. Later in this blog post, I define 0 for my output artifacts because I am publishing the artifact to the Artifactory repository as the final action.
After I define the custom action in a JSON file, I use the AWS CLI to create the custom action type in CodePipeline:
After I create the custom action type in the same region as my pipeline, I edit the pipeline to add a Deploy stage and configure it to use the custom action I created for Artifactory:
I have created a custom worker for the actions required to commit the npm artifact to the Artifactory repository. The worker is in Python and it runs in a loop on an Amazon EC2 instance. My custom worker polls for a deploy job and publishes the NPM artifact to the Artifactory repository.
The EC2 instance is running Amazon Linux and has an IAM instance role attached that gives the worker permission to access CodePipeline. The worker process is as follows:
Take the configuration properties from the custom worker and poll CodePipeline for a custom action job.
After there is a job in the job queue with the appropriate category, provider, and version, acknowledge the job.
Download the zipped artifact created in the previous Build stage from the provided S3 buckets with the provided temporary credentials.
Unzip the artifact into a temporary directory.
A user-defined Artifactory user name and password is used to receive a temporary API key from Artifactory.
To avoid having to write the password to a file, use that temporary API key and user name to authenticate with the NPM repository.
Publish the Node.js package to the specified repository.
Because I am running my custom worker on an Amazon Linux EC2 instance, I installed npm with the following command:
sudo yum install nodejs npm --enablerepo=epel
For my custom worker, I used pip to install the required Python libraries:
pip install boto3 requests
For a full Python package list, see requirements.txt in the GitHub repository.
Let’s take a look at some of the code snippets from the worker.
First, the worker polls for jobs:
def action_type():
ActionType = {
'category': 'Deploy',
'owner': 'Custom',
'provider': 'Artifactory',
'version': '1' }
return(ActionType)
def poll_for_jobs():
try:
artifactory_action_type = action_type()
print(artifactory_action_type)
jobs = codepipeline.poll_for_jobs(actionTypeId=artifactory_action_type)
while not jobs['jobs']:
time.sleep(10)
jobs = codepipeline.poll_for_jobs(actionTypeId=artifactory_action_type)
if jobs['jobs']:
print('Job found')
return jobs['jobs'][0]
except ClientError as e:
print("Received an error: %s" % str(e))
raise
When there is a job in the queue, the poller returns a number of values from the queue such as jobId, the input and output S3 buckets for artifacts, temporary credentials to access the S3 buckets, and other configuration details from the stage in the pipeline.
After successfully receiving the job details, the worker sends an acknowledgement to CodePipeline to ensure that the work on the job is not duplicated by other workers watching for the same job:
def job_acknowledge(jobId, nonce):
try:
print('Acknowledging job')
result = codepipeline.acknowledge_job(jobId=jobId, nonce=nonce)
return result
except Exception as e:
print("Received an error when trying to acknowledge the job: %s" % str(e))
raise
With the job now acknowledged, the worker publishes the source code artifact into the desired repository. The worker gets the value of the artifact S3 bucket and objectKey from the inputArtifacts in the response from the poll_for_jobs API request. Next, the worker creates a new directory in /tmp and downloads the S3 object into this directory:
def get_bucket_location(bucketName, init_client):
region = init_client.get_bucket_location(Bucket=bucketName)['LocationConstraint']
if not region:
region = 'us-east-1'
return region
def get_s3_artifact(bucketName, objectKey, ak, sk, st):
init_s3 = boto3.client('s3')
region = get_bucket_location(bucketName, init_s3)
session = Session(aws_access_key_id=ak,
aws_secret_access_key=sk,
aws_session_token=st)
s3 = session.resource('s3',
region_name=region,
config=botocore.client.Config(signature_version='s3v4'))
try:
tempdirname = tempfile.mkdtemp()
except OSError as e:
print('Could not write temp directory %s' % tempdirname)
raise
bucket = s3.Bucket(bucketName)
obj = bucket.Object(objectKey)
filename = tempdirname + '/' + objectKey
try:
if os.path.dirname(objectKey):
directory = os.path.dirname(filename)
os.makedirs(directory)
print('Downloading the %s object and writing it to disk in %s location' % (objectKey, tempdirname))
with open(filename, 'wb') as data:
obj.download_fileobj(data)
except ClientError as e:
print('Downloading the object and writing the file to disk raised this error: ' + str(e))
raise
return(filename, tempdirname)
Because the downloaded artifact from S3 is a zip file, the worker must unzip it first. To have a clean area in which to work, I extract the downloaded zip archive into a new directory:
def unzip_codepipeline_artifact(artifact, origtmpdir):
# create a new temp directory
# Unzip artifact into new directory
try:
newtempdir = tempfile.mkdtemp()
print('Extracting artifact %s into temporary directory %s' % (artifact, newtempdir))
zip_ref = zipfile.ZipFile(artifact, 'r')
zip_ref.extractall(newtempdir)
zip_ref.close()
shutil.rmtree(origtmpdir)
return(os.listdir(newtempdir), newtempdir)
except OSError as e:
if e.errno != errno.EEXIST:
shutil.rmtree(newtempdir)
raise
The worker now has the npm package that I want to store in my Artifactory NPM repository.
To authenticate with the NPM repository, the worker requests a temporary token from the Artifactory host. After receiving this temporary token, it creates a .npmrc file in the worker user’s home directory that includes a hash of the user name and temporary token. After it has authenticated, the worker runs npm config set registry <URL OF REPOSITORY> to configure the npm registry value to be the Artifactory host. Next, the worker runs npm publish –registry <URL OF REPOSITORY>, which publishes the node package to the NPM repository in the Artifactory host.
def push_to_npm(configuration, artifact_list, temp_dir, jobId):
reponame = configuration['RepoKey']
art_type = configuration['TypeOfArtifact']
print("Putting artifact into NPM repository " + reponame)
token, hostname, username = gen_artifactory_auth_token(configuration)
npmconfigfile = create_npmconfig_file(configuration, username, token)
url = hostname + '/artifactory/api/' + art_type + '/' + reponame
print("Changing directory to " + str(temp_dir))
os.chdir(temp_dir)
try:
print("Publishing following files to the repository: %s " % os.listdir(temp_dir))
print("Sending artifact to Artifactory NPM registry URL: " + url)
subprocess.call(["npm", "config", "set", "registry", url])
req = subprocess.call(["npm", "publish", "--registry", url])
print("Return code from npm publish: " + str(req))
if req != 0:
err_msg = "npm ERR! Recieved non OK response while sending response to Artifactory. Return code from npm publish: " + str(req)
signal_failure(jobId, err_msg)
else:
signal_success(jobId)
except requests.exceptions.RequestException as e:
print("Received an error when trying to commit artifact %s to repository %s: " % (str(art_type), str(configuration['RepoKey']), str(e)))
raise
return(req, npmconfigfile)
If the return value from publishing to the repository is not 0, the worker signals a failure to CodePipeline. If the value is 0, the worker signals success to CodePipeline to indicate that the stage of the pipeline has been completed successfully.
For the custom worker code, see npm_job_worker.py in the GitHub repository.
I run my custom worker on an EC2 instance using the command python npm_job_worker.py, with an optional --version flag that can be used to specify worker versions other than 1. Then I trigger a release change in my pipeline:
From my custom worker output logs, I have just committed a package named node_example at version 1.0.3:
On artifact: index.js
Committing to the repo: https://artifactory.myexamplehost.com/artifactory/api/npm/npm
Sending artifact to Artifactory URL: https:// artifactoryhost.myexamplehost.com/artifactory/api/npm/npm
npm config: 0
npm http PUT https://artifactory.myexamplehost.com/artifactory/api/npm/npm/node_example
npm http 201 https://artifactory.myexamplehost.com/artifactory/api/npm/npm/node_example
+ [email protected]
Return code from npm publish: 0
Signaling success to CodePipeline
After that has been built successfully, I can find my artifact in my Artifactory repository:
To help you automate this process, I have created this AWS CloudFormation template that automates the creation of the CodeBuild project, the custom action, and the CodePipeline pipeline. It also launches the Amazon EC2-based custom job worker in an AWS Auto Scaling group. This template requires you to have a VPC and CodeCommit repository for your Node.js project. If you do not currently have a VPC in which you want to run your custom worker EC2 instances, you can use this AWS QuickStart to create one. If you do not have an existing Node.js project, I’ve provided a sample project in the GitHub repository.
Conclusion
I‘ve shown you the steps to integrate your JFrog Artifactory repository with your CodePipeline workflow. I’ve shown you how to create a custom action in CodePipeline and how to create a custom worker that works in your CI/CD pipeline. To dig deeper into custom actions and see how you can integrate your Artifactory repositories into your AWS CodePipeline projects, check out the full code base on GitHub.
If you have any questions or feedback, feel free to reach out to us through the AWS CodePipeline forum.
Erin McGill is a Solutions Architect in the AWS Partner Program with a focus on DevOps and automation tooling.
The CFP will close on July 30th. Notification of acceptance and non-acceptance will go out within 7 days of the closing of the CFP.
All topics relevant to foundational open-source Linux technologies are welcome. In particular, however, we are looking for proposals including, but not limited to, the following topics:
Low-level container executors and infrastructure
IoT and embedded OS infrastructure
BPF and eBPF filtering
OS, container, IoT image delivery and updating
Building Linux devices and applications
Low-level desktop technologies
Networking
System and service management
Tracing and performance measuring
IPC and RPC systems
Security and Sandboxing
While our focus is definitely more on the user-space side of things, talks about kernel projects are welcome, as long as they have a clear and direct relevance for user-space.
As you can see from my EC2 Instance History post, we add new instance types on a regular and frequent basis. Driven by increasingly powerful processors and designed to address an ever-widening set of use cases, the size and diversity of this list reflects the equally diverse group of EC2 customers!
Near the bottom of that list you will find the new compute-intensive C5 instances. With a 25% to 50% improvement in price-performance over the C4 instances, the C5 instances are designed for applications like batch and log processing, distributed and or real-time analytics, high-performance computing (HPC), ad serving, highly scalable multiplayer gaming, and video encoding. Some of these applications can benefit from access to high-speed, ultra-low latency local storage. For example, video encoding, image manipulation, and other forms of media processing often necessitates large amounts of I/O to temporary storage. While the input and output files are valuable assets and are typically stored as Amazon Simple Storage Service (S3) objects, the intermediate files are expendable. Similarly, batch and log processing runs in a race-to-idle model, flushing volatile data to disk as fast as possible in order to make full use of compute resources.
New C5d Instances with Local Storage In order to meet this need, we are introducing C5 instances equipped with local NVMe storage. Available for immediate use in 5 regions, these instances are a great fit for the applications that I described above, as well as others that you will undoubtedly dream up! Here are the specs:
Instance Name
vCPUs
RAM
Local Storage
EBS Bandwidth
Network Bandwidth
c5d.large
2
4 GiB
1 x 50 GB NVMe SSD
Up to 2.25 Gbps
Up to 10 Gbps
c5d.xlarge
4
8 GiB
1 x 100 GB NVMe SSD
Up to 2.25 Gbps
Up to 10 Gbps
c5d.2xlarge
8
16 GiB
1 x 225 GB NVMe SSD
Up to 2.25 Gbps
Up to 10 Gbps
c5d.4xlarge
16
32 GiB
1 x 450 GB NVMe SSD
2.25 Gbps
Up to 10 Gbps
c5d.9xlarge
36
72 GiB
1 x 900 GB NVMe SSD
4.5 Gbps
10 Gbps
c5d.18xlarge
72
144 GiB
2 x 900 GB NVMe SSD
9 Gbps
25 Gbps
Other than the addition of local storage, the C5 and C5d share the same specs. Both are powered by 3.0 GHz Intel Xeon Platinum 8000-series processors, optimized for EC2 and with full control over C-states on the two largest sizes, giving you the ability to run two cores at up to 3.5 GHz using Intel Turbo Boost Technology.
You can use any AMI that includes drivers for the Elastic Network Adapter (ENA) and NVMe; this includes the latest Amazon Linux, Microsoft Windows (Server 2008 R2, Server 2012, Server 2012 R2 and Server 2016), Ubuntu, RHEL, SUSE, and CentOS AMIs.
Here are a couple of things to keep in mind about the local NVMe storage:
Naming – You don’t have to specify a block device mapping in your AMI or during the instance launch; the local storage will show up as one or more devices (/dev/nvme*1 on Linux) after the guest operating system has booted.
Encryption – Each local NVMe device is hardware encrypted using the XTS-AES-256 block cipher and a unique key. Each key is destroyed when the instance is stopped or terminated.
Lifetime – Local NVMe devices have the same lifetime as the instance they are attached to, and do not stick around after the instance has been stopped or terminated.
Available Now C5d instances are available in On-Demand, Reserved Instance, and Spot form in the US East (N. Virginia), US West (Oregon), EU (Ireland), US East (Ohio), and Canada (Central) Regions. Prices vary by Region, and are just a bit higher than for the equivalent C5 instances.
The EU’s General Data Protection Regulation (GDPR) describes data processor and data controller roles, and some customers and AWS Partner Network (APN) partners are asking how this affects the long-established AWS Shared Responsibility Model. I wanted to take some time to help folks understand shared responsibilities for us and for our customers in context of the GDPR.
How does the AWS Shared Responsibility Model change under GDPR? The short answer – it doesn’t. AWS is responsible for securing the underlying infrastructure that supports the cloud and the services provided; while customers and APN partners, acting either as data controllers or data processors, are responsible for any personal data they put in the cloud. The shared responsibility model illustrates the various responsibilities of AWS and our customers and APN partners, and the same separation of responsibility applies under the GDPR.
AWS responsibilities as a data processor
The GDPR does introduce specific regulation and responsibilities regarding data controllers and processors. When any AWS customer uses our services to process personal data, the controller is usually the AWS customer (and sometimes it is the AWS customer’s customer). However, in all of these cases, AWS is always the data processor in relation to this activity. This is because the customer is directing the processing of data through its interaction with the AWS service controls, and AWS is only executing customer directions. As a data processor, AWS is responsible for protecting the global infrastructure that runs all of our services. Controllers using AWS maintain control over data hosted on this infrastructure, including the security configuration controls for handling end-user content and personal data. Protecting this infrastructure, is our number one priority, and we invest heavily in third-party auditors to test our security controls and make any issues they find available to our customer base through AWS Artifact. Our ISO 27018 report is a good example, as it tests security controls that focus on protection of personal data in particular.
AWS has an increased responsibility for our managed services. Examples of managed services include Amazon DynamoDB, Amazon RDS, Amazon Redshift, Amazon Elastic MapReduce, and Amazon WorkSpaces. These services provide the scalability and flexibility of cloud-based resources with less operational overhead because we handle basic security tasks like guest operating system (OS) and database patching, firewall configuration, and disaster recovery. For most managed services, you only configure logical access controls and protect account credentials, while maintaining control and responsibility of any personal data.
Customer and APN partner responsibilities as data controllers — and how AWS Services can help
Our customers can act as data controllers or data processors within their AWS environment. As a data controller, the services you use may determine how you configure those services to help meet your GDPR compliance needs. For example, AWS Services that are classified as Infrastructure as a Service (IaaS), such as Amazon EC2, Amazon VPC, and Amazon S3, are under your control and require you to perform all routine security configuration and management that would be necessary no matter where the servers were located. With Amazon EC2 instances, you are responsible for managing: guest OS (including updates and security patches), application software or utilities installed on the instances, and the configuration of the AWS-provided firewall (called a security group).
To help you realize data protection by design principles under the GDPR when using our infrastructure, we recommend you protect AWS account credentials and set up individual user accounts with Amazon Identity and Access Management (IAM) so that each user is only given the permissions necessary to fulfill their job duties. We also recommend using multi-factor authentication (MFA) with each account, requiring the use of SSL/TLS to communicate with AWS resources, setting up API/user activity logging with AWS CloudTrail, and using AWS encryption solutions, along with all default security controls within AWS Services. You can also use advanced managed security services, such as Amazon Macie, which assists in discovering and securing personal data stored in Amazon S3.
For more information, you can download the AWS Security Best Practices whitepaper or visit the AWS Security Resources or GDPR Center webpages. In addition to our solutions and services, AWS APN partners can provide hundreds of tools and features to help you meet your security objectives, ranging from network security and configuration management to access control and data encryption.
I’ve been busy trying to replicate the “eFail” PGP/SMIME bug. I thought I’d write up some notes.
PGP and S/MIME encrypt emails, so that eavesdroppers can’t read them. The bugs potentially allow eavesdroppers to take the encrypted emails they’ve captured and resend them to you, reformatted in a way that allows them to decrypt the messages.
Disable remote/external content in email
The most important defense is to disable “external” or “remote” content from being automatically loaded. This is when HTML-formatted emails attempt to load images from remote websites. This happens legitimately when they want to display images, but not fill up the email with them. But most of the time this is illegitimate, they hide images on the webpage in order to track you with unique IDs and cookies. For example, this is the code at the end of an email from politician Bernie Sanders to his supporters. Notice the long random number assigned to track me, and the width/height of this image is set to one pixel, so you don’t even see it:
Such trackers are so pernicious they are disabled by default in most email clients. This is an example of the settings in Thunderbird:
The problem is that as you read email messages, you often get frustrated by the fact the error messages and missing content, so you keep adding exceptions:
The correct defense against this eFail bug is to make sure such remote content is disabled and that you have no exceptions, or at least, no HTTP exceptions. HTTPS exceptions (those using SSL) are okay as long as they aren’t to a website the attacker controls. Unencrypted exceptions, though, the hacker can eavesdrop on, so it doesn’t matter if they control the website the requests go to. If the attacker can eavesdrop on your emails, they can probably eavesdrop on your HTTP sessions as well.
Some have recommended disabling PGP and S/MIME completely. That’s probably overkill. As long as the attacker can’t use the “remote content” in emails, you are fine. Likewise, some have recommend disabling HTML completely. That’s not even an option in any email client I’ve used — you can disable sending HTML emails, but not receiving them. It’s sufficient to just disable grabbing remote content, not the rest of HTML email rendering.
I couldn’t replicate the direct exfiltration
There rare two related bugs. One allows direct exfiltration, which appends the decrypted PGP email onto the end of an IMG tag (like one of those tracking tags), allowing the entire message to be decrypted.
An example of this is the following email. This is a standard HTML email message consisting of multiple parts. The trick is that the IMG tag in the first part starts the URL (blog.robertgraham.com/…) but doesn’t end it. It has the starting quotes in front of the URL but no ending quotes. The ending will in the next chunk.
The next chunk isn’t HTML, though, it’s PGP. The PGP extension (in my case, Enignmail) will detect this and automatically decrypt it. In this case, it’s some previous email message I’ve received the attacker captured by eavesdropping, who then pastes the contents into this email message in order to get it decrypted.
What should happen at this point is that Thunderbird will generate a request (if “remote content” is enabled) to the blog.robertgraham.com server with the decrypted contents of the PGP email appended to it. But that’s not what happens. Instead, I get this:
I am indeed getting weird stuff in the URL (the bit after the GET /), but it’s not the PGP decrypted message. Instead what’s going on is that when Thunderbird puts together a “multipart/mixed” message, it adds it’s own HTML tags consisting of lines between each part. In the email client it looks like this:
The HTML code it adds looks like:
That’s what you see in the above URL, all this code up to the first quotes. Those quotes terminate the quotes in the URL from the first multipart section, causing the rest of the content to be ignored (as far as being sent as part of the URL).
So at least for the latest version of Thunderbird, you are accidentally safe, even if you have “remote content” enabled. Though, this is only according to my tests, there may be a work around to this that hackers could exploit.
STARTTLS
In the old days, email was sent plaintext over the wire so that it could be passively eavesdropped on. Nowadays, most providers send it via “STARTTLS”, which sorta encrypts it. Attackers can still intercept such email, but they have to do so actively, using man-in-the-middle. Such active techniques can be detected if you are careful and look for them.
Some organizations don’t care. Apparently, some nation states are just blocking all STARTTLS and forcing email to be sent unencrypted. Others do care. The NSA will passively sniff all the email they can in nations like Iraq, but they won’t actively intercept STARTTLS messages, for fear of getting caught.
The consequence is that it’s much less likely that somebody has been eavesdropping on you, passively grabbing all your PGP/SMIME emails. If you fear they have been, you should look (e.g. send emails from GMail and see if they are intercepted by sniffing the wire).
You’ll know if you are getting hacked
If somebody attacks you using eFail, you’ll know. You’ll get an email message formatted this way, with multipart/mixed components, some with corrupt HTML, some encrypted via PGP. This means that for the most part, your risk is that you’ll be attacked only once — the hacker will only be able to get one message through and decrypt it before you notice that something is amiss. Though to be fair, they can probably include all the emails they want decrypted as attachments to the single email they sent you, so the risk isn’t necessarily that you’ll only get one decrypted.
As mentioned above, a lot of attackers (e.g. the NSA) won’t attack you if its so easy to get caught. Other attackers, though, like anonymous hackers, don’t care.
Somebody ought to write a plugin to Thunderbird to detect this.
Summary
It only works if attackers have already captured your emails (though, that’s why you use PGP/SMIME in the first place, to guard against that).
It only works if you’ve enabled your email client to automatically grab external/remote content.
It seems to not be easily reproducible in all cases.
Instead of disabling PGP/SMIME, you should make sure your email client hast remote/external content disabled — that’s a huge privacy violation even without this bug.
Notes: The default email client on the Mac enables remote content by default, which is bad:
Spencer Ackerman has this interesting story about a guy assigned to crack down on unauthorized White House leaks. It’s necessarily light on technical details, so I thought I’d write up some guesses, either as a guide for future reporters asking questions, or for people who want to better know the risks when leak information.
It should come as no surprise that your work email and phone are already monitored. They can get every email you’ve sent or received, even if you’ve deleted it. They can get every text message you’ve sent or received, the metadata of every phone call sent or received, and so forth.
To a lesser extent, this also applies to your well-known personal phone and email accounts. Law enforcement can get the metadata (which includes text messages) for these things without a warrant. In the above story, the person doing the investigation wasn’t law enforcement, but I’m not sure that’s a significant barrier if they can pass things onto the Secret Service or something.
The danger here isn’t that you used these things to leak, it’s that you’ve used these things to converse with the reporter before you made the decision to leak. That’s what happened in the Reality Winner case: she communicated with The Intercept before she allegedly leaked a printed document to them via postal mail. While it wasn’t conclusive enough to convict her, the innocent emails certainly put the investigators on her trail.
The path to leaking often starts this way: innocent actions before the decision to leak was made that will come back to haunt the person afterwards. That includes emails. That also includes Google searches. That includes websites you visit (like this one). I’m not sure how to solve this, except that if you’ve been in contact with The Intercept, and then you decide to leak, send it to anybody but The Intercept.
By the way, the other thing that caught Reality Winner is the records they had of her accessing files and printing them on a printer. Depending where you work, they may have a record of every file you’ve accessed, every intranet page you visited. Because of the way printers put secret dots on documents, investigators know precisely which printer and time the document leaked to The Intercept was printed.
Photographs suffer the same problem: your camera and phone tag the photographs with GPS coordinates and time the photograph was taken, as well as information about the camera. This accidentally exposed John McAfee’s hiding location when Vice took pictures of him a few years ago. Some people leak by taking pictures of the screen — use a camera without GPS for this (meaning, a really old camera you bought from a pawnshop).
These examples should impress upon you the dangers of not understanding technology. As soon as you do something to evade surveillance you know about, you may get caught by surveillance you don’t know about.
If you nonetheless want to continue forward, the next step may be to get a “burner phone”. You can get an adequate Android “prepaid” phone for cash at the local Walmart, electronics store, or phone store.
There’s some problems with such phones, though. They can often be tracked back to the store that sold them, and the store will have security cameras that record you making the purchase. License plate readers and GPS tracking on your existing phone may also place you at that Walmart.
I don’t know how to resolve these problems. Perhaps the best is grow a beard and on the last day of your vacation, color your hair, take a long bike/metro ride (without your existing phone) to a store many miles away and pick up a phone, then shave and change your color back again. I don’t know — there’s a good chance any lame attempt you or I might think of has already been experienced by law enforcement, so they are likely ahead of you. Maybe ask your local drug dealer where they get their burner phones, and if they can sell you one. Of course, that just means when they get caught for drug dealing, they can reduce their sentence by giving up the middle class person who bought a phone from them.
Lastly, they may age out old security videos, so simply waiting six months before using the phone might work. That means prepaying for an entire year.
Note that I’m not going to link to examples of cheap burner phones on this page. Web browsers will sometimes prefetch some information from links in a webpage, so simply including links in this page can condemn you as having interest in burner phones. You are already in enough trouble for having visited this web page.
Burner phones have GPS. Newer the technology, like the latest Android LTE phones, have pretty accurate GPS that the police can query (without a warrant). If you take the phone home and turn it on, they’ll then be able to trace back the phone to your home. Carrying the phone around with you has the same problem, with the phone’s location correlating with your existing phone (which presumably you also carry) or credit card receipts. Rumors are that Petraeus was partly brought down by tracking locations where he used his credit card, namely, matching the hotel he was in with Internet address information.
Older phones that support 3G or even 2G have poorer GPS capabilities. They’ll still located you to the nearest cell tower, but not as accurately to your exact location.
A better strategy than a burner phone would be a burner laptop computer used with WiFi. You can get a cheap one for $200 at Amazon.com. My favorite are the 11 inch ones with a full sized keyboard and Windows 10. Better yet, get an older laptop for cash from a pawn shop.
You can install chat apps on this like “Signal Desktop”, “Wire Desktop”, or “WhatsApp” that will allow you to securely communicate. Or use “Discord”, which isn’t really encrypted, but it’s popular among gamers so therefore less likely to stand out. You can sit in a bar with free WiFi and a USB headset and talk to reporters without having a phone. If the reporter you want to leak to doesn’t have those apps (either on their own laptop or phone) then you don’t want to talk to them.
Needless to say, don’t cross the streams. Don’t log onto your normal accounts like Facebook. If you create fake Facebook accounts, don’t follow the same things. Better yet, configure your browser to discard all information (especially “cookies”) every time you log off, so you can’t be tracked. Install ad blockers, or use the “Brave” web browser, to remove even more trackers. A common trick among hackers is to change the “theme” to a red background, as a constant subliminal reminder that you using your dangerous computer, and never to do anything that identifies the real you.
Put tape over the camera. I’m not sure it’s a really big danger, but put tape over the camera. If they infect you enough to get your picture, they’ve also infected you enough to record any audio on your computer. Remember that proper encryption is end-to-end (they can’t eavesdrop in transit), but if they hack the ends (your laptop, or the reporter’s) they can still record the audio.
Note that when your burner laptop is in “sleep” mode, it can still be talking to the local wifi. Before taking it home, make sure it’s off. Go into the settings and configure it so that when the lid is closed, the computer is turned completely off.
It goes without saying: don’t use that burner laptop from home. Luckily, free wifi is everyone, so the local cafe, bar, or library can be used.
The next step is to also use a VPN or Tor to mask your Internet address. If there’s an active investigation into the reporter, they’ll get the metadata, the Internet address of the bar/cafe you are coming from. A good VPN provider or especially Tor will stop this. Remember that these providers increase latency, making phone calls a bit harder, but they are a lot safer.
Remember that Ross Ulbricht (owner of dark website market Silk Road) was caught in a library. They’d traced back his Internet address and grabbed his laptop out of his hands. Having it turn off (off off, not sleep off) when the lid is closed is one way to reduce this risk. Configuring your web browser to flush all cookies and passwords on restart is another. If they catch you in mid conversation with your secret contact, though, they’ll at least be able to hear your side of the conversation, and know who you are talking to.
The best measure, though it takes some learning, is “Tails live”. It’s a Linux distribution preconfigured with Tor and various secure chat apps that’ll boot from the USB or SD card. When you turn off the computer, nothing will be saved, so there will be no evidence saved to the disk for investigators to retrieve later.
While we are talking about Tor, it should be noted that many news organizations (NYTimes, Washington Post, The Intercept, etc.) support “SecureDrop” accessed only through Tor for receiving anonymous tips. Burner laptops you use from bars from Tails is the likely your most secure way of doing things.
Summary
The point of this post was not to provide a howto guide, but to discuss many of the technological issues involved. In a story about White House people investigating leaks, I’d like to see something in this technological direction. I’d like to know exactly how they were investigating leaks. Certainly, they were investigating all work computers, accounts, and phones. Where they also able to get to non-work computers, accounts, phones? Did they have law enforcement powers? What could they do about burner phones and laptops?
In any case, if you do want a howto guide, the discussion above should put some fear into you how easily you can inadvertently make a mistake.
Much has been written on LWN about dynamically instrumenting kernel code. These features are also available to user-space code with a special kind of probe known as a User Statically-Defined Tracing (USDT) probe. These probes provide a low-overhead way of instrumenting user-space code and provide a convenient way to debug applications running in production. In this final article of the BPF and BCC series we’ll look at where USDT probes come from and how you can use them to understand the behavior of your own applications.
Открих забавен проблем с python и multiprocessing, който в момента още не мога да реша чий проблем е (в крайна сметка ще се окаже мой). Отне ми прилично количество време да го хвана и си струва да го разкажа.
Малко предистория: ползваме influxdb, в което тъпчем бая секундни данни, които после предъвкваме до минутни. InfluxDB има continuous queries, които вършат тази работа – на някакъв интервал от време хващат новите данни и ги сгъват. Тези заявки имаха няколко проблема: – не се оправят с попълване на стари данни; – изпълняват се рядко и минутните данни изостават; – изпълняват се в общи линии в един thread, което кара минутните данни да изостават още повече (в нашия случай преди да ги сменим с около 12 часа).
Хванаха ме дяволите и си написах просто демонче на python, което да събира информация за различните бази какви данни могат да се сгънат, и паралелно да попълва данните. Работи в общи линии по следния начин: – взима списък с базите данни – пуска през multiprocessing-а да се събере за всяка база какви заявки трябва да се пуснат, на база на какви measurement-и има и докога са минутните и секундните данни в тях; – пуска през multiprocessing-а събраните от предния pass заявки – и така до края на света (или докато зависне).
След като навакса за няколко часа, успяваше да държи минутните данни в рамките на няколко минути от последните секундни данни, което си беше сериозно подобрение на ситуацията. Единственият проблем беше, че от време на време спираше да process-ва и увисваше.
Днес намерих време да го прегледам внимателно какво му се случва. Процесът изглежда като един parent и 5 fork()-нати child-а, като: Parent-а спи във futex 0x22555a0; Child 18455 във futex 0x7fdbfa366000; Child 18546 read Child 18457 във futex 0x7fdbfa366000 Child 18461 във futex 0x7fdbfa366000 Child 18462 във futex 0x7fdbfa366000 Child 18465 във futex 0x7fdbf908c2c0
Това не беше особено полезно, и се оказа, че стандартния python debugger (pdb) не може да се закача за съществуващи процеси, но за сметка на това gdb с подходящи debug символи може, и може да дава доста полезна информация. По този начин открих, че parent-а чака един child да приключи работата си:
#11 PyEval_EvalFrameEx ( [email protected]=Frame 0x235fb80, for file /usr/lib64/python2.7/multiprocessing/pool.py, line 543, in wait (self== 1525137960000000000 AND time < 1525138107000000000 GROUP BY time(1m), * fill(linear)\' in a read only context, please use a POST request instead', u'level': u'warning'}], u'statement_id': 0}]}, None], _callback=None, _chunksize=1, _number_left=1, _ready=False, _success=True, _cond=<_Condition(_Verbose__verbose=False, _Condition__lock=, acquire=, _Condition__waiters=[], release=) at remote 0x7fdbe0015310>, _job=45499, _cache={45499: < ...>}) a...(truncated), [email protected]=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3040
Като в pool.py около ред 543 има следното:
class ApplyResult(object):
...
def wait(self, timeout=None): self._cond.acquire() try: if not self._ready: self._cond.wait(timeout) finally: self._cond.release()
Първоначално си мислех, че 18546 очаква да прочете нещо от грешното място, но излезе, че това е child-а, който е спечелил състезанието за изпълняване на следващата задача и чака да му я дадат (което изглежда се раздава през futex 0x7fdbfa366000). Един от child-овете обаче чака в друг lock:
(gdb) bt #0 __lll_lock_wait () at ../nptl/sysdeps/unix/sysv/linux/x86_64/lowlevellock.S:135 #1 0x00007fdbf9b68dcb in _L_lock_812 () from /lib64/libpthread.so.0 #2 0x00007fdbf9b68c98 in __GI___pthread_mutex_lock ([email protected]=0x7fdbf908c2c0 ) at ../nptl/pthread_mutex_lock.c:79 #3 0x00007fdbf8e846ea in _nss_files_gethostbyname4_r ([email protected]=0x233fa44 "localhost", [email protected]=0x7fdbecfcb8e0, [email protected]=0x7fdbecfcb340 "hZ \372\333\177", [email protected]=1064, [email protected]=0x7fdbecfcb8b0, [email protected]=0x7fdbecfcb910, [email protected]=0x0) at nss_files/files-hosts.c:381 #4 0x00007fdbf9170ed8 in gaih_inet (name=, [email protected]=0x233fa44 "localhost", service=, [email protected]=0x7fdbecfcbb90, [email protected]=0x7fdbecfcb9f0, [email protected]=0x7fdbecfcb9e0) at ../sysdeps/posix/getaddrinfo.c:877 #5 0x00007fdbf91745cd in __GI_getaddrinfo ([email protected]=0x233fa44 "localhost", [email protected]=0x7fdbecfcbbc0 "8086", [email protected]=0x7fdbecfcbb90, [email protected]=0x7fdbecfcbb78) at ../sysdeps/posix/getaddrinfo.c:2431 #6 0x00007fdbeed8760d in socket_getaddrinfo (self=, args=) at /usr/src/debug/Python-2.7.5/Modules/socketmodule.c:4193 #7 0x00007fdbf9e5fbb0 in call_function (oparg=, pp_stack=0x7fdbecfcbd10) at /usr/src/debug/Python-2.7.5/Python/ceval.c:4408 #8 PyEval_EvalFrameEx ( [email protected]=Frame 0x7fdbe8013350, for file /usr/lib/python2.7/site-packages/urllib3/util/connection.py, line 64, in create_connection (address=('localhost', 8086), timeout=3000, source_address=None, socket_options=[(6, 1, 1)], host='localhost', port=8086, err=None), [email protected]=0) at /usr/src/debug/Python-2.7.5/Python/ceval.c:3040
(gdb) frame 3 #3 0x00007fdbf8e846ea in _nss_files_gethostbyname4_r ([email protected]=0x233fa44 "localhost", [email protected]=0x7fdbecfcb8e0, [email protected]=0x7fdbecfcb340 "hZ \372\333\177", [email protected]=1064, [email protected]=0x7fdbecfcb8b0, [email protected]=0x7fdbecfcb910, [email protected]=0x0) at nss_files/files-hosts.c:381 381 __libc_lock_lock (lock); (gdb) list 376 enum nss_status 377 _nss_files_gethostbyname4_r (const char *name, struct gaih_addrtuple **pat, 378 char *buffer, size_t buflen, int *errnop, 379 int *herrnop, int32_t *ttlp) 380 { 381 __libc_lock_lock (lock); 382 383 /* Reset file pointer to beginning or open file. */ 384 enum nss_status status = internal_setent (keep_stream); 385
Или в превод – опитваме се да вземем стандартния lock, който libc-то използва за да си пази reentrant функциите, и някой го държи. Кой ли?
… което е и съвсем очаквано, при условие, че са два процеса и тая памет не е обща.
Та, явно това, което се е случило е, че докато parent-а е правел fork(), тоя lock го е държал някой, и child-а реално не може да пипне каквото и да е, свързано с него (което значи никакви reentrant функции в glibc-то, каквито па всички ползват (и би трябвало да ползват)). Въпросът е, че по принцип това не би трябвало да е възможно, щото около fork() няма нищо, което да взима тоя lock, и би трябвало glibc да си освобождава lock-а като излиза от функциите си.
Първоначалното ми идиотско предположение беше, че в signal handler-а на SIGCHLD multiprocessing модула създава новите child-ове, и така докато нещо друго държи lock-а идва сигнал, прави се нов процес и той го “наследява” заключен. Това беше твърде глупаво, за да е истина, и се оказа, че не е…
Около въпросите с lock-а бях стигнал с търсене до две неща – issue 127 в gperftools и Debian bug 657835. Първото каза, че проблемът ми може да е от друг lock, който някой друг държи преди fork-а (което ме накара да се загледам по-внимателно какви lock-ове се държат), а второто, че като цяло ако fork-ваш thread-нато приложение, може после единствено да правиш execve(), защото всичко друго не е ясно колко ще работи.
И накрая се оказа, че ако се ползва multiprocessing модула, той пуска в главния процес няколко thread-а, които да се занимават със следенето и пускането на child-ове за обработка. Та ето какво реално се случва:
– някой child си изработва нужния брой операции и излиза – parent-а получава SIGCHLD и си отбелязва, че трябва да види какво става – главния thread на parent-а тръгва да събира списъка бази, и вика в някакъв момент _nss_files_gethostbyname4_r, който взима lock-а; – по това време другия thread казва “а, нямам достатъчно child-ове, fork()” – profit.
Текущото ми глупаво решение е да не правя нищо в главния thread, което може да взима тоя lock и да се надявам, че няма още някой такъв. Бъдещото ми решение е или да го пиша на python3 с някой друг модул по темата, или на go (което ще трябва да науча).
The Internet of Things (IoT) has precipitated to an influx of connected devices and data that can be mined to gain useful business insights. If you own an IoT device, you might want the data to be uploaded seamlessly from your connected devices to the cloud so that you can make use of cloud storage and the processing power to perform sophisticated analysis of data. To upload the data to the AWS Cloud, devices must pass authentication and authorization checks performed by the respective AWS services. The standard way of authenticating AWS requests is the Signature Version 4 algorithm that requires the caller to have an access key ID and secret access key. Consequently, you need to hardcode the access key ID and the secret access key on your devices. Alternatively, you can use the built-in X.509 certificate as the unique device identity to authenticate AWS requests.
AWS IoT has introduced the credentials provider feature that allows a caller to authenticate AWS requests by having an X.509 certificate. The credentials provider authenticates a caller using an X.509 certificate, and vends a temporary, limited-privilege security token. The token can be used to sign and authenticate any AWS request. Thus, the credentials provider relieves you from having to manage and periodically refresh the access key ID and secret access key remotely on your devices.
In the process of retrieving a security token, you use AWS IoT to create a thing (a representation of a specific device or logical entity), register a certificate, and create AWS IoT policies. You also configure an AWS Identity and Access Management (IAM) role and attach appropriate IAM policies to the role so that the credentials provider can assume the role on your behalf. You also make an HTTP-over-Transport Layer Security (TLS) mutual authentication request to the credentials provider that uses your preconfigured thing, certificate, policies, and IAM role to authenticate and authorize the request, and obtain a security token on your behalf. You can then use the token to sign any AWS request using Signature Version 4.
In this blog post, I explain the AWS IoT credentials provider design and then demonstrate the end-to-end process of retrieving a security token from AWS IoT and using the token to write a temperature and humidity record to a specific Amazon DynamoDB table.
Note: This post assumes you are familiar with AWS IoT and IAM to perform steps using the AWS CLI and OpenSSL. Make sure you are running the latest version of the AWS CLI.
Overview of the credentials provider workflow
The following numbered diagram illustrates the credentials provider workflow. The diagram is followed by explanations of the steps.
To explain the steps of the workflow as illustrated in the preceding diagram:
The AWS IoT device uses the AWS SDK or custom client to make an HTTPS request to the credentials provider for a security token. The request includes the device X.509 certificate for authentication.
The credentials provider forwards the request to the AWS IoT authentication and authorization module to verify the certificate and the permission to request the security token.
If the certificate is valid and has permission to request a security token, the AWS IoT authentication and authorization module returns success. Otherwise, it returns failure, which goes back to the device with the appropriate exception.
If assuming the role succeeds, AWS STS returns a temporary, limited-privilege security token to the credentials provider.
The credentials provider returns the security token to the device.
The AWS SDK on the device uses the security token to sign an AWS request with AWS Signature Version 4.
The requested service invokes IAM to validate the signature and authorize the request against access policies attached to the preconfigured IAM role.
If IAM validates the signature successfully and authorizes the request, the request goes through.
In another solution, you could configure an AWS Lambda rule that ingests your device data and sends it to another AWS service. However, in applications that require the uploading of large files such as videos or aggregated telemetry to the AWS Cloud, you may want your devices to be able to authenticate and send data directly to the AWS service of your choice. The credentials provider enables you to do that.
Outline of the steps to retrieve and use security token
Perform the following steps as part of this solution:
Create an AWS IoT thing: Start by creating a thing that corresponds to your home thermostat in the AWS IoT thing registry database. This allows you to authenticate the request as a thing and use thing attributes as policy variables in AWS IoT and IAM policies.
Register a certificate: Create and register a certificate with AWS IoT, and attach it to the thing for successful device authentication.
Create and configure an IAM role: Create an IAM role to be assumed by the service on behalf of your device. I illustrate how to configure a trust policy and an access policy so that AWS IoT has permission to assume the role, and the token has necessary permission to make requests to DynamoDB.
Create a role alias: Create a role alias in AWS IoT. A role alias is an alternate data model pointing to an IAM role. The credentials provider request must include a role alias name to indicate which IAM role to assume for obtaining a security token from AWS STS. You may update the role alias on the server to point to a different IAM role and thus make your device obtain a security token with different permissions.
Attach a policy: Create an authorization policy with AWS IoT and attach it to the certificate to control which device can assume which role aliases.
Request a security token: Make an HTTPS request to the credentials provider and retrieve a security token and use it to sign a DynamoDB request with Signature Version 4.
Use the security token to sign a request: Use the retrieved token to sign a request to DynamoDB and successfully write a temperature and humidity record from your home thermostat in a specific table. Thus, starting with an X.509 certificate on your home thermostat, you can successfully upload your thermostat record to DynamoDB and use it for further analysis. Before the availability of the credentials provider, you could not do this.
Deploy the solution
1. Create an AWS IoT thing
Register your home thermostat in the AWS IoT thing registry database by creating a thing type and a thing. You can use the AWS CLI with the following command to create a thing type. The thing type allows you to store description and configuration information that is common to a set of things.
Now, you need to have a Certificate Authority (CA) certificate, sign a device certificate using the CA certificate, and register both certificates with AWS IoT before your device can authenticate to AWS IoT. If you do not already have a CA certificate, you can use OpenSSL to create a CA certificate, as described in Use Your Own Certificate. To register your CA certificate with AWS IoT, follow the steps on Registering Your CA Certificate.
You then have to create a device certificate signed by the CA certificate and register it with AWS IoT, which you can do by following the steps on Creating a Device Certificate Using Your CA Certificate. Save the certificate and the corresponding key pair; you will use them when you request a security token later. Also, remember the password you provide when you create the certificate.
Run the following command in the AWS CLI to attach the device certificate to your thing so that you can use thing attributes in policy variables.
If the attach-thing-principal command succeeds, the output is empty.
3. Configure an IAM role
Next, configure an IAM role in your AWS account that will be assumed by the credentials provider on behalf of your device. You are required to associate two policies with the role: a trust policy that controls who can assume the role, and an access policy that controls which actions can be performed on which resources by assuming the role.
The following trust policy grants the credentials provider permission to assume the role. Put it in a text document and save the document with the name, trustpolicyforiot.json.
The following access policy allows DynamoDB operations on the table that has the same name as the thing name that you created in Step 1, MyHomeThermostat, by using credentials-iot:ThingName as a policy variable. I explain after Step 5 about using thing attributes as policy variables. Put the following policy in a text document and save the document with the name, accesspolicyfordynamodb.json.
Finally, run the following command in the AWS CLI to attach the access policy to your role.
aws iam attach-role-policy --role-name dynamodb-access-role --policy-arn arn:aws:iam::<your_aws_account_id>:policy/accesspolicyfordynamodb
If the attach-role-policy command succeeds, the output is empty.
Configure the PassRole permissions
The IAM role that you have created must be passed to AWS IoT to create a role alias, as described in Step 4. The user who performs the operation requires iam:PassRole permission to authorize this action. You also should add permission for the iam:GetRole action to allow the user to retrieve information about the specified role. Create the following policy to grant iam:PassRole and iam:GetRole permissions. Name this policy, passrolepermission.json.
Now, run the following command to attach the policy to the user.
aws iam attach-user-policy --policy-arn arn:aws:iam::<your_aws_account_id>:policy/passrolepermission --user-name <user_name>
If the attach-user-policy command succeeds, the output is empty.
4. Create a role alias
Now that you have configured the IAM role, you will create a role alias with AWS IoT. You must provide the following pieces of information when creating a role alias:
RoleAlias: This is the primary key of the role alias data model and hence a mandatory attribute. It is a string; the minimum length is 1 character, and the maximum length is 128 characters.
RoleArn: This is the Amazon Resource Name (ARN) of the IAM role you have created. This is also a mandatory attribute.
CredentialDurationSeconds: This is an optional attribute specifying the validity (in seconds) of the security token. The minimum value is 900 seconds (15 minutes), and the maximum value is 3,600 seconds (60 minutes); the default value is 3,600 seconds, if not specified.
Run the following command in the AWS CLI to create a role alias. Use the credentials of the user to whom you have given the iam:PassRole permission.
You created and registered a certificate with AWS IoT earlier for successful authentication of your device. Now, you need to create and attach a policy to the certificate to authorize the request for the security token.
Let’s say you want to allow a thing to get credentials for the role alias, Thermostat-dynamodb-access-role-alias, with thing owner Alice, thing type thermostat, and the thing attached to a principal. The following policy, with thing attributes as policy variables, achieves these requirements. After this step, I explain more about using thing attributes as policy variables. Put the policy in a text document, and save it with the name, alicethermostatpolicy.json.
If the attach-policy command succeeds, the output is empty.
You have completed all the necessary steps to request an AWS security token from the credentials provider!
Using thing attributes as policy variables
Before I show how to request a security token, I want to explain more about how to use thing attributes as policy variables and the advantage of using them. As a prerequisite, a device must provide a thing name in the credentials provider request.
Thing substitution variables in AWS IoT policies
AWS IoT Simplified Permission Management allows you to associate a connection with a specific thing, and allow the thing name, thing type, and other thing attributes to be available as substitution variables in AWS IoT policies. You can write a generic AWS IoT policy as in alicethermostatpolicy.json in Step 5, attach it to multiple certificates, and authorize the connection as a thing. For example, you could attach alicethermostatpolicy.json to certificates corresponding to each of the thermostats you have that you want to assume the role alias, Thermostat-dynamodb-access-role-alias, and allow operations only on the table with the name that matches the thing name. For more information, see the full list of thing policy variables.
Thing substitution variables in IAM policies
You also can use the following three substitution variables in the IAM role’s access policy (I used credentials-iot:ThingName in accesspolicyfordynamodb.json in Step 3):
credentials-iot:ThingName
credentials-iot:ThingTypeName
credentials-iot:AwsCertificateId
When the device provides the thing name in the request, the credentials provider fetches these three variables from the database and adds them as context variables to the security token. When the device uses the token to access DynamoDB, the variables in the role’s access policy are replaced with the corresponding values in the security token. Note that you also can use credentials-iot:AwsCertificateId as a policy variable; AWS IoT returns certificateId during registration.
6. Request a security token
Make an HTTPS request to the credentials provider to fetch a security token. You have to supply the following information:
Certificate and key pair: Because this is an HTTP request over TLS mutual authentication, you have to provide the certificate and the corresponding key pair to your client while making the request. Use the same certificate and key pair that you used during certificate registration with AWS IoT.
RoleAlias: Provide the role alias (in this example, Thermostat-dynamodb-access-role-alias) to be assumed in the request.
ThingName: Provide the thing name that you created earlier in the AWS IoT thing registry database. This is passed as a header with the name, x-amzn-iot-thingname. Note that the thing name is mandatory only if you have thing attributes as policy variables in AWS IoT or IAM policies.
Run the following command in the AWS CLI to obtain your AWS account-specific endpoint for the credentials provider. See the DescribeEndpoint API documentation for further details.
Note that if you are on Mac OS X, you need to export your certificate to a .pfx or .p12 file before you can pass it in the https request. Use OpenSSL with the following command to convert the device certificate from .pem to .pfx format. Remember the password because you will need it subsequently in a curl command.
Now, make an HTTPS request to the credentials provider to fetch a security token. You may use your preferred HTTP client for the request. I use curl in the following examples.
This command returns a security token object that has an accessKeyId, a secretAccessKey, a sessionToken, and an expiration. The following is sample output of the curl command.
Create a DynamoDB table called MyHomeThermostat in your AWS account. You will have to choose the hash (partition key) and the range (sort key) while creating the table to uniquely identify a record. Make the hash the serial_number of the thermostat and the range the timestamp of the record. Create a text file with the following JSON to put a temperature and humidity record in the table. Name the file, item.json.
You can use the accessKeyId, secretAccessKey, and sessionToken retrieved from the output of the curl command to sign a request that writes the temperature and humidity record to the DynamoDB table. Use the following commands to accomplish this.
In this blog post, I demonstrated how to retrieve a security token by using an X.509 certificate and then writing an item to a DynamoDB table by using the security token. Similarly, you could run applications on surveillance cameras or sensor devices that exchange the X.509 certificate for an AWS security token and use the token to upload video streams to Amazon Kinesis or telemetry data to Amazon CloudWatch.
If you have comments about this blog post, submit them in the “Comments” section below. If you have questions about or issues implementing this solution, start a new thread on the AWS IoT forum.
Backblaze’s rapid ingest service, Fireball, graduates out of public beta. Our device holds 70 terabytes of customer data and is perfect for migrating large data sets to B2 Cloud Storage.
At Backblaze, we like to put ourselves in the customer’s shoes. Specifically, we ask questions like “how can we make cloud storage more useful?” There is a long list of things we can do to help — over the last few weeks, we’ve addressed some of them when we lowered the cost of downloading data to $0.01 / GB. Today, we are pleased to publicly release our rapid ingest service, Fireball.
What is the Backblaze B2 Fireball?
The Fireball is a hardware device, specifically a NAS device. Any Backblaze B2 customer can order it from inside their account. The Fireball device can hold up to 70 terabytes of data. Upon ordering, it ships from a Backblaze data center to you. When you receive it, you can transfer your data onto the Fireball using your internal network. Once your data transfer is complete, you send it back to a Backblaze data center. Finally, inside our secure data center, your data is uploaded from the Fireball to your account. Your data remains encrypted throughout the process. Step by step instructions can be found here.
Why Use the Fireball?
“We would not have been able to get this project off the ground without the B2 Fireball.” — James Cole, KLRU (Austin City Limits)
For most customers, transferring large quantities of data isn’t always simple. The need can arise as you migrate off of legacy systems (e.g. replacing LTO) or simply on a project basis (e.g. transferring video shot in the field to the cloud). An common approach is to upload your data via the internet to the cloud storage vendor of your choosing. While cloud storage vendors don’t charge for uploads, you have to pay your network provider for bandwidth. That’s assuming you are in a place where the bandwidth can be secured.
Your data is stored in megabytes (“MB”) but your bandwidth is measured in megabits per second (“Mbps”). The difference? An 80 Mbps upload connection will transfer no more than 10 MB per second. That means, in your best case scenario, you might be able to upload 50 terabytes in 50 days, assuming you use nearly all of your upload bandwidth for the upload.
If you’re looking to migrate old backups from LTO or even a large project, a 3 month lag time is not operationally viable. That’s why multiple cloud storage providers have introduced rapid ingest devices.
How It Compares: Backblaze B2 Fireball vs AWS Snowball vs GCS Transfer Appliance
“We found the B2 Fireball much simpler and easier to use than Amazon’s Snowball. WunderVu had been looking for a cloud solution for security and simplicity, and B2 hit every check box.” — Aaron Rhodes, Executive Producer, WunderVu
Every vendor that offers a rapid ingest service only lets you upload to that vendor’s cloud. For example, you can’t use an Amazon Snowball to upload to Google Cloud Storage. This means that when considering a rapid ingest service, you are also making a decision on what cloud storage vendor to use. As such, one should consider not only the cost of the rapid ingest service, but also how much that vendor is going to charge you to store and download your data.
Device Capacity
Service Fee
Shipping
Cloud Storage $/GB/Month
Download $/GB
Backblaze B2
70 TB
$550 (30 day rental)
$75
$0.005
$0.01
Amazon S3
50 TB
$200 (10 day rental)
$? *
$0.021 +320%
$0.05+ +500%
Google Cloud
100 TB
$300 (10 day rental)
$500
$0.020 +300%
$0.08+ +800%
*AWS does not estimate shipping fees at the time of the Snowball order.
To make the comparison easier, let’s create a hypothetical case and compare the costs incurred in the first year. Assume you have 100 TB as an initial upload. But that’s just the initial upload. Over the course of the year, let’s consider a usage pattern where every month you add 5 TB, delete 2 TBs, and download 10 TBs.
Transfer Cost
Cloud Storage Fees
Total Transfer + Cloud Storage Fees
Backblaze B2
$1,250 (2 Fireballs)
$9,570
$10,820
Amazon S3
$400 (2 Snowballs)
$36,114
$36,514 +337%
Google Cloud
$800 (1 transit)
$39,684
$40,484 +374%
Just looking at the first year, Amazon is 337% more expensive than Backblaze and Google is 374% more expensive than Backblaze.
Put simply, Backblaze offers the lowest cost, high performance cloud storage on the planet. During our public beta of the Fireball program we’ve had extremely positive feedback around how the Fireball enables customers to get their projects started in a time efficient and cost effective way. We hope you’ll give it a try!
Mid-march in the USA means millions of people watching, and betting on, college basketball (I live here but I didn’t make the rules). As the NCAA college championship continues I wanted to briefly highlight the work of Wesley Pasfield one of our Professional Services Machine Learning Specialists. Wesley was able to take data from kenpom.com and College Basketball Reference to build a model predicting the outcome of March Madness using the Amazon SageMaker built-in XGBoost algorithm.
Wesley walks us through grabbing the data, performing an exploratory data analysis (EDA in the data science lingo), reshaping the data for the xgboost algorithm, using the SageMaker SDK to create a training job for two different models, and finally creating a SageMaker inference endpoint for serving predictions at https://cbbpredictions.com/. Check out part one of the post and part two.
Pretty cool right? Why not open the notebook and give the xgboost algorithm a try? Just know that there are a few caveats to the predictions so don’t go making your champion prediction just yet!
Data that describe processes in a spatial context are everywhere in our day-to-day lives and they dominate big data problems. Map data, for instance, whether describing networks of roads or remote sensing data from satellites, get us where we need to go. Atmospheric data from simulations and sensors underlie our weather forecasts and climate models. Devices and sensors with GPS can provide a spatial context to nearly all mobile data.
In this post, we introduce the WIND toolkit, a huge (500 TB), open weather model dataset that’s available to the world on Amazon’s cloud services. We walk through how to access this data and some of the open-source software developed to make it easily accessible. Our solution considers a subset of geospatial data that exist on a grid (raster) and explores ways to provide access to large-scale raster data from weather models. The solution uses foundational AWS services and the Hierarchical Data Format (HDF), a well adopted format for scientific data.
The approach developed here can be extended to any data that fit in an HDF5 file, which can describe sparse and dense vectors and matrices of arbitrary dimensions. This format is already popular within the physical sciences for both experimental and simulation data. We discuss solutions to gridded data storage for a massive dataset of public weather model outputs called the Wind Integration National Dataset (WIND) toolkit. We also highlight strategies that are general to other large geospatial data management problems.
Wind Integration National Dataset
As variable renewable power penetration levels increase in power systems worldwide, the importance of renewable integration studies to ensure continued economic and reliable operation of the power grid is also increasing. The WIND toolkit is the largest freely available grid integration dataset to date.
The WIND toolkit was developed by 3TIER by Vaisala. They were under a subcontract to the National Renewable Energy Laboratory (NREL) to support studies on integration of wind energy into the existing US grid. NREL is a part of a network of national laboratories for the US Department of Energy and has a mission to advance the science and engineering of energy efficiency, sustainable transportation, and renewable power technologies.
The toolkit has been used by consultants, research groups, and universities worldwide to support grid integration studies. Less traditional uses also include resource assessments for wind plants (such as those powering Amazon data centers), and studying the effects of weather on California condor migrations in the Baja peninsula.
The diversity of applications highlights the value of accessible, open public data. Yet, there’s a catch: the dataset is huge. The WIND toolkit provides simulated atmospheric (weather) data at a two-km spatial resolution and five-minute temporal resolution at multiple heights for seven years. The entire dataset is half a petabyte (500 TB) in size and is stored in the NREL High Performance Computing data center in Golden, Colorado. Making this dataset publicly available easily and in a cost-effective manner is a major challenge.
As other laboratories and public institutions work to release their data to the world, they may face similar challenges to those that we experienced. Some prior, well-intentioned efforts to release huge datasets as-is have resulted in data resources that are technically available but fundamentally unusable. They may be stored in an unintuitive format or indexed and organized to support only a subset of potential uses. Downloading hundreds of terabytes of data is often impractical. Most users don’t have access to a big data cluster (or super computer) to slice and dice the data as they need after it’s downloaded.
We aim to provide a large amount of data (50 terabytes) to the public in a way that is efficient, scalable, and easy to use. In many cases, researchers can access these huge cloud-located datasets using the same software and algorithms they have developed for smaller datasets stored locally. Only the pieces of data they need for their individual analysis must be downloaded. To make this work in practice, we worked with the HDF Group and have built upon their forthcoming Highly Scalable Data Service.
In the rest of this post, we discuss how the HSDS software was developed to use Amazon EC2 and Amazon S3 resources to provide convenient and scalable access to these huge geospatial datasets. We describe how the HSDS service has been put to work for the WIND Toolkit dataset and demonstrate how to access it using the h5pyd Python library and the REST API. We conclude with information about our ongoing work to release more ‘open’ datasets to the public using AWS services, and ways to improve and extend the HSDS with newer Amazon services like Amazon ECS and AWS Lambda.
Developing a scalable service for big geospatial data
The HDF5 file format and API have been used for many years and is an effective means of storing large scientific datasets. For example, NASA’s Earth Observing System (EOS) satellites collect more than 16 TBs of data per day using HDF5.
With the rise of the cloud, there are new challenges and opportunities to rethink how HDF5 can be enhanced to work effectively as a component in a cloud-native architecture. For the HDF Group, working with NREL has been a great opportunity to put ideas into practice with a production-size dataset.
An HDF5 file consists of a directed graph of group and dataset objects. Datasets can be thought of as a multidimensional array with support for user-defined metadata tags and compression. Typical operations on datasets would be reading or writing data to a regular subregion (a hyperslab) or reading and writing individual elements (a point selection). Also, group and dataset objects may each contain an arbitrary number of the user-defined metadata elements known as attributes.
Many people have used the HDF library in applications developed or ported to run on EC2 instances, but there are a number of constraints that often prove problematic:
The HDF5 library can’t read directly from HDF5 files stored as S3 objects. The entire file (often many GB in size) would need to be copied to local storage before the first byte can be read. Also, the instance must be configured with the appropriately sized EBS volume)
The HDF library only has access to the computational resources of the instance itself (as opposed to a cluster of instances), so many operations are bottlenecked by the library.
Any modifications to the HDF5 file would somehow have to be synchronized with changes that other instances have made to same file before writing back to S3.
Using a pattern common to many offerings from AWS, the solution to these constraints is to develop a service framework around the HDF data model. Using this model, the HDF Group has created the Highly Scalable Data Service (HSDS) that provides all the functionality that traditionally was provided by the HDF5 library. By using the service, you don’t need to manage your own file volumes, but can just read and write whatever data that you need.
Because the service manages the actual data persistence to a durable medium (S3, in this case), you don’t need to worry about disk management. Simply stream the data you need from the service as you need it. Secondly, putting the functionality behind a service allows some tricks to increase performance (described in more detail later). And lastly, HSDS allows any number of clients to access the data at the same time, enabling HDF5 to be used as a coordination mechanism for multiple readers and writers.
In designing the HSDS architecture, we gave much thought to how to achieve scalability of the HSDS service. For accessing HDF5 data, there are two different types of scaling to consider:
Multiple clients making many requests to the service
Single requests that require a significant amount of data processing
To deal with the first scaling challenge, as with most services, we considered how the service responds as the request rate increases. AWS provides some great tools that help in this regard:
Auto Scaling groups
Elastic Load Balancing load balancers
The ability of S3 to handle large aggregate throughput rates
By using a cluster of EC2 instances behind a load balancer, you can handle different client loads in a cost-effective manner.
The second scaling challenge concerns single requests that would take significant processing time with just one compute node. One example of this from the WIND toolkit would be extracting all the values in the seven-year time span for a given geographic point and dataset.
In HDF5, large datasets are typically stored as “chunks”; that is, a regular partition of the array. In HSDS, each chunk is stored as a binary object in S3. The sequential approach to retrieving the time series values would be for the service to read each chunk needed from S3, extract the needed elements, and go on to the next chunk. In this case, that would involve processing 2557 chunks, and would be quite slow.
Fortunately, with HSDS, you can speed this up quite a bit by exploiting the compute and I/O capabilities of the cluster. Upon receiving the request, the receiving node can use other nodes in the cluster to read different portions of the selection. With multiple nodes reading from S3 in parallel, performance improves as the cluster size increases.
The diagram below illustrates how this works in simplified case of four chunks and four nodes.
This architecture has worked in well in practice. In testing with the WIND toolkit and time series extraction, we observed a request latency of ~60 seconds using four nodes vs. ~5 seconds with 40 nodes. Performance roughly scales with the size of the cluster.
A planned enhancement to this is to use AWS Lambda for the worker processing. This enables 1000-way parallel reads at a reasonable cost, as you only pay for the milliseconds of CPU time used with AWS Lambda.
Public access to atmospheric data using HSDS and AWS
An early challenge in releasing the WIND toolkit data was in deciding how to subset the data for different use cases. In general, few researchers need access to the entire 0.5 PB of data and a great deal of efficiency and cost reduction can be gained by making directed constituent datasets.
NREL grid integration researchers initially extracted a 2-TB subset by selecting 120,000 points where the wind resource seemed appropriate for development. They also chose only those data important for wind applications (100-m wind speed, converted to power), the most interesting locations for those performing grid studies. To support the remaining users who needed more data resolution, we down-sampled the data to a 60-minute temporal resolution, keeping all the other variables and spatial resolution intact. This reduced dataset is 50 TB of data describing 30+ atmospheric variables of data for 7 years at a 60-minute temporal resolution.
The WindViz browser-based Gridded Wind Toolkit Visualizer was created as an example implementation of the HSDS REST API in JavaScript. The visualizer is written in the style of ECMAScript 2016 using a modern development toolchain that includes webpack and Babel. The source code is available through our GitHub repository. The demo page is hosted via GitHub pages, and we use a cross-origin AJAX request to fetch data from the HSDS service running on the EC2 infrastructure. The visualizer can be used to explore the gridded wind toolkit data on a map. Achieve full spatial resolution by zooming in to a specific region.
Programmatic access is possible using the h5pyd Python library, a distributed analog to the widely used h5py library. Users interact with the datasets (variables) and slice the data from its (time x longitude x latitude) cube form as they see fit.
Examples and use cases are described in a set of Jupyter notebooks and available on GitHub:
Now you have a Jupyter notebook server running on your EC2 server.
From your laptop, create an SSH tunnel:
$ ssh –L 8888:localhost:8888 (IP address of the EC2 server)
Now, you can browse to localhost:8888 using the correct token, and interact with the notebooks as if they were local. Within the directory, there are examples for accessing the HSDS API and plotting wind and weather data using matplotlib.
Controlling access and defraying costs
A final concern is rate limiting and access control. Although the HSDS service is scalable and relatively robust, we had a few practical concerns:
How can we protect from malicious or accidental use that may lead to high egress fees (for example, someone who attempts to repeatedly download the entire dataset from S3)?
How can we keep track of who is using the data both to document the value of the data resource and to justify the costs?
If costs become too high, can we charge for some or all API use to help cover the costs?
To approach these problems, we investigated using Amazon API Gateway and its simplified integration with the AWS Marketplace for SaaS monetization as well as third-party API proxies.
In the end, we chose to use API Umbrella due to its close involvement with http://data.gov. While AWS Marketplace is a compelling option for future datasets, the decision was made to keep this dataset entirely open, at least for now. As community use and associated costs grow, we’ll likely revisit Marketplace. Meanwhile, API Umbrella provides controls for rate limiting and API key registration out of the box and was simple to implement as a front-end proxy to HSDS. Those applications that may want to charge for API use can accomplish a similar strategy using Amazon API Gateway and AWS Marketplace.
Ongoing work and other resources
As NREL and other government research labs, municipalities, and organizations try to share data with the public, we expect many of you will face similar challenges to those we have tried to approach with the architecture described in this post. Providing large datasets is one challenge. Doing so in a way that is affordable and convenient for users is an entirely more difficult goal. Using AWS cloud-native services and the existing foundation of the HDF file format has allowed us to tackle that challenge in a meaningful way.
Dr. Caleb Phillips is a senior scientist with the Data Analysis and Visualization Group within the Computational Sciences Center at the National Renewable Energy Laboratory. Caleb comes from a background in computer science systems, applied statistics, computational modeling, and optimization. His work at NREL spans the breadth of renewable energy technologies and focuses on applying modern data science techniques to data problems at scale.
Dr. Caroline Draxl is a senior scientist at NREL. She supports the research and modeling activities of the US Department of Energy from mesoscale to wind plant scale. Caroline uses mesoscale models to research wind resources in various countries, and participates in on- and offshore boundary layer research and in the coupling of the mesoscale flow features (kilometer scale) to the microscale (tens of meters). She holds a M.S. degree in Meteorology and Geophysics from the University of Innsbruck, Austria, and a PhD in Meteorology from the Technical University of Denmark.
John Readey has been a Senior Architect at The HDF Group since he joined in June 2014. His interests include web services related to HDF, applications that support the use of HDF and data visualization.Before joining The HDF Group, John worked at Amazon.com from 2006–2014 where he developed service-based systems for eCommerce and AWS.
Jordan Perr-Sauer is an RPP intern with the Data Analysis and Visualization Group within the Computational Sciences Center at the National Renewable Energy Laboratory. Jordan hopes to use his professional background in software engineering and his academic training in applied mathematics to solve the challenging problems facing America and the world.
Last week, the Israeli security company CTS Labs published a series of exploits against AMD chips. The publication came with the flashy website, detailed whitepaper, cool vulnerability names — RYZENFALL, MASTERKEY, FALLOUT, and CHIMERA — and logos we’ve come to expect from these sorts of things. What’s new is that the company only gave AMD a day’s notice, which breaks with every norm about responsible disclosure. CTS Labs didn’t release details of the exploits, only high-level descriptions of the vulnerabilities, but it is probably still enough for others to reproduce their results. This is incredibly irresponsible of the company.
Moreover, the vulnerabilities are kind of meh. Nicholas Weaver explains:
In order to use any of the four vulnerabilities, an attacker must already have almost complete control over the machine. For most purposes, if the attacker already has this access, we would generally say they’ve already won. But these days, modern computers at least attempt to protect against a rogue operating system by having separate secure subprocessors. CTS Labs discovered the vulnerabilities when they looked at AMD’s implementation of the secure subprocessor to see if an attacker, having already taken control of the host operating system, could bypass these last lines of defense.
The vulnerabilities described in amdflaws.com could give an attacker that has already gained initial foothold into one or more computers in the enterprise a significant advantage against IT and security teams.
The only thing the attacker would need after the initial local compromise is local admin privileges and an affected machine. To clarify misunderstandings — there is no need for physical access, no digital signatures, no additional vulnerability to reflash an unsigned BIOS. Buy a computer from the store, run the exploits as admin — and they will work (on the affected models as described on the site).
The weirdest thing about this story is that CTS Labs describes one of the vulnerabilities, Chimera, as a backdoor. Although it doesn’t t come out and say that this was deliberately planted by someone, it does make the point that the chips were designed in Taiwan. This is an incredible accusation, and honestly needs more evidence before we can evaluate it.
The upshot of all of this is that CTS Labs played this for maximum publicity: over-hyping its results and minimizing AMD’s ability to respond. And it may have an ulterior motive:
But CTS’s website touting AMD’s flaws also contained a disclaimer that threw some shadows on the company’s motives: “Although we have a good faith belief in our analysis and believe it to be objective and unbiased, you are advised that we may have, either directly or indirectly, an economic interest in the performance of the securities of the companies whose products are the subject of our reports,” reads one line. WIRED asked in a follow-up email to CTS whether the company holds any financial positions designed to profit from the release of its AMD research specifically. CTS didn’t respond.
We all need to demand better behavior from security researchers. I know that any publicity is good publicity, but I am pleased to see the stories critical of CTS Labs outnumbering the stories praising it.
AMD’s response today agrees that all four bug families are real and are found in the various components identified by CTS. The company says that it is developing firmware updates for the three PSP flaws. These fixes, to be made available in “coming weeks,” will be installed through system firmware updates. The firmware updates will also mitigate, in some unspecified way, the Chimera issue, with AMD saying that it’s working with ASMedia, the third-party hardware company that developed Promontory for AMD, to develop suitable protections. In its report, CTS wrote that, while one CTS attack vector was a firmware bug (and hence in principle correctable), the other was a hardware flaw. If true, there may be no effective way of solving it.
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.