The advent of user namespaces and container technology has made it possible to extend more root-like powers to unprivileged users in a (we hope) safe way. One remaining sticking point is the mounting of filesystems, which has long been fraught with security problems. Work has been proceeding to allow such mounts for years, and it has gotten a little closer with the posting of a patch series intended for the 4.18 kernel. But, as an unrelated discussion has made clear, truly safe unprivileged filesystem mounting is still a rather distant prospect — at least, if one wants to do it in the kernel.
This post is courtesy of Alan Protasio, Software Development Engineer, Amazon Web Services
Just like compute and storage, messaging is a fundamental building block of enterprise applications. Message brokers (aka “message-oriented middleware”) enable different software systems, often written in different languages, on different platforms, running in different locations, to communicate and exchange information. Mission-critical applications, such as CRM and ERP, rely on message brokers to work.
A common performance consideration for customers deploying a message broker in a production environment is the throughput of the system, measured as messages per second. This is important to know so that application environments (hosts, threads, memory, etc.) can be configured correctly.
In this post, we demonstrate how to measure the throughput for Amazon MQ, a new managed message broker service for ActiveMQ, using JMS Benchmark. It should take between 15–20 minutes to set up the environment and an hour to run the benchmark. We also provide some tips on how to configure Amazon MQ for optimal throughput.
Benchmarking throughput for Amazon MQ
ActiveMQ can be used for a number of use cases. These use cases can range from simple fire and forget tasks (that is, asynchronous processing), low-latency request-reply patterns, to buffering requests before they are persisted to a database.
The throughput of Amazon MQ is largely dependent on the use case. For example, if you have non-critical workloads such as gathering click events for a non-business-critical portal, you can use ActiveMQ in a non-persistent mode and get extremely high throughput with Amazon MQ.
On the flip side, if you have a critical workload where durability is extremely important (meaning that you can’t lose a message), then you are bound by the I/O capacity of your underlying persistence store. We recommend using mq.m4.large for the best results. The mq.t2.micro instance type is intended for product evaluation. Performance is limited, due to the lower memory and burstable CPU performance.
Tip: To improve your throughput with Amazon MQ, make sure that you have consumers processing messaging as fast as (or faster than) your producers are pushing messages.
Because it’s impossible to talk about how the broker (ActiveMQ) behaves for each and every use case, we walk through how to set up your own benchmark for Amazon MQ using our favorite open-source benchmarking tool: JMS Benchmark. We are fans of the JMS Benchmark suite because it’s easy to set up and deploy, and comes with a built-in visualizer of the results.
Non-Persistent Scenarios – Queue latency as you scale producer throughput
Getting started
At the time of publication, you can create an mq.m4.large single-instance broker for testing for $0.30 per hour (US pricing).
Step 2 – Create an EC2 instance to run your benchmark Launch the EC2 instance using Step 1: Launch an Instance. We recommend choosing the m5.large instance type.
Step 3 – Configure the security groups Make sure that all the security groups are correctly configured to let the traffic flow between the EC2 instance and your broker.
From the broker list, choose the name of your broker (for example, MyBroker)
In the Details section, under Security and network, choose the name of your security group or choose the expand icon ( ).
From the security group list, choose your security group.
At the bottom of the page, choose Inbound, Edit.
In the Edit inbound rules dialog box, add a role to allow traffic between your instance and the broker: • Choose Add Rule. • For Type, choose Custom TCP. • For Port Range, type the ActiveMQ SSL port (61617). • For Source, leave Custom selected and then type the security group of your EC2 instance. • Choose Save.
Your broker can now accept the connection from your EC2 instance.
Step 4 – Run the benchmark Connect to your EC2 instance using SSH and run the following commands:
After the benchmark finishes, you can find the results in the ~/reports directory. As you may notice, the performance of ActiveMQ varies based on the number of consumers, producers, destinations, and message size.
Amazon MQ architecture
The last bit that’s important to know so that you can better understand the results of the benchmark is how Amazon MQ is architected.
Amazon MQ is architected to be highly available (HA) and durable. For HA, we recommend using the multi-AZ option. After a message is sent to Amazon MQ in persistent mode, the message is written to the highly durable message store that replicates the data across multiple nodes in multiple Availability Zones. Because of this replication, for some use cases you may see a reduction in throughput as you migrate to Amazon MQ. Customers have told us they appreciate the benefits of message replication as it helps protect durability even in the face of the loss of an Availability Zone.
Conclusion
We hope this gives you an idea of how Amazon MQ performs. We encourage you to run tests to simulate your own use cases.
To learn more, see the Amazon MQ website. You can try Amazon MQ for free with the AWS Free Tier, which includes up to 750 hours of a single-instance mq.t2.micro broker and up to 1 GB of storage per month for one year.
Thanks to Greg Eppel, Sr. Solutions Architect, Microsoft Platform for this great blog that describes how to create a custom CodeBuild build environment for the .NET Framework. — AWS CodeBuild is a fully managed build service that compiles source code, runs tests, and produces software packages that are ready to deploy. CodeBuild provides curated build environments for programming languages and runtimes such as Android, Go, Java, Node.js, PHP, Python, Ruby, and Docker. CodeBuild now supports builds for the Microsoft Windows Server platform, including a prepackaged build environment for .NET Core on Windows. If your application uses the .NET Framework, you will need to use a custom Docker image to create a custom build environment that includes the Microsoft proprietary Framework Class Libraries. For information about why this step is required, see our FAQs. In this post, I’ll show you how to create a custom build environment for .NET Framework applications and walk you through the steps to configure CodeBuild to use this environment.
Build environments are Docker images that include a complete file system with everything required to build and test your project. To use a custom build environment in a CodeBuild project, you build a container image for your platform that contains your build tools, push it to a Docker container registry such as Amazon Elastic Container Registry (Amazon ECR), and reference it in the project configuration. When it builds your application, CodeBuild retrieves the Docker image from the container registry specified in the project configuration and uses the environment to compile your source code, run your tests, and package your application.
Step 1: Launch EC2 Windows Server 2016 with Containers
In the Amazon EC2 console, in your region, launch an Amazon EC2 instance from a Microsoft Windows Server 2016 Base with Containers AMI.
Increase disk space on the boot volume to at least 50 GB to account for the larger size of containers required to install and run Visual Studio Build Tools.
Run the following command in that directory. This process can take a while. It depends on the size of EC2 instance you launched. In my tests, a t2.2xlarge takes less than 30 minutes to build the image and produces an approximately 15 GB image.
docker build -t buildtools2017:latest -m 2GB .
Run the following command to test the container and start a command shell with all the developer environment variables:
docker run -it buildtools2017
Create a repository in the Amazon ECS console. For the repository name, type buildtools2017. Choose Next step and then complete the remaining steps.
Execute the following command to generate authentication details for our registry to the local Docker engine. Make sure you have permissions to the Amazon ECR registry before you execute the command.
aws ecr get-login
In the same command prompt window, copy and paste the following commands:
In the CodeCommit console, create a repository named DotNetFrameworkSampleApp. On the Configure email notifications page, choose Skip.
Clone a .NET Framework Docker sample application from GitHub. The repository includes a sample ASP.NET Framework that we’ll use to demonstrate our custom build environment.On the EC2 instance, open a command prompt and execute the following commands:
Navigate to the CodeCommit repository and confirm that the files you just pushed are there.
Step 4: Configure build spec
To build your .NET Framework application with CodeBuild you use a build spec, which is a collection of build commands and related settings, in YAML format, that AWS CodeBuild can use to run a build. You can include a build spec as part of the source code or you can define a build spec when you create a build project. In this example, I include a build spec as part of the source code.
In the root directory of your source directory, create a YAML file named buildspec.yml.
At this point, we have a Docker image with Visual Studio Build Tools installed and stored in the Amazon ECR registry. We also have a sample ASP.NET Framework application in a CodeCommit repository. Now we are going to set up CodeBuild to build the ASP.NET Framework application.
In the Amazon ECR console, choose the repository that was pushed earlier with the docker push command. On the Permissions tab, choose Add.
For Source Provider, choose AWS CodeCommit and then choose the called DotNetFrameworkSampleApp repository.
For Environment Image, choose Specify a Docker image.
For Environment type, choose Windows.
For Custom image type, choose Amazon ECR.
For Amazon ECR repository, choose the Docker image with the Visual Studio Build Tools installed, buildtools2017. Your configuration should look like the image below:
Choose Continue and then Save and Build to create your CodeBuild project and start your first build. You can monitor the status of the build in the console. You can also configure notifications that will notify subscribers whenever builds succeed, fail, go from one phase to another, or any combination of these events.
Summary
CodeBuild supports a number of platforms and languages out of the box. By using custom build environments, it can be extended to other runtimes. In this post, I showed you how to build a .NET Framework environment on a Windows container and demonstrated how to use it to build .NET Framework applications in CodeBuild.
We’re excited to see how customers extend and use CodeBuild to enable continuous integration and continuous delivery for their Windows applications. Feel free to share what you’ve learned extending CodeBuild for your own projects. Just leave questions or suggestions in the comments.
When I talk with customers and partners, I find that they are in different stages in the adoption of DevOps methodologies. They are automating the creation of application artifacts and the deployment of their applications to different infrastructure environments. In many cases, they are creating and supporting multiple applications using a variety of coding languages and artifacts.
The management of these processes and artifacts can be challenging, but using the right tools and methodologies can simplify the process.
In this post, I will show you how you can automate the creation and storage of application artifacts through the implementation of a pipeline and custom deploy action in AWS CodePipeline. The example includes a Node.js code base stored in an AWS CodeCommit repository. A Node Package Manager (npm) artifact is built from the code base, and the build artifact is published to a JFrogArtifactory npm repository.
I frequently recommend AWS CodePipeline, the AWS continuous integration and continuous delivery tool. You can use it to quickly innovate through integration and deployment of new features and bug fixes by building a workflow that automates the build, test, and deployment of new versions of your application. And, because AWS CodePipeline is extensible, it allows you to create a custom action that performs customized, automated actions on your behalf.
JFrog’s Artifactory is a universal binary repository manager where you can manage multiple applications, their dependencies, and versions in one place. Artifactory also enables you to standardize the way you manage your package types across all applications developed in your company, no matter the code base or artifact type.
If you already have a Node.js CodeCommit repository, a JFrog Artifactory host, and would like to automate the creation of the pipeline, including the custom action and CodeBuild project, you can use this AWS CloudFormationtemplate to create your AWS CloudFormation stack.
This figure shows the path defined in the pipeline for this project. It starts with a change to Node.js source code committed to a private code repository in AWS CodeCommit. With this change, CodePipeline triggers AWS CodeBuild to create the npm package from the node.js source code. After the build, CodePipeline triggers the custom action job worker to commit the build artifact to the designated artifact repository in Artifactory.
This blog post assumes you have already:
· Created a CodeCommit repository that contains a Node.js project.
· Configured a two-stage pipeline in AWS CodePipeline.
The Source stage of the pipeline is configured to poll the Node.js CodeCommit repository. The Build stage is configured to use a CodeBuild project to build the npm package using a buildspec.yml file located in the code repository.
If you do not have a Node.js repository, you can create a CodeCommit repository that contains this simple ‘Hello World’ project. This project also includes a buildspec.yml file that is used when you define your CodeBuild project. It defines the steps to be taken by CodeBuild to create the npm artifact.
If you do not already have a pipeline set up in CodePipeline, you can use this template to create a pipeline with a CodeCommit source action and a CodeBuild build action through the AWS Command Line Interface (AWS CLI). If you do not want to install the AWS CLI on your local machine, you can use AWS Cloud9, our managed integrated development environment (IDE), to interact with AWS APIs.
In your development environment, open your favorite editor and fill out the template with values appropriate to your project. For information, see the readme in the GitHub repository.
Use this CLI command to create the pipeline from the template:
It creates a pipeline that has a CodeCommit source action and a CodeBuild build action.
Integrating JFrog Artifactory
JFrog Artifactory provides default repositories for your project needs. For my NPM package repository, I am using the default virtual npm repository (named npm) that is available in Artifactory Pro. You might want to consider creating a repository per project but for the example used in this post, using the default lets me get started without having to configure a new repository.
I can use the steps in the Set Me Up -> npm section on the landing page to configure my worker to interact with the default NPM repository.
Describes the required values to run the custom action. I will define my custom action in the ‘Deploy’ category, identify the provider as ‘Artifactory’, of version ‘1’, and specify a variety of configurationProperties whose values will be defined when this stage is added to my pipeline.
Polls CodePipeline for a job, scanning for its action-definition properties. In this blog post, after a job has been found, the job worker does the work required to publish the npm artifact to the Artifactory repository.
{
"category": "Deploy",
"configurationProperties": [{
"name": "TypeOfArtifact",
"required": true,
"key": true,
"secret": false,
"description": "Package type, ex. npm for node packages",
"type": "String"
},
{ "name": "RepoKey",
"required": true,
"key": true,
"secret": false,
"type": "String",
"description": "Name of the repository in which this artifact should be stored"
},
{ "name": "UserName",
"required": true,
"key": true,
"secret": false,
"type": "String",
"description": "Username for authenticating with the repository"
},
{ "name": "Password",
"required": true,
"key": true,
"secret": true,
"type": "String",
"description": "Password for authenticating with the repository"
},
{ "name": "EmailAddress",
"required": true,
"key": true,
"secret": false,
"type": "String",
"description": "Email address used to authenticate with the repository"
},
{ "name": "ArtifactoryHost",
"required": true,
"key": true,
"secret": false,
"type": "String",
"description": "Public address of Artifactory host, ex: https://myexamplehost.com or http://myexamplehost.com:8080"
}],
"provider": "Artifactory",
"version": "1",
"settings": {
"entityUrlTemplate": "{Config:ArtifactoryHost}/artifactory/webapp/#/artifacts/browse/tree/General/{Config:RepoKey}"
},
"inputArtifactDetails": {
"maximumCount": 5,
"minimumCount": 1
},
"outputArtifactDetails": {
"maximumCount": 5,
"minimumCount": 0
}
}
There are seven sections to the custom action definition:
category: This is the stage in which you will be creating this action. It can be Source, Build, Deploy, Test, Invoke, Approval. Except for source actions, the category section simply allows us to organize our actions. I am setting the category for my action as ‘Deploy’ because I’m using it to publish my node artifact to my Artifactory instance.
configurationProperties: These are the parameters or variables required for your project to authenticate and commit your artifact. In the case of my custom worker, I need:
TypeOfArtifact: In this case, npm, because it’s for the Node Package Manager.
RepoKey: The name of the repository. In this case, it’s the default npm.
UserName and Password for the user to authenticate with the Artifactory repository.
EmailAddress used to authenticate with the repository.
Artifactory host name or IP address.
provider: The name you define for your custom action stage. I have named the provider Artifactory.
version: Version number for the custom action. Because this is the first version, I set the version number to 1.
entityUrlTemplate: This URL is presented to your users for the deploy stage along with the title you define in your provider. The link takes the user to their artifact repository page in the Artifactory host.
inputArtifactDetails: The number of artifacts to expect from the previous stage in the pipeline.
outputArtifactDetails: The number of artifacts that should be the result from the custom action stage. Later in this blog post, I define 0 for my output artifacts because I am publishing the artifact to the Artifactory repository as the final action.
After I define the custom action in a JSON file, I use the AWS CLI to create the custom action type in CodePipeline:
After I create the custom action type in the same region as my pipeline, I edit the pipeline to add a Deploy stage and configure it to use the custom action I created for Artifactory:
I have created a custom worker for the actions required to commit the npm artifact to the Artifactory repository. The worker is in Python and it runs in a loop on an Amazon EC2 instance. My custom worker polls for a deploy job and publishes the NPM artifact to the Artifactory repository.
The EC2 instance is running Amazon Linux and has an IAM instance role attached that gives the worker permission to access CodePipeline. The worker process is as follows:
Take the configuration properties from the custom worker and poll CodePipeline for a custom action job.
After there is a job in the job queue with the appropriate category, provider, and version, acknowledge the job.
Download the zipped artifact created in the previous Build stage from the provided S3 buckets with the provided temporary credentials.
Unzip the artifact into a temporary directory.
A user-defined Artifactory user name and password is used to receive a temporary API key from Artifactory.
To avoid having to write the password to a file, use that temporary API key and user name to authenticate with the NPM repository.
Publish the Node.js package to the specified repository.
Because I am running my custom worker on an Amazon Linux EC2 instance, I installed npm with the following command:
sudo yum install nodejs npm --enablerepo=epel
For my custom worker, I used pip to install the required Python libraries:
pip install boto3 requests
For a full Python package list, see requirements.txt in the GitHub repository.
Let’s take a look at some of the code snippets from the worker.
First, the worker polls for jobs:
def action_type():
ActionType = {
'category': 'Deploy',
'owner': 'Custom',
'provider': 'Artifactory',
'version': '1' }
return(ActionType)
def poll_for_jobs():
try:
artifactory_action_type = action_type()
print(artifactory_action_type)
jobs = codepipeline.poll_for_jobs(actionTypeId=artifactory_action_type)
while not jobs['jobs']:
time.sleep(10)
jobs = codepipeline.poll_for_jobs(actionTypeId=artifactory_action_type)
if jobs['jobs']:
print('Job found')
return jobs['jobs'][0]
except ClientError as e:
print("Received an error: %s" % str(e))
raise
When there is a job in the queue, the poller returns a number of values from the queue such as jobId, the input and output S3 buckets for artifacts, temporary credentials to access the S3 buckets, and other configuration details from the stage in the pipeline.
After successfully receiving the job details, the worker sends an acknowledgement to CodePipeline to ensure that the work on the job is not duplicated by other workers watching for the same job:
def job_acknowledge(jobId, nonce):
try:
print('Acknowledging job')
result = codepipeline.acknowledge_job(jobId=jobId, nonce=nonce)
return result
except Exception as e:
print("Received an error when trying to acknowledge the job: %s" % str(e))
raise
With the job now acknowledged, the worker publishes the source code artifact into the desired repository. The worker gets the value of the artifact S3 bucket and objectKey from the inputArtifacts in the response from the poll_for_jobs API request. Next, the worker creates a new directory in /tmp and downloads the S3 object into this directory:
def get_bucket_location(bucketName, init_client):
region = init_client.get_bucket_location(Bucket=bucketName)['LocationConstraint']
if not region:
region = 'us-east-1'
return region
def get_s3_artifact(bucketName, objectKey, ak, sk, st):
init_s3 = boto3.client('s3')
region = get_bucket_location(bucketName, init_s3)
session = Session(aws_access_key_id=ak,
aws_secret_access_key=sk,
aws_session_token=st)
s3 = session.resource('s3',
region_name=region,
config=botocore.client.Config(signature_version='s3v4'))
try:
tempdirname = tempfile.mkdtemp()
except OSError as e:
print('Could not write temp directory %s' % tempdirname)
raise
bucket = s3.Bucket(bucketName)
obj = bucket.Object(objectKey)
filename = tempdirname + '/' + objectKey
try:
if os.path.dirname(objectKey):
directory = os.path.dirname(filename)
os.makedirs(directory)
print('Downloading the %s object and writing it to disk in %s location' % (objectKey, tempdirname))
with open(filename, 'wb') as data:
obj.download_fileobj(data)
except ClientError as e:
print('Downloading the object and writing the file to disk raised this error: ' + str(e))
raise
return(filename, tempdirname)
Because the downloaded artifact from S3 is a zip file, the worker must unzip it first. To have a clean area in which to work, I extract the downloaded zip archive into a new directory:
def unzip_codepipeline_artifact(artifact, origtmpdir):
# create a new temp directory
# Unzip artifact into new directory
try:
newtempdir = tempfile.mkdtemp()
print('Extracting artifact %s into temporary directory %s' % (artifact, newtempdir))
zip_ref = zipfile.ZipFile(artifact, 'r')
zip_ref.extractall(newtempdir)
zip_ref.close()
shutil.rmtree(origtmpdir)
return(os.listdir(newtempdir), newtempdir)
except OSError as e:
if e.errno != errno.EEXIST:
shutil.rmtree(newtempdir)
raise
The worker now has the npm package that I want to store in my Artifactory NPM repository.
To authenticate with the NPM repository, the worker requests a temporary token from the Artifactory host. After receiving this temporary token, it creates a .npmrc file in the worker user’s home directory that includes a hash of the user name and temporary token. After it has authenticated, the worker runs npm config set registry <URL OF REPOSITORY> to configure the npm registry value to be the Artifactory host. Next, the worker runs npm publish –registry <URL OF REPOSITORY>, which publishes the node package to the NPM repository in the Artifactory host.
def push_to_npm(configuration, artifact_list, temp_dir, jobId):
reponame = configuration['RepoKey']
art_type = configuration['TypeOfArtifact']
print("Putting artifact into NPM repository " + reponame)
token, hostname, username = gen_artifactory_auth_token(configuration)
npmconfigfile = create_npmconfig_file(configuration, username, token)
url = hostname + '/artifactory/api/' + art_type + '/' + reponame
print("Changing directory to " + str(temp_dir))
os.chdir(temp_dir)
try:
print("Publishing following files to the repository: %s " % os.listdir(temp_dir))
print("Sending artifact to Artifactory NPM registry URL: " + url)
subprocess.call(["npm", "config", "set", "registry", url])
req = subprocess.call(["npm", "publish", "--registry", url])
print("Return code from npm publish: " + str(req))
if req != 0:
err_msg = "npm ERR! Recieved non OK response while sending response to Artifactory. Return code from npm publish: " + str(req)
signal_failure(jobId, err_msg)
else:
signal_success(jobId)
except requests.exceptions.RequestException as e:
print("Received an error when trying to commit artifact %s to repository %s: " % (str(art_type), str(configuration['RepoKey']), str(e)))
raise
return(req, npmconfigfile)
If the return value from publishing to the repository is not 0, the worker signals a failure to CodePipeline. If the value is 0, the worker signals success to CodePipeline to indicate that the stage of the pipeline has been completed successfully.
For the custom worker code, see npm_job_worker.py in the GitHub repository.
I run my custom worker on an EC2 instance using the command python npm_job_worker.py, with an optional --version flag that can be used to specify worker versions other than 1. Then I trigger a release change in my pipeline:
From my custom worker output logs, I have just committed a package named node_example at version 1.0.3:
On artifact: index.js
Committing to the repo: https://artifactory.myexamplehost.com/artifactory/api/npm/npm
Sending artifact to Artifactory URL: https:// artifactoryhost.myexamplehost.com/artifactory/api/npm/npm
npm config: 0
npm http PUT https://artifactory.myexamplehost.com/artifactory/api/npm/npm/node_example
npm http 201 https://artifactory.myexamplehost.com/artifactory/api/npm/npm/node_example
+ [email protected]
Return code from npm publish: 0
Signaling success to CodePipeline
After that has been built successfully, I can find my artifact in my Artifactory repository:
To help you automate this process, I have created this AWS CloudFormation template that automates the creation of the CodeBuild project, the custom action, and the CodePipeline pipeline. It also launches the Amazon EC2-based custom job worker in an AWS Auto Scaling group. This template requires you to have a VPC and CodeCommit repository for your Node.js project. If you do not currently have a VPC in which you want to run your custom worker EC2 instances, you can use this AWS QuickStart to create one. If you do not have an existing Node.js project, I’ve provided a sample project in the GitHub repository.
Conclusion
I‘ve shown you the steps to integrate your JFrog Artifactory repository with your CodePipeline workflow. I’ve shown you how to create a custom action in CodePipeline and how to create a custom worker that works in your CI/CD pipeline. To dig deeper into custom actions and see how you can integrate your Artifactory repositories into your AWS CodePipeline projects, check out the full code base on GitHub.
If you have any questions or feedback, feel free to reach out to us through the AWS CodePipeline forum.
Erin McGill is a Solutions Architect in the AWS Partner Program with a focus on DevOps and automation tooling.
Side projects are the things you do at home, after work, for your own “entertainment”, or to satisfy your desire to learn new stuff, in case your workplace doesn’t give you that opportunity (or at least not enough of it). Side projects are also a way to build stuff that you think is valuable but not necessarily “commercialisable”. Many side projects are open-sourced sooner or later and some of them contribute to the pool of tools at other people’s disposal.
I’ve outlined one recommendation about side projects before – do them with technologies that are new to you, so that you learn important things that will keep you better positioned in the software world.
But there are more benefits than that – serendipitous benefits, for example. And I’d like to tell some personal stories about that. I’ll focus on a few examples from my list of side projects to show how, through a sort-of butterfly effect, they helped shape my career.
The computoser project, no matter how cool algorithmic music composition, didn’t manage to have much of a long term impact. But it did teach me something apart from niche musical theory – how to read a bulk of scientific papers (mostly computer science) and understand them without being formally trained in the particular field. We’ll see how that was useful later.
Then there was the “State alerts” project – a website that scraped content from public institutions in my country (legislation, legislation proposals, decisions by regulators, new tenders, etc.), made them searchable, and “subscribable” – so that you get notified when a keyword of interest is mentioned in newly proposed legislation, for example. (I obviously subscribed for “information technologies” and “electronic”).
And that project turned out to have a significant impact on the following years. First, I chose a new technology to write it with – Scala. Which turned out to be of great use when I started working at TomTom, and on the 3rd day I was transferred to a Scala project, which was way cooler and much more complex than the original one I was hired for. It was a bit ironic, as my colleagues had just read that “I don’t like Scala” a few weeks earlier, but nevertheless, that was one of the most interesting projects I’ve worked on, and it went on for two years. Had I not known Scala, I’d probably be gone from TomTom much earlier (as the other project was restructured a few times), and I would not have learned many of the scalability, architecture and AWS lessons that I did learn there.
But the very same project had an even more important follow-up. Because if its “civic hacking” flavour, I was invited to join an informal group of developers (later officiated as an NGO) who create tools that are useful for society (something like MySociety.org). That group gathered regularly, discussed both tools and policies, and at some point we put up a list of policy priorities that we wanted to lobby policy makers. One of them was open source for the government, the other one was open data. As a result of our interaction with an interim government, we donated the official open data portal of my country, functioning to this day.
As a result of that, a few months later we got a proposal from the deputy prime minister’s office to “elect” one of the group for an advisor to the cabinet. And we decided that could be me. So I went for it and became advisor to the deputy prime minister. The job has nothing to do with anything one could imagine, and it was challenging and fascinating. We managed to pass legislation, including one that requires open source for custom projects, eID and open data. And all of that would not have been possible without my little side project.
As for my latest side project, LogSentinel – it became my current startup company. And not without help from the previous two mentioned above – the computer science paper reading was of great use when I was navigating the crypto papers landscape, and from the government job I not only gained invaluable legal knowledge, but I also “got” a co-founder.
Some other side projects died without much fanfare, and that’s fine. But the ones above shaped my “story” in a way that would not have been possible otherwise.
And I agree that such serendipitous chain of events could have happened without side projects – I could’ve gotten these opportunities by meeting someone at a bar (unlikely, but who knows). But we, as software engineers, are capable of tilting chance towards us by utilizing our skills. Side projects are our “extracurricular activities”, and they often lead to unpredictable, but rather positive chains of events. They would rarely be the only factor, but they are certainly great at unlocking potential.
The survey of nearly forty Republican and Democratic campaign operatives, administered through November and December 2017, revealed that American political campaign staff — primarily working at the state and congressional levels — are not only unprepared for possible cyber attacks, but remain generally unconcerned about the threat. The survey sample was relatively small, but nevertheless the survey provides a first look at how campaign managers and staff are responding to the threat.
The overwhelming majority of those surveyed do not want to devote campaign resources to cybersecurity or to hire personnel to address cybersecurity issues. Even though campaign managers recognize there is a high probability that campaign and personal emails are at risk of being hacked, they are more concerned about fundraising and press coverage than they are about cybersecurity. Less than half of those surveyed said they had taken steps to make their data secure and most were unsure if they wanted to spend any money on this protection.
Security is never something we actually want. Security is something we need in order to avoid what we don’t want. It’s also more abstract, concerned with hypothetical future possibilities. Of course it’s lower on the priorities list than fundraising and press coverage. They’re more tangible, and they’re more immediate.
Many companies across the globe use Amazon DynamoDB to store and query historical user-interaction data. DynamoDB is a fast NoSQL database used by applications that need consistent, single-digit millisecond latency.
Often, customers want to turn their valuable data in DynamoDB into insights by analyzing a copy of their table stored in Amazon S3. Doing this separates their analytical queries from their low-latency critical paths. This data can be the primary source for understanding customers’ past behavior, predicting future behavior, and generating downstream business value. Customers often turn to DynamoDB because of its great scalability and high availability. After a successful launch, many customers want to use the data in DynamoDB to predict future behaviors or provide personalized recommendations.
DynamoDB is a good fit for low-latency reads and writes, but it’s not practical to scan all data in a DynamoDB database to train a model. In this post, I demonstrate how you can use DynamoDB table data copied to Amazon S3 by AWS Data Pipeline to predict customer behavior. I also demonstrate how you can use this data to provide personalized recommendations for customers using Amazon SageMaker. You can also run ad hoc queries using Amazon Athena against the data. DynamoDB recently released on-demand backups to create full table backups with no performance impact. However, it’s not suitable for our purposes in this post, so I chose AWS Data Pipeline instead to create managed backups are accessible from other services.
To do this, I describe how to read the DynamoDB backup file format in Data Pipeline. I also describe how to convert the objects in S3 to a CSV format that Amazon SageMaker can read. In addition, I show how to schedule regular exports and transformations using Data Pipeline. The sample data used in this post is from Bank Marketing Data Set of UCI.
The solution that I describe provides the following benefits:
Separates analytical queries from production traffic on your DynamoDB table, preserving your DynamoDB read capacity units (RCUs) for important production requests
Automatically updates your model to get real-time predictions
Optimizes for performance (so it doesn’t compete with DynamoDB RCUs after the export) and for cost (using data you already have)
Makes it easier for developers of all skill levels to use Amazon SageMaker
All code and data set in this post are available in this .zip file.
Solution architecture
The following diagram shows the overall architecture of the solution.
The steps that data follows through the architecture are as follows:
Data Pipeline regularly copies the full contents of a DynamoDB table as JSON into an S3
Exported JSON files are converted to comma-separated value (CSV) format to use as a data source for Amazon SageMaker.
Amazon SageMaker renews the model artifact and update the endpoint.
The converted CSV is available for ad hoc queries with Amazon Athena.
Data Pipeline controls this flow and repeats the cycle based on the schedule defined by customer requirements.
Building the auto-updating model
This section discusses details about how to read the DynamoDB exported data in Data Pipeline and build automated workflows for real-time prediction with a regularly updated model.
Find the automation_script.sh file and edit it for your environment. For example, you need to replace 's3://<your bucket>/<datasource path>/' with your own S3 path to the data source for Amazon ML. In the script, the text enclosed by angle brackets—< and >—should be replaced with your own path.
Upload the json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar file to your S3 path so that the ADD jar command in Apache Hive can refer to it.
For this solution, the banking.csv should be imported into a DynamoDB table.
Export a DynamoDB table
To export the DynamoDB table to S3, open the Data Pipeline console and choose the Export DynamoDB table to S3 template. In this template, Data Pipeline creates an Amazon EMR cluster and performs an export in the EMRActivity activity. Set proper intervals for backups according to your business requirements.
One core node(m3.xlarge) provides the default capacity for the EMR cluster and should be suitable for the solution in this post. Leave the option to resize the cluster before running enabled in the TableBackupActivity activity to let Data Pipeline scale the cluster to match the table size. The process of converting to CSV format and renewing models happens in this EMR cluster.
For a more in-depth look at how to export data from DynamoDB, see Export Data from DynamoDB in the Data Pipeline documentation.
Add the script to an existing pipeline
After you export your DynamoDB table, you add an additional EMR step to EMRActivity by following these steps:
Open the Data Pipeline console and choose the ID for the pipeline that you want to add the script to.
For Actions, choose Edit.
In the editing console, choose the Activities category and add an EMR step using the custom script downloaded in the previous section, as shown below.
Paste the following command into the new step after the data upload step:
The element #{output.directoryPath} references the S3 path where the data pipeline exports DynamoDB data as JSON. The path should be passed to the script as an argument.
The bash script has two goals, converting data formats and renewing the Amazon SageMaker model. Subsequent sections discuss the contents of the automation script.
Automation script: Convert JSON data to CSV with Hive
We use Apache Hive to transform the data into a new format. The Hive QL script to create an external table and transform the data is included in the custom script that you added to the Data Pipeline definition.
When you run the Hive scripts, do so with the -e option. Also, define the Hive table with the 'org.openx.data.jsonserde.JsonSerDe' row format to parse and read JSON format. The SQL creates a Hive EXTERNAL table, and it reads the DynamoDB backup data on the S3 path passed to it by Data Pipeline.
Note: You should create the table with the “EXTERNAL” keyword to avoid the backup data being accidentally deleted from S3 if you drop the table.
The full automation script for converting follows. Add your own bucket name and data source path in the highlighted areas.
After creating an external table, you need to read data. You then use the INSERT OVERWRITE DIRECTORY ~ SELECT command to write CSV data to the S3 path that you designated as the data source for Amazon SageMaker.
Depending on your requirements, you can eliminate or process the columns in the SELECT clause in this step to optimize data analysis. For example, you might remove some columns that have unpredictable correlations with the target value because keeping the wrong columns might expose your model to “overfitting” during the training. In this post, customer_id columns is removed. Overfitting can make your prediction weak. More information about overfitting can be found in the topic Model Fit: Underfitting vs. Overfitting in the Amazon ML documentation.
Automation script: Renew the Amazon SageMaker model
After the CSV data is replaced and ready to use, create a new model artifact for Amazon SageMaker with the updated dataset on S3. For renewing model artifact, you must create a new training job. Training jobs can be run using the AWS SDK ( for example, Amazon SageMaker boto3 ) or the Amazon SageMaker Python SDK that can be installed with “pip install sagemaker” command as well as the AWS CLI for Amazon SageMaker described in this post.
In addition, consider how to smoothly renew your existing model without service impact, because your model is called by applications in real time. To do this, you need to create a new endpoint configuration first and update a current endpoint with the endpoint configuration that is just created.
#!/bin/bash
## Define variable
REGION=$2
DTTIME=`date +%Y-%m-%d-%H-%M-%S`
ROLE="<your AmazonSageMaker-ExecutionRole>"
# Select containers image based on region.
case "$REGION" in
"us-west-2" )
IMAGE="174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:latest"
;;
"us-east-1" )
IMAGE="382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest"
;;
"us-east-2" )
IMAGE="404615174143.dkr.ecr.us-east-2.amazonaws.com/linear-learner:latest"
;;
"eu-west-1" )
IMAGE="438346466558.dkr.ecr.eu-west-1.amazonaws.com/linear-learner:latest"
;;
*)
echo "Invalid Region Name"
exit 1 ;
esac
# Start training job and creating model artifact
TRAINING_JOB_NAME=TRAIN-${DTTIME}
S3OUTPUT="s3://<your bucket name>/model/"
INSTANCETYPE="ml.m4.xlarge"
INSTANCECOUNT=1
VOLUMESIZE=5
aws sagemaker create-training-job --training-job-name ${TRAINING_JOB_NAME} --region ${REGION} --algorithm-specification TrainingImage=${IMAGE},TrainingInputMode=File --role-arn ${ROLE} --input-data-config '[{ "ChannelName": "train", "DataSource": { "S3DataSource": { "S3DataType": "S3Prefix", "S3Uri": "s3://<your bucket name>/<datasource path>/", "S3DataDistributionType": "FullyReplicated" } }, "ContentType": "text/csv", "CompressionType": "None" , "RecordWrapperType": "None" }]' --output-data-config S3OutputPath=${S3OUTPUT} --resource-config InstanceType=${INSTANCETYPE},InstanceCount=${INSTANCECOUNT},VolumeSizeInGB=${VOLUMESIZE} --stopping-condition MaxRuntimeInSeconds=120 --hyper-parameters feature_dim=20,predictor_type=binary_classifier
# Wait until job completed
aws sagemaker wait training-job-completed-or-stopped --training-job-name ${TRAINING_JOB_NAME} --region ${REGION}
# Get newly created model artifact and create model
MODELARTIFACT=`aws sagemaker describe-training-job --training-job-name ${TRAINING_JOB_NAME} --region ${REGION} --query 'ModelArtifacts.S3ModelArtifacts' --output text `
MODELNAME=MODEL-${DTTIME}
aws sagemaker create-model --region ${REGION} --model-name ${MODELNAME} --primary-container Image=${IMAGE},ModelDataUrl=${MODELARTIFACT} --execution-role-arn ${ROLE}
# create a new endpoint configuration
CONFIGNAME=CONFIG-${DTTIME}
aws sagemaker create-endpoint-config --region ${REGION} --endpoint-config-name ${CONFIGNAME} --production-variants VariantName=Users,ModelName=${MODELNAME},InitialInstanceCount=1,InstanceType=ml.m4.xlarge
# create or update the endpoint
STATUS=`aws sagemaker describe-endpoint --endpoint-name ServiceEndpoint --query 'EndpointStatus' --output text --region ${REGION} `
if [[ $STATUS -ne "InService" ]] ;
then
aws sagemaker create-endpoint --endpoint-name ServiceEndpoint --endpoint-config-name ${CONFIGNAME} --region ${REGION}
else
aws sagemaker update-endpoint --endpoint-name ServiceEndpoint --endpoint-config-name ${CONFIGNAME} --region ${REGION}
fi
Grant permission
Before you execute the script, you must grant proper permission to Data Pipeline. Data Pipeline uses the DataPipelineDefaultResourceRole role by default. I added the following policy to DataPipelineDefaultResourceRole to allow Data Pipeline to create, delete, and update the Amazon SageMaker model and data source in the script.
After you deploy a model into production using Amazon SageMaker hosting services, your client applications use this API to get inferences from the model hosted at the specified endpoint. This approach is useful for interactive web, mobile, or desktop applications.
Following, I provide a simple Python code example that queries against Amazon SageMaker endpoint URL with its name (“ServiceEndpoint”) and then uses them for real-time prediction.
Data Pipeline exports DynamoDB table data into S3. The original JSON data should be kept to recover the table in the rare event that this is needed. Data Pipeline then converts JSON to CSV so that Amazon SageMaker can read the data.Note: You should select only meaningful attributes when you convert CSV. For example, if you judge that the “campaign” attribute is not correlated, you can eliminate this attribute from the CSV.
Train the Amazon SageMaker model with the new data source.
When a new customer comes to your site, you can judge how likely it is for this customer to subscribe to your new product based on “predictedScores” provided by Amazon SageMaker.
If the new user subscribes your new product, your application must update the attribute “y” to the value 1 (for yes). This updated data is provided for the next model renewal as a new data source. It serves to improve the accuracy of your prediction. With each new entry, your application can become smarter and deliver better predictions.
Running ad hoc queries using Amazon Athena
Amazon Athena is a serverless query service that makes it easy to analyze large amounts of data stored in Amazon S3 using standard SQL. Athena is useful for examining data and collecting statistics or informative summaries about data. You can also use the powerful analytic functions of Presto, as described in the topic Aggregate Functions of Presto in the Presto documentation.
With the Data Pipeline scheduled activity, recent CSV data is always located in S3 so that you can run ad hoc queries against the data using Amazon Athena. I show this with example SQL statements following. For an in-depth description of this process, see the post Interactive SQL Queries for Data in Amazon S3 on the AWS News Blog.
Creating an Amazon Athena table and running it
Simply, you can create an EXTERNAL table for the CSV data on S3 in Amazon Athena Management Console.
=== Table Creation ===
CREATE EXTERNAL TABLE datasource (
age int,
job string,
marital string ,
education string,
default string,
housing string,
loan string,
contact string,
month string,
day_of_week string,
duration int,
campaign int,
pdays int ,
previous int ,
poutcome string,
emp_var_rate double,
cons_price_idx double,
cons_conf_idx double,
euribor3m double,
nr_employed double,
y int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n'
LOCATION 's3://<your bucket name>/<datasource path>/';
The following query calculates the correlation coefficient between the target attribute and other attributes using Amazon Athena.
=== Sample Query ===
SELECT corr(age,y) AS correlation_age_and_target,
corr(duration,y) AS correlation_duration_and_target,
corr(campaign,y) AS correlation_campaign_and_target,
corr(contact,y) AS correlation_contact_and_target
FROM ( SELECT age , duration , campaign , y ,
CASE WHEN contact = 'telephone' THEN 1 ELSE 0 END AS contact
FROM datasource
) datasource ;
Conclusion
In this post, I introduce an example of how to analyze data in DynamoDB by using table data in Amazon S3 to optimize DynamoDB table read capacity. You can then use the analyzed data as a new data source to train an Amazon SageMaker model for accurate real-time prediction. In addition, you can run ad hoc queries against the data on S3 using Amazon Athena. I also present how to automate these procedures by using Data Pipeline.
You can adapt this example to your specific use case at hand, and hopefully this post helps you accelerate your development. You can find more examples and use cases for Amazon SageMaker in the video AWS 2017: Introducing Amazon SageMaker on the AWS website.
Yong Seong Lee is a Cloud Support Engineer for AWS Big Data Services. He is interested in every technology related to data/databases and helping customers who have difficulties in using AWS services. His motto is “Enjoy life, be curious and have maximum experience.”
Google has announced the open-sourcing of gVisor, a sandboxed container runtime. “gVisor is more lightweight than a VM while maintaining a similar level of isolation. The core of gVisor is a kernel that runs as a normal, unprivileged process that supports most Linux system calls. This kernel is written in Go, which was chosen for its memory- and type-safety. Just like within a VM, an application running in a gVisor sandbox gets its own kernel and set of virtualized devices, distinct from the host and other sandboxes.”
AWS Glue provides enhanced support for working with datasets that are organized into Hive-style partitions. AWS Glue crawlers automatically identify partitions in your Amazon S3 data. The AWS Glue ETL (extract, transform, and load) library natively supports partitions when you work with DynamicFrames. DynamicFrames represent a distributed collection of data without requiring you to specify a schema. You can now push down predicates when creating DynamicFrames to filter out partitions and avoid costly calls to S3. We have also added support for writing DynamicFrames directly into partitioned directories without converting them to Apache Spark DataFrames.
Partitioning has emerged as an important technique for organizing datasets so that they can be queried efficiently by a variety of big data systems. Data is organized in a hierarchical directory structure based on the distinct values of one or more columns. For example, you might decide to partition your application logs in Amazon S3 by date—broken down by year, month, and day. Files corresponding to a single day’s worth of data would then be placed under a prefix such as s3://my_bucket/logs/year=2018/month=01/day=23/.
Systems like Amazon Athena, Amazon Redshift Spectrum, and now AWS Glue can use these partitions to filter data by value without making unnecessary calls to Amazon S3. This can significantly improve the performance of applications that need to read only a few partitions.
In this post, we show you how to efficiently process partitioned datasets using AWS Glue. First, we cover how to set up a crawler to automatically scan your partitioned dataset and create a table and partitions in the AWS Glue Data Catalog. Then, we introduce some features of the AWS Glue ETL library for working with partitioned data. You can now filter partitions using SQL expressions or user-defined functions to avoid listing and reading unnecessary data from Amazon S3. We’ve also added support in the ETL library for writing AWS Glue DynamicFrames directly into partitions without relying on Spark SQL DataFrames.
Let’s get started!
Crawling partitioned data
In this example, we use the same GitHub archive dataset that we introduced in a previous post about Scala support in AWS Glue. This data, which is publicly available from the GitHub archive, contains a JSON record for every API request made to the GitHub service. A sample dataset containing one month of activity from January 2017 is available at the following location:
Here you can replace <region> with the AWS Region in which you are working, for example, us-east-1. This dataset is partitioned by year, month, and day, so an actual file will be at a path like the following:
To crawl this data, you can either follow the instructions in the AWS Glue Developer Guide or use the provided AWS CloudFormation template. This template creates a stack that contains the following:
An IAM role with permissions to access AWS Glue resources
A database in the AWS Glue Data Catalog named githubarchive_month
A crawler set up to crawl the GitHub dataset
An AWS Glue development endpoint (which is used in the next section to transform the data)
To run this template, you must provide an S3 bucket and prefix where you can write output data in the next section. The role that this template creates will have permission to write to this bucket only. You also need to provide a public SSH key for connecting to the development endpoint. For more information about creating an SSH key, see our Development Endpoint tutorial. After you create the AWS CloudFormation stack, you can run the crawler from the AWS Glue console.
In addition to inferring file types and schemas, crawlers automatically identify the partition structure of your dataset and populate the AWS Glue Data Catalog. This ensures that your data is correctly grouped into logical tables and makes the partition columns available for querying in AWS Glue ETL jobs or query engines like Amazon Athena.
After you crawl the table, you can view the partitions by navigating to the table in the AWS Glue console and choosing View partitions. The partitions should look like the following:
For partitioned paths in Hive-style of the form key=val, crawlers automatically populate the column name. In this case, because the GitHub data is stored in directories of the form 2017/01/01, the crawlers use default names like partition_0, partition_1, and so on. You can easily change these names on the AWS Glue console: Navigate to the table, choose Edit schema, and rename partition_0 to year, partition_1 to month, and partition_2 to day:
Now that you’ve crawled the dataset and named your partitions appropriately, let’s see how to work with partitioned data in an AWS Glue ETL job.
Transforming and filtering the data
To get started with the AWS Glue ETL libraries, you can use an AWS Glue development endpoint and an Apache Zeppelin notebook. AWS Glue development endpoints provide an interactive environment to build and run scripts using Apache Spark and the AWS Glue ETL library. They are great for debugging and exploratory analysis, and can be used to develop and test scripts before migrating them to a recurring job.
If you ran the AWS CloudFormation template in the previous section, then you already have a development endpoint named partition-endpoint in your account. Otherwise, you can follow the instructions in this development endpoint tutorial. In either case, you need to set up an Apache Zeppelin notebook, either locally, or on an EC2 instance. You can find more information about development endpoints and notebooks in the AWS Glue Developer Guide.
The following examples are all written in the Scala programming language, but they can all be implemented in Python with minimal changes.
Reading a partitioned dataset
To get started, let’s read the dataset and see how the partitions are reflected in the schema. First, you import some classes that you will need for this example and set up a GlueContext, which is the main class that you will use to read and write data.
Execute the following in a Zeppelin paragraph, which is a unit of executable code:
%spark
import com.amazonaws.services.glue.DynamicFrame import com.amazonaws.services.glue.DynamicRecord
import com.amazonaws.services.glue.GlueContext
import com.amazonaws.services.glue.util.JsonOptions import org.apache.spark.SparkContext
import java.util.Calendar
import java.util.GregorianCalendar
import scala.collection.JavaConversions._
@transient val spark: SparkContext = SparkContext.getOrCreate()
val glueContext: GlueContext = new GlueContext(spark)
This is straightforward with two caveats: First, each paragraph must start with the line %spark to indicate that the paragraph is Scala. Second, the spark variable must be marked @transient to avoid serialization issues. This is only necessary when running in a Zeppelin notebook.
Next, read the GitHub data into a DynamicFrame, which is the primary data structure that is used in AWS Glue scripts to represent a distributed collection of data. A DynamicFrame is similar to a Spark DataFrame, except that it has additional enhancements for ETL transformations. DynamicFrames are discussed further in the post AWS Glue Now Supports Scala Scripts, and in the AWS Glue API documentation.
The following snippet creates a DynamicFrame by referencing the Data Catalog table that you just crawled and then prints the schema:
%spark
val githubEvents: DynamicFrame = glueContext.getCatalogSource(
database = "githubarchive_month",
tableName = "data"
).getDynamicFrame()
githubEvents.schema.asFieldList.foreach { field =>
println(s"${field.getName}: ${field.getType.getType.getName}")
}
You could also print the full schema using githubEvents.printSchema(). But in this case, the full schema is quite large, so I’ve printed only the top-level columns. This paragraph takes about 5 minutes to run on a standard size AWS Glue development endpoint. After it runs, you should see the following output:
Note that the partition columns year, month, and day were automatically added to each record.
Filtering by partition columns
One of the primary reasons for partitioning data is to make it easier to operate on a subset of the partitions, so now let’s see how to filter data by the partition columns. In particular, let’s find out what people are building in their free time by looking at GitHub activity on the weekends. One way to accomplish this is to use the filter transformation on the githubEvents DynamicFrame that you created earlier to select the appropriate events:
%spark
def filterWeekend(rec: DynamicRecord): Boolean = {
def getAsInt(field: String): Int = {
rec.getField(field) match {
case Some(strVal: String) => strVal.toInt
// The filter transformation will catch exceptions and mark the record as an error.
case _ => throw new IllegalArgumentException(s"Unable to extract field $field")
}
}
val (year, month, day) = (getAsInt("year"), getAsInt("month"), getAsInt("day"))
val cal = new GregorianCalendar(year, month - 1, day) // Calendar months start at 0.
val dayOfWeek = cal.get(Calendar.DAY_OF_WEEK)
dayOfWeek == Calendar.SATURDAY || dayOfWeek == Calendar.SUNDAY
}
val filteredEvents = githubEvents.filter(filterWeekend)
filteredEvents.count
This snippet defines the filterWeekend function that uses the Java Calendar class to identify those records where the partition columns (year, month, and day) fall on a weekend. If you run this code, you see that there were 6,303,480 GitHub events falling on the weekend in January 2017, out of a total of 29,160,561 events. This seems reasonable—about 22 percent of the events fell on the weekend, and about 29 percent of the days that month fell on the weekend (9 out of 31). So people are using GitHub slightly less on the weekends, but there is still a lot of activity!
Predicate pushdowns for partition columns
The main downside to using the filter transformation in this way is that you have to list and read all files in the entire dataset from Amazon S3 even though you need only a small fraction of them. This is manageable when dealing with a single month’s worth of data. But as you try to process more data, you will spend an increasing amount of time reading records only to immediately discard them.
To address this issue, we recently released support for pushing down predicates on partition columns that are specified in the AWS Glue Data Catalog. Instead of reading the data and filtering the DynamicFrame at executors in the cluster, you apply the filter directly on the partition metadata available from the catalog. Then you list and read only the partitions from S3 that you need to process.
To accomplish this, you can specify a Spark SQL predicate as an additional parameter to the getCatalogSource method. This predicate can be any SQL expression or user-defined function as long as it uses only the partition columns for filtering. Remember that you are applying this to the metadata stored in the catalog, so you don’t have access to other fields in the schema.
The following snippet shows how to use this functionality to read only those partitions occurring on a weekend:
%spark
val partitionPredicate =
"date_format(to_date(concat(year, '-', month, '-', day)), 'E') in ('Sat', 'Sun')"
val pushdownEvents = glueContext.getCatalogSource(
database = "githubarchive_month",
tableName = "data",
pushDownPredicate = partitionPredicate).getDynamicFrame()
Here you use the SparkSQL string concat function to construct a date string. You use the to_date function to convert it to a date object, and the date_format function with the ‘E’ pattern to convert the date to a three-character day of the week (for example, Mon, Tue, and so on). For more information about these functions, Spark SQL expressions, and user-defined functions in general, see the Spark SQL documentation and list of functions.
Note that the pushdownPredicate parameter is also available in Python. The corresponding call in Python is as follows:
You can observe the performance impact of pushing down predicates by looking at the execution time reported for each Zeppelin paragraph. The initial approach using a Scala filter function took 2.5 minutes:
Because the version using a pushdown lists and reads much less data, it takes only 24 seconds to complete, a 5X improvement!
Of course, the exact benefit that you see depends on the selectivity of your filter. The more partitions that you exclude, the more improvement you will see.
In addition to Hive-style partitioning for Amazon S3 paths, Parquet and ORC file formats further partition each file into blocks of data that represent column values. Each block also stores statistics for the records that it contains, such as min/max for column values. AWS Glue supports pushdown predicates for both Hive-style partitions and block partitions in these formats. While reading data, it prunes unnecessary S3 partitions and also skips the blocks that are determined unnecessary to be read by column statistics in Parquet and ORC formats.
Additional transformations
Now that you’ve read and filtered your dataset, you can apply any additional transformations to clean or modify the data. For example, you could augment it with sentiment analysis as described in the previous AWS Glue post.
To keep things simple, you can just pick out some columns from the dataset using the ApplyMapping transformation:
ApplyMapping is a flexible transformation for performing projection and type-casting. In this example, we use it to unnest several fields, such as actor.login, which we map to the top-level actor field. We also cast the id column to a long and the partition columns to integers.
Writing out partitioned data
The final step is to write out your transformed dataset to Amazon S3 so that you can process it with other systems like Amazon Athena. By default, when you write out a DynamicFrame, it is not partitioned—all the output files are written at the top level under the specified output path. Until recently, the only way to write a DynamicFrame into partitions was to convert it into a Spark SQL DataFrame before writing. We are excited to share that DynamicFrames now support native partitioning by a sequence of keys.
You can accomplish this by passing the additional partitionKeys option when creating a sink. For example, the following code writes out the dataset that you created earlier in Parquet format to S3 in directories partitioned by the type field.
Here, $outpath is a placeholder for the base output path in S3. The partitionKeys parameter can also be specified in Python in the connection_options dict:
When you execute this write, the type field is removed from the individual records and is encoded in the directory structure. To demonstrate this, you can list the output path using the aws s3 ls command from the AWS CLI:
PRE type=CommitCommentEvent/
PRE type=CreateEvent/
PRE type=DeleteEvent/
PRE type=ForkEvent/
PRE type=GollumEvent/
PRE type=IssueCommentEvent/
PRE type=IssuesEvent/
PRE type=MemberEvent/
PRE type=PublicEvent/
PRE type=PullRequestEvent/
PRE type=PullRequestReviewCommentEvent/
PRE type=PushEvent/
PRE type=ReleaseEvent/
PRE type=WatchEvent/
As expected, there is a partition for each distinct event type. In this example, we partitioned by a single value, but this is by no means required. For example, if you want to preserve the original partitioning by year, month, and day, you could simply set the partitionKeys option to be Seq(“year”, “month”, “day”).
Conclusion
In this post, we showed you how to work with partitioned data in AWS Glue. Partitioning is a crucial technique for getting the most out of your large datasets. Many tools in the AWS big data ecosystem, including Amazon Athena and Amazon Redshift Spectrum, take advantage of partitions to accelerate query processing. AWS Glue provides mechanisms to crawl, filter, and write partitioned data so that you can structure your data in Amazon S3 however you want, to get the best performance out of your big data applications.
Ben Sowell is a senior software development engineer at AWS Glue. He has worked for more than 5 years on ETL systems to help users unlock the potential of their data. In his free time, he enjoys reading and exploring the Bay Area.
Mohit Saxena is a senior software development engineer at AWS Glue. His passion is building scalable distributed systems for efficiently managing data on cloud. He also enjoys watching movies and reading about the latest technology.
User authentication is the functionality that every web application shared. We should have perfected that a long time ago, having implemented it so many times. And yet there are so many mistakes made all the time.
Part of the reason for that is that the list of things that can go wrong is long. You can store passwords incorrectly, you can have a vulnerably password reset functionality, you can expose your session to a CSRF attack, your session can be hijacked, etc. So I’ll try to compile a list of best practices regarding user authentication. OWASP top 10 is always something you should read, every year. But that might not be enough.
So, let’s start. I’ll try to be concise, but I’ll include as much of the related pitfalls as I can cover – e.g. what could go wrong with the user session after they login:
Store passwords with bcrypt/scrypt/PBKDF2. No MD5 or SHA, as they are not good for password storing. Long salt (per user) is mandatory (the aforementioned algorithms have it built in). If you don’t and someone gets hold of your database, they’ll be able to extract the passwords of all your users. And then try these passwords on other websites.
Use HTTPS. Period. (Otherwise user credentials can leak through unprotected networks). Force HTTPS if user opens a plain-text version.
Mark cookies as secure. Makes cookie theft harder.
Use CSRF protection (e.g. CSRF one-time tokens that are verified with each request). Frameworks have such functionality built-in.
Disallow framing (X-Frame-Options: DENY). Otherwise your website may be included in another website in a hidden iframe and “abused” through javascript.
Logout – let your users logout by deleting all cookies and invalidating the session. This makes usage of shared computers safer (yes, users should ideally use private browsing sessions, but not all of them are that savvy)
Session expiry – don’t have forever-lasting sessions. If the user closes your website, their session should expire after a while. “A while” may still be a big number depending on the service provided. For ajax-heavy website you can have regular ajax-polling that keeps the session alive while the page stays open.
Remember me – implementing “remember me” (on this machine) functionality is actually hard due to the risks of a stolen persistent cookie. Spring-security uses this approach, which I think should be followed if you wish to implement more persistent logins.
Forgotten password flow – the forgotten password flow should rely on sending a one-time (or expiring) link to the user and asking for a new password when it’s opened. 0Auth explain it in this post and Postmark gives some best pracitces. How the link is formed is a separate discussion and there are several approaches. Store a password-reset token in the user profile table and then send it as parameter in the link. Or do not store anything in the database, but send a few params: userId:expiresTimestamp:hmac(userId+expiresTimestamp). That way you have expiring links (rather than one-time links). The HMAC relies on a secret key, so the links can’t be spoofed. It seems there’s no consensus, as the OWASP guide has a bit different approach
One-time login links – this is an option used by Slack, which sends one-time login links instead of asking users for passwords. It relies on the fact that your email is well guarded and you have access to it all the time. If your service is not accessed to often, you can have that approach instead of (rather than in addition to) passwords.
Limit login attempts – brute-force through a web UI should not be possible; therefore you should block login attempts if they become too many. One approach is to just block them based on IP. The other one is to block them based on account attempted. (Spring example here). Which one is better – I don’t know. Both can actually be combined. Instead of fully blocking the attempts, you may add a captcha after, say, the 5th attempt. But don’t add the captcha for the first attempt – it is bad user experience.
Don’t leak information through error messages – you shouldn’t allow attackers to figure out if an email is registered or not. If an email is not found, upon login report just “Incorrect credentials”. On passwords reset, it may be something like “If your email is registered, you should have received a password reset email”. This is often at odds with usability – people don’t often remember the email they used to register, and the ability to check a number of them before getting in might be important. So this rule is not absolute, though it’s desirable, especially for more critical systems.
Consider using a 3rd party authentication – OpenID Connect, OAuth by Google/Facebook/Twitter (but be careful with OAuth flaws as well). There’s an associated risk with relying on a 3rd party identity provider, and you still have to manage cookies, logout, etc., but some of the authentication aspects are simplified.
For high-risk or sensitive applications use 2-factor authentication. There’s a caveat with Google Authenticator though – if you lose your phone, you lose your accounts (unless there’s a manual process to restore it). That’s why Authy seems like a good solution for storing 2FA keys.
I’m sure I’m missing something. And you see it’s complicated. Sadly we’re still at the point where the most common functionality – authenticating users – is so tricky and cumbersome, that you almost always get at least some of it wrong.
What happens when you combine the Internet of Things, Machine Learning, and Edge Computing? Before I tell you, let’s review each one and discuss what AWS has to offer.
Internet of Things (IoT) – Devices that connect the physical world and the digital one. The devices, often equipped with one or more types of sensors, can be found in factories, vehicles, mines, fields, homes, and so forth. Important AWS services include AWS IoT Core, AWS IoT Analytics, AWS IoT Device Management, and Amazon FreeRTOS, along with others that you can find on the AWS IoT page.
Machine Learning (ML) – Systems that can be trained using an at-scale dataset and statistical algorithms, and used to make inferences from fresh data. At Amazon we use machine learning to drive the recommendations that you see when you shop, to optimize the paths in our fulfillment centers, fly drones, and much more. We support leading open source machine learning frameworks such as TensorFlow and MXNet, and make ML accessible and easy to use through Amazon SageMaker. We also provide Amazon Rekognition for images and for video, Amazon Lex for chatbots, and a wide array of language services for text analysis, translation, speech recognition, and text to speech.
Edge Computing – The power to have compute resources and decision-making capabilities in disparate locations, often with intermittent or no connectivity to the cloud. AWS Greengrass builds on AWS IoT, giving you the ability to run Lambda functions and keep device state in sync even when not connected to the Internet.
ML Inference at the Edge Today I would like to toss all three of these important new technologies into a blender! You can now perform Machine Learning inference at the edge using AWS Greengrass. This allows you to use the power of the AWS cloud (including fast, powerful instances equipped with GPUs) to build, train, and test your ML models before deploying them to small, low-powered, intermittently-connected IoT devices running in those factories, vehicles, mines, fields, and homes that I mentioned.
Here are a few of the many ways that you can put Greengrass ML Inference to use:
Precision Farming – With an ever-growing world population and unpredictable weather that can affect crop yields, the opportunity to use technology to increase yields is immense. Intelligent devices that are literally in the field can process images of soil, plants, pests, and crops, taking local corrective action and sending status reports to the cloud.
Physical Security – Smart devices (including the AWS DeepLens) can process images and scenes locally, looking for objects, watching for changes, and even detecting faces. When something of interest or concern arises, the device can pass the image or the video to the cloud and use Amazon Rekognition to take a closer look.
Industrial Maintenance – Smart, local monitoring can increase operational efficiency and reduce unplanned downtime. The monitors can run inference operations on power consumption, noise levels, and vibration to flag anomalies, predict failures, detect faulty equipment.
Greengrass ML Inference Overview There are several different aspects to this new AWS feature. Let’s take a look at each one:
Machine Learning Models – Precompiled TensorFlow and MXNet libraries, optimized for production use on the NVIDIA Jetson TX2 and Intel Atom devices, and development use on 32-bit Raspberry Pi devices. The optimized libraries can take advantage of GPU and FPGA hardware accelerators at the edge in order to provide fast, local inferences.
Model Building and Training – The ability to use Amazon SageMaker and other cloud-based ML tools to build, train, and test your models before deploying them to your IoT devices. To learn more about SageMaker, read Amazon SageMaker – Accelerated Machine Learning.
Model Deployment – SageMaker models can (if you give them the proper IAM permissions) be referenced directly from your Greengrass groups. You can also make use of models stored in S3 buckets. You can add a new machine learning resource to a group with a couple of clicks:
This post courtesy of Michael Edge, Sr. Cloud Architect – AWS Professional Services
Applications built using a microservices architecture typically result in a number of independent, loosely coupled microservices communicating with each other, synchronously via their APIs and asynchronously via events. These microservices are often owned by different product teams, and these teams may segregate their resources into different AWS accounts for reasons that include security, billing, and resource isolation. This can sometimes result in the following challenges:
Cross-account deployment: A single pipeline must deploy a microservice into multiple accounts; for example, a microservice must be deployed to environments such as DEV, QA, and PROD, all in separate accounts.
Cross-account lookup: During deployment, a resource deployed into one AWS account may need to refer to a resource deployed in another AWS account.
Cross-account communication: Microservices executing in one AWS account may need to communicate with microservices executing in another AWS account.
In this post, I look at ways to address these challenges using a sample application composed of a web application supported by two serverless microservices. The microservices are owned by different product teams and deployed into different accounts using AWS CodePipeline, AWS CloudFormation, and the Serverless Application Model (SAM). At runtime, the microservices communicate using an event-driven architecture that requires asynchronous, cross-account communication via an Amazon Simple Notification Service (Amazon SNS) topic.
Sample application
First, look at the sample application I use to demonstrate these concepts. In the following overview diagram, you can see the following:
The entire application consists of three main services:
A Booking microservice, owned by the Booking account.
An Airmiles microservice, owned by the Airmiles account.
A web application that uses the services exposed by both microservices, owned by the Web Channel account.
The Booking microservice creates flight bookings and publishes booking events to an SNS topic.
The Airmiles microservice consumes booking events from the SNS topic and uses the booking event to calculate the airmiles associated with the flight booking. It also supports querying airmiles for a specific flight booking.
The web application allows an end user to make flight bookings, view flights bookings, and view the airmiles associated with a flight booking.
In the sample application, the Booking and Airmiles microservices are implemented using AWS Lambda. Together with Amazon API Gateway, Amazon DynamoDB, and SNS, the sample application is completely serverless.
Sample Serverless microservices application
The typical booking flow would be triggered by an end user making a flight booking using the web application, which invokes the Booking microservice via its REST API. The Booking microservice persists the flight booking and publishes the booking event to an SNS topic to enable sharing of the booking with other interested consumers. In this sample application, the Airmiles microservice subscribes to the SNS topic and consumes the booking event, using the booking information to calculate the airmiles. In line with microservices best practices, both the Booking and Airmiles microservices store their information in their own DynamoDB tables, and expose an API (via API Gateway) that is used by the web application.
Setup
Before you delve into the details of the sample application, get the source code and deploy it.
Cross-account deployment of Lambda functions using CodePipeline has been previously discussed by my colleague Anuj Sharma in his post, Building a Secure Cross-Account Continuous Delivery Pipeline. This sample application builds upon the solution proposed by Anuj, using some of the same scripts and a similar account structure. To make it feasible for you to deploy the sample application, I’ve reduced the number of accounts needed down to three accounts by consolidating some of the services. In the following diagram, you can see the services used by the sample application require three accounts:
Tools: A central location for the continuous delivery/deployment services such as CodePipeline and AWS CodeBuild. To reduce the number of accounts required by the sample application, also deploy the AWS CodeCommit repositories here, though typically they may belong in a separate Dev account.
Booking: Account for the Booking microservice.
Airmiles: Account for the Airmiles microservice.
Without consolidation, the sample application may require up to 10 accounts: one for Tools, and three accounts each for Booking, Airmiles and Web Application (to support the DEV, QA, and PROD environments).
Account structure for sample application
To follow the rest of this post, clone the repository in step 1 below. To deploy the application on AWS, follow steps 2 and 3:
Follow the instructions in the repository README to build the CodePipeline pipelines and deploy the microservices and web application.
Challenge 1: Cross-account deployment using CodePipeline
Though the Booking pipeline executes in the Tools account, it deploys the Booking Lambda functions into the Booking account.
In the sample application code repository, open the ToolsAcct/code-pipeline.yaml CloudFormation template.
Scroll down to the Pipeline resource and look for the DeployToTest pipeline stage (shown below). There are two AWS Identity and Access Management (IAM) service roles used in this stage that allow cross-account activity. Both of these roles exist in the Booking account:
Under Actions.RoleArn, find the service role assumed by CodePipeline to execute this pipeline stage in the Booking account. The role referred to by the parameter NonProdCodePipelineActionServiceRole allows access to the CodePipeline artifacts in the S3 bucket in the Tools account, and also access to the AWS KMS key needed to encrypt/decrypt the artifacts.
Under Actions.Configuration.RoleArn, find the service role assumed by CloudFormation when it carries out the CHANGE_SET_REPLACE action in the Booking account.
These roles are created in the CloudFormation template NonProdAccount/toolsacct-codepipeline-cloudformation-deployer.yaml.
Challenge 2: Cross-account stack lookup using custom resources
Asynchronous communication is a fairly common pattern in microservices architectures. A publishing microservice publishes an event that consumers may be interested in, without any concern for who those consumers may be.
In the case of the sample application, the publisher is the Booking microservice, publishing a flight booking event onto an SNS topic that exists in the Booking account. The consumer is the Airmiles microservice in the Airmiles account. To enable the two microservices to communicate, the Airmiles microservice must look up the ARN of the Booking SNS topic at deployment time in order to set up a subscription to it.
To enable CloudFormation templates to be reused, you are not hardcoding resource names in the templates. Because you allow CloudFormation to generate a resource name for the Booking SNS topic, the Airmiles microservice CloudFormation template must look up the SNS topic name at stack creation time. It’s not possible to use cross-stack references, as these can’t be used across different accounts.
However, you can use a CloudFormation custom resource to achieve the same outcome. This is discussed in the next section. Using a custom resource to look up stack exports in another stack does not create a dependency between the two stacks, unlike cross-stack references. A stack dependency would prevent one stack being deleted if another stack depended upon it. For more information, see Fn::ImportValue.
Using a Lambda function as a custom resource to look up the booking SNS topic
The sample application uses a Lambda function as a custom resource. This approach is discussed in AWS Lambda-backed Custom Resources. Walk through the sample application and see how the custom resource is used to list the stack export variables in another account and return these values to the calling AWS CloudFormation stack. Examine each of the following aspects of the custom resource:
Deploying the custom resource.
Custom resource IAM role.
A calling CloudFormation stack uses the custom resource.
The custom resource assumes a role in another account.
The custom resource obtains the stack exports from all stacks in the other account.
The custom resource returns the stack exports to the calling CloudFormation stack.
The custom resource handles all the event types sent by the calling CloudFormation stack.
Step1: Deploying the custom resource
The custom Lambda function is deployed using SAM, as are the Lambda functions for the Booking and Airmiles microservices. See the CustomLookupExports resource in Custom/custom-lookup-exports.yml.
Step2: Custom resource IAM role
The CustomLookupExports resource in Custom/custom-lookup-exports.yml executes using the CustomLookupLambdaRole IAM role. This role allows the custom Lambda function to assume a cross account role that is created along with the other cross account roles in the NonProdAccount/toolsacct-codepipeline-cloudformation-deployer.yml. See the resource CustomCrossAccountServiceRole.
Step3: The CloudFormation stack uses the custom resource
The Airmiles microservice is created by the Airmiles CloudFormation template Airmiles/sam-airmile.yml, a SAM template that uses the custom Lambda resource to look up the ARN of the Booking SNS topic. The custom resource is specified by the CUSTOMLOOKUP resource, and the PostAirmileFunction resource uses the custom resource to look up the ARN of the SNS topic and create a subscription to it.
Because the Airmiles Lambda function is going to subscribe to an SNS topic in another account, it must grant the SNS topic in the Booking account permissions to invoke the Airmiles Lambda function in the Airmiles account whenever a new event is published. Permissions are granted by the LambdaResourcePolicy resource.
Step4: The custom resource assumes a role in another account
When the Lambda custom resource is invoked by the Airmiles CloudFormation template, it must assume a role in the Booking account (see Step 2) in order to query the stack exports in that account.
This can be seen in Custom/custom-lookup-exports.py, where the AWS Simple Token Service (AWS STS) is used to obtain a temporary access key to allow access to resources in the account referred to by the environment variable: ‘CUSTOM_CROSS_ACCOUNT_ROLE_ARN’. This environment variable is defined in the Custom/custom-lookup-exports.yml CloudFormation template, and refers to the role created in Step 2.
Step5: The custom resource obtains the stack exports from all stacks in the other account
The function get_exports() in Custom/custom-lookup-exports.py uses the AWS SDK to list all the stack exports in the Booking account.
Step 6: The custom resource returns the stack exports to the calling CloudFormation stack
The calling CloudFormation template, Airmiles/sam-airmile.yml, uses the custom resource to look up a stack export with the name of booking-lambda-BookingTopicArn. This is exported by the CloudFormation template Booking/sam-booking.yml.
Step7: The custom resource handles all the event types sent by the calling CloudFormation stack
Custom resources used in a stack are called whenever the stack is created, updated, or deleted. They must therefore handle CREATE, UPDATE, and DELETE stack events. If your custom resource fails to send a SUCCESS or FAILED notification for each of these stack events, your stack may become stuck in a state such as CREATE_IN_PROGRESS or UPDATE_ROLLBACK_IN_PROGRESS. To handle this cleanly, use the crhelper custom resource helper. This strongly encourages you to handle the CREATE, UPDATE, and DELETE CloudFormation stack events.
Challenge 3: Cross-account SNS subscription
The SNS topic is specified as a resource in the Booking/sam-booking.yml CloudFormation template. To allow an event published to this topic to trigger the Airmiles Lambda function, permissions must be granted on both the Booking SNS topic and the Airmiles Lambda function. This is discussed in the next section.
Permissions on booking SNS topic
SNS topics support resource-based policies, which allow a policy to be attached directly to a resource specifying who can access the resource. This policy can specify which accounts can access a resource, and what actions those accounts are allowed to perform. This is the same approach as used by a small number of AWS services, such as Amazon S3, KMS, Amazon SQS, and Lambda. In Booking/sam-booking.yml, the SNS topic policy allows resources in the Airmiles account (referenced by the parameter NonProdAccount in the following snippet) to subscribe to the Booking SNS topic:
The Airmiles microservice is created by the Airmiles/sam-airmile.yml CloudFormation template. The template uses SAM to specify the Lambda functions together with their associated API Gateway configurations. SAM hides much of the complexity of deploying Lambda functions from you.
For instance, by adding an ‘Events’ resource to the PostAirmileFunction in the CloudFormation template, the Airmiles Lambda function is triggered by an event published to the Booking SNS topic. SAM creates the SNS subscription, as well as the permissions necessary for the SNS subscription to trigger the Lambda function.
However, the Lambda permissions generated automatically by SAM are not sufficient for cross-account SNS subscription, which means you must specify an additional Lambda permissions resource in the CloudFormation template. LambdaResourcePolicy, in the following snippet, specifies that the Booking SNS topic is allowed to invoke the Lambda function PostAirmileFunction in the Airmiles account.
In this post, I showed you how product teams can use CodePipeline to deploy microservices into different AWS accounts and different environments such as DEV, QA, and PROD. I also showed how, at runtime, microservices in different accounts can communicate securely with each other using an asynchronous, event-driven architecture. This allows product teams to maintain their own AWS accounts for billing and security purposes, while still supporting communication with microservices in other accounts owned by other product teams.
Acknowledgments
Thanks are due to the following people for their help in preparing this post:
Kevin Yung for the great web application used in the sample application
Jay McConnell for crhelper, the custom resource helper used with the CloudFormation custom resources
Новото правителство на Австрия предприема стъпки за засилване на позициите си в медиите.
Обществената телевизия с най-голямата аудитория в Австрия – до 4 милиона зрители при население 8,7 милиона души – се финансира главно чрез данък, който правителството иска да отмени. Различни министри правят изявления, че не одобряват модела на финансиране на ORF. Заместник-канцлерът е най-директен, като нарича ORF място, където лъжите стават новини. Понятия като фалшиви новини и lügenpresse (лъжепреса) се използват за критичните публикации по подобие на употребата на термините от управляващите в САЩ.
Представители на ORF оценяват атаките като част от опитите на правителството да получи по-голямо политическо влияние чрез медийния сектор.В същото време медийният министър Блумел няколко пъти обявява публично, че правителството възнамерява да укрепи частните радио- и телевизионни медии.
Amazon EC2 Spot Instances are spare compute capacity in the AWS Cloud available to you at steep discounts compared to On-Demand prices. The only difference between On-Demand Instances and Spot Instances is that Spot Instances can be interrupted by Amazon EC2 with two minutes of notification when EC2 needs the capacity back.
Customers have been taking advantage of Spot Instance interruption notices available via the instance metadata service since January 2015 to orchestrate their workloads seamlessly around any potential interruptions. Examples include saving the state of a job, detaching from a load balancer, or draining containers. Needless to say, the two-minute Spot Instance interruption notice is a powerful tool when using Spot Instances.
In January 2018, the Spot Instance interruption notice also became available as an event in Amazon CloudWatch Events. This allows targets such as AWS Lambda functions or Amazon SNS topics to process Spot Instance interruption notices by creating a CloudWatch Events rule to monitor for the notice.
In this post, I walk through an example use case for taking advantage of Spot Instance interruption notices in CloudWatch Events to automatically deregister Spot Instances from an Elastic Load Balancing Application Load Balancer.
When any of the Spot Instances receives an interruption notice, Spot Fleet sends the event to CloudWatch Events. The CloudWatch Events rule then notifies both targets, the Lambda function and SNS topic. The Lambda function detaches the Spot Instance from the Application Load Balancer target group, taking advantage of nearly a full two minutes of connection draining before the instance is interrupted. The SNS topic also receives a message, and is provided as an example for the reader to use as an exercise.
To complete this walkthrough, have the AWS CLI installed and configured, as well as the ability to launch CloudFormation stacks.
Launch the stack
Go ahead and launch the CloudFormation stack. You can check it out from GitHub, or grab the template directly. In this post, I use the stack name “spot-spin-cwe“, but feel free to use any name you like. Just remember to change it in the instructions.
Here are the details of the architecture being launched by the stack.
IAM permissions
Give permissions to a few components in the architecture:
The Lambda function
The CloudWatch Events rule
The Spot Fleet
The Lambda function needs basic Lambda function execution permissions so that it can write logs to CloudWatch Logs. You can use the AWS managed policy for this. It also needs to describe EC2 tags as well as deregister targets within Elastic Load Balancing. You can create a custom policy for these.
Finally, Spot Fleet needs permissions to request Spot Instances, tag, and register targets in Elastic Load Balancing. You can tap into an AWS managed policy for this.
Because you are taking advantage of the two-minute Spot Instance notice, you can tune the Elastic Load Balancing target group deregistration timeout delay to match. When a target is deregistered from the target group, it is put into connection draining mode for the length of the timeout delay: 120 seconds to equal the two-minute notice.
To capture the Spot Instance interruption notice being published to CloudWatch Events, create a rule with two targets: the Lambda function and the SNS topic.
The Lambda function does the heavy lifting for you. The details of the CloudWatch event are published to the Lambda function, which then uses boto3 to make a couple of AWS API calls. The first call is to describe the EC2 tags for the Spot Instance, filtering on a key of “TargetGroupArn”. If this tag is found, the instance is then deregistered from the target group ARN stored as the value of the tag.
import boto3
def handler(event, context):
instanceId = event['detail']['instance-id']
instanceAction = event['detail']['instance-action']
try:
ec2client = boto3.client('ec2')
describeTags = ec2client.describe_tags(Filters=[{'Name': 'resource-id','Values':[instanceId],'Name':'key','Values':['loadBalancerTargetGroup']}])
except:
print("No action being taken. Unable to describe tags for instance id:", instanceId)
return
try:
elbv2client = boto3.client('elbv2')
deregisterTargets = elbv2client.deregister_targets(TargetGroupArn=describeTags['Tags'][0]['Value'],Targets=[{'Id':instanceId}])
except:
print("No action being taken. Unable to deregister targets for instance id:", instanceId)
return
print("Detaching instance from target:")
print(instanceId, describeTags['Tags'][0]['Value'], deregisterTargets, sep=",")
return
SNS topic
Finally, you’ve created an SNS topic as an example target. For example, you could subscribe an email address to the SNS topic in order to receive email notifications when a Spot Instance interruption notice is received.
To proceed to creating your Spot Fleet request, use some of the resources that the CloudFormation stack created, to populate the Spot Fleet request launch configuration. You can find the values in the outputs values of the CloudFormation stack:
You can confirm that the Spot Fleet request was fulfilled by checking that ActivityStatus is “fulfilled”, or by checking that FulfilledCapacity is greater than or equal to TargetCapacity, while describing the request:
In order to test, you can take advantage of the fact that any interruption action that Spot Fleet takes on a Spot Instance results in a Spot Instance interruption notice being provided. Therefore, you can simply decrease the target size of your Spot Fleet from 2 to 1. The instance that is interrupted receives the interruption notice:
As soon as the interruption notice is published to CloudWatch Events, the Lambda function triggers and detaches the instance from the target group, effectively putting the instance in a draining state.
In conclusion, Amazon EC2 Spot Instance interruption notices are an extremely powerful tool when taking advantage of Amazon EC2 Spot Instances in your workloads, for tasks such as saving state, draining connections, and much more. I’d love to hear how you are using them in your own environment!
Chad Schmutzer Solutions Architect
Chad Schmutzer is a Solutions Architect at Amazon Web Services based in Pasadena, CA. As an extension of the Amazon EC2 Spot Instances team, Chad helps customers significantly reduce the cost of running their applications, growing their compute capacity and throughput without increasing budget, and enabling new types of cloud computing applications.
As a follow up to our initial region availability on November 20, 2017, I’m happy to announce that we have expanded the number of accredited services available in the AWS Secret Region by an additional 11 services. We continue to be the only cloud service provider with accredited regions to address the full range of U.S. Department of Defense (DoD) data classifications, including Unclassified, Sensitive (CUI), Secret, and Top Secret.
When the region launched last November, we achieved a Provisional Authorization (PA) for Impact Level 6 (IL6) workloads from the U.S. Defense Information Systems Agency (DISA), the IT combat support organization of the DoD. The PA was recently extended, allowing for continued access to the region for IL6 workloads.
The AWS Secret Region was designed and built to meet the specific security requirements of secret classified workloads for the DoD and the intelligence community. A service catalog for the region is available through your AWS Account Executive.
Amazon GameLift is a scalable, cloud-based runtime environment for session-based multiplayer games. You simply upload a build of your game, tell Amazon GameLift which type of EC2 instances you’d like to host it on, and sit back while Amazon GameLift takes care of setting up sessions and maintaining a suitably-sized fleet of EC2 instances. This automatic scaling allows you to accommodate demand that varies over time without having to keep compute resources in reserve during quiet periods.
Use Spot Instances Last week we added a new feature to further decrease your per-player, per-hour costs when you host your game on Amazon GameLift. Before that launch, Amazon GameLift instances were always launched in On-Demand form. Instances of this type are always billed at fixed prices, as detailed on the Amazon GameLift Pricing page.
You can now make use Amazon GameLift Spot Instances in your GameLift fleets. These instances represent unused capacity and have prices that rise and fall over time. While your results will vary, you may see savings of up to 90% when compared to On-Demand Instances.
While you can use Spot Instances as a simple money-saving tool, there are other interesting use cases as well. Every game has a life cycle, along with a cadre of loyal players who want to keep on playing until you finally unplug and decommission the servers. You could create an Amazon GameLift fleet comprised of low-cost Spot Instances and keep that beloved game up and running as long as possible without breaking the bank. Behind the scenes, an Amazon GameLift Queue will make use of both Spot and On-Demand Instances, balancing price and availability in an attempt to give you the best possible service at the lowest price.
As I mentioned earlier, Spot Instances represent capacity that is not in use by On-Demand Instances. When this capacity decreases, existing Spot Instances could be interrupted with two minutes of notification and then terminated. Fortunately, there’s a lot of capacity and terminations are, statistically speaking, quite rare. To reduce the frequency even further, Amazon GameLift Queues now include a new feature that we call FleetIQ.
FleetIQ is powered by historical pricing and termination data for Spot Instances. This data, in combination with a very conservative strategy for choosing instance types, further reduces the odds that any particular game will be notified and then interrupted. The onProcessTerminate callback in your game’s server process will be activated if the underlying Spot Instance is about to be interrupted. At that point you have two minutes to close out the game, save any logs, free up any resources, and otherwise wrap things up. While you are doing this, you can call GetTerminationTime to see how much time remains.
Creating a Fleet To take advantage of Spot Instances and FleetIQ, you can use the Amazon GameLift console or API to set up Queues with multiple fleets of Spot and On-Demand Instances. By adding more fleets into each Queue, you give FleetIQ more options to improve latency, interruption rate, and cost. To start a new game session on an instance, FleetIQ first selects the region with the lowest latency for each player, then chooses the fleet with the lowest interruption rate and cost.
Let’s walk through the process. I’ll create a fleet of On-Demand Instances and a fleet of Spot Instances, in that order:
And:
I take a quick break while the fleets are validated and activated:
Then I create a queue for my game. I select the fleets as the destinations for the queue:
If I am building a game that will have a global user base, I can create fleets in additional AWS Regions and use a player latency policy so that game sessions will be created in a suitable region:
Now Available You can use Amazon GameLift Spot Instance fleets to host your session-based games now! Take a look, give it a try, and let me know what you think.
If you are planning to attend GDC this year, be sure to swing by booth 1001. Check out our GDC 2018 site for more information on our dev day talks, classroom sessions, and in-booth demos.
In an unprecedented move, I did one thing for an entire calendar week. I say “unprecedented” but I guess the same thing happened with fox flux. And NEONPHASE. Hmm. Sensing a pattern. See if you can guess what the one thing was!
anise!!: Wow! It’s Anise! The game has come so far that I can’t even believe that any of this was a recent change. I made monster AI vastly more sensible, added a boatload of mechanics, fleshed out more than half the map (and sketched out the rest), and drew and implemented most of a menu with a number of excellent goodies. Also, FINALLY (after a full year of daydreaming about it), eliminated the terrible “clock” structure I invented for collision detection, as well as cut down on a huge source of completely pointless allocations, which sped physics up in general by at least 10% and cut GC churn significantly. Hooray! And I’ve done even more just in the last day and a half. Still a good bit of work left, but this game is gonna be fantastic.
art: Oh right I tried drawing a picture but I didn’t like it so I stopped.
I have some writing to catch up on — I have several things 80% written, but had to stop because I was just starting to get a cold and couldn’t even tell if my own writing was sensible any more. And then I had to work on a video game about my cat. Sorry. Actually, not sorry, video games about my cat are always top priority. You knew what you were signing up for.
Earlier this year, in February, I wrote a blog post encouraging people to donate to where I work, Software Freedom Conservancy. I’ve not otherwise blogged too much this year. It’s been a rough year for many reasons, and while I personally and Conservancy in general have accomplished some very important work this year, I’m reminded as always that more resources do make things easier.
I understand the urge, given how bad the larger political crises have gotten, to want to give to charities other than those related to software freedom. There are important causes out there that have become more urgent this year. Here’s three issues which have become shockingly more acute this year:
making sure the USA keeps it commitment to immigrants to allow them make a new life here just like my own ancestors did,
assuring that the great national nature reserves are maintained and left pristine for generations to come,
assuring that we have zero tolerance abusive behavior — particularly by those in power against people who come to them for help and job opportunities.
These are just three of the many issues this year that I’ve seen get worse, not better. I am glad that I know and support people who work on these issues, and I urge everyone to work on these issues, too.
Nevertheless, as I plan my primary donations this year, I’m again, as I always do, giving to the FSF and my own employer, Software Freedom Conservancy. The reason is simple: software freedom is still an essential cause and it is frankly one that most people don’t understand (yet). I wrote almost two years ago about the phenomenon I dubbed Kuhn’s Paradox. Simply put: it keeps getting more and more difficult to avoid proprietary software in a normal day’s tasks, even while the number of lines of code licensed freely gets larger every day.
As long as that paradox remains true, I see software freedom as urgent. I know that we’re losing ground on so many other causes, too. But those of you who read my blog are some of the few people in the world that understand that software freedom is under threat and needs the urgent work that the very few software-freedom-related organizations, like the FSF and Software Freedom Conservancy are doing. I hope you’ll donate now to both of them. For my part, I gave $120 myself to FSF as part of the monthly Associate Membership program, and in a few minutes, I’m going to give $400 to Conservancy. I’ll be frank: if you work in technology in an industrialized country, I’m quite sure you can afford that level of money, and I suspect those amounts are less than most of you spent on technology equipment and/or network connectivity charges this year. Make a difference for us and give to the cause of software freedom at least as much a you’re giving to large technology companies.
Finally, a good reason to give to smaller charities like FSF and Conservancy is that your donation makes a bigger difference. I do think bigger organizations, such as (to pick an example of an organization I used to give to) my local NPR station does important work. However, I was listening this week to my local NPR station, and they said their goal for that day was to raise $50,000. For Conservancy, that’s closer to a goal we have for entire fundraising season, which for this year was $75,000. The thing is: NPR is an important part of USA society, but it’s one that nearly everyone understands. So few people understand the threats looming from proprietary software, and they may not understand at all until it’s too late — when all their devices are locked down, DRM is fully ubiquitous, and no one is allowed to tinker with the software on their devices and learn the wonderful art of computer programming. We are at real risk of reaching that distopia before 90% of the world’s population understands the threat!
Thus, giving to organizations in the area of software freedom is just going to have a bigger and more immediate impact than more general causes that more easily connect with people. You’re giving to prevent a future that not everyone understands yet, and making an impact on our work to help explain the dangers to the larger population.
Jann Horn has reported eight bugs in the eBPF verifier, one for the 4.9 kernel and seven introduced in 4.14, to the oss-security mailing list. Some of these bugs result in eBPF programs being able to read and write arbitrary kernel memory, thus can be used for a variety of ill effects, including privilege escalation. As Ben Hutchings notes, one mitigation would be to disable unprivileged access to BPF using the following sysctl: kernel.unprivileged_bpf_disabled=1. More information can also be found in this Project Zero bug entry. The fixes are not yet in the mainline tree, but are in the netdev tree. Hutchings goes on to say: “There is a public exploit that uses several of these bugs to get root privileges. It doesn’t work as-is on stretch [Debian 9] with the Linux 4.9 kernel, but is easy to adapt. I recommend applying the above mitigation as soon as possible to all systems running Linux 4.4 or later.”
We find that AWS customers often require that every query submitted to Presto running on Amazon EMR is logged. They want to track what query was submitted, when it was submitted and who submitted it.
In this blog post, we will demonstrate how to implement and install a Presto event listener for purposes of custom logging, debugging and performance analysis for queries executed on an EMR cluster. An event listener is a plugin to Presto that is invoked when an event such as query creation, query completion, or split completion occurs.
Presto also provides a system connector to access metrics and details about a running cluster. In particular, the system connector gets information about currently running and recently run queries by using the system.runtime.queries table. From the Presto command line interface (CLI), you get this data with the entries Select * from system.runtime.queries; and Select * from system.runtime.tasks;. This connector is available out of the box and exposes information and metrics through SQL.
In addition to providing custom logging and debugging, the Presto event listener allows you to enhance the information provided by the system connector by providing a mechanism to capture detailed query context, query statistics and failure information during query creation or completion, or a split completion.
We will begin by providing a detailed walkthrough of the implementation of the Presto event listener in Java followed by its deployment on the EMR cluster.
Implementation
We use the Eclipse IDE to create a Maven Project, as shown below:
Once you have created the Maven Project, modify the pom.xml file to add the dependency for Presto, as shown following:
After you add the Presto dependency to our pom.xml file, create a Java package under the src/main/java folder. In our project, we have named the package com.amazonaws.QueryEventListener. You can choose the naming convention that best fits your organization. Within this package, create three Java files for the EventListener, the EventListenerFactory, and the EventListenerPlugin.
As the Presto website says: “EventListenerFactory is responsible for creating an EventListener instance. It also defines an EventListener name, which is used by the administrator in a Presto configuration. Implementations of EventListener implement methods for the event types they are interested in handling. The implementation of EventListener and EventListenerFactory must be wrapped as a plugin and installed on the Presto cluster.”
In our project, we have named these Java files QueryEventListener, QueryEventListenerFactory, and QueryEventListenerPlugin:
Now we write our code for the Java files.
QueryEventListener – QueryEventListener implements the Presto EventListener interface. It has a constructor that creates five rotating log files of 524 MB each. After creating QueryEventListener, we implement the query creation, query completion, and split completion methods and log the events relevant to us. You can choose to include more events based on your needs.
QueryEventListenerFactory – The QueryEventListenerFactory class implements the Presto EventListenerFactory interface. We implement the method getName, which provides a registered EventListenerFactory name to Presto. We also implement the create method, which creates an EventListener instance.
You can find the code for the QueryEventListenerFactory class in this Amazon repository.
QueryEventListenerPlugin – The QueryEventListenerPlugin class implements the Presto EventListenerPlugin interface. This class has a getEventListenerFactories method that returns an immutable list containing the EventListenerFactory. Basically, in this class we are wrapping QueryEventListener and QueryEventListenerFactory.
Finally, in our project we add the META-INF folder and a services subfolder within the META-INF folder. In the services subfolder, you create a file called com.facebook.presto.spi.Plugin:
As the Presto documentation describes: “Each plugin identifies an entry point: an implementation of the plugin interface. This class name is provided to Presto via the standard Java ServiceLoader interface: the classpath contains a resource file named com.facebook.presto.spi.Plugin in the META-INF/services directory”.
We add the name of our plugin class com.amazonaws.QueryEventListener.QueryEventListenerPlugin to the com.facebook.presto.spi.Plugin file, as shown below:
Next we will show you how to deploy the Presto plugin we created to Amazon EMR for custom logging.
Presto logging overview on Amazon EMR
Presto by default will produce three log files that capture the configurations properties and the overall operational events of the components that make up Presto, plus log end user access to the Presto UI.
On Amazon EMR, these log files are written into /var/log/presto. The log files in this directory are pushed into Amazon S3. This S3 location is the location of the new log file.
Steps to deploy Presto on Amazon EMR with custom logging
To deploy Presto on EMR with custom logging a bootstrap action will be used. The bootstrap is available in this Amazon Repository.
#!/bin/bash
IS_MASTER=true
if [ -f /mnt/var/lib/info/instance.json ]
then
if grep isMaster /mnt/var/lib/info/instance.json | grep true;
then
IS_MASTER=true
else
IS_MASTER=false
fi
fi
sudo mkdir -p /usr/lib/presto/plugin/queryeventlistener
sudo /usr/bin/aws s3 cp s3://replace-with-your-bucket/QueryEventListener.jar /tmp
sudo cp /tmp/QueryEventListener.jar /usr/lib/presto/plugin/queryeventlistener/
if [ "$IS_MASTER" = true ]; then
sudo mkdir -p /usr/lib/presto/etc
sudo bash -c 'cat <<EOT >> /usr/lib/presto/etc/event-listener.properties
event-listener.name=event-listener
EOT'
fi
First, upload the JAR file created on the last section and update the s3 location in the bootstrap, s3://replace-with-your-bucket/QueryEventListener.jar, with the bucket name where the jar was placed.
After updating the bootstrap with the S3 location for your JAR, upload that bootstrap to your own bucket.
The bootstrap action will copy the jar file with the custom EventListener implementation into all machines of the cluster. Moreover, the bootstrap action will create a file named event-listener.properties on the Amazon EMR Master node. This file will configure the coordinator to enable the custom logging plugin via property event-listener.name. The event-listener.name property is set to event-listener in the event-listener.properties file. As per Presto documentation, this property is used by Presto to find a registered EventListenerFactory based on the name returned by EventListenerFactory.getName().
Now that the bootstrap is ready, the following AWS CLI command can be used to create a new EMR cluster with the bootstrap:
After we have implemented the Presto custom logging plugin and deployed it, we can run a test and see the output of the log files.
We first create a Hive table with the DDL below from the Hive Command Line (simply type Hive on the SSH terminal to the EMR Master to access the Hive Command Line):
CREATE EXTERNAL TABLE wikistats (
language STRING,
page_title STRING,
hits BIGINT,
retrived_size BIGINT
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ' '
LINES TERMINATED BY '\n'
LOCATION 's3://support.elasticmapreduce/training/datasets/wikistats/';
Then, we access the presto command line by typing the following command on the terminal:
We then go to the /var/log/presto directory and look at the contents of the log file queries-YYYY-MM-DDTHH\:MM\:SS.0.log. As depicted in the screenshot below, our QueryEventListener plugin captures the fields shown for the Query Created and Query Completed events. Moreover, if there are splits, the plugin will also capture split events.
Note: If you want to include the query text executed by the user for auditing and debugging purposes, add the field appropriately in the QueryEventListener class methods, as shown below:
Because this is custom logging, you can capture as many fields as are available for the particular events. To find out the fields available for each of the events, see the Java Classes provided by Presto at this GitHub location.
Summary
In this post, you learned how to add custom logging to Presto on EMR to enhance your organization’s auditing capabilities and provide insights into performance.
If you have questions or suggestions, please leave a comment below.
Zafar Kapadia is a Cloud Application Architect for AWS. He works on Application Development and Optimization projects. He is also an avid cricketer and plays in various local leagues.
Francisco Oliveira is a Big Data Engineer with AWS Professional Services. He focuses on building big data solutions with open source technology and AWS. In his free time, he likes to try new sports, travel and explore national parks.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.