Tag Archives: ALA

[$] A filesystem “change journal” and other topics

Post Syndicated from jake original https://lwn.net/Articles/755277/rss

At the 2017 Linux Storage, Filesystem, and Memory-Management Summit
(LSFMM), Amir Goldstein presented his work
on adding a superblock watch mechanism to provide a scalable way to notify
applications
of changes in a filesystem. At the 2018 edition of LSFMM, he was back to
discuss adding NTFS-like change
journals
to the kernel in support of backup solutions of various
sorts. As a second topic for the session, he also wanted to discuss doing
more performance-regression testing
for filesystems.

EC2 Instance Update – M5 Instances with Local NVMe Storage (M5d)

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/ec2-instance-update-m5-instances-with-local-nvme-storage-m5d/

Earlier this month we launched the C5 Instances with Local NVMe Storage and I told you that we would be doing the same for additional instance types in the near future!

Today we are introducing M5 instances equipped with local NVMe storage. Available for immediate use in 5 regions, these instances are a great fit for workloads that require a balance of compute and memory resources. Here are the specs:

Instance NamevCPUsRAMLocal StorageEBS-Optimized BandwidthNetwork Bandwidth
m5d.large28 GiB1 x 75 GB NVMe SSDUp to 2.120 GbpsUp to 10 Gbps
m5d.xlarge416 GiB1 x 150 GB NVMe SSDUp to 2.120 GbpsUp to 10 Gbps
m5d.2xlarge832 GiB1 x 300 GB NVMe SSDUp to 2.120 GbpsUp to 10 Gbps
m5d.4xlarge1664 GiB1 x 600 GB NVMe SSD2.210 GbpsUp to 10 Gbps
m5d.12xlarge48192 GiB2 x 900 GB NVMe SSD5.0 Gbps10 Gbps
m5d.24xlarge96384 GiB4 x 900 GB NVMe SSD10.0 Gbps25 Gbps

The M5d instances are powered by Custom Intel® Xeon® Platinum 8175M series processors running at 2.5 GHz, including support for AVX-512.

You can use any AMI that includes drivers for the Elastic Network Adapter (ENA) and NVMe; this includes the latest Amazon Linux, Microsoft Windows (Server 2008 R2, Server 2012, Server 2012 R2 and Server 2016), Ubuntu, RHEL, SUSE, and CentOS AMIs.

Here are a couple of things to keep in mind about the local NVMe storage on the M5d instances:

Naming – You don’t have to specify a block device mapping in your AMI or during the instance launch; the local storage will show up as one or more devices (/dev/nvme*1 on Linux) after the guest operating system has booted.

Encryption – Each local NVMe device is hardware encrypted using the XTS-AES-256 block cipher and a unique key. Each key is destroyed when the instance is stopped or terminated.

Lifetime – Local NVMe devices have the same lifetime as the instance they are attached to, and do not stick around after the instance has been stopped or terminated.

Available Now
M5d instances are available in On-Demand, Reserved Instance, and Spot form in the US East (N. Virginia), US West (Oregon), EU (Ireland), US East (Ohio), and Canada (Central) Regions. Prices vary by Region, and are just a bit higher than for the equivalent M5 instances.

Jeff;

 

AWS Online Tech Talks – June 2018

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-june-2018/

AWS Online Tech Talks – June 2018

Join us this month to learn about AWS services and solutions. New this month, we have a fireside chat with the GM of Amazon WorkSpaces and our 2nd episode of the “How to re:Invent” series. We’ll also cover best practices, deep dives, use cases and more! Join us and register today!

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

 

Analytics & Big Data

June 18, 2018 | 11:00 AM – 11:45 AM PTGet Started with Real-Time Streaming Data in Under 5 Minutes – Learn how to use Amazon Kinesis to capture, store, and analyze streaming data in real-time including IoT device data, VPC flow logs, and clickstream data.
June 20, 2018 | 11:00 AM – 11:45 AM PT – Insights For Everyone – Deploying Data across your Organization – Learn how to deploy data at scale using AWS Analytics and QuickSight’s new reader role and usage based pricing.

 

AWS re:Invent
June 13, 2018 | 05:00 PM – 05:30 PM PTEpisode 2: AWS re:Invent Breakout Content Secret Sauce – Hear from one of our own AWS content experts as we dive deep into the re:Invent content strategy and how we maintain a high bar.
Compute

June 25, 2018 | 01:00 PM – 01:45 PM PTAccelerating Containerized Workloads with Amazon EC2 Spot Instances – Learn how to efficiently deploy containerized workloads and easily manage clusters at any scale at a fraction of the cost with Spot Instances.

June 26, 2018 | 01:00 PM – 01:45 PM PTEnsuring Your Windows Server Workloads Are Well-Architected – Get the benefits, best practices and tools on running your Microsoft Workloads on AWS leveraging a well-architected approach.

 

Containers
June 25, 2018 | 09:00 AM – 09:45 AM PTRunning Kubernetes on AWS – Learn about the basics of running Kubernetes on AWS including how setup masters, networking, security, and add auto-scaling to your cluster.

 

Databases

June 18, 2018 | 01:00 PM – 01:45 PM PTOracle to Amazon Aurora Migration, Step by Step – Learn how to migrate your Oracle database to Amazon Aurora.
DevOps

June 20, 2018 | 09:00 AM – 09:45 AM PTSet Up a CI/CD Pipeline for Deploying Containers Using the AWS Developer Tools – Learn how to set up a CI/CD pipeline for deploying containers using the AWS Developer Tools.

 

Enterprise & Hybrid
June 18, 2018 | 09:00 AM – 09:45 AM PTDe-risking Enterprise Migration with AWS Managed Services – Learn how enterprise customers are de-risking cloud adoption with AWS Managed Services.

June 19, 2018 | 11:00 AM – 11:45 AM PTLaunch AWS Faster using Automated Landing Zones – Learn how the AWS Landing Zone can automate the set up of best practice baselines when setting up new

 

AWS Environments

June 21, 2018 | 11:00 AM – 11:45 AM PTLeading Your Team Through a Cloud Transformation – Learn how you can help lead your organization through a cloud transformation.

June 21, 2018 | 01:00 PM – 01:45 PM PTEnabling New Retail Customer Experiences with Big Data – Learn how AWS can help retailers realize actual value from their big data and deliver on differentiated retail customer experiences.

June 28, 2018 | 01:00 PM – 01:45 PM PTFireside Chat: End User Collaboration on AWS – Learn how End User Compute services can help you deliver access to desktops and applications anywhere, anytime, using any device.
IoT

June 27, 2018 | 11:00 AM – 11:45 AM PTAWS IoT in the Connected Home – Learn how to use AWS IoT to build innovative Connected Home products.

 

Machine Learning

June 19, 2018 | 09:00 AM – 09:45 AM PTIntegrating Amazon SageMaker into your Enterprise – Learn how to integrate Amazon SageMaker and other AWS Services within an Enterprise environment.

June 21, 2018 | 09:00 AM – 09:45 AM PTBuilding Text Analytics Applications on AWS using Amazon Comprehend – Learn how you can unlock the value of your unstructured data with NLP-based text analytics.

 

Management Tools

June 20, 2018 | 01:00 PM – 01:45 PM PTOptimizing Application Performance and Costs with Auto Scaling – Learn how selecting the right scaling option can help optimize application performance and costs.

 

Mobile
June 25, 2018 | 11:00 AM – 11:45 AM PTDrive User Engagement with Amazon Pinpoint – Learn how Amazon Pinpoint simplifies and streamlines effective user engagement.

 

Security, Identity & Compliance

June 26, 2018 | 09:00 AM – 09:45 AM PTUnderstanding AWS Secrets Manager – Learn how AWS Secrets Manager helps you rotate and manage access to secrets centrally.
June 28, 2018 | 09:00 AM – 09:45 AM PTUsing Amazon Inspector to Discover Potential Security Issues – See how Amazon Inspector can be used to discover security issues of your instances.

 

Serverless

June 19, 2018 | 01:00 PM – 01:45 PM PTProductionize Serverless Application Building and Deployments with AWS SAM – Learn expert tips and techniques for building and deploying serverless applications at scale with AWS SAM.

 

Storage

June 26, 2018 | 11:00 AM – 11:45 AM PTDeep Dive: Hybrid Cloud Storage with AWS Storage Gateway – Learn how you can reduce your on-premises infrastructure by using the AWS Storage Gateway to connecting your applications to the scalable and reliable AWS storage services.
June 27, 2018 | 01:00 PM – 01:45 PM PTChanging the Game: Extending Compute Capabilities to the Edge – Discover how to change the game for IIoT and edge analytics applications with AWS Snowball Edge plus enhanced Compute instances.
June 28, 2018 | 11:00 AM – 11:45 AM PTBig Data and Analytics Workloads on Amazon EFS – Get best practices and deployment advice for running big data and analytics workloads on Amazon EFS.

Build your own weather station with our new guide!

Post Syndicated from Richard Hayler original https://www.raspberrypi.org/blog/build-your-own-weather-station/

One of the most common enquiries I receive at Pi Towers is “How can I get my hands on a Raspberry Pi Oracle Weather Station?” Now the answer is: “Why not build your own version using our guide?”

Build Your Own weather station kit assembled

Tadaaaa! The BYO weather station fully assembled.

Our Oracle Weather Station

In 2016 we sent out nearly 1000 Raspberry Pi Oracle Weather Station kits to schools from around the world who had applied to be part of our weather station programme. In the original kit was a special HAT that allows the Pi to collect weather data with a set of sensors.

The original Raspberry Pi Oracle Weather Station HAT – Build Your Own Raspberry Pi weather station

The original Raspberry Pi Oracle Weather Station HAT

We designed the HAT to enable students to create their own weather stations and mount them at their schools. As part of the programme, we also provide an ever-growing range of supporting resources. We’ve seen Oracle Weather Stations in great locations with a huge differences in climate, and they’ve even recorded the effects of a solar eclipse.

Our new BYO weather station guide

We only had a single batch of HATs made, and unfortunately we’ve given nearly* all the Weather Station kits away. Not only are the kits really popular, we also receive lots of questions about how to add extra sensors or how to take more precise measurements of a particular weather phenomenon. So today, to satisfy your demand for a hackable weather station, we’re launching our Build your own weather station guide!

Build Your Own Raspberry Pi weather station

Fun with meteorological experiments!

Our guide suggests the use of many of the sensors from the Oracle Weather Station kit, so can build a station that’s as close as possible to the original. As you know, the Raspberry Pi is incredibly versatile, and we’ve made it easy to hack the design in case you want to use different sensors.

Many other tutorials for Pi-powered weather stations don’t explain how the various sensors work or how to store your data. Ours goes into more detail. It shows you how to put together a breadboard prototype, it describes how to write Python code to take readings in different ways, and it guides you through recording these readings in a database.

Build Your Own Raspberry Pi weather station on a breadboard

There’s also a section on how to make your station weatherproof. And in case you want to move past the breadboard stage, we also help you with that. The guide shows you how to solder together all the components, similar to the original Oracle Weather Station HAT.

Who should try this build

We think this is a great project to tackle at home, at a STEM club, Scout group, or CoderDojo, and we’re sure that many of you will be chomping at the bit to get started. Before you do, please note that we’ve designed the build to be as straight-forward as possible, but it’s still fairly advanced both in terms of electronics and programming. You should read through the whole guide before purchasing any components.

Build Your Own Raspberry Pi weather station – components

The sensors and components we’re suggesting balance cost, accuracy, and easy of use. Depending on what you want to use your station for, you may wish to use different components. Similarly, the final soldered design in the guide may not be the most elegant, but we think it is achievable for someone with modest soldering experience and basic equipment.

You can build a functioning weather station without soldering with our guide, but the build will be more durable if you do solder it. If you’ve never tried soldering before, that’s OK: we have a Getting started with soldering resource plus video tutorial that will walk you through how it works step by step.

Prototyping HAT for Raspberry Pi weather station sensors

For those of you who are more experienced makers, there are plenty of different ways to put the final build together. We always like to hear about alternative builds, so please post your designs in the Weather Station forum.

Our plans for the guide

Our next step is publishing supplementary guides for adding extra functionality to your weather station. We’d love to hear which enhancements you would most like to see! Our current ideas under development include adding a webcam, making a tweeting weather station, adding a light/UV meter, and incorporating a lightning sensor. Let us know which of these is your favourite, or suggest your own amazing ideas in the comments!

*We do have a very small number of kits reserved for interesting projects or locations: a particularly cool experiment, a novel idea for how the Oracle Weather Station could be used, or places with specific weather phenomena. If have such a project in mind, please send a brief outline to [email protected], and we’ll consider how we might be able to help you.

The post Build your own weather station with our new guide! appeared first on Raspberry Pi.

Measuring the throughput for Amazon MQ using the JMS Benchmark

Post Syndicated from Rachel Richardson original https://aws.amazon.com/blogs/compute/measuring-the-throughput-for-amazon-mq-using-the-jms-benchmark/

This post is courtesy of Alan Protasio, Software Development Engineer, Amazon Web Services

Just like compute and storage, messaging is a fundamental building block of enterprise applications. Message brokers (aka “message-oriented middleware”) enable different software systems, often written in different languages, on different platforms, running in different locations, to communicate and exchange information. Mission-critical applications, such as CRM and ERP, rely on message brokers to work.

A common performance consideration for customers deploying a message broker in a production environment is the throughput of the system, measured as messages per second. This is important to know so that application environments (hosts, threads, memory, etc.) can be configured correctly.

In this post, we demonstrate how to measure the throughput for Amazon MQ, a new managed message broker service for ActiveMQ, using JMS Benchmark. It should take between 15–20 minutes to set up the environment and an hour to run the benchmark. We also provide some tips on how to configure Amazon MQ for optimal throughput.

Benchmarking throughput for Amazon MQ

ActiveMQ can be used for a number of use cases. These use cases can range from simple fire and forget tasks (that is, asynchronous processing), low-latency request-reply patterns, to buffering requests before they are persisted to a database.

The throughput of Amazon MQ is largely dependent on the use case. For example, if you have non-critical workloads such as gathering click events for a non-business-critical portal, you can use ActiveMQ in a non-persistent mode and get extremely high throughput with Amazon MQ.

On the flip side, if you have a critical workload where durability is extremely important (meaning that you can’t lose a message), then you are bound by the I/O capacity of your underlying persistence store. We recommend using mq.m4.large for the best results. The mq.t2.micro instance type is intended for product evaluation. Performance is limited, due to the lower memory and burstable CPU performance.

Tip: To improve your throughput with Amazon MQ, make sure that you have consumers processing messaging as fast as (or faster than) your producers are pushing messages.

Because it’s impossible to talk about how the broker (ActiveMQ) behaves for each and every use case, we walk through how to set up your own benchmark for Amazon MQ using our favorite open-source benchmarking tool: JMS Benchmark. We are fans of the JMS Benchmark suite because it’s easy to set up and deploy, and comes with a built-in visualizer of the results.

Non-Persistent Scenarios – Queue latency as you scale producer throughput

JMS Benchmark nonpersistent scenarios

Getting started

At the time of publication, you can create an mq.m4.large single-instance broker for testing for $0.30 per hour (US pricing).

This walkthrough covers the following tasks:

  1.  Create and configure the broker.
  2. Create an EC2 instance to run your benchmark
  3. Configure the security groups
  4.  Run the benchmark.

Step 1 – Create and configure the broker
Create and configure the broker using Tutorial: Creating and Configuring an Amazon MQ Broker.

Step 2 – Create an EC2 instance to run your benchmark
Launch the EC2 instance using Step 1: Launch an Instance. We recommend choosing the m5.large instance type.

Step 3 – Configure the security groups
Make sure that all the security groups are correctly configured to let the traffic flow between the EC2 instance and your broker.

  1. Sign in to the Amazon MQ console.
  2. From the broker list, choose the name of your broker (for example, MyBroker)
  3. In the Details section, under Security and network, choose the name of your security group or choose the expand icon ( ).
  4. From the security group list, choose your security group.
  5. At the bottom of the page, choose Inbound, Edit.
  6. In the Edit inbound rules dialog box, add a role to allow traffic between your instance and the broker:
    • Choose Add Rule.
    • For Type, choose Custom TCP.
    • For Port Range, type the ActiveMQ SSL port (61617).
    • For Source, leave Custom selected and then type the security group of your EC2 instance.
    • Choose Save.

Your broker can now accept the connection from your EC2 instance.

Step 4 – Run the benchmark
Connect to your EC2 instance using SSH and run the following commands:

$ cd ~
$ curl -L https://github.com/alanprot/jms-benchmark/archive/master.zip -o master.zip
$ unzip master.zip
$ cd jms-benchmark-master
$ chmod a+x bin/*
$ env \
  SERVER_SETUP=false \
  SERVER_ADDRESS={activemq-endpoint} \
  ACTIVEMQ_TRANSPORT=ssl\
  ACTIVEMQ_PORT=61617 \
  ACTIVEMQ_USERNAME={activemq-user} \
  ACTIVEMQ_PASSWORD={activemq-password} \
  ./bin/benchmark-activemq

After the benchmark finishes, you can find the results in the ~/reports directory. As you may notice, the performance of ActiveMQ varies based on the number of consumers, producers, destinations, and message size.

Amazon MQ architecture

The last bit that’s important to know so that you can better understand the results of the benchmark is how Amazon MQ is architected.

Amazon MQ is architected to be highly available (HA) and durable. For HA, we recommend using the multi-AZ option. After a message is sent to Amazon MQ in persistent mode, the message is written to the highly durable message store that replicates the data across multiple nodes in multiple Availability Zones. Because of this replication, for some use cases you may see a reduction in throughput as you migrate to Amazon MQ. Customers have told us they appreciate the benefits of message replication as it helps protect durability even in the face of the loss of an Availability Zone.

Conclusion

We hope this gives you an idea of how Amazon MQ performs. We encourage you to run tests to simulate your own use cases.

To learn more, see the Amazon MQ website. You can try Amazon MQ for free with the AWS Free Tier, which includes up to 750 hours of a single-instance mq.t2.micro broker and up to 1 GB of storage per month for one year.

Protecting your API using Amazon API Gateway and AWS WAF — Part I

Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/protecting-your-api-using-amazon-api-gateway-and-aws-waf-part-i/

This post courtesy of Thiago Morais, AWS Solutions Architect

When you build web applications or expose any data externally, you probably look for a platform where you can build highly scalable, secure, and robust REST APIs. As APIs are publicly exposed, there are a number of best practices for providing a secure mechanism to consumers using your API.

Amazon API Gateway handles all the tasks involved in accepting and processing up to hundreds of thousands of concurrent API calls, including traffic management, authorization and access control, monitoring, and API version management.

In this post, I show you how to take advantage of the regional API endpoint feature in API Gateway, so that you can create your own Amazon CloudFront distribution and secure your API using AWS WAF.

AWS WAF is a web application firewall that helps protect your web applications from common web exploits that could affect application availability, compromise security, or consume excessive resources.

As you make your APIs publicly available, you are exposed to attackers trying to exploit your services in several ways. The AWS security team published a whitepaper solution using AWS WAF, How to Mitigate OWASP’s Top 10 Web Application Vulnerabilities.

Regional API endpoints

Edge-optimized APIs are endpoints that are accessed through a CloudFront distribution created and managed by API Gateway. Before the launch of regional API endpoints, this was the default option when creating APIs using API Gateway. It primarily helped to reduce latency for API consumers that were located in different geographical locations than your API.

When API requests predominantly originate from an Amazon EC2 instance or other services within the same AWS Region as the API is deployed, a regional API endpoint typically lowers the latency of connections. It is recommended for such scenarios.

For better control around caching strategies, customers can use their own CloudFront distribution for regional APIs. They also have the ability to use AWS WAF protection, as I describe in this post.

Edge-optimized API endpoint

The following diagram is an illustrated example of the edge-optimized API endpoint where your API clients access your API through a CloudFront distribution created and managed by API Gateway.

Regional API endpoint

For the regional API endpoint, your customers access your API from the same Region in which your REST API is deployed. This helps you to reduce request latency and particularly allows you to add your own content delivery network, as needed.

Walkthrough

In this section, you implement the following steps:

  • Create a regional API using the PetStore sample API.
  • Create a CloudFront distribution for the API.
  • Test the CloudFront distribution.
  • Set up AWS WAF and create a web ACL.
  • Attach the web ACL to the CloudFront distribution.
  • Test AWS WAF protection.

Create the regional API

For this walkthrough, use an existing PetStore API. All new APIs launch by default as the regional endpoint type. To change the endpoint type for your existing API, choose the cog icon on the top right corner:

After you have created the PetStore API on your account, deploy a stage called “prod” for the PetStore API.

On the API Gateway console, select the PetStore API and choose Actions, Deploy API.

For Stage name, type prod and add a stage description.

Choose Deploy and the new API stage is created.

Use the following AWS CLI command to update your API from edge-optimized to regional:

aws apigateway update-rest-api \
--rest-api-id {rest-api-id} \
--patch-operations op=replace,path=/endpointConfiguration/types/EDGE,value=REGIONAL

A successful response looks like the following:

{
    "description": "Your first API with Amazon API Gateway. This is a sample API that integrates via HTTP with your demo Pet Store endpoints", 
    "createdDate": 1511525626, 
    "endpointConfiguration": {
        "types": [
            "REGIONAL"
        ]
    }, 
    "id": "{api-id}", 
    "name": "PetStore"
}

After you change your API endpoint to regional, you can now assign your own CloudFront distribution to this API.

Create a CloudFront distribution

To make things easier, I have provided an AWS CloudFormation template to deploy a CloudFront distribution pointing to the API that you just created. Click the button to deploy the template in the us-east-1 Region.

For Stack name, enter RegionalAPI. For APIGWEndpoint, enter your API FQDN in the following format:

{api-id}.execute-api.us-east-1.amazonaws.com

After you fill out the parameters, choose Next to continue the stack deployment. It takes a couple of minutes to finish the deployment. After it finishes, the Output tab lists the following items:

  • A CloudFront domain URL
  • An S3 bucket for CloudFront access logs
Output from CloudFormation

Output from CloudFormation

Test the CloudFront distribution

To see if the CloudFront distribution was configured correctly, use a web browser and enter the URL from your distribution, with the following parameters:

https://{your-distribution-url}.cloudfront.net/{api-stage}/pets

You should get the following output:

[
  {
    "id": 1,
    "type": "dog",
    "price": 249.99
  },
  {
    "id": 2,
    "type": "cat",
    "price": 124.99
  },
  {
    "id": 3,
    "type": "fish",
    "price": 0.99
  }
]

Set up AWS WAF and create a web ACL

With the new CloudFront distribution in place, you can now start setting up AWS WAF to protect your API.

For this demo, you deploy the AWS WAF Security Automations solution, which provides fine-grained control over the requests attempting to access your API.

For more information about deployment, see Automated Deployment. If you prefer, you can launch the solution directly into your account using the following button.

For CloudFront Access Log Bucket Name, add the name of the bucket created during the deployment of the CloudFormation stack for your CloudFront distribution.

The solution allows you to adjust thresholds and also choose which automations to enable to protect your API. After you finish configuring these settings, choose Next.

To start the deployment process in your account, follow the creation wizard and choose Create. It takes a few minutes do finish the deployment. You can follow the creation process through the CloudFormation console.

After the deployment finishes, you can see the new web ACL deployed on the AWS WAF console, AWSWAFSecurityAutomations.

Attach the AWS WAF web ACL to the CloudFront distribution

With the solution deployed, you can now attach the AWS WAF web ACL to the CloudFront distribution that you created earlier.

To assign the newly created AWS WAF web ACL, go back to your CloudFront distribution. After you open your distribution for editing, choose General, Edit.

Select the new AWS WAF web ACL that you created earlier, AWSWAFSecurityAutomations.

Save the changes to your CloudFront distribution and wait for the deployment to finish.

Test AWS WAF protection

To validate the AWS WAF Web ACL setup, use Artillery to load test your API and see AWS WAF in action.

To install Artillery on your machine, run the following command:

$ npm install -g artillery

After the installation completes, you can check if Artillery installed successfully by running the following command:

$ artillery -V
$ 1.6.0-12

As the time of publication, Artillery is on version 1.6.0-12.

One of the WAF web ACL rules that you have set up is a rate-based rule. By default, it is set up to block any requesters that exceed 2000 requests under 5 minutes. Try this out.

First, use cURL to query your distribution and see the API output:

$ curl -s https://{distribution-name}.cloudfront.net/prod/pets
[
  {
    "id": 1,
    "type": "dog",
    "price": 249.99
  },
  {
    "id": 2,
    "type": "cat",
    "price": 124.99
  },
  {
    "id": 3,
    "type": "fish",
    "price": 0.99
  }
]

Based on the test above, the result looks good. But what if you max out the 2000 requests in under 5 minutes?

Run the following Artillery command:

artillery quick -n 2000 --count 10  https://{distribution-name}.cloudfront.net/prod/pets

What you are doing is firing 2000 requests to your API from 10 concurrent users. For brevity, I am not posting the Artillery output here.

After Artillery finishes its execution, try to run the cURL request again and see what happens:

 

$ curl -s https://{distribution-name}.cloudfront.net/prod/pets

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
<HTML><HEAD><META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=iso-8859-1">
<TITLE>ERROR: The request could not be satisfied</TITLE>
</HEAD><BODY>
<H1>ERROR</H1>
<H2>The request could not be satisfied.</H2>
<HR noshade size="1px">
Request blocked.
<BR clear="all">
<HR noshade size="1px">
<PRE>
Generated by cloudfront (CloudFront)
Request ID: [removed]
</PRE>
<ADDRESS>
</ADDRESS>
</BODY></HTML>

As you can see from the output above, the request was blocked by AWS WAF. Your IP address is removed from the blocked list after it falls below the request limit rate.

Conclusion

In this first part, you saw how to use the new API Gateway regional API endpoint together with Amazon CloudFront and AWS WAF to secure your API from a series of attacks.

In the second part, I will demonstrate some other techniques to protect your API using API keys and Amazon CloudFront custom headers.

The Benefits of Side Projects

Post Syndicated from Bozho original https://techblog.bozho.net/the-benefits-of-side-projects/

Side projects are the things you do at home, after work, for your own “entertainment”, or to satisfy your desire to learn new stuff, in case your workplace doesn’t give you that opportunity (or at least not enough of it). Side projects are also a way to build stuff that you think is valuable but not necessarily “commercialisable”. Many side projects are open-sourced sooner or later and some of them contribute to the pool of tools at other people’s disposal.

I’ve outlined one recommendation about side projects before – do them with technologies that are new to you, so that you learn important things that will keep you better positioned in the software world.

But there are more benefits than that – serendipitous benefits, for example. And I’d like to tell some personal stories about that. I’ll focus on a few examples from my list of side projects to show how, through a sort-of butterfly effect, they helped shape my career.

The computoser project, no matter how cool algorithmic music composition, didn’t manage to have much of a long term impact. But it did teach me something apart from niche musical theory – how to read a bulk of scientific papers (mostly computer science) and understand them without being formally trained in the particular field. We’ll see how that was useful later.

Then there was the “State alerts” project – a website that scraped content from public institutions in my country (legislation, legislation proposals, decisions by regulators, new tenders, etc.), made them searchable, and “subscribable” – so that you get notified when a keyword of interest is mentioned in newly proposed legislation, for example. (I obviously subscribed for “information technologies” and “electronic”).

And that project turned out to have a significant impact on the following years. First, I chose a new technology to write it with – Scala. Which turned out to be of great use when I started working at TomTom, and on the 3rd day I was transferred to a Scala project, which was way cooler and much more complex than the original one I was hired for. It was a bit ironic, as my colleagues had just read that “I don’t like Scala” a few weeks earlier, but nevertheless, that was one of the most interesting projects I’ve worked on, and it went on for two years. Had I not known Scala, I’d probably be gone from TomTom much earlier (as the other project was restructured a few times), and I would not have learned many of the scalability, architecture and AWS lessons that I did learn there.

But the very same project had an even more important follow-up. Because if its “civic hacking” flavour, I was invited to join an informal group of developers (later officiated as an NGO) who create tools that are useful for society (something like MySociety.org). That group gathered regularly, discussed both tools and policies, and at some point we put up a list of policy priorities that we wanted to lobby policy makers. One of them was open source for the government, the other one was open data. As a result of our interaction with an interim government, we donated the official open data portal of my country, functioning to this day.

As a result of that, a few months later we got a proposal from the deputy prime minister’s office to “elect” one of the group for an advisor to the cabinet. And we decided that could be me. So I went for it and became advisor to the deputy prime minister. The job has nothing to do with anything one could imagine, and it was challenging and fascinating. We managed to pass legislation, including one that requires open source for custom projects, eID and open data. And all of that would not have been possible without my little side project.

As for my latest side project, LogSentinel – it became my current startup company. And not without help from the previous two mentioned above – the computer science paper reading was of great use when I was navigating the crypto papers landscape, and from the government job I not only gained invaluable legal knowledge, but I also “got” a co-founder.

Some other side projects died without much fanfare, and that’s fine. But the ones above shaped my “story” in a way that would not have been possible otherwise.

And I agree that such serendipitous chain of events could have happened without side projects – I could’ve gotten these opportunities by meeting someone at a bar (unlikely, but who knows). But we, as software engineers, are capable of tilting chance towards us by utilizing our skills. Side projects are our “extracurricular activities”, and they often lead to unpredictable, but rather positive chains of events. They would rarely be the only factor, but they are certainly great at unlocking potential.

The post The Benefits of Side Projects appeared first on Bozho's tech blog.

Naturebytes’ weatherproof Pi and camera case

Post Syndicated from Helen Lynn original https://www.raspberrypi.org/blog/naturebytes-weatherproof-pi-and-camera-case/

Naturebytes are making their weatherproof Wildlife Cam Case available as a standalone product for the first time, a welcome addition to the Raspberry Pi ecosystem that should take some of the hassle out of your outdoor builds.

A robin on a bird feeder in a garden with a Naturebytes Wildlife Cam mounted beside it

Weatherproofing digital making projects

People often use Raspberry Pis and Camera Modules for outdoor projects, but weatherproofing your set-up can be tricky. You need to keep water — and tiny creatures — out, but you might well need access for wires and cables, whether for power or sensors; if you’re using a camera, it’ll need something clear and cleanable in front of the lens. You can use sealant, but if you need to adjust anything that you’ve applied it to, you’ll have to remove it and redo it. While we’ve seen a few reasonable options available to buy, the choice has never been what you’d call extensive.

The Naturebytes case

For all these reasons, I was pleased to learn that Naturebytes, the wildlife camera people, are releasing their Wildlife Cam Case as a standalone product for the first time.

Naturebytes case open

The Wildlife Cam Case is ideal for nature camera projects, of course, but it’ll also be useful for anyone who wants to take their Pi outdoors. It has weatherproof lenses that are transparent to visible and IR light, for all your nature observation projects. Its opening is hinged to allow easy access to your hardware, and the case has waterproof access for cables. Inside, there’s a mount for fixing any model of Raspberry Pi and camera, as well as many other components. On top of all that, the case comes with a sturdy nylon strap to make it easy to attach it to a post or a tree.

Naturebytes case additional components

Order yours now!

At the moment, Naturebytes are producing a limited run of the cases. The first batch of 50 are due to be dispatched next week to arrive just in time for the Bank Holiday weekend in the UK, so get them while they’re hot. It’s the perfect thing for recording a timelapse of exactly how quickly the slugs obliterate your vegetable seedlings, and of lots more heartening things that must surely happen in gardens other than mine.

The post Naturebytes’ weatherproof Pi and camera case appeared first on Raspberry Pi.

EC2 Instance Update – C5 Instances with Local NVMe Storage (C5d)

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/ec2-instance-update-c5-instances-with-local-nvme-storage-c5d/

As you can see from my EC2 Instance History post, we add new instance types on a regular and frequent basis. Driven by increasingly powerful processors and designed to address an ever-widening set of use cases, the size and diversity of this list reflects the equally diverse group of EC2 customers!

Near the bottom of that list you will find the new compute-intensive C5 instances. With a 25% to 50% improvement in price-performance over the C4 instances, the C5 instances are designed for applications like batch and log processing, distributed and or real-time analytics, high-performance computing (HPC), ad serving, highly scalable multiplayer gaming, and video encoding. Some of these applications can benefit from access to high-speed, ultra-low latency local storage. For example, video encoding, image manipulation, and other forms of media processing often necessitates large amounts of I/O to temporary storage. While the input and output files are valuable assets and are typically stored as Amazon Simple Storage Service (S3) objects, the intermediate files are expendable. Similarly, batch and log processing runs in a race-to-idle model, flushing volatile data to disk as fast as possible in order to make full use of compute resources.

New C5d Instances with Local Storage
In order to meet this need, we are introducing C5 instances equipped with local NVMe storage. Available for immediate use in 5 regions, these instances are a great fit for the applications that I described above, as well as others that you will undoubtedly dream up! Here are the specs:

Instance NamevCPUsRAMLocal StorageEBS BandwidthNetwork Bandwidth
c5d.large24 GiB1 x 50 GB NVMe SSDUp to 2.25 GbpsUp to 10 Gbps
c5d.xlarge48 GiB1 x 100 GB NVMe SSDUp to 2.25 GbpsUp to 10 Gbps
c5d.2xlarge816 GiB1 x 225 GB NVMe SSDUp to 2.25 GbpsUp to 10 Gbps
c5d.4xlarge1632 GiB1 x 450 GB NVMe SSD2.25 GbpsUp to 10 Gbps
c5d.9xlarge3672 GiB1 x 900 GB NVMe SSD4.5 Gbps10 Gbps
c5d.18xlarge72144 GiB2 x 900 GB NVMe SSD9 Gbps25 Gbps

Other than the addition of local storage, the C5 and C5d share the same specs. Both are powered by 3.0 GHz Intel Xeon Platinum 8000-series processors, optimized for EC2 and with full control over C-states on the two largest sizes, giving you the ability to run two cores at up to 3.5 GHz using Intel Turbo Boost Technology.

You can use any AMI that includes drivers for the Elastic Network Adapter (ENA) and NVMe; this includes the latest Amazon Linux, Microsoft Windows (Server 2008 R2, Server 2012, Server 2012 R2 and Server 2016), Ubuntu, RHEL, SUSE, and CentOS AMIs.

Here are a couple of things to keep in mind about the local NVMe storage:

Naming – You don’t have to specify a block device mapping in your AMI or during the instance launch; the local storage will show up as one or more devices (/dev/nvme*1 on Linux) after the guest operating system has booted.

Encryption – Each local NVMe device is hardware encrypted using the XTS-AES-256 block cipher and a unique key. Each key is destroyed when the instance is stopped or terminated.

Lifetime – Local NVMe devices have the same lifetime as the instance they are attached to, and do not stick around after the instance has been stopped or terminated.

Available Now
C5d instances are available in On-Demand, Reserved Instance, and Spot form in the US East (N. Virginia), US West (Oregon), EU (Ireland), US East (Ohio), and Canada (Central) Regions. Prices vary by Region, and are just a bit higher than for the equivalent C5 instances.

Jeff;

PS – We will be adding local NVMe storage to other EC2 instance types in the months to come, so stay tuned!

Welcome Josh — Data Center Technician

Post Syndicated from Yev original https://www.backblaze.com/blog/welcome-josh-datacenter-technician/

The Backblaze production team is growing and that means the data center is increasingly gaining some new faces. One of the newest to join the team is Josh! Lets learn a bit more about Josh shall we?

What is your Backblaze Title?
I’m a Data Center Technician in the Sacramento area.

Where are you originally from?
I lived all over the California central valley growing up.

What attracted you to Backblaze?
Backblaze is the best of a few worlds — cool startup meets professional DIYers meets transparent tech company (a rare thing).

What do you expect to learn while being at Backblaze?
I expect to learn about Data Center operations, and continue to develop the Linux skills that landed me here.

Favorite hobby?
Building and playing with new and useful toys.

Star Trek or Star Wars?
Darmok and Jalad at Tanagra.

Coke or Pepsi?
Good Beer.

Favorite food?
Tacos. No, burgers. No, it’s sushi. No, gyros. I can’t choose.

Why do you like certain things?
I like things that I can take apart and rebuild and turn every knob and adjust every piece. It means there’s a lot to learn, and I definitely like that.

Darmok and Jalad on the ocean! Welcome aboard Josh 😀

The post Welcome Josh — Data Center Technician appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Analyze Apache Parquet optimized data using Amazon Kinesis Data Firehose, Amazon Athena, and Amazon Redshift

Post Syndicated from Roy Hasson original https://aws.amazon.com/blogs/big-data/analyzing-apache-parquet-optimized-data-using-amazon-kinesis-data-firehose-amazon-athena-and-amazon-redshift/

Amazon Kinesis Data Firehose is the easiest way to capture and stream data into a data lake built on Amazon S3. This data can be anything—from AWS service logs like AWS CloudTrail log files, Amazon VPC Flow Logs, Application Load Balancer logs, and others. It can also be IoT events, game events, and much more. To efficiently query this data, a time-consuming ETL (extract, transform, and load) process is required to massage and convert the data to an optimal file format, which increases the time to insight. This situation is less than ideal, especially for real-time data that loses its value over time.

To solve this common challenge, Kinesis Data Firehose can now save data to Amazon S3 in Apache Parquet or Apache ORC format. These are optimized columnar formats that are highly recommended for best performance and cost-savings when querying data in S3. This feature directly benefits you if you use Amazon Athena, Amazon Redshift, AWS Glue, Amazon EMR, or any other big data tools that are available from the AWS Partner Network and through the open-source community.

Amazon Connect is a simple-to-use, cloud-based contact center service that makes it easy for any business to provide a great customer experience at a lower cost than common alternatives. Its open platform design enables easy integration with other systems. One of those systems is Amazon Kinesis—in particular, Kinesis Data Streams and Kinesis Data Firehose.

What’s really exciting is that you can now save events from Amazon Connect to S3 in Apache Parquet format. You can then perform analytics using Amazon Athena and Amazon Redshift Spectrum in real time, taking advantage of this key performance and cost optimization. Of course, Amazon Connect is only one example. This new capability opens the door for a great deal of opportunity, especially as organizations continue to build their data lakes.

Amazon Connect includes an array of analytics views in the Administrator dashboard. But you might want to run other types of analysis. In this post, I describe how to set up a data stream from Amazon Connect through Kinesis Data Streams and Kinesis Data Firehose and out to S3, and then perform analytics using Athena and Amazon Redshift Spectrum. I focus primarily on the Kinesis Data Firehose support for Parquet and its integration with the AWS Glue Data Catalog, Amazon Athena, and Amazon Redshift.

Solution overview

Here is how the solution is laid out:

 

 

The following sections walk you through each of these steps to set up the pipeline.

1. Define the schema

When Kinesis Data Firehose processes incoming events and converts the data to Parquet, it needs to know which schema to apply. The reason is that many times, incoming events contain all or some of the expected fields based on which values the producers are advertising. A typical process is to normalize the schema during a batch ETL job so that you end up with a consistent schema that can easily be understood and queried. Doing this introduces latency due to the nature of the batch process. To overcome this issue, Kinesis Data Firehose requires the schema to be defined in advance.

To see the available columns and structures, see Amazon Connect Agent Event Streams. For the purpose of simplicity, I opted to make all the columns of type String rather than create the nested structures. But you can definitely do that if you want.

The simplest way to define the schema is to create a table in the Amazon Athena console. Open the Athena console, and paste the following create table statement, substituting your own S3 bucket and prefix for where your event data will be stored. A Data Catalog database is a logical container that holds the different tables that you can create. The default database name shown here should already exist. If it doesn’t, you can create it or use another database that you’ve already created.

CREATE EXTERNAL TABLE default.kfhconnectblog (
  awsaccountid string,
  agentarn string,
  currentagentsnapshot string,
  eventid string,
  eventtimestamp string,
  eventtype string,
  instancearn string,
  previousagentsnapshot string,
  version string
)
STORED AS parquet
LOCATION 's3://your_bucket/kfhconnectblog/'
TBLPROPERTIES ("parquet.compression"="SNAPPY")

That’s all you have to do to prepare the schema for Kinesis Data Firehose.

2. Define the data streams

Next, you need to define the Kinesis data streams that will be used to stream the Amazon Connect events.  Open the Kinesis Data Streams console and create two streams.  You can configure them with only one shard each because you don’t have a lot of data right now.

3. Define the Kinesis Data Firehose delivery stream for Parquet

Let’s configure the Data Firehose delivery stream using the data stream as the source and Amazon S3 as the output. Start by opening the Kinesis Data Firehose console and creating a new data delivery stream. Give it a name, and associate it with the Kinesis data stream that you created in Step 2.

As shown in the following screenshot, enable Record format conversion (1) and choose Apache Parquet (2). As you can see, Apache ORC is also supported. Scroll down and provide the AWS Glue Data Catalog database name (3) and table names (4) that you created in Step 1. Choose Next.

To make things easier, the output S3 bucket and prefix fields are automatically populated using the values that you defined in the LOCATION parameter of the create table statement from Step 1. Pretty cool. Additionally, you have the option to save the raw events into another location as defined in the Source record S3 backup section. Don’t forget to add a trailing forward slash “ / “ so that Data Firehose creates the date partitions inside that prefix.

On the next page, in the S3 buffer conditions section, there is a note about configuring a large buffer size. The Parquet file format is highly efficient in how it stores and compresses data. Increasing the buffer size allows you to pack more rows into each output file, which is preferred and gives you the most benefit from Parquet.

Compression using Snappy is automatically enabled for both Parquet and ORC. You can modify the compression algorithm by using the Kinesis Data Firehose API and update the OutputFormatConfiguration.

Be sure to also enable Amazon CloudWatch Logs so that you can debug any issues that you might run into.

Lastly, finalize the creation of the Firehose delivery stream, and continue on to the next section.

4. Set up the Amazon Connect contact center

After setting up the Kinesis pipeline, you now need to set up a simple contact center in Amazon Connect. The Getting Started page provides clear instructions on how to set up your environment, acquire a phone number, and create an agent to accept calls.

After setting up the contact center, in the Amazon Connect console, choose your Instance Alias, and then choose Data Streaming. Under Agent Event, choose the Kinesis data stream that you created in Step 2, and then choose Save.

At this point, your pipeline is complete.  Agent events from Amazon Connect are generated as agents go about their day. Events are sent via Kinesis Data Streams to Kinesis Data Firehose, which converts the event data from JSON to Parquet and stores it in S3. Athena and Amazon Redshift Spectrum can simply query the data without any additional work.

So let’s generate some data. Go back into the Administrator console for your Amazon Connect contact center, and create an agent to handle incoming calls. In this example, I creatively named mine Agent One. After it is created, Agent One can get to work and log into their console and set their availability to Available so that they are ready to receive calls.

To make the data a bit more interesting, I also created a second agent, Agent Two. I then made some incoming and outgoing calls and caused some failures to occur, so I now have enough data available to analyze.

5. Analyze the data with Athena

Let’s open the Athena console and run some queries. One thing you’ll notice is that when we created the schema for the dataset, we defined some of the fields as Strings even though in the documentation they were complex structures.  The reason for doing that was simply to show some of the flexibility of Athena to be able to parse JSON data. However, you can define nested structures in your table schema so that Kinesis Data Firehose applies the appropriate schema to the Parquet file.

Let’s run the first query to see which agents have logged into the system.

The query might look complex, but it’s fairly straightforward:

WITH dataset AS (
  SELECT 
    from_iso8601_timestamp(eventtimestamp) AS event_ts,
    eventtype,
    -- CURRENT STATE
    json_extract_scalar(
      currentagentsnapshot,
      '$.agentstatus.name') AS current_status,
    from_iso8601_timestamp(
      json_extract_scalar(
        currentagentsnapshot,
        '$.agentstatus.starttimestamp')) AS current_starttimestamp,
    json_extract_scalar(
      currentagentsnapshot, 
      '$.configuration.firstname') AS current_firstname,
    json_extract_scalar(
      currentagentsnapshot,
      '$.configuration.lastname') AS current_lastname,
    json_extract_scalar(
      currentagentsnapshot, 
      '$.configuration.username') AS current_username,
    json_extract_scalar(
      currentagentsnapshot, 
      '$.configuration.routingprofile.defaultoutboundqueue.name') AS               current_outboundqueue,
    json_extract_scalar(
      currentagentsnapshot, 
      '$.configuration.routingprofile.inboundqueues[0].name') as current_inboundqueue,
    -- PREVIOUS STATE
    json_extract_scalar(
      previousagentsnapshot, 
      '$.agentstatus.name') as prev_status,
    from_iso8601_timestamp(
      json_extract_scalar(
        previousagentsnapshot, 
       '$.agentstatus.starttimestamp')) as prev_starttimestamp,
    json_extract_scalar(
      previousagentsnapshot, 
      '$.configuration.firstname') as prev_firstname,
    json_extract_scalar(
      previousagentsnapshot, 
      '$.configuration.lastname') as prev_lastname,
    json_extract_scalar(
      previousagentsnapshot, 
      '$.configuration.username') as prev_username,
    json_extract_scalar(
      previousagentsnapshot, 
      '$.configuration.routingprofile.defaultoutboundqueue.name') as current_outboundqueue,
    json_extract_scalar(
      previousagentsnapshot, 
      '$.configuration.routingprofile.inboundqueues[0].name') as prev_inboundqueue
  from kfhconnectblog
  where eventtype <> 'HEART_BEAT'
)
SELECT
  current_status as status,
  current_username as username,
  event_ts
FROM dataset
WHERE eventtype = 'LOGIN' AND current_username <> ''
ORDER BY event_ts DESC

The query output looks something like this:

Here is another query that shows the sessions each of the agents engaged with. It tells us where they were incoming or outgoing, if they were completed, and where there were missed or failed calls.

WITH src AS (
  SELECT
     eventid,
     json_extract_scalar(currentagentsnapshot, '$.configuration.username') as username,
     cast(json_extract(currentagentsnapshot, '$.contacts') AS ARRAY(JSON)) as c,
     cast(json_extract(previousagentsnapshot, '$.contacts') AS ARRAY(JSON)) as p
  from kfhconnectblog
),
src2 AS (
  SELECT *
  FROM src CROSS JOIN UNNEST (c, p) AS contacts(c_item, p_item)
),
dataset AS (
SELECT 
  eventid,
  username,
  json_extract_scalar(c_item, '$.contactid') as c_contactid,
  json_extract_scalar(c_item, '$.channel') as c_channel,
  json_extract_scalar(c_item, '$.initiationmethod') as c_direction,
  json_extract_scalar(c_item, '$.queue.name') as c_queue,
  json_extract_scalar(c_item, '$.state') as c_state,
  from_iso8601_timestamp(json_extract_scalar(c_item, '$.statestarttimestamp')) as c_ts,
  
  json_extract_scalar(p_item, '$.contactid') as p_contactid,
  json_extract_scalar(p_item, '$.channel') as p_channel,
  json_extract_scalar(p_item, '$.initiationmethod') as p_direction,
  json_extract_scalar(p_item, '$.queue.name') as p_queue,
  json_extract_scalar(p_item, '$.state') as p_state,
  from_iso8601_timestamp(json_extract_scalar(p_item, '$.statestarttimestamp')) as p_ts
FROM src2
)
SELECT 
  username,
  c_channel as channel,
  c_direction as direction,
  p_state as prev_state,
  c_state as current_state,
  c_ts as current_ts,
  c_contactid as id
FROM dataset
WHERE c_contactid = p_contactid
ORDER BY id DESC, current_ts ASC

The query output looks similar to the following:

6. Analyze the data with Amazon Redshift Spectrum

With Amazon Redshift Spectrum, you can query data directly in S3 using your existing Amazon Redshift data warehouse cluster. Because the data is already in Parquet format, Redshift Spectrum gets the same great benefits that Athena does.

Here is a simple query to show querying the same data from Amazon Redshift. Note that to do this, you need to first create an external schema in Amazon Redshift that points to the AWS Glue Data Catalog.

SELECT 
  eventtype,
  json_extract_path_text(currentagentsnapshot,'agentstatus','name') AS current_status,
  json_extract_path_text(currentagentsnapshot, 'configuration','firstname') AS current_firstname,
  json_extract_path_text(currentagentsnapshot, 'configuration','lastname') AS current_lastname,
  json_extract_path_text(
    currentagentsnapshot,
    'configuration','routingprofile','defaultoutboundqueue','name') AS current_outboundqueue,
FROM default_schema.kfhconnectblog

The following shows the query output:

Summary

In this post, I showed you how to use Kinesis Data Firehose to ingest and convert data to columnar file format, enabling real-time analysis using Athena and Amazon Redshift. This great feature enables a level of optimization in both cost and performance that you need when storing and analyzing large amounts of data. This feature is equally important if you are investing in building data lakes on AWS.

 


Additional Reading

If you found this post useful, be sure to check out Analyzing VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight and Work with partitioned data in AWS Glue.


About the Author

Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.

 

 

 

Supply-Chain Security

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/05/supply-chain_se.html

Earlier this month, the Pentagon stopped selling phones made by the Chinese companies ZTE and Huawei on military bases because they might be used to spy on their users.

It’s a legitimate fear, and perhaps a prudent action. But it’s just one instance of the much larger issue of securing our supply chains.

All of our computerized systems are deeply international, and we have no choice but to trust the companies and governments that touch those systems. And while we can ban a few specific products, services or companies, no country can isolate itself from potential foreign interference.

In this specific case, the Pentagon is concerned that the Chinese government demanded that ZTE and Huawei add “backdoors” to their phones that could be surreptitiously turned on by government spies or cause them to fail during some future political conflict. This tampering is possible because the software in these phones is incredibly complex. It’s relatively easy for programmers to hide these capabilities, and correspondingly difficult to detect them.

This isn’t the first time the United States has taken action against foreign software suspected to contain hidden features that can be used against us. Last December, President Trump signed into law a bill banning software from the Russian company Kaspersky from being used within the US government. In 2012, the focus was on Chinese-made Internet routers. Then, the House Intelligence Committee concluded: “Based on available classified and unclassified information, Huawei and ZTE cannot be trusted to be free of foreign state influence and thus pose a security threat to the United States and to our systems.”

Nor is the United States the only country worried about these threats. In 2014, China reportedly banned antivirus products from both Kaspersky and the US company Symantec, based on similar fears. In 2017, the Indian government identified 42 smartphone apps that China subverted. Back in 1997, the Israeli company Check Point was dogged by rumors that its government added backdoors into its products; other of that country’s tech companies have been suspected of the same thing. Even al-Qaeda was concerned; ten years ago, a sympathizer released the encryption software Mujahedeen Secrets, claimed to be free of Western influence and backdoors. If a country doesn’t trust another country, then it can’t trust that country’s computer products.

But this trust isn’t limited to the country where the company is based. We have to trust the country where the software is written — and the countries where all the components are manufactured. In 2016, researchers discovered that many different models of cheap Android phones were sending information back to China. The phones might be American-made, but the software was from China. In 2016, researchers demonstrated an even more devious technique, where a backdoor could be added at the computer chip level in the factory that made the chips ­ without the knowledge of, and undetectable by, the engineers who designed the chips in the first place. Pretty much every US technology company manufactures its hardware in countries such as Malaysia, Indonesia, China and Taiwan.

We also have to trust the programmers. Today’s large software programs are written by teams of hundreds of programmers scattered around the globe. Backdoors, put there by we-have-no-idea-who, have been discovered in Juniper firewalls and D-Link routers, both of which are US companies. In 2003, someone almost slipped a very clever backdoor into Linux. Think of how many countries’ citizens are writing software for Apple or Microsoft or Google.

We can go even farther down the rabbit hole. We have to trust the distribution systems for our hardware and software. Documents disclosed by Edward Snowden showed the National Security Agency installing backdoors into Cisco routers being shipped to the Syrian telephone company. There are fake apps in the Google Play store that eavesdrop on you. Russian hackers subverted the update mechanism of a popular brand of Ukrainian accounting software to spread the NotPetya malware.

In 2017, researchers demonstrated that a smartphone can be subverted by installing a malicious replacement screen.

I could go on. Supply-chain security is an incredibly complex problem. US-only design and manufacturing isn’t an option; the tech world is far too internationally interdependent for that. We can’t trust anyone, yet we have no choice but to trust everyone. Our phones, computers, software and cloud systems are touched by citizens of dozens of different countries, any one of whom could subvert them at the demand of their government. And just as Russia is penetrating the US power grid so they have that capability in the event of hostilities, many countries are almost certainly doing the same thing at the consumer level.

We don’t know whether the risk of Huawei and ZTE equipment is great enough to warrant the ban. We don’t know what classified intelligence the United States has, and what it implies. But we do know that this is just a minor fix for a much larger problem. It’s doubtful that this ban will have any real effect. Members of the military, and everyone else, can still buy the phones. They just can’t buy them on US military bases. And while the US might block the occasional merger or acquisition, or ban the occasional hardware or software product, we’re largely ignoring that larger issue. Solving it borders on somewhere between incredibly expensive and realistically impossible.

Perhaps someday, global norms and international treaties will render this sort of device-level tampering off-limits. But until then, all we can do is hope that this particular arms race doesn’t get too far out of control.

This essay previously appeared in the Washington Post.

AWS Online Tech Talks – May and Early June 2018

Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-may-and-early-june-2018/

AWS Online Tech Talks – May and Early June 2018  

Join us this month to learn about some of the exciting new services and solution best practices at AWS. We also have our first re:Invent 2018 webinar series, “How to re:Invent”. Sign up now to learn more, we look forward to seeing you.

Note – All sessions are free and in Pacific Time.

Tech talks featured this month:

Analytics & Big Data

May 21, 2018 | 11:00 AM – 11:45 AM PT Integrating Amazon Elasticsearch with your DevOps Tooling – Learn how you can easily integrate Amazon Elasticsearch Service into your DevOps tooling and gain valuable insight from your log data.

May 23, 2018 | 11:00 AM – 11:45 AM PTData Warehousing and Data Lake Analytics, Together – Learn how to query data across your data warehouse and data lake without moving data.

May 24, 2018 | 11:00 AM – 11:45 AM PTData Transformation Patterns in AWS – Discover how to perform common data transformations on the AWS Data Lake.

Compute

May 29, 2018 | 01:00 PM – 01:45 PM PT – Creating and Managing a WordPress Website with Amazon Lightsail – Learn about Amazon Lightsail and how you can create, run and manage your WordPress websites with Amazon’s simple compute platform.

May 30, 2018 | 01:00 PM – 01:45 PM PTAccelerating Life Sciences with HPC on AWS – Learn how you can accelerate your Life Sciences research workloads by harnessing the power of high performance computing on AWS.

Containers

May 24, 2018 | 01:00 PM – 01:45 PM PT – Building Microservices with the 12 Factor App Pattern on AWS – Learn best practices for building containerized microservices on AWS, and how traditional software design patterns evolve in the context of containers.

Databases

May 21, 2018 | 01:00 PM – 01:45 PM PTHow to Migrate from Cassandra to Amazon DynamoDB – Get the benefits, best practices and guides on how to migrate your Cassandra databases to Amazon DynamoDB.

May 23, 2018 | 01:00 PM – 01:45 PM PT5 Hacks for Optimizing MySQL in the Cloud – Learn how to optimize your MySQL databases for high availability, performance, and disaster resilience using RDS.

DevOps

May 23, 2018 | 09:00 AM – 09:45 AM PT.NET Serverless Development on AWS – Learn how to build a modern serverless application in .NET Core 2.0.

Enterprise & Hybrid

May 22, 2018 | 11:00 AM – 11:45 AM PTHybrid Cloud Customer Use Cases on AWS – Learn how customers are leveraging AWS hybrid cloud capabilities to easily extend their datacenter capacity, deliver new services and applications, and ensure business continuity and disaster recovery.

IoT

May 31, 2018 | 11:00 AM – 11:45 AM PTUsing AWS IoT for Industrial Applications – Discover how you can quickly onboard your fleet of connected devices, keep them secure, and build predictive analytics with AWS IoT.

Machine Learning

May 22, 2018 | 09:00 AM – 09:45 AM PTUsing Apache Spark with Amazon SageMaker – Discover how to use Apache Spark with Amazon SageMaker for training jobs and application integration.

May 24, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS DeepLens – Learn how AWS DeepLens provides a new way for developers to learn machine learning by pairing the physical device with a broad set of tutorials, examples, source code, and integration with familiar AWS services.

Management Tools

May 21, 2018 | 09:00 AM – 09:45 AM PTGaining Better Observability of Your VMs with Amazon CloudWatch – Learn how CloudWatch Agent makes it easy for customers like Rackspace to monitor their VMs.

Mobile

May 29, 2018 | 11:00 AM – 11:45 AM PT – Deep Dive on Amazon Pinpoint Segmentation and Endpoint Management – See how segmentation and endpoint management with Amazon Pinpoint can help you target the right audience.

Networking

May 31, 2018 | 09:00 AM – 09:45 AM PTMaking Private Connectivity the New Norm via AWS PrivateLink – See how PrivateLink enables service owners to offer private endpoints to customers outside their company.

Security, Identity, & Compliance

May 30, 2018 | 09:00 AM – 09:45 AM PT – Introducing AWS Certificate Manager Private Certificate Authority (CA) – Learn how AWS Certificate Manager (ACM) Private Certificate Authority (CA), a managed private CA service, helps you easily and securely manage the lifecycle of your private certificates.

June 1, 2018 | 09:00 AM – 09:45 AM PTIntroducing AWS Firewall Manager – Centrally configure and manage AWS WAF rules across your accounts and applications.

Serverless

May 22, 2018 | 01:00 PM – 01:45 PM PTBuilding API-Driven Microservices with Amazon API Gateway – Learn how to build a secure, scalable API for your application in our tech talk about API-driven microservices.

Storage

May 30, 2018 | 11:00 AM – 11:45 AM PTAccelerate Productivity by Computing at the Edge – Learn how AWS Snowball Edge support for compute instances helps accelerate data transfers, execute custom applications, and reduce overall storage costs.

June 1, 2018 | 11:00 AM – 11:45 AM PTLearn to Build a Cloud-Scale Website Powered by Amazon EFS – Technical deep dive where you’ll learn tips and tricks for integrating WordPress, Drupal and Magento with Amazon EFS.

 

 

 

 

Cryptocurrency Security Challenges

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/cryptocurrency-security-challenges/

Physical coins representing cyrptocurrencies

Most likely you’ve read the tantalizing stories of big gains from investing in cryptocurrencies. Someone who invested $1,000 into bitcoins five years ago would have over $85,000 in value now. Alternatively, someone who invested in bitcoins three months ago would have seen their investment lose 20% in value. Beyond the big price fluctuations, currency holders are possibly exposed to fraud, bad business practices, and even risk losing their holdings altogether if they are careless in keeping track of the all-important currency keys.

It’s certain that beyond the rewards and risks, cryptocurrencies are here to stay. We can’t ignore how they are changing the game for how money is handled between people and businesses.

Some Advantages of Cryptocurrency

  • Cryptocurrency is accessible to anyone.
  • Decentralization means the network operates on a user-to-user (or peer-to-peer) basis.
  • Transactions can completed for a fraction of the expense and time required to complete traditional asset transfers.
  • Transactions are digital and cannot be counterfeited or reversed arbitrarily by the sender, as with credit card charge-backs.
  • There aren’t usually transaction fees for cryptocurrency exchanges.
  • Cryptocurrency allows the cryptocurrency holder to send exactly what information is needed and no more to the merchant or recipient, even permitting anonymous transactions (for good or bad).
  • Cryptocurrency operates at the universal level and hence makes transactions easier internationally.
  • There is no other electronic cash system in which your account isn’t owned by someone else.

On top of all that, blockchain, the underlying technology behind cryptocurrencies, is already being applied to a variety of business needs and itself becoming a hot sector of the tech economy. Blockchain is bringing traceability and cost-effectiveness to supply-chain management — which also improves quality assurance in areas such as food, reducing errors and improving accounting accuracy, smart contracts that can be automatically validated, signed and enforced through a blockchain construct, the possibility of secure, online voting, and many others.

Like any new, booming marketing there are risks involved in these new currencies. Anyone venturing into this domain needs to have their eyes wide open. While the opportunities for making money are real, there are even more ways to lose money.

We’re going to cover two primary approaches to staying safe and avoiding fraud and loss when dealing with cryptocurrencies. The first is to thoroughly vet any person or company you’re dealing with to judge whether they are ethical and likely to succeed in their business segment. The second is keeping your critical cryptocurrency keys safe, which we’ll deal with in this and a subsequent post.

Caveat Emptor — Buyer Beware

The short history of cryptocurrency has already seen the demise of a number of companies that claimed to manage, mine, trade, or otherwise help their customers profit from cryptocurrency. Mt. Gox, GAW Miners, and OneCoin are just three of the many companies that disappeared with their users’ money. This is the traditional equivalent of your bank going out of business and zeroing out your checking account in the process.

That doesn’t happen with banks because of regulatory oversight. But with cryptocurrency, you need to take the time to investigate any company you use to manage or trade your currencies. How long have they been around? Who are their investors? Are they affiliated with any reputable financial institutions? What is the record of their founders and executive management? These are all important questions to consider when evaluating a company in this new space.

Would you give the keys to your house to a service or person you didn’t thoroughly know and trust? Some companies that enable you to buy and sell currencies online will routinely hold your currency keys, which gives them the ability to do anything they want with your holdings, including selling them and pocketing the proceeds if they wish.

That doesn’t mean you shouldn’t ever allow a company to keep your currency keys in escrow. It simply means that you better know with whom you’re doing business and if they’re trustworthy enough to be given that responsibility.

Keys To the Cryptocurrency Kingdom — Public and Private

If you’re an owner of cryptocurrency, you know how this all works. If you’re not, bear with me for a minute while I bring everyone up to speed.

Cryptocurrency has no physical manifestation, such as bills or coins. It exists purely as a computer record. And unlike currencies maintained by governments, such as the U.S. dollar, there is no central authority regulating its distribution and value. Cryptocurrencies use a technology called blockchain, which is a decentralized way of keeping track of transactions. There are many copies of a given blockchain, so no single central authority is needed to validate its authenticity or accuracy.

The validity of each cryptocurrency is determined by a blockchain. A blockchain is a continuously growing list of records, called “blocks”, which are linked and secured using cryptography. Blockchains by design are inherently resistant to modification of the data. They perform as an open, distributed ledger that can record transactions between two parties efficiently and in a verifiable, permanent way. A blockchain is typically managed by a peer-to-peer network collectively adhering to a protocol for validating new blocks. Once recorded, the data in any given block cannot be altered retroactively without the alteration of all subsequent blocks, which requires collusion of the network majority. On a scaled network, this level of collusion is impossible — making blockchain networks effectively immutable and trustworthy.

Blockchain process

The other element common to all cryptocurrencies is their use of public and private keys, which are stored in the currency’s wallet. A cryptocurrency wallet stores the public and private “keys” or “addresses” that can be used to receive or spend the cryptocurrency. With the private key, it is possible to write in the public ledger (blockchain), effectively spending the associated cryptocurrency. With the public key, it is possible for others to send currency to the wallet.

What is a cryptocurrency address?

Cryptocurrency “coins” can be lost if the owner loses the private keys needed to spend the currency they own. It’s as if the owner had lost a bank account number and had no way to verify their identity to the bank, or if they lost the U.S. dollars they had in their wallet. The assets are gone and unusable.

The Cryptocurrency Wallet

Given the importance of these keys, and lack of recourse if they are lost, it’s obviously very important to keep track of your keys.

If you’re being careful in choosing reputable exchanges, app developers, and other services with whom to trust your cryptocurrency, you’ve made a good start in keeping your investment secure. But if you’re careless in managing the keys to your bitcoins, ether, Litecoin, or other cryptocurrency, you might as well leave your money on a cafe tabletop and walk away.

What Are the Differences Between Hot and Cold Wallets?

Just like other numbers you might wish to keep track of — credit cards, account numbers, phone numbers, passphrases — cryptocurrency keys can be stored in a variety of ways. Those who use their currencies for day-to-day purchases most likely will want them handy in a smartphone app, hardware key, or debit card that can be used for purchases. These are called “hot” wallets. Some experts advise keeping the balances in these devices and apps to a minimal amount to avoid hacking or data loss. We typically don’t walk around with thousands of dollars in U.S. currency in our old-style wallets, so this is really a continuation of the same approach to managing spending money.

Bread mobile app screenshot

A “hot” wallet, the Bread mobile app

Some investors with large balances keep their keys in “cold” wallets, or “cold storage,” i.e. a device or location that is not connected online. If funds are needed for purchases, they can be transferred to a more easily used payment medium. Cold wallets can be hardware devices, USB drives, or even paper copies of your keys.

Trezor hardware wallet

A “cold” wallet, the Trezor hardware wallet

Ledger Nano S hardware wallet

A “cold” wallet, the Ledger Nano S

Bitcoin paper wallet

A “cold” Bitcoin paper wallet

Wallets are suited to holding one or more specific cryptocurrencies, and some people have multiple wallets for different currencies and different purposes.

A paper wallet is nothing other than a printed record of your public and private keys. Some prefer their records to be completely disconnected from the internet, and a piece of paper serves that need. Just like writing down an account password on paper, however, it’s essential to keep the paper secure to avoid giving someone the ability to freely access your funds.

How to Keep your Keys, and Cryptocurrency Secure

In a post this coming Thursday, Securing Your Cryptocurrency, we’ll discuss the best strategies for backing up your cryptocurrency so that your currencies don’t become part of the millions that have been lost. We’ll cover the common (and uncommon) approaches to backing up hot wallets, cold wallets, and using paper and metal solutions to keeping your keys safe.

In the meantime, please tell us of your experiences with cryptocurrencies — good and bad — and how you’ve dealt with the issue of cryptocurrency security.

The post Cryptocurrency Security Challenges appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Ray Ozzie’s Encryption Backdoor

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/05/ray_ozzies_encr.html

Last month, Wired published a long article about Ray Ozzie and his supposed new scheme for adding a backdoor in encrypted devices. It’s a weird article. It paints Ozzie’s proposal as something that “attains the impossible” and “satisfies both law enforcement and privacy purists,” when (1) it’s barely a proposal, and (2) it’s essentially the same key escrow scheme we’ve been hearing about for decades.

Basically, each device has a unique public/private key pair and a secure processor. The public key goes into the processor and the device, and is used to encrypt whatever user key encrypts the data. The private key is stored in a secure database, available to law enforcement on demand. The only other trick is that for law enforcement to use that key, they have to put the device in some sort of irreversible recovery mode, which means it can never be used again. That’s basically it.

I have no idea why anyone is talking as if this were anything new. Several cryptographers have already explained why this key escrow scheme is no better than any other key escrow scheme. The short answer is (1) we won’t be able to secure that database of backdoor keys, (2) we don’t know how to build the secure coprocessor the scheme requires, and (3) it solves none of the policy problems around the whole system. This is the typical mistake non-cryptographers make when they approach this problem: they think that the hard part is the cryptography to create the backdoor. That’s actually the easy part. The hard part is ensuring that it’s only used by the good guys, and there’s nothing in Ozzie’s proposal that addresses any of that.

I worry that this kind of thing is damaging in the long run. There should be some rule that any backdoor or key escrow proposal be a fully specified proposal, not just some cryptography and hand-waving notions about how it will be used in practice. And before it is analyzed and debated, it should have to satisfy some sort of basic security analysis. Otherwise, we’ll be swatting pseudo-proposals like this one, while those on the other side of this debate become increasingly convinced that it’s possible to design one of these things securely.

Already people are using the National Academies report on backdoors for law enforcement as evidence that engineers are developing workable and secure backdoors. Writing in Lawfare, Alan Z. Rozenshtein claims that the report — and a related New York Times story — “undermine the argument that secure third-party access systems are so implausible that it’s not even worth trying to develop them.” Susan Landau effectively corrects this misconception, but the damage is done.

Here’s the thing: it’s not hard to design and build a backdoor. What’s hard is building the systems — both technical and procedural — around them. Here’s Rob Graham:

He’s only solving the part we already know how to solve. He’s deliberately ignoring the stuff we don’t know how to solve. We know how to make backdoors, we just don’t know how to secure them.

A bunch of us cryptographers have already explained why we don’t think this sort of thing will work in the foreseeable future. We write:

Exceptional access would force Internet system developers to reverse “forward secrecy” design practices that seek to minimize the impact on user privacy when systems are breached. The complexity of today’s Internet environment, with millions of apps and globally connected services, means that new law enforcement requirements are likely to introduce unanticipated, hard to detect security flaws. Beyond these and other technical vulnerabilities, the prospect of globally deployed exceptional access systems raises difficult problems about how such an environment would be governed and how to ensure that such systems would respect human rights and the rule of law.

Finally, Matthew Green:

The reason so few of us are willing to bet on massive-scale key escrow systems is that we’ve thought about it and we don’t think it will work. We’ve looked at the threat model, the usage model, and the quality of hardware and software that exists today. Our informed opinion is that there’s no detection system for key theft, there’s no renewability system, HSMs are terrifically vulnerable (and the companies largely staffed with ex-intelligence employees), and insiders can be suborned. We’re not going to put the data of a few billion people on the line an environment where we believe with high probability that the system will fail.

EDITED TO ADD (5/14): An analysis of the proposal.

Analyze data in Amazon DynamoDB using Amazon SageMaker for real-time prediction

Post Syndicated from YongSeong Lee original https://aws.amazon.com/blogs/big-data/analyze-data-in-amazon-dynamodb-using-amazon-sagemaker-for-real-time-prediction/

Many companies across the globe use Amazon DynamoDB to store and query historical user-interaction data. DynamoDB is a fast NoSQL database used by applications that need consistent, single-digit millisecond latency.

Often, customers want to turn their valuable data in DynamoDB into insights by analyzing a copy of their table stored in Amazon S3. Doing this separates their analytical queries from their low-latency critical paths. This data can be the primary source for understanding customers’ past behavior, predicting future behavior, and generating downstream business value. Customers often turn to DynamoDB because of its great scalability and high availability. After a successful launch, many customers want to use the data in DynamoDB to predict future behaviors or provide personalized recommendations.

DynamoDB is a good fit for low-latency reads and writes, but it’s not practical to scan all data in a DynamoDB database to train a model. In this post, I demonstrate how you can use DynamoDB table data copied to Amazon S3 by AWS Data Pipeline to predict customer behavior. I also demonstrate how you can use this data to provide personalized recommendations for customers using Amazon SageMaker. You can also run ad hoc queries using Amazon Athena against the data. DynamoDB recently released on-demand backups to create full table backups with no performance impact. However, it’s not suitable for our purposes in this post, so I chose AWS Data Pipeline instead to create managed backups are accessible from other services.

To do this, I describe how to read the DynamoDB backup file format in Data Pipeline. I also describe how to convert the objects in S3 to a CSV format that Amazon SageMaker can read. In addition, I show how to schedule regular exports and transformations using Data Pipeline. The sample data used in this post is from Bank Marketing Data Set of UCI.

The solution that I describe provides the following benefits:

  • Separates analytical queries from production traffic on your DynamoDB table, preserving your DynamoDB read capacity units (RCUs) for important production requests
  • Automatically updates your model to get real-time predictions
  • Optimizes for performance (so it doesn’t compete with DynamoDB RCUs after the export) and for cost (using data you already have)
  • Makes it easier for developers of all skill levels to use Amazon SageMaker

All code and data set in this post are available in this .zip file.

Solution architecture

The following diagram shows the overall architecture of the solution.

The steps that data follows through the architecture are as follows:

  1. Data Pipeline regularly copies the full contents of a DynamoDB table as JSON into an S3
  2. Exported JSON files are converted to comma-separated value (CSV) format to use as a data source for Amazon SageMaker.
  3. Amazon SageMaker renews the model artifact and update the endpoint.
  4. The converted CSV is available for ad hoc queries with Amazon Athena.
  5. Data Pipeline controls this flow and repeats the cycle based on the schedule defined by customer requirements.

Building the auto-updating model

This section discusses details about how to read the DynamoDB exported data in Data Pipeline and build automated workflows for real-time prediction with a regularly updated model.

Download sample scripts and data

Before you begin, take the following steps:

  1. Download sample scripts in this .zip file.
  2. Unzip the src.zip file.
  3. Find the automation_script.sh file and edit it for your environment. For example, you need to replace 's3://<your bucket>/<datasource path>/' with your own S3 path to the data source for Amazon ML. In the script, the text enclosed by angle brackets—< and >—should be replaced with your own path.
  4. Upload the json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar file to your S3 path so that the ADD jar command in Apache Hive can refer to it.

For this solution, the banking.csv  should be imported into a DynamoDB table.

Export a DynamoDB table

To export the DynamoDB table to S3, open the Data Pipeline console and choose the Export DynamoDB table to S3 template. In this template, Data Pipeline creates an Amazon EMR cluster and performs an export in the EMRActivity activity. Set proper intervals for backups according to your business requirements.

One core node(m3.xlarge) provides the default capacity for the EMR cluster and should be suitable for the solution in this post. Leave the option to resize the cluster before running enabled in the TableBackupActivity activity to let Data Pipeline scale the cluster to match the table size. The process of converting to CSV format and renewing models happens in this EMR cluster.

For a more in-depth look at how to export data from DynamoDB, see Export Data from DynamoDB in the Data Pipeline documentation.

Add the script to an existing pipeline

After you export your DynamoDB table, you add an additional EMR step to EMRActivity by following these steps:

  1. Open the Data Pipeline console and choose the ID for the pipeline that you want to add the script to.
  2. For Actions, choose Edit.
  3. In the editing console, choose the Activities category and add an EMR step using the custom script downloaded in the previous section, as shown below.

Paste the following command into the new step after the data ­­upload step:

s3://#{myDDBRegion}.elasticmapreduce/libs/script-runner/script-runner.jar,s3://<your bucket name>/automation_script.sh,#{output.directoryPath},#{myDDBRegion}

The element #{output.directoryPath} references the S3 path where the data pipeline exports DynamoDB data as JSON. The path should be passed to the script as an argument.

The bash script has two goals, converting data formats and renewing the Amazon SageMaker model. Subsequent sections discuss the contents of the automation script.

Automation script: Convert JSON data to CSV with Hive

We use Apache Hive to transform the data into a new format. The Hive QL script to create an external table and transform the data is included in the custom script that you added to the Data Pipeline definition.

When you run the Hive scripts, do so with the -e option. Also, define the Hive table with the 'org.openx.data.jsonserde.JsonSerDe' row format to parse and read JSON format. The SQL creates a Hive EXTERNAL table, and it reads the DynamoDB backup data on the S3 path passed to it by Data Pipeline.

Note: You should create the table with the “EXTERNAL” keyword to avoid the backup data being accidentally deleted from S3 if you drop the table.

The full automation script for converting follows. Add your own bucket name and data source path in the highlighted areas.

#!/bin/bash
hive -e "
ADD jar s3://<your bucket name>/json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar ; 
DROP TABLE IF EXISTS blog_backup_data ;
CREATE EXTERNAL TABLE blog_backup_data (
 customer_id map<string,string>,
 age map<string,string>, job map<string,string>, 
 marital map<string,string>,education map<string,string>, 
 default map<string,string>, housing map<string,string>,
 loan map<string,string>, contact map<string,string>, 
 month map<string,string>, day_of_week map<string,string>, 
 duration map<string,string>, campaign map<string,string>,
 pdays map<string,string>, previous map<string,string>, 
 poutcome map<string,string>, emp_var_rate map<string,string>, 
 cons_price_idx map<string,string>, cons_conf_idx map<string,string>,
 euribor3m map<string,string>, nr_employed map<string,string>, 
 y map<string,string> ) 
ROW FORMAT SERDE 'org.openx.data.jsonserde.JsonSerDe' 
LOCATION '$1/';

INSERT OVERWRITE DIRECTORY 's3://<your bucket name>/<datasource path>/' 
SELECT concat( customer_id['s'],',', 
 age['n'],',', job['s'],',', 
 marital['s'],',', education['s'],',', default['s'],',', 
 housing['s'],',', loan['s'],',', contact['s'],',', 
 month['s'],',', day_of_week['s'],',', duration['n'],',', 
 campaign['n'],',',pdays['n'],',',previous['n'],',', 
 poutcome['s'],',', emp_var_rate['n'],',', cons_price_idx['n'],',',
 cons_conf_idx['n'],',', euribor3m['n'],',', nr_employed['n'],',', y['n'] ) 
FROM blog_backup_data
WHERE customer_id['s'] > 0 ; 

After creating an external table, you need to read data. You then use the INSERT OVERWRITE DIRECTORY ~ SELECT command to write CSV data to the S3 path that you designated as the data source for Amazon SageMaker.

Depending on your requirements, you can eliminate or process the columns in the SELECT clause in this step to optimize data analysis. For example, you might remove some columns that have unpredictable correlations with the target value because keeping the wrong columns might expose your model to “overfitting” during the training. In this post, customer_id  columns is removed. Overfitting can make your prediction weak. More information about overfitting can be found in the topic Model Fit: Underfitting vs. Overfitting in the Amazon ML documentation.

Automation script: Renew the Amazon SageMaker model

After the CSV data is replaced and ready to use, create a new model artifact for Amazon SageMaker with the updated dataset on S3.  For renewing model artifact, you must create a new training job.  Training jobs can be run using the AWS SDK ( for example, Amazon SageMaker boto3 ) or the Amazon SageMaker Python SDK that can be installed with “pip install sagemaker” command as well as the AWS CLI for Amazon SageMaker described in this post.

In addition, consider how to smoothly renew your existing model without service impact, because your model is called by applications in real time. To do this, you need to create a new endpoint configuration first and update a current endpoint with the endpoint configuration that is just created.

#!/bin/bash
## Define variable 
REGION=$2
DTTIME=`date +%Y-%m-%d-%H-%M-%S`
ROLE="<your AmazonSageMaker-ExecutionRole>" 


# Select containers image based on region.  
case "$REGION" in
"us-west-2" )
    IMAGE="174872318107.dkr.ecr.us-west-2.amazonaws.com/linear-learner:latest"
    ;;
"us-east-1" )
    IMAGE="382416733822.dkr.ecr.us-east-1.amazonaws.com/linear-learner:latest" 
    ;;
"us-east-2" )
    IMAGE="404615174143.dkr.ecr.us-east-2.amazonaws.com/linear-learner:latest" 
    ;;
"eu-west-1" )
    IMAGE="438346466558.dkr.ecr.eu-west-1.amazonaws.com/linear-learner:latest" 
    ;;
 *)
    echo "Invalid Region Name"
    exit 1 ;  
esac

# Start training job and creating model artifact 
TRAINING_JOB_NAME=TRAIN-${DTTIME} 
S3OUTPUT="s3://<your bucket name>/model/" 
INSTANCETYPE="ml.m4.xlarge"
INSTANCECOUNT=1
VOLUMESIZE=5 
aws sagemaker create-training-job --training-job-name ${TRAINING_JOB_NAME} --region ${REGION}  --algorithm-specification TrainingImage=${IMAGE},TrainingInputMode=File --role-arn ${ROLE}  --input-data-config '[{ "ChannelName": "train", "DataSource": { "S3DataSource": { "S3DataType": "S3Prefix", "S3Uri": "s3://<your bucket name>/<datasource path>/", "S3DataDistributionType": "FullyReplicated" } }, "ContentType": "text/csv", "CompressionType": "None" , "RecordWrapperType": "None"  }]'  --output-data-config S3OutputPath=${S3OUTPUT} --resource-config  InstanceType=${INSTANCETYPE},InstanceCount=${INSTANCECOUNT},VolumeSizeInGB=${VOLUMESIZE} --stopping-condition MaxRuntimeInSeconds=120 --hyper-parameters feature_dim=20,predictor_type=binary_classifier  

# Wait until job completed 
aws sagemaker wait training-job-completed-or-stopped --training-job-name ${TRAINING_JOB_NAME}  --region ${REGION}

# Get newly created model artifact and create model
MODELARTIFACT=`aws sagemaker describe-training-job --training-job-name ${TRAINING_JOB_NAME} --region ${REGION}  --query 'ModelArtifacts.S3ModelArtifacts' --output text `
MODELNAME=MODEL-${DTTIME}
aws sagemaker create-model --region ${REGION} --model-name ${MODELNAME}  --primary-container Image=${IMAGE},ModelDataUrl=${MODELARTIFACT}  --execution-role-arn ${ROLE}

# create a new endpoint configuration 
CONFIGNAME=CONFIG-${DTTIME}
aws sagemaker  create-endpoint-config --region ${REGION} --endpoint-config-name ${CONFIGNAME}  --production-variants  VariantName=Users,ModelName=${MODELNAME},InitialInstanceCount=1,InstanceType=ml.m4.xlarge

# create or update the endpoint
STATUS=`aws sagemaker describe-endpoint --endpoint-name  ServiceEndpoint --query 'EndpointStatus' --output text --region ${REGION} `
if [[ $STATUS -ne "InService" ]] ;
then
    aws sagemaker  create-endpoint --endpoint-name  ServiceEndpoint  --endpoint-config-name ${CONFIGNAME} --region ${REGION}    
else
    aws sagemaker  update-endpoint --endpoint-name  ServiceEndpoint  --endpoint-config-name ${CONFIGNAME} --region ${REGION}
fi

Grant permission

Before you execute the script, you must grant proper permission to Data Pipeline. Data Pipeline uses the DataPipelineDefaultResourceRole role by default. I added the following policy to DataPipelineDefaultResourceRole to allow Data Pipeline to create, delete, and update the Amazon SageMaker model and data source in the script.

{
 "Version": "2012-10-17",
 "Statement": [
 {
 "Effect": "Allow",
 "Action": [
 "sagemaker:CreateTrainingJob",
 "sagemaker:DescribeTrainingJob",
 "sagemaker:CreateModel",
 "sagemaker:CreateEndpointConfig",
 "sagemaker:DescribeEndpoint",
 "sagemaker:CreateEndpoint",
 "sagemaker:UpdateEndpoint",
 "iam:PassRole"
 ],
 "Resource": "*"
 }
 ]
}

Use real-time prediction

After you deploy a model into production using Amazon SageMaker hosting services, your client applications use this API to get inferences from the model hosted at the specified endpoint. This approach is useful for interactive web, mobile, or desktop applications.

Following, I provide a simple Python code example that queries against Amazon SageMaker endpoint URL with its name (“ServiceEndpoint”) and then uses them for real-time prediction.

=== Python sample for real-time prediction ===

#!/usr/bin/env python
import boto3
import json 

client = boto3.client('sagemaker-runtime', region_name ='<your region>' )
new_customer_info = '34,10,2,4,1,2,1,1,6,3,190,1,3,4,3,-1.7,94.055,-39.8,0.715,4991.6'
response = client.invoke_endpoint(
    EndpointName='ServiceEndpoint',
    Body=new_customer_info, 
    ContentType='text/csv'
)
result = json.loads(response['Body'].read().decode())
print(result)
--- output(response) ---
{u'predictions': [{u'score': 0.7528127431869507, u'predicted_label': 1.0}]}

Solution summary

The solution takes the following steps:

  1. Data Pipeline exports DynamoDB table data into S3. The original JSON data should be kept to recover the table in the rare event that this is needed. Data Pipeline then converts JSON to CSV so that Amazon SageMaker can read the data.Note: You should select only meaningful attributes when you convert CSV. For example, if you judge that the “campaign” attribute is not correlated, you can eliminate this attribute from the CSV.
  2. Train the Amazon SageMaker model with the new data source.
  3. When a new customer comes to your site, you can judge how likely it is for this customer to subscribe to your new product based on “predictedScores” provided by Amazon SageMaker.
  4. If the new user subscribes your new product, your application must update the attribute “y” to the value 1 (for yes). This updated data is provided for the next model renewal as a new data source. It serves to improve the accuracy of your prediction. With each new entry, your application can become smarter and deliver better predictions.

Running ad hoc queries using Amazon Athena

Amazon Athena is a serverless query service that makes it easy to analyze large amounts of data stored in Amazon S3 using standard SQL. Athena is useful for examining data and collecting statistics or informative summaries about data. You can also use the powerful analytic functions of Presto, as described in the topic Aggregate Functions of Presto in the Presto documentation.

With the Data Pipeline scheduled activity, recent CSV data is always located in S3 so that you can run ad hoc queries against the data using Amazon Athena. I show this with example SQL statements following. For an in-depth description of this process, see the post Interactive SQL Queries for Data in Amazon S3 on the AWS News Blog. 

Creating an Amazon Athena table and running it

Simply, you can create an EXTERNAL table for the CSV data on S3 in Amazon Athena Management Console.

=== Table Creation ===
CREATE EXTERNAL TABLE datasource (
 age int, 
 job string, 
 marital string , 
 education string, 
 default string, 
 housing string, 
 loan string, 
 contact string, 
 month string, 
 day_of_week string, 
 duration int, 
 campaign int, 
 pdays int , 
 previous int , 
 poutcome string, 
 emp_var_rate double, 
 cons_price_idx double,
 cons_conf_idx double, 
 euribor3m double, 
 nr_employed double, 
 y int 
)
ROW FORMAT DELIMITED 
FIELDS TERMINATED BY ',' ESCAPED BY '\\' LINES TERMINATED BY '\n' 
LOCATION 's3://<your bucket name>/<datasource path>/';

The following query calculates the correlation coefficient between the target attribute and other attributes using Amazon Athena.

=== Sample Query ===

SELECT corr(age,y) AS correlation_age_and_target, 
 corr(duration,y) AS correlation_duration_and_target, 
 corr(campaign,y) AS correlation_campaign_and_target,
 corr(contact,y) AS correlation_contact_and_target
FROM ( SELECT age , duration , campaign , y , 
 CASE WHEN contact = 'telephone' THEN 1 ELSE 0 END AS contact 
 FROM datasource 
 ) datasource ;

Conclusion

In this post, I introduce an example of how to analyze data in DynamoDB by using table data in Amazon S3 to optimize DynamoDB table read capacity. You can then use the analyzed data as a new data source to train an Amazon SageMaker model for accurate real-time prediction. In addition, you can run ad hoc queries against the data on S3 using Amazon Athena. I also present how to automate these procedures by using Data Pipeline.

You can adapt this example to your specific use case at hand, and hopefully this post helps you accelerate your development. You can find more examples and use cases for Amazon SageMaker in the video AWS 2017: Introducing Amazon SageMaker on the AWS website.

 


Additional Reading

If you found this post useful, be sure to check out Serving Real-Time Machine Learning Predictions on Amazon EMR and Analyzing Data in S3 using Amazon Athena.

 


About the Author

Yong Seong Lee is a Cloud Support Engineer for AWS Big Data Services. He is interested in every technology related to data/databases and helping customers who have difficulties in using AWS services. His motto is “Enjoy life, be curious and have maximum experience.”