Tag Archives: benchmark

How to Automatically Revert and Receive Notifications About Changes to Your Amazon VPC Security Groups

Post Syndicated from Rob Barnes original https://aws.amazon.com/blogs/security/how-to-automatically-revert-and-receive-notifications-about-changes-to-your-amazon-vpc-security-groups/

In a previous AWS Security Blog post, Jeff Levine showed how you can monitor changes to your Amazon EC2 security groups. The methods he describes in that post are examples of detective controls, which can help you determine when changes are made to security controls on your AWS resources.

In this post, I take that approach a step further by introducing an example of a responsive control, which you can use to automatically respond to a detected security event by applying a chosen security mitigation. I demonstrate a solution that continuously monitors changes made to an Amazon VPC security group, and if a new ingress rule (the same as an inbound rule) is added to that security group, the solution removes the rule and then sends you a notification after the changes have been automatically reverted.

The scenario

Let’s say you want to reduce your infrastructure complexity by replacing your Secure Shell (SSH) bastion hosts with Amazon EC2 Systems Manager (SSM). SSM allows you to run commands on your hosts remotely, removing the need to manage bastion hosts or rely on SSH to execute commands. To support this objective, you must prevent your staff members from opening SSH ports to your web server’s Amazon VPC security group. If one of your staff members does modify the VPC security group to allow SSH access, you want the change to be automatically reverted and then receive a notification that the change to the security group was automatically reverted. If you are not yet familiar with security groups, see Security Groups for Your VPC before reading the rest of this post.

Solution overview

This solution begins with a directive control to mandate that no web server should be accessible using SSH. The directive control is enforced using a preventive control, which is implemented using a security group rule that prevents ingress from port 22 (typically used for SSH). The detective control is a “listener” that identifies any changes made to your security group. Finally, the responsive control reverts changes made to the security group and then sends a notification of this security mitigation.

The detective control, in this case, is an Amazon CloudWatch event that detects changes to your security group and triggers the responsive control, which in this case is an AWS Lambda function. I use AWS CloudFormation to simplify the deployment.

The following diagram shows the architecture of this solution.

Solution architecture diagram

Here is how the process works:

  1. Someone on your staff adds a new ingress rule to your security group.
  2. A CloudWatch event that continually monitors changes to your security groups detects the new ingress rule and invokes a designated Lambda function (with Lambda, you can run code without provisioning or managing servers).
  3. The Lambda function evaluates the event to determine whether you are monitoring this security group and reverts the new security group ingress rule.
  4. Finally, the Lambda function sends you an email to let you know what the change was, who made it, and that the change was reverted.

Deploy the solution by using CloudFormation

In this section, you will click the Launch Stack button shown below to launch the CloudFormation stack and deploy the solution.

Prerequisites

  • You must have AWS CloudTrail already enabled in the AWS Region where you will be deploying the solution. CloudTrail lets you log, continuously monitor, and retain events related to API calls across your AWS infrastructure. See Getting Started with CloudTrail for more information.
  • You must have a default VPC in the region in which you will be deploying the solution. AWS accounts have one default VPC per AWS Region. If you’ve deleted your VPC, see Creating a Default VPC to recreate it.

Resources that this solution creates

When you launch the CloudFormation stack, it creates the following resources:

  • A sample VPC security group in your default VPC, which is used as the target for reverting ingress rule changes.
  • A CloudWatch event rule that monitors changes to your AWS infrastructure.
  • A Lambda function that reverts changes to the security group and sends you email notifications.
  • A permission that allows CloudWatch to invoke your Lambda function.
  • An AWS Identity and Access Management (IAM) role with limited privileges that the Lambda function assumes when it is executed.
  • An Amazon SNS topic to which the Lambda function publishes notifications.

Launch the CloudFormation stack

The link in this section uses the us-east-1 Region (the US East [N. Virginia] Region). Change the region if you want to use this solution in a different region. See Selecting a Region for more information about changing the region.

To deploy the solution, click the following Launch Stack button to launch the stack. After you click the button, you must sign in to the AWS Management Console if you have not already done so.

Click this "Launch Stack" button

Then:

  1. Choose Next to proceed to the Specify Details page.
  2. On the Specify Details page, type your email address in the Send notifications to box. This is the email address to which change notifications will be sent. (After the stack is launched, you will receive a confirmation email that you must accept before you can receive notifications.)
  3. Choose Next until you get to the Review page, and then choose the I acknowledge that AWS CloudFormation might create IAM resources check box. This confirms that you are aware that the CloudFormation template includes an IAM resource.
  4. Choose Create. CloudFormation displays the stack status, CREATE_COMPLETE, when the stack has launched completely, which should take less than two minutes.Screenshot showing that the stack has launched completely

Testing the solution

  1. Check your email for the SNS confirmation email. You must confirm this subscription to receive future notification emails. If you don’t confirm the subscription, your security group ingress rules still will be automatically reverted, but you will not receive notification emails.
  2. Navigate to the EC2 console and choose Security Groups in the navigation pane.
  3. Choose the security group created by CloudFormation. Its name is Web Server Security Group.
  4. Choose the Inbound tab in the bottom pane of the page. Note that only one rule allows HTTPS ingress on port 443 from 0.0.0.0/0 (from anywhere).Screenshot showing the "Inbound" tab in the bottom pane of the page
  1. Choose Edit to display the Edit inbound rules dialog box (again, an inbound rule and an ingress rule are the same thing).
  2. Choose Add Rule.
  3. Choose SSH from the Type drop-down list.
  4. Choose My IP from the Source drop-down list. Your IP address is populated for you. By adding this rule, you are simulating one of your staff members violating your organization’s policy (in this blog post’s hypothetical example) against allowing SSH access to your EC2 servers. You are testing the solution created when you launched the CloudFormation stack in the previous section. The solution should remove this newly created SSH rule automatically.
    Screenshot of editing inbound rules
  5. Choose Save.

Adding this rule creates an EC2 AuthorizeSecurityGroupIngress service event, which triggers the Lambda function created in the CloudFormation stack. After a few moments, choose the refresh button ( The "refresh" icon ) to see that the new SSH ingress rule that you just created has been removed by the solution you deployed earlier with the CloudFormation stack. If the rule is still there, wait a few more moments and choose the refresh button again.

Screenshot of refreshing the page to see that the SSH ingress rule has been removed

You should also receive an email to notify you that the ingress rule was added and subsequently reverted.

Screenshot of the notification email

Cleaning up

If you want to remove the resources created by this CloudFormation stack, you can delete the CloudFormation stack:

  1. Navigate to the CloudFormation console.
  2. Choose the stack that you created earlier.
  3. Choose the Actions drop-down list.
  4. Choose Delete Stack, and then choose Yes, Delete.
  5. CloudFormation will display a status of DELETE_IN_PROGRESS while it deletes the resources created with the stack. After a few moments, the stack should no longer appear in the list of completed stacks.
    Screenshot of stack "DELETE_IN_PROGRESS"

Other applications of this solution

I have shown one way to use multiple AWS services to help continuously ensure that your security controls haven’t deviated from your security baseline. However, you also could use the CIS Amazon Web Services Foundations Benchmarks, for example, to establish a governance baseline across your AWS accounts and then use the principles in this blog post to automatically mitigate changes to that baseline.

To scale this solution, you can create a framework that uses resource tags to identify particular resources for monitoring. You also can use a consolidated monitoring approach by using cross-account event delivery. See Sending and Receiving Events Between AWS Accounts for more information. You also can extend the principle of automatic mitigation to detect and revert changes to other resources such as IAM policies and Amazon S3 bucket policies.

Summary

In this blog post, I demonstrated how you can automatically revert changes to a VPC security group and have a notification sent about the changes. You can use this solution in your own AWS accounts to enforce your security requirements continuously.

If you have comments about this blog post or other ideas for ways to use this solution, submit a comment in the “Comments” section below. If you have implementation questions, start a new thread in the EC2 forum or contact AWS Support.

– Rob

Roku Is Building Its Own Anti-Piracy Team

Post Syndicated from Ernesto original https://torrentfreak.com/roku-building-anti-piracy-team/

Online streaming piracy is on the rise and many people use dedicated media players to watch unauthorized content through their regular TV.

Although the media players themselves can be used for perfectly legal means, third-party add-ons turn them into pirate machines, providing access to movies, TV-shows and more.

The entertainment industry isn’t happy with this development and is trying to halt further growth wherever possible.

Just a few months ago, Roku was harshly confronted with this new reality when a Mexican court ordered local retailers to take its media player off the shelves. This legal battle is still ongoing, but it’s clear that Roku itself is now taking a more proactive role.

While Roku never permitted any infringing content, the company is taking steps to better deal with the problem. The company has already begun warning users of copyright-infringing third-party channels, but that was only the beginning.

Two new job applications posted by Roku a few days ago reveal that the company is putting together an in-house anti-piracy team to keep the problem under control.

One of the new positions is that of Director Anti-Piracy and Content Security. Roku stresses that this is a brand new position, which involves shaping the company’s anti-piracy strategy.

“The Director, Anti-Piracy and Content Security is responsible for defining the technology roadmap and overseeing implementation of anti-piracy and content security initiatives at Roku,” the application reads.

“This role requires ability to benchmark Roku against best practices (i.e. MPAA, Studio & Customer) but also requires an emphasis on maintaining deep insight into the evolving threat landscape and technical challenges of combating piracy.”

The job posting

The second job listed by Roku is that of an anti-piracy software engineer. One of the main tasks of this position is to write software for the Roku to monitor and prevent piracy.

“In this role, you will be responsible for implementing anti-piracy and content protection technology as it pertains to Roku OS,” the application explains.

“This entails developing software features, conducting forensic investigations and mining Roku’s big data platform and other threat intelligence sources for copyright infringement activities on and off platform.”

While a two-person team is relatively small, it’s possible that this will grow in the future, if there aren’t people in a similar role already. What’s clear, however, is that Roku takes piracy very seriously.

With Hollywood closely eyeing the streaming box landscape, the company is doing its best to keep copyright holders onside.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Backblaze’s Upgrade Guide for macOS High Sierra

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/macos-high-sierra-upgrade-guide/

High Sierra

Apple introduced macOS 10.13 “High Sierra” at its 2017 Worldwide Developers Conference in June. On Tuesday, we learned we don’t have long to wait — the new OS will be available on September 25. It’s a free upgrade, and millions of Mac users around the world will rush to install it.

We understand. A new OS from Apple is exciting, But please, before you upgrade, we want to remind you to back up your Mac. You want your data to be safe from unexpected problems that could happen in the upgrade. We do, too. To make that easier, Backblaze offers this macOS High Sierra upgrade guide.

Why Upgrade to macOS 10.13 High Sierra?

High Sierra, as the name suggests, is a follow-on to the previous macOS, Sierra. Its major focus is on improving the base OS with significant improvements that will support new capabilities in the future in the file system, video, graphics, and virtual/augmented reality.

But don’t despair; there also are outward improvements that will be readily apparent to everyone when they boot the OS for the first time. We’ll cover both the inner and outer improvements coming in this new OS.

Under the Hood of High Sierra

APFS (Apple File System)

Apple has been rolling out its first file system upgrade for a while now. It’s already in iOS: now High Sierra brings APFS to the Mac. Apple touts APFS as a new file system optimized for Flash/SSD storage and featuring strong encryption, better and faster file handling, safer copying and moving of files, and other improved file system fundamentals.

We went into detail about the enhancements and improvements that APFS has over the previous file system, HFS+, in an earlier post. Many of these improvements, including enhanced performance, security and reliability of data, will provide immediate benefits to users, while others provide a foundation for future storage innovations and will require work by Apple and third parties to support in their products and services.

Most of us won’t notice these improvements, but we’ll benefit from better, faster, and safer file handling, which I think all of us can appreciate.

Video

High Sierra includes High Efficiency Video Encoding (HEVC, aka H.265), which preserves better detail and color while also introducing improved compression over H.264 (MPEG-4 AVC). Even existing Macs will benefit from the HEVC software encoding in High Sierra, but newer Mac models include HEVC hardware acceleration for even better performance.

MacBook Pro

Metal 2

macOS High Sierra introduces Metal 2, the next-generation of Apple’s Metal graphics API that was launched three years ago. Apple claims that Metal 2 provides up to 10x better performance in key areas. It provides near-direct access to the graphics processor (GPU), enabling the GPU to take control over key aspects of the rendering pipeline. Metal 2 will enhance the Mac’s capability for machine learning, and is the technology driving the new virtual reality platform on Macs.

audio video editor screenshot

Virtual Reality

We’re about to see an explosion of virtual reality experiences on both the Mac and iOS thanks to High Sierra and iOS 11. Content creators will be able to use apps like Final Cut Pro X, Epic Unreal 4 Editor, and Unity Editor to create fully immersive worlds that will revolutionize entertainment and education and have many professional uses, as well.

Users will want the new iMac with Retina 5K display or the upcoming iMac Pro to enjoy them, or any supported Mac paired with the latest external GPU and VR headset.

iMac and HTC virtual reality player

Outward Improvements

Siri

Siri logo

Expect a more nature voice from Siri in High Sierra. She or he will be less robotic, with greater expression and use of intonation in speech. Siri will also learn more about your preferences in things like music, helping you choose music that fits your taste and putting together playlists expressly for you. Expect Siri to be able to answer your questions about music-related trivia, as well.

Siri:  what does “scaramouche” refer to in the song Bohemian Rhapsody?

Photos

HD MacBook Pro screenshot

Photos has been redesigned with a new layout and new tools. A redesigned Edit view includes new tools for fine-tuning color and contrast and making adjustments within a defined color range. Some fun elements for creating special effects and memories also have been added. Photos now works with external apps such as Photoshop and Pixelmator. Compatibility with third-party extension adds printing and publishing services to help get your photos out into the world.

Safari

Safari logo

Apple claims that Safari in High Sierra is the world’s fastest desktop browser, outperforming Chrome and other browsers in a range of benchmark tests. They’ve also added autoplay blocking for those pesky videos that play without your permission and tracking blocking to help protect your privacy.

Can My Mac Run macOS High Sierra 10.13?

All Macs introduced in mid 2010 or later are compatible. MacBook and iMac computers introduced in late 2009 are also compatible. You’ll need OS X 10.7.5 “Lion” or later installed, along with at least 2 GB RAM and 8.8 GB of available storage to manage the upgrade.
Some features of High Sierra require an internet connection or an Apple ID. You can check to see if your Mac is compatible with High Sierra on Apple’s website.

Conquering High Sierra — What Do I Do Before I Upgrade?

Back Up That Mac!

It’s always smart to back up before you upgrade the operating system or make any other crucial changes to your computer. Upgrading your OS is a major change to your computer, and if anything goes wrong…well, you don’t want that to happen.

iMac backup screenshot

We recommend the 3-2-1 Backup Strategy to make sure your data is safe. What does that mean? Have three copies of your data. There’s the “live” version on your Mac, a local backup (Time Machine, another copy on a local drive or other computer), and an offsite backup like Backblaze. No matter what happens to your computer, you’ll have a way to restore the files if anything goes wrong. Need help understanding how to back up your Mac? We have you covered with a handy Mac backup guide.

Check for App and Driver Updates

This is when it helps to do your homework. Check with app developers or device manufacturers to find if their apps and devices have updates to work with High Sierra. Visit their websites or use the Check for Updates feature built into most apps (often found in the File or Help menus).

If you’ve downloaded apps through the Mac App Store, make sure to open them and click on the Updates button to download the latest updates.

Updating can be hit or miss when you’ve installed apps that didn’t come from the Mac App Store. To make it easier, visit the MacUpdate website. MacUpdate tracks changes to thousands of Mac apps.


Will Backblaze work with macOS High Sierra?

Yes. We’ve taken care to ensure that Backblaze works with High Sierra. We’ve already enhanced our Macintosh client to report the space available on an APFS container and we plan to add additional support for APFS capabilities that enhance Backblaze’s capabilities in the future.

Of course, we’ll watch Apple’s release carefully for any last minute surprises. We’ll officially offer support for High Sierra once we’ve had a chance to thoroughly test the release version.


Set Aside Time for the Upgrade

Depending on the speed of your Internet connection and your computer, upgrading to High Sierra will take some time. You’ll be able to use your Mac straightaway after answering a few questions at the end of the upgrade process.

If you’re going to install High Sierra on multiple Macs, a time-and-bandwidth-saving tip came from a Backblaze customer who suggested copying the installer from your Mac’s Applications folder to a USB Flash drive (or an external drive) before you run it. The installer routinely deletes itself once the upgrade process is completed, but if you grab it before that happens you can use it on other computers.

Where Do I get High Sierra?

Apple says that High Sierra will be available on September 25. Like other Mac operating system releases, Apple offers macOS 10.13 High Sierra for download from the Mac App Store, which is included on the Mac. As long as your Mac is supported and running OS X 10.7.5 “Lion” (released in 2012) or later, you can download and run the installer. It’s free. Thank you, Apple.

Better to be Safe than Sorry

Back up your Mac before doing anything to it, and make Backblaze part of your 3-2-1 backup strategy. That way your data is secure. Even if you have to roll back after an upgrade, or if you run into other problems, your data will be safe and sound in your backup.

Tell us How it Went

Are you getting ready to install High Sierra? Still have questions? Let us know in the comments. Tell us how your update went and what you like about the new release of macOS.

And While You’re Waiting for High Sierra…

While you’re waiting for Apple to release High Sierra on September 25, you might want to check out these other posts about using your Mac and Backblaze.

The post Backblaze’s Upgrade Guide for macOS High Sierra appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

From Data Lake to Data Warehouse: Enhancing Customer 360 with Amazon Redshift Spectrum

Post Syndicated from Dylan Tong original https://aws.amazon.com/blogs/big-data/from-data-lake-to-data-warehouse-enhancing-customer-360-with-amazon-redshift-spectrum/

Achieving a 360o-view of your customer has become increasingly challenging as companies embrace omni-channel strategies, engaging customers across websites, mobile, call centers, social media, physical sites, and beyond. The promise of a web where online and physical worlds blend makes understanding your customers more challenging, but also more important. Businesses that are successful in this medium have a significant competitive advantage.

The big data challenge requires the management of data at high velocity and volume. Many customers have identified Amazon S3 as a great data lake solution that removes the complexities of managing a highly durable, fault tolerant data lake infrastructure at scale and economically.

AWS data services substantially lessen the heavy lifting of adopting technologies, allowing you to spend more time on what matters most—gaining a better understanding of customers to elevate your business. In this post, I show how a recent Amazon Redshift innovation, Redshift Spectrum, can enhance a customer 360 initiative.

Customer 360 solution

A successful customer 360 view benefits from using a variety of technologies to deliver different forms of insights. These could range from real-time analysis of streaming data from wearable devices and mobile interactions to historical analysis that requires interactive, on demand queries on billions of transactions. In some cases, insights can only be inferred through AI via deep learning. Finally, the value of your customer data and insights can’t be fully realized until it is operationalized at scale—readily accessible by fleets of applications. Companies are leveraging AWS for the breadth of services that cover these domains, to drive their data strategy.

A number of AWS customers stream data from various sources into a S3 data lake through Amazon Kinesis. They use Kinesis and technologies in the Hadoop ecosystem like Spark running on Amazon EMR to enrich this data. High-value data is loaded into an Amazon Redshift data warehouse, which allows users to analyze and interact with data through a choice of client tools. Redshift Spectrum expands on this analytics platform by enabling Amazon Redshift to blend and analyze data beyond the data warehouse and across a data lake.

The following diagram illustrates the workflow for such a solution.

This solution delivers value by:

  • Reducing complexity and time to value to deeper insights. For instance, an existing data model in Amazon Redshift may provide insights across dimensions such as customer, geography, time, and product on metrics from sales and financial systems. Down the road, you may gain access to streaming data sources like customer-care call logs and website activity that you want to blend in with the sales data on the same dimensions to understand how web and call center experiences maybe correlated with sales performance. Redshift Spectrum can join these dimensions in Amazon Redshift with data in S3 to allow you to quickly gain new insights, and avoid the slow and more expensive alternative of fully integrating these sources with your data warehouse.
  • Providing an additional avenue for optimizing costs and performance. In cases like call logs and clickstream data where volumes could be many TBs to PBs, storing the data exclusively in S3 yields significant cost savings. Interactive analysis on massive datasets may now be economically viable in cases where data was previously analyzed periodically through static reports generated by inexpensive batch processes. In some cases, you can improve the user experience while simultaneously lowering costs. Spectrum is powered by a large-scale infrastructure external to your Amazon Redshift cluster, and excels at scanning and aggregating large volumes of data. For instance, your analysts maybe performing data discovery on customer interactions across millions of consumers over years of data across various channels. On this large dataset, certain queries could be slow if you didn’t have a large Amazon Redshift cluster. Alternatively, you could use Redshift Spectrum to achieve a better user experience with a smaller cluster.

Proof of concept walkthrough

To make evaluation easier for you, I’ve conducted a Redshift Spectrum proof-of-concept (PoC) for the customer 360 use case. For those who want to replicate the PoC, the instructions, AWS CloudFormation templates, and public data sets are available in the GitHub repository.

The remainder of this post is a journey through the project, observing best practices in action, and learning how you can achieve business value. The walkthrough involves:

  • An analysis of performance data from the PoC environment involving queries that demonstrate blending and analysis of data across Amazon Redshift and S3. Observe that great results are achievable at scale.
  • Guidance by example on query tuning, design, and data preparation to illustrate the optimization process. This includes tuning a query that combines clickstream data in S3 with customer and time dimensions in Amazon Redshift, and aggregates ~1.9 B out of 3.7 B+ records in under 10 seconds with a small cluster!
  • Guidance and measurements to help assess deciding between two options: accessing and analyzing data exclusively in Amazon Redshift, or using Redshift Spectrum to access data left in S3.

Stream ingestion and enrichment

The focus of this post isn’t stream ingestion and enrichment on Kinesis and EMR, but be mindful of performance best practices on S3 to ensure good streaming and query performance:

  • Use random object keys: The data files provided for this project are prefixed with SHA-256 hashes to prevent hot partitions. This is important to ensure that optimal request rates to support PUT requests from the incoming stream in addition to certain queries from large Amazon Redshift clusters that could send a large number of parallel GET requests.
  • Micro-batch your data stream: S3 isn’t optimized for small random write workloads. Your datasets should be micro-batched into large files. For instance, the “parquet-1” dataset provided batches >7 million records per file. The optimal file size for Redshift Spectrum is usually in the 100 MB to 1 GB range.

If you have an edge case that may pose scalability challenges, AWS would love to hear about it. For further guidance, talk to your solutions architect.

Environment

The project consists of the following environment:

  • Amazon Redshift cluster: 4 X dc1.large
  • Data:
    • Time and customer dimension tables are stored on all Amazon Redshift nodes (ALL distribution style):
      • The data originates from the DWDATE and CUSTOMER tables in the Star Schema Benchmark
      • The customer table contains attributes for 3 million customers.
      • The time data is at the day-level granularity, and spans 7 years, from the start of 1992 to the end of 1998.
    • The clickstream data is stored in an S3 bucket, and serves as a fact table.
      • Various copies of this dataset in CSV and Parquet format have been provided, for reasons to be discussed later.
      • The data is a modified version of the uservisits dataset from AMPLab’s Big Data Benchmark, which was generated by Intel’s Hadoop benchmark tools.
      • Changes were minimal, so that existing test harnesses for this test can be adapted:
        • Increased the 751,754,869-row dataset 5X to 3,758,774,345 rows.
        • Added surrogate keys to support joins with customer and time dimensions. These keys were distributed evenly across the entire dataset to represents user visits from six customers over seven years.
        • Values for the visitDate column were replaced to align with the 7-year timeframe, and the added time surrogate key.

Queries across the data lake and data warehouse 

Imagine a scenario where a business analyst plans to analyze clickstream metrics like ad revenue over time and by customer, market segment and more. The example below is a query that achieves this effect: 

The query part highlighted in red retrieves clickstream data in S3, and joins the data with the time and customer dimension tables in Amazon Redshift through the part highlighted in blue. The query returns the total ad revenue for three customers over the last three months, along with info on their respective market segment.

Unfortunately, this query takes around three minutes to run, and doesn’t enable the interactive experience that you want. However, there’s a number of performance optimizations that you can implement to achieve the desired performance.

Performance analysis

Two key utilities provide visibility into Redshift Spectrum:

  • EXPLAIN
    Provides the query execution plan, which includes info around what processing is pushed down to Redshift Spectrum. Steps in the plan that include the prefix S3 are executed on Redshift Spectrum. For instance, the plan for the previous query has the step “S3 Seq Scan clickstream.uservisits_csv10”, indicating that Redshift Spectrum performs a scan on S3 as part of the query execution.
  • SVL_S3QUERY_SUMMARY
    Statistics for Redshift Spectrum queries are stored in this table. While the execution plan presents cost estimates, this table stores actual statistics for past query runs.

You can get the statistics of your last query by inspecting the SVL_S3QUERY_SUMMARY table with the condition (query = pg_last_query_id()). Inspecting the previous query reveals that the entire dataset of nearly 3.8 billion rows was scanned to retrieve less than 66.3 million rows. Improving scan selectivity in your query could yield substantial performance improvements.

Partitioning

Partitioning is a key means to improving scan efficiency. In your environment, the data and tables have already been organized, and configured to support partitions. For more information, see the PoC project setup instructions. The clickstream table was defined as:

CREATE EXTERNAL TABLE clickstream.uservisits_csv10
…
PARTITIONED BY(customer int4, visitYearMonth int4)

The entire 3.8 billion-row dataset is organized as a collection of large files where each file contains data exclusive to a particular customer and month in a year. This allows you to partition your data into logical subsets by customer and year/month. With partitions, the query engine can target a subset of files:

  • Only for specific customers
  • Only data for specific months
  • A combination of specific customers and year/months

You can use partitions in your queries. Instead of joining your customer data on the surrogate customer key (that is, c.c_custkey = uv.custKey), the partition key “customer” should be used instead:

SELECT c.c_name, c.c_mktsegment, t.prettyMonthYear, SUM(uv.adRevenue)
…
ON c.c_custkey = uv.customer
…
ORDER BY c.c_name, c.c_mktsegment, uv.yearMonthKey  ASC

This query should run approximately twice as fast as the previous query. If you look at the statistics for this query in SVL_S3QUERY_SUMMARY, you see that only half the dataset was scanned. This is expected because your query is on three out of six customers on an evenly distributed dataset. However, the scan is still inefficient, and you can benefit from using your year/month partition key as well:

SELECT c.c_name, c.c_mktsegment, t.prettyMonthYear, SUM(uv.adRevenue)
…
ON c.c_custkey = uv.customer
…
ON uv.visitYearMonth = t.d_yearmonthnum
…
ORDER BY c.c_name, c.c_mktsegment, uv.visitYearMonth ASC

All joins between the tables are now using partitions. Upon reviewing the statistics for this query, you should observe that Redshift Spectrum scans and returns the exact number of rows, 66,270,117. If you run this query a few times, you should see execution time in the range of 8 seconds, which is a 22.5X improvement on your original query!

Predicate pushdown and storage optimizations 

Previously, I mentioned that Redshift Spectrum performs processing through large-scale infrastructure external to your Amazon Redshift cluster. It is optimized for performing large scans and aggregations on S3. In fact, Redshift Spectrum may even out-perform a medium size Amazon Redshift cluster on these types of workloads with the proper optimizations. There are two important variables to consider for optimizing large scans and aggregations:

  • File size and count. As a general rule, use files 100 MB-1 GB in size, as Redshift Spectrum and S3 are optimized for reading this object size. However, the number of files operating on a query is directly correlated with the parallelism achievable by a query. There is an inverse relationship between file size and count: the bigger the files, the fewer files there are for the same dataset. Consequently, there is a trade-off between optimizing for object read performance, and the amount of parallelism achievable on a particular query. Large files are best for large scans as the query likely operates on sufficiently large number of files. For queries that are more selective and for which fewer files are operating, you may find that smaller files allow for more parallelism.
  • Data format. Redshift Spectrum supports various data formats. Columnar formats like Parquet can sometimes lead to substantial performance benefits by providing compression and more efficient I/O for certain workloads. Generally, format types like Parquet should be used for query workloads involving large scans, and high attribute selectivity. Again, there are trade-offs as formats like Parquet require more compute power to process than plaintext. For queries on smaller subsets of data, the I/O efficiency benefit of Parquet is diminished. At some point, Parquet may perform the same or slower than plaintext. Latency, compression rates, and the trade-off between user experience and cost should drive your decision.

To help illustrate how Redshift Spectrum performs on these large aggregation workloads, run a basic query that aggregates the entire ~3.7 billion record dataset on Redshift Spectrum, and compared that with running the query exclusively on Amazon Redshift:

SELECT uv.custKey, COUNT(uv.custKey)
FROM <your clickstream table> as uv
GROUP BY uv.custKey
ORDER BY uv.custKey ASC

For the Amazon Redshift test case, the clickstream data is loaded, and distributed evenly across all nodes (even distribution style) with optimal column compression encodings prescribed by the Amazon Redshift’s ANALYZE command.

The Redshift Spectrum test case uses a Parquet data format with each file containing all the data for a particular customer in a month. This results in files mostly in the range of 220-280 MB, and in effect, is the largest file size for this partitioning scheme. If you run tests with the other datasets provided, you see that this data format and size is optimal and out-performs others by ~60X. 

Performance differences will vary depending on the scenario. The important takeaway is to understand the testing strategy and the workload characteristics where Redshift Spectrum is likely to yield performance benefits. 

The following chart compares the query execution time for the two scenarios. The results indicate that you would have to pay for 12 X DC1.Large nodes to get performance comparable to using a small Amazon Redshift cluster that leverages Redshift Spectrum. 

Chart showing simple aggregation on ~3.7 billion records

So you’ve validated that Spectrum excels at performing large aggregations. Could you benefit by pushing more work down to Redshift Spectrum in your original query? It turns out that you can, by making the following modification:

The clickstream data is stored at a day-level granularity for each customer while your query rolls up the data to the month level per customer. In the earlier query that uses the day/month partition key, you optimized the query so that it only scans and retrieves the data required, but the day level data is still sent back to your Amazon Redshift cluster for joining and aggregation. The query shown here pushes aggregation work down to Redshift Spectrum as indicated by the query plan:

In this query, Redshift Spectrum aggregates the clickstream data to the month level before it is returned to the Amazon Redshift cluster and joined with the dimension tables. This query should complete in about 4 seconds, which is roughly twice as fast as only using the partition key. The speed increase is evident upon reviewing the SVL_S3QUERY_SUMMARY table:

  • Bytes scanned is 21.6X less because of the Parquet data format.
  • Only 90 records are returned back to the Amazon Redshift cluster as a result of the push-down, instead of ~66.2 million, leading to substantially less join overhead, and about 530 MB less data sent back to your cluster.
  • No adverse change in average parallelism.

Assessing the value of Amazon Redshift vs. Redshift Spectrum

At this point, you might be asking yourself, why would I ever not use Redshift Spectrum? Well, you still get additional value for your money by loading data into Amazon Redshift, and querying in Amazon Redshift vs. querying S3.

In fact, it turns out that the last version of our query runs even faster when executed exclusively in native Amazon Redshift, as shown in the following chart:

Chart comparing Amazon Redshift vs. Redshift Spectrum with pushdown aggregation over 3 months of data

As a general rule, queries that aren’t dominated by I/O and which involve multiple joins are better optimized in native Amazon Redshift. For instance, the performance difference between running the partition key query entirely in Amazon Redshift versus with Redshift Spectrum is twice as large as that that of the pushdown aggregation query, partly because the former case benefits more from better join performance.

Furthermore, the variability in latency in native Amazon Redshift is lower. For use cases where you have tight performance SLAs on queries, you may want to consider using Amazon Redshift exclusively to support those queries.

On the other hand, when you perform large scans, you could benefit from the best of both worlds: higher performance at lower cost. For instance, imagine that you wanted to enable your business analysts to interactively discover insights across a vast amount of historical data. In the example below, the pushdown aggregation query is modified to analyze seven years of data instead of three months:

SELECT c.c_name, c.c_mktsegment, t.prettyMonthYear, uv.totalRevenue
…
WHERE customer <= 3 and visitYearMonth >= 199201
… 
FROM dwdate WHERE d_yearmonthnum >= 199201) as t
…
ORDER BY c.c_name, c.c_mktsegment, uv.visitYearMonth ASC

This query requires scanning and aggregating nearly 1.9 billion records. As shown in the chart below, Redshift Spectrum substantially speeds up this query. A large Amazon Redshift cluster would have to be provisioned to support this use case. With the aid of Redshift Spectrum, you could use an existing small cluster, keep a single copy of your data in S3, and benefit from economical, durable storage while only paying for what you use via the pay per query pricing model.

Chart comparing Amazon Redshift vs. Redshift Spectrum with pushdown aggregation over 7 years of data

Summary

Redshift Spectrum lowers the time to value for deeper insights on customer data queries spanning the data lake and data warehouse. It can enable interactive analysis on datasets in cases that weren’t economically practical or technically feasible before.

There are cases where you can get the best of both worlds from Redshift Spectrum: higher performance at lower cost. However, there are still latency-sensitive use cases where you may want native Amazon Redshift performance. For more best practice tips, see the 10 Best Practices for Amazon Redshift post.

Please visit the Amazon Redshift Spectrum PoC Environment Github page. If you have questions or suggestions, please comment below.

 


Additional Reading

Learn more about how Amazon Redshift Spectrum extends data warehousing out to exabytes – no loading required.


About the Author

Dylan Tong is an Enterprise Solutions Architect at AWS. He works with customers to help drive their success on the AWS platform through thought leadership and guidance on designing well architected solutions. He has spent most of his career building on his expertise in data management and analytics by working for leaders and innovators in the space.

 

 

AWS Hot Startups – June 2017

Post Syndicated from Tina Barr original https://aws.amazon.com/blogs/aws/aws-hot-startups-june-2017/

Thanks for stopping by for another round of AWS Hot Startups! This month we are featuring:

  • CloudRanger – helping companies understand the cloud with visual representation.
  • quintly – providing social media analytics for brands on a single dashboard.
  • Tango Card – reinventing rewards programs for businesses and their customers worldwide.

Don’t forget to check out May’s Hot Startups in case you missed them.

CloudRanger (Letterkenny, Ireland)   

The idea for CloudRanger started where most great ideas do – at a bar in Las Vegas. During a late-night conversation with his friends at re:Invent 2014, Dave Gildea (Founder and CEO) used cocktail napkins and drink coasters to visually illustrate servers and backups, and the light on his phone to represent scheduling. By the end of the night, the idea for automated visual server management was born. With CloudRanger, companies can easily create backup and retention policies, visual scheduling, and simple restoration of snapshots and AMIs. The team behind CloudRanger believes that when servers and cloud resources are represented visually, they are easier to manage and understand. Users are able to see their servers, which turns them into a tangible and important piece of business inventory.

CloudRanger is an excellent platform for MSPs who manage many different AWS accounts, and need a quick method to display many servers and audit certain attributes. The company’s goal is to give anyone the ability to create backup policies in multiple regions, apply them using a tag-based methodology, and manage backups. Servers can be scheduled from one simple dashboard, and restoration is easy and step-by-step. With CloudRanger’s visual representation of resources, customers are encouraged to fully understand their backup policies, schedules, and servers.

As an AWS Partner, CloudRanger has built a globally redundant system after going all-in with AWS. They are using over 25 AWS services for everything including enterprise-level security, automation and 24/7 runtimes, and an emphasis on Machine Learning for efficiency in the sales process. CloudRanger continues to rely more on AWS as new services and features are released, and are replacing current services with AWS CodePipeline and AWS CodeBuild. CloudRanger was also named Startup Company of the Year at a recent Irish tech event!

To learn more about CloudRanger, visit their website.

quintly (Cologne, Germany)

In 2010, brothers Alexander Peiniger and Frederik Peiniger started a journey to help companies track their social media profiles and improve their strategies against competitors. The startup began under the name “Social.Media.Tracking” and then “AllFacebook Stats” before officially becoming quintly in 2013. With quintly, brands and agencies can analyze, benchmark, and optimize their social media activities on a global scale. The innovative dashboarding system gives clients an overview across all social media profiles on the most important networks (Facebook, Twitter, YouTube, Google+, LinkedIn, Instagram, etc.) and then derives an optimal social media strategy from those profiles. Today, quintly has users in over 180 countries and paying clients in over 65 countries including major agency networks and Fortune 500 companies.

Getting an overview of a brand’s social media activities can be time-consuming, and turning insights into actions is a challenge that not all brands master. Quintly offers a variety of features designed to help clients improve their social media reach. With their web-based SaaS product, brands and agencies can compare their social media performance against competitors and their best practices. Not only can clients learn from their own historic performance, but they can leverage data from any other brand around the world.

Since the company’s founding, quintly built and operates its SaaS offering on top of AWS services, leveraging Amazon EC2, Amazon ECS, Elastic Load Balancing, and Amazon Route53 to host their Docker-based environment. Large amounts of data are stored in Amazon DynamoDB and Amazon RDS, and they use Amazon CloudWatch to monitor and seamlessly scale to the current needs. In addition, quintly is using Amazon Machine Learning to add additional attributes to the data and to drive better decisions for their clients. With the help of AWS, quintly has been able to focus on their core business while having a scalable and well-performing solution to solve their technical needs.

For more on quintly, check out their Social Media Analytics blog.

Tango Card (Seattle, Washington)

Based in the heart of West Seattle, Tango Card is revolutionizing rewards programs for companies around the world. Too often customers redeem points in a loyalty or rebate program only to wait weeks for their prize to arrive. Companies generously give their employees appreciation gifts, but the gifts can be generic and impersonal. With Tango Card, companies can choose from a variety of rewards that fit the needs of their specific program, event, or business incentive. The extensive Rewards Catalog includes options for e-gift cards that are sure to excite any recipient. There are plenty of options for everyone from traditional e-gift cards to nonprofit donations to cash equivalent rewards.

Tango Card uses a combination of desired rewards, modern technology, and expert service to change the rewards and incentive experience. The Reward Delivery Platform offers solutions including Blast Rewards, Reward Link, and Rewards as a Service API (RaaS). Blast Rewards enables companies to purchase and send e-gift cards in bulk in just one business day. Reward Link lets recipients choose from an assortment of e-gift cards, prepaid cards, digital checks, and donations and is delivered instantly. Finally, Rewards as a Service is a robust digital gift card API that is built to support apps and platforms. With RaaS, Tango Card can send out e-gift cards on company-branded email templates or deliver them directly within a user interface.

The entire Tango Card Reward Delivery Platform leverages many AWS services. They use Amazon EC2 Container Service (ECS) for rapid deployment of containerized micro services, and Amazon Relational Database Service (RDS) for low overhead managed databases. Tango Card is also leveraging Amazon Virtual Private Cloud (VPC), AWS Key Management Service (KMS), and AWS Identity and Access Management (IMS).

To learn more about Tango Card, check out their blog!

I would also like to thank Alexander Moss-Bolanos for helping with the Hot Startups posts this year.

Thanks for reading and we’ll see you next month!

-Tina Barr

How to Create an AMI Builder with AWS CodeBuild and HashiCorp Packer – Part 2

Post Syndicated from Heitor Lessa original https://aws.amazon.com/blogs/devops/how-to-create-an-ami-builder-with-aws-codebuild-and-hashicorp-packer-part-2/

Written by AWS Solutions Architects Jason Barto and Heitor Lessa

 
In Part 1 of this post, we described how AWS CodeBuild, AWS CodeCommit, and HashiCorp Packer can be used to build an Amazon Machine Image (AMI) from the latest version of Amazon Linux. In this post, we show how to use AWS CodePipeline, AWS CloudFormation, and Amazon CloudWatch Events to continuously ship new AMIs. We use Ansible by Red Hat to harden the OS on the AMIs through a well-known set of security controls outlined by the Center for Internet Security in its CIS Amazon Linux Benchmark.

You’ll find the source code for this post in our GitHub repo.

At the end of this post, we will have the following architecture:

Requirements

 
To follow along, you will need Git and a text editor. Make sure Git is configured to work with AWS CodeCommit, as described in Part 1.

Technologies

 
In addition to the services and products used in Part 1 of this post, we also use these AWS services and third-party software:

AWS CloudFormation gives developers and systems administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion.

Amazon CloudWatch Events enables you to react selectively to events in the cloud and in your applications. Specifically, you can create CloudWatch Events rules that match event patterns, and take actions in response to those patterns.

AWS CodePipeline is a continuous integration and continuous delivery service for fast and reliable application and infrastructure updates. AWS CodePipeline builds, tests, and deploys your code every time there is a code change, based on release process models you define.

Amazon SNS is a fast, flexible, fully managed push notification service that lets you send individual messages or to fan out messages to large numbers of recipients. Amazon SNS makes it simple and cost-effective to send push notifications to mobile device users or email recipients. The service can even send messages to other distributed services.

Ansible is a simple IT automation system that handles configuration management, application deployment, cloud provisioning, ad-hoc task-execution, and multinode orchestration.

Getting Started

 
We use CloudFormation to bootstrap the following infrastructure:

Component Purpose
AWS CodeCommit repository Git repository where the AMI builder code is stored.
S3 bucket Build artifact repository used by AWS CodePipeline and AWS CodeBuild.
AWS CodeBuild project Executes the AWS CodeBuild instructions contained in the build specification file.
AWS CodePipeline pipeline Orchestrates the AMI build process, triggered by new changes in the AWS CodeCommit repository.
SNS topic Notifies subscribed email addresses when an AMI build is complete.
CloudWatch Events rule Defines how the AMI builder should send a custom event to notify an SNS topic.
Region AMI Builder Launch Template
N. Virginia (us-east-1)
Ireland (eu-west-1)

After launching the CloudFormation template linked here, we will have a pipeline in the AWS CodePipeline console. (Failed at this stage simply means we don’t have any data in our newly created AWS CodeCommit Git repository.)

Next, we will clone the newly created AWS CodeCommit repository.

If this is your first time connecting to a AWS CodeCommit repository, please see instructions in our documentation on Setup steps for HTTPS Connections to AWS CodeCommit Repositories.

To clone the AWS CodeCommit repository (console)

  1. From the AWS Management Console, open the AWS CloudFormation console.
  2. Choose the AMI-Builder-Blogpost stack, and then choose Output.
  3. Make a note of the Git repository URL.
  4. Use git to clone the repository.

For example: git clone https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/AMI-Builder_repo

To clone the AWS CodeCommit repository (CLI)

# Retrieve CodeCommit repo URL
git_repo=$(aws cloudformation describe-stacks --query 'Stacks[0].Outputs[?OutputKey==`GitRepository`].OutputValue' --output text --stack-name "AMI-Builder-Blogpost")

# Clone repository locally
git clone ${git_repo}

Bootstrap the Repo with the AMI Builder Structure

 
Now that our infrastructure is ready, download all the files and templates required to build the AMI.

Your local Git repo should have the following structure:

.
├── ami_builder_event.json
├── ansible
├── buildspec.yml
├── cloudformation
├── packer_cis.json

Next, push these changes to AWS CodeCommit, and then let AWS CodePipeline orchestrate the creation of the AMI:

git add .
git commit -m "My first AMI"
git push origin master

AWS CodeBuild Implementation Details

 
While we wait for the AMI to be created, let’s see what’s changed in our AWS CodeBuild buildspec.yml file:

...
phases:
  ...
  build:
    commands:
      ...
      - ./packer build -color=false packer_cis.json | tee build.log
  post_build:
    commands:
      - egrep "${AWS_REGION}\:\sami\-" build.log | cut -d' ' -f2 > ami_id.txt
      # Packer doesn't return non-zero status; we must do that if Packer build failed
      - test -s ami_id.txt || exit 1
      - sed -i.bak "s/<<AMI-ID>>/$(cat ami_id.txt)/g" ami_builder_event.json
      - aws events put-events --entries file://ami_builder_event.json
      ...
artifacts:
  files:
    - ami_builder_event.json
    - build.log
  discard-paths: yes

In the build phase, we capture Packer output into a file named build.log. In the post_build phase, we take the following actions:

  1. Look up the AMI ID created by Packer and save its findings to a temporary file (ami_id.txt).
  2. Forcefully make AWS CodeBuild to fail if the AMI ID (ami_id.txt) is not found. This is required because Packer doesn’t fail if something goes wrong during the AMI creation process. We have to tell AWS CodeBuild to stop by informing it that an error occurred.
  3. If an AMI ID is found, we update the ami_builder_event.json file and then notify CloudWatch Events that the AMI creation process is complete.
  4. CloudWatch Events publishes a message to an SNS topic. Anyone subscribed to the topic will be notified in email that an AMI has been created.

Lastly, the new artifacts phase instructs AWS CodeBuild to upload files built during the build process (ami_builder_event.json and build.log) to the S3 bucket specified in the Outputs section of the CloudFormation template. These artifacts can then be used as an input artifact in any later stage in AWS CodePipeline.

For information about customizing the artifacts sequence of the buildspec.yml, see the Build Specification Reference for AWS CodeBuild.

CloudWatch Events Implementation Details

 
CloudWatch Events allow you to extend the AMI builder to not only send email after the AMI has been created, but to hook up any of the supported targets to react to the AMI builder event. This event publication means you can decouple from Packer actions you might take after AMI completion and plug in other actions, as you see fit.

For more information about targets in CloudWatch Events, see the CloudWatch Events API Reference.

In this case, CloudWatch Events should receive the following event, match it with a rule we created through CloudFormation, and publish a message to SNS so that you can receive an email.

Example CloudWatch custom event

[
        {
            "Source": "com.ami.builder",
            "DetailType": "AmiBuilder",
            "Detail": "{ \"AmiStatus\": \"Created\"}",
            "Resources": [ "ami-12cd5guf" ]
        }
]

Cloudwatch Events rule

{
  "detail-type": [
    "AmiBuilder"
  ],
  "source": [
    "com.ami.builder"
  ],
  "detail": {
    "AmiStatus": [
      "Created"
    ]
  }
}

Example SNS message sent in email

{
    "version": "0",
    "id": "f8bdede0-b9d7...",
    "detail-type": "AmiBuilder",
    "source": "com.ami.builder",
    "account": "<<aws_account_number>>",
    "time": "2017-04-28T17:56:40Z",
    "region": "eu-west-1",
    "resources": ["ami-112cd5guf "],
    "detail": {
        "AmiStatus": "Created"
    }
}

Packer Implementation Details

 
In addition to the build specification file, there are differences between the current version of the HashiCorp Packer template (packer_cis.json) and the one used in Part 1.

Variables

  "variables": {
    "vpc": "{{env `BUILD_VPC_ID`}}",
    "subnet": "{{env `BUILD_SUBNET_ID`}}",
         “ami_name”: “Prod-CIS-Latest-AMZN-{{isotime \”02-Jan-06 03_04_05\”}}”
  },
  • ami_name: Prefixes a name used by Packer to tag resources during the Builders sequence.
  • vpc and subnet: Environment variables defined by the CloudFormation stack parameters.

We no longer assume a default VPC is present and instead use the VPC and subnet specified in the CloudFormation parameters. CloudFormation configures the AWS CodeBuild project to use these values as environment variables. They are made available throughout the build process.

That allows for more flexibility should you need to change which VPC and subnet will be used by Packer to launch temporary resources.

Builders

  "builders": [{
    ...
    "ami_name": “{{user `ami_name`| clean_ami_name}}”,
    "tags": {
      "Name": “{{user `ami_name`}}”,
    },
    "run_tags": {
      "Name": “{{user `ami_name`}}",
    },
    "run_volume_tags": {
      "Name": “{{user `ami_name`}}",
    },
    "snapshot_tags": {
      "Name": “{{user `ami_name`}}",
    },
    ...
    "vpc_id": "{{user `vpc` }}",
    "subnet_id": "{{user `subnet` }}"
  }],

We now have new properties (*_tag) and a new function (clean_ami_name) and launch temporary resources in a VPC and subnet specified in the environment variables. AMI names can only contain a certain set of ASCII characters. If the input in project deviates from the expected characters (for example, includes whitespace or slashes), Packer’s clean_ami_name function will fix it.

For more information, see functions on the HashiCorp Packer website.

Provisioners

  "provisioners": [
    {
        "type": "shell",
        "inline": [
            "sudo pip install ansible"
        ]
    }, 
    {
        "type": "ansible-local",
        "playbook_file": "ansible/playbook.yaml",
        "role_paths": [
            "ansible/roles/common"
        ],
        "playbook_dir": "ansible",
        "galaxy_file": "ansible/requirements.yaml"
    },
    {
      "type": "shell",
      "inline": [
        "rm .ssh/authorized_keys ; sudo rm /root/.ssh/authorized_keys"
      ]
    }

We used shell provisioner to apply OS patches in Part 1. Now, we use shell to install Ansible on the target machine and ansible-local to import, install, and execute Ansible roles to make our target machine conform to our standards.

Packer uses shell to remove temporary keys before it creates an AMI from the target and temporary EC2 instance.

Ansible Implementation Details

 
Ansible provides OS patching through a custom Common role that can be easily customized for other tasks.

CIS Benchmark and Cloudwatch Logs are implemented through two Ansible third-party roles that are defined in ansible/requirements.yaml as seen in the Packer template.

The Ansible provisioner uses Ansible Galaxy to download these roles onto the target machine and execute them as instructed by ansible/playbook.yaml.

For information about how these components are organized, see the Playbook Roles and Include Statements in the Ansible documentation.

The following Ansible playbook (ansible</playbook.yaml) controls the execution order and custom properties:

---
- hosts: localhost
  connection: local
  gather_facts: true    # gather OS info that is made available for tasks/roles
  become: yes           # majority of CIS tasks require root
  vars:
    # CIS Controls whitepaper:  http://bit.ly/2mGAmUc
    # AWS CIS Whitepaper:       http://bit.ly/2m2Ovrh
    cis_level_1_exclusions:
    # 3.4.2 and 3.4.3 effectively blocks access to all ports to the machine
    ## This can break automation; ignoring it as there are stronger mechanisms than that
      - 3.4.2 
      - 3.4.3
    # CloudWatch Logs will be used instead of Rsyslog/Syslog-ng
    ## Same would be true if any other software doesn't support Rsyslog/Syslog-ng mechanisms
      - 4.2.1.4
      - 4.2.2.4
      - 4.2.2.5
    # Autofs is not installed in newer versions, let's ignore
      - 1.1.19
    # Cloudwatch Logs role configuration
    logs:
      - file: /var/log/messages
        group_name: "system_logs"
  roles:
    - common
    - anthcourtney.cis-amazon-linux
    - dharrisio.aws-cloudwatch-logs-agent

Both third-party Ansible roles can be easily configured through variables (vars). We use Ansible playbook variables to exclude CIS controls that don’t apply to our case and to instruct the CloudWatch Logs agent to stream the /var/log/messages log file to CloudWatch Logs.

If you need to add more OS or application logs, you can easily duplicate the playbook and make changes. The CloudWatch Logs agent will ship configured log messages to CloudWatch Logs.

For more information about parameters you can use to further customize third-party roles, download Ansible roles for the Cloudwatch Logs Agent and CIS Amazon Linux from the Galaxy website.

Committing Changes

 
Now that Ansible and CloudWatch Events are configured as a part of the build process, commiting any changes to the AWS CodeComit Git Repository will triger a new AMI build process that can be followed through the AWS CodePipeline console.

When the build is complete, an email will be sent to the email address you provided as a part of the CloudFormation stack deployment. The email serves as notification that an AMI has been built and is ready for use.

Summary

 
We used AWS CodeCommit, AWS CodePipeline, AWS CodeBuild, Packer, and Ansible to build a pipeline that continuously builds new, hardened CIS AMIs. We used Amazon SNS so that email addresses subscribed to a SNS topic are notified upon completion of the AMI build.

By treating our AMI creation process as code, we can iterate and track changes over time. In this way, it’s no different from a software development workflow. With that in mind, software patches, OS configuration, and logs that need to be shipped to a central location are only a git commit away.

Next Steps

 
Here are some ideas to extend this AMI builder:

  • Hook up a Lambda function in Cloudwatch Events to update EC2 Auto Scaling configuration upon completion of the AMI build.
  • Use AWS CodePipeline parallel steps to build multiple Packer images.
  • Add a commit ID as a tag for the AMI you created.
  • Create a scheduled Lambda function through Cloudwatch Events to clean up old AMIs based on timestamp (name or additional tag).
  • Implement Windows support for the AMI builder.
  • Create a cross-account or cross-region AMI build.

Cloudwatch Events allow the AMI builder to decouple AMI configuration and creation so that you can easily add your own logic using targets (AWS Lambda, Amazon SQS, Amazon SNS) to add events or recycle EC2 instances with the new AMI.

If you have questions or other feedback, feel free to leave it in the comments or contribute to the AMI Builder repo on GitHub.

BPI Breaks Record After Sending 310 Million Google Takedowns

Post Syndicated from Andy original https://torrentfreak.com/bpi-breaks-record-after-sending-310-million-google-takedowns-170619/

A little over a year ago during March 2016, music industry group BPI reached an important milestone. After years of sending takedown notices to Google, the group burst through the 200 million URL barrier.

The fact that it took BPI several years to reach its 200 million milestone made the surpassing of the quarter billion milestone a few months later even more remarkable. In October 2016, the group sent its 250 millionth takedown to Google, a figure that nearly doubled when accounting for notices sent to Microsoft’s Bing.

But despite the volumes, the battle hadn’t been won, let alone the war. The BPI’s takedown machine continued to run at a remarkable rate, churning out millions more notices per week.

As a result, yet another new milestone was reached this month when the BPI smashed through the 300 million URL barrier. Then, days later, a further 10 million were added, with the latter couple of million added during the time it took to put this piece together.

BPI takedown notices, as reported by Google

While demanding that Google places greater emphasis on its de-ranking of ‘pirate’ sites, the BPI has called again and again for a “notice and stay down” regime, to ensure that content taken down by the search engine doesn’t simply reappear under a new URL. It’s a position BPI maintains today.

“The battle would be a whole lot easier if intermediaries played fair,” a BPI spokesperson informs TF.

“They need to take more proactive responsibility to reduce infringing content that appears on their platform, and, where we expressly notify infringing content to them, to ensure that they do not only take it down, but also keep it down.”

The long-standing suggestion is that the volume of takedown notices sent would reduce if a “take down, stay down” regime was implemented. The BPI says it’s difficult to present a precise figure but infringing content has a tendency to reappear, both in search engines and on hosting sites.

“Google rejects repeat notices for the same URL. But illegal content reappears as it is re-indexed by Google. As to the sites that actually host the content, the vast majority of notices sent to them could be avoided if they implemented take-down & stay-down,” BPI says.

The fact that the BPI has added 60 million more takedowns since the quarter billion milestone a few months ago is quite remarkable, particularly since there appears to be little slowdown from month to month. However, the numbers have grown so huge that 310 billion now feels a lot like 250 million, with just a few added on top for good measure.

That an extra 60 million takedowns can almost be dismissed as a handful is an indication of just how massive the issue is online. While pirates always welcome an abundance of links to juicy content, it’s no surprise that groups like the BPI are seeking more comprehensive and sustainable solutions.

Previously, it was hoped that the Digital Economy Bill would provide some relief, hopefully via government intervention and the imposition of a search engine Code of Practice. In the event, however, all pressure on search engines was removed from the legislation after a separate voluntary agreement was reached.

All parties agreed that the voluntary code should come into effect two weeks ago on June 1 so it seems likely that some effects should be noticeable in the near future. But the BPI says it’s still early days and there’s more work to be done.

“BPI has been working productively with search engines since the voluntary code was agreed to understand how search engines approach the problem, but also what changes can and have been made and how results can be improved,” the group explains.

“The first stage is to benchmark where we are and to assess the impact of the changes search engines have made so far. This will hopefully be completed soon, then we will have better information of the current picture and from that we hope to work together to continue to improve search for rights owners and consumers.”

With more takedown notices in the pipeline not yet publicly reported by Google, the BPI informs TF that it has now notified the search giant of 315 million links to illegal content.

“That’s an astonishing number. More than 1 in 10 of the entire world’s notices to Google come from BPI. This year alone, one in every three notices sent to Google from BPI is for independent record label repertoire,” BPI concludes.

While it’s clear that groups like BPI have developed systems to cope with the huge numbers of takedown notices required in today’s environment, it’s clear that few rightsholders are happy with the status quo. With that in mind, the fight will continue, until search engines are forced into compromise. Considering the implications, that could only appear on a very distant horizon.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

[$] Making Python faster

Post Syndicated from jake original https://lwn.net/Articles/725114/rss

The Python core developers, and Victor Stinner in particular, have been
focusing on improving the performance of Python 3 over the last few
years. At PyCon 2017, Stinner
gave a talk on some of the optimizations that have been added recently and
the effect they have had on various benchmarks. Along the way, he took a
detour into some improvements that have been made for benchmarking
Python.

[$] A survey of scheduler benchmarks

Post Syndicated from corbet original https://lwn.net/Articles/725238/rss

Many benchmarks have been used by kernel developers over the years to
test the performance of the scheduler. But recent kernel commit messages
have shown a particular pattern of tools being used (some relatively new),
all of which were created specifically for developing scheduler patches.
While each benchmark is different, having its own unique genesis story and
intended testing scenario, there is a unifying attribute; they were all
written to scratch a developer’s itch.

Data Compression Improvements in Amazon Redshift Bring Compression Ratios Up to 4x

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/data-compression-improvements-in-amazon-redshift/

Maor Kleider, Senior Product Manager with Amazon Redshift, wrote today’s guest post.

-Ana


Amazon Redshift, is a fast, fully managed, petabyte-scale data warehousing service that makes it simple and cost-effective to analyze all of your data. Many of our customers, including Scholastic, King.com, Electronic Arts, TripAdvisor and Yelp, migrated to Amazon Redshift and achieved agility and faster time to insight, while dramatically reducing costs.

Columnar compression is an important technology in Amazon Redshift. It both helps reduce customer costs by increasing the effective storage capacity of our nodes and improves performance by reducing I/O needed to process SQL requests. Improving I/O efficiency is very important for data warehousing. Last year, our I/O enhancements doubled query throughput. Let’s talk about some of the new compression improvements we’ve recently added to Amazon Redshift.

First, we added support for the Zstandard compression algorithm, which offers a good balance between a high compression ratio and speed in build 1.0.1172. When applied to raw data in the standard TPC-DS, 3 TB benchmark, Zstandard achieves 65% reduction in disk space. Zstandard is broadly applicable. You can apply it to any of the following data types: SMALLINT, INTEGER, BIGINT, DECIMAL, REAL, DOUBLE PRECISION, BOOLEAN, CHAR, VARCHAR, DATE, TIMESTAMP and TIMESTAMPTZ.

Second, we’ve improved the automation of compression on tables created by the CREATE TABLE AS, CREATE TABLE or ALTER TABLE ADD COLUMN commands. Starting with Build 1.0.1161, Amazon Redshift automatically chooses a default compression for the columns created by those commands. Automated compression happens when we estimate that we can reduce disk space without degrading query performance. Our customers have seen up to 40% reduction in disk space.

Third, we’ve been optimizing our internal on-disk data structures. Our preview customers averaged a 7% reduction in disk space usage with this improvement. This feature is delivered starting with Build 1.0.1271.

Finally, we have enhanced the ANALYZE COMPRESSION command to estimate disk space reduction. You can now easily identify opportunities to further compress data and improve performance. Behind the scenes, we sample your data and suggest the most effective compression. You can then specify the recommended encodings or your preferred encodings based on your own evaluation.

“Before all the recent compression features, our largest table was over 7 TB. It’s now only 4.85 TB, which is an additional 30.7% reduction in disk space. This allows us to reduce our disk space by 4X in total and our effective cost to less than $250/TB/Year on an uncompressed data basis. We’re now able to analyze more data with Amazon Redshift, and our query performance has gotten even better.” Chuong Do, Director of Analytics, Coursera

Of course, the actual benefits you see on your clusters will depend upon your workload and your data. In combination, these improvements may reduce your data sets by up to 4x vs. the 3x most of our customers saw before.

You may have heard us talk about how an Amazon Redshift data warehouse can cost as little as $1,000 per terabyte per year. It is important to realize that we’re talking about compressed data in this number. After all, that’s what we store. Not all vendors do this – many compress your data under the covers but describe per-terabyte costs in terms of uncompressed data. That’s unfortunate – the difference between talking in terms of uncompressed data and compressed data can be a significant overstatement.

-Maor Kleider

In Case You Missed These: AWS Security Blog Posts from January, February, and March

Post Syndicated from Craig Liebendorfer original https://aws.amazon.com/blogs/security/in-case-you-missed-these-aws-security-blog-posts-from-january-february-and-march/

Image of lock and key

In case you missed any AWS Security Blog posts published so far in 2017, they are summarized and linked to below. The posts are shown in reverse chronological order (most recent first), and the subject matter ranges from protecting dynamic web applications against DDoS attacks to monitoring AWS account configuration changes and API calls to Amazon EC2 security groups.

March

March 22: How to Help Protect Dynamic Web Applications Against DDoS Attacks by Using Amazon CloudFront and Amazon Route 53
Using a content delivery network (CDN) such as Amazon CloudFront to cache and serve static text and images or downloadable objects such as media files and documents is a common strategy to improve webpage load times, reduce network bandwidth costs, lessen the load on web servers, and mitigate distributed denial of service (DDoS) attacks. AWS WAF is a web application firewall that can be deployed on CloudFront to help protect your application against DDoS attacks by giving you control over which traffic to allow or block by defining security rules. When users access your application, the Domain Name System (DNS) translates human-readable domain names (for example, www.example.com) to machine-readable IP addresses (for example, 192.0.2.44). A DNS service, such as Amazon Route 53, can effectively connect users’ requests to a CloudFront distribution that proxies requests for dynamic content to the infrastructure hosting your application’s endpoints. In this blog post, I show you how to deploy CloudFront with AWS WAF and Route 53 to help protect dynamic web applications (with dynamic content such as a response to user input) against DDoS attacks. The steps shown in this post are key to implementing the overall approach described in AWS Best Practices for DDoS Resiliency and enable the built-in, managed DDoS protection service, AWS Shield.

March 21: New AWS Encryption SDK for Python Simplifies Multiple Master Key Encryption
The AWS Cryptography team is happy to announce a Python implementation of the AWS Encryption SDK. This new SDK helps manage data keys for you, and it simplifies the process of encrypting data under multiple master keys. As a result, this new SDK allows you to focus on the code that drives your business forward. It also provides a framework you can easily extend to ensure that you have a cryptographic library that is configured to match and enforce your standards. The SDK also includes ready-to-use examples. If you are a Java developer, you can refer to this blog post to see specific Java examples for the SDK. In this blog post, I show you how you can use the AWS Encryption SDK to simplify the process of encrypting data and how to protect your encryption keys in ways that help improve application availability by not tying you to a single region or key management solution.

March 21: Updated CJIS Workbook Now Available by Request
The need for guidance when implementing Criminal Justice Information Services (CJIS)–compliant solutions has become of paramount importance as more law enforcement customers and technology partners move to store and process criminal justice data in the cloud. AWS services allow these customers to easily and securely architect a CJIS-compliant solution when handling criminal justice data, creating a durable, cost-effective, and secure IT infrastructure that better supports local, state, and federal law enforcement in carrying out their public safety missions. AWS has created several documents (collectively referred to as the CJIS Workbook) to assist you in aligning with the FBI’s CJIS Security Policy. You can use the workbook as a framework for developing CJIS-compliant architecture in the AWS Cloud. The workbook helps you define and test the controls you operate, and document the dependence on the controls that AWS operates (compute, storage, database, networking, regions, Availability Zones, and edge locations).

March 9: New Cloud Directory API Makes It Easier to Query Data Along Multiple Dimensions
Today, we made available a new Cloud Directory API, ListObjectParentPaths, that enables you to retrieve all available parent paths for any directory object across multiple hierarchies. Use this API when you want to fetch all parent objects for a specific child object. The order of the paths and objects returned is consistent across iterative calls to the API, unless objects are moved or deleted. In case an object has multiple parents, the API allows you to control the number of paths returned by using a paginated call pattern. In this blog post, I use an example directory to demonstrate how this new API enables you to retrieve data across multiple dimensions to implement powerful applications quickly.

March 8: How to Access the AWS Management Console Using AWS Microsoft AD and Your On-Premises Credentials
AWS Directory Service for Microsoft Active Directory, also known as AWS Microsoft AD, is a managed Microsoft Active Directory (AD) hosted in the AWS Cloud. Now, AWS Microsoft AD makes it easy for you to give your users permission to manage AWS resources by using on-premises AD administrative tools. With AWS Microsoft AD, you can grant your on-premises users permissions to resources such as the AWS Management Console instead of adding AWS Identity and Access Management (IAM) user accounts or configuring AD Federation Services (AD FS) with Security Assertion Markup Language (SAML). In this blog post, I show how to use AWS Microsoft AD to enable your on-premises AD users to sign in to the AWS Management Console with their on-premises AD user credentials to access and manage AWS resources through IAM roles.

March 7: How to Protect Your Web Application Against DDoS Attacks by Using Amazon Route 53 and an External Content Delivery Network
Distributed Denial of Service (DDoS) attacks are attempts by a malicious actor to flood a network, system, or application with more traffic, connections, or requests than it is able to handle. To protect your web application against DDoS attacks, you can use AWS Shield, a DDoS protection service that AWS provides automatically to all AWS customers at no additional charge. You can use AWS Shield in conjunction with DDoS-resilient web services such as Amazon CloudFront and Amazon Route 53 to improve your ability to defend against DDoS attacks. Learn more about architecting for DDoS resiliency by reading the AWS Best Practices for DDoS Resiliency whitepaper. You also have the option of using Route 53 with an externally hosted content delivery network (CDN). In this blog post, I show how you can help protect the zone apex (also known as the root domain) of your web application by using Route 53 to perform a secure redirect to prevent discovery of your application origin.

Image of lock and key

February

February 27: Now Generally Available – AWS Organizations: Policy-Based Management for Multiple AWS Accounts
Today, AWS Organizations moves from Preview to General Availability. You can use Organizations to centrally manage multiple AWS accounts, with the ability to create a hierarchy of organizational units (OUs). You can assign each account to an OU, define policies, and then apply those policies to an entire hierarchy, specific OUs, or specific accounts. You can invite existing AWS accounts to join your organization, and you can also create new accounts. All of these functions are available from the AWS Management Console, the AWS Command Line Interface (CLI), and through the AWS Organizations API.To read the full AWS Blog post about today’s launch, see AWS Organizations – Policy-Based Management for Multiple AWS Accounts.

February 23: s2n Is Now Handling 100 Percent of SSL Traffic for Amazon S3
Today, we’ve achieved another important milestone for securing customer data: we have replaced OpenSSL with s2n for all internal and external SSL traffic in Amazon Simple Storage Service (Amazon S3) commercial regions. This was implemented with minimal impact to customers, and multiple means of error checking were used to ensure a smooth transition, including client integration tests, catching potential interoperability conflicts, and identifying memory leaks through fuzz testing.

February 22: Easily Replace or Attach an IAM Role to an Existing EC2 Instance by Using the EC2 Console
AWS Identity and Access Management (IAM) roles enable your applications running on Amazon EC2 to use temporary security credentials. IAM roles for EC2 make it easier for your applications to make API requests securely from an instance because they do not require you to manage AWS security credentials that the applications use. Recently, we enabled you to use temporary security credentials for your applications by attaching an IAM role to an existing EC2 instance by using the AWS CLI and SDK. To learn more, see New! Attach an AWS IAM Role to an Existing Amazon EC2 Instance by Using the AWS CLI. Starting today, you can attach an IAM role to an existing EC2 instance from the EC2 console. You can also use the EC2 console to replace an IAM role attached to an existing instance. In this blog post, I will show how to attach an IAM role to an existing EC2 instance from the EC2 console.

February 22: How to Audit Your AWS Resources for Security Compliance by Using Custom AWS Config Rules
AWS Config Rules enables you to implement security policies as code for your organization and evaluate configuration changes to AWS resources against these policies. You can use Config rules to audit your use of AWS resources for compliance with external compliance frameworks such as CIS AWS Foundations Benchmark and with your internal security policies related to the US Health Insurance Portability and Accountability Act (HIPAA), the Federal Risk and Authorization Management Program (FedRAMP), and other regimes. AWS provides some predefined, managed Config rules. You also can create custom Config rules based on criteria you define within an AWS Lambda function. In this post, I show how to create a custom rule that audits AWS resources for security compliance by enabling VPC Flow Logs for an Amazon Virtual Private Cloud (VPC). The custom rule meets requirement 4.3 of the CIS AWS Foundations Benchmark: “Ensure VPC flow logging is enabled in all VPCs.”

February 13: AWS Announces CISPE Membership and Compliance with First-Ever Code of Conduct for Data Protection in the Cloud
I have two exciting announcements today, both showing AWS’s continued commitment to ensuring that customers can comply with EU Data Protection requirements when using our services.

February 13: How to Enable Multi-Factor Authentication for AWS Services by Using AWS Microsoft AD and On-Premises Credentials
You can now enable multi-factor authentication (MFA) for users of AWS services such as Amazon WorkSpaces and Amazon QuickSight and their on-premises credentials by using your AWS Directory Service for Microsoft Active Directory (Enterprise Edition) directory, also known as AWS Microsoft AD. MFA adds an extra layer of protection to a user name and password (the first “factor”) by requiring users to enter an authentication code (the second factor), which has been provided by your virtual or hardware MFA solution. These factors together provide additional security by preventing access to AWS services, unless users supply a valid MFA code.

February 13: How to Create an Organizational Chart with Separate Hierarchies by Using Amazon Cloud Directory
Amazon Cloud Directory enables you to create directories for a variety of use cases, such as organizational charts, course catalogs, and device registries. Cloud Directory offers you the flexibility to create directories with hierarchies that span multiple dimensions. For example, you can create an organizational chart that you can navigate through separate hierarchies for reporting structure, location, and cost center. In this blog post, I show how to use Cloud Directory APIs to create an organizational chart with two separate hierarchies in a single directory. I also show how to navigate the hierarchies and retrieve data. I use the Java SDK for all the sample code in this post, but you can use other language SDKs or the AWS CLI.

February 10: How to Easily Log On to AWS Services by Using Your On-Premises Active Directory
AWS Directory Service for Microsoft Active Directory (Enterprise Edition), also known as Microsoft AD, now enables your users to log on with just their on-premises Active Directory (AD) user name—no domain name is required. This new domainless logon feature makes it easier to set up connections to your on-premises AD for use with applications such as Amazon WorkSpaces and Amazon QuickSight, and it keeps the user logon experience free from network naming. This new interforest trusts capability is now available when using Microsoft AD with Amazon WorkSpaces and Amazon QuickSight Enterprise Edition. In this blog post, I explain how Microsoft AD domainless logon works with AD interforest trusts, and I show an example of setting up Amazon WorkSpaces to use this capability.

February 9: New! Attach an AWS IAM Role to an Existing Amazon EC2 Instance by Using the AWS CLI
AWS Identity and Access Management (IAM) roles enable your applications running on Amazon EC2 to use temporary security credentials that AWS creates, distributes, and rotates automatically. Using temporary credentials is an IAM best practice because you do not need to maintain long-term keys on your instance. Using IAM roles for EC2 also eliminates the need to use long-term AWS access keys that you have to manage manually or programmatically. Starting today, you can enable your applications to use temporary security credentials provided by AWS by attaching an IAM role to an existing EC2 instance. You can also replace the IAM role attached to an existing EC2 instance. In this blog post, I show how you can attach an IAM role to an existing EC2 instance by using the AWS CLI.

February 8: How to Remediate Amazon Inspector Security Findings Automatically
The Amazon Inspector security assessment service can evaluate the operating environments and applications you have deployed on AWS for common and emerging security vulnerabilities automatically. As an AWS-built service, Amazon Inspector is designed to exchange data and interact with other core AWS services not only to identify potential security findings but also to automate addressing those findings. Previous related blog posts showed how you can deliver Amazon Inspector security findings automatically to third-party ticketing systems and automate the installation of the Amazon Inspector agent on new Amazon EC2 instances. In this post, I show how you can automatically remediate findings generated by Amazon Inspector. To get started, you must first run an assessment and publish any security findings to an Amazon Simple Notification Service (SNS) topic. Then, you create an AWS Lambda function that is triggered by those notifications. Finally, the Lambda function examines the findings and then implements the appropriate remediation based on the type of issue.

February 6: How to Simplify Security Assessment Setup Using Amazon EC2 Systems Manager and Amazon Inspector
In a July 2016 AWS Blog post, I discussed how to integrate Amazon Inspector with third-party ticketing systems by using Amazon Simple Notification Service (SNS) and AWS Lambda. This AWS Security Blog post continues in the same vein, describing how to use Amazon Inspector to automate various aspects of security management. In this post, I show you how to install the Amazon Inspector agent automatically through the Amazon EC2 Systems Manager when a new Amazon EC2 instance is launched. In a subsequent post, I will show you how to update EC2 instances automatically that run Linux when Amazon Inspector discovers a missing security patch.

Image of lock and key

January

January 30: How to Protect Data at Rest with Amazon EC2 Instance Store Encryption
Encrypting data at rest is vital for regulatory compliance to ensure that sensitive data saved on disks is not readable by any user or application without a valid key. Some compliance regulations such as PCI DSS and HIPAA require that data at rest be encrypted throughout the data lifecycle. To this end, AWS provides data-at-rest options and key management to support the encryption process. For example, you can encrypt Amazon EBS volumes and configure Amazon S3 buckets for server-side encryption (SSE) using AES-256 encryption. Additionally, Amazon RDS supports Transparent Data Encryption (TDE). Instance storage provides temporary block-level storage for Amazon EC2 instances. This storage is located on disks attached physically to a host computer. Instance storage is ideal for temporary storage of information that frequently changes, such as buffers, caches, and scratch data. By default, files stored on these disks are not encrypted. In this blog post, I show a method for encrypting data on Linux EC2 instance stores by using Linux built-in libraries. This method encrypts files transparently, which protects confidential data. As a result, applications that process the data are unaware of the disk-level encryption.

January 27: How to Detect and Automatically Remediate Unintended Permissions in Amazon S3 Object ACLs with CloudWatch Events
Amazon S3 Access Control Lists (ACLs) enable you to specify permissions that grant access to S3 buckets and objects. When S3 receives a request for an object, it verifies whether the requester has the necessary access permissions in the associated ACL. For example, you could set up an ACL for an object so that only the users in your account can access it, or you could make an object public so that it can be accessed by anyone. If the number of objects and users in your AWS account is large, ensuring that you have attached correctly configured ACLs to your objects can be a challenge. For example, what if a user were to call the PutObjectAcl API call on an object that is supposed to be private and make it public? Or, what if a user were to call the PutObject with the optional Acl parameter set to public-read, therefore uploading a confidential file as publicly readable? In this blog post, I show a solution that uses Amazon CloudWatch Events to detect PutObject and PutObjectAcl API calls in near-real time and helps ensure that the objects remain private by making automatic PutObjectAcl calls, when necessary.

January 26: Now Available: Amazon Cloud Directory—A Cloud-Native Directory for Hierarchical Data
Today we are launching Amazon Cloud Directory. This service is purpose-built for storing large amounts of strongly typed hierarchical data. With the ability to scale to hundreds of millions of objects while remaining cost-effective, Cloud Directory is a great fit for all sorts of cloud and mobile applications.

January 24: New SOC 2 Report Available: Confidentiality
As with everything at Amazon, the success of our security and compliance program is primarily measured by one thing: our customers’ success. Our customers drive our portfolio of compliance reports, attestations, and certifications that support their efforts in running a secure and compliant cloud environment. As a result of our engagement with key customers across the globe, we are happy to announce the publication of our new SOC 2 Confidentiality report. This report is available now through AWS Artifact in the AWS Management Console.

January 18: Compliance in the Cloud for New Financial Services Cybersecurity Regulations
Financial regulatory agencies are focused more than ever on ensuring responsible innovation. Consequently, if you want to achieve compliance with financial services regulations, you must be increasingly agile and employ dynamic security capabilities. AWS enables you to achieve this by providing you with the tools you need to scale your security and compliance capabilities on AWS. The following breakdown of the most recent cybersecurity regulations, NY DFS Rule 23 NYCRR 500, demonstrates how AWS continues to focus on your regulatory needs in the financial services sector.

January 9: New Amazon GameDev Blog Post: Protect Multiplayer Game Servers from DDoS Attacks by Using Amazon GameLift
In online gaming, distributed denial of service (DDoS) attacks target a game’s network layer, flooding servers with requests until performance degrades considerably. These attacks can limit a game’s availability to players and limit the player experience for those who can connect. Today’s new Amazon GameDev Blog post uses a typical game server architecture to highlight DDoS attack vulnerabilities and discusses how to stay protected by using built-in AWS Cloud security, AWS security best practices, and the security features of Amazon GameLift. Read the post to learn more.

January 6: The Top 10 Most Downloaded AWS Security and Compliance Documents in 2016
The following list includes the 10 most downloaded AWS security and compliance documents in 2016. Using this list, you can learn about what other people found most interesting about security and compliance last year.

January 6: FedRAMP Compliance Update: AWS GovCloud (US) Region Receives a JAB-Issued FedRAMP High Baseline P-ATO for Three New Services
Three new services in the AWS GovCloud (US) region have received a Provisional Authority to Operate (P-ATO) from the Joint Authorization Board (JAB) under the Federal Risk and Authorization Management Program (FedRAMP). JAB issued the authorization at the High baseline, which enables US government agencies and their service providers the capability to use these services to process the government’s most sensitive unclassified data, including Personal Identifiable Information (PII), Protected Health Information (PHI), Controlled Unclassified Information (CUI), criminal justice information (CJI), and financial data.

January 4: The Top 20 Most Viewed AWS IAM Documentation Pages in 2016
The following 20 pages were the most viewed AWS Identity and Access Management (IAM) documentation pages in 2016. I have included a brief description with each link to give you a clearer idea of what each page covers. Use this list to see what other people have been viewing and perhaps to pique your own interest about a topic you’ve been meaning to research.

January 3: The Most Viewed AWS Security Blog Posts in 2016
The following 10 posts were the most viewed AWS Security Blog posts that we published during 2016. You can use this list as a guide to catch up on your blog reading or even read a post again that you found particularly useful.

January 3: How to Monitor AWS Account Configuration Changes and API Calls to Amazon EC2 Security Groups
You can use AWS security controls to detect and mitigate risks to your AWS resources. The purpose of each security control is defined by its control objective. For example, the control objective of an Amazon VPC security group is to permit only designated traffic to enter or leave a network interface. Let’s say you have an Internet-facing e-commerce website, and your security administrator has determined that only HTTP (TCP port 80) and HTTPS (TCP 443) traffic should be allowed access to the public subnet. As a result, your administrator configures a security group to meet this control objective. What if, though, someone were to inadvertently change this security group’s rules and enable FTP or other protocols to access the public subnet from any location on the Internet? That expanded access could weaken the security posture of your assets. Consequently, your administrator might need to monitor the integrity of your company’s security controls so that the controls maintain their desired effectiveness. In this blog post, I explore two methods for detecting unintended changes to VPC security groups. The two methods address not only control objectives but also control failures.

If you have questions about or issues with implementing the solutions in any of these posts, please start a new thread on the forum identified near the end of each post.

– Craig

How to Audit Your AWS Resources for Security Compliance by Using Custom AWS Config Rules

Post Syndicated from Myles Hosford original https://aws.amazon.com/blogs/security/how-to-audit-your-aws-resources-for-security-compliance-by-using-custom-aws-config-rules/

AWS Config Rules enables you to implement security policies as code for your organization and evaluate configuration changes to AWS resources against these policies. You can use Config rules to audit your use of AWS resources for compliance with external compliance frameworks such as CIS AWS Foundations Benchmark and with your internal security policies related to the US Health Insurance Portability and Accountability Act (HIPAA), the Federal Risk and Authorization Management Program (FedRAMP), and other regimes.

AWS provides a number of predefined, managed Config rules. You also can create custom Config rules based on criteria you define within an AWS Lambda function. In this post, I will show how to create a custom rule that audits AWS resources for security compliance by enabling VPC Flow Logs for an Amazon Virtual Private Cloud (VPC). The custom rule meets requirement 4.3 of the CIS AWS Foundations Benchmark: “Ensure VPC flow logging is enabled in all VPCs.”

Solution overview

In this post, I walk through the process required to create a custom Config rule by following these steps:

  1. Create a Lambda function containing the logic to determine if a resource is compliant or noncompliant.
  2. Create a custom Config rule that uses the Lambda function created in Step 1 as the source.
  3. Create a Lambda function that polls Config to detect noncompliant resources on a daily basis and send notifications via Amazon SNS.

Prerequisite

You must set up Config before you start creating custom rules. Follow the steps on Set Up AWS Config Using the Console or Set Up AWS Config Using the AWS CLI to enable Config and send the configuration changes to Amazon S3 for storage.

Custom rule – Blueprint

The first step is to create a Lambda function that contains the logic to determine if the Amazon VPC has VPC Flow Logs enabled (in other words, it is compliant or noncompliant with requirement 4.3 of the CIS AWS Foundation Benchmark). First, let’s take a look at the components that make up a custom rule, which I will call the blueprint.

#
# Custom AWS Config Rule - Blueprint Code
#

import boto3, json

def evaluate_compliance(config_item, r_id):
    return 'NONCOMPLIANT'

def lambda_handler(event, context):
    
    # Create AWS SDK clients & initialize custom rule parameters
    config = boto3.client('config')
    invoking_event = json.loads(event['invokingEvent'])
    compliance_value = 'NOT_APPLICABLE'
    resource_id = invoking_event['configurationItem']['resourceId']
                    
    compliance_value = evaluate_compliance(invoking_event['configurationItem'], resource_id)
              
    response = config.put_evaluations(
       Evaluations=[
            {
                'ComplianceResourceType': invoking_event['configurationItem']['resourceType'],
                'ComplianceResourceId': resource_id,
                'ComplianceType': compliance_value,
                'Annotation': 'Insert text here to detail why control passed/failed',
                'OrderingTimestamp': invoking_event['notificationCreationTime']
            },
       ],
       ResultToken=event['resultToken'])

The key components in the preceding blueprint are:

  1. The lambda_handler function is the function that is executed when the Lambda function invokes my function. I create the necessary SDK clients and set up some initial variables for the rule to use.
  2. The evaluate_compliance function contains my custom rule logic. This is the function that I will tailor later in the post to create the custom rule to detect whether the Amazon VPC has VPC Flow Logs enabled. The result (compliant or noncompliant) is assigned to the compliance_value.
  3. The Config API’s put_evaluations function is called to deliver an evaluation result to Config. You can then view the result of the evaluation in the Config console (more about that later in this post). The annotation parameter is used to provide supplementary information about how the custom evaluation determined the compliance.

Custom rule – Flow logs enabled

The example we use for the custom rule is requirement 4.3 from the CIS AWS Foundations Benchmark: “Ensure VPC flow logging is enabled in all VPCs.” I update the blueprint rule that I just showed to do the following:

  1. Create an AWS Identity and Access Management (IAM) role that allows the Lambda function to perform the custom rule logic and publish the result to Config. The Lambda function will assume this role.
  2. Specify the resource type of the configuration item as EC2 VPC. This ensures that the rule is triggered when there is a change to any Amazon VPC resources.
  3. Add custom rule logic to the Lambda function to determine whether VPC Flow Logs are enabled for a given VPC.

Create an IAM role for Lambda

To create the IAM role, I go to the IAM console, choose Roles in the navigation pane, click Create New Role, and follow the wizard. In Step 2, I select the service role AWS Lambda, as shown in the following screenshot.

In Step 4 of the wizard, I attach the following managed policies:

  • AmazonEC2ReadOnlyAccess
  • AWSLambdaExecute
  • AWSConfigRulesExecutionRole

Finally, I name the new IAM role vpcflowlogs-role. This allows the Lambda function to call APIs such as EC2 describe flow logs to obtain the result for my compliance check. I assign this role to the Lambda function in the next step.

Create the Lambda function for the custom rule

To create the Lambda function that contains logic for my custom rule, I go to the Lambda console, click Create a Lambda Function, and then choose Blank Function.

When I configure the function, I name it vpcflowlogs-function and provide a brief description of the rule: “A custom rule to detect whether VPC Flow Logs is enabled.”

For the Lambda function code, I use the blueprint code shown earlier in this post and add the additional logic to determine whether VPC Flow Logs is enabled (specifically within the evaluate_compliance and is_flow_logs_enabled functions).

#
# Custom AWS Config Rule - VPC Flow Logs
#

import boto3, json

def evaluate_compliance(config_item, r_id):
    if (config_item['resourceType'] != 'AWS::EC2::VPC'):
        return 'NOT_APPLICABLE'

    elif is_flow_logs_enabled(r_id):
        return 'COMPLIANT'
    else:
        return 'NON_COMPLIANT'

def is_flow_logs_enabled(vpc_id):
    ec2 = boto3.client('ec2')
    response = ec2.describe_flow_logs(
        Filter=[
            {
                'Name': 'resource-id',
                'Values': [
                    r_id,
                ]
            },
        ],
    )
    if len(response[u'FlowLogs']) != 0: return True

def lambda_handler(event, context):
    
    # Create AWS SDK clients & initialize custom rule parameters
    config = boto3.client('config')
    invoking_event = json.loads(event['invokingEvent'])
    compliance_value = 'NOT_APPLICABLE'
    resource_id = invoking_event['configurationItem']['resourceId']
                    
    compliance_value = evaluate_compliance(invoking_event['configurationItem'], resource_id)
            
    response = config.put_evaluations(
       Evaluations=[
            {
                'ComplianceResourceType': invoking_event['configurationItem']['resourceType'],
                'ComplianceResourceId': resource_id,
                'ComplianceType': compliance_value,
                'Annotation': 'CIS 4.3 VPC Flow Logs',
                'OrderingTimestamp': invoking_event['notificationCreationTime']
            },
       ],
       ResultToken=event['resultToken'])

Below the Lambda function code, I configure the handler and role. As shown in the following screenshot, I select the IAM role I just created (vpcflowlogs-role) and create my Lambda function.

 When the Lambda function is created, I make a note of the Lambda Amazon Resource Name (ARN), which is the unique identifier used in the next step to specify this function as my Config rule source. (Be sure to replace placeholder value with your own value.)

Example ARN: arn:aws:lambda:ap-southeast-1:<your-account-id>:function:vpcflowlogs-function

Create a custom Config rule

The last step is to create a custom Config rule and use the Lambda function as the source. To do this, I go to the Config console, choose Add Rule, and choose Add Custom Rule. I give the rule a name, vpcflowlogs-configrule, and description, and I paste the Lambda ARN from the previous section.

Because this rule is specific to VPC resources, I set the Trigger type to Configuration changes and Resources to EC2: VPC, as shown in the following screenshot

I click Save to create the rule, and it is now live. Any VPC resources that are created or modified will now be checked against my VPC Flow Logs rule for compliance with the CIS Benchmark requirement.

From the Config console, I can now see if any resources do not comply with the control requirement, as shown in the following screenshot.

When I choose the rule, I see additional detail about the noncompliant resources (see the following screenshot). This allows me to view the Config timeline to determine when the resources became noncompliant, identify the resources’ owners, (if resources are following tagging best practices), and initiate a remediation effort.

Screenshot of the results of resources evaluated

Daily compliance assessment

Having created the custom rule, I now create a Lambda function to poll Config periodically to detect noncompliant resources. My Lambda function will run daily to assess for noncompliance with my custom rule. When noncompliant resources are detected, I send a notification by publishing a message to SNS.

Before creating the Lambda function, I create an SNS topic and subscribe to the topic for the email addresses that I want to receive noncompliance notifications. My SNS topic is called config-rules-compliance.

Note: The Lambda function will require permission to query Config and publish a message to SNS. For the purpose of this blog post, I created the following policy that allows publishing of messages to my SNS topic (config-rules-compliance), and I attached it to the vpcflowlogs-role role that my custom Config rule uses.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1485832788000",
            "Effect": "Allow",
            "Action": [
                "sns:Publish"
            ],
            "Resource": [
                "arn:aws:sns:ap-southeast-1:111111111111:config-rules-compliance"
            ]
        }
    ]
}

To create the Lambda function that performs the periodic compliance assessment, I go to the Lambda console, choose Create a Lambda Function and then choose Blank Function.

When configuring the Lambda trigger, I select CloudWatch Events – Schedule that allows the function to be executed periodically on a schedule I define. I then select rate(1 day) to get daily compliance assessments. For more information about scheduling events with Amazon CloudWatch, see Schedule Expressions for Rules.

scheduled-lambda-withborder

My Lambda function (see the following code) uses the vpcflowlogs-role IAM role that allows publishing of messages to my SNS topic.

'''
Lambda function to poll Config for noncompliant resources
'''

from __future__ import print_function

import boto3

# AWS Config settings
CONFIG_CLIENT = boto3.client('config')
MY_RULE = "vpcflowlogs-configrule"

# AWS SNS Settings
SNS_CLIENT = boto3.client('sns')
SNS_TOPIC = 'arn:aws:sns:ap-southeast-1:111111111111:config-rules-compliance'
SNS_SUBJECT = 'Compliance Update'

def lambda_handler(event, context):
    # Get compliance details
    non_compliant_detail = CONFIG_CLIENT.get_compliance_details_by_config_rule(\
    						ConfigRuleName=MY_RULE, ComplianceTypes=['NON_COMPLIANT'])

    if len(non_compliant_detail['EvaluationResults']) > 0:
        print('The following resource(s) are not compliant with AWS Config rule: ' + MY_RULE)
        non_complaint_resources = ''
        for result in non_compliant_detail['EvaluationResults']:
            print(result['EvaluationResultIdentifier']['EvaluationResultQualifier']['ResourceId'])
            non_complaint_resources = non_complaint_resources + \
    	    				result['EvaluationResultIdentifier']['EvaluationResultQualifier']['ResourceId'] + '\n'

        sns_message = 'AWS Config Compliance Update\n\n Rule: ' \
    				+ MY_RULE + '\n\n' \
     				+ 'The following resource(s) are not compliant:\n' \
     				+ non_complaint_resources

        SNS_CLIENT.publish(TopicArn=SNS_TOPIC, Message=sns_message, Subject=SNS_SUBJECT)

    else:
        print('No noncompliant resources detected.')

My Lambda function performs two key activities. First, it queries the Config API to determine which resources are noncompliant with my custom rule. This is done by executing the get_compliance_details_by_config_rule API call.

non_compliant_detail = CONFIG_CLIENT.get_compliance_details_by_config_rule(ConfigRuleName=MY_RULE, ComplianceTypes=['NON_COMPLIANT'])

Second, my Lambda function publishes a message to my SNS topic to notify me that resources are noncompliant, if they failed my custom rule evaluation. This is done using the SNS publish API call.

SNS_CLIENT.publish(TopicArn=SNS_TOPIC, Message=sns_message, Subject=SNS_SUBJECT)

This function provides an example of how to integrate Config and the results of the Config rules compliance evaluation into your operations and processes. You can extend this solution by integrating the results directly with your internal governance, risk, and compliance tools and IT service management frameworks.

Summary

In this post, I showed how to create a custom AWS Config rule to detect for noncompliance with security and compliance policies. I also showed how you can create a Lambda function to detect for noncompliance daily by polling Config via API calls. Using custom rules allows you to codify your internal or external security and compliance requirements and have a more effective view of your organization’s risks at a given time.

For more information about Config rules and examples of rules created for the CIS Benchmark, go to the aws-security-benchmark GitHub repository. If you have questions about the solution in this post, start a new thread on the AWS Config forum.

– Myles

Note: The content and opinions in this blog post are those of the author. This blog post is intended for informational purposes and not for the purpose of providing legal advice.

2017-02-22 FizzBuzz 2

Post Syndicated from Vasil Kolev original https://vasil.ludost.net/blog/?p=3343

Понеже идеята ми се мотае в главата от месец-два и тая нощ ми хрумна финалната оптимизация, ето продължението на post-а за fizzbuzz:

int i=0,p;
static void *pos[4]= {&&digit, &&fizz, &&buzz, &&fizzbuzz};
static void *loop[2] = { &&loopst, &&loopend};
int s3[3]={1,0,0},s5[5]={2,0,0,0,0};
char buff[2048];
char dgts[16]={'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'};
int buffpos=0;

loopst:
	i++;
	p= s3[i%3] | s5[i%5]; 
	goto *pos[p];

fizz:
	memcpy(&buff[buffpos],"Fizz", 4);
	buffpos+=4;
	goto end;
buzz:
	memcpy(&buff[buffpos],"Buzz", 4);
	buffpos+=4;
	goto end;
fizzbuzz:
	memcpy(&buff[buffpos],"FizzBuzz", 8);
	buffpos+=8;
	goto end;
digit:
	buff[buffpos++]=dgts[i/16];
	buff[buffpos++]=dgts[i%16];
end:
	buff[buffpos++]='\n';
	goto *loop[i/100];
loopend:
write(1, buff, buffpos);

Известно време се чудех как може цялото нещо да стане без никакъв branch, т.е. и без проверката за край на цикъла. Първоначалната ми идея беше да я карам на асемблер и да използвам като в exploit-ите NOP sled, нещо от типа (извинете ме за калпавия асемблер):

	JMP loopst
	JMP loopend
loopst:
	NOP
	NOP
...
	NOP
	; fizzbuzz implementation
	; i is in RAX
...
	MOV RBX, 0
	SUB RBX, RAX
	SUB RBX, $LENGTH
	SUB EIP, RBX
loopend:

Или, накратко, колкото повече се увеличава i, толкова повече скачам назад с релативния JMP (който съм написал като вадене на нещо от EIP, което най-вероятно изобщо не е валидно), докато не ударя JMP, който ме изхвърля. Като оптимизация бях решил, че мога да shift-вам стойността с 4, така че sled-а да е само 25 броя.

В един момент ми хрумна, че мога да мина и без sled-а, като правя деление (което е отвратителна операция, но спестява кофа nop-ове). Така се получи по-горния вариант на C, който не е съвсем C, а просто някаква странна асемблероподобна гняс.

Иначе, важно е да се отбележи, че на какъвто и да е модерен процесор по-горния код е далеч по-неефективен от простото решение с if-ове, най-вече защото branch prediction и всички други екстри се справят много добре с всякаквите if-ове, но доста по-трудно могат да се сетят тия jmp-ове към таблици базирани на някакви стойности къде точно ще идат, за да се прави спекулативното изпълнение. Не съм си играл да benchmark-вам (въпреки, че имам желание), но като цяло горния код има шанс да се справя по-добре само на неща като 8086 и компания.

И като идея за следващата подобна мизерия, може би може да се оптимизира истински чрез ползване на някое от разширенията за работа с вектори/големи стойности и се unroll-не цикъла, например да се прави на стъпки от по 4 с някаква инструкция, която смята делители (кой-знае какви странни неща има вкарани вече в x86 instruction set-а).

Go 1.8 released

Post Syndicated from ris original https://lwn.net/Articles/714783/rss

The Go team has announced the
release of Go 1.8. “The compiler back end introduced in Go 1.7 for 64-bit x86 is now used
on all architectures, and those architectures should see significant performance
improvements
. For instance, the CPU time required by our benchmark
programs was reduced by 20-30% on 32-bit ARM systems. There are also some
modest performance improvements in this release for 64-bit x86 systems. The
compiler and linker have been made faster. Compile times should be improved
by about 15% over Go 1.7. There is still more work to be done in this area:
expect faster compilation speeds in future releases.
” See the release notes for more details.

Open Sourcing TensorFlowOnSpark: Distributed Deep Learning on Big-Data Clusters

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/157196488076

yahoohadoop:

By Lee Yang, Jun Shi, Bobbie Chern, and Andy Feng (@afeng76), Yahoo Big ML team

Introduction

Today, we are pleased to offer TensorFlowOnSpark to the community, our latest open source framework for distributed deep learning on big-data clusters.

Deep learning (DL) has evolved significantly in recent years. At Yahoo, we’ve found that in order to gain insight from massive amounts of data, we need to deploy distributed deep learning. Existing DL frameworks often require us to set up separate clusters for deep learning, forcing us to create multiple programs for a machine learning pipeline (see Figure 1 below). Having separate clusters requires us to transfer large datasets between them, introducing unwanted system complexity and end-to-end learning latency.

image

Last year we addressed scaleout issues by developing and publishing CaffeOnSpark, our open source framework that allows distributed deep learning and big-data processing on identical Spark and Hadoop clusters. We use CaffeOnSpark at Yahoo to improve our NSFW image detection, to automatically identify eSports game highlights from live-streamed videos, and more. With the community’s valuable feedback and contributions, CaffeOnSpark has been upgraded with LSTM support, a new data layer, training and test interleaving, a Python API, and deployment on docker containers. This has been great for our Caffe users, but what about those who use the deep learning framework TensorFlow? We’re taking a page from our own playbook and doing for TensorFlow for what we did for Caffe.  

After TensorFlow’s initial publication, Google released an enhanced TensorFlow with distributed deep learning capabilities in April 2016. In October 2016, TensorFlow introduced HDFS support. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. TensorFlow programs could not be deployed on existing big-data clusters, thus increasing the cost and latency for those who wanted to take advantage of this technology at scale.

To address this limitation, several community projects wired TensorFlow onto Spark clusters. SparkNet added the ability to launch TensorFlow networks in Spark executors. DataBricks proposed TensorFrame to manipulate Apache Spark’s DataFrames with TensorFlow programs. While these approaches are a step in the right direction, after examining their code, we learned we would be unable to get the TensorFlow processes to communicate with each other directly, we would not be able to implement asynchronous distributed learning, and we would have to expend significant effort to migrate existing TensorFlow programs.

TensorFlowOnSpark

image

Our new framework, TensorFlowOnSpark (TFoS), enables distributed
TensorFlow execution on Spark and Hadoop clusters. As illustrated in
Figure 2 above, TensorFlowOnSpark is designed to work along with
SparkSQL, MLlib, and other Spark libraries in a single pipeline or
program (e.g. Python notebook).

TensorFlowOnSpark supports all types of TensorFlow programs, enabling both asynchronous and synchronous training and inferencing. It supports model parallelism and data parallelism, as well as TensorFlow tools such as TensorBoard on Spark clusters.

Any TensorFlow program can be easily modified to work with TensorFlowOnSpark. Typically, changing fewer than 10 lines of Python code are needed. Many developers at Yahoo who use TensorFlow have easily migrated TensorFlow programs for execution with TensorFlowOnSpark.

TensorFlowOnSpark supports direct tensor communication among TensorFlow processes (workers and parameter servers). Process-to-process direct communication enables TensorFlowOnSpark programs to scale easily by adding machines. As illustrated in Figure 3, TensorFlowOnSpark doesn’t involve Spark drivers in tensor communication, and thus achieves similar scalability as stand-alone TensorFlow clusters.

image

TensorFlowOnSpark provides two different modes to ingest data for training and inference:

  1. TensorFlow QueueRunners: TensorFlowOnSpark leverages TensorFlow’s file readers and QueueRunners to read data directly from HDFS files. Spark is not involved in accessing data.
  2. Spark Feeding: Spark RDD data is fed to each Spark executor, which subsequently feeds the data into the TensorFlow graph via feed_dict.

Figure 4 illustrates how the synchronous distributed training of Inception image classification network scales in TFoS using QueueRunners with a simple setting: 1 GPU, 1 reader, and batch size 32 for each worker. Four TFoS jobs were launched to train 100,000 steps. When these jobs completed after 2+ days, the top-5 accuracy of these jobs were 0.730, 0.814, 0.854, and 0.879. Reaching top-5 accuracy of 0.730 takes 46 hours for a 1-worker job, 22.5 hours for a 2-worker job, 13 hours for a 4-worker job, and 7.5 hours for an 8-worker job. TFoS thus achieves near linear scalability for Inception model training. This is very encouraging, although TFoS scalability will vary for different models and hyperparameters.

image

RDMA for Distributed TensorFlow

In Yahoo’s Hadoop clusters, GPU nodes are connected by both Ethernet and Infiniband. Infiniband provides faster connectivity and supports direct access to other servers’ memories over RDMA. Current TensorFlow releases, however, only support distributed learning using gRPC over Ethernet. To speed up distributed learning, we have enhanced the TensorFlow C++ layer to enable RDMA over Infiniband.

In conjunction with our TFoS release, we are introducing a new protocol for TensorFlow servers in addition to the default “grpc” protocol. Any distributed TensorFlow program can leverage our enhancement via specifying protocol=“grpc_rdma” in tf.train.ServerDef() or tf.train.Server().

With this new protocol, a RDMA rendezvous manager is created to ensure tensors are written directly into the memory of remote servers. We minimize the tensor buffer creation: Tensor buffers are allocated once at the beginning, and then reused across all training steps of a TensorFlow job. From our early experimentation with large models like the VGG-19 network, our RDMA implementation has demonstrated a significant speedup on training time compared with the existing gRPC implementation.

Since RDMA support is a highly requested capability (see TensorFlow issue #2916), we decided to make our current implementation available as an alpha release to the TensorFlow community. In the coming weeks, we will polish our RDMA implementation further, and share detailed benchmark results.

Simple CLI and API

TFoS programs are launched by the standard Apache Spark command, spark-submit. As illustrated below, users can specify the number of Spark executors, the number of GPUs per executor, and the number of parameter servers in the CLI. A user can also state whether they want to use TensorBoard (–tensorboard) and/or RDMA (–rdma).

      spark-submit –master ${MASTER} \
      ${TFoS_HOME}/examples/slim/train_image_classifier.py \
      –model_name inception_v3 \
      –train_dir hdfs://default/slim_train \
      –dataset_dir hdfs://default/data/imagenet \
      –dataset_name imagenet \
      –dataset_split_name train \
      –cluster_size ${NUM_EXEC} \
      –num_gpus ${NUM_GPU} \
      –num_ps_tasks ${NUM_PS} \
      –sync_replicas \
      –replicas_to_aggregate ${NUM_WORKERS} \
      –tensorboard \
      –rdma  

TFoS provides a high-level Python API (illustrated in our sample Python notebook):

  • TFCluster.reserve() … construct a TensorFlow cluster from Spark executors
  • TFCluster.start() … launch Tensorflow program on the executors
  • TFCluster.train() or TFCluster.inference() … feed RDD data to TensorFlow processes
  • TFCluster.shutdown() … shutdown Tensorflow execution on executors

Open Source

Yahoo is happy to release TensorFlowOnSpark at github.com/yahoo/TensorFlowOnSpark and a RDMA enhancement of TensorFlow at github.com/yahoo/tensorflow/tree/yahoo. Multiple example programs (including mnist, cifar10, inception, and VGG) are provided to illustrate the simple conversion process of TensorFlow programs to TensorFlowOnSpark, and leverage RDMA. An Amazon Machine Image is also available for applying TensorFlowOnSpark on AWS EC2.

Going forward, we will advance TensorFlowOnSpark as we continue to do with CaffeOnSpark. We welcome the community’s continued feedback and contributions to CaffeOnSpark, and are interested in thoughts on ways TensorFlowOnSpark can be enhanced.

Backblaze Hard Drive Stats for 2016

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/hard-drive-benchmark-stats-2016/

Backblaze drive stats
Backblaze has recorded and saved daily hard drive statistics from the drives in our data centers since April 2013. At the end of 2016 we had 73,653 spinning hard drives. Of that number, there were 1,553 boot drives and 72,100 data drives. This post looks at the hard drive statistics of the data drives we monitor. We’ll first look at the stats for Q4 2016, then present the data for all of 2016, and finish with the lifetime statistics for all of the drives Backblaze has used in our cloud storage data centers since we started keeping track. Along the way we’ll share observations and insights on the data presented. As always you can download our Hard Drive Test Data to examine and use.

Hard Drive Reliability Statistics for Q4 2016

At the end of Q4 2016 Backblaze was monitoring 72,100 data drives. For our evaluation we remove from consideration those drives which were used for testing purposes and those drive models for which we did not have at least 45 drives. This leaves us with 71,939 production hard drives. The table below is for the period of Q4 2016.

Hard Drive Annualized Failure Rates for Q4 2016
Notes:

  1. The failure rate listed is for just Q4 2016. If a drive model has a failure rate of 0%, it means there were no drive failures of that model during that quarter.
  2. 90 drives (2 storage pods) were used for testing purposes during the period. They contained Seagate 1.5TB and 1.0 TB WDC drives. These are not included in the results above.
  3. The most common reason we have less than 45 drives of one model is that we needed to replace a failed drive, but that drive model is no longer available. We use 45 drives as the minimum number to report quarterly and yearly statistics.

8 TB Hard Drive Performance

In Q4 2016 we introduced a third 8 TB drive model, the Seagate ST8000NM0055. This is an enterprise class drive. One 60-drive Storage Pod was deployed mid-Q4 and the initial results look promising as there have been no failures to date. Given our past disdain for overpaying for enterprise drives, it will be interesting to see how these drives perform.

We added 3,540 Seagate 8 TB drives, model ST8000DM002, giving us 8,660 of these drives. That’s 69 petabytes of raw storage, before formatting and encoding, or about 22% of our current data storage capacity. The failure rate for the quarter of these 8 TB drives was a very respectable 1.65%. That’s lower than the Q4 failure rate of 1.94% for all of the hard drives in the table above.

During the next couple of calendar quarters we’ll monitor how the new enterprise 8 TB drives compare to the consumer 8 TB drives. We’re interested to know which models deliver the best value and we bet you are too. We’ll let you know what we find.

2016 Hard Drive Performance Statistics

Looking back over 2016, we added 15,646 hard drives, and migrated 110 Storage Pods (4,950 drives) from 1-, 1.5-, and 2 TB drives to 4-, 6- and 8 TB drives. Below are the hard drive failure stats for 2016. As with the quarterly results, we have removed any non-production drives and any models that had less than 45 drives.

2016 Hard Drive Annualized Failure Rates
No Time For Failure

In 2016, three drives models ended the year with zero failures, albeit with a small number of drives. Both the 4 TB Toshiba and the 8 TB HGST models went the entire year without a drive failure. The 8 TB Seagate (ST8000NM0055) drives, which were deployed in November 2016, also recorded no failures.

The total number of failed drives was 1,225 for the year. That’s 3.36 drive failures per day or about 5 drives per workday, a very manageable workload. Of course, that’s easy for me to say, since I am not the one swapping out drives.

The overall hard drive failure rate for 2016 was 1.95%. That’s down from 2.47% in 2015 and well below the 6.39% failure rate for 2014.

Big Drives Rule

We increased storage density by moving to higher-capacity drives. That helped us end 2016 with 3 TB drives being the smallest density drives in our data centers. During 2017, we will begin migrating from the 3.0 TB drives to larger-sized drives. Here’s the distribution of our hard drives in our data centers by size for 2016.
2016 Distribution of Hard Drives by Size
Digging in a little further, below are the failure rates by drive size and vendor for 2016.

Hard Drive Failure Rates by Drive SizeHard Drive Failure Rates by Manufacturer

Computing the Failure Rate

Failure Rate, in the context we use it, is more accurately described as the Annualized Failure Rate. It is computed based on Drive Days and Drive Failures, not on the Drive Count. This may seem odd given we are looking at a one year period, 2016 in this case, so let’s take a look.

We start by dividing the Drive Failures by the Drive Count. For example if we use the statistics for 4 TB files, we get a “failure rate” of 1.92%, but the annualized failure rate shown on the chart for 4 TB drives is 2.06%. The trouble with just dividing Drive Failures by Drive Count is that the Drive Count constantly changes over the course of the year. By using Drive Count from a given day, you assume that each drive contributed the same amount of time over the year, but that’s not the case. Drives enter and leave the system all the time. By counting the number of days each drive is active as Drive Days, we can account for all the ins and outs over a given period of time.

Hard Drive Benchmark Statistics

As we noted earlier, we’ve been collecting and storing drive stats data since April 2013. In that time we have used 55 different hard drive models in our data center for data storage. We’ve omitted models from the table below that we didn’t have enough of to populate an entire storage pod (45 or fewer). That excludes 25 of those 55 models.

Annualized Hard Drive Failure Rates

Fun with Numbers

Since April 2013, there have been 5,380 hard drives failures. That works out to about 5 per day or about 7 per workday (200 workdays per year). As a point of reference, Backblaze only had 4,500 total hard drives in June 2010 when we racked our 100th Storage Pod to support our cloud backup service.

The 58,375,646 Drive Days translates to a little over 1.4 Billion Drive Hours. Going the other way we are measuring a mere 159,933 years of spinning hard drives.

You’ll also notice that we have used a total of 85,467 hard drives. But at the end of 2016 we had 71,939 hard drives. Are we missing 13,528 hard drives? Not really. While some drives failed, the remaining drives were removed from service due primarily to migrations from smaller to larger drives. The stats from the “migrated” drives, like Drive Hours, still count in establishing a failure rate, but they did not fail, they just stopped reporting data.

Failure Rates Over Time

The chart below shows the annualized failure rates of hard drives by drive size over time. The data points are the rates as of the end of each year shown. The “stars” mark the average annualized failure rate for all of the hard drives for each year.

Annualized Hard Drive Failures by Drive Size

Notes:

  1. The “8.0TB” failure rate of 4.9% for 2015 is comprised of 45 drives of which there were 2 failures during that year. In 2016 the number of 8 TB drives rose to 8,765 with 48 failures and an annualized failure rate of 1.6%.
  2. The “1.0TB” drives were 5+ years old on average when they were retired.
  3. There are only 45 of the “5.0TB” drives in operation.

Can’t Get Enough Hard Drive Stats?

We’ll be presenting the webinar “Backblaze Hard Drive Stats for 2016” on Thursday February 2, 2017 at 10:00 Pacific time. The webinar will be recorded so you can watch it over and over again. The webinar will dig deeper into the quarterly, yearly, and lifetime hard drive stats and include the annual and lifetime stats by drive size and manufacturer. You will need to subscribe to the Backblaze BrightTALK channel to view the webinar. Sign up for the webinar today.

As a reminder, the complete data set used to create the information used in this review is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose, all we ask is three things 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone, it is free. If you just want the summarized data used to create the tables and charts in this blog post you can download the ZIP file containing the MS Excel spreadsheet.

Good luck and let us know if you find anything interesting.

The post Backblaze Hard Drive Stats for 2016 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Is ‘aqenbpuu’ a bad password?

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/01/is-aqenbpuu-bad-password.html

Press secretary Sean Spicer has twice tweeted a random string, leading people to suspect he’s accidentally tweeted his Twitter password. One of these was ‘aqenbpuu’, which some have described as a “shitty password“. Is is actually bad?

No. It’s adequate. Not the best, perhaps, but not “shitty”.

It depends upon your threat model. The common threats are password reuse and phishing, where the strength doesn’t matter. When the strength does matter is when Twitter gets hacked and the password hashes stolen.

Twitter uses the bcrypt password hashing technique, which is designed to be slow. A typical desktop with a GPU can only crack bcrypt passwords at a rate of around 321 hashes-per-second. Doing the math (26 to the power of 8, divided by 321, divided by one day) it will take 20 years for this desktop to crack the password.

That’s not a good password. A botnet with thousands of desktops, or a somebody willing to invest thousands of dollars on a supercomputer or cluster like Amazon’s, can crack that password in a few days.

But, it’s not a bad password, either. A hack of a Twitter account like this would be a minor event. It’s not worth somebody spending that much resources hacking. Security is a tradeoff — you protect a ton of gold with Ft. Knox like protections, but you wouldn’t invest the same amount protecting a ton of wood. The same is true with passwords — as long as you don’t reuse your passwords, or fall victim to phishing, eight lower case characters is adequate.

This is especially true if using two-factor authentication, in which case, such a password is more than adequate.

I point this out because the Trump administration is bad, and Sean Spicer is a liar. Our criticism needs to be limited to things we can support, such as the DC metro ridership numbers (which Spicer has still not corrected). Every time we weakly criticize the administration on things we cannot support, like “shitty passwords”, we lessen our credibility. We look more like people who will hate the administration no matter what they do, rather than people who are standing up for principles like “honesty”.


The numbers above aren’t approximations. I actually generated a bcrypt hash and attempted to crack it in order to benchmark how long this would take. I’ll describe the process here.

First of all, I installed the “PHP command-line”. While older versions of PHP used MD5 for hashing, the newer versions use Bcrypt.

# apt-get install php5-cli

I then created a PHP program that will hash the password:

I actually use it three ways. The first way is to hash a small password “ax”, one short enough that the password cracker will actually succeed in hashing. The second is to hash the password with PHP defaults, which is what I assume Twitter is using. The third is to increase the difficulty level, in case Twitter has increased the default difficulty level at all in order to protect weak passwords.

I then ran the PHP script, producing these hashes:
$ php spicer.php
$2y$10$1BfTonhKWDN23cGWKpX3YuBSj5Us3eeLzeUsfylemU0PK4JFr4moa
$2y$10$DKe2N/shCIU.jSfSr5iWi.jH0AxduXdlMjWDRvNxIgDU3Cyjizgzu
$2y$15$HKcSh42d8amewd2QLvRWn.wttbz/B8Sm656Xh6ZWQ4a0u6LZaxyH6

I ran the first one through the password cracking known as “Hashcat”. Within seconds, I get the correct password “ax”. Thus, I know Hashcat is correctly cracking the password. I actually had a little doubt, because the documentation doesn’t make it clear that the Bcrypt algorithm the program supports is the same as the one produced by PHP5.

I run the second one, and get the following speed statistics:

As you can seem, I’m brute-forcing an eight character password that’s all lower case (-a 3 ?l?l?l?l?l?l?l?l). Checking the speed as it runs, it seems pretty consistently slightly above 300 hashes/second. It’s not perfect — it keeps warning that it’s slow. This is because the GPU acceleration works best if trying many password hashes at a time.

I tried the same sort of setup using John-the-Ripper in incremental mode. Whereas Hashcat uses the GPU, John uses the CPU. I have a 6-core Broadwell CPU, so I ran John-the-Ripper with 12 threads.

Curiously, it’s slightly faster, at 347 hashes-per-second on the CPU rather than 321 on the GPU.

Using the stronger work factor (the third hash I produced above), I get about 10 hashes/second on John, and 10 on Hashcat as well. It takes over a second to even generate the hash, meaning it’s probably too aggressive for a web server like Twitter to have to do that much work every time somebody logs in, so I suspect they aren’t that aggressive.

Would a massive IoT botnet help? Well, I tried out John on the Raspbery PI 3. With the same settings cracking the password (at default difficulty), I got the following:

In other words, the RPi is 35 times slower than my desktop computer at this task.

The RPi v3 has four cores and about twice the clock speed of IoT devices. So the typical IoT device would be 250 times slower than a desktop computer. This gives a good approximation of the difference in power.


So there’s this comment:

Yes, you can. So I did, described below.

Okay, so first you need to use the “Node Package Manager” to install bcrypt. The name isn’t “bcrypt”, which refers to a module that I can’t get installed on any platform. Instead, you want “bcrypt-nodejs”.

# npm install bcrypt-nodejs
[email protected] node_modules\bcrypt-nodejs

On all my platforms (Windows, Ubuntu, Raspbian) this is already installed. So now you just create a script spicer.js:

var bcrypt = require(“bcrypt-nodejs”);
console.log(bcrypt.hashSync(“aqenbpuu”);

This produces the following hash, which has the same work factor as the one generated by the PHP script above:

$2a$10$Ulbm//hEqMoco8FLRI6k8uVIrGqipeNbyU53xF2aYx3LuQ.xUEmz6

Hashcat and John then are the same speed cracking this one as the other one. The first characters $2a$ define the hash type (bcrypt). Apparently, there’s a slightly difference between that and $2y$, but that doesn’t change the analysis.


The first comment below questions the speed I get, because running Hashcat in benchmark mode, it gets much higher numbers for bcrypt.

This is actually normal, due to different iteration counts.

A bcrypt hash includes an iteration count (or more precisely, the logarithm of an iteration count). It repeats the hash that number of times. That’s what the $10$ means in the hash:

$2y$10$………

The Hashcat benchmark uses the number 5 (actually, 2^5, or 32 times) as it’s count. But the default iteration count produced by PHP and NodeJS is 10 (2^10, or 1024 times). Thus, you’d expect these hashes to run at a speed 32 times slower.

And indeed, that’s what I get. Running the benchmark on my machine, I get the following output:

Hashtype: bcrypt, Blowfish(OpenBSD)
Speed.Dev.#1…..:    10052 H/s (82.28ms)

Doing the math, dividing 1052 hashes/sec by 321 hashes/sec, I get 31.3. This is close enough to 32, the expected answer, giving the variabilities of thermal throttling, background activity on the machine, and so on.

Googling, this appears to be a common complaint. It’s be nice if it said something like ‘bcrypt[speed=5]’  or something to make this clear.


Spicer tweeted another password, “n9y25ah7”. Instead of all lower-case, this is lower-case plus digits, or 36 to the power of 8 combinations, so it’s about 13 times harder (36/26)^8, which is roughly in the same range.


BTW, there are two notes I want to make.

The first is that a good practical exercise first tries to falsify the theory. In this case, I deliberately tested whether Hashcat and John were actually cracking the right password. They were. They both successfully cracked the two character password “ax”. I also tested GPU vs. CPU. Everyone knows that GPUs are faster for password cracking, but also everyone knows that Bcrypt is designed to be hard to run on GPUs.

The second note is that everything here is already discussed in my study guide on command-lines [*]. I mention that you can run PHP on the command-line, and that you can use Hashcat and John to crack passwords. My guide isn’t complete enough to be an explanation for everything, but it’s a good discussion of where you need to end up.

The third note is that I’m not a master of any of these tools. I know enough about these tools to Google the answers, not to pull them off the top of my head. Mastery is impossible, don’t even try it. For example, bcrypt is one of many hashing algorithms, and has all by itself a lot of complexity, such as the difference between $2a$ and $2y$, or the the logarithmic iteration count. I ignored the issue of salting bcrypt altogether. So what I’m saying is that the level of proficiency you want is to be able to google the steps in solving a problem like this, not actually knowing all this off the top of your head.

Seamlessly Scale Predictions with AWS Lambda and MXNet

Post Syndicated from Bryan Liston original https://aws.amazon.com/blogs/compute/seamlessly-scale-predictions-with-aws-lambda-and-mxnet/


Sunil Mallya, Solutions Architect

Building AI solutions at scale can be challenging, in this blog we’ll look at how to leverage AWS Lambda and MXNet to build a scalable prediction pipeline.

Companies that leverage machine and deep learning invest in much more than just training models. They have sophisticated pipelines that include the following stages:

  • Data storage
  • Pre-processing
  • Feature extraction
  • Model generation
  • Model Analysis
  • Feature engineering
  • Evaluation and feedback

mxnet_1.png

Each stage of the pipeline requires:

  • Elasticity to adapt to changing workload demands
  • Scalability to adapt well to the overall size of the workload
  • Cost effectiveness to optimize total cost of ownership (TCO)

Amazon S3 meets all of the requirements for data and model storage. But the unpredictability of user demand and location can make scaling up for batch predictions (or the results of the model analysis) challenging, and can affect the overall user experience. In this post, we show how to use MXNet and AWS Lambda to deploy models at scale for predictions.

What is MXNet?

MXNet is a full-featured, flexibly programmable, and highly scalable deep learning framework that supports state-of-the-art deep models, including convolutional neural networks (CNNs) and long short-term memory networks (LSTMs). It is the result of collaboration between researchers at several top universities, including the founding institutions of the University of Washington and Carnegie Mellon University.

As discussed in Werner Vogel’s MXNet – Deep Learning Framework of Choice at AWS post, not only is MXNet scalable for multi-instance training, it also scales down to a variety of devices and small memory footprints, even when serving predictions on very large models. MXNet is available through open source under the Apache Version 2 license.

Challenges with the prediction pipeline

As previously mentioned, ML model training and validation is just a small part of the story. After the model is built, the real work begins. To service millions of customers seamlessly, every application must scale. However scaling the prediction portion of the pipeline can be challenging and expensive, especially when end users are geographically dispersed.

Delivering model updates, deploying globally, and maintaining high availability can be difficult. Lambda is a very good deployment option for the prediction pipeline, an excellent solution for serverless web applications, real-time batch processing, and map reduce tasks.

How can you leverage Lambda?

“Code, test, deploy, and let the service do the heavy lifting” is the Lambda customer motto. Regardless of your traffic patterns—high concurrency or bursty workloads—Lambda scales to service your needs.

We compiled and built the MXNet libraries to demonstrate how Lambda scales the prediction pipeline to provide this ease and flexibility for machine learning or deep learning model prediction. We built a sample application that predicts image labels using an 18-layer deep residual network. The model architecture is based on the winning model in the ImageNet competition called ResidualNet. The application produces state-of-the-art results for problems like image classification.

Putting it all together

Lambda has a deployment package limit of 50 MB. This limit means that you might not always be able to package your models along with the code. To accommodate this limitation, you can use S3 to store the model, and download the model when you service the request.

For optimal performance, you need to download the model outside of the lambda_handler function so that the downloaded file persists across< requests in memory. For subsequent Lambda invocations, MXNet uses the downloaded model that’s already in memory. For more information about this optimization, see AWS Lambda: How It Works.

The following reference Lambda function for prediction is quite simple. It loads the model, downloads image from the specified URL, transforms the image into an NDArray, and uses the model to make a prediction that outputs labels, with associated confidence percentages. The implementation code is provided below.

import os
import boto3
import json
import tempfile
import urllib2 
import mxnet as mx
import numpy as np
import cv2
from collections import namedtuple

Batch = namedtuple('Batch', ['data'])
f_params = 'resnet-18-0000.params'
f_symbol = 'resnet-18-symbol.json'

bucket = 'my-model-bucket'
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')

#params

f_params_file = tempfile.NamedTemporaryFile()
s3_client.download_file(bucket, f_params, f_params_file.name)
f_params_file.flush()

#symbol

f_symbol_file = tempfile.NamedTemporaryFile()
s3_client.download_file(bucket, f_symbol, f_symbol_file.name)
f_symbol_file.flush()

def load_model(s_fname, p_fname):
    symbol = mx.symbol.load(s_fname)
    save_dict = mx.nd.load(p_fname)
    arg_params = {}
    aux_params = {}
    for k, v in save_dict.items():
        tp, name = k.split(':', 1)
        if tp == 'arg':
            arg_params[name] = v
        if tp == 'aux':
            aux_params[name] = v
    return symbol, arg_params, aux_params

def predict(url, mod, synsets):
    req = urllib2.urlopen(url)
    arr = np.asarray(bytearray(req.read()), dtype=np.uint8)
    cv2_img = cv2.imdecode(arr, -1)
    img = cv2.cvtColor(cv2_img, cv2.COLOR_BGR2RGB)
    if img is None:
        return None
    img = cv2.resize(img, (224, 224))
    img = np.swapaxes(img, 0, 2)
    img = np.swapaxes(img, 1, 2) 
    img = img[np.newaxis, :] 

    mod.forward(Batch([mx.nd.array(img)]))
    prob = mod.get_outputs()[0].asnumpy()
    prob = np.squeeze(prob)
    a = np.argsort(prob)[::-1]
    out = '' 

    for i in a[0:5]:
        out += 'probability=%f, class=%s' %(prob[i], synsets[i])
    out += "\n"

    return out

with open('synset.txt', 'r') as f:
    synsets = [l.rstrip() for l in f]

def lambda_handler(event, context):
    url = ''
    if event['httpMethod'] == 'GET':
        url = event['queryStringParameters']['url']
    elif event['httpMethod'] == 'POST':
        data = json.loads(event['body'])
        url = data['url']

    print "image url: ", url
    sym, arg_params, aux_params = load_model(f_symbol_file.name, f_params_file.name)
    mod = mx.mod.Module(symbol=sym)
    mod.bind(for_training=False, data_shapes=[('data', (1,3,224,224))])
    mod.set_params(arg_params, aux_params)
    labels = predict(url, mod, synsets)
    out = {
            "headers": {
                "content-type": "application/json",
                "Access-Control-Allow-Origin": "*"
                },
            "body": '{"labels": "%s"}' % labels,  
            "statusCode": 200
          }
    return out

What can you expect in production?

Because the code must download the model from S3, download an image from the web, and run the image through the model, we benchmarked it to evaluate how it scales. We ran a local benchmark on a laptop in San Francisco using wrk to the endpoint deployed in US West (Oregon) Region (us-west-2). We observed an average latency of 1.18 seconds at a sustained rate of 75 requests per second, as shown in the following output.

mxnet_3.png

To benchmark global prediction latencies, we used Goad, a Lambda-based distributed load tester. We observed average latencies ranging from 1.2 seconds to 1.5 seconds. The following figure shows the observed latencies from various regions with the endpoint hosted in the US West (Oregon) Region (us-west-2).

mxnet_2.png

Conclusion

For a state-of-the-art image label prediction model, the numbers are impressive. The average global latency of 1.5 seconds makes it worth exploring integrating Lambda into your machine learning or deep learning pipeline for batch predictions. The libraries and code samples for deployment are available in the mxnet-lambda GitHub repo.

If you have questions or suggestions, please comment below.