Tag Archives: benchmark

How to Create an AMI Builder with AWS CodeBuild and HashiCorp Packer – Part 2

Post Syndicated from Heitor Lessa original https://aws.amazon.com/blogs/devops/how-to-create-an-ami-builder-with-aws-codebuild-and-hashicorp-packer-part-2/

Written by AWS Solutions Architects Jason Barto and Heitor Lessa

 
In Part 1 of this post, we described how AWS CodeBuild, AWS CodeCommit, and HashiCorp Packer can be used to build an Amazon Machine Image (AMI) from the latest version of Amazon Linux. In this post, we show how to use AWS CodePipeline, AWS CloudFormation, and Amazon CloudWatch Events to continuously ship new AMIs. We use Ansible by Red Hat to harden the OS on the AMIs through a well-known set of security controls outlined by the Center for Internet Security in its CIS Amazon Linux Benchmark.

You’ll find the source code for this post in our GitHub repo.

At the end of this post, we will have the following architecture:

Requirements

 
To follow along, you will need Git and a text editor. Make sure Git is configured to work with AWS CodeCommit, as described in Part 1.

Technologies

 
In addition to the services and products used in Part 1 of this post, we also use these AWS services and third-party software:

AWS CloudFormation gives developers and systems administrators an easy way to create and manage a collection of related AWS resources, provisioning and updating them in an orderly and predictable fashion.

Amazon CloudWatch Events enables you to react selectively to events in the cloud and in your applications. Specifically, you can create CloudWatch Events rules that match event patterns, and take actions in response to those patterns.

AWS CodePipeline is a continuous integration and continuous delivery service for fast and reliable application and infrastructure updates. AWS CodePipeline builds, tests, and deploys your code every time there is a code change, based on release process models you define.

Amazon SNS is a fast, flexible, fully managed push notification service that lets you send individual messages or to fan out messages to large numbers of recipients. Amazon SNS makes it simple and cost-effective to send push notifications to mobile device users or email recipients. The service can even send messages to other distributed services.

Ansible is a simple IT automation system that handles configuration management, application deployment, cloud provisioning, ad-hoc task-execution, and multinode orchestration.

Getting Started

 
We use CloudFormation to bootstrap the following infrastructure:

Component Purpose
AWS CodeCommit repository Git repository where the AMI builder code is stored.
S3 bucket Build artifact repository used by AWS CodePipeline and AWS CodeBuild.
AWS CodeBuild project Executes the AWS CodeBuild instructions contained in the build specification file.
AWS CodePipeline pipeline Orchestrates the AMI build process, triggered by new changes in the AWS CodeCommit repository.
SNS topic Notifies subscribed email addresses when an AMI build is complete.
CloudWatch Events rule Defines how the AMI builder should send a custom event to notify an SNS topic.
Region AMI Builder Launch Template
N. Virginia (us-east-1)
Ireland (eu-west-1)

After launching the CloudFormation template linked here, we will have a pipeline in the AWS CodePipeline console. (Failed at this stage simply means we don’t have any data in our newly created AWS CodeCommit Git repository.)

Next, we will clone the newly created AWS CodeCommit repository.

If this is your first time connecting to a AWS CodeCommit repository, please see instructions in our documentation on Setup steps for HTTPS Connections to AWS CodeCommit Repositories.

To clone the AWS CodeCommit repository (console)

  1. From the AWS Management Console, open the AWS CloudFormation console.
  2. Choose the AMI-Builder-Blogpost stack, and then choose Output.
  3. Make a note of the Git repository URL.
  4. Use git to clone the repository.

For example: git clone https://git-codecommit.eu-west-1.amazonaws.com/v1/repos/AMI-Builder_repo

To clone the AWS CodeCommit repository (CLI)

# Retrieve CodeCommit repo URL
git_repo=$(aws cloudformation describe-stacks --query 'Stacks[0].Outputs[?OutputKey==`GitRepository`].OutputValue' --output text --stack-name "AMI-Builder-Blogpost")

# Clone repository locally
git clone ${git_repo}

Bootstrap the Repo with the AMI Builder Structure

 
Now that our infrastructure is ready, download all the files and templates required to build the AMI.

Your local Git repo should have the following structure:

.
├── ami_builder_event.json
├── ansible
├── buildspec.yml
├── cloudformation
├── packer_cis.json

Next, push these changes to AWS CodeCommit, and then let AWS CodePipeline orchestrate the creation of the AMI:

git add .
git commit -m "My first AMI"
git push origin master

AWS CodeBuild Implementation Details

 
While we wait for the AMI to be created, let’s see what’s changed in our AWS CodeBuild buildspec.yml file:

...
phases:
  ...
  build:
    commands:
      ...
      - ./packer build -color=false packer_cis.json | tee build.log
  post_build:
    commands:
      - egrep "${AWS_REGION}\:\sami\-" build.log | cut -d' ' -f2 > ami_id.txt
      # Packer doesn't return non-zero status; we must do that if Packer build failed
      - test -s ami_id.txt || exit 1
      - sed -i.bak "s/<<AMI-ID>>/$(cat ami_id.txt)/g" ami_builder_event.json
      - aws events put-events --entries file://ami_builder_event.json
      ...
artifacts:
  files:
    - ami_builder_event.json
    - build.log
  discard-paths: yes

In the build phase, we capture Packer output into a file named build.log. In the post_build phase, we take the following actions:

  1. Look up the AMI ID created by Packer and save its findings to a temporary file (ami_id.txt).
  2. Forcefully make AWS CodeBuild to fail if the AMI ID (ami_id.txt) is not found. This is required because Packer doesn’t fail if something goes wrong during the AMI creation process. We have to tell AWS CodeBuild to stop by informing it that an error occurred.
  3. If an AMI ID is found, we update the ami_builder_event.json file and then notify CloudWatch Events that the AMI creation process is complete.
  4. CloudWatch Events publishes a message to an SNS topic. Anyone subscribed to the topic will be notified in email that an AMI has been created.

Lastly, the new artifacts phase instructs AWS CodeBuild to upload files built during the build process (ami_builder_event.json and build.log) to the S3 bucket specified in the Outputs section of the CloudFormation template. These artifacts can then be used as an input artifact in any later stage in AWS CodePipeline.

For information about customizing the artifacts sequence of the buildspec.yml, see the Build Specification Reference for AWS CodeBuild.

CloudWatch Events Implementation Details

 
CloudWatch Events allow you to extend the AMI builder to not only send email after the AMI has been created, but to hook up any of the supported targets to react to the AMI builder event. This event publication means you can decouple from Packer actions you might take after AMI completion and plug in other actions, as you see fit.

For more information about targets in CloudWatch Events, see the CloudWatch Events API Reference.

In this case, CloudWatch Events should receive the following event, match it with a rule we created through CloudFormation, and publish a message to SNS so that you can receive an email.

Example CloudWatch custom event

[
        {
            "Source": "com.ami.builder",
            "DetailType": "AmiBuilder",
            "Detail": "{ \"AmiStatus\": \"Created\"}",
            "Resources": [ "ami-12cd5guf" ]
        }
]

Cloudwatch Events rule

{
  "detail-type": [
    "AmiBuilder"
  ],
  "source": [
    "com.ami.builder"
  ],
  "detail": {
    "AmiStatus": [
      "Created"
    ]
  }
}

Example SNS message sent in email

{
    "version": "0",
    "id": "f8bdede0-b9d7...",
    "detail-type": "AmiBuilder",
    "source": "com.ami.builder",
    "account": "<<aws_account_number>>",
    "time": "2017-04-28T17:56:40Z",
    "region": "eu-west-1",
    "resources": ["ami-112cd5guf "],
    "detail": {
        "AmiStatus": "Created"
    }
}

Packer Implementation Details

 
In addition to the build specification file, there are differences between the current version of the HashiCorp Packer template (packer_cis.json) and the one used in Part 1.

Variables

  "variables": {
    "vpc": "{{env `BUILD_VPC_ID`}}",
    "subnet": "{{env `BUILD_SUBNET_ID`}}",
         “ami_name”: “Prod-CIS-Latest-AMZN-{{isotime \”02-Jan-06 03_04_05\”}}”
  },
  • ami_name: Prefixes a name used by Packer to tag resources during the Builders sequence.
  • vpc and subnet: Environment variables defined by the CloudFormation stack parameters.

We no longer assume a default VPC is present and instead use the VPC and subnet specified in the CloudFormation parameters. CloudFormation configures the AWS CodeBuild project to use these values as environment variables. They are made available throughout the build process.

That allows for more flexibility should you need to change which VPC and subnet will be used by Packer to launch temporary resources.

Builders

  "builders": [{
    ...
    "ami_name": “{{user `ami_name`| clean_ami_name}}”,
    "tags": {
      "Name": “{{user `ami_name`}}”,
    },
    "run_tags": {
      "Name": “{{user `ami_name`}}",
    },
    "run_volume_tags": {
      "Name": “{{user `ami_name`}}",
    },
    "snapshot_tags": {
      "Name": “{{user `ami_name`}}",
    },
    ...
    "vpc_id": "{{user `vpc` }}",
    "subnet_id": "{{user `subnet` }}"
  }],

We now have new properties (*_tag) and a new function (clean_ami_name) and launch temporary resources in a VPC and subnet specified in the environment variables. AMI names can only contain a certain set of ASCII characters. If the input in project deviates from the expected characters (for example, includes whitespace or slashes), Packer’s clean_ami_name function will fix it.

For more information, see functions on the HashiCorp Packer website.

Provisioners

  "provisioners": [
    {
        "type": "shell",
        "inline": [
            "sudo pip install ansible"
        ]
    }, 
    {
        "type": "ansible-local",
        "playbook_file": "ansible/playbook.yaml",
        "role_paths": [
            "ansible/roles/common"
        ],
        "playbook_dir": "ansible",
        "galaxy_file": "ansible/requirements.yaml"
    },
    {
      "type": "shell",
      "inline": [
        "rm .ssh/authorized_keys ; sudo rm /root/.ssh/authorized_keys"
      ]
    }

We used shell provisioner to apply OS patches in Part 1. Now, we use shell to install Ansible on the target machine and ansible-local to import, install, and execute Ansible roles to make our target machine conform to our standards.

Packer uses shell to remove temporary keys before it creates an AMI from the target and temporary EC2 instance.

Ansible Implementation Details

 
Ansible provides OS patching through a custom Common role that can be easily customized for other tasks.

CIS Benchmark and Cloudwatch Logs are implemented through two Ansible third-party roles that are defined in ansible/requirements.yaml as seen in the Packer template.

The Ansible provisioner uses Ansible Galaxy to download these roles onto the target machine and execute them as instructed by ansible/playbook.yaml.

For information about how these components are organized, see the Playbook Roles and Include Statements in the Ansible documentation.

The following Ansible playbook (ansible</playbook.yaml) controls the execution order and custom properties:

---
- hosts: localhost
  connection: local
  gather_facts: true    # gather OS info that is made available for tasks/roles
  become: yes           # majority of CIS tasks require root
  vars:
    # CIS Controls whitepaper:  http://bit.ly/2mGAmUc
    # AWS CIS Whitepaper:       http://bit.ly/2m2Ovrh
    cis_level_1_exclusions:
    # 3.4.2 and 3.4.3 effectively blocks access to all ports to the machine
    ## This can break automation; ignoring it as there are stronger mechanisms than that
      - 3.4.2 
      - 3.4.3
    # CloudWatch Logs will be used instead of Rsyslog/Syslog-ng
    ## Same would be true if any other software doesn't support Rsyslog/Syslog-ng mechanisms
      - 4.2.1.4
      - 4.2.2.4
      - 4.2.2.5
    # Autofs is not installed in newer versions, let's ignore
      - 1.1.19
    # Cloudwatch Logs role configuration
    logs:
      - file: /var/log/messages
        group_name: "system_logs"
  roles:
    - common
    - anthcourtney.cis-amazon-linux
    - dharrisio.aws-cloudwatch-logs-agent

Both third-party Ansible roles can be easily configured through variables (vars). We use Ansible playbook variables to exclude CIS controls that don’t apply to our case and to instruct the CloudWatch Logs agent to stream the /var/log/messages log file to CloudWatch Logs.

If you need to add more OS or application logs, you can easily duplicate the playbook and make changes. The CloudWatch Logs agent will ship configured log messages to CloudWatch Logs.

For more information about parameters you can use to further customize third-party roles, download Ansible roles for the Cloudwatch Logs Agent and CIS Amazon Linux from the Galaxy website.

Committing Changes

 
Now that Ansible and CloudWatch Events are configured as a part of the build process, commiting any changes to the AWS CodeComit Git Repository will triger a new AMI build process that can be followed through the AWS CodePipeline console.

When the build is complete, an email will be sent to the email address you provided as a part of the CloudFormation stack deployment. The email serves as notification that an AMI has been built and is ready for use.

Summary

 
We used AWS CodeCommit, AWS CodePipeline, AWS CodeBuild, Packer, and Ansible to build a pipeline that continuously builds new, hardened CIS AMIs. We used Amazon SNS so that email addresses subscribed to a SNS topic are notified upon completion of the AMI build.

By treating our AMI creation process as code, we can iterate and track changes over time. In this way, it’s no different from a software development workflow. With that in mind, software patches, OS configuration, and logs that need to be shipped to a central location are only a git commit away.

Next Steps

 
Here are some ideas to extend this AMI builder:

  • Hook up a Lambda function in Cloudwatch Events to update EC2 Auto Scaling configuration upon completion of the AMI build.
  • Use AWS CodePipeline parallel steps to build multiple Packer images.
  • Add a commit ID as a tag for the AMI you created.
  • Create a scheduled Lambda function through Cloudwatch Events to clean up old AMIs based on timestamp (name or additional tag).
  • Implement Windows support for the AMI builder.
  • Create a cross-account or cross-region AMI build.

Cloudwatch Events allow the AMI builder to decouple AMI configuration and creation so that you can easily add your own logic using targets (AWS Lambda, Amazon SQS, Amazon SNS) to add events or recycle EC2 instances with the new AMI.

If you have questions or other feedback, feel free to leave it in the comments or contribute to the AMI Builder repo on GitHub.

BPI Breaks Record After Sending 310 Million Google Takedowns

Post Syndicated from Andy original https://torrentfreak.com/bpi-breaks-record-after-sending-310-million-google-takedowns-170619/

A little over a year ago during March 2016, music industry group BPI reached an important milestone. After years of sending takedown notices to Google, the group burst through the 200 million URL barrier.

The fact that it took BPI several years to reach its 200 million milestone made the surpassing of the quarter billion milestone a few months later even more remarkable. In October 2016, the group sent its 250 millionth takedown to Google, a figure that nearly doubled when accounting for notices sent to Microsoft’s Bing.

But despite the volumes, the battle hadn’t been won, let alone the war. The BPI’s takedown machine continued to run at a remarkable rate, churning out millions more notices per week.

As a result, yet another new milestone was reached this month when the BPI smashed through the 300 million URL barrier. Then, days later, a further 10 million were added, with the latter couple of million added during the time it took to put this piece together.

BPI takedown notices, as reported by Google

While demanding that Google places greater emphasis on its de-ranking of ‘pirate’ sites, the BPI has called again and again for a “notice and stay down” regime, to ensure that content taken down by the search engine doesn’t simply reappear under a new URL. It’s a position BPI maintains today.

“The battle would be a whole lot easier if intermediaries played fair,” a BPI spokesperson informs TF.

“They need to take more proactive responsibility to reduce infringing content that appears on their platform, and, where we expressly notify infringing content to them, to ensure that they do not only take it down, but also keep it down.”

The long-standing suggestion is that the volume of takedown notices sent would reduce if a “take down, stay down” regime was implemented. The BPI says it’s difficult to present a precise figure but infringing content has a tendency to reappear, both in search engines and on hosting sites.

“Google rejects repeat notices for the same URL. But illegal content reappears as it is re-indexed by Google. As to the sites that actually host the content, the vast majority of notices sent to them could be avoided if they implemented take-down & stay-down,” BPI says.

The fact that the BPI has added 60 million more takedowns since the quarter billion milestone a few months ago is quite remarkable, particularly since there appears to be little slowdown from month to month. However, the numbers have grown so huge that 310 billion now feels a lot like 250 million, with just a few added on top for good measure.

That an extra 60 million takedowns can almost be dismissed as a handful is an indication of just how massive the issue is online. While pirates always welcome an abundance of links to juicy content, it’s no surprise that groups like the BPI are seeking more comprehensive and sustainable solutions.

Previously, it was hoped that the Digital Economy Bill would provide some relief, hopefully via government intervention and the imposition of a search engine Code of Practice. In the event, however, all pressure on search engines was removed from the legislation after a separate voluntary agreement was reached.

All parties agreed that the voluntary code should come into effect two weeks ago on June 1 so it seems likely that some effects should be noticeable in the near future. But the BPI says it’s still early days and there’s more work to be done.

“BPI has been working productively with search engines since the voluntary code was agreed to understand how search engines approach the problem, but also what changes can and have been made and how results can be improved,” the group explains.

“The first stage is to benchmark where we are and to assess the impact of the changes search engines have made so far. This will hopefully be completed soon, then we will have better information of the current picture and from that we hope to work together to continue to improve search for rights owners and consumers.”

With more takedown notices in the pipeline not yet publicly reported by Google, the BPI informs TF that it has now notified the search giant of 315 million links to illegal content.

“That’s an astonishing number. More than 1 in 10 of the entire world’s notices to Google come from BPI. This year alone, one in every three notices sent to Google from BPI is for independent record label repertoire,” BPI concludes.

While it’s clear that groups like BPI have developed systems to cope with the huge numbers of takedown notices required in today’s environment, it’s clear that few rightsholders are happy with the status quo. With that in mind, the fight will continue, until search engines are forced into compromise. Considering the implications, that could only appear on a very distant horizon.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

[$] Making Python faster

Post Syndicated from jake original https://lwn.net/Articles/725114/rss

The Python core developers, and Victor Stinner in particular, have been
focusing on improving the performance of Python 3 over the last few
years. At PyCon 2017, Stinner
gave a talk on some of the optimizations that have been added recently and
the effect they have had on various benchmarks. Along the way, he took a
detour into some improvements that have been made for benchmarking
Python.

[$] A survey of scheduler benchmarks

Post Syndicated from corbet original https://lwn.net/Articles/725238/rss

Many benchmarks have been used by kernel developers over the years to
test the performance of the scheduler. But recent kernel commit messages
have shown a particular pattern of tools being used (some relatively new),
all of which were created specifically for developing scheduler patches.
While each benchmark is different, having its own unique genesis story and
intended testing scenario, there is a unifying attribute; they were all
written to scratch a developer’s itch.

Data Compression Improvements in Amazon Redshift Bring Compression Ratios Up to 4x

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/data-compression-improvements-in-amazon-redshift/

Maor Kleider, Senior Product Manager with Amazon Redshift, wrote today’s guest post.

-Ana


Amazon Redshift, is a fast, fully managed, petabyte-scale data warehousing service that makes it simple and cost-effective to analyze all of your data. Many of our customers, including Scholastic, King.com, Electronic Arts, TripAdvisor and Yelp, migrated to Amazon Redshift and achieved agility and faster time to insight, while dramatically reducing costs.

Columnar compression is an important technology in Amazon Redshift. It both helps reduce customer costs by increasing the effective storage capacity of our nodes and improves performance by reducing I/O needed to process SQL requests. Improving I/O efficiency is very important for data warehousing. Last year, our I/O enhancements doubled query throughput. Let’s talk about some of the new compression improvements we’ve recently added to Amazon Redshift.

First, we added support for the Zstandard compression algorithm, which offers a good balance between a high compression ratio and speed in build 1.0.1172. When applied to raw data in the standard TPC-DS, 3 TB benchmark, Zstandard achieves 65% reduction in disk space. Zstandard is broadly applicable. You can apply it to any of the following data types: SMALLINT, INTEGER, BIGINT, DECIMAL, REAL, DOUBLE PRECISION, BOOLEAN, CHAR, VARCHAR, DATE, TIMESTAMP and TIMESTAMPTZ.

Second, we’ve improved the automation of compression on tables created by the CREATE TABLE AS, CREATE TABLE or ALTER TABLE ADD COLUMN commands. Starting with Build 1.0.1161, Amazon Redshift automatically chooses a default compression for the columns created by those commands. Automated compression happens when we estimate that we can reduce disk space without degrading query performance. Our customers have seen up to 40% reduction in disk space.

Third, we’ve been optimizing our internal on-disk data structures. Our preview customers averaged a 7% reduction in disk space usage with this improvement. This feature is delivered starting with Build 1.0.1271.

Finally, we have enhanced the ANALYZE COMPRESSION command to estimate disk space reduction. You can now easily identify opportunities to further compress data and improve performance. Behind the scenes, we sample your data and suggest the most effective compression. You can then specify the recommended encodings or your preferred encodings based on your own evaluation.

“Before all the recent compression features, our largest table was over 7 TB. It’s now only 4.85 TB, which is an additional 30.7% reduction in disk space. This allows us to reduce our disk space by 4X in total and our effective cost to less than $250/TB/Year on an uncompressed data basis. We’re now able to analyze more data with Amazon Redshift, and our query performance has gotten even better.” Chuong Do, Director of Analytics, Coursera

Of course, the actual benefits you see on your clusters will depend upon your workload and your data. In combination, these improvements may reduce your data sets by up to 4x vs. the 3x most of our customers saw before.

You may have heard us talk about how an Amazon Redshift data warehouse can cost as little as $1,000 per terabyte per year. It is important to realize that we’re talking about compressed data in this number. After all, that’s what we store. Not all vendors do this – many compress your data under the covers but describe per-terabyte costs in terms of uncompressed data. That’s unfortunate – the difference between talking in terms of uncompressed data and compressed data can be a significant overstatement.

-Maor Kleider

In Case You Missed These: AWS Security Blog Posts from January, February, and March

Post Syndicated from Craig Liebendorfer original https://aws.amazon.com/blogs/security/in-case-you-missed-these-aws-security-blog-posts-from-january-february-and-march/

Image of lock and key

In case you missed any AWS Security Blog posts published so far in 2017, they are summarized and linked to below. The posts are shown in reverse chronological order (most recent first), and the subject matter ranges from protecting dynamic web applications against DDoS attacks to monitoring AWS account configuration changes and API calls to Amazon EC2 security groups.

March

March 22: How to Help Protect Dynamic Web Applications Against DDoS Attacks by Using Amazon CloudFront and Amazon Route 53
Using a content delivery network (CDN) such as Amazon CloudFront to cache and serve static text and images or downloadable objects such as media files and documents is a common strategy to improve webpage load times, reduce network bandwidth costs, lessen the load on web servers, and mitigate distributed denial of service (DDoS) attacks. AWS WAF is a web application firewall that can be deployed on CloudFront to help protect your application against DDoS attacks by giving you control over which traffic to allow or block by defining security rules. When users access your application, the Domain Name System (DNS) translates human-readable domain names (for example, www.example.com) to machine-readable IP addresses (for example, 192.0.2.44). A DNS service, such as Amazon Route 53, can effectively connect users’ requests to a CloudFront distribution that proxies requests for dynamic content to the infrastructure hosting your application’s endpoints. In this blog post, I show you how to deploy CloudFront with AWS WAF and Route 53 to help protect dynamic web applications (with dynamic content such as a response to user input) against DDoS attacks. The steps shown in this post are key to implementing the overall approach described in AWS Best Practices for DDoS Resiliency and enable the built-in, managed DDoS protection service, AWS Shield.

March 21: New AWS Encryption SDK for Python Simplifies Multiple Master Key Encryption
The AWS Cryptography team is happy to announce a Python implementation of the AWS Encryption SDK. This new SDK helps manage data keys for you, and it simplifies the process of encrypting data under multiple master keys. As a result, this new SDK allows you to focus on the code that drives your business forward. It also provides a framework you can easily extend to ensure that you have a cryptographic library that is configured to match and enforce your standards. The SDK also includes ready-to-use examples. If you are a Java developer, you can refer to this blog post to see specific Java examples for the SDK. In this blog post, I show you how you can use the AWS Encryption SDK to simplify the process of encrypting data and how to protect your encryption keys in ways that help improve application availability by not tying you to a single region or key management solution.

March 21: Updated CJIS Workbook Now Available by Request
The need for guidance when implementing Criminal Justice Information Services (CJIS)–compliant solutions has become of paramount importance as more law enforcement customers and technology partners move to store and process criminal justice data in the cloud. AWS services allow these customers to easily and securely architect a CJIS-compliant solution when handling criminal justice data, creating a durable, cost-effective, and secure IT infrastructure that better supports local, state, and federal law enforcement in carrying out their public safety missions. AWS has created several documents (collectively referred to as the CJIS Workbook) to assist you in aligning with the FBI’s CJIS Security Policy. You can use the workbook as a framework for developing CJIS-compliant architecture in the AWS Cloud. The workbook helps you define and test the controls you operate, and document the dependence on the controls that AWS operates (compute, storage, database, networking, regions, Availability Zones, and edge locations).

March 9: New Cloud Directory API Makes It Easier to Query Data Along Multiple Dimensions
Today, we made available a new Cloud Directory API, ListObjectParentPaths, that enables you to retrieve all available parent paths for any directory object across multiple hierarchies. Use this API when you want to fetch all parent objects for a specific child object. The order of the paths and objects returned is consistent across iterative calls to the API, unless objects are moved or deleted. In case an object has multiple parents, the API allows you to control the number of paths returned by using a paginated call pattern. In this blog post, I use an example directory to demonstrate how this new API enables you to retrieve data across multiple dimensions to implement powerful applications quickly.

March 8: How to Access the AWS Management Console Using AWS Microsoft AD and Your On-Premises Credentials
AWS Directory Service for Microsoft Active Directory, also known as AWS Microsoft AD, is a managed Microsoft Active Directory (AD) hosted in the AWS Cloud. Now, AWS Microsoft AD makes it easy for you to give your users permission to manage AWS resources by using on-premises AD administrative tools. With AWS Microsoft AD, you can grant your on-premises users permissions to resources such as the AWS Management Console instead of adding AWS Identity and Access Management (IAM) user accounts or configuring AD Federation Services (AD FS) with Security Assertion Markup Language (SAML). In this blog post, I show how to use AWS Microsoft AD to enable your on-premises AD users to sign in to the AWS Management Console with their on-premises AD user credentials to access and manage AWS resources through IAM roles.

March 7: How to Protect Your Web Application Against DDoS Attacks by Using Amazon Route 53 and an External Content Delivery Network
Distributed Denial of Service (DDoS) attacks are attempts by a malicious actor to flood a network, system, or application with more traffic, connections, or requests than it is able to handle. To protect your web application against DDoS attacks, you can use AWS Shield, a DDoS protection service that AWS provides automatically to all AWS customers at no additional charge. You can use AWS Shield in conjunction with DDoS-resilient web services such as Amazon CloudFront and Amazon Route 53 to improve your ability to defend against DDoS attacks. Learn more about architecting for DDoS resiliency by reading the AWS Best Practices for DDoS Resiliency whitepaper. You also have the option of using Route 53 with an externally hosted content delivery network (CDN). In this blog post, I show how you can help protect the zone apex (also known as the root domain) of your web application by using Route 53 to perform a secure redirect to prevent discovery of your application origin.

Image of lock and key

February

February 27: Now Generally Available – AWS Organizations: Policy-Based Management for Multiple AWS Accounts
Today, AWS Organizations moves from Preview to General Availability. You can use Organizations to centrally manage multiple AWS accounts, with the ability to create a hierarchy of organizational units (OUs). You can assign each account to an OU, define policies, and then apply those policies to an entire hierarchy, specific OUs, or specific accounts. You can invite existing AWS accounts to join your organization, and you can also create new accounts. All of these functions are available from the AWS Management Console, the AWS Command Line Interface (CLI), and through the AWS Organizations API.To read the full AWS Blog post about today’s launch, see AWS Organizations – Policy-Based Management for Multiple AWS Accounts.

February 23: s2n Is Now Handling 100 Percent of SSL Traffic for Amazon S3
Today, we’ve achieved another important milestone for securing customer data: we have replaced OpenSSL with s2n for all internal and external SSL traffic in Amazon Simple Storage Service (Amazon S3) commercial regions. This was implemented with minimal impact to customers, and multiple means of error checking were used to ensure a smooth transition, including client integration tests, catching potential interoperability conflicts, and identifying memory leaks through fuzz testing.

February 22: Easily Replace or Attach an IAM Role to an Existing EC2 Instance by Using the EC2 Console
AWS Identity and Access Management (IAM) roles enable your applications running on Amazon EC2 to use temporary security credentials. IAM roles for EC2 make it easier for your applications to make API requests securely from an instance because they do not require you to manage AWS security credentials that the applications use. Recently, we enabled you to use temporary security credentials for your applications by attaching an IAM role to an existing EC2 instance by using the AWS CLI and SDK. To learn more, see New! Attach an AWS IAM Role to an Existing Amazon EC2 Instance by Using the AWS CLI. Starting today, you can attach an IAM role to an existing EC2 instance from the EC2 console. You can also use the EC2 console to replace an IAM role attached to an existing instance. In this blog post, I will show how to attach an IAM role to an existing EC2 instance from the EC2 console.

February 22: How to Audit Your AWS Resources for Security Compliance by Using Custom AWS Config Rules
AWS Config Rules enables you to implement security policies as code for your organization and evaluate configuration changes to AWS resources against these policies. You can use Config rules to audit your use of AWS resources for compliance with external compliance frameworks such as CIS AWS Foundations Benchmark and with your internal security policies related to the US Health Insurance Portability and Accountability Act (HIPAA), the Federal Risk and Authorization Management Program (FedRAMP), and other regimes. AWS provides some predefined, managed Config rules. You also can create custom Config rules based on criteria you define within an AWS Lambda function. In this post, I show how to create a custom rule that audits AWS resources for security compliance by enabling VPC Flow Logs for an Amazon Virtual Private Cloud (VPC). The custom rule meets requirement 4.3 of the CIS AWS Foundations Benchmark: “Ensure VPC flow logging is enabled in all VPCs.”

February 13: AWS Announces CISPE Membership and Compliance with First-Ever Code of Conduct for Data Protection in the Cloud
I have two exciting announcements today, both showing AWS’s continued commitment to ensuring that customers can comply with EU Data Protection requirements when using our services.

February 13: How to Enable Multi-Factor Authentication for AWS Services by Using AWS Microsoft AD and On-Premises Credentials
You can now enable multi-factor authentication (MFA) for users of AWS services such as Amazon WorkSpaces and Amazon QuickSight and their on-premises credentials by using your AWS Directory Service for Microsoft Active Directory (Enterprise Edition) directory, also known as AWS Microsoft AD. MFA adds an extra layer of protection to a user name and password (the first “factor”) by requiring users to enter an authentication code (the second factor), which has been provided by your virtual or hardware MFA solution. These factors together provide additional security by preventing access to AWS services, unless users supply a valid MFA code.

February 13: How to Create an Organizational Chart with Separate Hierarchies by Using Amazon Cloud Directory
Amazon Cloud Directory enables you to create directories for a variety of use cases, such as organizational charts, course catalogs, and device registries. Cloud Directory offers you the flexibility to create directories with hierarchies that span multiple dimensions. For example, you can create an organizational chart that you can navigate through separate hierarchies for reporting structure, location, and cost center. In this blog post, I show how to use Cloud Directory APIs to create an organizational chart with two separate hierarchies in a single directory. I also show how to navigate the hierarchies and retrieve data. I use the Java SDK for all the sample code in this post, but you can use other language SDKs or the AWS CLI.

February 10: How to Easily Log On to AWS Services by Using Your On-Premises Active Directory
AWS Directory Service for Microsoft Active Directory (Enterprise Edition), also known as Microsoft AD, now enables your users to log on with just their on-premises Active Directory (AD) user name—no domain name is required. This new domainless logon feature makes it easier to set up connections to your on-premises AD for use with applications such as Amazon WorkSpaces and Amazon QuickSight, and it keeps the user logon experience free from network naming. This new interforest trusts capability is now available when using Microsoft AD with Amazon WorkSpaces and Amazon QuickSight Enterprise Edition. In this blog post, I explain how Microsoft AD domainless logon works with AD interforest trusts, and I show an example of setting up Amazon WorkSpaces to use this capability.

February 9: New! Attach an AWS IAM Role to an Existing Amazon EC2 Instance by Using the AWS CLI
AWS Identity and Access Management (IAM) roles enable your applications running on Amazon EC2 to use temporary security credentials that AWS creates, distributes, and rotates automatically. Using temporary credentials is an IAM best practice because you do not need to maintain long-term keys on your instance. Using IAM roles for EC2 also eliminates the need to use long-term AWS access keys that you have to manage manually or programmatically. Starting today, you can enable your applications to use temporary security credentials provided by AWS by attaching an IAM role to an existing EC2 instance. You can also replace the IAM role attached to an existing EC2 instance. In this blog post, I show how you can attach an IAM role to an existing EC2 instance by using the AWS CLI.

February 8: How to Remediate Amazon Inspector Security Findings Automatically
The Amazon Inspector security assessment service can evaluate the operating environments and applications you have deployed on AWS for common and emerging security vulnerabilities automatically. As an AWS-built service, Amazon Inspector is designed to exchange data and interact with other core AWS services not only to identify potential security findings but also to automate addressing those findings. Previous related blog posts showed how you can deliver Amazon Inspector security findings automatically to third-party ticketing systems and automate the installation of the Amazon Inspector agent on new Amazon EC2 instances. In this post, I show how you can automatically remediate findings generated by Amazon Inspector. To get started, you must first run an assessment and publish any security findings to an Amazon Simple Notification Service (SNS) topic. Then, you create an AWS Lambda function that is triggered by those notifications. Finally, the Lambda function examines the findings and then implements the appropriate remediation based on the type of issue.

February 6: How to Simplify Security Assessment Setup Using Amazon EC2 Systems Manager and Amazon Inspector
In a July 2016 AWS Blog post, I discussed how to integrate Amazon Inspector with third-party ticketing systems by using Amazon Simple Notification Service (SNS) and AWS Lambda. This AWS Security Blog post continues in the same vein, describing how to use Amazon Inspector to automate various aspects of security management. In this post, I show you how to install the Amazon Inspector agent automatically through the Amazon EC2 Systems Manager when a new Amazon EC2 instance is launched. In a subsequent post, I will show you how to update EC2 instances automatically that run Linux when Amazon Inspector discovers a missing security patch.

Image of lock and key

January

January 30: How to Protect Data at Rest with Amazon EC2 Instance Store Encryption
Encrypting data at rest is vital for regulatory compliance to ensure that sensitive data saved on disks is not readable by any user or application without a valid key. Some compliance regulations such as PCI DSS and HIPAA require that data at rest be encrypted throughout the data lifecycle. To this end, AWS provides data-at-rest options and key management to support the encryption process. For example, you can encrypt Amazon EBS volumes and configure Amazon S3 buckets for server-side encryption (SSE) using AES-256 encryption. Additionally, Amazon RDS supports Transparent Data Encryption (TDE). Instance storage provides temporary block-level storage for Amazon EC2 instances. This storage is located on disks attached physically to a host computer. Instance storage is ideal for temporary storage of information that frequently changes, such as buffers, caches, and scratch data. By default, files stored on these disks are not encrypted. In this blog post, I show a method for encrypting data on Linux EC2 instance stores by using Linux built-in libraries. This method encrypts files transparently, which protects confidential data. As a result, applications that process the data are unaware of the disk-level encryption.

January 27: How to Detect and Automatically Remediate Unintended Permissions in Amazon S3 Object ACLs with CloudWatch Events
Amazon S3 Access Control Lists (ACLs) enable you to specify permissions that grant access to S3 buckets and objects. When S3 receives a request for an object, it verifies whether the requester has the necessary access permissions in the associated ACL. For example, you could set up an ACL for an object so that only the users in your account can access it, or you could make an object public so that it can be accessed by anyone. If the number of objects and users in your AWS account is large, ensuring that you have attached correctly configured ACLs to your objects can be a challenge. For example, what if a user were to call the PutObjectAcl API call on an object that is supposed to be private and make it public? Or, what if a user were to call the PutObject with the optional Acl parameter set to public-read, therefore uploading a confidential file as publicly readable? In this blog post, I show a solution that uses Amazon CloudWatch Events to detect PutObject and PutObjectAcl API calls in near-real time and helps ensure that the objects remain private by making automatic PutObjectAcl calls, when necessary.

January 26: Now Available: Amazon Cloud Directory—A Cloud-Native Directory for Hierarchical Data
Today we are launching Amazon Cloud Directory. This service is purpose-built for storing large amounts of strongly typed hierarchical data. With the ability to scale to hundreds of millions of objects while remaining cost-effective, Cloud Directory is a great fit for all sorts of cloud and mobile applications.

January 24: New SOC 2 Report Available: Confidentiality
As with everything at Amazon, the success of our security and compliance program is primarily measured by one thing: our customers’ success. Our customers drive our portfolio of compliance reports, attestations, and certifications that support their efforts in running a secure and compliant cloud environment. As a result of our engagement with key customers across the globe, we are happy to announce the publication of our new SOC 2 Confidentiality report. This report is available now through AWS Artifact in the AWS Management Console.

January 18: Compliance in the Cloud for New Financial Services Cybersecurity Regulations
Financial regulatory agencies are focused more than ever on ensuring responsible innovation. Consequently, if you want to achieve compliance with financial services regulations, you must be increasingly agile and employ dynamic security capabilities. AWS enables you to achieve this by providing you with the tools you need to scale your security and compliance capabilities on AWS. The following breakdown of the most recent cybersecurity regulations, NY DFS Rule 23 NYCRR 500, demonstrates how AWS continues to focus on your regulatory needs in the financial services sector.

January 9: New Amazon GameDev Blog Post: Protect Multiplayer Game Servers from DDoS Attacks by Using Amazon GameLift
In online gaming, distributed denial of service (DDoS) attacks target a game’s network layer, flooding servers with requests until performance degrades considerably. These attacks can limit a game’s availability to players and limit the player experience for those who can connect. Today’s new Amazon GameDev Blog post uses a typical game server architecture to highlight DDoS attack vulnerabilities and discusses how to stay protected by using built-in AWS Cloud security, AWS security best practices, and the security features of Amazon GameLift. Read the post to learn more.

January 6: The Top 10 Most Downloaded AWS Security and Compliance Documents in 2016
The following list includes the 10 most downloaded AWS security and compliance documents in 2016. Using this list, you can learn about what other people found most interesting about security and compliance last year.

January 6: FedRAMP Compliance Update: AWS GovCloud (US) Region Receives a JAB-Issued FedRAMP High Baseline P-ATO for Three New Services
Three new services in the AWS GovCloud (US) region have received a Provisional Authority to Operate (P-ATO) from the Joint Authorization Board (JAB) under the Federal Risk and Authorization Management Program (FedRAMP). JAB issued the authorization at the High baseline, which enables US government agencies and their service providers the capability to use these services to process the government’s most sensitive unclassified data, including Personal Identifiable Information (PII), Protected Health Information (PHI), Controlled Unclassified Information (CUI), criminal justice information (CJI), and financial data.

January 4: The Top 20 Most Viewed AWS IAM Documentation Pages in 2016
The following 20 pages were the most viewed AWS Identity and Access Management (IAM) documentation pages in 2016. I have included a brief description with each link to give you a clearer idea of what each page covers. Use this list to see what other people have been viewing and perhaps to pique your own interest about a topic you’ve been meaning to research.

January 3: The Most Viewed AWS Security Blog Posts in 2016
The following 10 posts were the most viewed AWS Security Blog posts that we published during 2016. You can use this list as a guide to catch up on your blog reading or even read a post again that you found particularly useful.

January 3: How to Monitor AWS Account Configuration Changes and API Calls to Amazon EC2 Security Groups
You can use AWS security controls to detect and mitigate risks to your AWS resources. The purpose of each security control is defined by its control objective. For example, the control objective of an Amazon VPC security group is to permit only designated traffic to enter or leave a network interface. Let’s say you have an Internet-facing e-commerce website, and your security administrator has determined that only HTTP (TCP port 80) and HTTPS (TCP 443) traffic should be allowed access to the public subnet. As a result, your administrator configures a security group to meet this control objective. What if, though, someone were to inadvertently change this security group’s rules and enable FTP or other protocols to access the public subnet from any location on the Internet? That expanded access could weaken the security posture of your assets. Consequently, your administrator might need to monitor the integrity of your company’s security controls so that the controls maintain their desired effectiveness. In this blog post, I explore two methods for detecting unintended changes to VPC security groups. The two methods address not only control objectives but also control failures.

If you have questions about or issues with implementing the solutions in any of these posts, please start a new thread on the forum identified near the end of each post.

– Craig

How to Audit Your AWS Resources for Security Compliance by Using Custom AWS Config Rules

Post Syndicated from Myles Hosford original https://aws.amazon.com/blogs/security/how-to-audit-your-aws-resources-for-security-compliance-by-using-custom-aws-config-rules/

AWS Config Rules enables you to implement security policies as code for your organization and evaluate configuration changes to AWS resources against these policies. You can use Config rules to audit your use of AWS resources for compliance with external compliance frameworks such as CIS AWS Foundations Benchmark and with your internal security policies related to the US Health Insurance Portability and Accountability Act (HIPAA), the Federal Risk and Authorization Management Program (FedRAMP), and other regimes.

AWS provides a number of predefined, managed Config rules. You also can create custom Config rules based on criteria you define within an AWS Lambda function. In this post, I will show how to create a custom rule that audits AWS resources for security compliance by enabling VPC Flow Logs for an Amazon Virtual Private Cloud (VPC). The custom rule meets requirement 4.3 of the CIS AWS Foundations Benchmark: “Ensure VPC flow logging is enabled in all VPCs.”

Solution overview

In this post, I walk through the process required to create a custom Config rule by following these steps:

  1. Create a Lambda function containing the logic to determine if a resource is compliant or noncompliant.
  2. Create a custom Config rule that uses the Lambda function created in Step 1 as the source.
  3. Create a Lambda function that polls Config to detect noncompliant resources on a daily basis and send notifications via Amazon SNS.

Prerequisite

You must set up Config before you start creating custom rules. Follow the steps on Set Up AWS Config Using the Console or Set Up AWS Config Using the AWS CLI to enable Config and send the configuration changes to Amazon S3 for storage.

Custom rule – Blueprint

The first step is to create a Lambda function that contains the logic to determine if the Amazon VPC has VPC Flow Logs enabled (in other words, it is compliant or noncompliant with requirement 4.3 of the CIS AWS Foundation Benchmark). First, let’s take a look at the components that make up a custom rule, which I will call the blueprint.

#
# Custom AWS Config Rule - Blueprint Code
#

import boto3, json

def evaluate_compliance(config_item, r_id):
    return 'NONCOMPLIANT'

def lambda_handler(event, context):
    
    # Create AWS SDK clients & initialize custom rule parameters
    config = boto3.client('config')
    invoking_event = json.loads(event['invokingEvent'])
    compliance_value = 'NOT_APPLICABLE'
    resource_id = invoking_event['configurationItem']['resourceId']
                    
    compliance_value = evaluate_compliance(invoking_event['configurationItem'], resource_id)
              
    response = config.put_evaluations(
       Evaluations=[
            {
                'ComplianceResourceType': invoking_event['configurationItem']['resourceType'],
                'ComplianceResourceId': resource_id,
                'ComplianceType': compliance_value,
                'Annotation': 'Insert text here to detail why control passed/failed',
                'OrderingTimestamp': invoking_event['notificationCreationTime']
            },
       ],
       ResultToken=event['resultToken'])

The key components in the preceding blueprint are:

  1. The lambda_handler function is the function that is executed when the Lambda function invokes my function. I create the necessary SDK clients and set up some initial variables for the rule to use.
  2. The evaluate_compliance function contains my custom rule logic. This is the function that I will tailor later in the post to create the custom rule to detect whether the Amazon VPC has VPC Flow Logs enabled. The result (compliant or noncompliant) is assigned to the compliance_value.
  3. The Config API’s put_evaluations function is called to deliver an evaluation result to Config. You can then view the result of the evaluation in the Config console (more about that later in this post). The annotation parameter is used to provide supplementary information about how the custom evaluation determined the compliance.

Custom rule – Flow logs enabled

The example we use for the custom rule is requirement 4.3 from the CIS AWS Foundations Benchmark: “Ensure VPC flow logging is enabled in all VPCs.” I update the blueprint rule that I just showed to do the following:

  1. Create an AWS Identity and Access Management (IAM) role that allows the Lambda function to perform the custom rule logic and publish the result to Config. The Lambda function will assume this role.
  2. Specify the resource type of the configuration item as EC2 VPC. This ensures that the rule is triggered when there is a change to any Amazon VPC resources.
  3. Add custom rule logic to the Lambda function to determine whether VPC Flow Logs are enabled for a given VPC.

Create an IAM role for Lambda

To create the IAM role, I go to the IAM console, choose Roles in the navigation pane, click Create New Role, and follow the wizard. In Step 2, I select the service role AWS Lambda, as shown in the following screenshot.

In Step 4 of the wizard, I attach the following managed policies:

  • AmazonEC2ReadOnlyAccess
  • AWSLambdaExecute
  • AWSConfigRulesExecutionRole

Finally, I name the new IAM role vpcflowlogs-role. This allows the Lambda function to call APIs such as EC2 describe flow logs to obtain the result for my compliance check. I assign this role to the Lambda function in the next step.

Create the Lambda function for the custom rule

To create the Lambda function that contains logic for my custom rule, I go to the Lambda console, click Create a Lambda Function, and then choose Blank Function.

When I configure the function, I name it vpcflowlogs-function and provide a brief description of the rule: “A custom rule to detect whether VPC Flow Logs is enabled.”

For the Lambda function code, I use the blueprint code shown earlier in this post and add the additional logic to determine whether VPC Flow Logs is enabled (specifically within the evaluate_compliance and is_flow_logs_enabled functions).

#
# Custom AWS Config Rule - VPC Flow Logs
#

import boto3, json

def evaluate_compliance(config_item, r_id):
    if (config_item['resourceType'] != 'AWS::EC2::VPC'):
        return 'NOT_APPLICABLE'

    elif is_flow_logs_enabled(r_id):
        return 'COMPLIANT'
    else:
        return 'NON_COMPLIANT'

def is_flow_logs_enabled(vpc_id):
    ec2 = boto3.client('ec2')
    response = ec2.describe_flow_logs(
        Filter=[
            {
                'Name': 'resource-id',
                'Values': [
                    r_id,
                ]
            },
        ],
    )
    if len(response[u'FlowLogs']) != 0: return True

def lambda_handler(event, context):
    
    # Create AWS SDK clients & initialize custom rule parameters
    config = boto3.client('config')
    invoking_event = json.loads(event['invokingEvent'])
    compliance_value = 'NOT_APPLICABLE'
    resource_id = invoking_event['configurationItem']['resourceId']
                    
    compliance_value = evaluate_compliance(invoking_event['configurationItem'], resource_id)
            
    response = config.put_evaluations(
       Evaluations=[
            {
                'ComplianceResourceType': invoking_event['configurationItem']['resourceType'],
                'ComplianceResourceId': resource_id,
                'ComplianceType': compliance_value,
                'Annotation': 'CIS 4.3 VPC Flow Logs',
                'OrderingTimestamp': invoking_event['notificationCreationTime']
            },
       ],
       ResultToken=event['resultToken'])

Below the Lambda function code, I configure the handler and role. As shown in the following screenshot, I select the IAM role I just created (vpcflowlogs-role) and create my Lambda function.

 When the Lambda function is created, I make a note of the Lambda Amazon Resource Name (ARN), which is the unique identifier used in the next step to specify this function as my Config rule source. (Be sure to replace placeholder value with your own value.)

Example ARN: arn:aws:lambda:ap-southeast-1:<your-account-id>:function:vpcflowlogs-function

Create a custom Config rule

The last step is to create a custom Config rule and use the Lambda function as the source. To do this, I go to the Config console, choose Add Rule, and choose Add Custom Rule. I give the rule a name, vpcflowlogs-configrule, and description, and I paste the Lambda ARN from the previous section.

Because this rule is specific to VPC resources, I set the Trigger type to Configuration changes and Resources to EC2: VPC, as shown in the following screenshot

I click Save to create the rule, and it is now live. Any VPC resources that are created or modified will now be checked against my VPC Flow Logs rule for compliance with the CIS Benchmark requirement.

From the Config console, I can now see if any resources do not comply with the control requirement, as shown in the following screenshot.

When I choose the rule, I see additional detail about the noncompliant resources (see the following screenshot). This allows me to view the Config timeline to determine when the resources became noncompliant, identify the resources’ owners, (if resources are following tagging best practices), and initiate a remediation effort.

Screenshot of the results of resources evaluated

Daily compliance assessment

Having created the custom rule, I now create a Lambda function to poll Config periodically to detect noncompliant resources. My Lambda function will run daily to assess for noncompliance with my custom rule. When noncompliant resources are detected, I send a notification by publishing a message to SNS.

Before creating the Lambda function, I create an SNS topic and subscribe to the topic for the email addresses that I want to receive noncompliance notifications. My SNS topic is called config-rules-compliance.

Note: The Lambda function will require permission to query Config and publish a message to SNS. For the purpose of this blog post, I created the following policy that allows publishing of messages to my SNS topic (config-rules-compliance), and I attached it to the vpcflowlogs-role role that my custom Config rule uses.

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "Stmt1485832788000",
            "Effect": "Allow",
            "Action": [
                "sns:Publish"
            ],
            "Resource": [
                "arn:aws:sns:ap-southeast-1:111111111111:config-rules-compliance"
            ]
        }
    ]
}

To create the Lambda function that performs the periodic compliance assessment, I go to the Lambda console, choose Create a Lambda Function and then choose Blank Function.

When configuring the Lambda trigger, I select CloudWatch Events – Schedule that allows the function to be executed periodically on a schedule I define. I then select rate(1 day) to get daily compliance assessments. For more information about scheduling events with Amazon CloudWatch, see Schedule Expressions for Rules.

scheduled-lambda-withborder

My Lambda function (see the following code) uses the vpcflowlogs-role IAM role that allows publishing of messages to my SNS topic.

'''
Lambda function to poll Config for noncompliant resources
'''

from __future__ import print_function

import boto3

# AWS Config settings
CONFIG_CLIENT = boto3.client('config')
MY_RULE = "vpcflowlogs-configrule"

# AWS SNS Settings
SNS_CLIENT = boto3.client('sns')
SNS_TOPIC = 'arn:aws:sns:ap-southeast-1:111111111111:config-rules-compliance'
SNS_SUBJECT = 'Compliance Update'

def lambda_handler(event, context):
    # Get compliance details
    non_compliant_detail = CONFIG_CLIENT.get_compliance_details_by_config_rule(\
    						ConfigRuleName=MY_RULE, ComplianceTypes=['NON_COMPLIANT'])

    if len(non_compliant_detail['EvaluationResults']) > 0:
        print('The following resource(s) are not compliant with AWS Config rule: ' + MY_RULE)
        non_complaint_resources = ''
        for result in non_compliant_detail['EvaluationResults']:
            print(result['EvaluationResultIdentifier']['EvaluationResultQualifier']['ResourceId'])
            non_complaint_resources = non_complaint_resources + \
    	    				result['EvaluationResultIdentifier']['EvaluationResultQualifier']['ResourceId'] + '\n'

        sns_message = 'AWS Config Compliance Update\n\n Rule: ' \
    				+ MY_RULE + '\n\n' \
     				+ 'The following resource(s) are not compliant:\n' \
     				+ non_complaint_resources

        SNS_CLIENT.publish(TopicArn=SNS_TOPIC, Message=sns_message, Subject=SNS_SUBJECT)

    else:
        print('No noncompliant resources detected.')

My Lambda function performs two key activities. First, it queries the Config API to determine which resources are noncompliant with my custom rule. This is done by executing the get_compliance_details_by_config_rule API call.

non_compliant_detail = CONFIG_CLIENT.get_compliance_details_by_config_rule(ConfigRuleName=MY_RULE, ComplianceTypes=['NON_COMPLIANT'])

Second, my Lambda function publishes a message to my SNS topic to notify me that resources are noncompliant, if they failed my custom rule evaluation. This is done using the SNS publish API call.

SNS_CLIENT.publish(TopicArn=SNS_TOPIC, Message=sns_message, Subject=SNS_SUBJECT)

This function provides an example of how to integrate Config and the results of the Config rules compliance evaluation into your operations and processes. You can extend this solution by integrating the results directly with your internal governance, risk, and compliance tools and IT service management frameworks.

Summary

In this post, I showed how to create a custom AWS Config rule to detect for noncompliance with security and compliance policies. I also showed how you can create a Lambda function to detect for noncompliance daily by polling Config via API calls. Using custom rules allows you to codify your internal or external security and compliance requirements and have a more effective view of your organization’s risks at a given time.

For more information about Config rules and examples of rules created for the CIS Benchmark, go to the aws-security-benchmark GitHub repository. If you have questions about the solution in this post, start a new thread on the AWS Config forum.

– Myles

Note: The content and opinions in this blog post are those of the author. This blog post is intended for informational purposes and not for the purpose of providing legal advice.

2017-02-22 FizzBuzz 2

Post Syndicated from Vasil Kolev original https://vasil.ludost.net/blog/?p=3343

Понеже идеята ми се мотае в главата от месец-два и тая нощ ми хрумна финалната оптимизация, ето продължението на post-а за fizzbuzz:

int i=0,p;
static void *pos[4]= {&&digit, &&fizz, &&buzz, &&fizzbuzz};
static void *loop[2] = { &&loopst, &&loopend};
int s3[3]={1,0,0},s5[5]={2,0,0,0,0};
char buff[2048];
char dgts[16]={'0','1','2','3','4','5','6','7','8','9','a','b','c','d','e','f'};
int buffpos=0;

loopst:
	i++;
	p= s3[i%3] | s5[i%5]; 
	goto *pos[p];

fizz:
	memcpy(&buff[buffpos],"Fizz", 4);
	buffpos+=4;
	goto end;
buzz:
	memcpy(&buff[buffpos],"Buzz", 4);
	buffpos+=4;
	goto end;
fizzbuzz:
	memcpy(&buff[buffpos],"FizzBuzz", 8);
	buffpos+=8;
	goto end;
digit:
	buff[buffpos++]=dgts[i/16];
	buff[buffpos++]=dgts[i%16];
end:
	buff[buffpos++]='\n';
	goto *loop[i/100];
loopend:
write(1, buff, buffpos);

Известно време се чудех как може цялото нещо да стане без никакъв branch, т.е. и без проверката за край на цикъла. Първоначалната ми идея беше да я карам на асемблер и да използвам като в exploit-ите NOP sled, нещо от типа (извинете ме за калпавия асемблер):

	JMP loopst
	JMP loopend
loopst:
	NOP
	NOP
...
	NOP
	; fizzbuzz implementation
	; i is in RAX
...
	MOV RBX, 0
	SUB RBX, RAX
	SUB RBX, $LENGTH
	SUB EIP, RBX
loopend:

Или, накратко, колкото повече се увеличава i, толкова повече скачам назад с релативния JMP (който съм написал като вадене на нещо от EIP, което най-вероятно изобщо не е валидно), докато не ударя JMP, който ме изхвърля. Като оптимизация бях решил, че мога да shift-вам стойността с 4, така че sled-а да е само 25 броя.

В един момент ми хрумна, че мога да мина и без sled-а, като правя деление (което е отвратителна операция, но спестява кофа nop-ове). Така се получи по-горния вариант на C, който не е съвсем C, а просто някаква странна асемблероподобна гняс.

Иначе, важно е да се отбележи, че на какъвто и да е модерен процесор по-горния код е далеч по-неефективен от простото решение с if-ове, най-вече защото branch prediction и всички други екстри се справят много добре с всякаквите if-ове, но доста по-трудно могат да се сетят тия jmp-ове към таблици базирани на някакви стойности къде точно ще идат, за да се прави спекулативното изпълнение. Не съм си играл да benchmark-вам (въпреки, че имам желание), но като цяло горния код има шанс да се справя по-добре само на неща като 8086 и компания.

И като идея за следващата подобна мизерия, може би може да се оптимизира истински чрез ползване на някое от разширенията за работа с вектори/големи стойности и се unroll-не цикъла, например да се прави на стъпки от по 4 с някаква инструкция, която смята делители (кой-знае какви странни неща има вкарани вече в x86 instruction set-а).

Go 1.8 released

Post Syndicated from ris original https://lwn.net/Articles/714783/rss

The Go team has announced the
release of Go 1.8. “The compiler back end introduced in Go 1.7 for 64-bit x86 is now used
on all architectures, and those architectures should see significant performance
improvements
. For instance, the CPU time required by our benchmark
programs was reduced by 20-30% on 32-bit ARM systems. There are also some
modest performance improvements in this release for 64-bit x86 systems. The
compiler and linker have been made faster. Compile times should be improved
by about 15% over Go 1.7. There is still more work to be done in this area:
expect faster compilation speeds in future releases.
” See the release notes for more details.

Open Sourcing TensorFlowOnSpark: Distributed Deep Learning on Big-Data Clusters

Post Syndicated from mikesefanov original https://yahooeng.tumblr.com/post/157196488076

yahoohadoop:

By Lee Yang, Jun Shi, Bobbie Chern, and Andy Feng (@afeng76), Yahoo Big ML team

Introduction

Today, we are pleased to offer TensorFlowOnSpark to the community, our latest open source framework for distributed deep learning on big-data clusters.

Deep learning (DL) has evolved significantly in recent years. At Yahoo, we’ve found that in order to gain insight from massive amounts of data, we need to deploy distributed deep learning. Existing DL frameworks often require us to set up separate clusters for deep learning, forcing us to create multiple programs for a machine learning pipeline (see Figure 1 below). Having separate clusters requires us to transfer large datasets between them, introducing unwanted system complexity and end-to-end learning latency.

image

Last year we addressed scaleout issues by developing and publishing CaffeOnSpark, our open source framework that allows distributed deep learning and big-data processing on identical Spark and Hadoop clusters. We use CaffeOnSpark at Yahoo to improve our NSFW image detection, to automatically identify eSports game highlights from live-streamed videos, and more. With the community’s valuable feedback and contributions, CaffeOnSpark has been upgraded with LSTM support, a new data layer, training and test interleaving, a Python API, and deployment on docker containers. This has been great for our Caffe users, but what about those who use the deep learning framework TensorFlow? We’re taking a page from our own playbook and doing for TensorFlow for what we did for Caffe.  

After TensorFlow’s initial publication, Google released an enhanced TensorFlow with distributed deep learning capabilities in April 2016. In October 2016, TensorFlow introduced HDFS support. Outside of the Google cloud, however, users still needed a dedicated cluster for TensorFlow applications. TensorFlow programs could not be deployed on existing big-data clusters, thus increasing the cost and latency for those who wanted to take advantage of this technology at scale.

To address this limitation, several community projects wired TensorFlow onto Spark clusters. SparkNet added the ability to launch TensorFlow networks in Spark executors. DataBricks proposed TensorFrame to manipulate Apache Spark’s DataFrames with TensorFlow programs. While these approaches are a step in the right direction, after examining their code, we learned we would be unable to get the TensorFlow processes to communicate with each other directly, we would not be able to implement asynchronous distributed learning, and we would have to expend significant effort to migrate existing TensorFlow programs.

TensorFlowOnSpark

image

Our new framework, TensorFlowOnSpark (TFoS), enables distributed
TensorFlow execution on Spark and Hadoop clusters. As illustrated in
Figure 2 above, TensorFlowOnSpark is designed to work along with
SparkSQL, MLlib, and other Spark libraries in a single pipeline or
program (e.g. Python notebook).

TensorFlowOnSpark supports all types of TensorFlow programs, enabling both asynchronous and synchronous training and inferencing. It supports model parallelism and data parallelism, as well as TensorFlow tools such as TensorBoard on Spark clusters.

Any TensorFlow program can be easily modified to work with TensorFlowOnSpark. Typically, changing fewer than 10 lines of Python code are needed. Many developers at Yahoo who use TensorFlow have easily migrated TensorFlow programs for execution with TensorFlowOnSpark.

TensorFlowOnSpark supports direct tensor communication among TensorFlow processes (workers and parameter servers). Process-to-process direct communication enables TensorFlowOnSpark programs to scale easily by adding machines. As illustrated in Figure 3, TensorFlowOnSpark doesn’t involve Spark drivers in tensor communication, and thus achieves similar scalability as stand-alone TensorFlow clusters.

image

TensorFlowOnSpark provides two different modes to ingest data for training and inference:

  1. TensorFlow QueueRunners: TensorFlowOnSpark leverages TensorFlow’s file readers and QueueRunners to read data directly from HDFS files. Spark is not involved in accessing data.
  2. Spark Feeding: Spark RDD data is fed to each Spark executor, which subsequently feeds the data into the TensorFlow graph via feed_dict.

Figure 4 illustrates how the synchronous distributed training of Inception image classification network scales in TFoS using QueueRunners with a simple setting: 1 GPU, 1 reader, and batch size 32 for each worker. Four TFoS jobs were launched to train 100,000 steps. When these jobs completed after 2+ days, the top-5 accuracy of these jobs were 0.730, 0.814, 0.854, and 0.879. Reaching top-5 accuracy of 0.730 takes 46 hours for a 1-worker job, 22.5 hours for a 2-worker job, 13 hours for a 4-worker job, and 7.5 hours for an 8-worker job. TFoS thus achieves near linear scalability for Inception model training. This is very encouraging, although TFoS scalability will vary for different models and hyperparameters.

image

RDMA for Distributed TensorFlow

In Yahoo’s Hadoop clusters, GPU nodes are connected by both Ethernet and Infiniband. Infiniband provides faster connectivity and supports direct access to other servers’ memories over RDMA. Current TensorFlow releases, however, only support distributed learning using gRPC over Ethernet. To speed up distributed learning, we have enhanced the TensorFlow C++ layer to enable RDMA over Infiniband.

In conjunction with our TFoS release, we are introducing a new protocol for TensorFlow servers in addition to the default “grpc” protocol. Any distributed TensorFlow program can leverage our enhancement via specifying protocol=“grpc_rdma” in tf.train.ServerDef() or tf.train.Server().

With this new protocol, a RDMA rendezvous manager is created to ensure tensors are written directly into the memory of remote servers. We minimize the tensor buffer creation: Tensor buffers are allocated once at the beginning, and then reused across all training steps of a TensorFlow job. From our early experimentation with large models like the VGG-19 network, our RDMA implementation has demonstrated a significant speedup on training time compared with the existing gRPC implementation.

Since RDMA support is a highly requested capability (see TensorFlow issue #2916), we decided to make our current implementation available as an alpha release to the TensorFlow community. In the coming weeks, we will polish our RDMA implementation further, and share detailed benchmark results.

Simple CLI and API

TFoS programs are launched by the standard Apache Spark command, spark-submit. As illustrated below, users can specify the number of Spark executors, the number of GPUs per executor, and the number of parameter servers in the CLI. A user can also state whether they want to use TensorBoard (–tensorboard) and/or RDMA (–rdma).

      spark-submit –master ${MASTER} \
      ${TFoS_HOME}/examples/slim/train_image_classifier.py \
      –model_name inception_v3 \
      –train_dir hdfs://default/slim_train \
      –dataset_dir hdfs://default/data/imagenet \
      –dataset_name imagenet \
      –dataset_split_name train \
      –cluster_size ${NUM_EXEC} \
      –num_gpus ${NUM_GPU} \
      –num_ps_tasks ${NUM_PS} \
      –sync_replicas \
      –replicas_to_aggregate ${NUM_WORKERS} \
      –tensorboard \
      –rdma  

TFoS provides a high-level Python API (illustrated in our sample Python notebook):

  • TFCluster.reserve() … construct a TensorFlow cluster from Spark executors
  • TFCluster.start() … launch Tensorflow program on the executors
  • TFCluster.train() or TFCluster.inference() … feed RDD data to TensorFlow processes
  • TFCluster.shutdown() … shutdown Tensorflow execution on executors

Open Source

Yahoo is happy to release TensorFlowOnSpark at github.com/yahoo/TensorFlowOnSpark and a RDMA enhancement of TensorFlow at github.com/yahoo/tensorflow/tree/yahoo. Multiple example programs (including mnist, cifar10, inception, and VGG) are provided to illustrate the simple conversion process of TensorFlow programs to TensorFlowOnSpark, and leverage RDMA. An Amazon Machine Image is also available for applying TensorFlowOnSpark on AWS EC2.

Going forward, we will advance TensorFlowOnSpark as we continue to do with CaffeOnSpark. We welcome the community’s continued feedback and contributions to CaffeOnSpark, and are interested in thoughts on ways TensorFlowOnSpark can be enhanced.

Backblaze Hard Drive Stats for 2016

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/hard-drive-benchmark-stats-2016/

Backblaze drive stats
Backblaze has recorded and saved daily hard drive statistics from the drives in our data centers since April 2013. At the end of 2016 we had 73,653 spinning hard drives. Of that number, there were 1,553 boot drives and 72,100 data drives. This post looks at the hard drive statistics of the data drives we monitor. We’ll first look at the stats for Q4 2016, then present the data for all of 2016, and finish with the lifetime statistics for all of the drives Backblaze has used in our cloud storage data centers since we started keeping track. Along the way we’ll share observations and insights on the data presented. As always you can download our Hard Drive Test Data to examine and use.

Hard Drive Reliability Statistics for Q4 2016

At the end of Q4 2016 Backblaze was monitoring 72,100 data drives. For our evaluation we remove from consideration those drives which were used for testing purposes and those drive models for which we did not have at least 45 drives. This leaves us with 71,939 production hard drives. The table below is for the period of Q4 2016.

Hard Drive Annualized Failure Rates for Q4 2016
Notes:

  1. The failure rate listed is for just Q4 2016. If a drive model has a failure rate of 0%, it means there were no drive failures of that model during that quarter.
  2. 90 drives (2 storage pods) were used for testing purposes during the period. They contained Seagate 1.5TB and 1.0 TB WDC drives. These are not included in the results above.
  3. The most common reason we have less than 45 drives of one model is that we needed to replace a failed drive, but that drive model is no longer available. We use 45 drives as the minimum number to report quarterly and yearly statistics.

8 TB Hard Drive Performance

In Q4 2016 we introduced a third 8 TB drive model, the Seagate ST8000NM0055. This is an enterprise class drive. One 60-drive Storage Pod was deployed mid-Q4 and the initial results look promising as there have been no failures to date. Given our past disdain for overpaying for enterprise drives, it will be interesting to see how these drives perform.

We added 3,540 Seagate 8 TB drives, model ST8000DM002, giving us 8,660 of these drives. That’s 69 petabytes of raw storage, before formatting and encoding, or about 22% of our current data storage capacity. The failure rate for the quarter of these 8 TB drives was a very respectable 1.65%. That’s lower than the Q4 failure rate of 1.94% for all of the hard drives in the table above.

During the next couple of calendar quarters we’ll monitor how the new enterprise 8 TB drives compare to the consumer 8 TB drives. We’re interested to know which models deliver the best value and we bet you are too. We’ll let you know what we find.

2016 Hard Drive Performance Statistics

Looking back over 2016, we added 15,646 hard drives, and migrated 110 Storage Pods (4,950 drives) from 1-, 1.5-, and 2 TB drives to 4-, 6- and 8 TB drives. Below are the hard drive failure stats for 2016. As with the quarterly results, we have removed any non-production drives and any models that had less than 45 drives.

2016 Hard Drive Annualized Failure Rates
No Time For Failure

In 2016, three drives models ended the year with zero failures, albeit with a small number of drives. Both the 4 TB Toshiba and the 8 TB HGST models went the entire year without a drive failure. The 8 TB Seagate (ST8000NM0055) drives, which were deployed in November 2016, also recorded no failures.

The total number of failed drives was 1,225 for the year. That’s 3.36 drive failures per day or about 5 drives per workday, a very manageable workload. Of course, that’s easy for me to say, since I am not the one swapping out drives.

The overall hard drive failure rate for 2016 was 1.95%. That’s down from 2.47% in 2015 and well below the 6.39% failure rate for 2014.

Big Drives Rule

We increased storage density by moving to higher-capacity drives. That helped us end 2016 with 3 TB drives being the smallest density drives in our data centers. During 2017, we will begin migrating from the 3.0 TB drives to larger-sized drives. Here’s the distribution of our hard drives in our data centers by size for 2016.
2016 Distribution of Hard Drives by Size
Digging in a little further, below are the failure rates by drive size and vendor for 2016.

Hard Drive Failure Rates by Drive SizeHard Drive Failure Rates by Manufacturer

Computing the Failure Rate

Failure Rate, in the context we use it, is more accurately described as the Annualized Failure Rate. It is computed based on Drive Days and Drive Failures, not on the Drive Count. This may seem odd given we are looking at a one year period, 2016 in this case, so let’s take a look.

We start by dividing the Drive Failures by the Drive Count. For example if we use the statistics for 4 TB files, we get a “failure rate” of 1.92%, but the annualized failure rate shown on the chart for 4 TB drives is 2.06%. The trouble with just dividing Drive Failures by Drive Count is that the Drive Count constantly changes over the course of the year. By using Drive Count from a given day, you assume that each drive contributed the same amount of time over the year, but that’s not the case. Drives enter and leave the system all the time. By counting the number of days each drive is active as Drive Days, we can account for all the ins and outs over a given period of time.

Hard Drive Benchmark Statistics

As we noted earlier, we’ve been collecting and storing drive stats data since April 2013. In that time we have used 55 different hard drive models in our data center for data storage. We’ve omitted models from the table below that we didn’t have enough of to populate an entire storage pod (45 or fewer). That excludes 25 of those 55 models.

Annualized Hard Drive Failure Rates

Fun with Numbers

Since April 2013, there have been 5,380 hard drives failures. That works out to about 5 per day or about 7 per workday (200 workdays per year). As a point of reference, Backblaze only had 4,500 total hard drives in June 2010 when we racked our 100th Storage Pod to support our cloud backup service.

The 58,375,646 Drive Days translates to a little over 1.4 Billion Drive Hours. Going the other way we are measuring a mere 159,933 years of spinning hard drives.

You’ll also notice that we have used a total of 85,467 hard drives. But at the end of 2016 we had 71,939 hard drives. Are we missing 13,528 hard drives? Not really. While some drives failed, the remaining drives were removed from service due primarily to migrations from smaller to larger drives. The stats from the “migrated” drives, like Drive Hours, still count in establishing a failure rate, but they did not fail, they just stopped reporting data.

Failure Rates Over Time

The chart below shows the annualized failure rates of hard drives by drive size over time. The data points are the rates as of the end of each year shown. The “stars” mark the average annualized failure rate for all of the hard drives for each year.

Annualized Hard Drive Failures by Drive Size

Notes:

  1. The “8.0TB” failure rate of 4.9% for 2015 is comprised of 45 drives of which there were 2 failures during that year. In 2016 the number of 8 TB drives rose to 8,765 with 48 failures and an annualized failure rate of 1.6%.
  2. The “1.0TB” drives were 5+ years old on average when they were retired.
  3. There are only 45 of the “5.0TB” drives in operation.

Can’t Get Enough Hard Drive Stats?

We’ll be presenting the webinar “Backblaze Hard Drive Stats for 2016” on Thursday February 2, 2017 at 10:00 Pacific time. The webinar will be recorded so you can watch it over and over again. The webinar will dig deeper into the quarterly, yearly, and lifetime hard drive stats and include the annual and lifetime stats by drive size and manufacturer. You will need to subscribe to the Backblaze BrightTALK channel to view the webinar. Sign up for the webinar today.

As a reminder, the complete data set used to create the information used in this review is available on our Hard Drive Test Data page. You can download and use this data for free for your own purpose, all we ask is three things 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, and 3) you do not sell this data to anyone, it is free. If you just want the summarized data used to create the tables and charts in this blog post you can download the ZIP file containing the MS Excel spreadsheet.

Good luck and let us know if you find anything interesting.

The post Backblaze Hard Drive Stats for 2016 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Is ‘aqenbpuu’ a bad password?

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/01/is-aqenbpuu-bad-password.html

Press secretary Sean Spicer has twice tweeted a random string, leading people to suspect he’s accidentally tweeted his Twitter password. One of these was ‘aqenbpuu’, which some have described as a “shitty password“. Is is actually bad?

No. It’s adequate. Not the best, perhaps, but not “shitty”.

It depends upon your threat model. The common threats are password reuse and phishing, where the strength doesn’t matter. When the strength does matter is when Twitter gets hacked and the password hashes stolen.

Twitter uses the bcrypt password hashing technique, which is designed to be slow. A typical desktop with a GPU can only crack bcrypt passwords at a rate of around 321 hashes-per-second. Doing the math (26 to the power of 8, divided by 321, divided by one day) it will take 20 years for this desktop to crack the password.

That’s not a good password. A botnet with thousands of desktops, or a somebody willing to invest thousands of dollars on a supercomputer or cluster like Amazon’s, can crack that password in a few days.

But, it’s not a bad password, either. A hack of a Twitter account like this would be a minor event. It’s not worth somebody spending that much resources hacking. Security is a tradeoff — you protect a ton of gold with Ft. Knox like protections, but you wouldn’t invest the same amount protecting a ton of wood. The same is true with passwords — as long as you don’t reuse your passwords, or fall victim to phishing, eight lower case characters is adequate.

This is especially true if using two-factor authentication, in which case, such a password is more than adequate.

I point this out because the Trump administration is bad, and Sean Spicer is a liar. Our criticism needs to be limited to things we can support, such as the DC metro ridership numbers (which Spicer has still not corrected). Every time we weakly criticize the administration on things we cannot support, like “shitty passwords”, we lessen our credibility. We look more like people who will hate the administration no matter what they do, rather than people who are standing up for principles like “honesty”.


The numbers above aren’t approximations. I actually generated a bcrypt hash and attempted to crack it in order to benchmark how long this would take. I’ll describe the process here.

First of all, I installed the “PHP command-line”. While older versions of PHP used MD5 for hashing, the newer versions use Bcrypt.

# apt-get install php5-cli

I then created a PHP program that will hash the password:

I actually use it three ways. The first way is to hash a small password “ax”, one short enough that the password cracker will actually succeed in hashing. The second is to hash the password with PHP defaults, which is what I assume Twitter is using. The third is to increase the difficulty level, in case Twitter has increased the default difficulty level at all in order to protect weak passwords.

I then ran the PHP script, producing these hashes:
$ php spicer.php
$2y$10$1BfTonhKWDN23cGWKpX3YuBSj5Us3eeLzeUsfylemU0PK4JFr4moa
$2y$10$DKe2N/shCIU.jSfSr5iWi.jH0AxduXdlMjWDRvNxIgDU3Cyjizgzu
$2y$15$HKcSh42d8amewd2QLvRWn.wttbz/B8Sm656Xh6ZWQ4a0u6LZaxyH6

I ran the first one through the password cracking known as “Hashcat”. Within seconds, I get the correct password “ax”. Thus, I know Hashcat is correctly cracking the password. I actually had a little doubt, because the documentation doesn’t make it clear that the Bcrypt algorithm the program supports is the same as the one produced by PHP5.

I run the second one, and get the following speed statistics:

As you can seem, I’m brute-forcing an eight character password that’s all lower case (-a 3 ?l?l?l?l?l?l?l?l). Checking the speed as it runs, it seems pretty consistently slightly above 300 hashes/second. It’s not perfect — it keeps warning that it’s slow. This is because the GPU acceleration works best if trying many password hashes at a time.

I tried the same sort of setup using John-the-Ripper in incremental mode. Whereas Hashcat uses the GPU, John uses the CPU. I have a 6-core Broadwell CPU, so I ran John-the-Ripper with 12 threads.

Curiously, it’s slightly faster, at 347 hashes-per-second on the CPU rather than 321 on the GPU.

Using the stronger work factor (the third hash I produced above), I get about 10 hashes/second on John, and 10 on Hashcat as well. It takes over a second to even generate the hash, meaning it’s probably too aggressive for a web server like Twitter to have to do that much work every time somebody logs in, so I suspect they aren’t that aggressive.

Would a massive IoT botnet help? Well, I tried out John on the Raspbery PI 3. With the same settings cracking the password (at default difficulty), I got the following:

In other words, the RPi is 35 times slower than my desktop computer at this task.

The RPi v3 has four cores and about twice the clock speed of IoT devices. So the typical IoT device would be 250 times slower than a desktop computer. This gives a good approximation of the difference in power.


So there’s this comment:

Yes, you can. So I did, described below.

Okay, so first you need to use the “Node Package Manager” to install bcrypt. The name isn’t “bcrypt”, which refers to a module that I can’t get installed on any platform. Instead, you want “bcrypt-nodejs”.

# npm install bcrypt-nodejs
[email protected] node_modules\bcrypt-nodejs

On all my platforms (Windows, Ubuntu, Raspbian) this is already installed. So now you just create a script spicer.js:

var bcrypt = require(“bcrypt-nodejs”);
console.log(bcrypt.hashSync(“aqenbpuu”);

This produces the following hash, which has the same work factor as the one generated by the PHP script above:

$2a$10$Ulbm//hEqMoco8FLRI6k8uVIrGqipeNbyU53xF2aYx3LuQ.xUEmz6

Hashcat and John then are the same speed cracking this one as the other one. The first characters $2a$ define the hash type (bcrypt). Apparently, there’s a slightly difference between that and $2y$, but that doesn’t change the analysis.


The first comment below questions the speed I get, because running Hashcat in benchmark mode, it gets much higher numbers for bcrypt.

This is actually normal, due to different iteration counts.

A bcrypt hash includes an iteration count (or more precisely, the logarithm of an iteration count). It repeats the hash that number of times. That’s what the $10$ means in the hash:

$2y$10$………

The Hashcat benchmark uses the number 5 (actually, 2^5, or 32 times) as it’s count. But the default iteration count produced by PHP and NodeJS is 10 (2^10, or 1024 times). Thus, you’d expect these hashes to run at a speed 32 times slower.

And indeed, that’s what I get. Running the benchmark on my machine, I get the following output:

Hashtype: bcrypt, Blowfish(OpenBSD)
Speed.Dev.#1…..:    10052 H/s (82.28ms)

Doing the math, dividing 1052 hashes/sec by 321 hashes/sec, I get 31.3. This is close enough to 32, the expected answer, giving the variabilities of thermal throttling, background activity on the machine, and so on.

Googling, this appears to be a common complaint. It’s be nice if it said something like ‘bcrypt[speed=5]’  or something to make this clear.


Spicer tweeted another password, “n9y25ah7”. Instead of all lower-case, this is lower-case plus digits, or 36 to the power of 8 combinations, so it’s about 13 times harder (36/26)^8, which is roughly in the same range.


BTW, there are two notes I want to make.

The first is that a good practical exercise first tries to falsify the theory. In this case, I deliberately tested whether Hashcat and John were actually cracking the right password. They were. They both successfully cracked the two character password “ax”. I also tested GPU vs. CPU. Everyone knows that GPUs are faster for password cracking, but also everyone knows that Bcrypt is designed to be hard to run on GPUs.

The second note is that everything here is already discussed in my study guide on command-lines [*]. I mention that you can run PHP on the command-line, and that you can use Hashcat and John to crack passwords. My guide isn’t complete enough to be an explanation for everything, but it’s a good discussion of where you need to end up.

The third note is that I’m not a master of any of these tools. I know enough about these tools to Google the answers, not to pull them off the top of my head. Mastery is impossible, don’t even try it. For example, bcrypt is one of many hashing algorithms, and has all by itself a lot of complexity, such as the difference between $2a$ and $2y$, or the the logarithmic iteration count. I ignored the issue of salting bcrypt altogether. So what I’m saying is that the level of proficiency you want is to be able to google the steps in solving a problem like this, not actually knowing all this off the top of your head.

Seamlessly Scale Predictions with AWS Lambda and MXNet

Post Syndicated from Bryan Liston original https://aws.amazon.com/blogs/compute/seamlessly-scale-predictions-with-aws-lambda-and-mxnet/


Sunil Mallya, Solutions Architect

Building AI solutions at scale can be challenging, in this blog we’ll look at how to leverage AWS Lambda and MXNet to build a scalable prediction pipeline.

Companies that leverage machine and deep learning invest in much more than just training models. They have sophisticated pipelines that include the following stages:

  • Data storage
  • Pre-processing
  • Feature extraction
  • Model generation
  • Model Analysis
  • Feature engineering
  • Evaluation and feedback

mxnet_1.png

Each stage of the pipeline requires:

  • Elasticity to adapt to changing workload demands
  • Scalability to adapt well to the overall size of the workload
  • Cost effectiveness to optimize total cost of ownership (TCO)

Amazon S3 meets all of the requirements for data and model storage. But the unpredictability of user demand and location can make scaling up for batch predictions (or the results of the model analysis) challenging, and can affect the overall user experience. In this post, we show how to use MXNet and AWS Lambda to deploy models at scale for predictions.

What is MXNet?

MXNet is a full-featured, flexibly programmable, and highly scalable deep learning framework that supports state-of-the-art deep models, including convolutional neural networks (CNNs) and long short-term memory networks (LSTMs). It is the result of collaboration between researchers at several top universities, including the founding institutions of the University of Washington and Carnegie Mellon University.

As discussed in Werner Vogel’s MXNet – Deep Learning Framework of Choice at AWS post, not only is MXNet scalable for multi-instance training, it also scales down to a variety of devices and small memory footprints, even when serving predictions on very large models. MXNet is available through open source under the Apache Version 2 license.

Challenges with the prediction pipeline

As previously mentioned, ML model training and validation is just a small part of the story. After the model is built, the real work begins. To service millions of customers seamlessly, every application must scale. However scaling the prediction portion of the pipeline can be challenging and expensive, especially when end users are geographically dispersed.

Delivering model updates, deploying globally, and maintaining high availability can be difficult. Lambda is a very good deployment option for the prediction pipeline, an excellent solution for serverless web applications, real-time batch processing, and map reduce tasks.

How can you leverage Lambda?

“Code, test, deploy, and let the service do the heavy lifting” is the Lambda customer motto. Regardless of your traffic patterns—high concurrency or bursty workloads—Lambda scales to service your needs.

We compiled and built the MXNet libraries to demonstrate how Lambda scales the prediction pipeline to provide this ease and flexibility for machine learning or deep learning model prediction. We built a sample application that predicts image labels using an 18-layer deep residual network. The model architecture is based on the winning model in the ImageNet competition called ResidualNet. The application produces state-of-the-art results for problems like image classification.

Putting it all together

Lambda has a deployment package limit of 50 MB. This limit means that you might not always be able to package your models along with the code. To accommodate this limitation, you can use S3 to store the model, and download the model when you service the request.

For optimal performance, you need to download the model outside of the lambda_handler function so that the downloaded file persists across< requests in memory. For subsequent Lambda invocations, MXNet uses the downloaded model that’s already in memory. For more information about this optimization, see AWS Lambda: How It Works.

The following reference Lambda function for prediction is quite simple. It loads the model, downloads image from the specified URL, transforms the image into an NDArray, and uses the model to make a prediction that outputs labels, with associated confidence percentages. The implementation code is provided below.

import os
import boto3
import json
import tempfile
import urllib2 
import mxnet as mx
import numpy as np
import cv2
from collections import namedtuple

Batch = namedtuple('Batch', ['data'])
f_params = 'resnet-18-0000.params'
f_symbol = 'resnet-18-symbol.json'

bucket = 'my-model-bucket'
s3 = boto3.resource('s3')
s3_client = boto3.client('s3')

#params

f_params_file = tempfile.NamedTemporaryFile()
s3_client.download_file(bucket, f_params, f_params_file.name)
f_params_file.flush()

#symbol

f_symbol_file = tempfile.NamedTemporaryFile()
s3_client.download_file(bucket, f_symbol, f_symbol_file.name)
f_symbol_file.flush()

def load_model(s_fname, p_fname):
    symbol = mx.symbol.load(s_fname)
    save_dict = mx.nd.load(p_fname)
    arg_params = {}
    aux_params = {}
    for k, v in save_dict.items():
        tp, name = k.split(':', 1)
        if tp == 'arg':
            arg_params[name] = v
        if tp == 'aux':
            aux_params[name] = v
    return symbol, arg_params, aux_params

def predict(url, mod, synsets):
    req = urllib2.urlopen(url)
    arr = np.asarray(bytearray(req.read()), dtype=np.uint8)
    cv2_img = cv2.imdecode(arr, -1)
    img = cv2.cvtColor(cv2_img, cv2.COLOR_BGR2RGB)
    if img is None:
        return None
    img = cv2.resize(img, (224, 224))
    img = np.swapaxes(img, 0, 2)
    img = np.swapaxes(img, 1, 2) 
    img = img[np.newaxis, :] 

    mod.forward(Batch([mx.nd.array(img)]))
    prob = mod.get_outputs()[0].asnumpy()
    prob = np.squeeze(prob)
    a = np.argsort(prob)[::-1]
    out = '' 

    for i in a[0:5]:
        out += 'probability=%f, class=%s' %(prob[i], synsets[i])
    out += "\n"

    return out

with open('synset.txt', 'r') as f:
    synsets = [l.rstrip() for l in f]

def lambda_handler(event, context):
    url = ''
    if event['httpMethod'] == 'GET':
        url = event['queryStringParameters']['url']
    elif event['httpMethod'] == 'POST':
        data = json.loads(event['body'])
        url = data['url']

    print "image url: ", url
    sym, arg_params, aux_params = load_model(f_symbol_file.name, f_params_file.name)
    mod = mx.mod.Module(symbol=sym)
    mod.bind(for_training=False, data_shapes=[('data', (1,3,224,224))])
    mod.set_params(arg_params, aux_params)
    labels = predict(url, mod, synsets)
    out = {
            "headers": {
                "content-type": "application/json",
                "Access-Control-Allow-Origin": "*"
                },
            "body": '{"labels": "%s"}' % labels,  
            "statusCode": 200
          }
    return out

What can you expect in production?

Because the code must download the model from S3, download an image from the web, and run the image through the model, we benchmarked it to evaluate how it scales. We ran a local benchmark on a laptop in San Francisco using wrk to the endpoint deployed in US West (Oregon) Region (us-west-2). We observed an average latency of 1.18 seconds at a sustained rate of 75 requests per second, as shown in the following output.

mxnet_3.png

To benchmark global prediction latencies, we used Goad, a Lambda-based distributed load tester. We observed average latencies ranging from 1.2 seconds to 1.5 seconds. The following figure shows the observed latencies from various regions with the endpoint hosted in the US West (Oregon) Region (us-west-2).

mxnet_2.png

Conclusion

For a state-of-the-art image label prediction model, the numbers are impressive. The average global latency of 1.5 seconds makes it worth exploring integrating Lambda into your machine learning or deep learning pipeline for batch predictions. The libraries and code samples for deployment are available in the mxnet-lambda GitHub repo.

If you have questions or suggestions, please comment below.

UK ‘Piracy Warnings’ Are Coming This Month; Here’s How it Works

Post Syndicated from Ernesto original https://torrentfreak.com/uk-piracy-warnings-coming-month-heres-works-170111/

uk-flagIn an effort to curb online piracy, the movie and music industries reached an agreement with the UK’s leading ISPs to send “educational alerts” to alleged copyright infringers.

The piracy alerts program is part of the larger Creative Content UK (CCUK) initiative which already introduced several anti-piracy PR campaigns, targeted at the general public as well as the classroom.

The plan to send out email alerts was first announced several years ago and is about to kick off. According to ISPReview the first providers will start sending out emails later this month.

The four ISPs who are confirmed to be participating are BT, Sky, TalkTalk and Virgin Media, but other providers could join in at a later stage. Thus far CCUK hasn’t announced a lot of detail or specifics on how the program will operate exactly, but here’s what TorrentFreak has learned so far.

What will be monitored?

The “alerts” system will only apply to P2P file-sharing. In theory, this means that the focus will be almost exclusively on BitTorrent (including apps such as Popcorn Time), as other P2P networks have relatively low user bases.

Consequently, those who use Usenet providers, streaming services (such as 123movies), or file-hosters such as Zippyshare and 4Shared, are not at risk. In other words, the program only covers a part of all online piracy.

A spokesperson from CCUK’s “Get it Right” campaign stressed that the alerts represent only one part of the broader program, which also aims to reach other infringers through its other initiatives.

How many people will be targeted?

The system will apply to everyone whose Internet account has been used to share copyrighted material via P2P networks.

That said, copyright holders and ISPs have agreed to cap the warnings at 2.5 million over three years. This means that only a fraction of all UK pirates will receive a notice.

Some people may also receive multiple notices if their account is repeatedly used to share copyrighted material.

“This ensures that people who might have missed an earlier email receive another one – but also allows time for account holders to take steps to address the issue,” a Get It Right spokesperson informed us.

What’s in the notices?

While the exact language might differ between ISPs, the notices are primarily meant to inform subscribers that their accounts have been used to share infringing material, while pointing them to legal alternatives.

“The purpose is to educate UK consumers about the many sources of legal content available, highlight the value of the UK’s creative industries and reduce online copyright infringement,” we were told.

Who will be monitoring these copyright infringements?

While ISPs take part in the scheme, they will not monitor subscribers’ file-sharing activities. The tracking will be done by third-party company MarkMonitor, who are also the technology partner for the U.S. Copyright Alert System.

This tracking company collects IP-addresses from BitTorrent swarms and sends its findings directly to the Internet providers. The lists with infringing IP-addresses are not shared with any of the rightsholders.

Each ISP will keep a database of the alleged infringers and send them appropriate warnings. In compliance with local laws and the best practices of the Information Commissioner’s Office, recorded infringements will be stored for a limited time.

Will any Internet accounts be disconnected?

There are no disconnections or mitigation measures for repeat infringers under the UK copyright alerts program. Early reports suggested that alleged file-sharers will get up to four warnings after which all subsequent offenses will be ignored.

This is in line with the overall goal of the campaign which is not targeted at the most hardcore file-sharers. The program is mostly focused on educating casual infringers about the legal alternatives to piracy.

Can the monitoring be circumvented?

The answer to the previous questions already shows that users have plenty of options to bypass the program. They can simply switch to other means of downloading, but there are more alternatives.

BitTorrent users could hide their IP-addresses through proxy services and VPNs for example. After the U.S. Copyright Alert Program launched in the U.S. there was a huge increase in demand for this kind of anonymity services.

So how scary are the alerts?

CCUK’s “Get it Right” stresses that the main purpose of the system is to inform casual infringers about their inappropriate behavior and point them to legal alternatives.

The focus lies on education, although the warnings also serve as a deterrent by pointing out that people are not anonymous. For some, this may be enough to cause them to switch to legal alternatives.

All in all the proposed measures are fairly reasonable, especially when compared to other countries where fines and internet connections are on the table. Whether it will be successful is an entirely different question of course.

The Creative Content UK team is confident that they can drive some significant change. Several benchmark measurements were taken prior to the campaign, so its effectiveness can be properly measured once the first results come in.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

Running Jupyter Notebook and JupyterHub on Amazon EMR

Post Syndicated from Tom Zeng original https://aws.amazon.com/blogs/big-data/running-jupyter-notebook-and-jupyterhub-on-amazon-emr/

Tom Zeng is a Solutions Architect for Amazon EMR

Jupyter Notebook (formerly IPython) is one of the most popular user interfaces for running Python, R, Julia, Scala, and other languages to process and visualize data, perform statistical analysis, and train and run machine learning models. Jupyter notebooks are self-contained documents that can include live code, charts, narrative text, and more. The notebooks can be easily converted to HTML, PDF, and other formats for sharing.

Amazon EMR is a popular hosted big data processing service that allows users to easily run Hadoop, Spark, Presto, and other Hadoop ecosystem applications, such as Hive and Pig.

Python, Scala, and R provide support for Spark and Hadoop, and running them in Jupyter on Amazon EMR makes it easy to take advantage of:

  • the big-data processing capabilities of Hadoop applications.
  • the large selection of Python and R packages for analytics and visualization.

JupyterHub is a multiple-user environment for Jupyter. You can use the following bootstrap action (BA) to install Jupyter and JupyterHub on Amazon EMR:

s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh

These are the supported Jupyter kernels:

  • Python
  • R
  • Scala
  • Apache Toree (which provides the Spark, PySpark, SparkR, and SparkSQL kernels)
  • Julia
  • Ruby
  • JavaScript
  • CoffeeScript
  • Torch

The BA will install Jupyter, JupyterHub, and sample notebooks on the master node.

Commonly used Python and R data science and machine learning packages can be optionally installed on all nodes.

The following arguments can be passed to the BA:

--r Install the IRKernel for R.
--toree Install the Apache Toree kernel that supports Scala, PySpark, SQL, SparkR for Apache Spark.
--julia Install the IJulia kernel for Julia.
--torch Install the iTorch kernel for Torch (machine learning and visualization).
--ruby Install the iRuby kernel for Ruby.
--ds-packages Install the Python data science-related packages (scikit-learn pandas statsmodels).
--ml-packages Install the Python machine learning-related packages (theano keras tensorflow).
--python-packages Install specific Python packages (for example, ggplot and nilearn).
--port Set the port for Jupyter notebook. The default is 8888.
--password Set the password for the Jupyter notebook.
--localhost-only Restrict Jupyter to listen on localhost only. The default is to listen on all IP addresses.
--jupyterhub Install JupyterHub.
--jupyterhub-port Set the port for JuputerHub. The default is 8000.
--notebook-dir Specify the notebook folder. This could be a local directory or an S3 bucket.
--cached-install Use some cached dependency artifacts on S3 to speed up installation.
--ssl Enable SSL. For production, make sure to use your own certificate and key files.
--copy-samples Copy sample notebooks to the notebook folder.
--spark-opts User-supplied Spark options to override the default values.
--s3fs Use instead of s3nb (the default) for storing notebooks on Amazon S3. This argument can cause slowness if the S3 bucket has lots of files. The Upload file and Create folder menu options do not work with s3nb.

By default (with no --password and --port arguments), Jupyter will run on port 8888 with no password protection; JupyterHub will run on port 8000.  The --port and --jupyterhub-port arguments can be used to override the default ports to avoid conflicts with other applications.

The --r option installs the IRKernel for R. It also installs SparkR and sparklyr for R, so make sure Spark is one of the selected EMR applications to be installed. You’ll need the Spark application if you use the --toree argument.

If you used --jupyterhub, use Linux users to sign in to JupyterHub. (Be sure to create passwords for the Linux users first.)  hadoop, the default admin user for JupyterHub, can be used to set up other users. The –password option sets the password for Jupyter and for the hadoop user for JupyterHub.

Jupyter on EMR allows users to save their work on Amazon S3 rather than on local storage on the EMR cluster (master node).

To store notebooks on S3, use:

--notebook-dir <s3://your-bucket/folder/>

To store notebooks in a directory different from the user’s home directory, use:

--notebook-dir <local directory>

The following example CLI command is used to launch a five-node (c3.4xlarge) EMR 5.2.0 cluster with the bootstrap action. The BA will install all the available kernels. It will also install the ggplot and nilearn Python packages and set:

  • the Jupyter port to 8880
  • the password to jupyter
  • the JupyterHub port to 8001
aws emr create-cluster --release-label emr-5.2.0 \
  --name 'emr-5.2.0 sparklyr + jupyter cli example' \
  --applications Name=Hadoop Name=Hive Name=Spark Name=Pig Name=Tez Name=Ganglia Name=Presto \
  --ec2-attributes KeyName=<your-ec2-key>,InstanceProfile=EMR_EC2_DefaultRole \
  --service-role EMR_DefaultRole \
  --instance-groups \
    InstanceGroupType=MASTER,InstanceCount=1,InstanceType=c3.4xlarge \
    InstanceGroupType=CORE,InstanceCount=4,InstanceType=c3.4xlarge \
  --region us-east-1 \
  --log-uri s3://<your-s3-bucket>/emr-logs/ \
  --bootstrap-actions \
    Name='Install Jupyter notebook',Path="s3://aws-bigdata-blog/artifacts/aws-blog-emr-jupyter/install-jupyter-emr5.sh",Args=[--r,--julia,--toree,--torch,--ruby,--ds-packages,--ml-packages,--python-packages,'ggplot nilearn',--port,8880,--password,jupyter,--jupyterhub,--jupyterhub-port,8001,--cached-install,--notebook-dir,s3://<your-s3-bucket>/notebooks/,--copy-samples]

Replace <your-ec2-key> with your AWS access key and <your-s3-bucket> with the S3 bucket where you store notebooks. You can also change the instance types to suit your needs and budget.

If you are using the EMR console to launch a cluster, you can specify the bootstrap action as follows:

o_jupyter_1

When the cluster is available, set up the SSH tunnel and web proxy. The Jupyter notebook should be available at localhost:8880 (as specified in the example CLI command).

o_jupyter_2

After you have signed in, you will see the home page, which displays the notebook files:

o_jupyter_3

If JupyterHub is installed, the Sign in page should be available at port 8001 (as specified in the CLI example):

o_jupyter_4

After you are signed in, you’ll see the JupyterHub and Jupyter home pages are the same. The JupyterHub URL, however, is /user/<username>/tree instead of /tree.

o_jupyter_5

The JupyterHub Admin page is used for managing users:

o_jupyter_6

You can install Jupyter extensions from the Nbextensions tab:

o_jupyter_7

If you specified the --copy-samples option in the BA, you should see sample notebooks on the home page. To try the samples, first open and run the CopySampleDataToHDFS.ipynb notebook to copy some sample data files to HDFS. In the CLI example, --python-packages,'ggplot nilearn' is used to install the ggplot and nilearn packages. You can verify those packages were installed by running the Py-ggplot and PyNilearn notebooks.

The CreateUser.ipynb notebook contains examples for setting up JupyterHub users.

The PySpark.ipynb and ScalaSpark.ipynb notebooks contain the Python and Scala versions of some machine learning examples from the Spark distribution (Logistic Regression, Neural Networks, Random Forest, and Support Vector Machines):

o_jupyter_8

PyHivePrestoHDFS.ipynb shows how to access Hive, Presto, and HDFS in Python. (Be sure to run the CreateHivePrestoS3Tables.ipynb first to create tables.) The %%time and %%timeit cell magics can be used to benchmark Hive and Presto queries (and other executable code):

o_jupyter_9

Here are some other sample notebooks for you to try.

SparkSQLParquetJSON.ipynb:

o_SparkSQLParquetJSON

plot_separating_hyperplane.ipynb:

o_plot_separating_hyperplane

R-SVMLinearNonLinear.ipynb:

o_R-SVMLinearNonLinear

plot_iris.ipynb:

o_plot_iris

Julia-IrisPlot.ipynb:

o_Julia-IrisPlot

PyIrisPlot.ipynb:

o_PyIrisPlot

R-RandomForestVisualization.ipynb:

o_R-RandomForestVisualization

GrangerCausality.ipynb:

o_GrangerCausality

SQLite.ipynb:

o_SQLite

GraphvizDot.ipynb:

o_GraphvizDot

Conclusion

Data scientists who run Jupyter and JupyterHub on Amazon EMR can use Python, R, Julia, and Scala to process, analyze, and visualize big data stored in Amazon S3. Jupyter notebooks can be saved to S3 automatically, so users can shut down and launch new EMR clusters, as needed. EMR makes it easy to spin up clusters with different sizes and CPU/memory configurations to suit different workloads and budgets. This can greatly reduce the cost of data-science investigations.

If you have questions about using Jupyter and JupyterHub on EMR or would like share your use cases, please leave a comment below.


Related

Running sparklyr – RStudio’s R Interface to Spark on Amazon EMR

sparklyr_2

US Government Publishes New Plan to Target Pirate Sites

Post Syndicated from Andy original https://torrentfreak.com/us-government-publishes-new-plan-to-target-pirate-sites-161213/

whitehouse-logoThe Office of the Intellectual Property Enforcement Coordinator (IPEC) has just released its Joint Strategic Plan (JSP) for Intellectual Property Enforcement, titled Supporting Innovation, Creativity & Enterprise: Charting a Path Ahead.

“The Plan – which incorporates views from a variety of individual stakeholders across government, industry, educational institutions, trade organizations and public interest groups — offers a blueprint for the work to be carried out over the next three years by the Federal Government in support of a healthy and robust intellectual property enforcement policy environment,” a White House statement reads.

The plan has four stated goals:

– Enhance National understanding of the economic and social impacts flowing from misappropriation of trade secrets and the infringement of intellectual property rights

– Promote a safe and secure Internet by minimizing counterfeiting and IP-infringing activity online

– Secure and facilitate lawful trade

– Enhance domestic strategies and global collaboration in support of effective IP enforcement.

The 163-page report leaves few stones unturned, with Section 2 homing in on Internet piracy.

Follow The Money

While shutting down websites is often seen as the ultimate anti-piracy tool, more commonly authorities are targeting what they believe fuels online piracy – money. The report says that while original content is expensive to create, copies cost almost nothing, leading to large profits for pirates.

“An effective enforcement strategy against commercial-scale piracy and counterfeiting therefore, must target and dry up the illicit revenue flow of the actors engaged in commercial piracy online. That requires an examination of the revenue sources for commercial-scale pirates,” the report says.

“The operators of direct illicit download and streaming sites enjoy revenue through membership subscriptions serviced by way of credit card and similar payment-based transactions, as is the case with the sale and purchase of counterfeit goods, while the operators of torrent sites may rely more heavily on advertising revenue as the primary source of income.”

To cut-off this revenue, the government foresees voluntary collaboration between payment processor networks, online advertisers, and the banking sector.

Payment processors

“All legitimate payment processors prohibit the use of their services and platforms for unlawful conduct, including IP-infringing activities. They do so by way of policy and contract through terms of use and other agreements applicable to their users,” the JSP says.

“Yet, notwithstanding these prohibitions, payment processor platforms continue to be exploited by illicit merchants of counterfeit products and infringing content.”

The government says that pirates and counterfeiters use a number of techniques to exploit payment processors and have deployed systems that can thwart “test” transactions conducted by rightsholders and other investigators. Furthermore, the fact that some credit card companies do not have direct contractual relationships with merchants, enables websites to continue doing business after payment processing rights have been terminated.

The JSP calls for more coordination between companies in the ecosystem, increased transparency, greater geographic scope, and bi-lateral engagements with other governments.

“IPEC and USPTO, with private sector input, will facilitate benchmarking studies of current voluntary initiatives designed to combat revenue flow to rogue sites to determine whether existing voluntary initiatives are functioning effectively, and thereby promote a robust, datadriven voluntary initiative environment,” the report adds.

Advertising

The JSP begins with the comment that “Ad revenue is the oxygen that content theft to breathe” and it’s clear that the government wants to asphyxiate pirate sites. It believes that up to 86% of download and streaming platforms rely on advertising for revenue and the sector needs to be cleaned up.

In common with payment processors, the report notes that legitimate ad networks also have policies in place to stop their services appearing on pirate sites. However, “sophisticated entities” dedicated to infringement can exploit loopholes, with some doing so to display “high-risk” ads that include malware, pop-unders and pixel stuffing.

Collaboration is already underway among industry players but the government wants to see more integration and cooperation, to stay ahead of the tactics allegedly employed by sites such as the defunct KickassTorrents, which is highlighted in the report.

kat-ad

“IPEC and the IPR Center (with its constituent law enforcement partners), along with other relevant Federal agencies, will convene the advertising industry to hear further about their voluntary efforts. The U.S. Interagency Strategic Planning Committees on IP Enforcement will assess opportunities to support efforts to combat the flow of ad revenue to criminals,” the JSP reads.

“As part of best practices and initiatives, advertising networks are encouraged
to make appropriately generalized and anonymized data publicly available to permit study and analysis of illicit activity intercepted on their platforms and networks. Such data will allow study by public and private actors alike to identify patterns of behavior or tactics associated with illicit actors who seek to profit from ad revenue from content theft websites.”

Domain hopping

When pirate sites come under pressure from copyright holders, their domain names are often at risk of suspension or even seizure. This triggers a phenomenon known as domain hopping, a tactic most visibly employed by The Pirate Bay when it skipped all around the world with domains registered in several different countries.

tpb-hop

“To evade law enforcement, bad actors will register the same or different domain name with different registrars. They then attempt to evade law enforcement by moving from one registrar to another, thus prolonging the so-called ‘whack-a-mole’ pursuit. The result of this behavior is to drive up costs of time and resources spent on protecting intellectual property right,” the JSP notes.

The report adds that pirate sites are more likely to use ccTLDs (country code Top-Level Domains) than gTLDs (Generic Top-Level Domains) due to the way the former are administrated.

“The relationship between any given ccTLD administrator and its government will differ from case to case and may depend on complex and sensitive arrangements particular to the local political climate. Different ccTLD policies will reflect different approaches with respect to process for the suspension, transfer, or cancellation of a domain name registration,” it reads.

“Based on the most recent Notorious Markets lists available prior to issuance of this plan, ccTLDs comprise roughly half of all named ‘notorious’ top-level domains. Considering that ccTLDs are outnumbered by gTLDs in the domain name base by more than a 2-to-1 ratio, the frequency of bad faith ccTLD sites appear to be disproportionate in nature and worthy of further research and analysis.”

Once again, the US government calls for more cooperation alongside an investigation to assess the scope of “abusive domain name registration tactics and trend.”

Policies to improve DMCA takedown processes

As widely documented, rightsholders are generally very unhappy with the current DMCA regime as they are forced to send millions of notices every week to contain the flow of pirate content. Equally, service providers are also being placed under significant stress due to the processing of those same notices.

In its report, the government acknowledges the problems faced by both sides but indicates that the right discussions are already underway to address the issues.

“The continued development of private sector best practices, led through a multistakeholder process, may ease the burdens involved with the DMCA process for rights holders, Internet intermediaries, and users while decreasing infringing activity,” the report says.

“These best practices may focus on enhanced methods for identifying actionable infringement, preventing abuse of the system, establishing efficient takedown procedures, preventing the reappearance of previously removed infringing content, and providing opportunity for creators to assert their fair use rights.”

In summary, the government champions the Copyright Office’s current evaluation of Section 512 of the DMCA while calling for cooperation between stakeholders.

Social Media

The Joint Strategic Plan highlights the growing part social media has to play in the dissemination of infringing content, from driving traffic to websites selling illegal products, unlawful exploitation of third-party content, to suspect product reviews. Again, the solution can be found in collaboration, including with the public.

“[The government will] encourage the development of industry standards and best
practices, through a multistakeholder process, to curb abuses of social media channels for illicit purposes, while protecting the rights of users to use those channels for non-infringing and other lawful activities,” it notes.

“One underutilized resource may be the users themselves, who may be in a
position to report suspicious product offerings or other illicit activity, if provided a streamlined opportunity to do so, as some social media companies are beginning to explore.”

And finally – education

The government believes that greater knowledge among the public of where it can obtain content legally will assist in reducing instances of online piracy.

“The U.S. Interagency Strategic Planning Committees on IP Enforcement, and other relevant Federal agencies, as appropriate, will assess opportunities to support public-private collaborative efforts aimed at increasing awareness of legal sources of copyrighted material online and educating users about the harmful impacts of digital piracy,” it concludes.

The full report is available here (163 pages, PDF)

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

20 Billion Files Restored

Post Syndicated from Yev original https://www.backblaze.com/blog/20-billion-files-restored/

20-billion-files-restored
On September 1st we asked you to predict when Backblaze would reach our 20 Billion Files Restored mark. We had a ton of entrants!! Over two thousand people hazarded a guess. It is our pleasure to announce that we’ve finally hit that milestone, and along with it, we’re announcing the winners of our contest.

Before we reveal the date and time that we crossed that threshold, we first want to point out that the closest guess was only off by 23 minutes. Second closest? 57 minutes. That’s kind of remarkable. The 10 winners all pinpointed the exact date and the furthest winner was only four hours and six minutes off the mark, very impressive job!

Congratulations to our winners:

  • Lance – Bismarck, North Dakota
  • Bartosz – Warsaw, Poland
  • Jeremy – Evergreen, Colorado
  • Justin – Los Angeles, California
  • Andy – London, UK
  • Jeffrey – Round Rock, Texas
  • Jose – Merida, Mexico
  • Maria – New York, New York
  • Rizwan – Surrey, UK
  • Max – Howard Beach, New York

11/20/2016 – 10:57 AM

That was the exact time when we restored the 20,000,000,000th file. That’s lots of memories, documents, and projects saved. The amount of files per Backblaze restore varies do our various restore methods. The most common use-case for our restores is when folks forget one or two files and do small restores of just a folder or two.

Restore Fun Facts For a Typical Month:

Where restores are created:

  • 96.3% of restores are done on the web
  • 3.7% are done via our mobile apps

Of all restores:

  • 97.8% are ZIP restores
  • 1.7% are USB HD restores
  • 0.5% are USB Flash Drive restores

The average size of restores:

  • ZIP restores (web & mobile): 25 GB
  • USB HD – 1.1 TB (1,100 GB)
  • USB Flash – 63 GB

Based on the amount of data in ZIP file restores:

Range in GB % of Restores
< 1 54.5%
1 – 10 17.5%
10 – 25 9.5%
25 – 50 7.2%
50 – 75 2.6%
75 – 100 1.7%
100 – 200 3.4%
200 – 300 1.5%
300 – 400 1.0%
400 – 500 0.8%
> 500 0.3%

Backblaze was started with the goal of preventing data loss. Even though we’re nearing 300 Petabytes of data stored, we consider our 20 Billion Files Restored the benchmark that we’re most proud of because it validates our ability to help our customers out when they need us most. Thank you for being loyal Backblaze fans, and while we always hope that folks won’t need to create restores, we’ll be here for you if you do!

The post 20 Billion Files Restored appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.