Tag Archives: downloads

Stretch for PCs and Macs, and a Raspbian update

Post Syndicated from Simon Long original https://www.raspberrypi.org/blog/stretch-pcs-macs-raspbian-update/

Today, we are launching the first Debian Stretch release of the Raspberry Pi Desktop for PCs and Macs, and we’re also releasing the latest version of Raspbian Stretch for your Pi.

Raspberry Pi Desktop Stretch splash screen

For PCs and Macs

When we released our custom desktop environment on Debian for PCs and Macs last year, we were slightly taken aback by how popular it turned out to be. We really only created it as a result of one of those “Wouldn’t it be cool if…” conversations we sometimes have in the office, so we were delighted by the Pi community’s reaction.

Seeing how keen people were on the x86 version, we decided that we were going to try to keep releasing it alongside Raspbian, with the ultimate aim being to make simultaneous releases of both. This proved to be tricky, particularly with the move from the Jessie version of Debian to the Stretch version this year. However, we have now finished the job of porting all the custom code in Raspbian Stretch to Debian, and so the first Debian Stretch release of the Raspberry Pi Desktop for your PC or Mac is available from today.

The new Stretch releases

As with the Jessie release, you can either run this as a live image from a DVD, USB stick, or SD card or install it as the native operating system on the hard drive of an old laptop or desktop computer. Please note that installing this software will erase anything else on the hard drive — do not install this over a machine running Windows or macOS that you still need to use for its original purpose! It is, however, safe to boot a live image on such a machine, since your hard drive will not be touched by this.

We’re also pleased to announce that we are releasing the latest version of Raspbian Stretch for your Pi today. The Pi and PC versions are largely identical: as before, there are a few applications (such as Mathematica) which are exclusive to the Pi, but the user interface, desktop, and most applications will be exactly the same.

For Raspbian, this new release is mostly bug fixes and tweaks over the previous Stretch release, but there are one or two changes you might notice.

File manager

The file manager included as part of the LXDE desktop (on which our desktop is based) is a program called PCManFM, and it’s very feature-rich; there’s not much you can’t do in it. However, having used it for a few years, we felt that it was perhaps more complex than it needed to be — the sheer number of menu options and choices made some common operations more awkward than they needed to be. So to try to make file management easier, we have implemented a cut-down mode for the file manager.

Raspberry Pi Desktop Stretch - file manager

Most of the changes are to do with the menus. We’ve removed a lot of options that most people are unlikely to change, and moved some other options into the Preferences screen rather than the menus. The two most common settings people tend to change — how icons are displayed and sorted — are now options on the toolbar and in a top-level menu rather than hidden away in submenus.

The sidebar now only shows a single hierarchical view of the file system, and we’ve tidied the toolbar and updated the icons to make them match our house style. We’ve removed the option for a tabbed interface, and we’ve stomped a few bugs as well.

One final change was to make it possible to rename a file just by clicking on its icon to highlight it, and then clicking on its name. This is the way renaming works on both Windows and macOS, and it’s always seemed slightly awkward that Unix desktop environments tend not to support it.

As with most of the other changes we’ve made to the desktop over the last few years, the intention is to make it simpler to use, and to ease the transition from non-Unix environments. But if you really don’t like what we’ve done and long for the old file manager, just untick the box for Display simplified user interface and menus in the Layout page of Preferences, and everything will be back the way it was!

Raspberry Pi Desktop Stretch - preferences GUI

Battery indicator for laptops

One important feature missing from the previous release was an indication of the amount of battery life. Eben runs our desktop on his Mac, and he was becoming slightly irritated by having to keep rebooting into macOS just to check whether his battery was about to die — so fixing this was a priority!

We’ve added a battery status icon to the taskbar; this shows current percentage charge, along with whether the battery is charging, discharging, or connected to the mains. When you hover over the icon with the mouse pointer, a tooltip with more details appears, including the time remaining if the battery can provide this information.

Raspberry Pi Desktop Stretch - battery indicator

While this battery monitor is mainly intended for the PC version, it also supports the first-generation pi-top — to see it, you’ll only need to make sure that I2C is enabled in Configuration. A future release will support the new second-generation pi-top.

New PC applications

We have included a couple of new applications in the PC version. One is called PiServer — this allows you to set up an operating system, such as Raspbian, on the PC which can then be shared by a number of Pi clients networked to it. It is intended to make it easy for classrooms to have multiple Pis all running exactly the same software, and for the teacher to have control over how the software is installed and used. PiServer is quite a clever piece of software, and it’ll be covered in more detail in another blog post in December.

We’ve also added an application which allows you to easily use the GPIO pins of a Pi Zero connected via USB to a PC in applications using Scratch or Python. This makes it possible to run the same physical computing projects on the PC as you do on a Pi! Again, we’ll tell you more in a separate blog post this month.

Both of these applications are included as standard on the PC image, but not on the Raspbian image. You can run them on a Pi if you want — both can be installed from apt.

How to get the new versions

New images for both Raspbian and Debian versions are available from the Downloads page.

It is possible to update existing installations of both Raspbian and Debian versions. For Raspbian, this is easy: just open a terminal window and enter

sudo apt-get update
sudo apt-get dist-upgrade

Updating Raspbian on your Raspberry Pi

How to update to the latest version of Raspbian on your Raspberry Pi. Download Raspbian here: More information on the latest version of Raspbian: Buy a Raspberry Pi:

It is slightly more complex for the PC version, as the previous release was based around Debian Jessie. You will need to edit the files /etc/apt/sources.list and /etc/apt/sources.list.d/raspi.list, using sudo to do so. In both files, change every occurrence of the word “jessie” to “stretch”. When that’s done, do the following:

sudo apt-get update 
sudo dpkg --force-depends -r libwebkitgtk-3.0-common
sudo apt-get -f install
sudo apt-get dist-upgrade
sudo apt-get install python3-thonny
sudo apt-get install sonic-pi=2.10.0~repack-rpt1+2
sudo apt-get install piserver
sudo apt-get install usbbootgui

At several points during the upgrade process, you will be asked if you want to keep the current version of a configuration file or to install the package maintainer’s version. In every case, keep the existing version, which is the default option. The update may take an hour or so, depending on your network connection.

As with all software updates, there is the possibility that something may go wrong during the process, which could lead to your operating system becoming corrupted. Therefore, we always recommend making a backup first.

Enjoy the new versions, and do let us know any feedback you have in the comments or on the forums!

The post Stretch for PCs and Macs, and a Raspbian update appeared first on Raspberry Pi.

UI Testing at Scale with AWS Lambda

Post Syndicated from Stas Neyman original https://aws.amazon.com/blogs/devops/ui-testing-at-scale-with-aws-lambda/

This is a guest blog post by Wes Couch and Kurt Waechter from the Blackboard Internal Product Development team about their experience using AWS Lambda.

One year ago, one of our UI test suites took hours to run. Last month, it took 16 minutes. Today, it takes 39 seconds. Here’s how we did it.

The backstory:

Blackboard is a global leader in delivering robust and innovative education software and services to clients in higher education, government, K12, and corporate training. We have a large product development team working across the globe in at least 10 different time zones, with an internal tools team providing support for quality and workflows. We have been using Selenium Webdriver to perform automated cross-browser UI testing since 2007. Because we are now practicing continuous delivery, the automated UI testing challenge has grown due to the faster release schedule. On top of that, every commit made to each branch triggers an execution of our automated UI test suite. If you have ever implemented an automated UI testing infrastructure, you know that it can be very challenging to scale and maintain. Although there are services that are useful for testing different browser/OS combinations, they don’t meet our scale needs.

It used to take three hours to synchronously run our functional UI suite, which revealed the obvious need for parallel execution. Previously, we used Mesos to orchestrate a Selenium Grid Docker container for each test run. This way, we were able to run eight concurrent threads for test execution, which took an average of 16 minutes. Although this setup is fine for a single workflow, the cracks started to show when we reached the scale required for Blackboard’s mature product lines. Going beyond eight concurrent sessions on a single container introduced performance problems that impact the reliability of tests (for example, issues in Webdriver or the browser popping up frequently). We tried Mesos and considered Kubernetes for Selenium Grid orchestration, but the answer to scaling a Selenium Grid was to think smaller, not larger. This led to our breakthrough with AWS Lambda.

The solution:

We started using AWS Lambda for UI testing because it doesn’t require costly infrastructure or countless man hours to maintain. The steps we outline in this blog post took one work day, from inception to implementation. By simply packaging the UI test suite into a Lambda function, we can execute these tests in parallel on a massive scale. We use a custom JUnit test runner that invokes the Lambda function with a request to run each test from the suite. The runner then aggregates the results returned from each Lambda test execution.

Selenium is the industry standard for testing UI at scale. Although there are other options to achieve the same thing in Lambda, we chose this mature suite of tools. Selenium is backed by Google, Firefox, and others to help the industry drive their browsers with code. This makes Lambda and Selenium a compelling stack for achieving UI testing at scale.

Making Chrome Run in Lambda

Currently, Chrome for Linux will not run in Lambda due to an absent mount point. By rebuilding Chrome with a slight modification, as Marco Lüthy originally demonstrated, you can run it inside Lambda anyway! It took about two hours to build the current master branch of Chromium to build on a c4.4xlarge. Unfortunately, the current version of ChromeDriver, 2.33, does not support any version of Chrome above 62, so we’ll be using Marco’s modified version of version 60 for the near future.

Required System Libraries

The Lambda runtime environment comes with a subset of common shared libraries. This means we need to include some extra libraries to get Chrome and ChromeDriver to work. Anything that exists in the java resources folder during compile time is included in the base directory of the compiled jar file. When this jar file is deployed to Lambda, it is placed in the /var/task/ directory. This allows us to simply place the libraries in the java resources folder under a folder named lib/ so they are right where they need to be when the Lambda function is invoked.

To get these libraries, create an EC2 instance and choose the Amazon Linux AMI.

Next, use ssh to connect to the server. After you connect to the new instance, search for the libraries to find their locations.

sudo find / -name libgconf-2.so.4
sudo find / -name libORBit-2.so.0

Now that you have the locations of the libraries, copy these files from the EC2 instance and place them in the java resources folder under lib/.

Packaging the Tests

To deploy the test suite to Lambda, we used a simple Gradle tool called ShadowJar, which is similar to the Maven Shade Plugin. It packages the libraries and dependencies inside the jar that is built. Usually test dependencies and sources aren’t included in a jar, but for this instance we want to include them. To include the test dependencies, add this section to the build.gradle file.

shadowJar {
   from sourceSets.test.output
   configurations = [project.configurations.testRuntime]
}

Deploying the Test Suite

Now that our tests are packaged with the dependencies in a jar, we need to get them into a running Lambda function. We use  simple SAM  templates to upload the packaged jar into S3, and then deploy it to Lambda with our settings.

{
   "AWSTemplateFormatVersion": "2010-09-09",
   "Transform": "AWS::Serverless-2016-10-31",
   "Resources": {
       "LambdaTestHandler": {
           "Type": "AWS::Serverless::Function",
           "Properties": {
               "CodeUri": "./build/libs/your-test-jar-all.jar",
               "Runtime": "java8",
               "Handler": "com.example.LambdaTestHandler::handleRequest",
               "Role": "<YourLambdaRoleArn>",
               "Timeout": 300,
               "MemorySize": 1536
           }
       }
   }
}

We use the maximum timeout available to ensure our tests have plenty of time to run. We also use the maximum memory size because this ensures our Lambda function can support Chrome and other resources required to run a UI test.

Specifying the handler is important because this class executes the desired test. The test handler should be able to receive a test class and method. With this information it will then execute the test and respond with the results.

public LambdaTestResult handleRequest(TestRequest testRequest, Context context) {
   LoggerContainer.LOGGER = new Logger(context.getLogger());
  
   BlockJUnit4ClassRunner runner = getRunnerForSingleTest(testRequest);
  
   Result result = new JUnitCore().run(runner);

   return new LambdaTestResult(result);
}

Creating a Lambda-Compatible ChromeDriver

We provide developers with an easily accessible ChromeDriver for local test writing and debugging. When we are running tests on AWS, we have configured ChromeDriver to run them in Lambda.

To configure ChromeDriver, we first need to tell ChromeDriver where to find the Chrome binary. Because we know that ChromeDriver is going to be unzipped into the root task directory, we should point the ChromeDriver configuration at that location.

The settings for getting ChromeDriver running are mostly related to Chrome, which must have its working directories pointed at the tmp/ folder.

Start with the default DesiredCapabilities for ChromeDriver, and then add the following settings to enable your ChromeDriver to start in Lambda.

public ChromeDriver createLambdaChromeDriver() {
   ChromeOptions options = new ChromeOptions();

   // Set the location of the chrome binary from the resources folder
   options.setBinary("/var/task/chrome");

   // Include these settings to allow Chrome to run in Lambda
   options.addArguments("--disable-gpu");
   options.addArguments("--headless");
   options.addArguments("--window-size=1366,768");
   options.addArguments("--single-process");
   options.addArguments("--no-sandbox");
   options.addArguments("--user-data-dir=/tmp/user-data");
   options.addArguments("--data-path=/tmp/data-path");
   options.addArguments("--homedir=/tmp");
   options.addArguments("--disk-cache-dir=/tmp/cache-dir");
  
   DesiredCapabilities desiredCapabilities = DesiredCapabilities.chrome();
   desiredCapabilities.setCapability(ChromeOptions.CAPABILITY, options);
  
   return new ChromeDriver(desiredCapabilities);
}

Executing Tests in Parallel

You can approach parallel test execution in Lambda in many different ways. Your approach depends on the structure and design of your test suite. For our solution, we implemented a custom test runner that uses reflection and JUnit libraries to create a list of test cases we want run. When we have the list, we create a TestRequest object to pass into the Lambda function that we have deployed. In this TestRequest, we place the class name, test method, and the test run identifier. When the Lambda function receives this TestRequest, our LambdaTestHandler generates and runs the JUnit test. After the test is complete, the test result is sent to the test runner. The test runner compiles a result after all of the tests are complete. By executing the same Lambda function multiple times with different test requests, we can effectively run the entire test suite in parallel.

To get screenshots and other test data, we pipe those files during test execution to an S3 bucket under the test run identifier prefix. When the tests are complete, we link the files to each test execution in the report generated from the test run. This lets us easily investigate test executions.

Pro Tip: Dynamically Loading Binaries

AWS Lambda has a limit of 250 MB of uncompressed space for packaged Lambda functions. Because we have libraries and other dependencies to our test suite, we hit this limit when we tried to upload a function that contained Chrome and ChromeDriver (~140 MB). This test suite was not originally intended to be used with Lambda. Otherwise, we would have scrutinized some of the included libraries. To get around this limit, we used the Lambda functions temporary directory, which allows up to 500 MB of space at runtime. Downloading these binaries at runtime moves some of that space requirement into the temporary directory. This allows more room for libraries and dependencies. You can do this by grabbing Chrome and ChromeDriver from an S3 bucket and marking them as executable using built-in Java libraries. If you take this route, be sure to point to the new location for these executables in order to create a ChromeDriver.

private static void downloadS3ObjectToExecutableFile(String key) throws IOException {
   File file = new File("/tmp/" + key);

   GetObjectRequest request = new GetObjectRequest("s3-bucket-name", key);

   FileUtils.copyInputStreamToFile(s3client.getObject(request).getObjectContent(), file);
   file.setExecutable(true);
}

Lambda-Selenium Project Source

We have compiled an open source example that you can grab from the Blackboard Github repository. Grab the code and try it out!

https://blackboard.github.io/lambda-selenium/

Conclusion

One year ago, one of our UI test suites took hours to run. Last month, it took 16 minutes. Today, it takes 39 seconds. Thanks to AWS Lambda, we can reduce our build times and perform automated UI testing at scale!

How to Patch, Inspect, and Protect Microsoft Windows Workloads on AWS—Part 2

Post Syndicated from Koen van Blijderveen original https://aws.amazon.com/blogs/security/how-to-patch-inspect-and-protect-microsoft-windows-workloads-on-aws-part-2/

Yesterday in Part 1 of this blog post, I showed you how to:

  1. Launch an Amazon EC2 instance with an AWS Identity and Access Management (IAM) role, an Amazon Elastic Block Store (Amazon EBS) volume, and tags that Amazon EC2 Systems Manager (Systems Manager) and Amazon Inspector use.
  2. Configure Systems Manager to install the Amazon Inspector agent and patch your EC2 instances.

Today in Steps 3 and 4, I show you how to:

  1. Take Amazon EBS snapshots using Amazon EBS Snapshot Scheduler to automate snapshots based on instance tags.
  2. Use Amazon Inspector to check if your EC2 instances running Microsoft Windows contain any common vulnerabilities and exposures (CVEs).

To catch up on Steps 1 and 2, see yesterday’s blog post.

Step 3: Take EBS snapshots using EBS Snapshot Scheduler

In this section, I show you how to use EBS Snapshot Scheduler to take snapshots of your instances at specific intervals. To do this, I will show you how to:

  • Determine the schedule for EBS Snapshot Scheduler by providing you with best practices.
  • Deploy EBS Snapshot Scheduler by using AWS CloudFormation.
  • Tag your EC2 instances so that EBS Snapshot Scheduler backs up your instances when you want them backed up.

In addition to making sure your EC2 instances have all the available operating system patches applied on a regular schedule, you should take snapshots of the EBS storage volumes attached to your EC2 instances. Taking regular snapshots allows you to restore your data to a previous state quickly and cost effectively. With Amazon EBS snapshots, you pay only for the actual data you store, and snapshots save only the data that has changed since the previous snapshot, which minimizes your cost. You will use EBS Snapshot Scheduler to make regular snapshots of your EC2 instance. EBS Snapshot Scheduler takes advantage of other AWS services including CloudFormation, Amazon DynamoDB, and AWS Lambda to make backing up your EBS volumes simple.

Determine the schedule

As a best practice, you should back up your data frequently during the hours when your data changes the most. This reduces the amount of data you lose if you have to restore from a snapshot. For the purposes of this blog post, the data for my instances changes the most between the business hours of 9:00 A.M. to 5:00 P.M. Pacific Time. During these hours, I will make snapshots hourly to minimize data loss.

In addition to backing up frequently, another best practice is to establish a strategy for retention. This will vary based on how you need to use the snapshots. If you have compliance requirements to be able to restore for auditing, your needs may be different than if you are able to detect data corruption within three hours and simply need to restore to something that limits data loss to five hours. EBS Snapshot Scheduler enables you to specify the retention period for your snapshots. For this post, I only need to keep snapshots for recent business days. To account for weekends, I will set my retention period to three days, which is down from the default of 15 days when deploying EBS Snapshot Scheduler.

Deploy EBS Snapshot Scheduler

In Step 1 of Part 1 of this post, I showed how to configure an EC2 for Windows Server 2012 R2 instance with an EBS volume. You will use EBS Snapshot Scheduler to take eight snapshots each weekday of your EC2 instance’s EBS volumes:

  1. Navigate to the EBS Snapshot Scheduler deployment page and choose Launch Solution. This takes you to the CloudFormation console in your account. The Specify an Amazon S3 template URL option is already selected and prefilled. Choose Next on the Select Template page.
  2. On the Specify Details page, retain all default parameters except for AutoSnapshotDeletion. Set AutoSnapshotDeletion to Yes to ensure that old snapshots are periodically deleted. The default retention period is 15 days (you will specify a shorter value on your instance in the next subsection).
  3. Choose Next twice to move to the Review step, and start deployment by choosing the I acknowledge that AWS CloudFormation might create IAM resources check box and then choosing Create.

Tag your EC2 instances

EBS Snapshot Scheduler takes a few minutes to deploy. While waiting for its deployment, you can start to tag your instance to define its schedule. EBS Snapshot Scheduler reads tag values and looks for four possible custom parameters in the following order:

  • <snapshot time> – Time in 24-hour format with no colon.
  • <retention days> – The number of days (a positive integer) to retain the snapshot before deletion, if set to automatically delete snapshots.
  • <time zone> – The time zone of the times specified in <snapshot time>.
  • <active day(s)>all, weekdays, or mon, tue, wed, thu, fri, sat, and/or sun.

Because you want hourly backups on weekdays between 9:00 A.M. and 5:00 P.M. Pacific Time, you need to configure eight tags—one for each hour of the day. You will add the eight tags shown in the following table to your EC2 instance.

TagValue
scheduler:ebs-snapshot:09000900;3;utc;weekdays
scheduler:ebs-snapshot:10001000;3;utc;weekdays
scheduler:ebs-snapshot:11001100;3;utc;weekdays
scheduler:ebs-snapshot:12001200;3;utc;weekdays
scheduler:ebs-snapshot:13001300;3;utc;weekdays
scheduler:ebs-snapshot:14001400;3;utc;weekdays
scheduler:ebs-snapshot:15001500;3;utc;weekdays
scheduler:ebs-snapshot:16001600;3;utc;weekdays

Next, you will add these tags to your instance. If you want to tag multiple instances at once, you can use Tag Editor instead. To add the tags in the preceding table to your EC2 instance:

  1. Navigate to your EC2 instance in the EC2 console and choose Tags in the navigation pane.
  2. Choose Add/Edit Tags and then choose Create Tag to add all the tags specified in the preceding table.
  3. Confirm you have added the tags by choosing Save. After adding these tags, navigate to your EC2 instance in the EC2 console. Your EC2 instance should look similar to the following screenshot.
    Screenshot of how your EC2 instance should look in the console
  4. After waiting a couple of hours, you can see snapshots beginning to populate on the Snapshots page of the EC2 console.Screenshot of snapshots beginning to populate on the Snapshots page of the EC2 console
  5. To check if EBS Snapshot Scheduler is active, you can check the CloudWatch rule that runs the Lambda function. If the clock icon shown in the following screenshot is green, the scheduler is active. If the clock icon is gray, the rule is disabled and does not run. You can enable or disable the rule by selecting it, choosing Actions, and choosing Enable or Disable. This also allows you to temporarily disable EBS Snapshot Scheduler.Screenshot of checking to see if EBS Snapshot Scheduler is active
  1. You can also monitor when EBS Snapshot Scheduler has run by choosing the name of the CloudWatch rule as shown in the previous screenshot and choosing Show metrics for the rule.Screenshot of monitoring when EBS Snapshot Scheduler has run by choosing the name of the CloudWatch rule

If you want to restore and attach an EBS volume, see Restoring an Amazon EBS Volume from a Snapshot and Attaching an Amazon EBS Volume to an Instance.

Step 4: Use Amazon Inspector

In this section, I show you how to you use Amazon Inspector to scan your EC2 instance for common vulnerabilities and exposures (CVEs) and set up Amazon SNS notifications. To do this I will show you how to:

  • Install the Amazon Inspector agent by using EC2 Run Command.
  • Set up notifications using Amazon SNS to notify you of any findings.
  • Define an Amazon Inspector target and template to define what assessment to perform on your EC2 instance.
  • Schedule Amazon Inspector assessment runs to assess your EC2 instance on a regular interval.

Amazon Inspector can help you scan your EC2 instance using prebuilt rules packages, which are built and maintained by AWS. These prebuilt rules packages tell Amazon Inspector what to scan for on the EC2 instances you select. Amazon Inspector provides the following prebuilt packages for Microsoft Windows Server 2012 R2:

  • Common Vulnerabilities and Exposures
  • Center for Internet Security Benchmarks
  • Runtime Behavior Analysis

In this post, I’m focused on how to make sure you keep your EC2 instances patched, backed up, and inspected for common vulnerabilities and exposures (CVEs). As a result, I will focus on how to use the CVE rules package and use your instance tags to identify the instances on which to run the CVE rules. If your EC2 instance is fully patched using Systems Manager, as described earlier, you should not have any findings with the CVE rules package. Regardless, as a best practice I recommend that you use Amazon Inspector as an additional layer for identifying any unexpected failures. This involves using Amazon CloudWatch to set up weekly Amazon Inspector scans, and configuring Amazon Inspector to notify you of any findings through SNS topics. By acting on the notifications you receive, you can respond quickly to any CVEs on any of your EC2 instances to help ensure that malware using known CVEs does not affect your EC2 instances. In a previous blog post, Eric Fitzgerald showed how to remediate Amazon Inspector security findings automatically.

Install the Amazon Inspector agent

To install the Amazon Inspector agent, you will use EC2 Run Command, which allows you to run any command on any of your EC2 instances that have the Systems Manager agent with an attached IAM role that allows access to Systems Manager.

  1. Choose Run Command under Systems Manager Services in the navigation pane of the EC2 console. Then choose Run a command.
    Screenshot of choosing "Run a command"
  2. To install the Amazon Inspector agent, you will use an AWS managed and provided command document that downloads and installs the agent for you on the selected EC2 instance. Choose AmazonInspector-ManageAWSAgent. To choose the target EC2 instance where this command will be run, use the tag you previously assigned to your EC2 instance, Patch Group, with a value of Windows Servers. For this example, set the concurrent installations to 1 and tell Systems Manager to stop after 5 errors.
    Screenshot of installing the Amazon Inspector agent
  3. Retain the default values for all other settings on the Run a command page and choose Run. Back on the Run Command page, you can see if the command that installed the Amazon Inspector agent executed successfully on all selected EC2 instances.
    Screenshot showing that the command that installed the Amazon Inspector agent executed successfully on all selected EC2 instances

Set up notifications using Amazon SNS

Now that you have installed the Amazon Inspector agent, you will set up an SNS topic that will notify you of any findings after an Amazon Inspector run.

To set up an SNS topic:

  1. In the AWS Management Console, choose Simple Notification Service under Messaging in the Services menu.
  2. Choose Create topic, name your topic (only alphanumeric characters, hyphens, and underscores are allowed) and give it a display name to ensure you know what this topic does (I’ve named mine Inspector). Choose Create topic.
    "Create new topic" page
  3. To allow Amazon Inspector to publish messages to your new topic, choose Other topic actions and choose Edit topic policy.
  4. For Allow these users to publish messages to this topic and Allow these users to subscribe to this topic, choose Only these AWS users. Type the following ARN for the US East (N. Virginia) Region in which you are deploying the solution in this post: arn:aws:iam::316112463485:root. This is the ARN of Amazon Inspector itself. For the ARNs of Amazon Inspector in other AWS Regions, see Setting Up an SNS Topic for Amazon Inspector Notifications (Console). Amazon Resource Names (ARNs) uniquely identify AWS resources across all of AWS.
    Screenshot of editing the topic policy
  5. To receive notifications from Amazon Inspector, subscribe to your new topic by choosing Create subscription and adding your email address. After confirming your subscription by clicking the link in the email, the topic should display your email address as a subscriber. Later, you will configure the Amazon Inspector template to publish to this topic.
    Screenshot of subscribing to the new topic

Define an Amazon Inspector target and template

Now that you have set up the notification topic by which Amazon Inspector can notify you of findings, you can create an Amazon Inspector target and template. A target defines which EC2 instances are in scope for Amazon Inspector. A template defines which packages to run, for how long, and on which target.

To create an Amazon Inspector target:

  1. Navigate to the Amazon Inspector console and choose Get started. At the time of writing this blog post, Amazon Inspector is available in the US East (N. Virginia), US West (N. California), US West (Oregon), EU (Ireland), Asia Pacific (Mumbai), Asia Pacific (Seoul), Asia Pacific (Sydney), and Asia Pacific (Tokyo) Regions.
  2. For Amazon Inspector to be able to collect the necessary data from your EC2 instance, you must create an IAM service role for Amazon Inspector. Amazon Inspector can create this role for you if you choose Choose or create role and confirm the role creation by choosing Allow.
    Screenshot of creating an IAM service role for Amazon Inspector
  3. Amazon Inspector also asks you to tag your EC2 instance and install the Amazon Inspector agent. You already performed these steps in Part 1 of this post, so you can proceed by choosing Next. To define the Amazon Inspector target, choose the previously used Patch Group tag with a Value of Windows Servers. This is the same tag that you used to define the targets for patching. Then choose Next.
    Screenshot of defining the Amazon Inspector target
  4. Now, define your Amazon Inspector template, and choose a name and the package you want to run. For this post, use the Common Vulnerabilities and Exposures package and choose the default duration of 1 hour. As you can see, the package has a version number, so always select the latest version of the rules package if multiple versions are available.
    Screenshot of defining an assessment template
  5. Configure Amazon Inspector to publish to your SNS topic when findings are reported. You can also choose to receive a notification of a started run, a finished run, or changes in the state of a run. For this blog post, you want to receive notifications if there are any findings. To start, choose Assessment Templates from the Amazon Inspector console and choose your newly created Amazon Inspector assessment template. Choose the icon below SNS topics (see the following screenshot).
    Screenshot of choosing an assessment template
  6. A pop-up appears in which you can choose the previously created topic and the events about which you want SNS to notify you (choose Finding reported).
    Screenshot of choosing the previously created topic and the events about which you want SNS to notify you

Schedule Amazon Inspector assessment runs

The last step in using Amazon Inspector to assess for CVEs is to schedule the Amazon Inspector template to run using Amazon CloudWatch Events. This will make sure that Amazon Inspector assesses your EC2 instance on a regular basis. To do this, you need the Amazon Inspector template ARN, which you can find under Assessment templates in the Amazon Inspector console. CloudWatch Events can run your Amazon Inspector assessment at an interval you define using a Cron-based schedule. Cron is a well-known scheduling agent that is widely used on UNIX-like operating systems and uses the following syntax for CloudWatch Events.

Image of Cron schedule

All scheduled events use a UTC time zone, and the minimum precision for schedules is one minute. For more information about scheduling CloudWatch Events, see Schedule Expressions for Rules.

To create the CloudWatch Events rule:

  1. Navigate to the CloudWatch console, choose Events, and choose Create rule.
    Screenshot of starting to create a rule in the CloudWatch Events console
  2. On the next page, specify if you want to invoke your rule based on an event pattern or a schedule. For this blog post, you will select a schedule based on a Cron expression.
  3. You can schedule the Amazon Inspector assessment any time you want using the Cron expression, or you can use the Cron expression I used in the following screenshot, which will run the Amazon Inspector assessment every Sunday at 10:00 P.M. GMT.
    Screenshot of scheduling an Amazon Inspector assessment with a Cron expression
  4. Choose Add target and choose Inspector assessment template from the drop-down menu. Paste the ARN of the Amazon Inspector template you previously created in the Amazon Inspector console in the Assessment template box and choose Create a new role for this specific resource. This new role is necessary so that CloudWatch Events has the necessary permissions to start the Amazon Inspector assessment. CloudWatch Events will automatically create the new role and grant the minimum set of permissions needed to run the Amazon Inspector assessment. To proceed, choose Configure details.
    Screenshot of adding a target
  5. Next, give your rule a name and a description. I suggest using a name that describes what the rule does, as shown in the following screenshot.
  6. Finish the wizard by choosing Create rule. The rule should appear in the Events – Rules section of the CloudWatch console.
    Screenshot of completing the creation of the rule
  7. To confirm your CloudWatch Events rule works, wait for the next time your CloudWatch Events rule is scheduled to run. For testing purposes, you can choose your CloudWatch Events rule and choose Edit to change the schedule to run it sooner than scheduled.
    Screenshot of confirming the CloudWatch Events rule works
  8. Now navigate to the Amazon Inspector console to confirm the launch of your first assessment run. The Start time column shows you the time each assessment started and the Status column the status of your assessment. In the following screenshot, you can see Amazon Inspector is busy Collecting data from the selected assessment targets.
    Screenshot of confirming the launch of the first assessment run

You have concluded the last step of this blog post by setting up a regular scan of your EC2 instance with Amazon Inspector and a notification that will let you know if your EC2 instance is vulnerable to any known CVEs. In a previous Security Blog post, Eric Fitzgerald explained How to Remediate Amazon Inspector Security Findings Automatically. Although that blog post is for Linux-based EC2 instances, the post shows that you can learn about Amazon Inspector findings in other ways than email alerts.

Conclusion

In this two-part blog post, I showed how to make sure you keep your EC2 instances up to date with patching, how to back up your instances with snapshots, and how to monitor your instances for CVEs. Collectively these measures help to protect your instances against common attack vectors that attempt to exploit known vulnerabilities. In Part 1, I showed how to configure your EC2 instances to make it easy to use Systems Manager, EBS Snapshot Scheduler, and Amazon Inspector. I also showed how to use Systems Manager to schedule automatic patches to keep your instances current in a timely fashion. In Part 2, I showed you how to take regular snapshots of your data by using EBS Snapshot Scheduler and how to use Amazon Inspector to check if your EC2 instances running Microsoft Windows contain any common vulnerabilities and exposures (CVEs).

If you have comments about today’s or yesterday’s post, submit them in the “Comments” section below. If you have questions about or issues implementing any part of this solution, start a new thread on the Amazon EC2 forum or the Amazon Inspector forum, or contact AWS Support.

– Koen

How to Enable Caching for AWS CodeBuild

Post Syndicated from Karthik Thirugnanasambandam original https://aws.amazon.com/blogs/devops/how-to-enable-caching-for-aws-codebuild/

AWS CodeBuild is a fully managed build service. There are no servers to provision and scale, or software to install, configure, and operate. You just specify the location of your source code, choose your build settings, and CodeBuild runs build scripts for compiling, testing, and packaging your code.

A typical application build process includes phases like preparing the environment, updating the configuration, downloading dependencies, running unit tests, and finally, packaging the built artifact.

Downloading dependencies is a critical phase in the build process. These dependent files can range in size from a few KBs to multiple MBs. Because most of the dependent files do not change frequently between builds, you can noticeably reduce your build time by caching dependencies.

In this post, I will show you how to enable caching for AWS CodeBuild.

Requirements

  • Create an Amazon S3 bucket for storing cache archives (You can use existing s3 bucket as well).
  • Create a GitHub account (if you don’t have one).

Create a sample build project:

1. Open the AWS CodeBuild console at https://console.aws.amazon.com/codebuild/.

2. If a welcome page is displayed, choose Get started.

If a welcome page is not displayed, on the navigation pane, choose Build projects, and then choose Create project.

3. On the Configure your project page, for Project name, type a name for this build project. Build project names must be unique across each AWS account.

4. In Source: What to build, for Source provider, choose GitHub.

5. In Environment: How to build, for Environment image, select Use an image managed by AWS CodeBuild.

  • For Operating system, choose Ubuntu.
  • For Runtime, choose Java.
  • For Version,  choose aws/codebuild/java:openjdk-8.
  • For Build specification, select Insert build commands.

Note: The build specification file (buildspec.yml) can be configured in two ways. You can package it along with your source root directory, or you can override it by using a project environment configuration. In this example, I will use the override option and will use the console editor to specify the build specification.

6. Under Build commands, click Switch to editor to enter the build specification.

Copy the following text.

version: 0.2

phases:
  build:
    commands:
      - mvn install
      
cache:
  paths:
    - '/root/.m2/**/*'

Note: The cache section in the build specification instructs AWS CodeBuild about the paths to be cached. Like the artifacts section, the cache paths are relative to $CODEBUILD_SRC_DIR and specify the directories to be cached. In this example, Maven stores the downloaded dependencies to the /root/.m2/ folder, but other tools use different folders. For example, pip uses the /root/.cache/pip folder, and Gradle uses the /root/.gradle/caches folder. You might need to configure the cache paths based on your language platform.

7. In Artifacts: Where to put the artifacts from this build project:

  • For Type, choose No artifacts.

8. In Cache:

  • For Type, choose Amazon S3.
  • For Bucket, choose your S3 bucket.
  • For Path prefix, type cache/archives/

9. In Service role, the Create a service role in your account option will display a default role name.  You can accept the default name or type your own.

If you already have an AWS CodeBuild service role, choose Choose an existing service role from your account.

10. Choose Continue.

11. On the Review page, to run a build, choose Save and build.

Review build and cache behavior:

Let us review our first build for the project.

In the first run, where no cache exists, overall build time would look something like below (notice the time for DOWNLOAD_SOURCE, BUILD and POST_BUILD):

If you check the build logs, you will see log entries for dependency downloads. The dependencies are downloaded directly from configured external repositories. At the end of the log, you will see an entry for the cache uploaded to your S3 bucket.

Let’s review the S3 bucket for the cached archive. You’ll see the cache from our first successful build is uploaded to the configured S3 path.

Let’s try another build with the same CodeBuild project. This time the build should pick up the dependencies from the cache.

In the second run, there was a cache hit (cache was generated from the first run):

You’ll notice a few things:

  1. DOWNLOAD_SOURCE took slightly longer. Because, in addition to the source code, this time the build also downloaded the cache from user’s s3 bucket.
  2. BUILD time was faster. As the dependencies didn’t need to get downloaded, but were reused from cache.
  3. POST_BUILD took slightly longer, but was relatively the same.

Overall, build duration was improved with cache.

Best practices for cache

  • By default, the cache archive is encrypted on the server side with the customer’s artifact KMS key.
  • You can expire the cache by manually removing the cache archive from S3. Alternatively, you can expire the cache by using an S3 lifecycle policy.
  • You can override cache behavior by updating the project. You can use the AWS CodeBuild the AWS CodeBuild console, AWS CLI, or AWS SDKs to update the project. You can also invalidate cache setting by using the new InvalidateProjectCache API. This API forces a new InvalidationKey to be generated, ensuring that future builds receive an empty cache. This API does not remove the existing cache, because this could cause inconsistencies with builds currently in flight.
  • The cache can be enabled for any folders in the build environment, but we recommend you only cache dependencies/files that will not change frequently between builds. Also, to avoid unexpected application behavior, don’t cache configuration and sensitive information.

Conclusion

In this blog post, I showed you how to enable and configure cache setting for AWS CodeBuild. As you see, this can save considerable build time. It also improves resiliency by avoiding external network connections to an artifact repository.

I hope you found this post useful. Feel free to leave your feedback or suggestions in the comments.

B2 Cloud Storage Roundup

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/b2-cloud-storage-roundup/

B2 Integrations
Over the past several months, B2 Cloud Storage has continued to grow like we planted magic beans. During that time we have added a B2 Java SDK, and certified integrations with GoodSync, Arq, Panic, UpdraftPlus, Morro Data, QNAP, Archiware, Restic, and more. In addition, B2 customers like Panna Cooking, Sermon Audio, and Fellowship Church are happy they chose B2 as their cloud storage provider. If any of that sounds interesting, read on.

The B2 Java SDK

While the Backblaze B2 API is well documented and straight-forward to implement, we were asked by a few of our Integration Partners if we had an SDK they could use. So we developed one as an open-course project on GitHub, where we hope interested parties will not only use our Java SDK, but make it better for everyone else.

There are different reasons one might use the Java SDK, but a couple of areas where the SDK can simplify the coding process are:

Expiring Authorization — B2 requires an application key for a given account be reissued once a day when using the API. If the application key expires while you are in the middle of transferring files or some other B2 activity (bucket list, etc.), the SDK can be used to detect and then update the application key on the fly. Your B2 related activities will continue without incident and without having to capture and code your own exception case.

Error Handling — There are different types of error codes B2 will return, from expired application keys to detecting malformed requests to command time-outs. The SDK can dramatically simplify the coding needed to capture and account for the various things that can happen.

While Backblaze has created the Java SDK, developers in the GitHub community have also created other SDKs for B2, for example, for PHP (https://github.com/cwhite92/b2-sdk-php,) and Go (https://github.com/kurin/blazer.) Let us know in the comments about other SDKs you’d like to see or perhaps start your own GitHub project. We will publish any updates in our next B2 roundup.

What You Can Do with Affordable and Available Cloud Storage

You’re probably aware that B2 is up to 75% less expensive than other similar cloud storage services like Amazon S3 and Microsoft Azure. Businesses and organizations are finding that projects that previously weren’t economically feasible with other Cloud Storage services are now not only possible, but a reality with B2. Here are a few recent examples:

SermonAudio logoSermonAudio wanted their media files to be readily available, but didn’t want to build and manage their own internal storage farm. Until B2, cloud storage was just too expensive to use. Now they use B2 to store their audio and video files, and also as the primary source of downloads and streaming requests from their subscribers.
Fellowship Church logoFellowship Church wanted to escape from the ever increasing amount of time they were spending saving their data to their LTO-based system. Using B2 saved countless hours of personnel time versus LTO, fit easily into their video processing workflow, and provided instant access at any time to their media library.
Panna logoPanna Cooking replaced their closet full of archive hard drives with a cost-efficient hybrid-storage solution combining 45Drives and Backblaze B2 Cloud Storage. Archived media files that used to take hours to locate are now readily available regardless of whether they reside in local storage or in the B2 Cloud.

B2 Integrations

Leading companies in backup, archive, and sync continue to add B2 Cloud Storage as a storage destination for their customers. These companies realize that by offering B2 as an option, they can dramatically lower the total cost of ownership for their customers — and that’s always a good thing.

If your favorite application is not integrated to B2, you can do something about it. One integration partner told us they received over 200 customer requests for a B2 integration. The partner got the message and the integration is currently in beta test.

Below are some of the partner integrations completed in the past few months. You can check the B2 Partner Integrations page for a complete list.

Archiware — Both P5 Archive and P5 Backup can now store data in the B2 Cloud making your offsite media files readily available while keeping your off-site storage costs predictable and affordable.

Arq — Combine Arq and B2 for amazingly affordable backup of external drives, network drives, NAS devices, Windows PCs, Windows Servers, and Macs to the cloud.

GoodSync — Automatically synchronize and back up all your photos, music, email, and other important files between all your desktops, laptops, servers, external drives, and sync, or back up to B2 Cloud Storage for off-site storage.

QNAP — QNAP Hybrid Backup Sync consolidates backup, restoration, and synchronization functions into a single QTS application to easily transfer your data to local, remote, and cloud storage.

Morro Data — Their CloudNAS solution stores files in the cloud, caches them locally as needed, and syncs files globally among other CloudNAS systems in an organization.

Restic – Restic is a fast, secure, multi-platform command line backup program. Files are uploaded to a B2 bucket as de-duplicated, encrypted chunks. Each backup is a snapshot of only the data that has changed, making restores of a specific date or time easy.

Transmit 5 by Panic — Transmit 5, the gold standard for macOS file transfer apps, now supports B2. Upload, download, and manage files on tons of servers with an easy, familiar, and powerful UI.

UpdraftPlus — WordPress developers and admins can now use the UpdraftPlus Premium WordPress plugin to affordably back up their data to the B2 Cloud.

Getting Started with B2 Cloud Storage

If you’re using B2 today, thank you. If you’d like to try B2, but don’t know where to start, here’s a guide to getting started with the B2 Web Interface — no programming or scripting is required. You get 10 gigabytes of free storage and 1 gigabyte a day in free downloads. Give it a try.

The post B2 Cloud Storage Roundup appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
yearYear that song was released
songtitleTitle of the song
artistnameName of the song artist
songidUnique identifier for the song
artistidUnique identifier for the song artist
timesignatureVariable estimating the time signature of the song
timesignature_confidenceConfidence in the estimate for the timesignature
loudnessContinuous variable indicating the average amplitude of the audio in decibels
tempoVariable indicating the estimated beats per minute of the song
tempo_confidenceConfidence in the estimate for tempo
keyVariable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidenceConfidence in the estimate for key
energyVariable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitchContinuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_minVariables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_maxVariables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

MetricDescription
SensitivityMeasures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
SpecificityMeasures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
ThresholdCutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
PrecisionThe fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3GBM ModelDeep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.01.0

1.0

(t=0)

Specificity

(max)

1.01.0

1.0

(t=1)

Sensitivity

 

0.20338980.1355932

0.3898305

(t=0.5)

AUC0.84923890.86305730.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

Microcell through a mobile hotspot

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/10/microcell-through-mobile-hotspot.html

I accidentally acquired a tree farm 20 minutes outside of town. For utilities, it gets electricity and basic phone. It doesn’t get water, sewer, cable, or DSL (i.e. no Internet). Also, it doesn’t really get cell phone service. While you can get SMS messages up there, you usually can’t get a call connected, or hold a conversation if it does.

We have found a solution — an evil solution. We connect an AT&T “Microcell“, which provides home cell phone service through your Internet connection, to an AT&T Mobile Hotspot, which provides an Internet connection through your cell phone service.

Now, you may be laughing at this, because it’s a circular connection. It’s like trying to make a sailboat go by blowing on the sails, or lifting up a barrel to lighten the load in the boat.

But it actually works.

Since we get some, but not enough, cellular signal, we setup a mast 20 feet high with a directional antenna pointed to the cell tower 7.5 miles to the southwest, connected to a signal amplifier. It’s still an imperfect solution, as we are still getting terrain distortions in the signal, but it provides a good enough signal-to-noise ratio to get a solid connection.

We then connect that directional antenna directly to a high-end Mobile Hotspot. This gives us a solid 2mbps connection with a latency under 30milliseconds. This is far lower than the 50mbps you can get right next to a 4G/LTE tower, but it’s still pretty good for our purposes.

We then connect the AT&T Microcell to the Mobile Hotspot, via WiFi.

To avoid the circular connection, we lock the frequencies for the Mobile Hotspot to 4G/LTE, and to 3G for the Microcell. This prevents the Mobile Hotspot locking onto the strong 3G signal from the Microcell. It also prevents the two from causing noise to the other.

This works really great. We now get a strong cell signal on our phones even 400 feet from the house through some trees. We can be all over the property, out in the lake, down by the garden, and so on, and have our phones work as normal. It’s only AT&T, but that’s what the whole family uses.

You might be asking why we didn’t just use a normal signal amplifier, like they use on corporate campus. It boosts all the analog frequencies, making any cell phone service works.

We’ve tried this, and it works a bit, allowing cell phones to work inside the house pretty well. But they don’t work outside the house, which is where we spend a lot of time. In addition, while our newer phones work, my sister’s iPhone 5 doesn’t. We have no idea what’s going on. Presumably, we could hire professional installers and stuff to get everything working, but nobody would quote us a price lower than $25,000 to even come look at the property.

Another possible solution is satellite Internet. There are two satellites in orbit that cover the United States with small “spot beams” delivering high-speed service (25mbps downloads). However, the latency is 500milliseconds, which makes it impractical for low-latency applications like phone calls.

While I know a lot about the technology in theory, I find myself hopelessly clueless in practice. I’ve been playing with SDR (“software defined radio”) to try to figure out exactly where to locate and point the directional antenna, but I’m not sure I’ve come up with anything useful. In casual tests, it seems rotating the antenna from vertical to horizontal increases the signal-to-noise ratio a bit, which seems counter intuitive, and should not happen. So I’m completely lost.

Anyway, I thought I’d write this up as a blogpost, in case anybody has better suggestion. Or, instead of signals, suggestions to get wired connectivity. Properties a half mile away get DSL, I wish I knew who to talk to at the local phone company to pay them money to extend Internet to our property.

Phone works in all this area now

Raspbian Stretch has arrived for Raspberry Pi

Post Syndicated from Simon Long original https://www.raspberrypi.org/blog/raspbian-stretch/

It’s now just under two years since we released the Jessie version of Raspbian. Those of you who know that Debian run their releases on a two-year cycle will therefore have been wondering when we might be releasing the next version, codenamed Stretch. Well, wonder no longer – Raspbian Stretch is available for download today!

Disney Pixar Toy Story Raspbian Stretch Raspberry Pi

Debian releases are named after characters from Disney Pixar’s Toy Story trilogy. In case, like me, you were wondering: Stretch is a purple octopus from Toy Story 3. Hi, Stretch!

The differences between Jessie and Stretch are mostly under-the-hood optimisations, and you really shouldn’t notice any differences in day-to-day use of the desktop and applications. (If you’re really interested, the technical details are in the Debian release notes here.)

However, we’ve made a few small changes to our image that are worth mentioning.

New versions of applications

Version 3.0.1 of Sonic Pi is included – this includes a lot of new functionality in terms of input/output. See the Sonic Pi release notes for more details of exactly what has changed.

Raspbian Stretch Raspberry Pi

The Chromium web browser has been updated to version 60, the most recent stable release. This offers improved memory usage and more efficient code, so you may notice it running slightly faster than before. The visual appearance has also been changed very slightly.

Raspbian Stretch Raspberry Pi

Bluetooth audio

In Jessie, we used PulseAudio to provide support for audio over Bluetooth, but integrating this with the ALSA architecture used for other audio sources was clumsy. For Stretch, we are using the bluez-alsa package to make Bluetooth audio work with ALSA itself. PulseAudio is therefore no longer installed by default, and the volume plugin on the taskbar will no longer start and stop PulseAudio. From a user point of view, everything should still work exactly as before – the only change is that if you still wish to use PulseAudio for some other reason, you will need to install it yourself.

Better handling of other usernames

The default user account in Raspbian has always been called ‘pi’, and a lot of the desktop applications assume that this is the current user. This has been changed for Stretch, so now applications like Raspberry Pi Configuration no longer assume this to be the case. This means, for example, that the option to automatically log in as the ‘pi’ user will now automatically log in with the name of the current user instead.

One other change is how sudo is handled. By default, the ‘pi’ user is set up with passwordless sudo access. We are no longer assuming this to be the case, so now desktop applications which require sudo access will prompt for the password rather than simply failing to work if a user without passwordless sudo uses them.

Scratch 2 SenseHAT extension

In the last Jessie release, we added the offline version of Scratch 2. While Scratch 2 itself hasn’t changed for this release, we have added a new extension to allow the SenseHAT to be used with Scratch 2. Look under ‘More Blocks’ and choose ‘Add an Extension’ to load the extension.

This works with either a physical SenseHAT or with the SenseHAT emulator. If a SenseHAT is connected, the extension will control that in preference to the emulator.

Raspbian Stretch Raspberry Pi

Fix for Broadpwn exploit

A couple of months ago, a vulnerability was discovered in the firmware of the BCM43xx wireless chipset which is used on Pi 3 and Pi Zero W; this potentially allows an attacker to take over the chip and execute code on it. The Stretch release includes a patch that addresses this vulnerability.

There is also the usual set of minor bug fixes and UI improvements – I’ll leave you to spot those!

How to get Raspbian Stretch

As this is a major version upgrade, we recommend using a clean image; these are available from the Downloads page on our site as usual.

Upgrading an existing Jessie image is possible, but is not guaranteed to work in every circumstance. If you wish to try upgrading a Jessie image to Stretch, we strongly recommend taking a backup first – we can accept no responsibility for loss of data from a failed update.

To upgrade, first modify the files /etc/apt/sources.list and /etc/apt/sources.list.d/raspi.list. In both files, change every occurrence of the word ‘jessie’ to ‘stretch’. (Both files will require sudo to edit.)

Then open a terminal window and execute

sudo apt-get update
sudo apt-get -y dist-upgrade

Answer ‘yes’ to any prompts. There may also be a point at which the install pauses while a page of information is shown on the screen – hold the ‘space’ key to scroll through all of this and then hit ‘q’ to continue.

Finally, if you are not using PulseAudio for anything other than Bluetooth audio, remove it from the image by entering

sudo apt-get -y purge pulseaudio*

The post Raspbian Stretch has arrived for Raspberry Pi appeared first on Raspberry Pi.

Backblaze Cloud Backup 5.0: The Rapid Access Release

Post Syndicated from Yev original https://www.backblaze.com/blog/cloud-backup-5-0-rapid-access/

Announcing Backblaze Cloud Backup 5.0: the Rapid Access Release. We’ve been at the backup game for a long time now, and we continue to focus on providing the best unlimited backup service on the planet. A lot of the features in this release have come from listening to our customers about how they want to use their data. “Rapid Access” quickly became the theme because, well, we’re all acquiring more and more data and want to access it in a myriad of ways.

This release brings a lot of new functionality to Backblaze Computer Backup: faster backups, accelerated file browsing, image preview, individual file download (without creating a “restore”), and file sharing. To top it all off, we’ve refreshed the user interface on our client app. We hope you like it!

Speeding Things Up

New code + new hardware + elbow grease = things are going to move much faster.

Faster Backups

We’ve doubled the number of threads available for backup on both Mac and PC . This gives our service the ability to intelligently detect the right settings for you (based on your computer, capacity, and bandwidth). As always, you can manually set the number of threads — keep in mind that if you have a slow internet connection, adding threads might have the opposite effect and slow you down. On its default settings, our client app will now automatically evaluate what’s best given your environment. We’ve internally tested our service backing up at over 100 Mbps, which means if you have a fast-enough internet connection, you could back up 50 GB in just one hour.

Faster Browsing

We’ve introduced a number of enhancements that increase file browsing speed by 3x. Hidden files are no longer displayed by default, but you can still show them with one click on the restore page. This gives the restore interface a cleaner look, and helps you navigate backup history if you need to roll back time.

Faster Restore Preparation

We take pride in providing a variety of ways for consumers to get their data back. When something has happened to your computer, getting your files back quickly is critical. Both web download restores and Restore by Mail will now be much faster. In some cases up to 10x faster!

Preview — Access — Share

Our system has received a number of enhancements — all intended to give you more access to your data.

Image Preview

If you have a lot of photos, this one’s for you. When you go to the restore page you’ll now be able to click on each individual file that we have backed up, and if it’s an image you’ll see a preview of that file. We hope this helps people figure out which pictures they want to download (this especially helps people with a lot of photos named something along the lines of: 2017-04-20-9783-41241.jpg). Now you can just click on the picture to preview it.

Access

Once you’ve clicked on a file (30MB and smaller), you’ll be able to individually download that file directly in your browser. You’ll no longer need to wait for a single-file restore to be built and zipped up; you’ll be able to download it quickly and easily. This was a highly requested feature and we’re stoked to get it implemented.

Share

We’re leveraging Backblaze B2 Cloud Storage and giving folks the ability to publicly share their files. In order to use this feature, you’ll need to enable Backblaze B2 on your account (if you haven’t already, there’s a simple wizard that will pop up the first time you try to share a file). Files can be shared anywhere in the world via URL. All B2 accounts have 10GB/month of storage and 1GB/day of downloads (equivalent to sharing an iPhone photo 1,000 times per month) for free. You can increase those limits in your B2 Settings. Keep in mind that any file you share will be accessible to anybody with the link. Learn more about File Sharing.

For now, we’ve limited the Preview/Access/Share functionality to files 30MB and smaller, but larger files will be supported in the coming weeks!

Other Goodies

In addition to adding 2FV via ToTP, we’ve also been hard at work on the client. In version 5.0 we’ve touched up the user interface to make it a bit more lively, and we’ve also made the client IPv6 compatible.

Backblaze 5.0 Available: August 10, 2017

We will slowly be auto-updating all users in the coming weeks. To update now:

This version is now the default download on www.backblaze.com.

We hope you enjoy Backblaze Cloud Backup v5.0!

The post Backblaze Cloud Backup 5.0: The Rapid Access Release appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Turbocharge your Apache Hive queries on Amazon EMR using LLAP

Post Syndicated from Jigar Mistry original https://aws.amazon.com/blogs/big-data/turbocharge-your-apache-hive-queries-on-amazon-emr-using-llap/

Apache Hive is one of the most popular tools for analyzing large datasets stored in a Hadoop cluster using SQL. Data analysts and scientists use Hive to query, summarize, explore, and analyze big data.

With the introduction of Hive LLAP (Low Latency Analytical Processing), the notion of Hive being just a batch processing tool has changed. LLAP uses long-lived daemons with intelligent in-memory caching to circumvent batch-oriented latency and provide sub-second query response times.

This post provides an overview of Hive LLAP, including its architecture and common use cases for boosting query performance. You will learn how to install and configure Hive LLAP on an Amazon EMR cluster and run queries on LLAP daemons.

What is Hive LLAP?

Hive LLAP was introduced in Apache Hive 2.0, which provides very fast processing of queries. It uses persistent daemons that are deployed on a Hadoop YARN cluster using Apache Slider. These daemons are long-running and provide functionality such as I/O with DataNode, in-memory caching, query processing, and fine-grained access control. And since the daemons are always running in the cluster, it saves substantial overhead of launching new YARN containers for every new Hive session, thereby avoiding long startup times.

When Hive is configured in hybrid execution mode, small and short queries execute directly on LLAP daemons. Heavy lifting (like large shuffles in the reduce stage) is performed in YARN containers that belong to the application. Resources (CPU, memory, etc.) are obtained in a traditional fashion using YARN. After the resources are obtained, the execution engine can decide which resources are to be allocated to LLAP, or it can launch Apache Tez processors in separate YARN containers. You can also configure Hive to run all the processing workloads on LLAP daemons for querying small datasets at lightning fast speeds.

LLAP daemons are launched under YARN management to ensure that the nodes don’t get overloaded with the compute resources of these daemons. You can use scheduling queues to make sure that there is enough compute capacity for other YARN applications to run.

Why use Hive LLAP?

With many options available in the market (Presto, Spark SQL, etc.) for doing interactive SQL  over data that is stored in Amazon S3 and HDFS, there are several reasons why using Hive and LLAP might be a good choice:

  • For those who are heavily invested in the Hive ecosystem and have external BI tools that connect to Hive over JDBC/ODBC connections, LLAP plugs in to their existing architecture without a steep learning curve.
  • It’s compatible with existing Hive SQL and other Hive tools, like HiveServer2, and JDBC drivers for Hive.
  • It has native support for security features with authentication and authorization (SQL standards-based authorization) using HiveServer2.
  • LLAP daemons are aware about of the columns and records that are being processed which enables you to enforce fine-grained access control.
  • It can use Hive’s vectorization capabilities to speed up queries, and Hive has better support for Parquet file format when vectorization is enabled.
  • It can take advantage of a number of Hive optimizations like merging multiple small files for query results, automatically determining the number of reducers for joins and groupbys, etc.
  • It’s optional and modular so it can be turned on or off depending on the compute and resource requirements of the cluster. This lets you to run other YARN applications concurrently without reserving a cluster specifically for LLAP.

How do you install Hive LLAP in Amazon EMR?

To install and configure LLAP on an EMR cluster, use the following bootstrap action (BA):

s3://aws-bigdata-blog/artifacts/Turbocharge_Apache_Hive_on_EMR/configure-Hive-LLAP.sh

This BA downloads and installs Apache Slider on the cluster and configures LLAP so that it works with EMR Hive. For LLAP to work, the EMR cluster must have Hive, Tez, and Apache Zookeeper installed.

You can pass the following arguments to the BA.

ArgumentDefinitionDefault value
--instancesNumber of instances of LLAP daemonNumber of core/task nodes of the cluster
--cacheCache size per instance20% of physical memory of the node
--executorsNumber of executors per instanceNumber of CPU cores of the node
--iothreadsNumber of IO threads per instanceNumber of CPU cores of the node
--sizeContainer size per instance50% of physical memory of the node
--xmxWorking memory size50% of container size
--log-levelLog levels for the LLAP instanceINFO

LLAP example

This section describes how you can try the faster Hive queries with LLAP using the TPC-DS testbench for Hive on Amazon EMR.

Use the following AWS command line interface (AWS CLI) command to launch a 1+3 nodes m4.xlarge EMR 5.6.0 cluster with the bootstrap action to install LLAP:

aws emr create-cluster --release-label emr-5.6.0 \
--applications Name=Hadoop Name=Hive Name=Hue Name=ZooKeeper Name=Tez \
--bootstrap-actions '[{"Path":"s3://aws-bigdata-blog/artifacts/Turbocharge_Apache_Hive_on_EMR/configure-Hive-LLAP.sh","Name":"Custom action"}]' \ 
--ec2-attributes '{"KeyName":"<YOUR-KEY-PAIR>","InstanceProfile":"EMR_EC2_DefaultRole","SubnetId":"subnet-xxxxxxxx","EmrManagedSlaveSecurityGroup":"sg-xxxxxxxx","EmrManagedMasterSecurityGroup":"sg-xxxxxxxx"}' 
--service-role EMR_DefaultRole \
--enable-debugging \
--log-uri 's3n://<YOUR-BUCKET/' --name 'test-hive-llap' \
--instance-groups '[{"InstanceCount":1,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}],"EbsOptimized":true},"InstanceGroupType":"MASTER","InstanceType":"m4.xlarge","Name":"Master - 1"},{"InstanceCount":3,"EbsConfiguration":{"EbsBlockDeviceConfigs":[{"VolumeSpecification":{"SizeInGB":32,"VolumeType":"gp2"},"VolumesPerInstance":1}],"EbsOptimized":true},"InstanceGroupType":"CORE","InstanceType":"m4.xlarge","Name":"Core - 2"}]' 
--region us-east-1

After the cluster is launched, log in to the master node using SSH, and do the following:

  1. Open the hive-tpcds folder:
    cd /home/hadoop/hive-tpcds/
  2. Start Hive CLI using the testbench configuration, create the required tables, and run the sample query:

    hive –i testbench.settings
    hive> source create_tables.sql;
    hive> source query55.sql;

    This sample query runs on a 40 GB dataset that is stored on Amazon S3. The dataset is generated using the data generation tool in the TPC-DS testbench for Hive.It results in output like the following:
  3. This screenshot shows that the query finished in about 47 seconds for LLAP mode. Now, to compare this to the execution time without LLAP, you can run the same workload using only Tez containers:
    hive> set hive.llap.execution.mode=none;
    hive> source query55.sql;


    This query finished in about 80 seconds.

The difference in query execution time is almost 1.7 times when using just YARN containers in contrast to running the query on LLAP daemons. And with every rerun of the query, you notice that the execution time substantially decreases by the virtue of in-memory caching by LLAP daemons.

Conclusion

In this post, I introduced Hive LLAP as a way to boost Hive query performance. I discussed its architecture and described several use cases for the component. I showed how you can install and configure Hive LLAP on an Amazon EMR cluster and how you can run queries on LLAP daemons.

If you have questions about using Hive LLAP on Amazon EMR or would like to share your use cases, please leave a comment below.


Additional Reading

Learn how to to automatically partition Hive external tables with AWS.


About the Author

Jigar Mistry is a Hadoop Systems Engineer with Amazon Web Services. He works with customers to provide them architectural guidance and technical support for processing large datasets in the cloud using open-source applications. In his spare time, he enjoys going for camping and exploring different restaurants in the Seattle area.

 

 

 

 

Transparency in Cloud Storage Costs

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/transparency-in-cloud-storage-costs/

cloud storage cost calculator

Backblaze’s mission is to make cloud storage that’s affordable and astonishingly easy to use. Backblaze B2 embodies that mission for those looking for an object storage solution.

Another Backblaze core value is being transparent, from releasing our Storage Pod designs to detailing our cloud storage cost of goods sold. We are an open book in the Cloud Storage industry. So it makes sense that opaque pricing policies that require mind numbing calculations are a no-no for us. Our approach to pricing is to be transparent, straight-forward, and predictable.

For Backblaze B2, this means that no matter how much data you have, the cost for B2 is $0.005/GB per month for data storage and $0.02/GB to download data. There are no costs to upload. We also throw in 10GB of storage and 1GB of downloads for free every month.

Cloud Storage Price Comparison

The storage industry does not share our view of making pricing transparent, or affordable. In an effort to help everyone, we’ve made a Cloud Storage Pricing Calculator, where anyone can enter in their specific use case and get pricing back for B2, S3, Azure, and GCS. We’ve also included the calculator below for those interested in trying it out.

B2 Cost Calculator

Backblaze provides this calculator as an estimate.

Initial Upload:

GB

Data over time

Monthly Upload:

GB

Monthly Delete:

GB

Monthly Download:

GB


Period of Time:

Months

Storage Costs

Storage Cost for Initial Month:
x

Data Added Each Month:
x

Data Deleted Each Month:
x

Net Data:
x

Download Costs

Monthly Download Cost:
x

Total

Total Cost for x Months
x

Amazon S3
Microsoft Azure
Google Cloud

x
x
x
x
x
x
* Figures are not exact and do not include the following: Free first 10 GB of storage, free 1 GB of daily downloads, or $.004/10,000 class B Transactions and $.004/1,000 Class C Transactions.

Sample storage scenarios:

Scenario 1

You have data you wish to archive, and will be adding more each month, but you don’t expect that you will be downloading or deleting any data.

Initial upload:10,000GB
Monthly upload:1,000GB

For twelve months, your costs would be:

Backblaze B2$990.00
Amazon S3$4,158.00+420%
Microsoft Azure$4,356.00+440%
Google Cloud$5,148.00+520%

 

Scenario 2

You wish to store data, and will be actively changing that data with uploads, downloads, and deletions.

Initial upload:10,000GB
Monthly upload:2,000GB
Monthly deletion:1,000GB
Monthly download:500GB

Your costs for 12 months would be:

Backblaze B2$1,100.00
Amazon S3$3.458/00+402%
Microsoft Azure$4,656.00+519%
Google Cloud$5,628.00+507%

We invite you to compare our cost estimates against the competition. Here are the links to our competitors’ pricing calculators.

B2 Cloud Storage Pricing Summary

Provider
Storage
($/GB/Month)

Download
($/GB)
$0.005$0.02
$0.021
+420%
$0.05+
+250%
$0.022+
+440%
$0.05+
+250%
$0.026
+520%
$0.08+
+400%

The Details


STORAGE
$0.005/GB/Month
How much data you have stored with Backblaze. This is calculated once a day based on the average storage of the previous 24 hours.
The first 10 GB of storage is free.

DOWNLOAD
$0.02/GB
Charged when you download files and charged when you create a Snapshot. Charged for any portion of a GB. The first 1 GB of data downloaded each day is free.

TRANSACTIONS
Class “A” transactions – Free
Class “B” transactions – $0.004 per 10,000 with 2,500 free per day.
Class “C” transactions – $0.004 per 1,000 with 2,500 free per day.
View Transactions by API Call

DATA BY MAIL
Mail us your data on a B2 Fireball – $550
Backblaze will mail your data to you by FedEx:
• USB Flash Drive – up to 110 GB – $89
• USB Hard Drive – up to 3.5TB of data – $189

PRODUCT SUPPORT
All B2 active account owners can contact Backblaze support at help.backblaze.com where they will also find a free-to- use knowledge base of B2 advice, guides, and more. In addition, a B2 user can pay to upgrade their support plan to include phone service, 24×7 support and more.

EVERYTHING ELSE
Free
Unlike other services, you won’t be nickeled and dimed with upload fees, file deletion charges, minimum files size requirements, and more. Everything you can possibly pay Backblaze is listed above.

 

Visit our B2 Cloud Storage Pricing web page for more details.


Amazon S3
Storage Costs
Initial upload cost:
x
Data added each month:
x

Data del. each month:
x

Net data:
x

Download Costs

Monthly Download Cost:
x

Total

Total Cost for x Months
x

Microsoft
Storage Costs
Initial upload cost:
x
Data added each month:
x

Data del. each month:
x

Net data:
x

Download Costs

Monthly Download Cost:
x

Total

Total Cost for x Months
x

Google
Storage Costs
Initial upload cost:
x
Data added each month:
x

Data del. each month:
x

Net data:
x

Download Costs

Monthly Download Cost:
x

Total

Total Cost for x Months
x

The post Transparency in Cloud Storage Costs appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

A new twist on data backup: CloudNAS

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/cloudnas-backup/

Morro CacheDrive

There are many ways for SMBs, professionals, and advanced users to back up their data. The process can be as simple as copying files to a flash drive or an external drive, or as sophisticated as using a Synology or QNAP NAS device as your primary storage device and syncing the files to a cloud storage service such as Backblaze B2.

A recent entry into the backup arena is Morro Data and their CloudNAS solution, where files are stored in the cloud, cached locally as needed, and synced globally among the other CloudNAS systems in a given organization. There are three components to the solution:

  • A Morro CacheDrive — This resides on your internal network like a NAS device and stores from 1- to 8 TB of data depending on the model
  • The CloudNAS service — This software runs on the Morro CacheDrive to keep track of and manage the data
  • Backblaze B2 Cloud Storage — Where the data is stored in the cloud

The Morro CacheDrive is installed on your local network and looks like a network share. On Windows, the share can be mounted as a letter device, M:, for example. On the Mac, the device is mounted as a Shared device (Databank in the example below).

CloudNAS software dashboard

In either case, the device works like a folder/directory, typically on your desktop. You then either drag-and-drop or save a file to the folder/directory. This places the file on the CacheDrive. Once there, the file is automatically backed up to the cloud. In the case of CloudNAS solution, that cloud is Backblaze B2.

All that sounds pretty straight-forward, but what makes the CloudNAS solution unique is the solution allows you to have unlimited storage space. For example, you can access 5 TB of data from a 1 TB CacheDrive. Confused? Let me explain. All 5 TB of the data is stored in B2, having been uploaded to B2 each time you stored data on the CacheDrive. The 1 TB CacheDrive keeps (caches) the most recent or most often used files on the CacheDrive. When you need a file not currently stored on the CacheDrive, the CloudNAS software automatically downloads the file from the B2 cloud to the CacheDrive and makes it available to use as desired.

Things to know about the CloudNAS solution

  • Sharing Systems: Multiple users can mount the same CacheDrive with each being able to update and share the files.
  • Synced Systems: If you have two or more CloudNAS systems on your network, they will keep the B2 directory of files synced between all of the systems. Everyone on the network sees the same file list.
  • Unlimited Data: Regardless of the size of the CacheDrive device you purchase, you will not run out of space as Backblaze B2 will contain all of your data. That said, you should choose the size of your CacheDrive that fits your operational environment.
  • Network Speed: Files are initially stored on the CacheDrive, then copied to B2. Local network connections are typically much faster than internet network speeds. This means your files are uploaded to the CacheDrive fast then transferred to B2 as time allows at the speed of your internet connection, all without slowing you down. This should be interesting to those of you who have slower internet connections.
  • Access: The files stored using the Cloud NAS solution can be accessed through the shared folder/directory on your desktop as well as through a web-based Team Portal.

Getting Started

To start, you purchase a Morro CacheDrive. The price starts at $499.00 for a unit with 1 TB of cache storage. Next you choose a CloudNAS subscription. This starts at $10/month for the Standard plan, and lets you manage up to 10 TB of data. Finally, you connect Backblaze B2 to the Morro system to finish the set-up process. You pay Backblaze each month for the data you store in and download from B2 while using the Morro solution.

The CloudNAS solution is certainly a different approach to storing your data. You get the ability to store a nearly unlimited amount of data without having to upgrade your hardware as you go, and all of your data is readily available with just a few clicks. For users who need to store terabytes of data that needs be available anytime, the CloudNAS solution is worth a look.

The post A new twist on data backup: CloudNAS appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

AWS Hot Startups – July 2017

Post Syndicated from Tina Barr original https://aws.amazon.com/blogs/aws/aws-hot-startups-july-2017/

Welcome back to another month of Hot Startups! Every day, startups are creating innovative and exciting businesses, applications, and products around the world. Each month we feature a handful of startups doing cool things using AWS.

July is all about learning! These companies are focused on providing access to tools and resources to expand knowledge and skills in different ways.

This month’s startups:

  • CodeHS – provides fun and accessible computer science curriculum for middle and high schools.
  • Insight – offers intensive fellowships to grow technical talent in Data Science.
  • iTranslate – enables people to read, write, and speak in over 90 languages, anywhere in the world.

CodeHS (San Francisco, CA)

In 2012, Stanford students Zach Galant and Jeremy Keeshin were computer science majors and TAs for introductory classes when they noticed a trend among their peers. Many wished that they had been exposed to computer science earlier in life. In their senior year, Zach and Jeremy launched CodeHS to give middle and high schools the opportunity to provide a fun, accessible computer science education to students everywhere. CodeHS is a web-based curriculum pathway complete with teacher resources, lesson plans, and professional development opportunities. The curriculum is supplemented with time-saving teacher tools to help with lesson planning, grading and reviewing student code, and managing their classroom.

CodeHS aspires to empower all students to meaningfully impact the future, and believe that coding is becoming a new foundational skill, along with reading and writing, that allows students to further explore any interest or area of study. At the time CodeHS was founded in 2012, only 10% of high schools in America offered a computer science course. Zach and Jeremy set out to change that by providing a solution that made it easy for schools and districts to get started. With CodeHS, thousands of teachers have been trained and are teaching hundreds of thousands of students all over the world. To use CodeHS, all that’s needed is the internet and a web browser. Students can write and run their code online, and teachers can immediately see what the students are working on and how they are doing.

Amazon EC2, Amazon RDS, Amazon ElastiCache, Amazon CloudFront, and Amazon S3 make it possible for CodeHS to scale their site to meet the needs of schools all over the world. CodeHS also relies on AWS to compile and run student code in the browser, which is extremely important when teaching server-side languages like Java that powers the AP course. Since usage rises and falls based on school schedules, Amazon CloudWatch and ELBs are used to easily scale up when students are running code so they have a seamless experience.

Be sure to visit the CodeHS website, and to learn more about bringing computer science to your school, click here!

Insight (Palo Alto, CA)

Insight was founded in 2012 to create a new educational model, optimize hiring for data teams, and facilitate successful career transitions among data professionals. Over the last 5 years, Insight has kept ahead of market trends and launched a series of professional training fellowships including Data Science, Health Data Science, Data Engineering, and Artificial Intelligence. Finding individuals with the right skill set, background, and culture fit is a challenge for big companies and startups alike, and Insight is focused on developing top talent through intensive 7-week fellowships. To date, Insight has over 1,000 alumni at over 350 companies including Amazon, Google, Netflix, Twitter, and The New York Times.

The Data Engineering team at Insight is well-versed in the current ecosystem of open source tools and technologies and provides mentorship on the best practices in this space. The technical teams are continually working with external groups in a variety of data advisory and mentorship capacities, but the majority of Insight partners participate in professional sessions. Companies visit the Insight office to speak with fellows in an informal setting and provide details on the type of work they are doing and how their teams are growing. These sessions have proved invaluable as fellows experience a significantly better interview process and companies yield engaged and enthusiastic new team members.

An important aspect of Insight’s fellowships is the opportunity for hands-on work, focusing on everything from building big-data pipelines to contributing novel features to industry-standard open source efforts. Insight provides free AWS resources for all fellows to use, in addition to mentorships from the Data Engineering team. Fellows regularly utilize Amazon S3, Amazon EC2, Amazon Kinesis, Amazon EMR, AWS Lambda, Amazon Redshift, Amazon RDS, among other services. The experience with AWS gives fellows a solid skill set as they transition into the industry. Fellowships are currently being offered in Boston, New York, Seattle, and the Bay Area.

Check out the Insight blog for more information on trends in data infrastructure, artificial intelligence, and cutting-edge data products.

 

iTranslate (Austria)

When the App Store was introduced in 2008, the founders of iTranslate saw an opportunity to be part of something big. The group of four fully believed that the iPhone and apps were going to change the world, and together they brainstormed ideas for their own app. The combination of translation and mobile devices seemed a natural fit, and by 2009 iTranslate was born. iTranslate’s mission is to enable travelers, students, business professionals, employers, and medical staff to read, write, and speak in all languages, anywhere in the world. The app allows users to translate text, voice, websites and more into nearly 100 languages on various platforms. Today, iTranslate is the leading player for conversational translation and dictionary apps, with more than 60 million downloads and 6 million monthly active users.

iTranslate is breaking language barriers through disruptive technology and innovation, enabling people to translate in real time. The app has a variety of features designed to optimize productivity including offline translation, website and voice translation, and language auto detection. iTranslate also recently launched the world’s first ear translation device in collaboration with Bragi, a company focused on smart earphones. The Dash Pro allows people to communicate freely, while having a personal translator right in their ear.

iTranslate started using Amazon Polly soon after it was announced. CEO Alexander Marktl said, “As the leading translation and dictionary app, it is our mission at iTranslate to provide our users with the best possible tools to read, write, and speak in all languages across the globe. Amazon Polly provides us with the ability to efficiently produce and use high quality, natural sounding synthesized speech.” The stable and simple-to-use API, low latency, and free caching allow iTranslate to scale as they continue adding features to their app. Customers also enjoy the option to change speech rate and change between male and female voices. To assure quality, speed, and reliability of their products, iTranslate also uses Amazon EC2, Amazon S3, and Amazon Route 53.

To get started with iTranslate, visit their website here.

—–

Thanks for reading!

-Tina

Run Common Data Science Packages on Anaconda and Oozie with Amazon EMR

Post Syndicated from John Ohle original https://aws.amazon.com/blogs/big-data/run-common-data-science-packages-on-anaconda-and-oozie-with-amazon-emr/

In the world of data science, users must often sacrifice cluster set-up time to allow for complex usability scenarios. Amazon EMR allows data scientists to spin up complex cluster configurations easily, and to be up and running with complex queries in a matter of minutes.

Data scientists often use scheduling applications such as Oozie to run jobs overnight. However, Oozie can be difficult to configure when you are trying to use popular Python packages (such as “pandas,” “numpy,” and “statsmodels”), which are not included by default.

One such popular platform that contains these types of packages (and more) is Anaconda. This post focuses on setting up an Anaconda platform on EMR, with an intent to use its packages with Oozie. I describe how to run jobs using a popular open source scheduler like Oozie.

Walkthrough

For this post, you walk through the following tasks:

  • Create an EMR cluster.
  • Download Anaconda on your master node.
  • Configure Oozie.
  • Test the steps.

Create an EMR cluster

Spin up an Amazon EMR cluster using the console or the AWS CLI. Use the latest release, and include Apache Hadoop, Apache Spark, Apache Hive, and Oozie.

To create a three-node cluster in the us-east-1 region, issue an AWS CLI command such as the following. This command must be typed as one line, as shown below. It is shown here separated for readability purposes only.

aws emr create-cluster \ 
--release-label emr-5.7.0 \ 
 --name '<YOUR-CLUSTER-NAME>' \
 --applications Name=Hadoop Name=Oozie Name=Spark Name=Hive \ 
 --ec2-attributes '{"KeyName":"<YOUR-KEY-PAIR>","SubnetId":"<YOUR-SUBNET-ID>","EmrManagedSlaveSecurityGroup":"<YOUR-EMR-SLAVE-SECURITY-GROUP>","EmrManagedMasterSecurityGroup":"<YOUR-EMR-MASTER-SECURITY-GROUP>"}' \ 
 --use-default-roles \ 
 --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"<YOUR-INSTANCE-TYPE>","Name":"Master - 1"},{"InstanceCount":<YOUR-CORE-INSTANCE-COUNT>,"InstanceGroupType":"CORE","InstanceType":"<YOUR-INSTANCE-TYPE>","Name":"Core - 2"}]'

One-line version for reference:

aws emr create-cluster --release-label emr-5.7.0 --name '<YOUR-CLUSTER-NAME>' --applications Name=Hadoop Name=Oozie Name=Spark Name=Hive --ec2-attributes '{"KeyName":"<YOUR-KEY-PAIR>","SubnetId":"<YOUR-SUBNET-ID>","EmrManagedSlaveSecurityGroup":"<YOUR-EMR-SLAVE-SECURITY-GROUP>","EmrManagedMasterSecurityGroup":"<YOUR-EMR-MASTER-SECURITY-GROUP>"}' --use-default-roles --instance-groups '[{"InstanceCount":1,"InstanceGroupType":"MASTER","InstanceType":"<YOUR-INSTANCE-TYPE>","Name":"Master - 1"},{"InstanceCount":<YOUR-CORE-INSTANCE-COUNT>,"InstanceGroupType":"CORE","InstanceType":"<YOUR-INSTANCE-TYPE>","Name":"Core - 2"}]'

Download Anaconda

SSH into your EMR master node instance and download the official Anaconda installer:

wget https://repo.continuum.io/archive/Anaconda2-4.4.0-Linux-x86_64.sh

At the time of publication, Anaconda 4.4 is the most current version available. For the download link location for the latest Python 2.7 version (Python 3.6 may encounter issues), see https://www.continuum.io/downloads.  Open the context (right-click) menu for the Python 2.7 download link, choose Copy Link Location, and use this value in the previous wget command.

This post used the Anaconda 4.4 installation. If you have a later version, it is shown in the downloaded filename:  “anaconda2-<version number>-Linux-x86_64.sh”.

Run this downloaded script and follow the on-screen installer prompts.

chmod u+x Anaconda2-4.4.0-Linux-x86_64.sh
./Anaconda2-4.4.0-Linux-x86_64.sh

For an installation directory, select somewhere with enough space on your cluster, such as “/mnt/anaconda/”.

The process should take approximately 1–2 minutes to install. When prompted if you “wish the installer to prepend the Anaconda2 install location”, select the default option of [no].

After you are done, export the PATH to include this new Anaconda installation:

export PATH=/mnt/anaconda/bin:$PATH

Zip up the Anaconda installation:

cd /mnt/anaconda/
zip -r anaconda.zip .

The zip process may take 4–5 minutes to complete.

(Optional) Upload this anaconda.zip file to your S3 bucket for easier inclusion into future EMR clusters. This removes the need to repeat the previous steps for future EMR clusters.

Configure Oozie

Next, you configure Oozie to use Pyspark and the Anaconda platform.

Get the location of your Oozie sharelibupdate folder. Issue the following command and take note of the “sharelibDirNew” value:

oozie admin -sharelibupdate

For this post, this value is “hdfs://ip-192-168-4-200.us-east-1.compute.internal:8020/user/oozie/share/lib/lib_20170616133136”.

Pass in the required Pyspark files into Oozies sharelibupdate location. The following files are required for Oozie to be able to run Pyspark commands:

  • pyspark.zip
  • py4j-0.10.4-src.zip

These are located on the EMR master instance in the location “/usr/lib/spark/python/lib/”, and must be put into the Oozie sharelib spark directory. This location is the value of the sharelibDirNew parameter value (shown above) with “/spark/” appended, that is, “hdfs://ip-192-168-4-200.us-east-1.compute.internal:8020/user/oozie/share/lib/lib_20170616133136/spark/”.

To do this, issue the following commands:

hdfs dfs -put /usr/lib/spark/python/lib/py4j-0.10.4-src.zip hdfs://ip-192-168-4-200.us-east-1.compute.internal:8020/user/oozie/share/lib/lib_20170616133136/spark/
hdfs dfs -put /usr/lib/spark/python/lib/pyspark.zip hdfs://ip-192-168-4-200.us-east-1.compute.internal:8020/user/oozie/share/lib/lib_20170616133136/spark/

After you’re done, Oozie can use Pyspark in its processes.

Pass the anaconda.zip file into HDFS as follows:

hdfs dfs -put /mnt/anaconda/anaconda.zip /tmp/myLocation/anaconda.zip

(Optional) Verify that it was transferred successfully with the following command:

hdfs dfs -ls /tmp/myLocation/

On your master node, execute the following command:

export PYSPARK_PYTHON=/mnt/anaconda/bin/python

Set the PYSPARK_PYTHON environment variable on the executor nodes. Put the following configurations in your “spark-opts” values in your Oozie workflow.xml file:

–conf spark.executorEnv.PYSPARK_PYTHON=./anaconda_remote/bin/python
–conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./anaconda_remote/bin/python

This is referenced from the Oozie job in the following line in your workflow.xml file, also included as part of your “spark-opts”:

--archives hdfs:///tmp/myLocation/anaconda.zip#anaconda_remote

Your Oozie workflow.xml file should now look something like the following:

<workflow-app name="spark-wf" xmlns="uri:oozie:workflow:0.5">
<start to="start_spark" />
<action name="start_spark">
    <spark xmlns="uri:oozie:spark-action:0.1">
        <job-tracker>${jobTracker}</job-tracker>
        <name-node>${nameNode}</name-node>
        <prepare>
            <delete path="/tmp/test/spark_oozie_test_out3"/>
        </prepare>
        <master>yarn-cluster</master>
        <mode>cluster</mode>
        <name>SparkJob</name>
        <class>clear</class>
        <jar>hdfs:///user/oozie/apps/myPysparkProgram.py</jar>
        <spark-opts>--queue default
            --conf spark.ui.view.acls=*
            --executor-memory 2G --num-executors 2 --executor-cores 2 --driver-memory 3g
            --conf spark.executorEnv.PYSPARK_PYTHON=./anaconda_remote/bin/python
            --conf spark.yarn.appMasterEnv.PYSPARK_PYTHON=./anaconda_remote/bin/python
            --archives hdfs:///tmp/myLocation/anaconda.zip#anaconda_remote
        </spark-opts>
    </spark>
    <ok to="end"/>
    <error to="kill"/>
</action>
        <kill name="kill">
                <message>Action failed, error message[${wf:errorMessage(wf:lastErrorNode())}]</message>
        </kill>
        <end name="end"/>
</workflow-app>

Test steps

To test this out, you can use the following job.properties and myPysparkProgram.py file, along with the following steps:

job.properties

masterNode ip-xxx-xxx-xxx-xxx.us-east-1.compute.internal
nameNode hdfs://${masterNode}:8020
jobTracker ${masterNode}:8032
master yarn
mode cluster
queueName default
oozie.libpath ${nameNode}/user/oozie/share/lib
oozie.use.system.libpath true
oozie.wf.application.path ${nameNode}/user/oozie/apps/

Note: You can get your master node IP address (denoted as “ip-xxx-xxx-xxx-xxx” here) from the value for the sharelibDirNew parameter noted earlier.

myPysparkProgram.py

from pyspark import SparkContext, SparkConf
import numpy
import sys

conf = SparkConf().setAppName('myPysparkProgram')
sc = SparkContext(conf=conf)

rdd = sc.textFile("/user/hadoop/input.txt")

x = numpy.sum([3,4,5]) #total = 12

rdd = rdd.map(lambda line: line + str(x))
rdd.saveAsTextFile("/user/hadoop/output")

Put the “myPysparkProgram.py” into the location mentioned between the “<jar>xxxxx</jar>” tags in your workflow.xml. In this example, the location is “hdfs:///user/oozie/apps/”. Use the following command to move the “myPysparkProgram.py” file to the correct location:

hdfs dfs -put myPysparkProgram.py /user/oozie/apps/

Put the above workflow.xml file into the “/user/oozie/apps/” location in hdfs:

hdfs dfs –put workflow.xml /user/oozie/apps/

Note: The job.properties file is run locally from the EMR master node.

Create a sample input.txt file with some data in it. For example:

input.txt

This is a sentence.
So is this. 
This is also a sentence.

Put this file into hdfs:

hdfs dfs -put input.txt /user/hadoop/

Execute the job in Oozie with the following command. This creates an Oozie job ID.

oozie job -config job.properties -run

You can check the Oozie job state with the command:

oozie job -info <Oozie job ID>

  1. When the job is successfully finished, the results are located at:
/user/hadoop/output/part-00000
/user/hadoop/output/part-00001

  1. Run the following commands to view the output:
hdfs dfs -cat /user/hadoop/output/part-00000
hdfs dfs -cat /user/hadoop/output/part-00001

The output will be:

This is a sentence. 12
So is this 12
This is also a sentence 12

Summary

The myPysparkProgram.py has successfully imported the numpy library from the Anaconda platform and has produced some output with it. If you tried to run this using standard Python, you’d encounter the following error:

Now when your Python job runs in Oozie, any imported packages that are implicitly imported by your Pyspark script are imported into your job within Oozie directly from the Anaconda platform. Simple!

If you have questions or suggestions, please leave a comment below.


Additional Reading

Learn how to use Apache Oozie workflows to automate Apache Spark jobs on Amazon EMR.

 


About the Author

John Ohle is an AWS BigData Cloud Support Engineer II for the BigData team in Dublin. He works to provide advice and solutions to our customers on their Big Data projects and workflows on AWS. In his spare time, he likes to play music, learn, develop tools and write documentation to further help others – both colleagues and customers alike.

 

 

 

Defending anti-netneutrality arguments

Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/07/defending-anti-netneutrality-arguments.html

Last week, activists proclaimed a “NetNeutrality Day”, trying to convince the FCC to regulate NetNeutrality. As a libertarian, I tweeted many reasons why NetNeutrality is stupid. NetNeutrality is exactly the sort of government regulation Libertarians hate most. Somebody tweeted the following challenge, which I thought I’d address here.

The links point to two separate cases.

  • the Comcast BitTorrent throttling case
  • a lawsuit against Time Warning for poor service
The tone of the tweet suggests that my anti-NetNeutrality stance cannot be defended in light of these cases. But of course this is wrong. The short answers are:

  • the Comcast BitTorrent throttling benefits customers
  • poor service has nothing to do with NetNeutrality

The long answers are below.

The Comcast BitTorrent Throttling

The presumption is that any sort of packet-filtering is automatically evil, and against the customer’s interests. That’s not true.
Take GoGoInflight’s internet service for airplanes. They block access to video sites like NetFlix. That’s because they often have as little as 1-mbps for the entire plane, which is enough to support many people checking email and browsing Facebook, but a single person trying to watch video will overload the internet connection for everyone. Therefore, their Internet service won’t work unless they filter video sites.
GoGoInflight breaks a lot of other NetNeutrality rules, such as providing free access to Amazon.com or promotion deals where users of a particular phone get free Internet access that everyone else pays for. And all this is allowed by FCC, allowing GoGoInflight to break NetNeutrality rules because it’s clearly in the customer interest.
Comcast’s throttling of BitTorrent is likewise clearly in the customer interest. Until the FCC stopped them, BitTorrent users were allowed unlimited downloads. Afterwards, Comcast imposed a 300-gigabyte/month bandwidth cap.
Internet access is a series of tradeoffs. BitTorrent causes congestion during prime time (6pm to 10pm). Comcast has to solve it somehow — not solving it wasn’t an option. Their options were:
  • Charge all customers more, so that the 99% not using BitTorrent subsidizes the 1% who do.
  • Impose a bandwidth cap, preventing heavy BitTorrent usage.
  • Throttle BitTorrent packets during prime-time hours when the network is congested.
Option 3 is clearly the best. BitTorrent downloads take hours, days, and sometimes weeks. BitTorrent users don’t mind throttling during prime-time congested hours. That’s preferable to the other option, bandwidth caps.
I’m a BitTorrent user, and a heavy downloader (I scan the Internet on a regular basis from cloud machines, then download the results to home, which can often be 100-gigabytes in size for a single scan). I want prime-time BitTorrent throttling rather than bandwidth caps. The EFF/FCC’s action that prevented BitTorrent throttling forced me to move to Comcast Business Class which doesn’t have bandwidth caps, charging me $100 more a month. It’s why I don’t contribute the EFF — if they had not agitated for this, taking such choices away from customers, I’d have $1200 more per year to donate to worthy causes.
Ask any user of BitTorrent which they prefer: 300gig monthly bandwidth cap or BitTorrent throttling during prime-time congested hours (6pm to 10pm). The FCC’s action did not help Comcast’s customers, it hurt them. Packet-filtering would’ve been a good thing, not a bad thing.

The Time-Warner Case
First of all, no matter how you define the case, it has nothing to do with NetNeutrality. NetNeutrality is about filtering packets, giving some priority over others. This case is about providing slow service for everyone.
Secondly, it’s not true. Time Warner provided the same access speeds as everyone else. Just because they promise 10mbps download speeds doesn’t mean you get 10mbps to NetFlix. That’s not how the Internet works — that’s not how any of this works.
To prove this, look at NetFlix’s connection speed graphis. It shows Time Warner Cable is average for the industry. It had the same congestion problems most ISPs had in 2014, and it has the same inability to provide more than 3mbps during prime-time (6pm-10pm) that all ISPs have today.

The YouTube video quality diagnostic pages show Time Warner Cable to similar to other providers around the country. It also shows the prime-time bump between 6pm and 10pm.
Congestion is an essential part of the Internet design. When an ISP like Time Warner promises you 10mbps bandwidth, that’s only “best effort”. There’s no way they can promise 10mbps stream to everybody on the Internet, especially not to a site like NetFlix that gets overloaded during prime-time.
Indeed, it’s the defining feature of the Internet compared to the old “telecommunications” network. The old phone system guaranteed you a steady 64-kbps stream between any time points in the phone network, but it cost a lot of money. Today’s Internet provide a free multi-megabit stream for free video calls (Skype, Facetime) around the world — but with the occasional dropped packets because of congestion.
Whatever lawsuit money-hungry lawyers come up with isn’t about how an ISP like Time Warner works. It’s only about how they describe the technology. They work no different than every ISP — no different than how anything is possible.
Conclusion

The short answer to the above questions is this: Comcast’s BitTorrent throttling benefits customers, and the Time Warner issue has nothing to do with NetNeutrality at all.

The tweet demonstrates that NetNeutrality really means. It has nothing to do with the facts of any case, especially the frequency that people point to ISP ills that have nothing actually to do with NetNeutrality. Instead, what NetNeutrality really about is socialism. People are convinced corporations are evil and want the government to run the Internet. The Comcast/BitTorrent case is a prime example of why this is a bad idea: government definitions of what customers want is actually far different than what customers actually want.

mkosi — A Tool for Generating OS Images

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/mkosi-a-tool-for-generating-os-images.html

Introducing mkosi

After blogging about
casync
I realized I never blogged about the
mkosi tool that combines nicely
with it. mkosi has been around for a while already, and its time to
make it a bit better known. mkosi stands for Make Operating System
Image
, and is a tool for precisely that: generating an OS tree or
image that can be booted.

Yes, there are many tools like mkosi, and a number of them are quite
well known and popular. But mkosi has a number of features that I
think make it interesting for a variety of use-cases that other tools
don’t cover that well.

What is mkosi?

What are those use-cases, and what does mkosi precisely set apart?
mkosi is definitely a tool with a focus on developer’s needs for
building OS images, for testing and debugging, but also for generating
production images with cryptographic protection. A typical use-case
would be to add a mkosi.default file to an existing project (for
example, one written in C or Python), and thus making it easy to
generate an OS image for it. mkosi will put together the image with
development headers and tools, compile your code in it, run your test
suite, then throw away the image again, and build a new one, this time
without development headers and tools, and install your build
artifacts in it. This final image is then “production-ready”, and only
contains your built program and the minimal set of packages you
configured otherwise. Such an image could then be deployed with
casync (or any other tool of course) to be delivered to your set of
servers, or IoT devices or whatever you are building.

mkosi is supposed to be legacy-free: the focus is clearly on
today’s technology, not yesteryear’s. Specifically this means that
we’ll generate GPT partition tables, not MBR/DOS ones. When you tell
mkosi to generate a bootable image for you, it will make it bootable
on EFI, not on legacy BIOS. The GPT images generated follow
specifications such as the Discoverable Partitions
Specification
,
so that /etc/fstab can remain unpopulated and tools such as
systemd-nspawn can automatically dissect the image and boot from
them.

So, let’s have a look on the specific images it can generate:

  1. Raw GPT disk image, with ext4 as root
  2. Raw GPT disk image, with btrfs as root
  3. Raw GPT disk image, with a read-only squashfs as root
  4. A plain directory on disk containing the OS tree directly (this is useful for creating generic container images)
  5. A btrfs subvolume on disk, similar to the plain directory
  6. A tarball of a plain directory

When any of the GPT choices above are selected, a couple of additional
options are available:

  1. A swap partition may be added in
  2. The system may be made bootable on EFI systems
  3. Separate partitions for /home and /srv may be added in
  4. The root, /home and /srv partitions may be optionally encrypted with LUKS
  5. The root partition may be protected using dm-verity, thus making offline attacks on the generated system hard
  6. If the image is made bootable, the dm-verity root hash is automatically added to the kernel command line, and the kernel together with its initial RAM disk and the kernel command line is optionally cryptographically signed for UEFI SecureBoot

Note that mkosi is distribution-agnostic. It currently can build
images based on the following Linux distributions:

  1. Fedora
  2. Debian
  3. Ubuntu
  4. ArchLinux
  5. openSUSE

Note though that not all distributions are supported at the same
feature level currently. Also, as mkosi is based on dnf
--installroot
, debootstrap, pacstrap and zypper, and those
packages are not packaged universally on all distributions, you might
not be able to build images for all those distributions on arbitrary
host distributions. For example, Fedora doesn’t package zypper,
hence you cannot build an openSUSE image easily on Fedora, but you can
still build Fedora (obviously…), Debian, Ubuntu and ArchLinux images
on it just fine.

The GPT images are put together in a way that they aren’t just
compatible with UEFI systems, but also with VM and container managers
(that is, at least the smart ones, i.e. VM managers that know UEFI,
and container managers that grok GPT disk images) to a large
degree. In fact, the idea is that you can use mkosi to build a
single GPT image that may be used to:

  1. Boot on bare-metal boxes
  2. Boot in a VM
  3. Boot in a systemd-nspawn container
  4. Directly run a systemd service off, using systemd’s RootImage= unit file setting

Note that in all four cases the dm-verity data is automatically used
if available to ensure the image is not tempered with (yes, you read
that right, systemd-nspawn and systemd’s RootImage= setting
automatically do dm-verity these days if the image has it.)

Mode of Operation

The simplest usage of mkosi is by simply invoking it without
parameters (as root):

# mkosi

Without any configuration this will create a GPT disk image for you,
will call it image.raw and drop it in the current directory. The
distribution used will be the same one as your host runs.

Of course in most cases you want more control about how the image is
put together, i.e. select package sets, select the distribution, size
partitions and so on. Most of that you can actually specify on the
command line, but it is recommended to instead create a couple of
mkosi.$SOMETHING files and directories in some directory. Then,
simply change to that directory and run mkosi without any further
arguments. The tool will then look in the current working directory
for these files and directories and make use of them (similar to how
make looks for a Makefile…). Every single file/directory is
optional, but if they exist they are honored. Here’s a list of the
files/directories mkosi currently looks for:

  1. mkosi.default — This is the main configuration file, here you
    can configure what kind of image you want, which distribution, which
    packages and so on.

  2. mkosi.extra/ — If this directory exists, then mkosi will copy
    everything inside it into the images built. You can place arbitrary
    directory hierarchies in here, and they’ll be copied over whatever is
    already in the image, after it was put together by the distribution’s
    package manager. This is the best way to drop additional static files
    into the image, or override distribution-supplied ones.

  3. mkosi.build — This executable file is supposed to be a build
    script. When it exists, mkosi will build two images, one after the
    other in the mode already mentioned above: the first version is the
    build image, and may include various build-time dependencies such as
    a compiler or development headers. The build script is also copied
    into it, and then run inside it. The script should then build
    whatever shall be built and place the result in $DESTDIR (don’t
    worry, popular build tools such as Automake or Meson all honor
    $DESTDIR anyway, so there’s not much to do here explicitly). It may
    also run a test suite, or anything else you like. After the script
    finished, the build image is removed again, and a second image (the
    final image) is built. This time, no development packages are
    included, and the build script is not copied into the image again —
    however, the build artifacts from the first run (i.e. those placed in
    $DESTDIR) are copied into the image.

  4. mkosi.postinst — If this executable script exists, it is invoked
    inside the image (inside a systemd-nspawn invocation) and can
    adjust the image as it likes at a very late point in the image
    preparation. If mkosi.build exists, i.e. the dual-phased
    development build process used, then this script will be invoked
    twice: once inside the build image and once inside the final
    image. The first parameter passed to the script clarifies which phase
    it is run in.

  5. mkosi.nspawn — If this file exists, it should contain a
    container configuration file for systemd-nspawn (see
    systemd.nspawn(5)
    for details), which shall be shipped along with the final image and
    shall be included in the check-sum calculations (see below).

  6. mkosi.cache/ — If this directory exists, it is used as package
    cache directory for the builds. This directory is effectively bind
    mounted into the image at build time, in order to speed up building
    images. The package installers of the various distributions will
    place their package files here, so that subsequent runs can reuse
    them.

  7. mkosi.passphrase — If this file exists, it should contain a
    pass-phrase to use for the LUKS encryption (if that’s enabled for the
    image built). This file should not be readable to other users.

  8. mkosi.secure-boot.crt and mkosi.secure-boot.key should be an
    X.509 key pair to use for signing the kernel and initrd for UEFI
    SecureBoot, if that’s enabled.

How to use it

So, let’s come back to our most trivial example, without any of the
mkosi.$SOMETHING files around:

# mkosi

As mentioned, this will create a build file image.raw in the current
directory. How do we use it? Of course, we could dd it onto some USB
stick and boot it on a bare-metal device. However, it’s much simpler
to first run it in a container for testing:

# systemd-nspawn -bi image.raw

And there you go: the image should boot up, and just work for you.

Now, let’s make things more interesting. Let’s still not use any of
the mkosi.$SOMETHING files around:

# mkosi -t raw_btrfs --bootable -o foobar.raw
# systemd-nspawn -bi foobar.raw

This is similar as the above, but we made three changes: it’s no
longer GPT + ext4, but GPT + btrfs. Moreover, the system is made
bootable on UEFI systems, and finally, the output is now called
foobar.raw.

Because this system is bootable on UEFI systems, we can run it in KVM:

qemu-kvm -m 512 -smp 2 -bios /usr/share/edk2/ovmf/OVMF_CODE.fd -drive format=raw,file=foobar.raw

This will look very similar to the systemd-nspawn invocation, except
that this uses full VM virtualization rather than container
virtualization. (Note that the way to run a UEFI qemu/kvm instance
appears to change all the time and is different on the various
distributions. It’s quite annoying, and I can’t really tell you what
the right qemu command line is to make this work on your system.)

Of course, it’s not all raw GPT disk images with mkosi. Let’s try
a plain directory image:

# mkosi -d fedora -t directory -o quux
# systemd-nspawn -bD quux

Of course, if you generate the image as plain directory you can’t boot
it on bare-metal just like that, nor run it in a VM.

A more complex command line is the following:

# mkosi -d fedora -t raw_squashfs --checksum --xz --package=openssh-clients --package=emacs

In this mode we explicitly pick Fedora as the distribution to use, ask
mkosi to generate a compressed GPT image with a root squashfs,
compress the result with xz, and generate a SHA256SUMS file with
the hashes of the generated artifacts. The package will contain the
SSH client as well as everybody’s favorite editor.

Now, let’s make use of the various mkosi.$SOMETHING files. Let’s
say we are working on some Automake-based project and want to make it
easy to generate a disk image off the development tree with the
version you are hacking on. Create a configuration file:

# cat > mkosi.default <<EOF
[Distribution]
Distribution=fedora
Release=24

[Output]
Format=raw_btrfs
Bootable=yes

[Packages]
# The packages to appear in both the build and the final image
Packages=openssh-clients httpd
# The packages to appear in the build image, but absent from the final image
BuildPackages=make gcc libcurl-devel
EOF

And let’s add a build script:

# cat > mkosi.build <<EOF
#!/bin/sh
cd $SRCDIR
./autogen.sh
./configure --prefix=/usr
make -j `nproc`
make install
EOF
# chmod +x mkosi.build

And with all that in place we can now build our project into a disk image, simply by typing:

# mkosi

Let’s try it out:

# systemd-nspawn -bi image.raw

Of course, if you do this you’ll notice that building an image like
this can be quite slow. And slow build times are actively hurtful to
your productivity as a developer. Hence let’s make things a bit
faster. First, let’s make use of a package cache shared between runs:

# mkdir mkosi.chache

Building images now should already be substantially faster (and
generate less network traffic) as the packages will now be downloaded
only once and reused. However, you’ll notice that unpacking all those
packages and the rest of the work is still quite slow. But mkosi can
help you with that. Simply use mkosi‘s incremental build feature. In
this mode mkosi will make a copy of the build and final images
immediately before dropping in your build sources or artifacts, so
that building an image becomes a lot quicker: instead of always
starting totally from scratch a build will now reuse everything it can
reuse from a previous run, and immediately begin with building your
sources rather than the build image to build your sources in. To
enable the incremental build feature use -i:

# mkosi -i

Note that if you use this option, the package list is not updated
anymore from your distribution’s servers, as the cached copy is made
after all packages are installed, and hence until you actually delete
the cached copy the distribution’s network servers aren’t contacted
again and no RPMs or DEBs are downloaded. This means the distribution
you use becomes “frozen in time” this way. (Which might be a bad
thing, but also a good thing, as it makes things kinda reproducible.)

Of course, if you run mkosi a couple of times you’ll notice that it
won’t overwrite the generated image when it already exists. You can
either delete the file yourself first (rm image.raw) or let mkosi
do it for you right before building a new image, with mkosi -f. You
can also tell mkosi to not only remove any such pre-existing images,
but also remove any cached copies of the incremental feature, by using
-f twice.

I wrote mkosi originally in order to test systemd, and quickly
generate a disk image of various distributions with the most current
systemd version from git, without all that affecting my host system. I
regularly use mkosi for that today, in incremental mode. The two
commands I use most in that context are:

# mkosi -if && systemd-nspawn -bi image.raw

And sometimes:

# mkosi -iff && systemd-nspawn -bi image.raw

The latter I use only if I want to regenerate everything based on the
very newest set of RPMs provided by Fedora, instead of a cached
snapshot of it.

BTW, the mkosi files for systemd are included in the systemd git
tree:
mkosi.default
and
mkosi.build. This
way, any developer who wants to quickly test something with current
systemd git, or wants to prepare a patch based on it and test it can
check out the systemd repository and simply run mkosi in it and a
few minutes later he has a bootable image he can test in
systemd-nspawn or KVM. casync has similar files:
mkosi.default,
mkosi.build.

Random Interesting Features

  1. As mentioned already, mkosi will generate dm-verity enabled
    disk images if you ask for it. For that use the --verity switch on
    the command line or Verity= setting in mkosi.default. Of course,
    dm-verity implies that the root volume is read-only. In this mode
    the top-level dm-verity hash will be placed along-side the output
    disk image in a file named the same way, but with the .roothash
    suffix. If the image is to be created bootable, the root hash is also
    included on the kernel command line in the roothash= parameter,
    which current systemd versions can use to both find and activate the
    root partition in a dm-verity protected way. BTW: it’s a good idea
    to combine this dm-verity mode with the raw_squashfs image mode,
    to generate a genuinely protected, compressed image suitable for
    running in your IoT device.

  2. As indicated above, mkosi can automatically create a check-sum
    file SHA256SUMS for you (--checksum) covering all the files it
    outputs (which could be the image file itself, a matching .nspawn
    file using the mkosi.nspawn file mentioned above, as well as the
    .roothash file for the dm-verity root hash.) It can then
    optionally sign this with gpg (--sign). Note that systemd‘s
    machinectl pull-tar and machinectl pull-raw command can download
    these files and the SHA256SUMS file automatically and verify things
    on download. With other words: what mkosi outputs is perfectly
    ready for downloads using these two systemd commands.

  3. As mentioned, mkosi is big on supporting UEFI SecureBoot. To
    make use of that, place your X.509 key pair in two files
    mkosi.secureboot.crt and mkosi.secureboot.key, and set
    SecureBoot= or --secure-boot. If so, mkosi will sign the
    kernel/initrd/kernel command line combination during the build. Of
    course, if you use this mode, you should also use
    Verity=/--verity=, otherwise the setup makes only partial
    sense. Note that mkosi will not help you with actually enrolling
    the keys you use in your UEFI BIOS.

  4. mkosi has minimal support for GIT checkouts: when it recognizes
    it is run in a git checkout and you use the mkosi.build script
    stuff, the source tree will be copied into the build image, but will
    all files excluded by .gitignore removed.

  5. There’s support for encryption in place. Use --encrypt= or
    Encrypt=. Note that the UEFI ESP is never encrypted though, and the
    root partition only if explicitly requested. The /home and /srv
    partitions are unconditionally encrypted if that’s enabled.

  6. Images may be built with all documentation removed.

  7. The password for the root user and additional kernel command line
    arguments may be configured for the image to generate.

Minimum Requirements

Current mkosi requires Python 3.5, and has a number of dependencies,
listed in the
README. Most
notably you need a somewhat recent systemd version to make use of its
full feature set: systemd 233. Older versions are already packaged for
various distributions, but much of what I describe above is only
available in the most recent release mkosi 3.

The UEFI SecureBoot support requires sbsign which currently isn’t
available in Fedora, but there’s a
COPR
.

Future

It is my intention to continue turning mkosi into a tool suitable
for:

  1. Testing and debugging projects
  2. Building images for secure devices
  3. Building portable service images
  4. Building images for secure VMs and containers

One of the biggest goals I have for the future is to teach mkosi and
systemd/sd-boot native support for A/B IoT style partition
setups. The idea is that the combination of systemd, casync and
mkosi provides generic building blocks for building secure,
auto-updating devices in a generic way from, even though all pieces
may be used individually, too.

FAQ

  1. Why are you reinventing the wheel again? This is exactly like
    $SOMEOTHERPROJECT!
    — Well, to my knowledge there’s no tool that
    integrates this nicely with your project’s development tree, and can
    do dm-verity and UEFI SecureBoot and all that stuff for you. So
    nope, I don’t think this exactly like $SOMEOTHERPROJECT, thank you
    very much.

  2. What about creating MBR/DOS partition images? — That’s really
    out of focus to me. This is an exercise in figuring out how generic
    OSes and devices in the future should be built and an attempt to
    commoditize OS image building. And no, the future doesn’t speak MBR,
    sorry. That said, I’d be quite interested in adding support for
    booting on Raspberry Pi, possibly using a hybrid approach, i.e. using
    a GPT disk label, but arranging things in a way that the Raspberry Pi
    boot protocol (which is built around DOS partition tables), can still
    work.

  3. Is this portable? — Well, depends what you mean by
    portable. No, this tool runs on Linux only, and as it uses
    systemd-nspawn during the build process it doesn’t run on
    non-systemd systems either. But then again, you should be able to
    create images for any architecture you like with it, but of course if
    you want the image bootable on bare-metal systems only systems doing
    UEFI are supported (but systemd-nspawn should still work fine on
    them).

  4. Where can I get this stuff? — Try
    GitHub. And some distributions
    carry packaged versions, but I think none of them the current v3
    yet.

  5. Is this a systemd project? — Yes, it’s hosted under the
    systemd GitHub umbrella. And yes,
    during run-time systemd-nspawn in a current version is required. But
    no, the code-bases are separate otherwise, already because systemd
    is a C project, and mkosi Python.

  6. Requiring systemd 233 is a pretty steep requirement, no?
    Yes, but the feature we need kind of matters (systemd-nspawn‘s
    --overlay= switch), and again, this isn’t supposed to be a tool for
    legacy systems.

  7. Can I run the resulting images in LXC or Docker? — Humm, I am
    not an LXC nor Docker guy. If you select directory or subvolume
    as image type, LXC should be able to boot the generated images just
    fine, but I didn’t try. Last time I looked, Docker doesn’t permit
    running proper init systems as PID 1 inside the container, as they
    define their own run-time without intention to emulate a proper
    system. Hence, no I don’t think it will work, at least not with an
    unpatched Docker version. That said, again, don’t ask me questions
    about Docker, it’s not precisely my area of expertise, and quite
    frankly I am not a fan. To my knowledge neither LXC nor Docker are
    able to run containers directly off GPT disk images, hence the
    various raw_xyz image types are definitely not compatible with
    either. That means if you want to generate a single raw disk image
    that can be booted unmodified both in a container and on bare-metal,
    then systemd-nspawn is the container manager to go for
    (specifically, its -i/--image= switch).

Should you care? Is this a tool for you?

Well, that’s up to you really.

If you hack on some complex project and need a quick way to compile
and run your project on a specific current Linux distribution, then
mkosi is an excellent way to do that. Simply drop the mkosi.default
and mkosi.build files in your git tree and everything will be
easy. (And of course, as indicated above: if the project you are
hacking on happens to be called systemd or casync be aware that
those files are already part of the git tree — you can just use them.)

If you hack on some embedded or IoT device, then mkosi is a great
choice too, as it will make it reasonably easy to generate secure
images that are protected against offline modification, by using
dm-verity and UEFI SecureBoot.

If you are an administrator and need a nice way to build images for a
VM or systemd-nspawn container, or a portable service then mkosi
is an excellent choice too.

If you care about legacy computers, old distributions, non-systemd
init systems, old VM managers, Docker, … then no, mkosi is not for
you, but there are plenty of well-established alternatives around that
cover that nicely.

And never forget: mkosi is an Open Source project. We are happy to
accept your patches and other contributions.

Oh, and one unrelated last thing: don’t forget to submit your talk
proposal

and/or buy a ticket for
All Systems Go! 2017 in Berlin — the
conference where things like systemd, casync and mkosi are
discussed, along with a variety of other Linux userspace projects used
for building systems.

mkosi — A Tool for Generating OS Images

Post Syndicated from Lennart Poettering original http://0pointer.net/blog/mkosi-a-tool-for-generating-os-images.html

Introducing mkosi

After blogging about
casync
I realized I never blogged about the
mkosi tool that combines nicely
with it. mkosi has been around for a while already, and its time to
make it a bit better known. mkosi stands for Make Operating System
Image
, and is a tool for precisely that: generating an OS tree or
image that can be booted.

Yes, there are many tools like mkosi, and a number of them are quite
well known and popular. But mkosi has a number of features that I
think make it interesting for a variety of use-cases that other tools
don’t cover that well.

What is mkosi?

What are those use-cases, and what does mkosi precisely set apart?
mkosi is definitely a tool with a focus on developer’s needs for
building OS images, for testing and debugging, but also for generating
production images with cryptographic protection. A typical use-case
would be to add a mkosi.default file to an existing project (for
example, one written in C or Python), and thus making it easy to
generate an OS image for it. mkosi will put together the image with
development headers and tools, compile your code in it, run your test
suite, then throw away the image again, and build a new one, this time
without development headers and tools, and install your build
artifacts in it. This final image is then “production-ready”, and only
contains your built program and the minimal set of packages you
configured otherwise. Such an image could then be deployed with
casync (or any other tool of course) to be delivered to your set of
servers, or IoT devices or whatever you are building.

mkosi is supposed to be legacy-free: the focus is clearly on
today’s technology, not yesteryear’s. Specifically this means that
we’ll generate GPT partition tables, not MBR/DOS ones. When you tell
mkosi to generate a bootable image for you, it will make it bootable
on EFI, not on legacy BIOS. The GPT images generated follow
specifications such as the Discoverable Partitions
Specification
,
so that /etc/fstab can remain unpopulated and tools such as
systemd-nspawn can automatically dissect the image and boot from
them.

So, let’s have a look on the specific images it can generate:

  1. Raw GPT disk image, with ext4 as root
  2. Raw GPT disk image, with btrfs as root
  3. Raw GPT disk image, with a read-only squashfs as root
  4. A plain directory on disk containing the OS tree directly (this is useful for creating generic container images)
  5. A btrfs subvolume on disk, similar to the plain directory
  6. A tarball of a plain directory

When any of the GPT choices above are selected, a couple of additional
options are available:

  1. A swap partition may be added in
  2. The system may be made bootable on EFI systems
  3. Separate partitions for /home and /srv may be added in
  4. The root, /home and /srv partitions may be optionally encrypted with LUKS
  5. The root partition may be protected using dm-verity, thus making offline attacks on the generated system hard
  6. If the image is made bootable, the dm-verity root hash is automatically added to the kernel command line, and the kernel together with its initial RAM disk and the kernel command line is optionally cryptographically signed for UEFI SecureBoot

Note that mkosi is distribution-agnostic. It currently can build
images based on the following Linux distributions:

  1. Fedora
  2. Debian
  3. Ubuntu
  4. ArchLinux
  5. openSUSE

Note though that not all distributions are supported at the same
feature level currently. Also, as mkosi is based on dnf
--installroot
, debootstrap, pacstrap and zypper, and those
packages are not packaged universally on all distributions, you might
not be able to build images for all those distributions on arbitrary
host distributions.

The GPT images are put together in a way that they aren’t just
compatible with UEFI systems, but also with VM and container managers
(that is, at least the smart ones, i.e. VM managers that know UEFI,
and container managers that grok GPT disk images) to a large
degree. In fact, the idea is that you can use mkosi to build a
single GPT image that may be used to:

  1. Boot on bare-metal boxes
  2. Boot in a VM
  3. Boot in a systemd-nspawn container
  4. Directly run a systemd service off, using systemd’s RootImage= unit file setting

Note that in all four cases the dm-verity data is automatically used
if available to ensure the image is not tampered with (yes, you read
that right, systemd-nspawn and systemd’s RootImage= setting
automatically do dm-verity these days if the image has it.)

Mode of Operation

The simplest usage of mkosi is by simply invoking it without
parameters (as root):

# mkosi

Without any configuration this will create a GPT disk image for you,
will call it image.raw and drop it in the current directory. The
distribution used will be the same one as your host runs.

Of course in most cases you want more control about how the image is
put together, i.e. select package sets, select the distribution, size
partitions and so on. Most of that you can actually specify on the
command line, but it is recommended to instead create a couple of
mkosi.$SOMETHING files and directories in some directory. Then,
simply change to that directory and run mkosi without any further
arguments. The tool will then look in the current working directory
for these files and directories and make use of them (similar to how
make looks for a Makefile…). Every single file/directory is
optional, but if they exist they are honored. Here’s a list of the
files/directories mkosi currently looks for:

  1. mkosi.default — This is the main configuration file, here you
    can configure what kind of image you want, which distribution, which
    packages and so on.

  2. mkosi.extra/ — If this directory exists, then mkosi will copy
    everything inside it into the images built. You can place arbitrary
    directory hierarchies in here, and they’ll be copied over whatever is
    already in the image, after it was put together by the distribution’s
    package manager. This is the best way to drop additional static files
    into the image, or override distribution-supplied ones.

  3. mkosi.build — This executable file is supposed to be a build
    script. When it exists, mkosi will build two images, one after the
    other in the mode already mentioned above: the first version is the
    build image, and may include various build-time dependencies such as
    a compiler or development headers. The build script is also copied
    into it, and then run inside it. The script should then build
    whatever shall be built and place the result in $DESTDIR (don’t
    worry, popular build tools such as Automake or Meson all honor
    $DESTDIR anyway, so there’s not much to do here explicitly). It may
    also run a test suite, or anything else you like. After the script
    finished, the build image is removed again, and a second image (the
    final image) is built. This time, no development packages are
    included, and the build script is not copied into the image again —
    however, the build artifacts from the first run (i.e. those placed in
    $DESTDIR) are copied into the image.

  4. mkosi.postinst — If this executable script exists, it is invoked
    inside the image (inside a systemd-nspawn invocation) and can
    adjust the image as it likes at a very late point in the image
    preparation. If mkosi.build exists, i.e. the dual-phased
    development build process used, then this script will be invoked
    twice: once inside the build image and once inside the final
    image. The first parameter passed to the script clarifies which phase
    it is run in.

  5. mkosi.nspawn — If this file exists, it should contain a
    container configuration file for systemd-nspawn (see
    systemd.nspawn(5)
    for details), which shall be shipped along with the final image and
    shall be included in the check-sum calculations (see below).

  6. mkosi.cache/ — If this directory exists, it is used as package
    cache directory for the builds. This directory is effectively bind
    mounted into the image at build time, in order to speed up building
    images. The package installers of the various distributions will
    place their package files here, so that subsequent runs can reuse
    them.

  7. mkosi.passphrase — If this file exists, it should contain a
    pass-phrase to use for the LUKS encryption (if that’s enabled for the
    image built). This file should not be readable to other users.

  8. mkosi.secure-boot.crt and mkosi.secure-boot.key should be an
    X.509 key pair to use for signing the kernel and initrd for UEFI
    SecureBoot, if that’s enabled.

How to use it

So, let’s come back to our most trivial example, without any of the
mkosi.$SOMETHING files around:

# mkosi

As mentioned, this will create a build file image.raw in the current
directory. How do we use it? Of course, we could dd it onto some USB
stick and boot it on a bare-metal device. However, it’s much simpler
to first run it in a container for testing:

# systemd-nspawn -bi image.raw

And there you go: the image should boot up, and just work for you.

Now, let’s make things more interesting. Let’s still not use any of
the mkosi.$SOMETHING files around:

# mkosi -t raw_btrfs --bootable -o foobar.raw
# systemd-nspawn -bi foobar.raw

This is similar as the above, but we made three changes: it’s no
longer GPT + ext4, but GPT + btrfs. Moreover, the system is made
bootable on UEFI systems, and finally, the output is now called
foobar.raw.

Because this system is bootable on UEFI systems, we can run it in KVM:

qemu-kvm -m 512 -smp 2 -bios /usr/share/edk2/ovmf/OVMF_CODE.fd -drive format=raw,file=foobar.raw

This will look very similar to the systemd-nspawn invocation, except
that this uses full VM virtualization rather than container
virtualization. (Note that the way to run a UEFI qemu/kvm instance
appears to change all the time and is different on the various
distributions. It’s quite annoying, and I can’t really tell you what
the right qemu command line is to make this work on your system.)

Of course, it’s not all raw GPT disk images with mkosi. Let’s try
a plain directory image:

# mkosi -d fedora -t directory -o quux
# systemd-nspawn -bD quux

Of course, if you generate the image as plain directory you can’t boot
it on bare-metal just like that, nor run it in a VM.

A more complex command line is the following:

# mkosi -d fedora -t raw_squashfs --checksum --xz --package=openssh-clients --package=emacs

In this mode we explicitly pick Fedora as the distribution to use, ask
mkosi to generate a compressed GPT image with a root squashfs,
compress the result with xz, and generate a SHA256SUMS file with
the hashes of the generated artifacts. The package will contain the
SSH client as well as everybody’s favorite editor.

Now, let’s make use of the various mkosi.$SOMETHING files. Let’s
say we are working on some Automake-based project and want to make it
easy to generate a disk image off the development tree with the
version you are hacking on. Create a configuration file:

# cat > mkosi.default <<EOF
[Distribution]
Distribution=fedora
Release=24

[Output]
Format=raw_btrfs
Bootable=yes

[Packages]
# The packages to appear in both the build and the final image
Packages=openssh-clients httpd
# The packages to appear in the build image, but absent from the final image
BuildPackages=make gcc libcurl-devel
EOF

And let’s add a build script:

# cat > mkosi.build <<EOF
#!/bin/sh
./autogen.sh
./configure --prefix=/usr
make -j `nproc`
make install
EOF
# chmod +x mkosi.build

And with all that in place we can now build our project into a disk image, simply by typing:

# mkosi

Let’s try it out:

# systemd-nspawn -bi image.raw

Of course, if you do this you’ll notice that building an image like
this can be quite slow. And slow build times are actively hurtful to
your productivity as a developer. Hence let’s make things a bit
faster. First, let’s make use of a package cache shared between runs:

# mkdir mkosi.cache

Building images now should already be substantially faster (and
generate less network traffic) as the packages will now be downloaded
only once and reused. However, you’ll notice that unpacking all those
packages and the rest of the work is still quite slow. But mkosi can
help you with that. Simply use mkosi‘s incremental build feature. In
this mode mkosi will make a copy of the build and final images
immediately before dropping in your build sources or artifacts, so
that building an image becomes a lot quicker: instead of always
starting totally from scratch a build will now reuse everything it can
reuse from a previous run, and immediately begin with building your
sources rather than the build image to build your sources in. To
enable the incremental build feature use -i:

# mkosi -i

Note that if you use this option, the package list is not updated
anymore from your distribution’s servers, as the cached copy is made
after all packages are installed, and hence until you actually delete
the cached copy the distribution’s network servers aren’t contacted
again and no RPMs or DEBs are downloaded. This means the distribution
you use becomes “frozen in time” this way. (Which might be a bad
thing, but also a good thing, as it makes things kinda reproducible.)

Of course, if you run mkosi a couple of times you’ll notice that it
won’t overwrite the generated image when it already exists. You can
either delete the file yourself first (rm image.raw) or let mkosi
do it for you right before building a new image, with mkosi -f. You
can also tell mkosi to not only remove any such pre-existing images,
but also remove any cached copies of the incremental feature, by using
-f twice.

I wrote mkosi originally in order to test systemd, and quickly
generate a disk image of various distributions with the most current
systemd version from git, without all that affecting my host system. I
regularly use mkosi for that today, in incremental mode. The two
commands I use most in that context are:

# mkosi -if && systemd-nspawn -bi image.raw

And sometimes:

# mkosi -iff && systemd-nspawn -bi image.raw

The latter I use only if I want to regenerate everything based on the
very newest set of RPMs provided by Fedora, instead of a cached
snapshot of it.

BTW, the mkosi files for systemd are included in the systemd git
tree:
mkosi.default
and
mkosi.build. This
way, any developer who wants to quickly test something with current
systemd git, or wants to prepare a patch based on it and test it can
check out the systemd repository and simply run mkosi in it and a
few minutes later he has a bootable image he can test in
systemd-nspawn or KVM. casync has similar files:
mkosi.default,
mkosi.build.

Random Interesting Features

  1. As mentioned already, mkosi will generate dm-verity enabled
    disk images if you ask for it. For that use the --verity switch on
    the command line or Verity= setting in mkosi.default. Of course,
    dm-verity implies that the root volume is read-only. In this mode
    the top-level dm-verity hash will be placed along-side the output
    disk image in a file named the same way, but with the .roothash
    suffix. If the image is to be created bootable, the root hash is also
    included on the kernel command line in the roothash= parameter,
    which current systemd versions can use to both find and activate the
    root partition in a dm-verity protected way. BTW: it’s a good idea
    to combine this dm-verity mode with the raw_squashfs image mode,
    to generate a genuinely protected, compressed image suitable for
    running in your IoT device.

  2. As indicated above, mkosi can automatically create a check-sum
    file SHA256SUMS for you (--checksum) covering all the files it
    outputs (which could be the image file itself, a matching .nspawn
    file using the mkosi.nspawn file mentioned above, as well as the
    .roothash file for the dm-verity root hash.) It can then
    optionally sign this with gpg (--sign). Note that systemd‘s
    machinectl pull-tar and machinectl pull-raw command can download
    these files and the SHA256SUMS file automatically and verify things
    on download. With other words: what mkosi outputs is perfectly
    ready for downloads using these two systemd commands.

  3. As mentioned, mkosi is big on supporting UEFI SecureBoot. To
    make use of that, place your X.509 key pair in two files
    mkosi.secureboot.crt and mkosi.secureboot.key, and set
    SecureBoot= or --secure-boot. If so, mkosi will sign the
    kernel/initrd/kernel command line combination during the build. Of
    course, if you use this mode, you should also use
    Verity=/--verity=, otherwise the setup makes only partial
    sense. Note that mkosi will not help you with actually enrolling
    the keys you use in your UEFI BIOS.

  4. mkosi has minimal support for GIT checkouts: when it recognizes
    it is run in a git checkout and you use the mkosi.build script
    stuff, the source tree will be copied into the build image, but will
    all files excluded by .gitignore removed.

  5. There’s support for encryption in place. Use --encrypt= or
    Encrypt=. Note that the UEFI ESP is never encrypted though, and the
    root partition only if explicitly requested. The /home and /srv
    partitions are unconditionally encrypted if that’s enabled.

  6. Images may be built with all documentation removed.

  7. The password for the root user and additional kernel command line
    arguments may be configured for the image to generate.

Minimum Requirements

Current mkosi requires Python 3.5, and has a number of dependencies,
listed in the
README. Most
notably you need a somewhat recent systemd version to make use of its
full feature set: systemd 233. Older versions are already packaged for
various distributions, but much of what I describe above is only
available in the most recent release mkosi 3.

The UEFI SecureBoot support requires sbsign which currently isn’t
available in Fedora, but there’s a
COPR
.

Future

It is my intention to continue turning mkosi into a tool suitable
for:

  1. Testing and debugging projects
  2. Building images for secure devices
  3. Building portable service images
  4. Building images for secure VMs and containers

One of the biggest goals I have for the future is to teach mkosi and
systemd/sd-boot native support for A/B IoT style partition
setups. The idea is that the combination of systemd, casync and
mkosi provides generic building blocks for building secure,
auto-updating devices in a generic way from, even though all pieces
may be used individually, too.

FAQ

  1. Why are you reinventing the wheel again? This is exactly like
    $SOMEOTHERPROJECT!
    — Well, to my knowledge there’s no tool that
    integrates this nicely with your project’s development tree, and can
    do dm-verity and UEFI SecureBoot and all that stuff for you. So
    nope, I don’t think this exactly like $SOMEOTHERPROJECT, thank you
    very much.

  2. What about creating MBR/DOS partition images? — That’s really
    out of focus to me. This is an exercise in figuring out how generic
    OSes and devices in the future should be built and an attempt to
    commoditize OS image building. And no, the future doesn’t speak MBR,
    sorry. That said, I’d be quite interested in adding support for
    booting on Raspberry Pi, possibly using a hybrid approach, i.e. using
    a GPT disk label, but arranging things in a way that the Raspberry Pi
    boot protocol (which is built around DOS partition tables), can still
    work.

  3. Is this portable? — Well, depends what you mean by
    portable. No, this tool runs on Linux only, and as it uses
    systemd-nspawn during the build process it doesn’t run on
    non-systemd systems either. But then again, you should be able to
    create images for any architecture you like with it, but of course if
    you want the image bootable on bare-metal systems only systems doing
    UEFI are supported (but systemd-nspawn should still work fine on
    them).

  4. Where can I get this stuff? — Try
    GitHub. And some distributions
    carry packaged versions, but I think none of them the current v3
    yet.

  5. Is this a systemd project? — Yes, it’s hosted under the
    systemd GitHub umbrella. And yes,
    during run-time systemd-nspawn in a current version is required. But
    no, the code-bases are separate otherwise, already because systemd
    is a C project, and mkosi Python.

  6. Requiring systemd 233 is a pretty steep requirement, no?
    Yes, but the feature we need kind of matters (systemd-nspawn‘s
    --overlay= switch), and again, this isn’t supposed to be a tool for
    legacy systems.

  7. Can I run the resulting images in LXC or Docker? — Humm, I am
    not an LXC nor Docker guy. If you select directory or subvolume
    as image type, LXC should be able to boot the generated images just
    fine, but I didn’t try. Last time I looked, Docker doesn’t permit
    running proper init systems as PID 1 inside the container, as they
    define their own run-time without intention to emulate a proper
    system. Hence, no I don’t think it will work, at least not with an
    unpatched Docker version. That said, again, don’t ask me questions
    about Docker, it’s not precisely my area of expertise, and quite
    frankly I am not a fan. To my knowledge neither LXC nor Docker are
    able to run containers directly off GPT disk images, hence the
    various raw_xyz image types are definitely not compatible with
    either. That means if you want to generate a single raw disk image
    that can be booted unmodified both in a container and on bare-metal,
    then systemd-nspawn is the container manager to go for
    (specifically, its -i/--image= switch).

Should you care? Is this a tool for you?

Well, that’s up to you really.

If you hack on some complex project and need a quick way to compile
and run your project on a specific current Linux distribution, then
mkosi is an excellent way to do that. Simply drop the mkosi.default
and mkosi.build files in your git tree and everything will be
easy. (And of course, as indicated above: if the project you are
hacking on happens to be called systemd or casync be aware that
those files are already part of the git tree — you can just use them.)

If you hack on some embedded or IoT device, then mkosi is a great
choice too, as it will make it reasonably easy to generate secure
images that are protected against offline modification, by using
dm-verity and UEFI SecureBoot.

If you are an administrator and need a nice way to build images for a
VM or systemd-nspawn container, or a portable service then mkosi
is an excellent choice too.

If you care about legacy computers, old distributions, non-systemd
init systems, old VM managers, Docker, … then no, mkosi is not for
you, but there are plenty of well-established alternatives around that
cover that nicely.

And never forget: mkosi is an Open Source project. We are happy to
accept your patches and other contributions.

Oh, and one unrelated last thing: don’t forget to submit your talk
proposal

and/or buy a ticket for
All Systems Go! 2017 in Berlin — the
conference where things like systemd, casync and mkosi are
discussed, along with a variety of other Linux userspace projects used
for building systems.

Backblaze B2, Cloud Storage on a Budget: One Year Later

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-b2-cloud-storage-on-a-budget-one-year-later/

B2 Cloud Storage Review

A year ago, Backblaze B2 Cloud Storage came out of beta and became available for everyone to use. We were pretty excited, even though it seemed like everyone and their brother had a cloud storage offering. Now that we are a year down the road let’s see how B2 has fared in the real world of tight budgets, maxed-out engineering schedules, insanely funded competition, and more. Spoiler alert: We’re still pretty excited…

Cloud Storage on a Budget

There are dozens of companies offering cloud storage and the landscape is cluttered with incomprehensible pricing models, cleverly disguised transfer and download charges, and differing levels of service that seem to be driven more by marketing departments than customer needs.

Backblaze B2 keeps things simple: A single performant level of service, a single affordable price for storage ($0.005/GB/month), a single affordable price for downloads ($0.02/GB), and a single list of transaction charges – all on a single pricing page.

Who’s Using B2?

By making cloud storage affordable, companies and organizations now have a way to store their data in the cloud and still be able to access and restore it as quickly as needed. You don’t have to choose between price and performance. Here are a few examples:

  • Media & Entertainment: KLRU-TV, Austin PBS, is using B2 to preserve their video catalog of the world renown musical anthology series, Austin City Limits.
  • LTO Migration: The Girl Scouts San Diego, were able to move their daily incremental backups from LTO tape to the cloud, saving money and time, while helping automate their entire backup process.
  • Cloud Migration: Vintage Aerial found it cost effective to discard their internal data server and store their unique hi-resolution images in B2 Cloud Storage.
  • Backup: Ahuja and Clark, a boutique accounting firm, was able to save over 80% on the cost to backup all their corporate and client data.

How is B2 Being Used?

B2 Cloud Storage can be accessed in four ways: using the Web GUI, using the CLI, using the API library, and using a product or service integrated with B2. While many customers are using the Web GUI, CLI and API to store and retrieve data, the most prolific use of B2 occurs via our integration partners. Each integration partner has certified they have met our best practices for integrating to B2 and we’ve tested each of the integrations submitted to us. Here are a few of the highlights.

  • NAS Devices – Synology and QNAP have integrations which allow their NAS devices to sync their data to/from B2.
  • Backup and Sync – CloudBerry, GoodSync, and Retrospect are just a few of the services that can backup and/or sync data to/from B2.
  • Hybrid Cloud – 45 Drives and OpenIO are solutions that allow you to setup and operate a hybrid data storage cloud environment.
  • Desktop Apps – CyberDuck, MountainDuck, Dropshare, and more allow users an easy way to store and use data in B2 right from your desktop.
  • Digital Asset Management – Cantemo, Cubix, CatDV, and axle Video, let you catalog your digital assets and then store them in B2 for fast retrieval when they are needed.

If you have an application or service that stores data in the cloud and it isn’t integrated with Backblaze B2, then your customers are probably paying too much for cloud storage.

What’s New in B2?

B2 Fireball – our rapid data ingest service. We send you a storage device, and you load it up with up to 40 TB of data and send it back, then we load the data into your B2 account. The cost is $550 per trip plus shipping. Save your network bandwidth with the B2 Fireball.

Lowered the download price – When we introduced B2, we set the price to download a gigabyte of data to be $0.05/GB – the same as most competitors. A year in, we reevaluated the price based on usage and decided to lower the price to $0.02/GB.

B2 User Groups – Backblaze Groups functionality is now available in B2. An administrator can invite users to a B2 centric Group to centralize the storage location for that group of users. For example, multiple members of a department working on a project will be able to archive their work-in-process activities into a single B2 bucket.

Time Machine backup – You may know that you can use your Synology NAS as the destination for your Time Machine backup. With B2 you can also sync your Synology NAS to B2 for a true 3-2-1 backup solution. If your system crashes or is lost, you can restore your Time Machine image directly from B2 to your new machine.

Life Cycle Rules – Create rules that allow you to manage the length of time deleted files will remain in your B2 bucket before they are deleted. A great option for managing the cleanup of outdated file versions to save on storage costs.

Large Files – In the B2 Web GUI you can upload files as large as 500 MB using either the upload or drag-and-drop functionality. The B2 CLI and API support the ability to upload/download files as large as 10 TB.

5 MB file part size – When working with large files, the minimum file part size can now be set as low as 5 MB versus the previous low setting of 100 MB. Now the range of a file part when working with large files can be from 5 MB to 5GB. This increases the throughput of your data uploads and downloads.

SHA-1 at the end – This feature allows you to compute the SHA-1 checksum and append it to the end of the request body versus doing the computation before the file is sent. This is especially useful for those applications which stream data to/from B2.

Cache-Control – When data is downloaded from B2 into a browser, the length of time the file remains in the browser cache can be set at the bucket level using the b2_create_bucket and b2_update_bucket API calls. Setting this policy is optional.

Customized delimiters – Used in the API, this allows you to specify a delimiter to use for a given purpose. A common use is to set a delimiter in the file name string. Then use that delimiter to detect a folder name within the string.

Looking Ahead

Over the past year we added nearly 30,000 new B2 customers to the fold and are welcoming more and more each day as B2 continues to grow. We have plans to expand our storage footprint by adding more data centers as we look forward to moving towards a multi-region environment.

For those of you who are B2 customers – thank you for helping build B2. If you have an interesting way you are using B2, tell us in the comments below.

The post Backblaze B2, Cloud Storage on a Budget: One Year Later appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.