Tag Archives: training

CoderDojo: 2000 Dojos ever

Post Syndicated from Giustina Mizzoni original https://www.raspberrypi.org/blog/2000-dojos-ever/

Every day of the week, we verify new Dojos all around the world, and each Dojo is championed by passionate volunteers. Last week, a huge milestone for the CoderDojo community went by relatively unnoticed: in the history of the movement, more than 2000 Dojos have now been verified!

CoderDojo banner — 2000 Dojos

2000 Dojos

This is a phenomenal achievement for a movement that’s just six years old and powered by volunteers. Presently, there are more than 1650 active Dojos running weekly, fortnightly, or monthly, and all of them are free for participants — for example, the Dojos run by Joel Bayubasire in Kampala, Uganda:

Joel Bayubasire with Ninjas at his Ugandan Dojo — 2000 Dojos

Empowering refugee children

This week, Joel set up his second Dojo and verified it on our global map. Joel is a Congolese refugee living in Kampala, Uganda, where he is currently completing his PhD in Economics at Madison International Institute and Business School.

Joel understands first-hand the challenges faced by refugees who were forced to leave their country due to war or conflict. Uganda is currently hosting more than 1.2 million refugees, 60% of which are children (World Bank, 2017). As refugees, children are only allowed to attend local schools until the age of 12. This results in lower educational attainment, which will likely affect their future employment prospects.

Two girls at a laptop. Joel Bayubasire CoderDojo — 2000 Dojos

Joel has the motivation to overcome these challenges, because he understands the power of education. Therefore, he initiated a number of community-based activities to provide educational opportunities for refugee children. As part of this, he founded his first Dojo earlier in the year, with the aim of giving these children a chance to compete in today’s global knowledge-based economy.

Two boys at a laptop. Joel Bayubasire CoderDojo — 2000 Dojos

Aware that securing volunteer mentors would be a challenge, Joel trained eight young people from the community to become youth mentors to their peers. He explains:

I believe that the mastery of computer coding allows talented young people to thrive professionally and enables them to not only be consumers but creators of the interconnected world of today!

Based on the success of Joel’s first Dojo, he has now expanded the CoderDojo initiative in his community; his plan is to provide computer science training for more than 300 refugee youths in Kampala by 2019. If you’d like to learn more about Joel’s efforts, head to this website.

Join the movement

If you are interested in creating opportunities for the young people in your community, then join the growing CoderDojo movement — you can volunteer to start a Dojo or to support an existing one today!

The post CoderDojo: 2000 Dojos ever appeared first on Raspberry Pi.

AWS Contributes to Milestone 1.0 Release and Adds Model Serving Capability for Apache MXNet

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/aws-contributes-to-milestone-1-0-release-and-adds-model-serving-capability-for-apache-mxnet/

Post by Dr. Matt Wood

Today AWS announced contributions to the milestone 1.0 release of the Apache MXNet deep learning engine including the introduction of a new model-serving capability for MXNet. The new capabilities in MXNet provide the following benefits to users:

1) MXNet is easier to use: The model server for MXNet is a new capability introduced by AWS, and it packages, runs, and serves deep learning models in seconds with just a few lines of code, making them accessible over the internet via an API endpoint and thus easy to integrate into applications. The 1.0 release also includes an advanced indexing capability that enables users to perform matrix operations in a more intuitive manner.

  • Model Serving enables set up of an API endpoint for prediction: It saves developers time and effort by condensing the task of setting up an API endpoint for running and integrating prediction functionality into an application to just a few lines of code. It bridges the barrier between Python-based deep learning frameworks and production systems through a Docker container-based deployment model.
  • Advanced indexing for array operations in MXNet: It is now more intuitive for developers to leverage the powerful array operations in MXNet. They can use the advanced indexing capability by leveraging existing knowledge of NumPy/SciPy arrays. For example, it supports MXNet NDArray and Numpy ndarray as index, e.g. (a[mx.nd.array([1,2], dtype = ‘int32’]).

2) MXNet is faster: The 1.0 release includes implementation of cutting-edge features that optimize the performance of training and inference. Gradient compression enables users to train models up to five times faster by reducing communication bandwidth between compute nodes without loss in convergence rate or accuracy. For speech recognition acoustic modeling like the Alexa voice, this feature can reduce network bandwidth by up to three orders of magnitude during training. With the support of NVIDIA Collective Communication Library (NCCL), users can train a model 20% faster on multi-GPU systems.

  • Optimize network bandwidth with gradient compression: In distributed training, each machine must communicate frequently with others to update the weight-vectors and thereby collectively build a single model, leading to high network traffic. Gradient compression algorithm enables users to train models up to five times faster by compressing the model changes communicated by each instance.
  • Optimize the training performance by taking advantage of NCCL: NCCL implements multi-GPU and multi-node collective communication primitives that are performance optimized for NVIDIA GPUs. NCCL provides communication routines that are optimized to achieve high bandwidth over interconnection between multi-GPUs. MXNet supports NCCL to train models about 20% faster on multi-GPU systems.

3) MXNet provides easy interoperability: MXNet now includes a tool for converting neural network code written with the Caffe framework to MXNet code, making it easier for users to take advantage of MXNet’s scalability and performance.

  • Migrate Caffe models to MXNet: It is now possible to easily migrate Caffe code to MXNet, using the new source code translation tool for converting Caffe code to MXNet code.

MXNet has helped developers and researchers make progress with everything from language translation to autonomous vehicles and behavioral biometric security. We are excited to see the broad base of users that are building production artificial intelligence applications powered by neural network models developed and trained with MXNet. For example, the autonomous driving company TuSimple recently piloted a self-driving truck on a 200-mile journey from Yuma, Arizona to San Diego, California using MXNet. This release also includes a full-featured and performance optimized version of the Gluon programming interface. The ease-of-use associated with it combined with the extensive set of tutorials has led significant adoption among developers new to deep learning. The flexibility of the interface has driven interest within the research community, especially in the natural language processing domain.

Getting started with MXNet
Getting started with MXNet is simple. To learn more about the Gluon interface and deep learning, you can reference this comprehensive set of tutorials, which covers everything from an introduction to deep learning to how to implement cutting-edge neural network models. If you’re a contributor to a machine learning framework, check out the interface specs on GitHub.

To get started with the Model Server for Apache MXNet, install the library with the following command:

$ pip install mxnet-model-server

The Model Server library has a Model Zoo with 10 pre-trained deep learning models, including the SqueezeNet 1.1 object classification model. You can start serving the SqueezeNet model with just the following command:

$ mxnet-model-server \
  --models squeezenet=https://s3.amazonaws.com/model-server/models/squeezenet_v1.1/squeezenet_v1.1.model \
  --service dms/model_service/mxnet_vision_service.py

Learn more about the Model Server and view the source code, reference examples, and tutorials here: https://github.com/awslabs/mxnet-model-server/

-Dr. Matt Wood

Glenn’s Take on re:Invent 2017 Part 1

Post Syndicated from Glenn Gore original https://aws.amazon.com/blogs/architecture/glenns-take-on-reinvent-2017-part-1/

GREETINGS FROM LAS VEGAS

Glenn Gore here, Chief Architect for AWS. I’m in Las Vegas this week — with 43K others — for re:Invent 2017. We have a lot of exciting announcements this week. I’m going to post to the AWS Architecture blog each day with my take on what’s interesting about some of the announcements from a cloud architectural perspective.

Why not start at the beginning? At the Midnight Madness launch on Sunday night, we announced Amazon Sumerian, our platform for VR, AR, and mixed reality. The hype around VR/AR has existed for many years, though for me, it is a perfect example of how a working end-to-end solution often requires innovation from multiple sources. For AR/VR to be successful, we need many components to come together in a coherent manner to provide a great experience.

First, we need lightweight, high-definition goggles with motion tracking that are comfortable to wear. Second, we need to track movement of our body and hands in a 3-D space so that we can interact with virtual objects in the virtual world. Third, we need to build the virtual world itself and populate it with assets and define how the interactions will work and connect with various other systems.

There has been rapid development of the physical devices for AR/VR, ranging from iOS devices to Oculus Rift and HTC Vive, which provide excellent capabilities for the first and second components defined above. With the launch of Amazon Sumerian we are solving for the third area, which will help developers easily build their own virtual worlds and start experimenting and innovating with how to apply AR/VR in new ways.

Already, within 48 hours of Amazon Sumerian being announced, I have had multiple discussions with customers and partners around some cool use cases where VR can help in training simulations, remote-operator controls, or with new ideas around interacting with complex visual data sets, which starts bringing concepts straight out of sci-fi movies into the real (virtual) world. I am really excited to see how Sumerian will unlock the creative potential of developers and where this will lead.

Amazon MQ
I am a huge fan of distributed architectures where asynchronous messaging is the backbone of connecting the discrete components together. Amazon Simple Queue Service (Amazon SQS) is one of my favorite services due to its simplicity, scalability, performance, and the incredible flexibility of how you can use Amazon SQS in so many different ways to solve complex queuing scenarios.

While Amazon SQS is easy to use when building cloud-native applications on AWS, many of our customers running existing applications on-premises required support for different messaging protocols such as: Java Message Service (JMS), .Net Messaging Service (NMS), Advanced Message Queuing Protocol (AMQP), MQ Telemetry Transport (MQTT), Simple (or Streaming) Text Orientated Messaging Protocol (STOMP), and WebSockets. One of the most popular applications for on-premise message brokers is Apache ActiveMQ. With the release of Amazon MQ, you can now run Apache ActiveMQ on AWS as a managed service similar to what we did with Amazon ElastiCache back in 2012. For me, there are two compelling, major benefits that Amazon MQ provides:

  • Integrate existing applications with cloud-native applications without having to change a line of application code if using one of the supported messaging protocols. This removes one of the biggest blockers for integration between the old and the new.
  • Remove the complexity of configuring Multi-AZ resilient message broker services as Amazon MQ provides out-of-the-box redundancy by always storing messages redundantly across Availability Zones. Protection is provided against failure of a broker through to complete failure of an Availability Zone.

I believe that Amazon MQ is a major component in the tools required to help you migrate your existing applications to AWS. Having set up cross-data center Apache ActiveMQ clusters in the past myself and then testing to ensure they work as expected during critical failure scenarios, technical staff working on migrations to AWS benefit from the ease of deploying a fully redundant, managed Apache ActiveMQ cluster within minutes.

Who would have thought I would have been so excited to revisit Apache ActiveMQ in 2017 after using SQS for many, many years? Choice is a wonderful thing.

Amazon GuardDuty
Maintaining application and information security in the modern world is increasingly complex and is constantly evolving and changing as new threats emerge. This is due to the scale, variety, and distribution of services required in a competitive online world.

At Amazon, security is our number one priority. Thus, we are always looking at how we can increase security detection and protection while simplifying the implementation of advanced security practices for our customers. As a result, we released Amazon GuardDuty, which provides intelligent threat detection by using a combination of multiple information sources, transactional telemetry, and the application of machine learning models developed by AWS. One of the biggest benefits of Amazon GuardDuty that I appreciate is that enabling this service requires zero software, agents, sensors, or network choke points. which can all impact performance or reliability of the service you are trying to protect. Amazon GuardDuty works by monitoring your VPC flow logs, AWS CloudTrail events, DNS logs, as well as combing other sources of security threats that AWS is aggregating from our own internal and external sources.

The use of machine learning in Amazon GuardDuty allows it to identify changes in behavior, which could be suspicious and require additional investigation. Amazon GuardDuty works across all of your AWS accounts allowing for an aggregated analysis and ensuring centralized management of detected threats across accounts. This is important for our larger customers who can be running many hundreds of AWS accounts across their organization, as providing a single common threat detection of their organizational use of AWS is critical to ensuring they are protecting themselves.

Detection, though, is only the beginning of what Amazon GuardDuty enables. When a threat is identified in Amazon GuardDuty, you can configure remediation scripts or trigger Lambda functions where you have custom responses that enable you to start building automated responses to a variety of different common threats. Speed of response is required when a security incident may be taking place. For example, Amazon GuardDuty detects that an Amazon Elastic Compute Cloud (Amazon EC2) instance might be compromised due to traffic from a known set of malicious IP addresses. Upon detection of a compromised EC2 instance, we could apply an access control entry restricting outbound traffic for that instance, which stops loss of data until a security engineer can assess what has occurred.

Whether you are a customer running a single service in a single account, or a global customer with hundreds of accounts with thousands of applications, or a startup with hundreds of micro-services with hourly release cycle in a devops world, I recommend enabling Amazon GuardDuty. We have a 30-day free trial available for all new customers of this service. As it is a monitor of events, there is no change required to your architecture within AWS.

Stay tuned for tomorrow’s post on AWS Media Services and Amazon Neptune.

 

Glenn during the Tour du Mont Blanc

UI Testing at Scale with AWS Lambda

Post Syndicated from Stas Neyman original https://aws.amazon.com/blogs/devops/ui-testing-at-scale-with-aws-lambda/

This is a guest blog post by Wes Couch and Kurt Waechter from the Blackboard Internal Product Development team about their experience using AWS Lambda.

One year ago, one of our UI test suites took hours to run. Last month, it took 16 minutes. Today, it takes 39 seconds. Here’s how we did it.

The backstory:

Blackboard is a global leader in delivering robust and innovative education software and services to clients in higher education, government, K12, and corporate training. We have a large product development team working across the globe in at least 10 different time zones, with an internal tools team providing support for quality and workflows. We have been using Selenium Webdriver to perform automated cross-browser UI testing since 2007. Because we are now practicing continuous delivery, the automated UI testing challenge has grown due to the faster release schedule. On top of that, every commit made to each branch triggers an execution of our automated UI test suite. If you have ever implemented an automated UI testing infrastructure, you know that it can be very challenging to scale and maintain. Although there are services that are useful for testing different browser/OS combinations, they don’t meet our scale needs.

It used to take three hours to synchronously run our functional UI suite, which revealed the obvious need for parallel execution. Previously, we used Mesos to orchestrate a Selenium Grid Docker container for each test run. This way, we were able to run eight concurrent threads for test execution, which took an average of 16 minutes. Although this setup is fine for a single workflow, the cracks started to show when we reached the scale required for Blackboard’s mature product lines. Going beyond eight concurrent sessions on a single container introduced performance problems that impact the reliability of tests (for example, issues in Webdriver or the browser popping up frequently). We tried Mesos and considered Kubernetes for Selenium Grid orchestration, but the answer to scaling a Selenium Grid was to think smaller, not larger. This led to our breakthrough with AWS Lambda.

The solution:

We started using AWS Lambda for UI testing because it doesn’t require costly infrastructure or countless man hours to maintain. The steps we outline in this blog post took one work day, from inception to implementation. By simply packaging the UI test suite into a Lambda function, we can execute these tests in parallel on a massive scale. We use a custom JUnit test runner that invokes the Lambda function with a request to run each test from the suite. The runner then aggregates the results returned from each Lambda test execution.

Selenium is the industry standard for testing UI at scale. Although there are other options to achieve the same thing in Lambda, we chose this mature suite of tools. Selenium is backed by Google, Firefox, and others to help the industry drive their browsers with code. This makes Lambda and Selenium a compelling stack for achieving UI testing at scale.

Making Chrome Run in Lambda

Currently, Chrome for Linux will not run in Lambda due to an absent mount point. By rebuilding Chrome with a slight modification, as Marco Lüthy originally demonstrated, you can run it inside Lambda anyway! It took about two hours to build the current master branch of Chromium to build on a c4.4xlarge. Unfortunately, the current version of ChromeDriver, 2.33, does not support any version of Chrome above 62, so we’ll be using Marco’s modified version of version 60 for the near future.

Required System Libraries

The Lambda runtime environment comes with a subset of common shared libraries. This means we need to include some extra libraries to get Chrome and ChromeDriver to work. Anything that exists in the java resources folder during compile time is included in the base directory of the compiled jar file. When this jar file is deployed to Lambda, it is placed in the /var/task/ directory. This allows us to simply place the libraries in the java resources folder under a folder named lib/ so they are right where they need to be when the Lambda function is invoked.

To get these libraries, create an EC2 instance and choose the Amazon Linux AMI.

Next, use ssh to connect to the server. After you connect to the new instance, search for the libraries to find their locations.

sudo find / -name libgconf-2.so.4
sudo find / -name libORBit-2.so.0

Now that you have the locations of the libraries, copy these files from the EC2 instance and place them in the java resources folder under lib/.

Packaging the Tests

To deploy the test suite to Lambda, we used a simple Gradle tool called ShadowJar, which is similar to the Maven Shade Plugin. It packages the libraries and dependencies inside the jar that is built. Usually test dependencies and sources aren’t included in a jar, but for this instance we want to include them. To include the test dependencies, add this section to the build.gradle file.

shadowJar {
   from sourceSets.test.output
   configurations = [project.configurations.testRuntime]
}

Deploying the Test Suite

Now that our tests are packaged with the dependencies in a jar, we need to get them into a running Lambda function. We use  simple SAM  templates to upload the packaged jar into S3, and then deploy it to Lambda with our settings.

{
   "AWSTemplateFormatVersion": "2010-09-09",
   "Transform": "AWS::Serverless-2016-10-31",
   "Resources": {
       "LambdaTestHandler": {
           "Type": "AWS::Serverless::Function",
           "Properties": {
               "CodeUri": "./build/libs/your-test-jar-all.jar",
               "Runtime": "java8",
               "Handler": "com.example.LambdaTestHandler::handleRequest",
               "Role": "<YourLambdaRoleArn>",
               "Timeout": 300,
               "MemorySize": 1536
           }
       }
   }
}

We use the maximum timeout available to ensure our tests have plenty of time to run. We also use the maximum memory size because this ensures our Lambda function can support Chrome and other resources required to run a UI test.

Specifying the handler is important because this class executes the desired test. The test handler should be able to receive a test class and method. With this information it will then execute the test and respond with the results.

public LambdaTestResult handleRequest(TestRequest testRequest, Context context) {
   LoggerContainer.LOGGER = new Logger(context.getLogger());
  
   BlockJUnit4ClassRunner runner = getRunnerForSingleTest(testRequest);
  
   Result result = new JUnitCore().run(runner);

   return new LambdaTestResult(result);
}

Creating a Lambda-Compatible ChromeDriver

We provide developers with an easily accessible ChromeDriver for local test writing and debugging. When we are running tests on AWS, we have configured ChromeDriver to run them in Lambda.

To configure ChromeDriver, we first need to tell ChromeDriver where to find the Chrome binary. Because we know that ChromeDriver is going to be unzipped into the root task directory, we should point the ChromeDriver configuration at that location.

The settings for getting ChromeDriver running are mostly related to Chrome, which must have its working directories pointed at the tmp/ folder.

Start with the default DesiredCapabilities for ChromeDriver, and then add the following settings to enable your ChromeDriver to start in Lambda.

public ChromeDriver createLambdaChromeDriver() {
   ChromeOptions options = new ChromeOptions();

   // Set the location of the chrome binary from the resources folder
   options.setBinary("/var/task/chrome");

   // Include these settings to allow Chrome to run in Lambda
   options.addArguments("--disable-gpu");
   options.addArguments("--headless");
   options.addArguments("--window-size=1366,768");
   options.addArguments("--single-process");
   options.addArguments("--no-sandbox");
   options.addArguments("--user-data-dir=/tmp/user-data");
   options.addArguments("--data-path=/tmp/data-path");
   options.addArguments("--homedir=/tmp");
   options.addArguments("--disk-cache-dir=/tmp/cache-dir");
  
   DesiredCapabilities desiredCapabilities = DesiredCapabilities.chrome();
   desiredCapabilities.setCapability(ChromeOptions.CAPABILITY, options);
  
   return new ChromeDriver(desiredCapabilities);
}

Executing Tests in Parallel

You can approach parallel test execution in Lambda in many different ways. Your approach depends on the structure and design of your test suite. For our solution, we implemented a custom test runner that uses reflection and JUnit libraries to create a list of test cases we want run. When we have the list, we create a TestRequest object to pass into the Lambda function that we have deployed. In this TestRequest, we place the class name, test method, and the test run identifier. When the Lambda function receives this TestRequest, our LambdaTestHandler generates and runs the JUnit test. After the test is complete, the test result is sent to the test runner. The test runner compiles a result after all of the tests are complete. By executing the same Lambda function multiple times with different test requests, we can effectively run the entire test suite in parallel.

To get screenshots and other test data, we pipe those files during test execution to an S3 bucket under the test run identifier prefix. When the tests are complete, we link the files to each test execution in the report generated from the test run. This lets us easily investigate test executions.

Pro Tip: Dynamically Loading Binaries

AWS Lambda has a limit of 250 MB of uncompressed space for packaged Lambda functions. Because we have libraries and other dependencies to our test suite, we hit this limit when we tried to upload a function that contained Chrome and ChromeDriver (~140 MB). This test suite was not originally intended to be used with Lambda. Otherwise, we would have scrutinized some of the included libraries. To get around this limit, we used the Lambda functions temporary directory, which allows up to 500 MB of space at runtime. Downloading these binaries at runtime moves some of that space requirement into the temporary directory. This allows more room for libraries and dependencies. You can do this by grabbing Chrome and ChromeDriver from an S3 bucket and marking them as executable using built-in Java libraries. If you take this route, be sure to point to the new location for these executables in order to create a ChromeDriver.

private static void downloadS3ObjectToExecutableFile(String key) throws IOException {
   File file = new File("/tmp/" + key);

   GetObjectRequest request = new GetObjectRequest("s3-bucket-name", key);

   FileUtils.copyInputStreamToFile(s3client.getObject(request).getObjectContent(), file);
   file.setExecutable(true);
}

Lambda-Selenium Project Source

We have compiled an open source example that you can grab from the Blackboard Github repository. Grab the code and try it out!

https://blackboard.github.io/lambda-selenium/

Conclusion

One year ago, one of our UI test suites took hours to run. Last month, it took 16 minutes. Today, it takes 39 seconds. Thanks to AWS Lambda, we can reduce our build times and perform automated UI testing at scale!

On Architecture and the State of the Art

Post Syndicated from Philip Fitzsimons original https://aws.amazon.com/blogs/architecture/on-architecture-and-the-state-of-the-art/

On the AWS Solutions Architecture team we know we’re following in the footsteps of other technical experts who pulled together the best practices of their eras. Around 22 BC the Roman Architect Vitruvius Pollio wrote On architecture (published as The Ten Books on Architecture), which became a seminal work on architectural theory. Vitruvius captured the best practices of his contemporaries and those who went before him (especially the Greek architects).

Closer to our time, in 1910, another technical expert, Henry Harrison Suplee, wrote Gas Turbine: progress in the design and construction of turbines operated by gases of combustion, from which we believe the phrase “state of the art” originates:

It has therefore been thought desirable to gather under one cover the most important papers which have appeared upon the subject of the gas turbine in England, France, Germany, and Switzerland, together with some account of the work in America, and to add to this such information upon actual experimental machines as can be secured.

In the present state of the art this is all that can be done, but it is believed that this will aid materially in the conduct of subsequent work, and place in the hands of the gas-power engineer a collection of material not generally accessible or available in convenient form.  

Source

Both authors wrote books that captured the current knowledge on design principles and best practices (in architecture and engineering) to improve awareness and adoption. Like these authors, we at AWS believe that capturing and sharing best practices leads to better outcomes. This follows a pattern we established internally, in our Principal Engineering community. In 2012 we started an initiative called “Well-Architected” to help share the best practices for architecting in the cloud with our customers.

Every year AWS Solution Architects dedicate hundreds of thousands of hours to helping customers build architectures that are cloud native. Through customer feedback, and real world experience we see what strategies, patterns, and approaches work for you.

“After our well-architected review and subsequent migration to the cloud, we saw the tremendous cost-savings potential of Amazon Web Services. By using the industry-standard service, we can invest the majority of our time and energy into enhancing our solutions. Thanks to (consulting partner) 1Strategy’s deep, technical AWS expertise and flexibility during our migration, we were able to leverage the strengths of AWS quickly.”

Paul Cooley, Chief Technology Officer for Imprev

This year we have again refreshed the AWS Well-Architected Framework, with a particular focus on Operational Excellence. Last year we announced the addition of Operational Excellence as a new pillar to AWS Well-Architected Framework. Having carried out thousands of reviews since then, we reexamined the pillar and are pleased to announce some significant changes. First, the pillar dives more deeply into people and process because this is an area where we see the most opportunities for teams to improve. Second, we’ve pivoted heavily to focusing on whether your team and your workload are ready for runtime operations. Key to this is ensuring that in the early phases of design that you think about how your architecture will be operated. Reflecting on this we realized that Operational Excellence should be the first pillar to support the “Architect for run-time operations” approach.

We’ve also added detail on how Amazon approaches technology architecture, covering topics such as our Principal Engineering Community and two-way doors and mechanisms. We refreshed the other pillars to reflect the evolution of AWS, and the best practices we are seeing in the field. We have also added detail on the review process, in the surprisingly named “The Review Process” section.

As part of refreshing the pillars are have also released a new Operational Excellence Pillar whitepaper, and have updated the whitepapers for all of the other pillars of the Framework. For example we have significantly updated the Reliability Pillar whitepaper to provide guidance on application design for high availability. New sections cover techniques for high availability including and beyond infrastructure implications, and considerations across the application lifecycle. This updated whitepaper also provides examples that show how to achieve availability goals in single and multi-region architectures.

You can find free training and all of the ”state of the art” whitepapers on the AWS Well-Architected homepage:

Philip Fitzsimons, Leader, AWS Well-Architected Team

Prepare to run a Code Club on FutureLearn

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/code-club-futurelearn/

Prepare to run a Code Club with our newest free online course, available now on FutureLearn!

FutureLearn: Prepare to Run a Code Club

Ready to launch! Our free FutureLearn course ‘Prepare to Run a Code Club’ starts next week and you can sign up now: https://www.futurelearn.com/courses/code-club

Code Club

As of today, more than 10000 Code Clubs run in 130 countries, delivering free coding opportunities to approximately 150000 children across the globe.

A child absorbed in a task at a Code Club

As an organisation, Code Club provides free learning resources and training materials to supports the ever-growing and truly inspiring community of volunteers and educators who set up and run Code Clubs.

FutureLearn

Today we’re launching our latest free online course on FutureLearn, dedicated to training and supporting new Code Club volunteers. It will give you practical guidance on all things Code Club, as well as a taste of beginner programming!

Split over three weeks and running for 3–4 hours in total, the course provides hands-on advice and tips on everything you need to know to run a successful, fun, and educational club.

“Week 1 kicks off with advice on how to prepare to start a Code Club, for example which hardware and software are needed. Week 2 focusses on how to deliver Code Club sessions, with practical tips on helping young people learn and an easy taster coding project to try out. In the final week, the course looks at interesting ideas to enrich and extend club sessions.”
— Sarah Sherman-Chase, Code Club Participation Manager

The course is available wherever you live, and it is completely free — sign up now!

If you’re already a volunteer, the course will be a great refresher, and a chance to share your insights with newcomers. Moreover, it is also useful for parents and guardians who wish to learn more about Code Club.

Your next step

Interested in learning more? You can start the course today by visiting FutureLearn. And to find out more about Code Clubs in your country, visit Code Club UK or Code Club International.

Code Club partners from across the globe gathered together for a group photo at the International Meetup

We love hearing your Code Club stories! If you’re a volunteer, are in the process of setting up a club, or are inspired to learn more, share your story in the comments below or via social media, making sure to tag @CodeClub and @CodeClubWorld.

You might also be interested in our other free courses on the FutureLearn platform, including Teaching Physical Computing with Raspberry Pi and Python and Teaching Programming in Primary Schools.

 

The post Prepare to run a Code Club on FutureLearn appeared first on Raspberry Pi.

Computing in schools: the report card

Post Syndicated from Philip Colligan original https://www.raspberrypi.org/blog/after-the-reboot/

Today the Royal Society published After the Reboot, a report card on the state of computing education in UK schools. It’s a serious piece of work, published with lots of accompanying research and data, and well worth a read if you care about these issues (which, if you’re reading this blog, I guess you do).

The headline message is that, while a lot has been achieved, there’s a long way to go before we can say that young people are consistently getting the computing education they need and deserve in UK schools.

If this were a school report card, it would probably say: “good progress when he applies himself, but would benefit from more focus and effort in class” (which is eerily reminiscent of my own school reports).

A child coding in Scratch on a laptop - Royal Society After the Reboot

Good progress

After the Reboot comes five and a half years after the Royal Society’s first review of computing education, Shut down or restart, a report that was published just a few days before the Education Secretary announced in January 2012 that he was scrapping the widely discredited ICT programme of study.

There’s no doubt that a lot has been achieved since 2012, and the Royal Society has done a good job of documenting those successes in this latest report. Computing is now part of the curriculum for all schools. There’s a Computer Science GCSE that is studied by thousands of young people. Organisations like Computing At School have built a grassroots movement of educators who are leading fantastic work in schools up and down the country. Those are big wins.

The Raspberry Pi Foundation has been playing its part. With the support of partners like Google, we’ve trained over a thousand UK educators through our Picademy programme. Those educators have gone on to work with hundreds of thousands of students, and many have become leaders in the field. Many thousands more have taken our free online training courses, and through our partnership with BT, CAS and the BCS on the Barefoot programme, we’re supporting thousands of primary school teachers to deliver the computing curriculum. Earlier this year we launched a free magazine for computing educators, Hello World, which has over 14,000 subscribers after just three editions.

A group of people learning about digital making - Royal Society After the Reboot

More to do

Despite all the progress, the Royal Society study has confirmed what many of us have been saying for some time: we need to do much more to support teachers to develop the skills and confidence to deliver the computing curriculum. More than anything, we need to give them the time to invest in their own professional development. The UK led the way on putting computing in the curriculum. Now we need to follow through on that promise by investing in a huge effort to support professional development across the school system.

This isn’t a problem that any one organisation or sector can solve on its own. It will require a grand coalition of government, industry, non-profits, and educators if we are going to make change at the pace that our young people need and deserve. Over the coming weeks and months, we’ll be working with our partners to figure out how we make that happen.

A boy learning about computing from a woman - Royal Society After the Reboot

The other 75%

While the Royal Society report rightly focuses on what happens in classrooms during the school day, we need to remember that children spend only 25% of their waking hours there. What about the other 75%?

Ask any computer scientist, engineer, or maker, and they’ll tell stories about how much they learned in those precious discretionary hours.

Ask an engineer of a certain age (ahem), and they will tell you about the local computing club where they got hands-on with new technologies, picked up new ideas, and were given help by peers and mentors. They might also tell you how they would spend dozens of hours typing in hundreds of line of code from a magazine to create their own game, and dozens more debugging when it didn’t work.

One of our goals at the Raspberry Pi Foundation is to lead the revival in that culture of informal learning.

The revival of computing clubs

There are now more than 6,000 active Code Clubs in the UK, engaging over 90,000 young people each week. 41% of the kids at Code Club are girls. More than 150 UK CoderDojos take place in universities, science centres, and corporate offices, providing a safe space for over 4,000 young people to learn programming and digital making.

So far this year, there have been 164 Raspberry Jams in the UK, volunteer-led meetups attended by over 10,000 people, who come to learn from volunteers and share their digital making projects.

It’s a movement, and it’s growing fast. One of the most striking facts is that whenever a new Code Club, CoderDojo, or Raspberry Jam is set up, it is immediately oversubscribed.

So while we work on fixing the education system, there’s a tangible way that we can all make a huge difference right now. You can help set up a Code Club, get involved with CoderDojo, or join the Raspberry Jam movement.

The post Computing in schools: the report card appeared first on Raspberry Pi.

Say Hello To Our Newest AWS Community Heroes (Fall 2017 Edition)

Post Syndicated from Sara Rodas original https://aws.amazon.com/blogs/aws/say-hello-to-our-newest-aws-community-heroes-fall-2017-edition/

The AWS Community Heroes program helps shine a spotlight on some of the innovative work being done by rockstar AWS developers around the globe. Marrying cloud expertise with a passion for community building and education, these heroes share their time and knowledge across social media and through in-person events. Heroes also actively help drive community-led tracks at conferences. At this year’s re:Invent, many Heroes will be speaking during the Monday Community Day track.

This November, we are thrilled to have four Heroes joining our network of cloud innovators. Without further ado, meet to our newest AWS Community Heroes!

 

Anh Ho Viet

Anh Ho Viet is the founder of AWS Vietnam User Group, Co-founder & CEO of OSAM, an AWS Consulting Partner in Vietnam, an AWS Certified Solutions Architect, and a cloud lover.

At OSAM, Anh and his enthusiastic team have helped many companies, from SMBs to Enterprises, move to the cloud with AWS. They offer a wide range of services, including migration, consultation, architecture, and solution design on AWS. Anh’s vision for OSAM is beyond a cloud service provider; the company will take part in building a complete AWS ecosystem in Vietnam, where other companies are encouraged to become AWS partners through training and collaboration activities.

In 2016, Anh founded the AWS Vietnam User Group as a channel to share knowledge and hands-on experience among cloud practitioners. Since then, the community has reached more than 4,800 members and is still expanding. The group holds monthly meetups, connects many SMEs to AWS experts, and provides real-time, free-of-charge consultancy to startups. In August 2017, Anh joined as lead content creator of a program called “Cloud Computing Lectures for Universities” which includes translating AWS documentation & news into Vietnamese, providing students with fundamental, up-to-date knowledge of AWS cloud computing, and supporting students’ career paths.

 

Thorsten Höger

Thorsten Höger is CEO and Cloud consultant at Taimos, where he is advising customers on how to use AWS. Being a developer, he focuses on improving development processes and automating everything to build efficient deployment pipelines for customers of all sizes.

Before being self-employed, Thorsten worked as a developer and CTO of Germany’s first private bank running on AWS. With his colleagues, he migrated the core banking system to the AWS platform in 2013. Since then he organizes the AWS user group in Stuttgart and is a frequent speaker at Meetups, BarCamps, and other community events.

As a supporter of open source software, Thorsten is maintaining or contributing to several projects on Github, like test frameworks for AWS Lambda, Amazon Alexa, or developer tools for CloudFormation. He is also the maintainer of the Jenkins AWS Pipeline plugin.

In his spare time, he enjoys indoor climbing and cooking.

 

Becky Zhang

Yu Zhang (Becky Zhang) is COO of BootDev, which focuses on Big Data solutions on AWS and high concurrency web architecture. Before she helped run BootDev, she was working at Yubis IT Solutions as an operations manager.

Becky plays a key role in the AWS User Group Shanghai (AWSUGSH), regularly organizing AWS UG events including AWS Tech Meetups and happy hours, gathering AWS talent together to communicate the latest technology and AWS services. As a female in technology industry, Becky is keen on promoting Women in Tech and encourages more woman to get involved in the community.

Becky also connects the China AWS User Group with user groups in other regions, including Korea, Japan, and Thailand. She was invited as a panelist at AWS re:Invent 2016 and spoke at the Seoul AWS Summit this April to introduce AWS User Group Shanghai and communicate with other AWS User Groups around the world.

Besides events, Becky also promotes the Shanghai AWS User Group by posting AWS-related tech articles, event forecasts, and event reports to Weibo, Twitter, Meetup.com, and WeChat (which now has over 2000 official account followers).

 

Nilesh Vaghela

Nilesh Vaghela is the founder of ElectroMech Corporation, an AWS Cloud and open source focused company (the company started as an open source motto). Nilesh has been very active in the Linux community since 1998. He started working with AWS Cloud technologies in 2013 and in 2014 he trained a dedicated cloud team and started full support of AWS cloud services as an AWS Standard Consulting Partner. He always works to establish and encourage cloud and open source communities.

He started the AWS Meetup community in Ahmedabad in 2014 and as of now 12 Meetups have been conducted, focusing on various AWS technologies. The Meetup has quickly grown to include over 2000 members. Nilesh also created a Facebook group for AWS enthusiasts in Ahmedabad, with over 1500 members.

Apart from the AWS Meetup, Nilesh has delivered a number of seminars, workshops, and talks around AWS introduction and awareness, at various organizations, as well as at colleges and universities. He has also been active in working with startups, presenting AWS services overviews and discussing how startups can benefit the most from using AWS services.

Nilesh is Red Hat Linux Technologies and AWS Cloud Technologies trainer as well.

 

To learn more about the AWS Community Heroes Program and how to get involved with your local AWS community, click here.

New – Amazon EC2 Instances with Up to 8 NVIDIA Tesla V100 GPUs (P3)

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-amazon-ec2-instances-with-up-to-8-nvidia-tesla-v100-gpus-p3/

Driven by customer demand and made possible by on-going advances in the state-of-the-art, we’ve come a long way since the original m1.small instance that we launched in 2006, with instances that are emphasize compute power, burstable performance, memory size, local storage, and accelerated computing.

The New P3
Today we are making the next generation of GPU-powered EC2 instances available in four AWS regions. Powered by up to eight NVIDIA Tesla V100 GPUs, the P3 instances are designed to handle compute-intensive machine learning, deep learning, computational fluid dynamics, computational finance, seismic analysis, molecular modeling, and genomics workloads.

P3 instances use customized Intel Xeon E5-2686v4 processors running at up to 2.7 GHz. They are available in three sizes (all VPC-only and EBS-only):

Model NVIDIA Tesla V100 GPUs GPU Memory NVIDIA NVLink vCPUs Main Memory Network Bandwidth EBS Bandwidth
p3.2xlarge 1 16 GiB n/a 8 61 GiB Up to 10 Gbps 1.5 Gbps
p3.8xlarge 4 64 GiB 200 GBps 32 244 GiB 10 Gbps 7 Gbps
p3.16xlarge 8 128 GiB 300 GBps 64 488 GiB 25 Gbps 14 Gbps

Each of the NVIDIA GPUs is packed with 5,120 CUDA cores and another 640 Tensor cores and can deliver up to 125 TFLOPS of mixed-precision floating point, 15.7 TFLOPS of single-precision floating point, and 7.8 TFLOPS of double-precision floating point. On the two larger sizes, the GPUs are connected together via NVIDIA NVLink 2.0 running at a total data rate of up to 300 GBps. This allows the GPUs to exchange intermediate results and other data at high speed, without having to move it through the CPU or the PCI-Express fabric.

What’s a Tensor Core?
I had not heard the term Tensor core before starting to write this post. According to this very helpful post on the NVIDIA Blog, Tensor cores are designed to speed up the training and inference of large, deep neural networks. Each core is able to quickly and efficiently multiply a pair of 4×4 half-precision (also known as FP16) matrices together, add the resulting 4×4 matrix to another half or single-precision (FP32) matrix, and store the resulting 4×4 matrix in either half or single-precision form. Here’s a diagram from NVIDIA’s blog post:

This operation is in the innermost loop of the training process for a deep neural network, and is an excellent example of how today’s NVIDIA GPU hardware is purpose-built to address a very specific market need. By the way, the mixed-precision qualifier on the Tensor core performance means that it is flexible enough to work with with a combination of 16-bit and 32-bit floating point values.

Performance in Perspective
I always like to put raw performance numbers into a real-world perspective so that they are easier to relate to and (hopefully) more meaningful. This turned out to be surprisingly difficult, given that the eight NVIDIA Tesla V100 GPUs on a single p3.16xlarge can do 125 trillion single-precision floating point multiplications per second.

Let’s go back to the dawn of the microprocessor era, and consider the Intel 8080A chip that powered the MITS Altair that I bought in the summer of 1977. With a 2 MHz clock, it was able to do about 832 multiplications per second (I used this data and corrected it for the faster clock speed). The p3.16xlarge is roughly 150 billion times faster. However, just 1.2 billion seconds have gone by since that summer. In other words, I can do 100x more calculations today in one second than my Altair could have done in the last 40 years!

What about the innovative 8087 math coprocessor that was an optional accessory for the IBM PC that was announced in the summer of 1981? With a 5 MHz clock and purpose-built hardware, it was able to do about 52,632 multiplications per second. 1.14 billion seconds have elapsed since then, p3.16xlarge is 2.37 billion times faster, so the poor little PC would be barely halfway through a calculation that would run for 1 second today.

Ok, how about a Cray-1? First delivered in 1976, this supercomputer was able to perform vector operations at 160 MFLOPS, making the p3.x16xlarge 781,000 times faster. It could have iterated on some interesting problem 1500 times over the years since it was introduced.

Comparisons between the P3 and today’s scale-out supercomputers are harder to make, given that you can think of the P3 as a step-and-repeat component of a supercomputer that you can launch on as as-needed basis.

Run One Today
In order to take full advantage of the NVIDIA Tesla V100 GPUs and the Tensor cores, you will need to use CUDA 9 and cuDNN7. These drivers and libraries have already been added to the newest versions of the Windows AMIs and will be included in an updated Amazon Linux AMI that is scheduled for release on November 7th. New packages are already available in our repos if you want to to install them on your existing Amazon Linux AMI.

The newest AWS Deep Learning AMIs come preinstalled with the latest releases of Apache MxNet, Caffe2, and Tensorflow (each with support for the NVIDIA Tesla V100 GPUs), and will be updated to support P3 instances with other machine learning frameworks such as Microsoft Cognitive Toolkit and PyTorch as soon as these frameworks release support for the NVIDIA Tesla V100 GPUs. You can also use the NVIDIA Volta Deep Learning AMI for NGC.

P3 instances are available in the US East (Northern Virginia), US West (Oregon), EU (Ireland), and Asia Pacific (Tokyo) Regions in On-Demand, Spot, Reserved Instance, and Dedicated Host form.

Jeff;

 

Getting Ready for AWS re:Invent 2017

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/getting-ready-for-aws-reinvent-2017/

With just 40 days remaining before AWS re:Invent begins, my colleagues and I want to share some tips that will help you to make the most of your time in Las Vegas. As always, our focus is on training and education, mixed in with some after-hours fun and recreation for balance.

Locations, Locations, Locations
The re:Invent Campus will span the length of the Las Vegas strip, with events taking place at the MGM Grand, Aria, Mirage, Venetian, Palazzo, the Sands Expo Hall, the Linq Lot, and the Encore. Each venue will host tracks devoted to specific topics:

MGM Grand – Business Apps, Enterprise, Security, Compliance, Identity, Windows.

Aria – Analytics & Big Data, Alexa, Container, IoT, AI & Machine Learning, and Serverless.

Mirage – Bootcamps, Certifications & Certification Exams.

Venetian / Palazzo / Sands Expo Hall – Architecture, AWS Marketplace & Service Catalog, Compute, Content Delivery, Database, DevOps, Mobile, Networking, and Storage.

Linq Lot – Alexa Hackathons, Gameday, Jam Sessions, re:Play Party, Speaker Meet & Greets.

EncoreBookable meeting space.

If your interests span more than one topic, plan to take advantage of the re:Invent shuttles that will be making the rounds between the venues.

Lots of Content
The re:Invent Session Catalog is now live and you should start to choose the sessions of interest to you now.

With more than 1100 sessions on the agenda, planning is essential! Some of the most popular “deep dive” sessions will be run more than once and others will be streamed to overflow rooms at other venues. We’ve analyzed a lot of data, run some simulations, and are doing our best to provide you with multiple opportunities to build an action-packed schedule.

We’re just about ready to let you reserve seats for your sessions (follow me and/or @awscloud on Twitter for a heads-up). Based on feedback from earlier years, we have fine-tuned our seat reservation model. This year, 75% of the seats for each session will be reserved and the other 25% are for walk-up attendees. We’ll start to admit walk-in attendees 10 minutes before the start of the session.

Las Vegas never sleeps and neither should you! This year we have a host of late-night sessions, workshops, chalk talks, and hands-on labs to keep you busy after dark.

To learn more about our plans for sessions and content, watch the Get Ready for re:Invent 2017 Content Overview video.

Have Fun
After you’ve had enough training and learning for the day, plan to attend the Pub Crawl, the re:Play party, the Tatonka Challenge (two locations this year), our Hands-On LEGO Activities, and the Harley Ride. Stay fit with our 4K Run, Spinning Challenge, Fitness Bootcamps, and Broomball (a longstanding Amazon tradition).

See You in Vegas
As always, I am looking forward to meeting as many AWS users and blog readers as possible. Never hesitate to stop me and to say hello!

Jeff;

 

 

Predict Billboard Top 10 Hits Using RStudio, H2O and Amazon Athena

Post Syndicated from Gopal Wunnava original https://aws.amazon.com/blogs/big-data/predict-billboard-top-10-hits-using-rstudio-h2o-and-amazon-athena/

Success in the popular music industry is typically measured in terms of the number of Top 10 hits artists have to their credit. The music industry is a highly competitive multi-billion dollar business, and record labels incur various costs in exchange for a percentage of the profits from sales and concert tickets.

Predicting the success of an artist’s release in the popular music industry can be difficult. One release may be extremely popular, resulting in widespread play on TV, radio and social media, while another single may turn out quite unpopular, and therefore unprofitable. Record labels need to be selective in their decision making, and predictive analytics can help them with decision making around the type of songs and artists they need to promote.

In this walkthrough, you leverage H2O.ai, Amazon Athena, and RStudio to make predictions on whether a song might make it to the Top 10 Billboard charts. You explore the GLM, GBM, and deep learning modeling techniques using H2O’s rapid, distributed and easy-to-use open source parallel processing engine. RStudio is a popular IDE, licensed either commercially or under AGPLv3, for working with R. This is ideal if you don’t want to connect to a server via SSH and use code editors such as vi to do analytics. RStudio is available in a desktop version, or a server version that allows you to access R via a web browser. RStudio’s Notebooks feature is used to demonstrate the execution of code and output. In addition, this post showcases how you can leverage Athena for query and interactive analysis during the modeling phase. A working knowledge of statistics and machine learning would be helpful to interpret the analysis being performed in this post.

Walkthrough

Your goal is to predict whether a song will make it to the Top 10 Billboard charts. For this purpose, you will be using multiple modeling techniques―namely GLM, GBM and deep learning―and choose the model that is the best fit.

This solution involves the following steps:

  • Install and configure RStudio with Athena
  • Log in to RStudio
  • Install R packages
  • Connect to Athena
  • Create a dataset
  • Create models

Install and configure RStudio with Athena

Use the following AWS CloudFormation stack to install, configure, and connect RStudio on an Amazon EC2 instance with Athena.

Launching this stack creates all required resources and prerequisites:

  • Amazon EC2 instance with Amazon Linux (minimum size of t2.large is recommended)
  • Provisioning of the EC2 instance in an existing VPC and public subnet
  • Installation of Java 8
  • Assignment of an IAM role to the EC2 instance with the required permissions for accessing Athena and Amazon S3
  • Security group allowing access to the RStudio and SSH ports from the internet (I recommend restricting access to these ports)
  • S3 staging bucket required for Athena (referenced within RStudio as ATHENABUCKET)
  • RStudio username and password
  • Setup logs in Amazon CloudWatch Logs (if needed for additional troubleshooting)
  • Amazon EC2 Systems Manager agent, which makes it easy to manage and patch

All AWS resources are created in the US-East-1 Region. To avoid cross-region data transfer fees, launch the CloudFormation stack in the same region. To check the availability of Athena in other regions, see Region Table.

Log in to RStudio

The instance security group has been automatically configured to allow incoming connections on the RStudio port 8787 from any source internet address. You can edit the security group to restrict source IP access. If you have trouble connecting, ensure that port 8787 isn’t blocked by subnet network ACLS or by your outgoing proxy/firewall.

  1. In the CloudFormation stack, choose Outputs, Value, and then open the RStudio URL. You might need to wait for a few minutes until the instance has been launched.
  2. Log in to RStudio with the and password you provided during setup.

Install R packages

Next, install the required R packages from the RStudio console. You can download the R notebook file containing just the code.

#install pacman – a handy package manager for managing installs
if("pacman" %in% rownames(installed.packages()) == FALSE)
{install.packages("pacman")}  
library(pacman)
p_load(h2o,rJava,RJDBC,awsjavasdk)
h2o.init(nthreads = -1)
##  Connection successful!
## 
## R is connected to the H2O cluster: 
##     H2O cluster uptime:         2 hours 42 minutes 
##     H2O cluster version:        3.10.4.6 
##     H2O cluster version age:    4 months and 4 days !!! 
##     H2O cluster name:           H2O_started_from_R_rstudio_hjx881 
##     H2O cluster total nodes:    1 
##     H2O cluster total memory:   3.30 GB 
##     H2O cluster total cores:    4 
##     H2O cluster allowed cores:  4 
##     H2O cluster healthy:        TRUE 
##     H2O Connection ip:          localhost 
##     H2O Connection port:        54321 
##     H2O Connection proxy:       NA 
##     H2O Internal Security:      FALSE 
##     R Version:                  R version 3.3.3 (2017-03-06)
## Warning in h2o.clusterInfo(): 
## Your H2O cluster version is too old (4 months and 4 days)!
## Please download and install the latest version from http://h2o.ai/download/
#install aws sdk if not present (pre-requisite for using Athena with an IAM role)
if (!aws_sdk_present()) {
  install_aws_sdk()
}

load_sdk()
## NULL

Connect to Athena

Next, establish a connection to Athena from RStudio, using an IAM role associated with your EC2 instance. Use ATHENABUCKET to specify the S3 staging directory.

URL <- 'https://s3.amazonaws.com/athena-downloads/drivers/AthenaJDBC41-1.0.1.jar'
fil <- basename(URL)
#download the file into current working directory
if (!file.exists(fil)) download.file(URL, fil)
#verify that the file has been downloaded successfully
list.files()
## [1] "AthenaJDBC41-1.0.1.jar"
drv <- JDBC(driverClass="com.amazonaws.athena.jdbc.AthenaDriver", fil, identifier.quote="'")

con <- jdbcConnection <- dbConnect(drv, 'jdbc:awsathena://athena.us-east-1.amazonaws.com:443/',
                                   s3_staging_dir=Sys.getenv("ATHENABUCKET"),
                                   aws_credentials_provider_class="com.amazonaws.auth.DefaultAWSCredentialsProviderChain")

Verify the connection. The results returned depend on your specific Athena setup.

con
## <JDBCConnection>
dbListTables(con)
##  [1] "gdelt"               "wikistats"           "elb_logs_raw_native"
##  [4] "twitter"             "twitter2"            "usermovieratings"   
##  [7] "eventcodes"          "events"              "billboard"          
## [10] "billboardtop10"      "elb_logs"            "gdelthist"          
## [13] "gdeltmaster"         "twitter"             "twitter3"

Create a dataset

For this analysis, you use a sample dataset combining information from Billboard and Wikipedia with Echo Nest data in the Million Songs Dataset. Upload this dataset into your own S3 bucket. The table below provides a description of the fields used in this dataset.

Field Description
year Year that song was released
songtitle Title of the song
artistname Name of the song artist
songid Unique identifier for the song
artistid Unique identifier for the song artist
timesignature Variable estimating the time signature of the song
timesignature_confidence Confidence in the estimate for the timesignature
loudness Continuous variable indicating the average amplitude of the audio in decibels
tempo Variable indicating the estimated beats per minute of the song
tempo_confidence Confidence in the estimate for tempo
key Variable with twelve levels indicating the estimated key of the song (C, C#, B)
key_confidence Confidence in the estimate for key
energy Variable that represents the overall acoustic energy of the song, using a mix of features such as loudness
pitch Continuous variable that indicates the pitch of the song
timbre_0_min thru timbre_11_min Variables that indicate the minimum values over all segments for each of the twelve values in the timbre vector
timbre_0_max thru timbre_11_max Variables that indicate the maximum values over all segments for each of the twelve values in the timbre vector
top10 Indicator for whether or not the song made it to the Top 10 of the Billboard charts (1 if it was in the top 10, and 0 if not)

Create an Athena table based on the dataset

In the Athena console, select the default database, sampled, or create a new database.

Run the following create table statement.

create external table if not exists billboard
(
year int,
songtitle string,
artistname string,
songID string,
artistID string,
timesignature int,
timesignature_confidence double,
loudness double,
tempo double,
tempo_confidence double,
key int,
key_confidence double,
energy double,
pitch double,
timbre_0_min double,
timbre_0_max double,
timbre_1_min double,
timbre_1_max double,
timbre_2_min double,
timbre_2_max double,
timbre_3_min double,
timbre_3_max double,
timbre_4_min double,
timbre_4_max double,
timbre_5_min double,
timbre_5_max double,
timbre_6_min double,
timbre_6_max double,
timbre_7_min double,
timbre_7_max double,
timbre_8_min double,
timbre_8_max double,
timbre_9_min double,
timbre_9_max double,
timbre_10_min double,
timbre_10_max double,
timbre_11_min double,
timbre_11_max double,
Top10 int
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION 's3://aws-bigdata-blog/artifacts/predict-billboard/data'
;

Inspect the table definition for the ‘billboard’ table that you have created. If you chose a database other than sampledb, replace that value with your choice.

dbGetQuery(con, "show create table sampledb.billboard")
##                                      createtab_stmt
## 1       CREATE EXTERNAL TABLE `sampledb.billboard`(
## 2                                       `year` int,
## 3                               `songtitle` string,
## 4                              `artistname` string,
## 5                                  `songid` string,
## 6                                `artistid` string,
## 7                              `timesignature` int,
## 8                `timesignature_confidence` double,
## 9                                `loudness` double,
## 10                                  `tempo` double,
## 11                       `tempo_confidence` double,
## 12                                       `key` int,
## 13                         `key_confidence` double,
## 14                                 `energy` double,
## 15                                  `pitch` double,
## 16                           `timbre_0_min` double,
## 17                           `timbre_0_max` double,
## 18                           `timbre_1_min` double,
## 19                           `timbre_1_max` double,
## 20                           `timbre_2_min` double,
## 21                           `timbre_2_max` double,
## 22                           `timbre_3_min` double,
## 23                           `timbre_3_max` double,
## 24                           `timbre_4_min` double,
## 25                           `timbre_4_max` double,
## 26                           `timbre_5_min` double,
## 27                           `timbre_5_max` double,
## 28                           `timbre_6_min` double,
## 29                           `timbre_6_max` double,
## 30                           `timbre_7_min` double,
## 31                           `timbre_7_max` double,
## 32                           `timbre_8_min` double,
## 33                           `timbre_8_max` double,
## 34                           `timbre_9_min` double,
## 35                           `timbre_9_max` double,
## 36                          `timbre_10_min` double,
## 37                          `timbre_10_max` double,
## 38                          `timbre_11_min` double,
## 39                          `timbre_11_max` double,
## 40                                     `top10` int)
## 41                             ROW FORMAT DELIMITED 
## 42                         FIELDS TERMINATED BY ',' 
## 43                            STORED AS INPUTFORMAT 
## 44       'org.apache.hadoop.mapred.TextInputFormat' 
## 45                                     OUTPUTFORMAT 
## 46  'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
## 47                                        LOCATION
## 48    's3://aws-bigdata-blog/artifacts/predict-billboard/data'
## 49                                  TBLPROPERTIES (
## 50            'transient_lastDdlTime'='1505484133')

Run a sample query

Next, run a sample query to obtain a list of all songs from Janet Jackson that made it to the Billboard Top 10 charts.

dbGetQuery(con, " SELECT songtitle,artistname,top10   FROM sampledb.billboard WHERE lower(artistname) =     'janet jackson' AND top10 = 1")
##                       songtitle    artistname top10
## 1                       Runaway Janet Jackson     1
## 2               Because Of Love Janet Jackson     1
## 3                         Again Janet Jackson     1
## 4                            If Janet Jackson     1
## 5  Love Will Never Do (Without You) Janet Jackson 1
## 6                     Black Cat Janet Jackson     1
## 7               Come Back To Me Janet Jackson     1
## 8                       Alright Janet Jackson     1
## 9                      Escapade Janet Jackson     1
## 10                Rhythm Nation Janet Jackson     1

Determine how many songs in this dataset are specifically from the year 2010.

dbGetQuery(con, " SELECT count(*)   FROM sampledb.billboard WHERE year = 2010")
##   _col0
## 1   373

The sample dataset provides certain song properties of interest that can be analyzed to gauge the impact to the song’s overall popularity. Look at one such property, timesignature, and determine the value that is the most frequent among songs in the database. Timesignature is a measure of the number of beats and the type of note involved.

Running the query directly may result in an error, as shown in the commented lines below. This error is a result of trying to retrieve a large result set over a JDBC connection, which can cause out-of-memory issues at the client level. To address this, reduce the fetch size and run again.

#t<-dbGetQuery(con, " SELECT timesignature FROM sampledb.billboard")
#Note:  Running the preceding query results in the following error: 
#Error in .jcall(rp, "I", "fetch", stride, block): java.sql.SQLException: The requested #fetchSize is more than the allowed value in Athena. Please reduce the fetchSize and try #again. Refer to the Athena documentation for valid fetchSize values.
# Use the dbSendQuery function, reduce the fetch size, and run again
r <- dbSendQuery(con, " SELECT timesignature     FROM sampledb.billboard")
dftimesignature<- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
table(dftimesignature)
## dftimesignature
##    0    1    3    4    5    7 
##   10  143  503 6787  112   19
nrow(dftimesignature)
## [1] 7574

From the results, observe that 6787 songs have a timesignature of 4.

Next, determine the song with the highest tempo.

dbGetQuery(con, " SELECT songtitle,artistname,tempo   FROM sampledb.billboard WHERE tempo = (SELECT max(tempo) FROM sampledb.billboard) ")
##                   songtitle      artistname   tempo
## 1 Wanna Be Startin' Somethin' Michael Jackson 244.307

Create the training dataset

Your model needs to be trained such that it can learn and make accurate predictions. Split the data into training and test datasets, and create the training dataset first.  This dataset contains all observations from the year 2009 and earlier. You may face the same JDBC connection issue pointed out earlier, so this query uses a fetch size.

#BillboardTrain <- dbGetQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
#Running the preceding query results in the following error:-
#Error in .verify.JDBC.result(r, "Unable to retrieve JDBC result set for ", : Unable to retrieve #JDBC result set for SELECT * FROM sampledb.billboard WHERE year <= 2009 (Internal error)
#Follow the same approach as before to address this issue.

r <- dbSendQuery(con, "SELECT * FROM sampledb.billboard WHERE year <= 2009")
BillboardTrain <- fetch(r, n=-1, block=100)
dbClearResult(r)
## [1] TRUE
BillboardTrain[1:2,c(1:3,6:10)]
##   year           songtitle artistname timesignature
## 1 2009 The Awkward Goodbye    Athlete             3
## 2 2009        Rubik's Cube    Athlete             3
##   timesignature_confidence loudness   tempo tempo_confidence
## 1                    0.732   -6.320  89.614   0.652
## 2                    0.906   -9.541 117.742   0.542
nrow(BillboardTrain)
## [1] 7201

Create the test dataset

BillboardTest <- dbGetQuery(con, "SELECT * FROM sampledb.billboard where year = 2010")
BillboardTest[1:2,c(1:3,11:15)]
##   year              songtitle        artistname key
## 1 2010 This Is the House That Doubt Built A Day to Remember  11
## 2 2010        Sticks & Bricks A Day to Remember  10
##   key_confidence    energy pitch timbre_0_min
## 1          0.453 0.9666556 0.024        0.002
## 2          0.469 0.9847095 0.025        0.000
nrow(BillboardTest)
## [1] 373

Convert the training and test datasets into H2O dataframes

train.h2o <- as.h2o(BillboardTrain)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
test.h2o <- as.h2o(BillboardTest)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%

Inspect the column names in your H2O dataframes.

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"

Create models

You need to designate the independent and dependent variables prior to applying your modeling algorithms. Because you’re trying to predict the ‘top10’ field, this would be your dependent variable and everything else would be independent.

Create your first model using GLM. Because GLM works best with numeric data, you create your model by dropping non-numeric variables. You only use the variables in the dataset that describe the numerical attributes of the song in the logistic regression model. You won’t use these variables:  “year”, “songtitle”, “artistname”, “songid”, or “artistid”.

y.dep <- 39
x.indep <- c(6:38)
x.indep
##  [1]  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
## [24] 29 30 31 32 33 34 35 36 37 38

Create Model 1: All numeric variables

Create Model 1 with the training dataset, using GLM as the modeling algorithm and H2O’s built-in h2o.glm function.

modelh1 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 1, using H2O’s built-in performance function.

h2o.performance(model=modelh1,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09924684
## RMSE:  0.3150347
## LogLoss:  0.3220267
## Mean Per-Class Error:  0.2380168
## AUC:  0.8431394
## Gini:  0.6862787
## R^2:  0.254663
## Null Deviance:  326.0801
## Residual Deviance:  240.2319
## AIC:  308.2319
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0   1    Error     Rate
## 0      255  59 0.187898  =59/314
## 1       17  42 0.288136   =17/59
## Totals 272 101 0.203753  =76/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.192772 0.525000 100
## 2                       max f2  0.124912 0.650510 155
## 3                 max f0point5  0.416258 0.612903  23
## 4                 max accuracy  0.416258 0.879357  23
## 5                max precision  0.813396 1.000000   0
## 6                   max recall  0.037579 1.000000 282
## 7              max specificity  0.813396 1.000000   0
## 8             max absolute_mcc  0.416258 0.455251  23
## 9   max min_per_class_accuracy  0.161402 0.738854 125
## 10 max mean_per_class_accuracy  0.124912 0.765006 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or ` 
h2o.auc(h2o.performance(modelh1,test.h2o)) 
## [1] 0.8431394

The AUC metric provides insight into how well the classifier is able to separate the two classes. In this case, the value of 0.8431394 indicates that the classification is good. (A value of 0.5 indicates a worthless test, while a value of 1.0 indicates a perfect test.)

Next, inspect the coefficients of the variables in the dataset.

dfmodelh1 <- as.data.frame(h2o.varimp(modelh1))
dfmodelh1
##                       names coefficients sign
## 1              timbre_0_max  1.290938663  NEG
## 2                  loudness  1.262941934  POS
## 3                     pitch  0.616995941  NEG
## 4              timbre_1_min  0.422323735  POS
## 5              timbre_6_min  0.349016024  NEG
## 6                    energy  0.348092062  NEG
## 7             timbre_11_min  0.307331997  NEG
## 8              timbre_3_max  0.302225619  NEG
## 9             timbre_11_max  0.243632060  POS
## 10             timbre_4_min  0.224233951  POS
## 11             timbre_4_max  0.204134342  POS
## 12             timbre_5_min  0.199149324  NEG
## 13             timbre_0_min  0.195147119  POS
## 14 timesignature_confidence  0.179973904  POS
## 15         tempo_confidence  0.144242598  POS
## 16            timbre_10_max  0.137644568  POS
## 17             timbre_7_min  0.126995955  NEG
## 18            timbre_10_min  0.123851179  POS
## 19             timbre_7_max  0.100031481  NEG
## 20             timbre_2_min  0.096127636  NEG
## 21           key_confidence  0.083115820  POS
## 22             timbre_6_max  0.073712419  POS
## 23            timesignature  0.067241917  POS
## 24             timbre_8_min  0.061301881  POS
## 25             timbre_8_max  0.060041698  POS
## 26                      key  0.056158445  POS
## 27             timbre_3_min  0.050825116  POS
## 28             timbre_9_max  0.033733561  POS
## 29             timbre_2_max  0.030939072  POS
## 30             timbre_9_min  0.020708113  POS
## 31             timbre_1_max  0.014228818  NEG
## 32                    tempo  0.008199861  POS
## 33             timbre_5_max  0.004837870  POS
## 34                                    NA <NA>

Typically, songs with heavier instrumentation tend to be louder (have higher values in the variable “loudness”) and more energetic (have higher values in the variable “energy”). This knowledge is helpful for interpreting the modeling results.

You can make the following observations from the results:

  • The coefficient estimates for the confidence values associated with the time signature, key, and tempo variables are positive. This suggests that higher confidence leads to a higher predicted probability of a Top 10 hit.
  • The coefficient estimate for loudness is positive, meaning that mainstream listeners prefer louder songs with heavier instrumentation.
  • The coefficient estimate for energy is negative, meaning that mainstream listeners prefer songs that are less energetic, which are those songs with light instrumentation.

These coefficients lead to contradictory conclusions for Model 1. This could be due to multicollinearity issues. Inspect the correlation between the variables “loudness” and “energy” in the training set.

cor(train.h2o$loudness,train.h2o$energy)
## [1] 0.7399067

This number indicates that these two variables are highly correlated, and Model 1 does indeed suffer from multicollinearity. Typically, you associate a value of -1.0 to -0.5 or 1.0 to 0.5 to indicate strong correlation, and a value of 0.1 to 0.1 to indicate weak correlation. To avoid this correlation issue, omit one of these two variables and re-create the models.

You build two variations of the original model:

  • Model 2, in which you keep “energy” and omit “loudness”
  • Model 3, in which you keep “loudness” and omit “energy”

You compare these two models and choose the model with a better fit for this use case.

Create Model 2: Keep energy and omit loudness

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:7,9:38)
x.indep
##  [1]  6  7  9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh2 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |=================================================================| 100%

Measure the performance of Model 2.

h2o.performance(model=modelh2,newdata=test.h2o)
## H2OBinomialMetrics: glm
## 
## MSE:  0.09922606
## RMSE:  0.3150017
## LogLoss:  0.3228213
## Mean Per-Class Error:  0.2490554
## AUC:  0.8431933
## Gini:  0.6863867
## R^2:  0.2548191
## Null Deviance:  326.0801
## Residual Deviance:  240.8247
## AIC:  306.8247
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      280 34 0.108280  =34/314
## 1       23 36 0.389831   =23/59
## Totals 303 70 0.152815  =57/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.254391 0.558140  69
## 2                       max f2  0.113031 0.647208 157
## 3                 max f0point5  0.413999 0.596026  22
## 4                 max accuracy  0.446250 0.876676  18
## 5                max precision  0.811739 1.000000   0
## 6                   max recall  0.037682 1.000000 283
## 7              max specificity  0.811739 1.000000   0
## 8             max absolute_mcc  0.254391 0.469060  69
## 9   max min_per_class_accuracy  0.141051 0.716561 131
## 10 max mean_per_class_accuracy  0.113031 0.761821 157
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh2 <- as.data.frame(h2o.varimp(modelh2))
dfmodelh2
##                       names coefficients sign
## 1                     pitch  0.700331511  NEG
## 2              timbre_1_min  0.510270513  POS
## 3              timbre_0_max  0.402059546  NEG
## 4              timbre_6_min  0.333316236  NEG
## 5             timbre_11_min  0.331647383  NEG
## 6              timbre_3_max  0.252425901  NEG
## 7             timbre_11_max  0.227500308  POS
## 8              timbre_4_max  0.210663865  POS
## 9              timbre_0_min  0.208516163  POS
## 10             timbre_5_min  0.202748055  NEG
## 11             timbre_4_min  0.197246582  POS
## 12            timbre_10_max  0.172729619  POS
## 13         tempo_confidence  0.167523934  POS
## 14 timesignature_confidence  0.167398830  POS
## 15             timbre_7_min  0.142450727  NEG
## 16             timbre_8_max  0.093377516  POS
## 17            timbre_10_min  0.090333426  POS
## 18            timesignature  0.085851625  POS
## 19             timbre_7_max  0.083948442  NEG
## 20           key_confidence  0.079657073  POS
## 21             timbre_6_max  0.076426046  POS
## 22             timbre_2_min  0.071957831  NEG
## 23             timbre_9_max  0.071393189  POS
## 24             timbre_8_min  0.070225578  POS
## 25                      key  0.061394702  POS
## 26             timbre_3_min  0.048384697  POS
## 27             timbre_1_max  0.044721121  NEG
## 28                   energy  0.039698433  POS
## 29             timbre_5_max  0.039469064  POS
## 30             timbre_2_max  0.018461133  POS
## 31                    tempo  0.013279926  POS
## 32             timbre_9_min  0.005282143  NEG
## 33                                    NA <NA>

h2o.auc(h2o.performance(modelh2,test.h2o)) 
## [1] 0.8431933

You can make the following observations:

  • The AUC metric is 0.8431933.
  • Inspecting the coefficient of the variable energy, Model 2 suggests that songs with high energy levels tend to be more popular. This is as per expectation.
  • As H2O orders variables by significance, the variable energy is not significant in this model.

You can conclude that Model 2 is not ideal for this use , as energy is not significant.

CreateModel 3: Keep loudness but omit energy

colnames(train.h2o)
##  [1] "year"                     "songtitle"               
##  [3] "artistname"               "songid"                  
##  [5] "artistid"                 "timesignature"           
##  [7] "timesignature_confidence" "loudness"                
##  [9] "tempo"                    "tempo_confidence"        
## [11] "key"                      "key_confidence"          
## [13] "energy"                   "pitch"                   
## [15] "timbre_0_min"             "timbre_0_max"            
## [17] "timbre_1_min"             "timbre_1_max"            
## [19] "timbre_2_min"             "timbre_2_max"            
## [21] "timbre_3_min"             "timbre_3_max"            
## [23] "timbre_4_min"             "timbre_4_max"            
## [25] "timbre_5_min"             "timbre_5_max"            
## [27] "timbre_6_min"             "timbre_6_max"            
## [29] "timbre_7_min"             "timbre_7_max"            
## [31] "timbre_8_min"             "timbre_8_max"            
## [33] "timbre_9_min"             "timbre_9_max"            
## [35] "timbre_10_min"            "timbre_10_max"           
## [37] "timbre_11_min"            "timbre_11_max"           
## [39] "top10"
y.dep <- 39
x.indep <- c(6:12,14:38)
x.indep
##  [1]  6  7  8  9 10 11 12 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29
## [24] 30 31 32 33 34 35 36 37 38
modelh3 <- h2o.glm( y = y.dep, x = x.indep, training_frame = train.h2o, family = "binomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |=================================================================| 100%
perfh3<-h2o.performance(model=modelh3,newdata=test.h2o)
perfh3
## H2OBinomialMetrics: glm
## 
## MSE:  0.0978859
## RMSE:  0.3128672
## LogLoss:  0.3178367
## Mean Per-Class Error:  0.264925
## AUC:  0.8492389
## Gini:  0.6984778
## R^2:  0.2648836
## Null Deviance:  326.0801
## Residual Deviance:  237.1062
## AIC:  303.1062
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      286 28 0.089172  =28/314
## 1       26 33 0.440678   =26/59
## Totals 312 61 0.144772  =54/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                         metric threshold    value idx
## 1                       max f1  0.273799 0.550000  60
## 2                       max f2  0.125503 0.663265 155
## 3                 max f0point5  0.435479 0.628931  24
## 4                 max accuracy  0.435479 0.882038  24
## 5                max precision  0.821606 1.000000   0
## 6                   max recall  0.038328 1.000000 280
## 7              max specificity  0.821606 1.000000   0
## 8             max absolute_mcc  0.435479 0.471426  24
## 9   max min_per_class_accuracy  0.173693 0.745763 120
## 10 max mean_per_class_accuracy  0.125503 0.775073 155
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
dfmodelh3 <- as.data.frame(h2o.varimp(modelh3))
dfmodelh3
##                       names coefficients sign
## 1              timbre_0_max 1.216621e+00  NEG
## 2                  loudness 9.780973e-01  POS
## 3                     pitch 7.249788e-01  NEG
## 4              timbre_1_min 3.891197e-01  POS
## 5              timbre_6_min 3.689193e-01  NEG
## 6             timbre_11_min 3.086673e-01  NEG
## 7              timbre_3_max 3.025593e-01  NEG
## 8             timbre_11_max 2.459081e-01  POS
## 9              timbre_4_min 2.379749e-01  POS
## 10             timbre_4_max 2.157627e-01  POS
## 11             timbre_0_min 1.859531e-01  POS
## 12             timbre_5_min 1.846128e-01  NEG
## 13 timesignature_confidence 1.729658e-01  POS
## 14             timbre_7_min 1.431871e-01  NEG
## 15            timbre_10_max 1.366703e-01  POS
## 16            timbre_10_min 1.215954e-01  POS
## 17         tempo_confidence 1.183698e-01  POS
## 18             timbre_2_min 1.019149e-01  NEG
## 19           key_confidence 9.109701e-02  POS
## 20             timbre_7_max 8.987908e-02  NEG
## 21             timbre_6_max 6.935132e-02  POS
## 22             timbre_8_max 6.878241e-02  POS
## 23            timesignature 6.120105e-02  POS
## 24                      key 5.814805e-02  POS
## 25             timbre_8_min 5.759228e-02  POS
## 26             timbre_1_max 2.930285e-02  NEG
## 27             timbre_9_max 2.843755e-02  POS
## 28             timbre_3_min 2.380245e-02  POS
## 29             timbre_2_max 1.917035e-02  POS
## 30             timbre_5_max 1.715813e-02  POS
## 31                    tempo 1.364418e-02  NEG
## 32             timbre_9_min 8.463143e-05  NEG
## 33                                    NA <NA>
h2o.sensitivity(perfh3,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501855569251422. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.2033898
h2o.auc(perfh3)
## [1] 0.8492389

You can make the following observations:

  • The AUC metric is 0.8492389.
  • From the confusion matrix, the model correctly predicts that 33 songs will be top 10 hits (true positives). However, it has 26 false positives (songs that the model predicted would be Top 10 hits, but ended up not being Top 10 hits).
  • Loudness has a positive coefficient estimate, meaning that this model predicts that songs with heavier instrumentation tend to be more popular. This is the same conclusion from Model 2.
  • Loudness is significant in this model.

Overall, Model 3 predicts a higher number of top 10 hits with an accuracy rate that is acceptable. To choose the best fit for production runs, record labels should consider the following factors:

  • Desired model accuracy at a given threshold
  • Number of correct predictions for top10 hits
  • Tolerable number of false positives or false negatives

Next, make predictions using Model 3 on the test dataset.

predict.regh <- h2o.predict(modelh3, test.h2o)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |=================================================================| 100%
print(predict.regh)
##   predict        p0          p1
## 1       0 0.9654739 0.034526052
## 2       0 0.9654748 0.034525236
## 3       0 0.9635547 0.036445318
## 4       0 0.9343579 0.065642149
## 5       0 0.9978334 0.002166601
## 6       0 0.9779949 0.022005078
## 
## [373 rows x 3 columns]
predict.regh$predict
##   predict
## 1       0
## 2       0
## 3       0
## 4       0
## 5       0
## 6       0
## 
## [373 rows x 1 column]
dpr<-as.data.frame(predict.regh)
#Rename the predicted column 
colnames(dpr)[colnames(dpr) == 'predict'] <- 'predict_top10'
table(dpr$predict_top10)
## 
##   0   1 
## 312  61

The first set of output results specifies the probabilities associated with each predicted observation.  For example, observation 1 is 96.54739% likely to not be a Top 10 hit, and 3.4526052% likely to be a Top 10 hit (predict=1 indicates Top 10 hit and predict=0 indicates not a Top 10 hit).  The second set of results list the actual predictions made.  From the third set of results, this model predicts that 61 songs will be top 10 hits.

Compute the baseline accuracy, by assuming that the baseline predicts the most frequent outcome, which is that most songs are not Top 10 hits.

table(BillboardTest$top10)
## 
##   0   1 
## 314  59

Now observe that the baseline model would get 314 observations correct, and 59 wrong, for an accuracy of 314/(314+59) = 0.8418231.

It seems that Model 3, with an accuracy of 0.8552, provides you with a small improvement over the baseline model. But is this model useful for record labels?

View the two models from an investment perspective:

  • A production company is interested in investing in songs that are more likely to make it to the Top 10. The company’s objective is to minimize the risk of financial losses attributed to investing in songs that end up unpopular.
  • How many songs does Model 3 correctly predict as a Top 10 hit in 2010? Looking at the confusion matrix, you see that it predicts 33 top 10 hits correctly at an optimal threshold, which is more than half the number
  • It will be more useful to the record label if you can provide the production company with a list of songs that are highly likely to end up in the Top 10.
  • The baseline model is not useful, as it simply does not label any song as a hit.

Considering the three models built so far, you can conclude that Model 3 proves to be the best investment choice for the record label.

GBM model

H2O provides you with the ability to explore other learning models, such as GBM and deep learning. Explore building a model using the GBM technique, using the built-in h2o.gbm function.

Before you do this, you need to convert the target variable to a factor for multinomial classification techniques.

train.h2o$top10=as.factor(train.h2o$top10)
gbm.modelh <- h2o.gbm(y=y.dep, x=x.indep, training_frame = train.h2o, ntrees = 500, max_depth = 4, learn_rate = 0.01, seed = 1122,distribution="multinomial")
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   5%
  |                                                                       
  |=====                                                            |   7%
  |                                                                       
  |======                                                           |   9%
  |                                                                       
  |=======                                                          |  10%
  |                                                                       
  |======================                                           |  33%
  |                                                                       
  |=====================================                            |  56%
  |                                                                       
  |====================================================             |  79%
  |                                                                       
  |================================================================ |  98%
  |                                                                       
  |=================================================================| 100%
perf.gbmh<-h2o.performance(gbm.modelh,test.h2o)
perf.gbmh
## H2OBinomialMetrics: gbm
## 
## MSE:  0.09860778
## RMSE:  0.3140188
## LogLoss:  0.3206876
## Mean Per-Class Error:  0.2120263
## AUC:  0.8630573
## Gini:  0.7261146
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      266 48 0.152866  =48/314
## 1       16 43 0.271186   =16/59
## Totals 282 91 0.171582  =64/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.189757 0.573333  90
## 2                     max f2  0.130895 0.693717 145
## 3               max f0point5  0.327346 0.598802  26
## 4               max accuracy  0.442757 0.876676  14
## 5              max precision  0.802184 1.000000   0
## 6                 max recall  0.049990 1.000000 284
## 7            max specificity  0.802184 1.000000   0
## 8           max absolute_mcc  0.169135 0.496486 104
## 9 max min_per_class_accuracy  0.169135 0.796610 104
## 10 max mean_per_class_accuracy  0.169135 0.805948 104
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `
h2o.sensitivity(perf.gbmh,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.501205344484314. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.1355932
h2o.auc(perf.gbmh)
## [1] 0.8630573

This model correctly predicts 43 top 10 hits, which is 10 more than the number predicted by Model 3. Moreover, the AUC metric is higher than the one obtained from Model 3.

As seen above, H2O’s API provides the ability to obtain key statistical measures required to analyze the models easily, using several built-in functions. The record label can experiment with different parameters to arrive at the model that predicts the maximum number of Top 10 hits at the desired level of accuracy and threshold.

H2O also allows you to experiment with deep learning models. Deep learning models have the ability to learn features implicitly, but can be more expensive computationally.

Now, create a deep learning model with the h2o.deeplearning function, using the same training and test datasets created before. The time taken to run this model depends on the type of EC2 instance chosen for this purpose.  For models that require more computation, consider using accelerated computing instances such as the P2 instance type.

system.time(
  dlearning.modelh <- h2o.deeplearning(y = y.dep,
                                      x = x.indep,
                                      training_frame = train.h2o,
                                      epoch = 250,
                                      hidden = c(250,250),
                                      activation = "Rectifier",
                                      seed = 1122,
                                      distribution="multinomial"
  )
)
## 
  |                                                                       
  |                                                                 |   0%
  |                                                                       
  |===                                                              |   4%
  |                                                                       
  |=====                                                            |   8%
  |                                                                       
  |========                                                         |  12%
  |                                                                       
  |==========                                                       |  16%
  |                                                                       
  |=============                                                    |  20%
  |                                                                       
  |================                                                 |  24%
  |                                                                       
  |==================                                               |  28%
  |                                                                       
  |=====================                                            |  32%
  |                                                                       
  |=======================                                          |  36%
  |                                                                       
  |==========================                                       |  40%
  |                                                                       
  |=============================                                    |  44%
  |                                                                       
  |===============================                                  |  48%
  |                                                                       
  |==================================                               |  52%
  |                                                                       
  |====================================                             |  56%
  |                                                                       
  |=======================================                          |  60%
  |                                                                       
  |==========================================                       |  64%
  |                                                                       
  |============================================                     |  68%
  |                                                                       
  |===============================================                  |  72%
  |                                                                       
  |=================================================                |  76%
  |                                                                       
  |====================================================             |  80%
  |                                                                       
  |=======================================================          |  84%
  |                                                                       
  |=========================================================        |  88%
  |                                                                       
  |============================================================     |  92%
  |                                                                       
  |==============================================================   |  96%
  |                                                                       
  |=================================================================| 100%
##    user  system elapsed 
##   1.216   0.020 166.508
perf.dl<-h2o.performance(model=dlearning.modelh,newdata=test.h2o)
perf.dl
## H2OBinomialMetrics: deeplearning
## 
## MSE:  0.1678359
## RMSE:  0.4096778
## LogLoss:  1.86509
## Mean Per-Class Error:  0.3433013
## AUC:  0.7568822
## Gini:  0.5137644
## 
## Confusion Matrix (vertical: actual; across: predicted) for F1-optimal threshold:
##          0  1    Error     Rate
## 0      290 24 0.076433  =24/314
## 1       36 23 0.610169   =36/59
## Totals 326 47 0.160858  =60/373
## 
## Maximum Metrics: Maximum metrics at their respective thresholds
##                       metric threshold    value idx
## 1                     max f1  0.826267 0.433962  46
## 2                     max f2  0.000000 0.588235 239
## 3               max f0point5  0.999929 0.511811  16
## 4               max accuracy  0.999999 0.865952  10
## 5              max precision  1.000000 1.000000   0
## 6                 max recall  0.000000 1.000000 326
## 7            max specificity  1.000000 1.000000   0
## 8           max absolute_mcc  0.999929 0.363219  16
## 9 max min_per_class_accuracy  0.000004 0.662420 145
## 10 max mean_per_class_accuracy  0.000000 0.685334 224
## 
## Gains/Lift Table: Extract with `h2o.gainsLift(<model>, <data>)` or `h2o.gainsLift(<model>, valid=<T/F>, xval=<T/F>)`
h2o.sensitivity(perf.dl,0.5)
## Warning in h2o.find_row_by_threshold(object, t): Could not find exact
## threshold: 0.5 for this set of metrics; using closest threshold found:
## 0.496293348880151. Run `h2o.predict` and apply your desired threshold on a
## probability column.
## [[1]]
## [1] 0.3898305
h2o.auc(perf.dl)
## [1] 0.7568822

The AUC metric for this model is 0.7568822, which is less than what you got from the earlier models. I recommend further experimentation using different hyper parameters, such as the learning rate, epoch or the number of hidden layers.

H2O’s built-in functions provide many key statistical measures that can help measure model performance. Here are some of these key terms.

Metric Description
Sensitivity Measures the proportion of positives that have been correctly identified. It is also called the true positive rate, or recall.
Specificity Measures the proportion of negatives that have been correctly identified. It is also called the true negative rate.
Threshold Cutoff point that maximizes specificity and sensitivity. While the model may not provide the highest prediction at this point, it would not be biased towards positives or negatives.
Precision The fraction of the documents retrieved that are relevant to the information needed, for example, how many of the positively classified are relevant
AUC

Provides insight into how well the classifier is able to separate the two classes. The implicit goal is to deal with situations where the sample distribution is highly skewed, with a tendency to overfit to a single class.

0.90 – 1 = excellent (A)

0.8 – 0.9 = good (B)

0.7 – 0.8 = fair (C)

.6 – 0.7 = poor (D)

0.5 – 0.5 = fail (F)

Here’s a summary of the metrics generated from H2O’s built-in functions for the three models that produced useful results.

Metric Model 3 GBM Model Deep Learning Model

Accuracy

(max)

0.882038

(t=0.435479)

0.876676

(t=0.442757)

0.865952

(t=0.999999)

Precision

(max)

1.0

(t=0.821606)

1.0

(t=0802184)

1.0

(t=1.0)

Recall

(max)

1.0 1.0

1.0

(t=0)

Specificity

(max)

1.0 1.0

1.0

(t=1)

Sensitivity

 

0.2033898 0.1355932

0.3898305

(t=0.5)

AUC 0.8492389 0.8630573 0.756882

Note: ‘t’ denotes threshold.

Your options at this point could be narrowed down to Model 3 and the GBM model, based on the AUC and accuracy metrics observed earlier.  If the slightly lower accuracy of the GBM model is deemed acceptable, the record label can choose to go to production with the GBM model, as it can predict a higher number of Top 10 hits.  The AUC metric for the GBM model is also higher than that of Model 3.

Record labels can experiment with different learning techniques and parameters before arriving at a model that proves to be the best fit for their business. Because deep learning models can be computationally expensive, record labels can choose more powerful EC2 instances on AWS to run their experiments faster.

Conclusion

In this post, I showed how the popular music industry can use analytics to predict the type of songs that make the Top 10 Billboard charts. By running H2O’s scalable machine learning platform on AWS, data scientists can easily experiment with multiple modeling techniques and interactively query the data using Amazon Athena, without having to manage the underlying infrastructure. This helps record labels make critical decisions on the type of artists and songs to promote in a timely fashion, thereby increasing sales and revenue.

If you have questions or suggestions, please comment below.


Additional Reading

Learn how to build and explore a simple geospita simple GEOINT application using SparkR.


About the Authors

gopalGopal Wunnava is a Partner Solution Architect with the AWS GSI Team. He works with partners and customers on big data engagements, and is passionate about building analytical solutions that drive business capabilities and decision making. In his spare time, he loves all things sports and movies related and is fond of old classics like Asterix, Obelix comics and Hitchcock movies.

 

 

Bob Strahan, a Senior Consultant with AWS Professional Services, contributed to this post.

 

 

Introducing Gluon: a new library for machine learning from AWS and Microsoft

Post Syndicated from Ana Visneski original https://aws.amazon.com/blogs/aws/introducing-gluon-a-new-library-for-machine-learning-from-aws-and-microsoft/

Post by Dr. Matt Wood

Today, AWS and Microsoft announced Gluon, a new open source deep learning interface which allows developers to more easily and quickly build machine learning models, without compromising performance.

Gluon Logo

Gluon provides a clear, concise API for defining machine learning models using a collection of pre-built, optimized neural network components. Developers who are new to machine learning will find this interface more familiar to traditional code, since machine learning models can be defined and manipulated just like any other data structure. More seasoned data scientists and researchers will value the ability to build prototypes quickly and utilize dynamic neural network graphs for entirely new model architectures, all without sacrificing training speed.

Gluon is available in Apache MXNet today, a forthcoming Microsoft Cognitive Toolkit release, and in more frameworks over time.

Neural Networks vs Developers
Machine learning with neural networks (including ‘deep learning’) has three main components: data for training; a neural network model, and an algorithm which trains the neural network. You can think of the neural network in a similar way to a directed graph; it has a series of inputs (which represent the data), which connect to a series of outputs (the prediction), through a series of connected layers and weights. During training, the algorithm adjusts the weights in the network based on the error in the network output. This is the process by which the network learns; it is a memory and compute intensive process which can take days.

Deep learning frameworks such as Caffe2, Cognitive Toolkit, TensorFlow, and Apache MXNet are, in part, an answer to the question ‘how can we speed this process up? Just like query optimizers in databases, the more a training engine knows about the network and the algorithm, the more optimizations it can make to the training process (for example, it can infer what needs to be re-computed on the graph based on what else has changed, and skip the unaffected weights to speed things up). These frameworks also provide parallelization to distribute the computation process, and reduce the overall training time.

However, in order to achieve these optimizations, most frameworks require the developer to do some extra work: specifically, by providing a formal definition of the network graph, up-front, and then ‘freezing’ the graph, and just adjusting the weights.

The network definition, which can be large and complex with millions of connections, usually has to be constructed by hand. Not only are deep learning networks unwieldy, but they can be difficult to debug and it’s hard to re-use the code between projects.

The result of this complexity can be difficult for beginners and is a time-consuming task for more experienced researchers. At AWS, we’ve been experimenting with some ideas in MXNet around new, flexible, more approachable ways to define and train neural networks. Microsoft is also a contributor to the open source MXNet project, and were interested in some of these same ideas. Based on this, we got talking, and found we had a similar vision: to use these techniques to reduce the complexity of machine learning, making it accessible to more developers.

Enter Gluon: dynamic graphs, rapid iteration, scalable training
Gluon introduces four key innovations.

  1. Friendly API: Gluon networks can be defined using a simple, clear, concise code – this is easier for developers to learn, and much easier to understand than some of the more arcane and formal ways of defining networks and their associated weighted scoring functions.
  2. Dynamic networks: the network definition in Gluon is dynamic: it can bend and flex just like any other data structure. This is in contrast to the more common, formal, symbolic definition of a network which the deep learning framework has to effectively carve into stone in order to be able to effectively optimizing computation during training. Dynamic networks are easier to manage, and with Gluon, developers can easily ‘hybridize’ between these fast symbolic representations and the more friendly, dynamic ‘imperative’ definitions of the network and algorithms.
  3. The algorithm can define the network: the model and the training algorithm are brought much closer together. Instead of separate definitions, the algorithm can adjust the network dynamically during definition and training. Not only does this mean that developers can use standard programming loops, and conditionals to create these networks, but researchers can now define even more sophisticated algorithms and models which were not possible before. They are all easier to create, change, and debug.
  4. High performance operators for training: which makes it possible to have a friendly, concise API and dynamic graphs, without sacrificing training speed. This is a huge step forward in machine learning. Some frameworks bring a friendly API or dynamic graphs to deep learning, but these previous methods all incur a cost in terms of training speed. As with other areas of software, abstraction can slow down computation since it needs to be negotiated and interpreted at run time. Gluon can efficiently blend together a concise API with the formal definition under the hood, without the developer having to know about the specific details or to accommodate the compiler optimizations manually.

The team here at AWS, and our collaborators at Microsoft, couldn’t be more excited to bring these improvements to developers through Gluon. We’re already seeing quite a bit of excitement from developers and researchers alike.

Getting started with Gluon
Gluon is available today in Apache MXNet, with support coming for the Microsoft Cognitive Toolkit in a future release. We’re also publishing the front-end interface and the low-level API specifications so it can be included in other frameworks in the fullness of time.

You can get started with Gluon today. Fire up the AWS Deep Learning AMI with a single click and jump into one of 50 fully worked, notebook examples. If you’re a contributor to a machine learning framework, check out the interface specs on GitHub.

-Dr. Matt Wood

Backing Up WordPress

Post Syndicated from Roderick Bauer original https://www.backblaze.com/blog/backing-up-wordpress/

WordPress cloud backup
WordPress logo

WordPress is the most popular CMS (Content Management System) for websites, with almost 30% of all websites in the world using WordPress. That’s a lot of sites — over 350 million!

In this post we’ll talk about the different approaches to keeping the data on your WordPress website safe.


Stop the Presses! (Or the Internet!)

As we were getting ready to publish this post, we received news from UpdraftPlus, one of the biggest WordPress plugin developers, that they are supporting Backblaze B2 as a storage solution for their backup plugin. They shipped the update (1.13.9) this week. This is great news for Backblaze customers! UpdraftPlus is also offering a 20% discount to Backblaze customers wishing to purchase or upgrade to UpdraftPlus Premium. The complete information is below.

UpdraftPlus joins backup plugin developer XCloner — Backup and Restore in supporting Backblaze B2. A third developer, BlogVault, also announced their intent to support Backblaze B2. Contact your favorite WordPress backup plugin developer and urge them to support Backblaze B2, as well.

Now, back to our post…


Your WordPress website data is on a web server that’s most likely located in a large data center. You might wonder why it is necessary to have a backup of your website if it’s in a data center. Website data can be lost in a number of ways, including mistakes by the website owner (been there), hacking, or even domain ownership dispute (I’ve seen it happen more than once). A website backup also can provide a history of changes you’ve made to the website, which can be useful. As an overall strategy, it’s best to have a backup of any data that you can’t afford to lose for personal or business reasons.

Your web hosting company might provide backup services as part of your hosting plan. If you are using their service, you should know where and how often your data is being backed up. You don’t want to find out too late that your backup plan was not adequate.

Sites on WordPress.com are automatically backed up by VaultPress (Automattic), which also is available for self-hosted WordPress installations. If you don’t want the work or decisions involved in managing the hosting for your WordPress site, WordPress.com will handle it for you. You do, however, give up some customization abilities, such as the option to add plugins of your own choice.

Very large and active websites might consider WordPress VIP by Automattic, or another premium WordPress hosting service such as Pagely.com.

This post is about backing up self-hosted WordPress sites, so we’ll focus on those options.

WordPress Backup

Backup strategies for WordPress can be divided into broad categories depending on 1) what you back up, 2) when you back up, and 3) where the data is backed up.

With server data, such as with a WordPress installation, you should plan to have three copies of the data (the 3-2-1 backup strategy). The first is the active data on the WordPress web server, the second is a backup stored on the web server or downloaded to your local computer, and the third should be in another location, such as the cloud.

We’ll talk about the different approaches to backing up WordPress, but we recommend using a WordPress plugin to handle your backups. A backup plugin can automate the task, optimize your backup storage space, and alert you of problems with your backups or WordPress itself. We’ll cover plugins in more detail, below.

What to Back Up?

The main components of your WordPress installation are:

You should decide which of these elements you wish to back up. The database is the top priority, as it contains all your website posts and pages (exclusive of media). Your current theme is important, as it likely contains customizations you’ve made. Following those in priority are any other files you’ve customized or made changes to.

You can choose to back up the WordPress core installation and plugins, if you wish, but these files can be downloaded again if necessary from the source, so you might not wish to include them. You likely have all the media files you use on your website on your local computer (which should be backed up), so it is your choice whether to back these up from the server as well.

If you wish to be able to recreate your entire website easily in case of data loss or disaster, you might choose to back up everything, though on a large website this could be a lot of data.

Generally, you should 1) prioritize any file that you’ve customized that you can’t afford to lose, and 2) decide whether you need a copy of everything in order to get your site back up quickly. These choices will determine your backup method and the amount of storage you need.

A good backup plugin for WordPress enables you to specify which files you wish to back up, and even to create separate backups and schedules for different backup contents. That’s another good reason to use a plugin for backing up WordPress.

When to Back Up?

You can back up manually at any time by using the Export tool in WordPress. This is handy if you wish to do a quick backup of your site or parts of it. Since it is manual, however, it is not a part of a dependable backup plan that should be done regularly. If you wish to use this tool, go to Tools, Export, and select what you wish to back up. The output will be an XML file that uses the WordPress Extended RSS format, also known as WXR. You can create a WXR file that contains all of the information on your site or just portions of the site, such as posts or pages by selecting: All content, Posts, Pages, or Media.
Note: You can use WordPress’s Export tool for sites hosted on WordPress.com, as well.

Export instruction for WordPress

Many of the backup plugins we’ll be discussing later also let you do a manual backup on demand in addition to regularly scheduled or continuous backups.

Note:  Another use of the WordPress Export tool and the WXR file is to transfer or clone your website to another server. Once you have exported the WXR file from the website you wish to transfer from, you can import the WXR file from the Tools, Import menu on the new WordPress destination site. Be aware that there are file size limits depending on the settings on your web server. See the WordPress Codex entry for more information. To make this job easier, you may wish to use one of a number of WordPress plugins designed specifically for this task.

You also can manually back up the WordPress MySQL database using a number of tools or a plugin. The WordPress Codex has good information on this. All WordPress plugins will handle this for you and do it automatically. They also typically include tools for optimizing the database tables, which is just good housekeeping.

A dependable backup strategy doesn’t rely on manual backups, which means you should consider using one of the many backup plugins available either free or for purchase. We’ll talk more about them below.

Which Format To Back Up In?

In addition to the WordPress WXR format, plugins and server tools will use various file formats and compression algorithms to store and compress your backup. You may get to choose between zip, tar, tar.gz, tar.gz2, and others. See The Most Common Archive File Formats for more information on these formats.

Select a format that you know you can access and unarchive should you need access to your backup. All of these formats are standard and supported across operating systems, though you might need to download a utility to access the file.

Where To Back Up?

Once you have your data in a suitable format for backup, where do you back it up to?

We want to have multiple copies of our active website data, so we’ll choose more than one destination for our backup data. The backup plugins we’ll discuss below enable you to specify one or more possible destinations for your backup. The possible destinations for your backup include:

A backup folder on your web server
A backup folder on your web server is an OK solution if you also have a copy elsewhere. Depending on your hosting plan, the size of your site, and what you include in the backup, you may or may not have sufficient disk space on the web server. Some backup plugins allow you to configure the plugin to keep only a certain number of recent backups and delete older ones, saving you disk space on the server.
Email to you
Because email servers have size limitations, the email option is not the best one to use unless you use it to specifically back up just the database or your main theme files.
FTP, SFTP, SCP, WebDAV
FTP, SFTP, SCP, and WebDAV are all widely-supported protocols for transferring files over the internet and can be used if you have access credentials to another server or supported storage device that is suitable for storing a backup.
Sync service (Dropbox, SugarSync, Google Drive, OneDrive)
A sync service is another possible server storage location though it can be a pricier choice depending on the plan you have and how much you wish to store.
Cloud storage (Backblaze B2, Amazon S3, Google Cloud, Microsoft Azure, Rackspace)
A cloud storage service can be an inexpensive and flexible option with pay-as-you go pricing for storing backups and other data.

A good website backup strategy would be to have multiple backups of your website data: one in a backup folder on your web hosting server, one downloaded to your local computer, and one in the cloud, such as with Backblaze B2.

If I had to choose just one of these, I would choose backing up to the cloud because it is geographically separated from both your local computer and your web host, it uses fault-tolerant and redundant data storage technologies to protect your data, and it is available from anywhere if you need to restore your site.

Backup Plugins for WordPress

Probably the easiest and most common way to implement a solid backup strategy for WordPress is to use one of the many backup plugins available for WordPress. Fortunately, there are a number of good ones and are available free or in “freemium” plans in which you can use the free version and pay for more features and capabilities only if you need them. The premium options can give you more flexibility in configuring backups or have additional options for where you can store the backups.

How to Choose a WordPress Backup Plugin

screenshot of WordPress plugins search

When considering which plugin to use, you should take into account a number of factors in making your choice.

Is the plugin actively maintained and up-to-date? You can determine this from the listing in the WordPress Plugin Repository. You also can look at reviews and support comments to get an idea of user satisfaction and how well issues are resolved.

Does the plugin work with your web hosting provider? Generally, well-supported plugins do, but you might want to check to make sure there are no issues with your hosting provider.

Does it support the cloud service or protocol you wish to use? This can be determined from looking at the listing in the WordPress Plugin Repository or on the developer’s website. Developers often will add support for cloud services or other backup destinations based on user demand, so let the developer know if there is a feature or backup destination you’d like them to add to their plugin.

Other features and options to consider in choosing a backup plugin are:

  • Whether encryption of your backup data is available
  • What are the options for automatically deleting backups from the storage destination?
  • Can you globally exclude files, folders, and specific types of files from the backup?
  • Do the options for scheduling automatic backups meet your needs for frequency?
  • Can you exclude/include specific database tables (a good way to save space in your backup)?

WordPress Backup Plugins Review

Let’s review a few of the top choices for WordPress backup plugins.

UpdraftPlus

UpdraftPlus is one of the most popular backup plugins for WordPress with over one million active installations. It is available in both free and Premium versions.

UpdraftPlus just released support for Backblaze B2 Cloud Storage in their 1.13.9 update on September 25. According to the developer, support for Backblaze B2 was the most frequent request for a new storage option for their plugin. B2 support is available in their Premium plugin and as a stand-alone update to their standard product.

Note: The developers of UpdraftPlus are offering a special 20% discount to Backblaze customers on the purchase of UpdraftPlus Premium by using the coupon code backblaze20. The discount is valid until the end of Friday, October 6th, 2017.

screenshot of Backblaze B2 cloud backup for WordPress in UpdraftPlus

XCloner — Backup and Restore

XCloner — Backup and Restore is a useful open-source plugin with many options for backing up WordPress.

XCloner supports B2 Cloud Storage in their free plugin.

screenshot of XCloner WordPress Backblaze B2 backup settings

BlogVault

BlogVault describes themselves as a “complete WordPress backup solution.” They offer a free trial of their paid WordPress backup subscription service that features real-time backups of changes to your WordPress site, as well as many other features.

BlogVault has announced their intent to support Backblaze B2 Cloud Storage in a future update.

screenshot of BlogValut WordPress Backup settings

BackWPup

BackWPup is a popular and free option for backing up WordPress. It supports a number of options for storing your backup, including the cloud, FTP, email, or on your local computer.

screenshot of BackWPup WordPress backup settings

WPBackItUp

WPBackItUp has been around since 2012 and is highly rated. It has both free and paid versions.

screenshot of WPBackItUp WordPress backup settings

VaultPress

VaultPress is part of Automattic’s well-known WordPress product, JetPack. You will need a JetPack subscription plan to use VaultPress. There are different pricing plans with different sets of features.

screenshot of VaultPress backup settings

Backup by Supsystic

Backup by Supsystic supports a number of options for backup destinations, encryption, and scheduling.

screenshot of Backup by Supsystic backup settings

BackupWordPress

BackUpWordPress is an open-source project on Github that has a popular and active following and many positive reviews.

screenshot of BackupWordPress WordPress backup settings

BackupBuddy

BackupBuddy, from iThemes, is the old-timer of backup plugins, having been around since 2010. iThemes knows a lot about WordPress, as they develop plugins, themes, utilities, and provide training in WordPress.

BackupBuddy’s backup includes all WordPress files, all files in the WordPress Media library, WordPress themes, and plugins. BackupBuddy generates a downloadable zip file of the entire WordPress website. Remote storage destinations also are supported.

screenshot of BackupBuddy settings

WordPress and the Cloud

Do you use WordPress and back up to the cloud? We’d like to hear about it. We’d also like to hear whether you are interested in using B2 Cloud Storage for storing media files served by WordPress. If you are, we’ll write about it in a future post.

In the meantime, keep your eye out for new plugins supporting Backblaze B2, or better yet, urge them to support B2 if they’re not already.

The Best Backup Strategy is the One You Use

There are other approaches and tools for backing up WordPress that you might use. If you have an approach that works for you, we’d love to hear about it in the comments.

The post Backing Up WordPress appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Dialekt-o-maten vending machine

Post Syndicated from Janina Ander original https://www.raspberrypi.org/blog/dialekt-o-maten-vending-machine/

At some point, many of you will have become exasperated with your AI personal assistant for not understanding you due to your accent – or worse, your fantastic regional dialect! A vending machine from Coca-Cola Sweden turns this issue inside out: the Dialekt-o-maten rewards users with a free soft drink for speaking in a Swedish regional dialect.

The world’s first vending machine where you pay with a dialect!

Thirsty fans along with journalists were invited to try Dialekt-o-maten at Stureplan in central Stockholm. Depending on how well they could pronounce the different phrases in assorted Swedish dialects – they were rewarded an ice cold Coke with that destination on the label.

The Dialekt-o-maten

The machine, which uses a Raspberry Pi, was set up in Stureplan Square in Stockholm. A person presses one of six buttons to choose the regional dialect they want to try out. They then hit ‘record’, and speak into the microphone. The recording is compared to a library of dialect samples, and, if it matches closely enough, voila! — the Dialekt-o-maten dispenses a soft drink for free.

Dialekt-o-maten on the highstreet in Stockholm

Code for the Dialekt-o-maten

The team of developers used the dejavu Python library, as well as custom-written code which responded to new recordings. Carl-Anders Svedberg, one of the developers, said:

Testing the voices and fine-tuning the right level of difficulty for the users was quite tricky. And we really should have had more voice samples. Filtering out noise from the surroundings, like cars and music, was also a small hurdle.

While they wrote the initial software on macOS, the team transferred it to a Raspberry Pi so they could install the hardware inside the Dialekt-o-maten.

Regional dialects

Even though Sweden has only ten million inhabitants, there are more than 100 Swedish dialects. In some areas of Sweden, the local language even still resembles Old Norse. The Dialekt-o-maten recorded how well people spoke the six dialects it used. Apparently, the hardest one to imitate is spoken in Vadstena, and the easiest is spoken in Smögen.

Dialekt-o-maten on Stockholm highstreet

Speech recognition with the Pi

Because of its audio input capabilities, the Raspberry Pi is very useful for building devices that use speech recognition software. One of our favourite projects in this vein is of course Allen Pan’s Real-Life Wizard Duel. We also think this pronunciation training machine by Japanese makers HomeMadeGarbage is really neat. Ideas from these projects and the Dialekt-o-maten could potentially be combined to make a fully fledged language-learning tool!

How about you? Have you used a Raspberry Pi to help you become multilingual? If so, do share your project with us in the comments or via social media.

The post Dialekt-o-maten vending machine appeared first on Raspberry Pi.

Google Signs Agreement to Tackle YouTube Piracy

Post Syndicated from Andy original https://torrentfreak.com/google-signs-unprecedented-agreement-to-tackle-youtube-piracy-170921/

Once upon a time, people complaining about piracy would point to the hundreds of piracy sites around the Internet. These days, criticism is just as likely to be leveled at Google-owned services.

YouTube, in particular, has come in for intense criticism, with the music industry complaining of exploitation of the DMCA in order to obtain unfair streaming rates from record labels. Along with streaming-ripping, this so-called Value Gap is one of the industry’s hottest topics.

With rightsholders seemingly at war with Google to varying degrees, news from France suggests that progress can be made if people sit down and negotiate.

According to local reports, Google and local anti-piracy outfit ALPA (l’Association de Lutte Contre la Piraterie Audiovisuelle) under the auspices of the CNC have signed an agreement to grant rightsholders direct access to content takedown mechanisms on YouTube.

YouTube has granted access to its Content ID systems to companies elsewhere for years but the new deal will see the system utilized by French content owners for the first time. It’s hoped that the access will result in infringing content being taken down or monetized more quickly than before.

“We do not want fraudsters to use our platforms to the detriment of creators,” said Carlo D’Asaro Biondo, Google’s President of Strategic Relationships in Europe, the Middle East and Africa.

The agreement, overseen by the Ministry of Culture, will see Google provide ALPA with financial support and rightsholders with essential training.

ALPA president Nicolas Seydoux welcomed the deal, noting that it symbolizes the “collapse of the wall of incomprehension” that previously existed between France’s rightsholders and the Internet search giant.

The deal forms part of the French government’s “Plan of Action Against Piracy”, in which it hopes to crack down on infringement in various ways, including tackling the threat of pirate sites, better promotion of services offering legitimate content, and educating children “from an early age” on the need to respect copyright.

“The fight against piracy is the great challenge of the new century in the cultural sphere,” said France’s Minister of Culture, Françoise Nyssen.

“I hope this is just the beginning of a process. It will require other agreements with rights holders and other platforms, as well as at the European level.”

According to NextInpact, the Google agreement will eventually encompass the downgrading of infringing content in search results as part of the Trusted Copyright Removal Program. A similar system is already in place in the UK.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and ANONYMOUS VPN services.

New Techniques in Fake Reviews

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2017/09/new_techniques_.html

Research paper: “Automated Crowdturfing Attacks and Defenses in Online Review Systems.”

Abstract: Malicious crowdsourcing forums are gaining traction as sources of spreading misinformation online, but are limited by the costs of hiring and managing human workers. In this paper, we identify a new class of attacks that leverage deep learning language models (Recurrent Neural Networks or RNNs) to automate the generation of fake online reviews for products and services. Not only are these attacks cheap and therefore more scalable, but they can control rate of content output to eliminate the signature burstiness that makes crowdsourced campaigns easy to detect.

Using Yelp reviews as an example platform, we show how a two phased review generation and customization attack can produce reviews that are indistinguishable by state-of-the-art statistical detectors. We conduct a survey-based user study to show these reviews not only evade human detection, but also score high on “usefulness” metrics by users. Finally, we develop novel automated defenses against these attacks, by leveraging the lossy transformation introduced by the RNN training and generation cycle. We consider countermeasures against our mechanisms, show that they produce unattractive cost-benefit tradeoffs for attackers, and that they can be further curtailed by simple constraints imposed by online service providers.

Create a text-based adventure game with FutureLearn

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/text-based-futurelearn/

Learning with Raspberry Pi has never been so easy! We’re adding a new course to FutureLearn today, and you can take part anywhere in the world.

FutureLearn: the story so far…

In February 2017, we were delighted to launch two free online CPD training courses on the FutureLearn platform, available anywhere in the world. Since the launch, more than 30,000 educators have joined these courses: Teaching Programming in Primary Schools, and Teaching Physical Computing with Raspberry Pi and Python.

Futurelearn Raspberry Pi

Thousands of educators have been building their skills – completing tasks such as writing a program in Python to make an LED blink, or building a voting app in Scratch. The two courses are scaffolded to build skills, week by week. Learners are supported by videos, screencasts, and articles, and they have the chance to apply what they have learned in as many different practical projects as possible.

We have had some excellent feedback from learners on the courses, such as Kyle Wilke who commented: “Fantastic course. Nice integration of text-based and video instruction. Was very impressed how much support was provided by fellow students, kudos to us. Can’t wait to share this with fellow educators.”

Brand new course

We are launching a new course this autumn. You can join lead educator Laura Sachs to learn object-oriented programming principles by creating your own text-based adventure game in Python. The course is aimed at educators who have programming experience, but have never programmed in the object-oriented style.

Future Learn: Object-oriented Programming in Python trailer

Our newest FutureLearn course in now live. You can join lead educator Laura Sachs to learn object-oriented programming principles by creating your own text-based adventure game in Python. The course is aimed at educators who have programming experience, but have never programmed in the object-oriented style.

The course will introduce you to the principles of object-oriented programming in Python, showing you how to create objects, functions, methods, and classes. You’ll use what you learn to create your own text-based adventure game. You will have the chance to share your code with other learners, and to see theirs. If you’re an educator, you’ll also be able to develop ideas for using object-oriented programming in your classroom.

Take part

Sign up now to join us on the course, starting today, September 4. Our courses are free to join online – so you can learn wherever you are, and whenever you want.

The post Create a text-based adventure game with FutureLearn appeared first on Raspberry Pi.

VMware Cloud on AWS – Now Available

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/vmware-cloud-on-aws-now-available/

Last year I told you about the work that we are doing with our friends at VMware to build the VMware Cloud on AWS. As I shared at the time, this is a native, fully-managed offering that runs the VMware SDDC stack directly on bare-metal AWS infrastructure that maintains the elasticity and security customers have come to expect. This allows you to benefit from the scalability and resiliency of AWS, along with the networking and system-level hardware features that are fundamental parts of our security-first architecture.

VMware Cloud on AWS allows you take advantage of what you already know and own. Your existing skills, your investment in training, your operational practices, and your investment in software licenses remain relevant and applicable when you move to the public cloud. As part of that move you can forget about building & running data centers, modernizing hardware, and scaling to meet transient or short-term demand. You can also take advantage of a long list of AWS compute, database, analytics, IoT, AI, security, mobile, deployment and application services.

Initial Availability
After incorporating feedback from many customers and partners in our Early Access beta program, today at VMworld, VMware and Amazon announced the initial availability of VMware Cloud on AWS. This service is initially available in the US West (Oregon) region through VMware and members of the VMware Partner Network. It is designed to support popular use cases such as data center extension, application development & testing, and application migration.

This offering is sold, delivered, supported, and billed by VMware. It supports custom-sized VMs, runs any OS that is supported by VMware, and makes use of single-tenant bare-metal AWS infrastructure so that you can bring your Windows Server licenses to the cloud. Each SDDC (Software-Defined Data Center) consists of 4 to 16 instances, each with 36 cores, 512 GB of memory, and 15.2 TB of NVMe storage. Clusters currently run in a single AWS Availability Zone (AZ) with support in the works for clusters that span AZs. You can spin up an entire VMware SDDC in a couple of hours, and scale host capacity up and down in minutes.

The NSX networking platform (powered by the AWS Elastic Networking Adapter running at up to 25 Gbps) supports multicast traffic, separate networks for management and compute, and IPSec VPN tunnels to on-premises firewalls, routers, and so forth.

Here’s an overview to show you how all of the parts fit together:

The VMware and third-party management tools (vCenter Server, PowerCLI, the vRealize Suite, and code that calls the vSphere API) that you use today will work just fine when you build a hybrid VMware environment that combines your existing on-premises resources and those that you launch in AWS. This hybrid environment will use a new VMware Hybrid Linked Mode to create a single, unified view of your on-premises and cloud resources. You can use familiar VMware tools to manage your applications, without having to purchase any new or custom hardware, rewrite applications, or modify your operating model.

Your applications and your code can access the full range of AWS services (the database, analytical, and AI services are a good place to start). Use for these services is billed separately and you’ll need to create an AWS account.

Learn More at VMworld
If you are attending VMworld in Las Vegas, please be sure to check out some of the 90+ AWS sessions:

Also, be sure to stop by booth #300 and say hello to my colleagues from the AWS team.

In the Works
Our teams have come a long way since last year, but things are just getting revved up!

VMware and AWS are continuing to invest to enable support for new capabilities and use cases, such as application migration, data center expansion, and application test and development. Work is under way to add additional AWS regions, support more use cases such as disaster recovery and data center consolidation, add certifications, and enable even deeper integration with AWS services.

Jeff;