Tag Archives: data analysis

New AWS Certification Specialty Exam for Big Data

Post Syndicated from Sara Snedeker original https://aws.amazon.com/blogs/big-data/new-aws-certification-specialty-exam-for-big-data/

AWS Certifications validate technical knowledge with an industry-recognized credential. Today, the AWS Certification team released the AWS Certified Big Data – Specialty exam. This new exam validates technical skills and experience in designing and implementing AWS services to derive value from data. The exam requires a current Associate AWS Certification and is intended for individuals who perform complex big data analyses.

Individuals who are interested in sitting for this exam should know how to do the following:

  • Implement core AWS big data services according to basic architectural best practices
  • Design and maintain big data
  • Leverage tools to automate data analysis

To prepare for the exam, we recommend the Big Data on AWS course, plus AWS whitepapers and documentation that are focused on big data.

This credential can help you stand out from the crowd, get recognized, and provide more evidence of your unique technical skills.

The AWS Certification team also released an AWS Certified Advanced Networking – Specialty exam and new AWS Certification Benefits. You can read more about these new releases on the AWS Blog.

Have more questions about AWS Certification? See our AWS Certification FAQ.

Amazon QuickSight Now Supports Federated Single Sign-On Using SAML 2.0

Post Syndicated from Jose Kunnackal original https://aws.amazon.com/blogs/big-data/amazon-quicksight-now-supports-federated-single-sign-on-using-saml-2-0/

Since launch, Amazon QuickSight has enabled business users to quickly and easily analyze data from a wide variety of data sources with superfast visualization capabilities enabled by SPICE (Superfast, Parallel, In-memory Calculation Engine). When setting up Amazon QuickSight access for business users, administrators have a choice of authentication mechanisms. These include Amazon QuickSight–specific credentials, AWS credentials, or in the case of Amazon QuickSight Enterprise Edition, existing Microsoft Active Directory credentials. Although each of these mechanisms provides a reliable, secure authentication process, they all require end users to input their credentials every time users log in to Amazon QuickSight. In addition, the invitation model for user onboarding currently in place today requires administrators to add users to Amazon QuickSight accounts either via email invitations or via AD-group membership, which can contribute to delays in user provisioning.

Today, we are happy to announce two new features that will make user authentication and provisioning simpler – Federated Single-Sign-On (SSO) and just-in-time (JIT) user creation.

Federated Single Sign-On

Federated SSO authentication to web applications (including the AWS Management Console) and Software-as-a-Service products has become increasingly popular, because Federated SSO lets organizations consolidate end-user authentication to external applications.

Traditionally, SSO involves the use of a centralized identity store (such as Active Directory or LDAP) to authenticate the user against applications within a corporate network. The growing popularity of SaaS and web applications created the need to authenticate users outside corporate networks. Federated SSO makes this scenario possible. It provides a mechanism for external applications to direct authentication requests to the centralized identity store and receive an authentication token back with the response and validity. SAML is the most common protocol used as a basis for Federated SSO capabilities today.

With Federated SSO in place, business users sign in to their Identity Provider portals with existing credentials and access QuickSight with a single click, without having to enter any QuickSight-specific passwords or account names. This makes it simple for users to access Amazon QuickSight for data analysis needs.

Federated SSO also enables administrators to impose additional security requirements for Amazon QuickSight access (through the identity provider portal) depending on details such as where the user is accessing from or what device is used for access. This access control lets administrators comply with corporate policies regarding data access and also enforce additional security for sensitive data handling in Amazon QuickSight.

Setting up federated authentication in Amazon QuickSight is straightforward. You follow the same sequence of steps you would to setup federated access for the AWS Management Console and then setup redirection to ensure that users land directly on Amazon QuickSight.

Let’s take a look at how this works. The following diagram illustrates the authentication flow between Amazon QuickSight and a third-party identity provider with Federated SSO in place with SAML 2.0.

  1. The Amazon QuickSight user browses to the organization’s identity provider portal, and authenticates using existing credentials.
  2. The federation service requests user authentication from the organization’s identity store, based on credentials provided.
  3. The identity store authenticates the user, and returns the authentication response to the federation service.
  4. The federation service posts the SAML assertion to the user’s browser.
  5. The user’s browser posts the SAML assertion to the AWS Sign-In SAML endpoint. AWS Sign-In processes the SAML request, authenticates the user, and forwards the authentication token to Amazon QuickSight.
  6. Amazon QuickSight uses the authentication token from AWS Sign-In, and authorizes user access.

Federated SSO using SAML 2.0 is now available for Amazon QuickSight Standard Edition, with support for Enterprise Edition coming shortly. You can enable federated access by using any identity provider compliant with SAML 2.0. These identity providers include Microsoft Active Directory Federation Services, Okta, Ping Identity, and Shibboleth. To set up your Amazon QuickSight account for Federated SSO, follow the guidance here.

Just-in-time user creation

With this release, we are also launching a new permissions-based user provisioning model in Amazon QuickSight. Administrators can use the existing AWS permissions management mechanisms in place to enable Amazon QuickSight permissions for their users. Once these required permissions are in place, users can onboard themselves to QuickSight without any additional administrator intervention. This approach simplifies user provisioning and enables onboarding of thousands of users by simply granting the right permissions.

Administrators can choose to assign either of the permissions below, which will result in the user being able to sign up to QuickSight either as a user or an administrator.

quicksight:CreateUser
quicksight:CreateAdmin

If you have an AWS account that is already signed up for QuickSight, and you would like to add yourself as a new user, add one of the permissions above and access https://quicksight.aws.amazon.com.

You will see a screen that requests your email address. Once you provide this, you will be added to the QuickSight account as a user or administrator, as specified by your permissions!

Switch to a Federated SSO user: If you are already an Amazon QuickSight Standard Edition user using authentication based on user name and password, and you want to switch to using Federated SSO, follow these steps:

  1. Sign in using the Federated SSO option to the AWS Management console as you do today. Ensure that you have the permissions for QuickSight user/admin creation assigned to you.
  2. Access https://quicksight.aws.amazon.com.
  3. Provide your email address, and sign up for Amazon QuickSight as an Amazon QuickSight user or admin.
  4. Delete the existing Amazon QuickSight user that you no longer want to use.
  5. Assign resources and data to the new role-based user from step 1. (Amazon QuickSight will prompt you to do this when you delete a user. For more information, see Deleting a User Account.)
  6. Continue as the new, role-based user.

Learn more

To learn more about these capabilities and start using them with your identity provider, see [Managing-SSO-user-guide-topic] in the Amazon QuickSight User Guide.

Stay engaged

If you have questions and suggestions, you can post them on the Amazon QuickSight Discussion Forum.

Not an Amazon QuickSight user?

See the Amazon Quicksight page to get started for free.

 

 

Amazon Elasticsearch Service support for Elasticsearch 5.1

Post Syndicated from Tara Walker original https://aws.amazon.com/blogs/aws/amazon-elasticsearch-service-support-for-es-5-1/

The Amazon Elasticsearch Service is a fully managed service that provides easier deployment, operation, and scale for the Elasticsearch open-source search and analytics engine. We are excited to announce that Amazon Elasticsearch Service now supports Elasticsearch 5.1 and Kibana 5.1.

Elasticsearch 5 comes with a ton of new features and enhancements that customers can now take advantage of in Amazon Elasticsearch service. Elements of the Elasticsearch 5 release are as follow:

  • Indexing performance: Improved Indexing throughput with updates to lock implementation & async translog fsyncing
  • Ingestion Pipelines: Incoming data can be sent to a pipeline that applies a series of ingestion processors, allowing transformation to the exact data you want to have in your search index. There are twenty processors included, from simple appending to complex regex applications
  • Painless scripting: Amazon Elasticsearch Service supports Painless, a new secure and performant scripting language for Elasticsearch 5. You can use scripting to change the precedence of search results, delete index fields by query, modify search results to return specific fields, and more.
  • New data structures: Lucene 6 data structures, new data types; half_float, text, keyword, and more complete support for dots-in-fieldnames
  • Search and Aggregations: Refactored search API, BM25 relevance calculations, Instant Aggregations, improvements to histogram aggregations & terms aggregations, and rewritten percolator & completion suggester
  • User experience: Strict settings and body & query string parameter validation, index management improvement, default deprecation logging, new shard allocation API, and new indices efficiency pattern for rollover & shrink APIs
  • Java REST client: simple HTTP/REST Java client that works with Java 7 and handles retry on node failure, as well as, round-robin, sniffing, and logging of requests
  • Other improvements: Lazy unicast hosts DNS lookup, automatic parallel tasking of reindex, update-by-query, delete-by-query, and search cancellation by task management API

The compelling new enhancements of Elasticsearch 5 are meant to make the service faster and easier to use while providing better security. Amazon Elasticsearch Service is a managed service designed to aid customers in building, developing and deploying solutions with Elasticsearch by providing the following capabilities:

  • Multiple configurations of instance types
  • Amazon EBS volumes for data storage
  • Cluster stability improvement with dedicated master nodes
  • Zone awareness – Cluster node allocation across two Availability Zones in the region
  • Access Control & Security with AWS Identity and Access Management (IAM)
  • Various geographical locations/regions for resources
  • Amazon Elasticsearch domain snapshots for replication, backup and restore
  • Integration with Amazon CloudWatch for monitoring Amazon Elasticsearch domain metrics
  • Integration with AWS CloudTrail for configuration auditing
  • Integration with other AWS Services like Kinesis Firehouse and DynamoDB for loading of real-time streaming data into Amazon Elasticsearch Service

Amazon Elasticsearch Service allows dynamic changes with zero downtime. You can add instances, remove instances, change instance sizes, change storage configuration, and make other changes dynamically.

The best way to highlight some of the aforementioned capabilities is with an example.

During a presentation at the IT/Dev conference, I demonstrated how to build a serverless employee onboarding system using Express.js, AWS Lambda, Amazon DynamoDB, and Amazon S3. In the demo, the information collected was personnel data stored in DynamoDB about an employee going through a fictional onboarding process. Imagine if the collected employee data could be searched, queried, and analyzed as needed by the company’s HR department. We can easily augment the onboarding system to add these capabilities by enabling the employee table to use DynamoDB Streams to trigger Lambda and store the desired employee attributes in Amazon Elasticsearch Service.

The result is the following solution architecture:

We will focus solely on how to dynamically store and index employee data to Amazon Elasticseach Service each time an employee record is entered and subsequently stored in the database.
To add this enhancement to the existing aforementioned onboarding solution, we will implement the solution as noted by the detailed cloud architecture diagram below:

Let’s look at how to implement the employee load process to the Amazon Elasticsearch Service, which is the first process flow shown in the diagram above.

Amazon Elasticsearch Service: Domain Creation

Let’s now visit the AWS Console to check out Amazon Elasticsearch Service with Elasticsearch 5 in action. As you probably guessed, from the AWS Console home, we select Elasticsearch Service under the Analytics group.

The first step in creating an Elasticsearch solution is to create a domain.  You will notice that now when creating an Amazon Elasticsearch Service domain, you now have the option to choose the Elasticsearch 5.1 version.  Since we are discussing the launch of the support of Elasticsearch 5, we will, of course, choose the 5.1 Elasticsearch engine version when creating our domain in the Amazon Elasticsearch Service.


After clicking Next, we will now setup our Elasticsearch domain by configuring our instance and storage settings. The instance type and the number of instances for your cluster should be determined based upon your application’s availability, network volume, and data needs. A recommended best practice is to choose two or more instances in order to avoid possible data inconsistencies or split brain failure conditions with Elasticsearch. Therefore, I will choose two instances/data nodes for my cluster and set up EBS as my storage device.

To understand how many instances you will need for your specific application, please review the blog post, Get Started with Amazon Elasticsearch Service: How Many Data Instances Do I Need, on the AWS Database blog.

All that is left for me is to set up the access policy and deploy the service. Once I create my service, the domain will be initialized and deployed.

Now that I have my Elasticsearch service running, I now need a mechanism to populate it with data. I will implement a dynamic data load process of the employee data to Amazon Elasticsearch Service using DynamoDB Streams.

Amazon DynamoDB: Table and Streams

Before I head to the DynamoDB console, I will quickly cover the basics.

Amazon DynamoDB is a scalable, distributed NoSQL database service. DynamoDB Streams provide an ordered, time-based sequence of every CRUD operation to the items in a DynamoDB table. Each stream record has information about the primary attribute modification for an individual item in the table. Streams execute asynchronously and can write stream records in practically real time. Additionally, a stream can be enabled when a table is created or can be enabled and modified on an existing table. You can learn more about DynamoDB Streams in the DynamoDB developer guide.

Now we will head to the DynamoDB console and view the OnboardingEmployeeData table.

This table has a primary partition key, UserID, that is a string data type and a primary sort key, Username, which is also of a string data type. We will use the UserID as the document ID in Elasticsearch. You will also notice that on this table, streams are enabled and the stream view type is New image. A stream that is set to a New image view type will have stream records that display the entire item record after it has been updated. You also have the option to have the stream present records that provide data items before modification, provide only the items’ key attributes, or provide old and new item information.  If you opt to use the AWS CLI to create your DynamoDB table, the key information to capture is the Latest Stream ARN shown underneath the Stream Details section. A DynamoDB stream has a unique ARN identifier that is outside of the ARN of the DynamoDB table. The stream ARN will be needed to create the IAM policy for access permissions between the stream and the Lambda function.

IAM Policy

The first thing that is essential for any service implementation is getting the correct permissions in place. Therefore, I will first go to the IAM console to create a role and a policy for my Lambda function that will provide permissions for DynamoDB and Elasticsearch.

First, I will create a policy based upon an existing managed policy for Lambda execution with DynamoDB Streams.

This will take us to the Review Policy screen, which will have the selected managed policy details. I’ll name this policy, Onboarding-LambdaDynamoDB-toElasticsearch, and then customize the policy for my solution. The first thing you should notice is that the current policy allows access to all streams, however, the best practice would be to have this policy only access the specific DynamoDB Stream by adding the Latest Stream ARN. Hence, I will alter the policy and add the ARN for the DynamoDB table, OnboardingEmployeeData, and validate the policy. The altered policy is as shown below.

The only thing left is to add the Amazon Elasticsearch Service permissions in the policy. The core policy for Amazon Elasticsearch Service access permissions is as shown below:

 

I will use this policy and add the specific Elasticsearch domain ARN as the Resource for the policy. This ensures that I have a policy that enforces the Least Privilege security best practice for policies. With the Amazon Elasticsearch Service domain added as shown, I can validate and save the policy.

The best way to create a custom policy is to use the IAM Policy Simulator or view the examples of the AWS service permissions from the service documentation. You can also find some examples of policies for a subset of AWS Services here. Remember you should only add the ES permissions that are needed using the Least Privilege security best practice, the policy shown above is used only as an example.

We will create the role for our Lambda function to use to grant access and attach the aforementioned policy to the role.

AWS Lambda: DynamoDB triggered Lambda function

AWS Lambda is the core of Amazon Web Services serverless computing offering. With Lambda, you can write and run code using supported languages for almost any type of application or backend service. Lambda will trigger your code in response to events from AWS services or from HTTP requests. Lambda will dynamically scale based upon workload and you only pay for your code execution.

We will have DynamoDB streams trigger a Lambda function that will create an index and send data to Elasticsearch. Another option for this is to use the Logstash plugin for DynamoDB. However, since several of the Logstash processors are now included in Elasticsearch 5.1 core and with the improved performance optimizations, I will opt to use Lambda to process my DynamoDB stream and load data to Amazon Elasticsearch Service.
Now let us head over to the AWS Lambda console and create the lambda function for loading employee data to Amazon Elasticsearch Service.

Once in the console, I will create a new Lambda function by selecting the Blank Function blueprint that will take me to the Configure Trigger page. Once on the trigger page, I will select DynamoDB as the AWS service which will trigger Lambda, and I provide the following trigger related options:

  • Table: OnboardingEmployeeData
  • Batch size: 100 (default)
  • Starting position: Trim Horizon

I hit Next button, and I am on the Configure Function screen. The name of my function will be ESEmployeeLoad and I will write this function in Node.4.3.

The Lambda function code is as follows:

var AWS = require('aws-sdk');
var path = require('path');

//Object for all the ElasticSearch Domain Info
var esDomain = {
    region: process.env.RegionForES,
    endpoint: process.env.EndpointForES,
    index: process.env.IndexForES,
    doctype: 'onboardingrecords'
};
//AWS Endpoint from created ES Domain Endpoint
var endpoint = new AWS.Endpoint(esDomain.endpoint);
//The AWS credentials are picked up from the environment.
var creds = new AWS.EnvironmentCredentials('AWS');

console.log('Loading function');
exports.handler = (event, context, callback) => {
    //console.log('Received event:', JSON.stringify(event, null, 2));
    console.log(JSON.stringify(esDomain));
    
    event.Records.forEach((record) => {
        console.log(record.eventID);
        console.log(record.eventName);
        console.log('DynamoDB Record: %j', record.dynamodb);
       
        var dbRecord = JSON.stringify(record.dynamodb);
        postToES(dbRecord, context, callback);
    });
};

function postToES(doc, context, lambdaCallback) {
    var req = new AWS.HttpRequest(endpoint);

    req.method = 'POST';
    req.path = path.join('/', esDomain.index, esDomain.doctype);
    req.region = esDomain.region;
    req.headers['presigned-expires'] = false;
    req.headers['Host'] = endpoint.host;
    req.body = doc;

    var signer = new AWS.Signers.V4(req , 'es');  // es: service code
    signer.addAuthorization(creds, new Date());

    var send = new AWS.NodeHttpClient();
    send.handleRequest(req, null, function(httpResp) {
        var respBody = '';
        httpResp.on('data', function (chunk) {
            respBody += chunk;
        });
        httpResp.on('end', function (chunk) {
            console.log('Response: ' + respBody);
            lambdaCallback(null,'Lambda added document ' + doc);
        });
    }, function(err) {
        console.log('Error: ' + err);
        lambdaCallback('Lambda failed with error ' + err);
    });
}

The Lambda function Environment variables are:

I will select an Existing role option and choose the ESOnboardingSystem IAM role I created earlier.

Upon completing my IAM role permissions for the Lambda function, I can review the Lambda function details and complete the creation of ESEmployeeLoad function.

I have completed the process of building my Lambda function to talk to Elasticsearch, and now I test my function my simulating data changes to my database.

Now my function, ESEmployeeLoad, will execute upon changes to the data in my database from my onboarding system. Additionally, I can review the processing of the Lambda function to Elasticsearch by reviewing the CloudWatch logs.

Now I can alter my Lambda function to take advantage of the new features or go directly to Elasticsearch and utilize the new Ingest Mode. An example of this would be to implement a pipeline for my Employee record documents.

I can replicate this function for handling the badge updates to the employee record, and/or leverage other preprocessors against the employee data. For instance, if I wanted to do a search of data based upon a data parameter in the Elasticsearch document, I could use the Search API and get records from the dataset.

The possibilities are endless, and you can get as creative as your data needs dictate while maintaining great performance.

Amazon Elasticsearch Service: Kibana 5.1

All Amazon Elasticsearch Service domains using Elasticsearch 5.1 are bundled with Kibana 5.1, the latest version of the open-source visualization tool.

The companion visualization and analytics platform, Kibana, has also been enhanced in the Kibana 5.1 release. Kibana is used to view, search or and interact with Elasticsearch data with a myriad of different charts, tables, and maps.  In addition, Kibana performs advanced data analysis of large volumes of the data. Key enhancements of the Kibana release are as follows:

  • Visualization tool new design: Updated color scheme and maximization of screen real-estate
  • Timelion: visualization tool with a time-based query DSL
  • Console: formerly known as Sense is now part of the core, using the same configuration for free-form requests to Elasticsearch
  • Scripted field language: ability use new Painless scripting language in the Elasticsearch cluster
  • Tag Cloud Visualization: 5.1 adds a word base graphical view of data sized by importance
  • More Charts: return of previously removed charts and addition of advanced view for X-Pack
  • Profiler UI:1 provides an enhancement to profile API with tree view
  • Rendering performance improvement: Discover performance fixes, decrease of CPU load

Summary

As you can see this release is expansive with many enhancements to assist customers in building Elasticsearch solutions. Amazon Elasticsearch Service now supports 15 new Elasticsearch APIs and 6 new plugins. Amazon Elasticsearch Service supports the following operations for Elasticsearch 5.1:

You can read more about the supported operations for Elasticsearch in the Amazon Elasticsearch Developer Guide, and you can get started by visiting the Amazon Elasticsearch Service website and/or sign into the AWS Management Console.

Tara

 

The state of Jupyter (O’Reilly)

Post Syndicated from corbet original http://lwn.net/Articles/712677/rss

Here’s an
O’Reilly article
describing the Jupyter project and what it has
accomplished.
Project Jupyter aims to create an ecosystem of open source tools for
interactive computation and data analysis, where the direct participation
of humans in the computational loop—executing code to understand a problem
and iteratively refine their approach—is the primary consideration.

The state of Jupyter (O’Reilly)

Post Syndicated from corbet original https://lwn.net/Articles/712677/rss

Here’s an
O’Reilly article
describing the Jupyter project and what it has
accomplished.
Project Jupyter aims to create an ecosystem of open source tools for
interactive computation and data analysis, where the direct participation
of humans in the computational loop—executing code to understand a problem
and iteratively refine their approach—is the primary consideration.

New – GPU-Powered Amazon Graphics WorkSpaces

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-gpu-powered-amazon-graphics-workspaces/

As you can probably tell from my I Love My Amazon WorkSpace post I am kind of a fan-boy!

Since writing that post I have found out that I am not alone, and that there are many other WorkSpaces fan-boys and fan-girls out there. Many AWS customers are enjoying their fully managed, secure desktop computing environments almost as much as I am. From their perspective as users, they like to be able to access their WorkSpace from a multitude of supported devices including Windows and Mac computers, PCoIP Zero Clients, Chromebooks, iPads, Fire tablets, and Android tablets. As administrators, they appreciate the ability to deploy high-quality cloud desktops for any number of users. And, finally, as business leaders they like the ability to pay hourly or monthly for the WorkSpaces that they launch.

New Graphics Bundle
These fans already have access to several different hardware choices: the Value, Standard, and Performance bundles. With 1 or 2 vCPUs (virtual CPUs) and 2 to 7.5 GiB of memory, these bundles are a good fit for many office productivity use cases.

Today we are expanding the WorkSpaces family by adding a new GPU-powered Graphics bundle. This bundle offers a high-end virtual desktop that is a great fit for 3D application developers, 3D modelers, and engineers that use CAD, CAM, or CAE tools at the office. Here are the specs:

  • Display – NVIDIA GPU with 1,536 CUDA cores and 4 GiB of graphics memory.
  • Processing – 8 vCPUs.
  • Memory – 15 GiB.
  • System volume – 100 GB.
  • User volume – 100 GB.

This new bundle is available in all regions where WorkSpaces currently operates, and can be used with any of the devices that I mentioned above. You can run the license-included operating system (Windows Server 2008 with Windows 7 Desktop Experience), or you can bring your own licenses for Windows 7 or 10. Applications that make use of OpenGL 4.x, DirectX, CUDA, OpenCL, and the NVIDIA GRID SDK will be able to take advantage of the GPU.

As you start to think about your petabyte-scale data analysis and visualization, keep in mind that these instances are located just light-feet away from EC2, RDS, Amazon Redshift, S3, and Kinesis. You can do your compute-intensive analysis server-side, and then render it in a visually compelling way on an adjacent WorkSpace. I am highly confident that you can use this combination of AWS services to create compelling applications that would simply not be cost-effective or achievable in any other way.

There is one important difference between the Graphics Bundle and the other bundles. Due to the way that the underlying hardware operates, WorkSpaces that run this bundle do not save the local state (running applications and open documents) when used in conjunction with the AutoStop running mode that I described in my Amazon WorkSpaces Update – Hourly Usage and Expanded Root Volume post. We recommend saving open documents and closing applications before disconnecting from your WorkSpace or stepping away from it for an extended period of time.

Demo
I don’t build 3D applications or use CAD, CAM, or CAE tools. However, I do like to design and build cool things with LEGO® bricks! I fired up the latest version of LEGO Digital Designer (LDD) and spent some time enhancing a design. Although I was not equipped to do any benchmarks, the GPU-enhanced version definitely ran more quickly and produced a higher quality finished product. Here’s a little design study I’ve been working on:

With my design all set up it was time to start building. Instead of trying to re-position my monitor so that it would be visible from my building table, I simply logged in to my Graphics WorkSpace from my Fire tablet. I was able to scale and rotate my design very quickly, even though I had very modest local computing power. Here’s what I saw on my Fire:

As you can see, the two screens (desktop and Fire) look identical! I stepped over to my building table and was able to set things up so that I could see my design and find my bricks:

Pricing
Graphics WorkSpaces are available with an hourly billing option. You pay a small, fixed monthly fee to cover infrastructure costs and storage, and an hourly rate for each hour that the WorkSpace is used during the month. Prices start at $22/month + $1.75 per hour in the US East (Northern Virginia) Region; see the WorkSpaces Pricing page for more information.

Jeff

 

Readmission Prediction Through Patient Risk Stratification Using Amazon Machine Learning

Post Syndicated from Ujjwal Ratan original https://blogs.aws.amazon.com/bigdata/post/Tx1Z7AR9QTXIWA1/Readmission-Prediction-Through-Patient-Risk-Stratification-Using-Amazon-Machine

Ujjwal Ratan is a Solutions Architect with Amazon Web Services

The Hospital Readmission Reduction Program (HRRP) was included as part of the Affordable Care Act to improve quality of care and lower healthcare spending. A patient visit to a hospital may be constituted as a readmission if the patient in question is admitted to a hospital within 30 days after being discharged from an earlier hospital stay. This should be easy to measure right? Wrong.

Unfortunately, it gets more complicated than this. Not all readmissions can be prevented, as some of them are part of an overall care plan for the patient. There are also factors beyond the hospital’s control that may cause a readmission. The Center for Medicare and Medicaid Services (CMS) recognized the complexities with measuring readmission rates and came up with a set of measures to evaluate providers.

There is still a long way to go for hospitals to be effective in preventing unplanned readmissions. Recognizing factors effecting readmissions is an important first step, but it is also important to draw out patterns in readmission data by aggregating information from multiple clinical and non-clinical hospital systems.

Moreover, most analysis algorithms rely on financial data which omit the clinical nuances applicable to a readmission pattern. The data sets contain a lot of redundant information like patient demographics and historical data. All this creates a massive data analysis challenge that may take months to solve using conventional means.

In this post, I show how to apply advanced analytics concepts like pattern analysis and machine learning to do risk stratification for patient cohorts.

The role of Amazon ML

There have been multiple global scientific studies on scalable models for predicting readmissions with high accuracy. Some of them, like comparison of models for predicting early hospital readmissions and predicting hospital readmissions in the Medicare population, are great examples.

Readmission records demonstrate patterns in data that can be used in a prediction algorithm. These patterns can be separated as outliers that are used to identify patient cohorts with high risk. Attribute correlation helps to identify the significant features that effect readmission risk in a patient.  This risk stratification in patients is enabled by categorizing patient attributes into numerical, categorical, and text attributes and applying statistical methods like standard deviation, median analysis, and the chi-squared test. These data sets are used to build statistical models to identify patients demonstrating certain characteristics consistent with readmissions so necessary steps can be taken to prevent it.

Amazon Machine Learning (Amazon ML) provides visual tools and wizards that guide users in creating complex ML models in minutes. You can also interact with it using the AWS CLI and API to integrate the power of ML with other applications. Based on the chosen target attribute in Amazon ML, you can build ML models like a binary classification model that predicts between states of 0 or 1 or a numeric regression model that predicts numerical values based on certain correlated attributes.

Creating an ML model for readmission prediction

The following diagram represents a reference architecture for building a scalable ML platform on AWS.

  1. The first step is to get the data into Amazon S3, the object storage service from AWS.
  2. Amazon Redshift acts as the database for the huge amounts of structured clinical data. The data is loaded into Amazon Redshift tables and is massaged to make it more meaningful as a data source for an ML model.
  3. A binary classification ML model is created using Amazon ML, with Amazon Redshift as the data source. A real-time endpoint is also created to allow real-time querying for the ML model.
  4. Amazon Cognito is used for secure federated access to the Amazon ML real-time endpoint.
  5. A static web site is created on S3. This website hosts the end user facing application using which one can query the Amazon ML endpoint in real time.

The architecture above is just one of the ways in which you can use AWS for building machine learning applications. You can vary this architecture and add services such as Amazon Elastic Map Reduce (EMR) if your use case involves large volumes of unstructured data sets or build a business intelligence (BI) reporting interface for analysis of predicted metrics. AWS provides a range of services that act as building blocks for the use case you want to build.

 

Prerequisite: Start with a data set

The first step in creating an accurate model is to choose the right data set to build and train the model. For the purposes of this post, I am using a publicly available diabetes data set from the University of California, Irvine (UCI).  The data set consists of 101,766 rows and represents 10 years of clinical care records from 130 US hospitals and integrated delivery networks. It includes over 50 features (attributes) representing patient and hospital outcomes. The data set can be downloaded from the UCI website. The hosted zip file consists of two csv files. The first file, diabetic_data.csv, is the actual data set and the second file, IDs_mapping.csv is the master data for admission_type_id, discharge_disposition_id, and admission_source_id.

Amazon ML automatically splits source data sets into two parts. The first part is used to train the ML model and the second part is used to evaluate the ML model’s accuracy. In this case, seventy percent of the source data is used to train the ML model and thirty percent is used to evaluate it. This is represented in the data rearrangement attribute as shown below:

ML model training data set:

{
  "splitting": {
    "percentBegin": 0,
    "percentEnd": 70,
    "strategy": "random",
    "complement": false,
    "strategyParams": {
      "randomSeed": ""
    }
  }
}

ML model evaluation data set:

{
  "splitting": {
    "percentBegin": 70,
    "percentEnd": 100,
    "strategy": "random",
    "complement": false,
    "strategyParams": {
      "randomSeed": ""
    }
  }
}

The accuracy of ML models becomes better when more data is used to train it. The data set I’m using in this post is very limited for building a comprehensive ML model but this methodology can be replicated with larger data sets.

 

Prepare the data and move it into Amazon S3

For an ML model to be effective, you should prepare the data so that it provides the right patterns to the model. The data set should have good coverage for relevant features, be low in unwanted “noise” or variance, and be as complete as possible with correct labels.

Use the Amazon Redshift database to prepare the data set. To begin, copy the data into an S3 bucket named diabetesdata. The bucket consists of four CSV files:

You can LIST the bucket contents by running the following command in the AWS CLI:

aws s3 ls s3://diabetesdata

Following this, create the necessary tables in Amazon Redshift to process the data in the CSV files by creating three master tables in one transaction table.

The transaction table consists of lookup IDs which act as foreign keys (FK) from the above master tables. It also has a primary key “encounter_id” and multiple columns that act as features for the ML model. The createredshifttables.sql script is executed to create the above tables.         

After the necessary tables are created, start loading them with data. You can make use of the Amazon Redshift COPY command to copy the data from the files on S3 into the respective Amazon Redshift tables. The following script template details the format of the copy command used:

COPY diabetes_data from 's3://<S3 file path>' credentials 'aws_access_key_id=<AWS Access Key ID>;aws_secret_access_key=<AWS Secret Access Key>' delimiter ',' IGNOREHEADER 1;

The loaddata.sql script is executed for the data loading step.

 

Modify the data set in Amazon Redshift

The next step is to make some changes to the data set to make it less noisy and suitable for the ML model that you create later. There are various things you can do as part of this clean up, such as updating incomplete values and grouping attributes into categories. For example, age can be grouped into young, adult or old based on age ranges.

For the target attribute for your ML model, create a custom attribute called readmission_result, with a value of “Yes” or “No” based on conditions in the readmitted attribute. To see all the changes made to the data, see the ModifyData.sql script.

Finally, the complete modified data set is dumped into a new table, diabetes_data_modified, which acts as a source for the ML model. Notice the new custom column readmission_result, which is your target attribute for the ML model.

 

Create a data source for Amazon ML and build the ML model

Next, create an Amazon ML data source, choosing Amazon Redshift as the source. This can be easily done through the console or through the CreateDataSourceFromRedshift API operation by specifying the Redshift parameters like Cluster Name, Database Name, username, password, role and the SQL query. The IAM role for Amazon Redshift as a data source is easily populated, as shown in the screenshot below.

You need the entire data set for the ML model, so use the following query for the data source:

SELECT * FROM diabetes_data_modified

This can be modified with column names and WHERE clauses to build different data sets for training the ML model.

The steps to create a binary classification ML model are covered in detail in the Building a Binary Classification Model with Amazon Machine Learning and Amazon Redshift blog post.

Amazon ML provides two types of predictions that you can try. The first one is a batch prediction that can be generated through the console or the GetBatchPrediction API operation. The result of the batch prediction is stored in an Amazon S3 bucket and can be used to build reports for end users (like monthly actual value vs predicted value report).

You can also use the ML model to generate a real-time prediction. To enable real-time predictions, create an endpoint for the ML model either through the console or using the CreateRealTimeEndpoint API operation.

After it’s created, you can query this endpoint in real time to get a response from Amazon ML, as shown in the following CLI screenshot.


 

 

Build the end user application

The Amazon ML endpoint created earlier can be invoked using an API call. This is very handy for building an application for end users who can interact with the ML model in real time.

Create a similar application and host it as a static website on Amazon S3. This feature of S3 allows you to host websites without any web servers and takes away the complexities of scaling hardware based on traffic routed to your application. The following is a screenshot from the application:

The application allows end users to select certain patient parameters and then makes a call to the predict API. The results are displayed in real time in the results pane.

I made use of the AWS SDK for JavaScript to build this application. The SDK can be added to your script using the following code:

<script src="https://sdk.amazonaws.com/js/aws-sdk-2.3.3.min.js"></script>

 

Use Amazon Cognito for secure access

To authenticate the Amazon ML API request, you can make use of Amazon Cognito, which allows for secure access to the Amazon ML endpoint without making use of the AWS security credentials. To enable this, create an identity pool in Amazon Cognito.

Amazon Cognito creates a new role in IAM. You need to allow this new IAM role to interact with Amazon ML by attaching the AmazonMachineLearningRealTimePredictionOnlyAccess policy to the role. This IAM policy allows the application to query the Amazon ML endpoint.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "machinelearning:Predict"
      ],
      "Resource": "*"
    }
  ]
}

Next, initialize credential objects, as shown in the code below:

var parameters = {
      AccountId: "AWS Account ID",
      RoleArn: "ARN for the role created by Amazon Cognito",
      IdentityPoolId: "The identity pool ID created in Amazon Cognito"
       };
 // set the Amazon Cognito region
       AWS.config.region = 'us-east-1';
// initialize the Credentials object with the parameters
 AWS.config.credentials = new AWS.CognitoIdentityCredentials(parameters);

 

Call the AML Endpoint using the API

Create the function callApi() to make a call to the Amazon ML endpoint. The steps in the callAPI() function involve building the object that forms a part of the parameters sent to the Amazon ML endpoint, as shown in the code below:

var machinelearning = new AWS.MachineLearning({apiVersion: '2014-12-12'});
var params = {
	 	 	MLModelId: ‘<ML model ID>',
	  		PredictEndpoint: ‘<ML model real-time endpoint>',
	  		Record: 
		};
		var request = machinelearning.predict(params);

The API call returns a JSON object that includes, among other things, the predictedLabel and predictedScores parameters, as shown in the code below:

{
    "Prediction": {
        "details": {
            "Algorithm": "SGD",
            "PredictiveModelType": "BINARY"
        },
        "predictedLabel": "1",
        "predictedScores": {
            "1": 0.5548262000083923
        }
    }
}

The predictedScores parameter generates a score between 0 and 1 which you can convert into a percentage:

if(!isNaN(predictedScore)){
			finalScore = Math.round(predictedScore * 100);
			resultMessage = finalScore + "%";
		}

The complete code for this sample application is uploaded to PredictReadmission_AML GitHub repo for reference and can be used to create more sophisticated machine learning applications using Amazon ML.

 

Conclusion

The power of machine learning opens new avenues for advanced analytics in healthcare. With new means of gathering data that range from sensors mounted on medical devices to medical images and everything in between, the complexities demonstrated by these varied data sets are pushing the boundaries of conventional analysis techniques.

The advent of cloud computing has made it possible for researchers to take up the challenging task of synthesizing these data sets and draw insights that are providing us with information that we never knew existed.

We are still at the beginning of this journey and there are, of course, challenges that we have to overcome. The ease of availability of quality data sets, which is the starting point of any good analysis, is still a major hurdle. Regulations like Health Insurance Portability and Accountability Act of 1996 (HIPAA) make it difficult to obtain medical records with Protected Health Information (PHI). The good news is that this is changing with initiatives like AWS Public Data Sets, which hosts a variety of public data sets that anyone can use.

At the end of the day, all this analysis and research is for one cause: To improve the quality of human lives. I hope this is, and will continue to be, the greatest motivation to overcome any challenge.

If you have any questions or suggestions, please comment below.
_ _ _ _ _

Do you want to be part of the conversation? Join AWS developers, enthusiasts, and healthcare professionals as we discuss building smart healthcare applications on AWS in Seattle on August 31.

Seattle AWS Big Data Meetup (Wednesday, August 31, 2016)


 

Related

Building a Multi-Class ML Model with Amazon Machine Learning
 

 

Month in Review: July 2016

Post Syndicated from Derek Young original https://blogs.aws.amazon.com/bigdata/post/Tx3PZZPH7CK6QOB/Month-in-Review-July-2016

July was a busy month of big data solutions on the Big Data Blog. The month started with our most popular story yet, Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE. It was a great post to start a spectacular month. Take a look at our summaries below. Learn, comment, and share. Thank you for reading the AWS Big Data Blog!

Installing and Running JobServer for Apache Spark on Amazon EMR
In this blog post, learn how to install JobServer on EMR using a bootstrap action (BA) derived from the JobServer GitHub repository. Then, run JobServer using a sample dataset.

Process Large DynamoDB Streams Using Multiple Amazon Kinesis Client Library (KCL) Workers
A previous post, described how you can use the Amazon Kinesis Client Library (KCL) and DynamoDB Streams Kinesis Adapter to efficiently process DynamoDB streams. This post focuses on the KCL configurations that are likely to have an impact on the performance of your application when processing a large DynamoDB stream.

Simplify Management of Amazon Redshift Snapshots using AWS Lambda
In this blog post, learn about the new Amazon Redshift Utils module that helps you manage the Snapshots that your cluster creates. You supply a simple configuration, and then AWS Lambda ensures that you have cluster snapshots as frequently as required to meet your RPO.

How SmartNews Built a Lambda Architecture on AWS to Analyze Customer Behavior and Recommend Content
In this post, SmartNews shows you how they built their data platform on AWS. Their current system generates tens of GBs of data from multiple data sources, and runs daily aggregation queries or machine learning algorithms on datasets with hundreds of GBs. Some outputs by machine learning algorithms are joined on data streams for gathering user feedback in near real-time (e.g. the last 5 minutes). It lets them adapt their product for users with minimum latency.

Generating Recommendations at Amazon Scale with Apache Spark and Amazon DSSTNE
Managing a hybrid cluster of both CPU and GPU instances poses challenges because cluster managers such as Yarn/Mesos do not natively support GPUs. Even if they did have native GPU support, the open source deep learning libraries would have to be re-written to work with the cluster manager API. This post discusses an alternate solution; namely, running separate CPU and GPU clusters, and driving the end-to-end modeling process from Apache Spark.

FROM THE ARCHIVE

Will Spark Power the Data behind Precision Medicine? (March 2016)
Spark is already known for being a major player in big data analysis, but it is additionally uniquely capable in advancing genomics algorithms given the complex nature of genomics research. This post introduces gene analysis using Spark on EMR and ADAM, for those new to precision medicine.

———————————————–

Want to learn more about Big Data or Streaming Data? Check out our Big Data and Streaming data educational pages.

Leave a comment below to let us know what big data topics you’d like to see next on the AWS Big Data Blog.

 

Plan Bee

Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/plan-bee/

Bees are important. I find myself saying this a lot and, slowly but surely, the media seems to be coming to this realisation too. The plight of the bee is finally being brought to our attention with increasing urgency.

A colony of bees make honey

Welcome to the house of buzz.

In the UK, bee colonies are suffering mass losses. Due to the use of bee-killing fertilisers and pesticides within the farming industry, the decline of pollen-rich plants, the destruction of hives by mites, and Colony Collapse Disorder (CCD), bees are in decline at a worrying pace.

Bee Collision

When you find the perfect GIF…

One hint of a silver lining is that increasing awareness of the crisis has led to a rise in the number of beekeeping hobbyists. As getting your hands on some bees is now as simple as ordering a box from the internet, keeping bees in your garden is a much less daunting venture than it once was. 

Taking this one step further, beekeepers are now using tech to monitor the conditions of their bees, improving conditions for their buzzy workforce while also recording data which can then feed into studies attempting to lessen the decline of the bee.

WDLabs recently donated a PiDrive to the Honey Bee Gardens Project in order to help beekeeper David Ammons and computer programmer Graham Total create The Hive Project, an electric beehive colony that monitors real-time bee data.

Electric Bee Hive

The setup records colony size, honey production, and bee health to help combat CCD.

Colony Collapse Disorder (CCD) is decidedly mysterious. Colonies hit by the disease seem to simply disappear. The hive itself often remains completely intact, full of honey at the perfect temperature, but… no bees. Dead or alive, the bees are nowhere to be found.

To try to combat this phenomenon, the electric hive offers 24/7 video coverage of the inner hive, while tracking the conditions of the hive population.

Bee bringing pollen into the hive

This is from the first live day of our instrumented beehive. This was the only bee we spotted all day that brought any pollen into the hive.

Ultimately, the team aim for the data to be crowdsourced, enabling researchers and keepers to gain the valuable information needed to fight CCD via a network of electric hives. While many people blame the aforementioned pollen decline and chemical influence for the rise of CCD, without the empirical information gathered from builds such as The Hive Project, the source of the problem, and therefore the solution, can’t be found.

Bee making honey

It has been brought to our attention that the picture here previously was of a wasp doing bee things. We have swapped it out for a bee.

 

 

Ammons and Total researched existing projects around the use of digital tech within beekeeping, and they soon understood that a broad analysis of bee conditions didn’t exist. While many were tracking hive weight, temperature, or honey population, there was no system in place for integrating such data collection into one place. This realisation spurred them on further.

“We couldn’t find any one project that took a broad overview of the whole area. Even if we don’t end up being the people who implement it, we intend to create a plan for a networked system of low-cost monitors that will assist both research and commercial beekeeping.”

With their mission statement firmly in place, the duo looked toward the Raspberry Pi as the brain of their colony. Finding the device small enough to fit within the hive without disruption, the power of the Pi allowed them to monitor multiple factors while also using the Pi Camera Module to record all video to the 314GB storage of the Western Digital PiDrive.

Data recorded by The Hive Project is vital to the survival of the bee, the growth of colony population, and an understanding of the conditions of the hive in changing climates. These are issues which affect us all. The honey bee is responsible for approximately 80% of pollination in the UK, and is essential to biodiversity. Here, I should hand over to a ‘real’ bee to explain more about the importance of bee-ing…

Bee Movie – Devastating Consequences – HD

Barry doesn’t understand why all the bee aren’t happy. Then, Vanessa shows Barry the devastating consequences of the bees being triumphant in their lawsuit against the human race.

 

The post Plan Bee appeared first on Raspberry Pi.