Security updates have been issued by CentOS (augeas, samba, and samba4), Debian (apache2, bluez, emacs23, and newsbeuter), Fedora (kernel and mingw-LibRaw), openSUSE (apache2 and libzip), Oracle (kernel), SUSE (kernel, spice, and xen), and Ubuntu (emacs24, emacs25, and samba).
Security updates have been issued by CentOS (emacs), Debian (apache2, gdk-pixbuf, and pyjwt), Fedora (autotrace, converseen, dmtx-utils, drawtiming, emacs, gtatool, imageinfo, ImageMagick, inkscape, jasper, k3d, kxstitch, libwpd, mingw-libzip, perl-Image-SubImageFind, pfstools, php-pecl-imagick, psiconv, q, rawtherapee, ripright, rss-glx, rubygem-rmagick, synfig, synfigstudio, techne, vdr-scraper2vdr, vips, and WindowMaker), Oracle (emacs and kernel), Red Hat (emacs and kernel), Scientific Linux (emacs), SUSE (emacs), and Ubuntu (apache2).
Security updates have been issued by Arch Linux (apache and ettercap), Debian (gdk-pixbuf and newsbeuter), Red Hat (kernel), Slackware (httpd, libgcrypt, and ruby), SUSE (kernel), and Ubuntu (bind9, kernel, libidn2-0, libxml2, linux, linux-aws, linux-gke, linux-kvm, linux-raspi2, linux-snapdragon, linux, linux-raspi2, linux-hwe, linux-lts-trusty, and linux-lts-xenial).
Post Syndicated from Robert Graham original http://blog.erratasec.com/2017/09/people-cant-read-equifax-edition.html
One of these days I’m going to write a guide for journalists reporting on the cyber. One of the items I’d stress is that they often fail to read the text of what is being said, but instead read some sort of subtext that wasn’t explicitly said. This is valid sometimes — as the subtext is what the writer intended all along, even if they didn’t explicitly write it. Other times, though the imagined subtext is not what the writer intended at all.
A good example is the recent Equifax breach. The original statement says:
Equifax Inc. (NYSE: EFX) today announced a cybersecurity incident potentially impacting approximately 143 million U.S. consumers.
The word consumers was widely translated to customers, as in this Bloomberg story:
Equifax Inc. said its systems were struck by a cyberattack that may have affected about 143 million U.S. customers of the credit reporting agency
But these aren’t the same thing. Equifax is a credit rating agency, keeping data on people who are not its own customers. It’s an important difference.
Another good example is yesterday’s quote “confirming” that the “Apache Struts” vulnerability was to blame:
Equifax has been intensely investigating the scope of the intrusion with the assistance of a leading, independent cybersecurity firm to determine what information was accessed and who has been impacted. We know that criminals exploited a U.S. website application vulnerability. The vulnerability was Apache Struts CVE-2017-5638.
But it doesn’t confirm Struts was responsible. Blaming Struts is certainly the subtext of this paragraph, but it’s not the text. It mentions that criminals had exploited the Struts vulnerability, but don’t actually connect the dots to the breach we are all talking about.
There’s probably reasons for this. While it’s easy for forensics to find evidence of Struts exploitation in logfiles, it’s much harder to connect this to the breach. While they suspect Struts, they may not actually be able to confirm it. Or, maybe they are trying to cover things up, where they feel failing to patch is a lesser crime than what they really did.
It’s at this point journalists should earn their pay. Instead rewriting what they read on the Internet, they could do legwork and call up Equifax PR and ask.
The purpose of this post isn’t to discuss Equifax, but the tendency of people to “read between the lines”, to read some subtext that wasn’t actually expressed in the text. Sometimes the subtext is legitimately there, such as how Equifax clearly intends people to blame Struts thought they don’t say it outright. Sometimes the subtext isn’t there, such as how Equifax doesn’t mean it’s own customers, only “U.S. consumers”. Journalists need to be careful about making assumptions about the subtext.
Update: The Equifax CSO has a degree in music. Some people have criticized this. Most people have defended this, pointing out that almost nobody has an “infosec” degree in our industry, and many of the top people have no degree at all. Among others, @thegrugq has pointed out that infosec degrees are only a few years old — they weren’t around 20 years ago when today’s corporate officers were getting their degrees.
Again, we have the text/subtext problem, where people interpret infosec degrees as being the same as computer-science degrees, the later of which have existed for decades. Some, as in this case, consider them to be wildly different. Others consider them to be nearly the same.
The Equifax data breach is pretty huge with 143 million records leaked from the hack in the US alone with unknown more in Canada and the UK.
The original statement about the breach is as follows for those that weren’t up to date with it, which came out Sept 7th (4 months AFTER the breach happened).
Equifax Inc. (NYSE: EFX) today announced a cybersecurity incident potentially impacting approximately 143 million U.S.
The Apache Struts project has put out a
statement on the possible role played by a Struts vulnerability in the
massive Equifax data breach. “Regarding the assertion that
especially CVE-2017-9805 is a nine year old security flaw, one has to
understand that there is a huge difference between detecting a flaw after
nine years and knowing about a flaw for several years. If the latter was
the case, the team would have had a hard time to provide a good answer why
they did not fix this earlier. But this was actually not the case here –we
were notified just recently on how a certain piece of code can be misused,
and we fixed this ASAP. What we saw here is common software engineering
business –people write code for achieving a desired function, but may not
be aware of undesired side-effects. Once this awareness is reached, we as
well as hopefully all other library and framework maintainers put high
efforts into removing the side-effects as soon as possible. It’s probably
fair to say that we met this goal pretty well in case of
Post Syndicated from Sara Rodas original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-september-2017/
As a school supply aficionado, the month of September has always held a special place in my heart. Nothing sets the tone for success like getting a killer deal on pens and a crisp college ruled notebook. Even if back to school shopping trips have secured a seat in your distant memory, this is still a perfect time of year to stock up on office supplies and set aside some time for flexing those learning muscles. A great way to get started: scan through our September Tech Talks and check out the ones that pique your interest. This month we are covering re:Invent, AI, and much more.
September 2017 – Schedule
Noted below are the upcoming scheduled live, online technical sessions being held during the month of September. Make sure to register ahead of time so you won’t miss out on these free talks conducted by AWS subject matter experts.
Webinars featured this month are:
Monday, September 11
9:00 – 9:40 AM PDT: What’s New with Amazon DynamoDB
10:30 – 11:10 AM PDT: Local Testing and Deployment Best Practices for Serverless Applications
12:00 – 12:40 PM PDT: Managing Secrets for Containers with Amazon ECS
Tuesday, September 12
9:00 – 9:40 AM PDT: Get Ready for re:Invent 2017 Content Overview
10:30 – 11:10 AM PDT: Deep Dive on User Sign-up and Sign-in with Amazon Cognito
12:00 – 12:40 PM PDT: Using CloudTrail to Enhance Compliance and Governance of S3
Wednesday, September 13
9:00 – 9:40 AM PDT: Best Practices for Processing Managed Hadoop Workloads
10:30 – 11:10 AM PDT: Migrating Your Oracle Database to PostgreSQL
12:00 – 12:40 PM PDT: Configuration Management in the Cloud
Thursday, September 14
9:00 – 9:40 AM PDT: Tackle Your Dark Data Challenge with AWS Glue
10:30 – 11:10 AM PDT: Deep Dive on MySQL Databases on AWS
12:00 – 12:40 PM PDT: Using AWS Batch and AWS Step Functions to Design and Run High-Throughput Workflows
Tuesday, September 26
9:00 – 9:40 AM PDT: An Overview of AI on the AWS Platform
10:30 – 11:10 AM PDT: Introduction to Generative Adversarial Networks (GAN) with Apache MXNet
12:00 – 12:40 PM PDT: Revolutionizing Backup & Recovery Using Amazon S3
2:00 – 2:40 PM PDT: Securing Your Desktops with Amazon WorkSpaces
Wednesday, September 27
Security & Identity
9:00 – 9:40 AM PDT: Advanced DNS Traffic Management using Amazon Route 53
10:30 – 11:10 AM PDT: Deep Dive on Amazon EFS (with Encryption)
Hands on Lab
12:30 – 2:00 PM PDT: Hands on Lab: Windows Workloads
Thursday, September 28
Security & Identity
9:00 – 9:40 AM PDT: How to use AWS WAF to Mitigate OWASP Top 10 attacks
10:30 – 11:10 AM PDT: AWS Greengrass Technical Deep Dive with Demo
Hands on Lab
1:00 – 1:40 PM PDT: Design, Deploy, and Optimize SQL Server on AWS
The AWS Online Tech Talks series covers a broad range of topics at varying technical levels. These sessions feature live demonstrations & customer examples led by AWS engineers and Solution Architects. Check out the AWS YouTube channel for more on-demand webinars on AWS technologies.
Security updates have been issued by Debian (enigmail, gnupg, libgd2, libidn, libidn2-0, mercurial, and strongswan), Fedora (gd, libidn2, mbedtls, mingw-openjpeg2, openjpeg2, and xen), Mageia (apache-commons-email, botan, iceape, poppler, rt/perl-Encode, samba, and wireshark), and openSUSE (expat, freerdp, git, libzypp, and php7).
When you develop Apache Spark–based applications, you might face some additional challenges when dealing with continuous integration and deployment pipelines, such as the following common issues:
- Applications must be tested on real clusters using automation tools (live test)
- Any user or developer must be able to easily deploy and use different versions of both the application and infrastructure to be able to debug, experiment on, and test different functionality.
- Infrastructure needs to be evaluated and tested along with the application that uses it.
In this post, we walk you through a solution that implements a continuous integration and deployment pipeline supported by AWS services. The pipeline offers the following workflow:
- Deploy the application to a QA stage after a commit is performed to the source code.
- Perform a unit test using Spark local mode.
- Deploy to a dynamically provisioned Amazon EMR cluster and test the Spark application on it
- Update the application as an AWS Service Catalog product version, allowing a user to deploy any version (commit) of the application on demand.
The following diagram shows the pipeline workflow.
The solution uses AWS CodePipeline, which allows users to orchestrate and automate the build, test, and deploy stages for application source code. The solution consists of a pipeline that contains the following stages:
- Source: Both the Spark application source code in addition to the AWS CloudFormation template file for deploying the application are committed to version control. In this example, we use AWS CodeCommit. For an example of the application source code, see zip.
- Build: In this stage, you use Apache Maven both to generate the application .jar binaries and to execute all of the application unit tests that end with *Spec.scala. In this example, we use AWS CodeBuild, which runs the unit tests given that they are designed to use Spark local mode.
- QADeploy: In this stage, the .jar file built previously is deployed using the CloudFormation template included with the application source code. All the resources are created in this stage, such as networks, EMR clusters, and so on.
- LiveTest: In this stage, you use Apache Maven to execute all the application tests that end with *SpecLive.scala. The tests submit EMR steps to the cluster created as part of the QADeploy step. The tests verify that the steps ran successfully and their results.
- LiveTestApproval: This stage is included in case a pipeline administrator approval is required to deploy the application to the next stages. The pipeline pauses in this stage until an administrator manually approves the release.
- QACleanup: In this stage, you use an AWS Lambda function to delete the CloudFormation template deployed as part of the QADeploy stage. The function does not affect any resources other than those deployed as part of the QADeploy stage.
- DeployProduct: In this stage, you use a Lambda function that creates or updates an AWS Service Catalog product and portfolio. Every time the pipeline releases a change to the application, the AWS Service Catalog product gets a new version, with the commit of the change as the version description.
Try it out!
Use the provided sample template to get started using this solution. This template creates the pipeline described earlier with all of its stages. It performs an initial commit of the sample Spark application in order to trigger the first release change. To deploy the template, use the following AWS CLI command:
After the template finishes creating resources, you see the pipeline name on the stack Outputs tab. After that, open the AWS CodePipeline console and select the newly created pipeline.
After a couple of minutes, AWS CodePipeline detects the initial commit applied by the CloudFormation stack and starts the first release.
You can watch how the pipeline goes through the Build, QADeploy, and LiveTest stages until it finally reaches the LiveTestApproval stage.
At this point, you can check the results of the test in the log files of the Build and LiveTest stage jobs on AWS CodeBuild. If you check the CloudFormation console, you see that a new template has been deployed as part of the QADeploy stage.
You can also visit the EMR console and view how the LiveTest stage submitted steps to the EMR cluster.
After performing the review, manually approve the revision on the LiveTestApproval stage by using the AWS CodePipeline console.
After the revision is approved, the pipeline proceeds to use a Lambda function that destroys the resources deployed on the QAdeploy stage. Finally, it creates or updates a product and portfolio in AWS Service Catalog. After the final stage of the pipeline is complete, you can check that the product is created successfully on the AWS Service Catalog console.
You can check the product versions and notice that the first version is the initial commit performed by the CloudFormation template.
You can proceed to share the created portfolio with any users in your AWS account and allow them to deploy any version of the Spark application. You can also perform a commit on the AWS CodeCommit repository. The pipeline is triggered automatically and repeats the pipeline execution to deploy a new version of the product.
To destroy all of the resources created by the stack, make sure all the deployed stacks using AWS Service Catalog or the QAdeploy stage are destroyed. Then, destroy the pipeline template using the following AWS CLI command:
You can use the sample template and Spark application shared in this post and adapt them for the specific needs of your own application. The pipeline can have as many stages as needed and it can be used to automatically deploy to AWS Service Catalog or a production environment using CloudFormation.
If you have questions or suggestions, please comment below.
About the Authors
Luis Caro is a Big Data Consultant for AWS Professional Services. He works with our customers to provide guidance and technical assistance on big data projects, helping them improving the value of their solutions when using AWS.
Samuel Schmidt is a Big Data Consultant for AWS Professional Services. He works with our customers to provide guidance and technical assistance on big data projects, helping them improving the value of their solutions when using AWS.
Security updates have been issued by Arch Linux (salt and thunderbird), Debian (aodh), Fedora (kernel and nginx), Mageia (apache, graphicsmagick, kernel-tmb, and openjpeg2), Red Hat (bind and thunderbird), Scientific Linux (thunderbird), and Ubuntu (python-pysaml2).
Post Syndicated from Dillon Morrison original https://aws.amazon.com/blogs/big-data/analyzing-aws-cost-and-usage-reports-with-looker-and-amazon-athena/
This is a guest post by Dillon Morrison at Looker. Looker is, in their own words, “a new kind of analytics platform–letting everyone in your business make better decisions by getting reliable answers from a tool they can use.”
As the breadth of AWS products and services continues to grow, customers are able to more easily move their technology stack and core infrastructure to AWS. One of the attractive benefits of AWS is the cost savings. Rather than paying upfront capital expenses for large on-premises systems, customers can instead pay variables expenses for on-demand services. To further reduce expenses AWS users can reserve resources for specific periods of time, and automatically scale resources as needed.
The AWS Cost Explorer is great for aggregated reporting. However, conducting analysis on the raw data using the flexibility and power of SQL allows for much richer detail and insight, and can be the better choice for the long term. Thankfully, with the introduction of Amazon Athena, monitoring and managing these costs is now easier than ever.
In the post, I walk through setting up the data pipeline for cost and usage reports, Amazon S3, and Athena, and discuss some of the most common levers for cost savings. I surface tables through Looker, which comes with a host of pre-built data models and dashboards to make analysis of your cost and usage data simple and intuitive.
Analysis with Athena
With Athena, there’s no need to create hundreds of Excel reports, move data around, or deploy clusters to house and process data. Athena uses Apache Hive’s DDL to create tables, and the Presto querying engine to process queries. Analysis can be performed directly on raw data in S3. Conveniently, AWS exports raw cost and usage data directly into a user-specified S3 bucket, making it simple to start querying with Athena quickly. This makes continuous monitoring of costs virtually seamless, since there is no infrastructure to manage. Instead, users can leverage the power of the Athena SQL engine to easily perform ad-hoc analysis and data discovery without needing to set up a data warehouse.
After the data pipeline is established, cost and usage data (the recommended billing data, per AWS documentation) provides a plethora of comprehensive information around usage of AWS services and the associated costs. Whether you need the report segmented by product type, user identity, or region, this report can be cut-and-sliced any number of ways to properly allocate costs for any of your business needs. You can then drill into any specific line item to see even further detail, such as the selected operating system, tenancy, purchase option (on-demand, spot, or reserved), and so on.
By default, the Cost and Usage report exports CSV files, which you can compress using gzip (recommended for performance). There are some additional configuration options for tuning performance further, which are discussed below.
If you want to follow along, you need the following resources:
- AWS account
- New or existing S3 bucket
- Appropriate permissions for Athena to the S3 bucket
Enable the cost and usage reports
First, enable the Cost and Usage report. For Time unit, select Hourly. For Include, select Resource IDs. All options are prompted in the report-creation window.
The Cost and Usage report dumps CSV files into the specified S3 bucket. Please note that it can take up to 24 hours for the first file to be delivered after enabling the report.
Configure the S3 bucket and files for Athena querying
In addition to the CSV file, AWS also creates a JSON manifest file for each cost and usage report. Athena requires that all of the files in the S3 bucket are in the same format, so we need to get rid of all these manifest files. If you’re looking to get started with Athena quickly, you can simply go into your S3 bucket and delete the manifest file manually, skip the automation described below, and move on to the next section.
To automate the process of removing the manifest file each time a new report is dumped into S3, which I recommend as you scale, there are a few additional steps. The folks at Concurrency labs wrote a great overview and set of scripts for this, which you can find in their GitHub repo.
These scripts take the data from an input bucket, remove anything unnecessary, and dump it into a new output bucket. We can utilize AWS Lambda to trigger this process whenever new data is dropped into S3, or on a nightly basis, or whatever makes most sense for your use-case, depending on how often you’re querying the data. Please note that enabling the “hourly” report means that data is reported at the hour-level of granularity, not that a new file is generated every hour.
Following these scripts, you’ll notice that we’re adding a date partition field, which isn’t necessary but improves query performance. In addition, converting data from CSV to a columnar format like ORC or Parquet also improves performance. We can automate this process using Lambda whenever new data is dropped in our S3 bucket. Amazon Web Services discusses columnar conversion at length, and provides walkthrough examples, in their documentation.
As a long-term solution, best practice is to use compression, partitioning, and conversion. However, for purposes of this walkthrough, we’re not going to worry about them so we can get up-and-running quicker.
Set up the Athena query engine
In your AWS console, navigate to the Athena service, and click “Get Started”. Follow the tutorial and set up a new database (we’ve called ours “AWS Optimizer” in this example). Don’t worry about configuring your initial table, per the tutorial instructions. We’ll be creating a new table for cost and usage analysis. Once you walked through the tutorial steps, you’ll be able to access the Athena interface, and can begin running Hive DDL statements to create new tables.
One thing that’s important to note, is that the Cost and Usage CSVs also contain the column headers in their first row, meaning that the column headers would be included in the dataset and any queries. For testing and quick set-up, you can remove this line manually from your first few CSV files. Long-term, you’ll want to use a script to programmatically remove this row each time a new file is dropped in S3 (every few hours typically). We’ve drafted up a sample script for ease of reference, which we run on Lambda. We utilize Lambda’s native ability to invoke the script whenever a new object is dropped in S3.
For cost and usage, we recommend using the DDL statement below. Since our data is in CSV format, we don’t need to use a SerDe, we can simply specify the “separatorChar, quoteChar, and escapeChar”, and the structure of the files (“TEXTFILE”). Note that AWS does have an OpenCSV SerDe as well, if you prefer to use that.
Once you’ve successfully executed the command, you should see a new table named “cost_and_usage” with the below properties. Now we’re ready to start executing queries and running analysis!
Start with Looker and connect to Athena
Setting up Looker is a quick process, and you can try it out for free here (or download from Amazon Marketplace). It takes just a few seconds to connect Looker to your Athena database, and Looker comes with a host of pre-built data models and dashboards to make analysis of your cost and usage data simple and intuitive. After you’re connected, you can use the Looker UI to run whatever analysis you’d like. Looker translates this UI to optimized SQL, so any user can execute and visualize queries for true self-service analytics.
Major cost saving levers
Now that the data pipeline is configured, you can dive into the most popular use cases for cost savings. In this post, I focus on:
- Purchasing Reserved Instances vs. On-Demand Instances
- Data transfer costs
- Allocating costs over users or other Attributes (denoted with resource tags)
On-Demand, Spot, and Reserved Instances
Purchasing Reserved Instances vs On-Demand Instances is arguably going to be the biggest cost lever for heavy AWS users (Reserved Instances run up to 75% cheaper!). AWS offers three options for purchasing instances:
- On-Demand—Pay as you use.
- Spot (variable cost)—Bid on spare Amazon EC2 computing capacity.
- Reserved Instances—Pay for an instance for a specific, allotted period of time.
When purchasing a Reserved Instance, you can also choose to pay all-upfront, partial-upfront, or monthly. The more you pay upfront, the greater the discount.
If your company has been using AWS for some time now, you should have a good sense of your overall instance usage on a per-month or per-day basis. Rather than paying for these instances On-Demand, you should try to forecast the number of instances you’ll need, and reserve them with upfront payments.
The total amount of usage with Reserved Instances versus overall usage with all instances is called your coverage ratio. It’s important not to confuse your coverage ratio with your Reserved Instance utilization. Utilization represents the amount of reserved hours that were actually used. Don’t worry about exceeding capacity, you can still set up Auto Scaling preferences so that more instances get added whenever your coverage or utilization crosses a certain threshold (we often see a target of 80% for both coverage and utilization among savvy customers).
Calculating the reserved costs and coverage can be a bit tricky with the level of granularity provided by the cost and usage report. The following query shows your total cost over the last 6 months, broken out by Reserved Instance vs other instance usage. You can substitute the cost field for usage if you’d prefer. Please note that you should only have data for the time period after the cost and usage report has been enabled (though you can opt for up to 3 months of historical data by contacting your AWS Account Executive). If you’re just getting started, this query will only show a few days.
The resulting table should look something like the image below (I’m surfacing tables through Looker, though the same table would result from querying via command line or any other interface).
With a BI tool, you can create dashboards for easy reference and monitoring. New data is dumped into S3 every few hours, so your dashboards can update several times per day.
It’s an iterative process to understand the appropriate number of Reserved Instances needed to meet your business needs. After you’ve properly integrated Reserved Instances into your purchasing patterns, the savings can be significant. If your coverage is consistently below 70%, you should seriously consider adjusting your purchase types and opting for more Reserved instances.
Data transfer costs
One of the great things about AWS data storage is that it’s incredibly cheap. Most charges often come from moving and processing that data. There are several different prices for transferring data, broken out largely by transfers between regions and availability zones. Transfers between regions are the most costly, followed by transfers between Availability Zones. Transfers within the same region and same availability zone are free unless using elastic or public IP addresses, in which case there is a cost. You can find more detailed information in the AWS Pricing Docs. With this in mind, there are several simple strategies for helping reduce costs.
First, since costs increase when transferring data between regions, it’s wise to ensure that as many services as possible reside within the same region. The more you can localize services to one specific region, the lower your costs will be.
Second, you should maximize the data you’re routing directly within AWS services and IP addresses. Transfers out to the open internet are the most costly and least performant mechanisms of data transfers, so it’s best to keep transfers within AWS services.
Lastly, data transfers between private IP addresses are cheaper than between elastic or public IP addresses, so utilizing private IP addresses as much as possible is the most cost-effective strategy.
The following query provides a table depicting the total costs for each AWS product, broken out transfer cost type. Substitute the “lineitem_productcode” field in the query to segment the costs by any other attribute. If you notice any unusually high spikes in cost, you’ll need to dig deeper to understand what’s driving that spike: location, volume, and so on. Drill down into specific costs by including “product_usagetype” and “product_transfertype” in your query to identify the types of transfer costs that are driving up your bill.
When moving between regions or over the open web, many data transfer costs also include the origin and destination location of the data movement. Using a BI tool with mapping capabilities, you can get a nice visual of data flows. The point at the center of the map is used to represent external data flows over the open internet.
Analysis by tags
AWS provides the option to apply custom tags to individual resources, so you can allocate costs over whatever customized segment makes the most sense for your business. For a SaaS company that hosts software for customers on AWS, maybe you’d want to tag the size of each customer. The following query uses custom tags to display the reserved, data transfer, and total cost for each AWS service, broken out by tag categories, over the last 6 months. You’ll want to substitute the cost_and_usage.resourcetags_customersegment and cost_and_usage.customer_segment with the name of your customer field.
The resulting table in this example looks like the results below. In this example, you can tell that we’re making poor use of Reserved Instances because they represent such a small portion of our overall costs.
Again, using a BI tool to visualize these costs and trends over time makes the analysis much easier to consume and take action on.
Saving costs on your AWS spend is always an iterative, ongoing process. Hopefully with these queries alone, you can start to understand your spending patterns and identify opportunities for savings. However, this is just a peek into the many opportunities available through analysis of the Cost and Usage report. Each company is different, with unique needs and usage patterns. To achieve maximum cost savings, we encourage you to set up an analytics environment that enables your team to explore all potential cuts and slices of your usage data, whenever it’s necessary. Exploring different trends and spikes across regions, services, user types, etc. helps you gain comprehensive understanding of your major cost levers and consistently implement new cost reduction strategies.
Note that all of the queries and analysis provided in this post were generated using the Looker data platform. If you’re already a Looker customer, you can get all of this analysis, additional pre-configured dashboards, and much more using Looker Blocks for AWS.
About the Author
Dillon Morrison leads the Platform Ecosystem at Looker. He enjoys exploring new technologies and architecting the most efficient data solutions for the business needs of his company and their customers. In his spare time, you’ll find Dillon rock climbing in the Bay Area or nose deep in the docs of the latest AWS product release at his favorite cafe (“Arlequin in SF is unbeatable!”).
Have you ever been faced with many different data sources in different formats that need to be analyzed together to drive value and insights? You need to be able to query, analyze, process, and visualize all your data as one canonical dataset, regardless of the data source or original format.
In this post, I walk through using AWS Glue to create a query optimized, canonical dataset on Amazon S3 from three different datasets and formats. Then, I use Amazon Athena and Amazon QuickSight to query against that data quickly and easily.
AWS Glue overview
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easy for customers to prepare and load their data for query and analytics. You can create and run an ETL job with a few clicks in the AWS Management Console. Point AWS Glue to your data stored on AWS, and a crawler discovers your data, classifies it, and stores the associated metadata (such as table definitions) in the AWS Glue Data Catalog. After it’s cataloged, your data is immediately searchable, queryable, and available for ETL. AWS Glue generates the ETL code for data transformation, and loads the transformed data into a target data store for analytics.
The AWS Glue ETL engine generates Python code that is entirely customizable, reusable, and portable. You can edit the code using your favorite IDE or notebook and share it with others using GitHub. After your ETL job is ready, you can schedule it to run on the AWS Glue fully managed, scale-out Spark environment, using its flexible scheduler with dependency resolution, job monitoring, and alerting.
AWS Glue is serverless. It automatically provisions the environment needed to complete the job, and customers pay only for the compute resources consumed while running ETL jobs. With AWS Glue, data can be available for analytics in minutes.
During this post, I step through an example using the New York City Taxi Records dataset. I focus on one month of data, but you could easily do this for the entire eight years of data. At the time of this post, AWS Glue is available in US-East-1 (N. Virginia).
As you crawl the unknown dataset, you discover that the data is in different formats, depending on the type of taxi. You then convert the data to a canonical form, start to analyze it, and build a set of visualizations… all without launching a single server.
Discover the data
To analyze all the taxi rides for January 2016, you start with a set of data in S3. First, create a database for this project within AWS Glue. A database is a set of associated table definitions, organized into a logical group. In Athena, database names are all lowercase, no matter what you type.
Next, add a new crawler and have it infer the schemas, structure, and other properties of the data.
- Under Crawlers, choose Add crawler.
- For Name, type “NYCityTaxiCrawler”.
- Choose an IAM role that AWS Glue can use for the crawler. For more information, see the AWS Glue Developer Guide.
- For Data store, type S3.
- For Crawl data in, choose Specified path in another account.
- For Include path, type “s3://serverless-analytics/glue-blog”.
- For Add Another data store, choose No.
- For Frequency, choose On demand. This allows you to customize when the crawler runs.
- Configure the crawler output database and prefix:
- For Database, select the database created earlier, “nycitytaxianalysis”.
- For Prefix added to tables (optional), type “blog_”.
- Choose Finish and Run it now.
The crawler runs and indicates that it found three tables.
If you look under Tables, you can see the three new tables that were created under the database “nycitytaxianalysis”.
The crawler used the built-in classifiers and identified the tables as CSV, inferred the columns/data types, and collected a set of properties for each table. If you look in each of those table definitions, you see the number of rows for each dataset found and that the columns don’t match between tables. As an example, bringing up the blog_yellow table, you can see the yellow dataset for January 2017 with 8.7 million rows, the location on S3, and the various columns found.
Optimize queries and get data into a canonical form
Create an ETL job to move this data into a query-optimized form. In a future post, I’ll cover how to partition the query-optimized data. You convert the data into a column format, changing the storage type to Parquet, and writing the data to a bucket that you own.
- Create a new ETL job and name it NYCityTaxiYellow.
- Select an IAM role that has permissions to write to a new S3 location in a bucket that you own.
- Specify a location for the script and tmp space on S3.
- Choose the blog_yellow table as a data source.
- Specify a new location (a new prefix location without any existing objects) to store the results.
- In the transform, rename the pickup and dropoff dates to be generic fields. The raw data actually lists these fields differently and you should use a common name. To do this, choose the target column name and retype the new field name.
Also, change the data types to be time stamps rather than strings. The following table outlines the updates.
Old Name Target Name Target Data Type tpep_pickup_datetime pickup_date timestamp tpep_dropoff_datetime dropoff_date timestamp
- Choose Next and Finish.
In this case, AWS Glue creates a script. You could also start with a blank script or provide your own.
Because AWS Glue uses Apache Spark behind the scenes, you can easily switch from an AWS Glue DynamicFrame to a Spark DataFrame and do advanced operations within Apache Spark. Just as easily, you can switch back and continue to use the transforms and tables from the catalog.
To show this, convert to a DataFrame and add a new field column indicating the type of taxi. Make the following custom modifications in PySpark.
Add the following header:
Find the last call before the the line that start with the datasink. This is the dynamic frame that is being used to write out the data.
Let’s now convert that to a DataFrame. Please replace the <DYNAMIC_FRAME_NAME> with the name generated in the script.
In the last datasink call, change the dynamic frame to point to the new custom dynamic frame created from the Spark DataFrame:
Save and run the ETL.
To use AWS Glue with Athena, you must upgrade your Athena data catalog to the AWS Glue Data Catalog. Without the upgrade, tables and partitions created by AWS Glue cannot be queried with Athena. As of August 14, 2017, the AWS Glue Data Catalog is only available in US-East-1 (N. Virginia). Start the upgrade in the Athena console. For more information, see Integration with AWS Glue in the Amazon Athena User Guide.
Next, we’ll go into Athena and create a table there for the new data format. This shows how you can interact with the data catalog either through Glue or the Athena.
- Go into the Athena Web Console
- Select the nycitytaxianalysis database
- Run the following create statement:
If you do a query on the count per type of taxi, you notice that there is only “yellow” taxi data in the canonical location. Convert “green” and “fhv” data:
In the AWS Glue console, create a new ETL job named “NYCityTaxiGreen”. Choose blog_green as the source and the new canonical location as the target. Choose the new pickup_date and dropoff_date fields for the target field types for the date variables, as shown below.
In the script, insert the following two lines at the end of the top imports:
Find the last call before the the line that start with the datasink. This is the dynamic frame that is being used to write out the data.
Let’s now convert that to a DataFrame. Please replace the <DYNAMIC_FRAME_NAME> with the name generated in the script.
In the last datasink call, change the dynamic frame to point to the new custom dynamic frame created from the Spark DataFrame:
Notice that it matches the last line in the commands that you pasted earlier. After the ETL job finishes, you can require the count per type through Athena and see the sizes:
What you have done is taken two datasets that contained taxi data in different formats and converted them into a single, query-optimized format. That format can easily be consumed by various analytical tools, including Athena, EMR, and Redshift Spectrum. You’ve done all of this without launching a single server.
You have one more dataset to convert, so create a new ETL job for that. Name the job “NYCityTaxiFHV”. This dataset is limited as it only has three fields, one of which you are converting. Choose “–” (dash) for the dispatcher and location, and map the pickup date to the canonical field.
In the script, insert the following two lines at the end of the top imports:
Type the following line before the last sink calls:
In the last datasink call, change the dynamic frame to point to the new custom dynamic frame created from the Spark DataFrame:
After the third ETL job completes, you can query the single dataset, in a query-optimized form on S3:
Query across the data with Athena
Here’s a query:
Within a second or two, you can quickly see the ranges for these values.
You can see some outliers in the total amount and distances that are = 0. In the next query, don’t include those when calculating things like average cost per mile:
Interesting note: you aren’t seeing FHV broken out yet, because that dataset didn’t have a mapping to these fields. Here’s a query that’s a little more complicated, to calculate a 99th percentile:
Visualizing the data in QuickSight
Amazon QuickSight is a data visualization service you can use to analyze the data that you just combined. For more detailed instructions, see the Amazon QuickSight User Guide.
- Create a new Athena data source, if you haven’t created one yet.
- Create a new dataset for NY taxi data, and point it to the canonical taxi data.
- Choose Edit/Preview data.
- Add a new calculated field named HourOfDay.
- Choose Save and Visualize.
You can now visualize this dataset and query through Athena the canonical dataset on S3 that has been converted by AWS Glue. Selecting HourOfDay allows you to see how many taxi rides there are based on the hour of the day. This is a calculated field that is being derived by the time stamp in the raw dataset.
You can also deselect that and choose type to see the total counts again by type.
Now choose the pickup_date field in addition to the type field and notice that the plot type automatically changed. Because time is being used, it’s showing a time chart defaulting to the year. Choose a day interval instead.
Expand the x axis to zoom out. You can clearly see a large drop in all rides on January 23, 2016.
In the post, you went from data investigation to analyzing and visualizing a canonical dataset, without starting a single server. You started by crawling a dataset you didn’t know anything about and the crawler told you the structure, columns, and counts of records.
From there, you saw that all three datasets were in different formats, but represented the same thing: NY City Taxi rides. You then converted them into a canonical (or normalized) form that is easily queried through Athena and QuickSight, in addition to a wide number of different tools not covered in this post.
If you have questions or suggestions, please comment below.
About the Author
Ben Snively is a Public Sector Specialist Solutions Architect. He works with government, non-profit and education customers on big data and analytical projects, helping them build solutions using AWS. In his spare time he adds IoT sensors throughout his house and runs analytics on it.
Post Syndicated from Yev original https://www.backblaze.com/blog/wanted-front-end-developer/
Want to work at a company that helps customers in over 150 countries around the world protect the memories they hold dear? Do you want to challenge yourself with a business that serves consumers, SMBs, Enterprise, and developers? If all that sounds interesting, you might be interested to know that Backblaze is looking for a Front End Developer!
Backblaze is a 10 year old company. Providing great customer experiences is the “secret sauce” that enables us to successfully compete against some of technology’s giants. We’ll finish the year at ~$20MM ARR and are a profitable business. This is an opportunity to have your work shine at scale in one of the fastest growing verticals in tech – Cloud Storage.
You will utilize HTML, ReactJS, CSS and jQuery to develop intuitive, elegant user experiences. As a member of our Front End Dev team, you will work closely with our web development, software design, and marketing teams.
On a day to day basis, you must be able to convert image mockups to HTML or ReactJS – There’s some production work that needs to get done. But you will also be responsible for helping build out new features, rethink old processes, and enabling third party systems to empower our marketing/sales/ and support teams.
Our Front End Developer must be proficient in:
- HTML, ReactJS
- UTF-8, Java Properties, and Localized HTML (Backblaze runs in 11 languages!)
- jQuery, Bootstrap
- JSON, XML
- Understanding of cross-browser compatibility issues and ways to work around them
- Basic SEO principles and ensuring that applications will adhere to them
- Learning about third party marketing and sales tools through reading documentation. Our systems include Google Tag Manager, Google Analytics, Salesforce, and Hubspot
Struts, Java, JSP, Servlet and Apache Tomcat are a plus, but not required.
We’re looking for someone that is:
- Passionate about building friendly, easy to use Interfaces and APIs.
- Likes to work closely with other engineers, support, and marketing to help customers.
- Is comfortable working independently on a mutually agreed upon prioritization queue (we don’t micromanage, we do make sure tasks are reasonably defined and scoped).
- Diligent with quality control. Backblaze prides itself on giving our team autonomy to get work done, do the right thing for our customers, and keep a pace that is sustainable over the long run. As such, we expect everyone that checks in code that is stable. We also have a small QA team that operates as a secondary check when needed.
Backblaze Employees Have:
- Good attitude and willingness to do whatever it takes to get the job done
- Strong desire to work for a small fast, paced company
- Desire to learn and adapt to rapidly changing technologies and work environment
- Comfort with well behaved pets in the office
This position is located in San Mateo, California. Regular attendance in the office is expected. Backblaze is an Equal Opportunity Employer and we offer competitive salary and benefits, including our no policy vacation policy.
If this sounds like you
Send an email to [email protected] with:
- Front End Dev in the subject line
- Your resume attached
- An overview of your relevant experience
Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-summit-new-york-summary-of-announcements/
Whew – what a week! Tara, Randall, Ana, and I have been working around the clock to create blog posts for the announcements that we made at the AWS Summit in New York. Here’s a summary to help you to get started:
Amazon Macie – This new service helps you to discover, classify, and secure content at scale. Powered by machine learning and making use of Natural Language Processing (NLP), Macie looks for patterns and alerts you to suspicious behavior, and can help you with governance, compliance, and auditing. You can read Tara’s post to see how to put Macie to work; you select the buckets of interest, customize the classification settings, and review the results in the Macie Dashboard.
AWS Glue – Randall’s post (with deluxe animated GIFs) introduces you to this new extract, transform, and load (ETL) service. Glue is serverless and fully managed, As you can see from the post, Glue crawls your data, infers schemas, and generates ETL scripts in Python. You define jobs that move data from place to place, with a wide selection of transforms, each expressed as code and stored in human-readable form. Glue uses Development Endpoints and notebooks to provide you with a testing environment for the scripts you build. We also announced that Amazon Athena now integrates with Amazon Glue, as does Apache Spark and Hive on Amazon EMR.
AWS Migration Hub – This new service will help you to migrate your application portfolio to AWS. My post outlines the major steps and shows you how the Migration Hub accelerates, tracks,and simplifies your migration effort. You can begin with a discovery step, or you can jump right in and migrate directly. Migration Hub integrates with tools from our migration partners and builds upon the Server Migration Service and the Database Migration Service.
CloudHSM Update – We made a major upgrade to AWS CloudHSM, making the benefits of hardware-based key management available to a wider audience. The service is offered on a pay-as-you-go basis, and is fully managed. It is open and standards compliant, with support for multiple APIs, programming languages, and cryptography extensions. CloudHSM is an integral part of AWS and can be accessed from the AWS Management Console, AWS Command Line Interface (CLI), and through API calls. Read my post to learn more and to see how to set up a CloudHSM cluster.
Managed Rules to Secure S3 Buckets – We added two new rules to AWS Config that will help you to secure your S3 buckets. The s3-bucket-public-write-prohibited rule identifies buckets that have public write access and the s3-bucket-public-read-prohibited rule identifies buckets that have global read access. As I noted in my post, you can run these rules in response to configuration changes or on a schedule. The rules make use of some leading-edge constraint solving techniques, as part of a larger effort to use automated formal reasoning about AWS.
CloudTrail for All Customers – Tara’s post revealed that AWS CloudTrail is now available and enabled by default for all AWS customers. As a bonus, Tara reviewed the principal benefits of CloudTrail and showed you how to review your event history and to deep-dive on a single event. She also showed you how to create a second trail, for use with CloudWatch CloudWatch Events.
Encryption of Data at Rest for EFS – When you create a new file system, you now have the option to select a key that will be used to encrypt the contents of the files on the file system. The encryption is done using an industry-standard AES-256 algorithm. My post shows you how to select a key and to verify that it is being used.
Watch the Keynote
My colleagues Adrian Cockcroft and Matt Wood talked about these services and others on the stage, and also invited some AWS customers to share their stories. Here’s the video:
Post Syndicated from Jeremy Winters original https://aws.amazon.com/blogs/big-data/upsert-into-amazon-redshift-using-aws-glue-and-sneaql/
This is a guest post by Jeremy Winters and Ritu Mishra, Solution Architects at Full 360. In their own words, “Full 360 is a cloud first, cloud native integrator, and true believers in the cloud since inception in 2007, our focus has been on helping customers with their journey into the cloud. Our practice areas – Big Data and Warehousing, Application Modernization, and Cloud Ops/Strategy – represent deep, but focused expertise.”
AWS Glue is a fully managed ETL (extract, transform, and load) service that makes it simple and cost-effective to categorize your data, clean it, enrich it, and move it reliably between various data stores. As a company who has been building data warehouse solutions in the cloud for 10 years, we at Full 360 were interested to see how we can leverage AWS Glue in customer solutions. This post details our experience and lessons learned from using AWS Glue for an Amazon Redshift data integration use case.
UI-based ETL Tools
We have been anticipating the release of AWS Glue since it was announced at re:Invent 2016. Many of our customers are looking for an easy to use, UI-based tooling to manage their data transformation pipeline. However in our experience, the complexity of any production pipeline tends to be difficult to unwind, regardless of the technology used to create them. At Full 360, we build cloud-native, script-based applications deployed in containers to handle data integration. We think script-based transformation provides a balance of robustness and flexibility necessary to handle any data problem that comes our way.
AWS Glue caters both to developers who prefer writing scripts and those who want UI-based interactions. It is possible to initially create jobs using the UI, by selecting data source and target. Under the hood, AWS Glue auto-generates the Python code for you, which can be edited if needed, though this isn’t necessary for the majority of use cases.
Of course, you don’t have to rely on the UI at all. You can simply write your own Python scripts, store them in Amazon S3 with any dependent libraries, and import them into AWS Glue. AWS Glue also supports IDE integration with tools such as PyCharm, and interestingly enough, Zeppelin notebooks! These integrations are targeted toward developers who prefer writing Python themselves and want a cleaner dev/test cycle.
Developers who are already in the business of scripting ETL will be excited by the ability to easily deploy Python scripts with AWS Glue, using existing workflows for source control and CI/CD, and have them deployed and executed in a fully managed manner by AWS. The UI experience of AWS Glue works well, but it is good to know that the tool accommodates those who prefer traditional coding. You can also dig into complex edge cases for data handling where the UI doesn’t cut it.
When AWS Lambda was released, we were excited for its potential to host our ETL processes. With Lambda, you are limited to the five-minute maximum for function execution. We resorted to running Docker containers and Amazon ECS to orchestrate many of our customers long running tasks. With this approach, we are still required to manage the underlying infrastructure.
After a closer look at AWS Glue, we realize that it is a full serverless PySpark runtime, accompanied by an Apache Hive metastore compatible catalog-as-a-service. This means that you are not just running a script on a single core, but instead you have the full power of a multi-worker Spark environment available. If you’re not familiar with the Spark framework, it introduces a new paradigm that allows for the processing of distributed, in-memory datasets. Spark has many uses, from data transformation to machine learning.
In AWS Glue, you use PySpark dataframes to transform data before reaching your database. Dataframes manage data in a way similar to relational databases, so the methods are likely to be familiar to most SQL users. Additionally, you can use SQL in your PySpark code to manipulate the data.
AWS Glue also simplifies the management of the runtime environment by providing you with a DPU setting, which allows you to dial up or down the amount of compute resources used to run your job. One DPU is equivalent to 4 vCPU, 16 GB Mem.
Common use case
We can see that most customers would leverage AWS Glue to load one or many files from S3 into Amazon Redshift. To accelerate this process, you can use the crawler, an AWS console-based utility, to discover the schema of your data and store it in the AWS Glue Data Catalog, whether your data sits in a file or a database. We were able to discover the schemas of our source file and target table, then have AWS Glue construct and execute the ETL job. It worked! We successfully loaded the beer drinkers’ dataset, JSON files used in our advanced tuning class, into Amazon Redshift!
Our use case
For our use case, we integrated AWS Glue and Amazon Redshift Spectrum with an open-source project that we initiated called SneaQL. SneaQL is an open source, containerized framework for sneaking interactivity into static SQL. SneaQL provides an extension to ANSI SQL with command tags to provide functionality such as loops, variables, conditional execution, and error handling to your scripts. It allows you to manage your scripts in an AWS CodeCommit Git repo, which then gets deployed and executed in a container.
We use SneaQL for complex data integrations, usually with an ELT model, where the data is loaded into the database, then transformed into fact tables using parameterized SQL. SneaQL enables advanced use cases like partial upsert aggregation of data, where multiple data sources can merge into the same fact table.
We think AWS Glue, Redshift Spectrum, and SneaQL offer a compelling way to build a data lake in S3, with all of your metadata accessible through a variety of tools such as Hive, Presto, Spark, and Redshift Spectrum). Build your aggregation table in Amazon Redshift to drive your dashboards or other high-performance analytics.
In the video below, you see a demonstration of using AWS Glue to convert JSON documents into Parquet for partitioned storage in an S3 data lake. We then access the data from S3 into Amazon Redshift by way of Redshift Spectrum. Nearing the end of the AWS Glue job, we then call AWS boto3 to trigger an Amazon ECS SneaQL task to perform an upsert of the data into our fact table. All the sample artifacts needed for this demonstration are available in the Full360/Sneaql Github repository.
In this scenario, AWS Glue manipulates data outside of a data warehouse and loads it to Amazon Redshift, and SneaQL manipulates data inside Amazon Redshift.
We think that they are a good compliment to each other when doing complex integrations.
Our pipeline works as follows:
- New files arrive in S3 in JSON format.
- AWS Glue converts the JSON files in Parquet format, stored in another S3 bucket.
- At the end of the AWS Glue script, the AWS SDK for Python (Boto) is used to trigger the Amazon ECS task that runs SneaQL.
- SneaQL container pulls the secrets file from S3 then decrypts it with biscuit or AWS KMS.
- SneaQL clones the appropriate branch from an AWS CodeCommit Git repo.
- Data in Amazon Redshift is transformed by SneaQL statements stored in the repo.
The AWS Glue team also supports conditional triggers that would allow us to split the jobs, and make one dependent upon the other.
Although, as a general rule, we believe in managing schema definitions in source control, we definitely think that the crawler feature has some attractive perks. Besides being handy for ad-hoc jobs, it is also partition-aware. This means that you can perform data movements without worrying about which data has been processed already, which is one less thing for you to manage. We’re interested to see the design patterns that emerge around the use of the crawler.
The freedom of PySpark also allows us to implement advanced use cases. While the boto3 library is available to all AWS Glue jobs, it is also possible to include any libraries and additional JAR files along with your Python script.
We think many customers will find AWS Glue valuable. Those who prefer to code their data processing pipeline will be quick to realize how powerful it is, especially for solving complex data cleansing problems. The learning curve for PySpark is real, so we suggest looking into dataframes as a good entry point. In the long run, the fact that AWS Glue is serverless makes the effort more than worth it.
If you have questions or suggestions, please comment below.
Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/launch-aws-glue-now-generally-available/
Today we’re excited to announce the general availability of AWS Glue. Glue is a fully managed, serverless, and cloud-optimized extract, transform and load (ETL) service. Glue is different from other ETL services and platforms in a few very important ways.
First, Glue is “serverless” – you don’t need to provision or manage any resources and you only pay for resources when Glue is actively running. Second, Glue provides crawlers that can automatically detect and infer schemas from many data sources, data types, and across various types of partitions. It stores these generated schemas in a centralized Data Catalog for editing, versioning, querying, and analysis. Third, Glue can automatically generate ETL scripts (in Python!) to translate your data from your source formats to your target formats. Finally, Glue allows you to create development endpoints that allow your developers to use their favorite toolchains to construct their ETL scripts. Ok, let’s dive deep with an example.
In my job as a Developer Evangelist I spend a lot of time traveling and I thought it would be cool to play with some flight data. The Bureau of Transportations Statistics is kind enough to share all of this data for anyone to use here. We can easily download this data and put it in an Amazon Simple Storage Service (S3) bucket. This data will be the basis of our work today.
First, we need to create a Crawler for our flights data from S3. We’ll select Crawlers in the Glue console and follow the on screen prompts from there. I’ll specify
s3://crawler-public-us-east-1/flight/2016/csv/ as my first datasource (we can add more later if needed). Next, we’ll create a database called flights and give our tables a prefix of flights as well.
The Crawler will go over our dataset, detect partitions through various folders – in this case months of the year, detect the schema, and build a table. We could add additonal data sources and jobs into our crawler or create separate crawlers that push data into the same database but for now let’s look at the autogenerated schema.
I’m going to make a quick schema change to year, moving it from BIGINT to INT. Then I can compare the two versions of the schema if needed.
Now that we know how to correctly parse this data let’s go ahead and do some transforms.
Now we’ll navigate to the Jobs subconsole and click Add Job. Will follow the prompts from there giving our job a name, selecting a datasource, and an S3 location for temporary files. Next we add our target by specifying “Create tables in your data target” and we’ll specify an S3 location in Parquet format as our target.
After clicking next, we’re at screen showing our various mappings proposed by Glue. Now we can make manual column adjustments as needed – in this case we’re just going to use the X button to remove a few columns that we don’t need.
This brings us to my favorite part. This is what I absolutely love about Glue.
Glue generated a PySpark script to transform our data based on the information we’ve given it so far. On the left hand side we can see a diagram documenting the flow of the ETL job. On the top right we see a series of buttons that we can use to add annotated data sources and targets, transforms, spigots, and other features. This is the interface I get if I click on transform.
If we add any of these transforms or additional data sources, Glue will update the diagram on the left giving us a useful visualization of the flow of our data. We can also just write our own code into the console and have it run. We can add triggers to this job that fire on completion of another job, a schedule, or on demand. That way if we add more flight data we can reload this same data back into S3 in the format we need.
I could spend all day writing about the power and versatility of the jobs console but Glue still has more features I want to cover. So, while I might love the script editing console, I know many people prefer their own development environments, tools, and IDEs. Let’s figure out how we can use those with Glue.
Development Endpoints and Notebooks
A Development Endpoint is an environment used to develop and test our Glue scripts. If we navigate to “Dev endpoints” in the Glue console we can click “Add endpoint” in the top right to get started. Next we’ll select a VPC, a security group that references itself and then we wait for it to provision.
Once it’s provisioned we can create an Apache Zeppelin notebook server by going to actions and clicking create notebook server. We give our instance an IAM role and make sure it has permissions to talk to our data sources. Then we can either SSH into the server or connect to the notebook to interactively develop our script.
Pricing and Documentation
You can see detailed pricing information here. Glue crawlers, ETL jobs, and development endpoints are all billed in Data Processing Unit Hours (DPU) (billed by minute). Each DPU-Hour costs $0.44 in us-east-1. A single DPU provides 4vCPU and 16GB of memory.
We’ve only covered about half of the features that Glue has so I want to encourage everyone who made it this far into the post to go read the documentation and service FAQs. Glue also has a rich and powerful API that allows you to do anything console can do and more.
I hope you find that using Glue reduces the time it takes to start doing things with your data. Look for another post from me on AWS Glue soon because I can’t stop playing with this new service.
Post Syndicated from Aaron Friedman original https://aws.amazon.com/blogs/big-data/building-a-real-world-evidence-platform-on-aws/
Deriving insights from large datasets is central to nearly every industry, and life sciences is no exception. To combat the rising cost of bringing drugs to market, pharmaceutical companies are looking for ways to optimize their drug development processes. They are turning to big data analytics to better quantify the effect that their drug compounds have on different populations and to look for new clinical indications for existing drugs.
A real world evidence (RWE) platform is loosely defined as an integrated set of services and products that life sciences companies use to securely acquire, store, and analyze large, often disparate datasets to gain insight into the functions of a specific drug or intervention. The petabyte-scale data generated from wearables, medical devices, genomics, clinical imaging, and claims (to name a few) allows pharmaceutical and other life sciences companies to build big data platforms to analyze these datasets. And many of them are doing it on AWS.
At the center of nearly every RWE platform is a data lake that houses different data types. It also stores the related metadata to identify where each piece of data came from, who owned it, etc. Analytics engines integrate the relevant streaming (e.g., wearables), structured (e.g., claims data), and unstructured (e.g., notes in electronic health records) data.
In this post, I highlight common architectural patterns that customers are using to maximize the value of real world evidence on AWS. The architecture presented here can be reproduced in multiple regions, so you can respect local data sovereignty requirements, when applicable, while conducting global studies. This post doesn’t cover all considerations for real world evidence (such as security and authentication), but instead focuses on the areas that are related to your data flow.
A data lake is your source of truth
The ability to integrate disparate data types is critical to maximizing the utility of RWE. Life sciences companies have to be able to store, search, and retrieve data of different types and sizes, including (but not limited to) the following:
- Streaming data from wearables and medical devices
- Structured data from genomics and claims data
- Unstructured data from notes in electronic health records
A common solution to integrate these data types is a data lake. Data lakes allow organizations to store all their data, regardless of data type, in a centralized repository. Because data can be stored as-is, there’s no need to convert it to a predefined schema. And you no longer need to know what questions you want to ask of your data beforehand. You can use data lakes for ad hoc analyses, so you can quickly explore and discover new insights without needing to structure the data first, as you would with a traditional data warehouse.
Although there are many uses for storing real world evidence data in a data lake, here are a few examples of the processes it can facilitate for pharma and biotech companies:
- Quickly access all data for a given subject, from clinical images to genomics to claims data.
- Associate incoming RWE data with existing data in your RWE data lake, such as by subject or study.
- Select specific windows in longitudinal studies (microbiome, metabolomics, etc.).
The following diagram shows an architecture of the features within a data lake and several examples of how data enters a data lake:
This architecture, which is entirely serverless and backed by Amazon S3, lets you scale your data lake to easily accommodate any data size for your RWE platform. Additionally, the components presented in this diagram can be secured via IAM controls and service policies. This enables you to secure and protect the sensitive information that often resides within an RWE platform. Amazon S3 is the canonical source for objects in your RWE data lake. You track and search metadata that is associated with these objects in a data catalog built on Amazon Elasticsearch Service and Amazon DynamoDB.
Let’s dive deeper into what this architecture does and how data makes it into the data lake:
- Streaming (e.g., wearables), structured, and unstructured data is acquired from myriad devices and sources. Depending on the data size, you might use AWS IoT (streaming), AWS Storage Gateway (mid-size/continual batch), and AWS Snowball (large legacy datasets, such as imaging). AWS IoT writes to Amazon Kinesis Firehose, which transforms the telemetry data in-flight to land both transformed and raw data in Amazon S3.
- When data lands in Amazon S3 buckets, an AWS Lambda function is invoked (either by trigger or manually).
- This Lambda function writes to a data catalog that is fronted by Amazon API Gateway. The data catalog contains metadata about all the object data in Amazon S3, as well as data that resides in databases, such as Amazon Redshift.
- An AWS Lambda function on the other side of API Gateway writes the appropriate metadata about the objects, such as the study that the data was generated from, into Amazon Elasticsearch Service and/or Amazon DynamoDB, which I refer to as the data catalog.
The data catalog mentioned in steps 3 and 4 is central to your data management. In most analyses, the first step is to build a data manifest of where the data-of-interest lies, which is discussed in later sections.
Normalizing data for real world evidence
As I previously mentioned, data can enter your real world evidence data lake in many different formats. In many cases, you might need to normalize this data into a specific format to ease downstream analysis. For example, you might have to integrate many different electronic health records that are stored in different formats. You also might have to transform genomics data, often represented as a variant call format (VCF) file, into a format that’s easier for big data technologies like Apache Parquet to query. While you can certainly analyze this data in the format in which it entered, it might be better for querying after it’s transformed into a different format, or into a different data store.
In the architecture shown in the following diagram, AWS Step Functions orchestrates the data normalization (extract, transform, load—or ETL) process. Step Functions is a serverless workflow service that ensures that your long-running ETL jobs execute in order and complete successfully. At a high level, you first query the data catalog to get a manifest of the data to be normalized. Normalization occurs on Amazon EC2 instances, and the results are then stored back in Amazon S3 or in databases such as Amazon Redshift for future analysis. Locations of these results data are also logged in the data catalog for future querying.
Here is an example Step Functions state machine for this process:
The following is the accompanying JSON that produces it:
And here is the overall architecture:
Here are the steps in more detail:
- Either a user or an application submits a POST request through Amazon API Gateway to process or transform a set of data. This request contains the query parameters that correspond to generating the data manifest. It is passed into an AWS Step Functions state machine, which orchestrates the ETL process.
- A Lambda function queries the data catalog and builds a manifest that contains the location of the data of interest.
- The manifest is passed to downstream Lambda functions in the state machine that orchestrate the batch workflow to process the data that’s specified in the manifest.
- These Lambda functions submit jobs to AWS Batch to execute batch jobs on Amazon EC2.
- Amazon EC2 processes the data (e.g., data normalization) and stores results back in Amazon S3 and/or Amazon Redshift.
- Results data is then logged in the data catalog, using the process shown in the data acquisition diagram.
Analyzing and visualizing data in real world evidence
Ultimately, the value in real world evidence platforms is the value you derive from the wide variety of data that feeds into it. Big data analytics, such as Spark on EMR, machine learning with the Deep Learning AMIs, and healthcare data warehouses, are all fundamental to maximizing the value of RWE.
Given the wide array of questions you can answer, you want your architecture to be as flexible as possible. Your organization also might use business intelligence (BI) tools. Again, as with the data normalization tier, you can use Step Functions to orchestrate your workflow. You first build the data manifest and then submit each portion of your desired analysis to the relevant data location and compute options, such as Amazon EMR or Amazon Redshift. Your organization’s BI tool of choice, such as Amazon QuickSight or one offered by our AWS Big Data Competency partners, can extract and visualize the results.
Here is the workflow in more detail:
- A user initiates a query on a dataset. In the context of RWE, this is largely analyzing a cohort in the context of a specific indication (drug response, etc.).
- AWS Step Functions invokes a Lambda function to query the data catalog and build a manifest of data.
- The manifest is passed to subsequent Lambda functions, which orchestrate the data analysis through different AWS services.
- These analyses can include Amazon EMR for population-scale genomics, Amazon EC2 for HPC and machine learning, and Amazon Redshift for your healthcare data warehouse.
- Results from each analysis are staged back in Amazon S3 and logged in the data catalog.
- BI tools like Amazon QuickSight or Tableau, or data analysis workbooks like Jupyter can query Amazon S3 to visualize results, such as through Amazon Athena.
A practical example
Imagine that you are at a pharmaceutical company that is developing a drug to address a neurological condition. You have longitudinal data in the form of brain scans and have noticed that the drug compound under development seems to benefit a specific subpopulation of individuals. As a result, you determine to sequence the genomes of all the responders and non-responders to look for specific biomarkers that could indicate response levels. Success would result in a companion diagnostic that you could use alongside your drug to increase its overall efficacy by delivering it only to the responding population.
Let’s briefly look at how this study might play out in the steps of data acquisition, normalization, and analysis described earlier.
Data from longitudinal brain imaging studies can be moved to AWS using AWS Snowball. Genomics data coming off your genome sequencers would land in Amazon S3 via AWS Storage Gateway and/or AWS Direct Connect. These datasets are stored in your data catalog where you track items such as date generated, anonymized subject ID, etc.
Genomes are transformed from raw reads (e.g., FASTQ) to a human readable format (VCF) that identifies variations in a genome. These VCF files are subsequently extracted, transformed, and loaded into a format that is amenable to big data analytics, such as Parquet.
You build a data manifest of your brain scans in Amazon S3. You use the deep learning AMI and P2 instance family to build machine learning models to identify images that represent different stages in your disorder-of-interest. You then run association analyses that combine your genomics data with your brain imaging models and drug response data to identify what genomic variants associate with improved treatment outcomes. You can manage these analyses via Jupyter notebooks, or you can connect with your BI tool of choice to visualize your results.
It will always be day one for RWE
By implementing real world evidence platforms on AWS, you can quickly integrate and interrogate disparate healthcare data to advance human health. By designing your RWE platform to be flexible, you can readily incorporate new big data technologies as they become relevant to your needs. This enables you to innovate and discover at a quicker rate.
To read more about real world evidence on AWS, and how AWS Partner Network (APN) Premier Consulting Partner Deloitte has architected their ConvergeHEALTH platform on AWS, check out this guest post on the APN Blog!
Please leave any questions and comments below.
The healthcare ecosystem has chosen a variety of tools and techniques for working with big data, but one tool that comes up again and again in many of the architectures we design and review is Spark on Amazon EMR. Will spark power the data behind precision medicine?
About the Authors
Dr. Aaron Friedman is a Healthcare and Life Sciences Partner Solutions Architect at Amazon Web Services. He works with ISVs and SIs to architect healthcare solutions on AWS, and bring the best possible experience to their customers. His passion is working at the intersection of science, big data, and software. In his spare time, he’s exploring the outdoors, learning a new thing to cook, or spending time with his wife and his dog, Macaroon.
Post Syndicated from Jigar Mistry original https://aws.amazon.com/blogs/big-data/turbocharge-your-apache-hive-queries-on-amazon-emr-using-llap/
Apache Hive is one of the most popular tools for analyzing large datasets stored in a Hadoop cluster using SQL. Data analysts and scientists use Hive to query, summarize, explore, and analyze big data.
With the introduction of Hive LLAP (Low Latency Analytical Processing), the notion of Hive being just a batch processing tool has changed. LLAP uses long-lived daemons with intelligent in-memory caching to circumvent batch-oriented latency and provide sub-second query response times.
This post provides an overview of Hive LLAP, including its architecture and common use cases for boosting query performance. You will learn how to install and configure Hive LLAP on an Amazon EMR cluster and run queries on LLAP daemons.
What is Hive LLAP?
Hive LLAP was introduced in Apache Hive 2.0, which provides very fast processing of queries. It uses persistent daemons that are deployed on a Hadoop YARN cluster using Apache Slider. These daemons are long-running and provide functionality such as I/O with DataNode, in-memory caching, query processing, and fine-grained access control. And since the daemons are always running in the cluster, it saves substantial overhead of launching new YARN containers for every new Hive session, thereby avoiding long startup times.
When Hive is configured in hybrid execution mode, small and short queries execute directly on LLAP daemons. Heavy lifting (like large shuffles in the reduce stage) is performed in YARN containers that belong to the application. Resources (CPU, memory, etc.) are obtained in a traditional fashion using YARN. After the resources are obtained, the execution engine can decide which resources are to be allocated to LLAP, or it can launch Apache Tez processors in separate YARN containers. You can also configure Hive to run all the processing workloads on LLAP daemons for querying small datasets at lightning fast speeds.
LLAP daemons are launched under YARN management to ensure that the nodes don’t get overloaded with the compute resources of these daemons. You can use scheduling queues to make sure that there is enough compute capacity for other YARN applications to run.
Why use Hive LLAP?
With many options available in the market (Presto, Spark SQL, etc.) for doing interactive SQL over data that is stored in Amazon S3 and HDFS, there are several reasons why using Hive and LLAP might be a good choice:
- For those who are heavily invested in the Hive ecosystem and have external BI tools that connect to Hive over JDBC/ODBC connections, LLAP plugs in to their existing architecture without a steep learning curve.
- It’s compatible with existing Hive SQL and other Hive tools, like HiveServer2, and JDBC drivers for Hive.
- It has native support for security features with authentication and authorization (SQL standards-based authorization) using HiveServer2.
- LLAP daemons are aware about of the columns and records that are being processed which enables you to enforce fine-grained access control.
- It can use Hive’s vectorization capabilities to speed up queries, and Hive has better support for Parquet file format when vectorization is enabled.
- It can take advantage of a number of Hive optimizations like merging multiple small files for query results, automatically determining the number of reducers for joins and groupbys, etc.
- It’s optional and modular so it can be turned on or off depending on the compute and resource requirements of the cluster. This lets you to run other YARN applications concurrently without reserving a cluster specifically for LLAP.
How do you install Hive LLAP in Amazon EMR?
To install and configure LLAP on an EMR cluster, use the following bootstrap action (BA):
This BA downloads and installs Apache Slider on the cluster and configures LLAP so that it works with EMR Hive. For LLAP to work, the EMR cluster must have Hive, Tez, and Apache Zookeeper installed.
You can pass the following arguments to the BA.
|--instances||Number of instances of LLAP daemon||Number of core/task nodes of the cluster|
|--cache||Cache size per instance||20% of physical memory of the node|
|--executors||Number of executors per instance||Number of CPU cores of the node|
|--iothreads||Number of IO threads per instance||Number of CPU cores of the node|
|--size||Container size per instance||50% of physical memory of the node|
|--xmx||Working memory size||50% of container size|
|--log-level||Log levels for the LLAP instance||INFO|
This section describes how you can try the faster Hive queries with LLAP using the TPC-DS testbench for Hive on Amazon EMR.
Use the following AWS command line interface (AWS CLI) command to launch a 1+3 nodes m4.xlarge EMR 5.6.0 cluster with the bootstrap action to install LLAP:
After the cluster is launched, log in to the master node using SSH, and do the following:
- Open the hive-tpcds folder:
- Start Hive CLI using the testbench configuration, create the required tables, and run the sample query:
hive –i testbench.settings
hive> source create_tables.sql;
hive> source query55.sql;
This sample query runs on a 40 GB dataset that is stored on Amazon S3. The dataset is generated using the data generation tool in the TPC-DS testbench for Hive.It results in output like the following:
- This screenshot shows that the query finished in about 47 seconds for LLAP mode. Now, to compare this to the execution time without LLAP, you can run the same workload using only Tez containers:
hive> set hive.llap.execution.mode=none;
hive> source query55.sql;
This query finished in about 80 seconds.
The difference in query execution time is almost 1.7 times when using just YARN containers in contrast to running the query on LLAP daemons. And with every rerun of the query, you notice that the execution time substantially decreases by the virtue of in-memory caching by LLAP daemons.
In this post, I introduced Hive LLAP as a way to boost Hive query performance. I discussed its architecture and described several use cases for the component. I showed how you can install and configure Hive LLAP on an Amazon EMR cluster and how you can run queries on LLAP daemons.
If you have questions about using Hive LLAP on Amazon EMR or would like to share your use cases, please leave a comment below.
Learn how to to automatically partition Hive external tables with AWS.
About the Author
Jigar Mistry is a Hadoop Systems Engineer with Amazon Web Services. He works with customers to provide them architectural guidance and technical support for processing large datasets in the cloud using open-source applications. In his spare time, he enjoys going for camping and exploring different restaurants in the Seattle area.
Post Syndicated from Darknet original http://feedproxy.google.com/~r/darknethackers/~3/uMXdj1EvNhM/
The post Jack – Drag…
Read the full post at darknet.org.uk
Post Syndicated from Nathan Peck original https://aws.amazon.com/blogs/compute/nginx-reverse-proxy-sidecar-container-on-amazon-ecs/
Reverse proxies are a powerful software architecture primitive for fetching resources from a server on behalf of a client. They serve a number of purposes, from protecting servers from unwanted traffic to offloading some of the heavy lifting of HTTP traffic processing.
This post explains the benefits of a reverse proxy, and explains how to use NGINX and Amazon EC2 Container Service (Amazon ECS) to easily implement and deploy a reverse proxy for your containerized application.
NGINX is a high performance HTTP server that has achieved significant adoption because of its asynchronous event driven architecture. It can serve thousands of concurrent requests with a low memory footprint. This efficiency also makes it ideal as a reverse proxy.
Amazon ECS is a highly scalable, high performance container management service that supports Docker containers. It allows you to run applications easily on a managed cluster of Amazon EC2 instances. Amazon ECS helps you get your application components running on instances according to a specified configuration. It also helps scale out these components across an entire fleet of instances.
Sidecar containers are a common software pattern that has been embraced by engineering organizations. It’s a way to keep server side architecture easier to understand by building with smaller, modular containers that each serve a simple purpose. Just like an application can be powered by multiple microservices, each microservice can also be powered by multiple containers that work together. A sidecar container is simply a way to move part of the core responsibility of a service out into a containerized module that is deployed alongside a core application container.
The following diagram shows how an NGINX reverse proxy sidecar container operates alongside an application server container:
In this architecture, Amazon ECS has deployed two copies of an application stack that is made up of an NGINX reverse proxy side container and an application container. Web traffic from the public goes to an Application Load Balancer, which then distributes the traffic to one of the NGINX reverse proxy sidecars. The NGINX reverse proxy then forwards the request to the application server and returns its response to the client via the load balancer.
Reverse proxy for security
Security is one reason for using a reverse proxy in front of an application container. Any web server that serves resources to the public can expect to receive lots of unwanted traffic every day. Some of this traffic is relatively benign scans by researchers and tools, such as Shodan or nmap: