Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/how-netflix-uses-druid-for-real-time-insights-to-ensure-a-high-quality-experience-19e1e8568d06
Tag Archives: Apache
Build a social media follower counter
Post Syndicated from Alex Bate original https://www.raspberrypi.org/blog/build-social-media-follower-counter/
In this tutorial from HackSpace magazine issue 9, Paul Freeman-Powell shows you how to keep track of your social media followers, and encourage subscribers, by building a live follower counter. Get your copy of HackSpace magazine in stores now, or download it as a free PDF here.
Issues 10 of HackSpace magazine is available online and in stores from tomorrow!

The finished build with all components connected, working, and installed in the frame ready for hanging on the wall
If you run a local business like an electronics shop or a café, or if you just want to grow your online following and influence, this project is a fun way to help you keep track of your progress. A counter could also help contribute to growing your following if you hang it on the wall and actively ask your customers to like/follow you to see the numbers go up!
You’ve probably seen those social media follower counters that feature mechanical splitflap displays. In this project we’ll build a counter powered by RGB LEDs that scrolls through four social profiles, using APIs to pull the number of followers for each account. I’m using YouTube, Twitter, Facebook, and Instagram; you can, of course, tailor the project to your needs.
This project involves a bit of electronics, a bit of software coding, and a bit of woodwork, as well as some fairly advanced display work as we transfer a small portion of the Raspberry Pi’s HDMI output onto the LED matrices.
Let’s get social
First, you need to get your Raspberry Pi all set up and talking to the social networks that you’re going to display. Usually, it’s advisable to install Raspbian without any graphical user interface (GUI) for most electronics projects, but in this case you’ll be actively using that GUI, so make sure you start with a fresh and up-to-date installation of full-fat Raspbian.

phpMyAdmin gives you an easy web interface to allow you to access and edit the device’s settings – for example, speed and direction of scrolling, API credentials, and the social network accounts to monitor
You start by turning your humble little Raspberry Pi into your very own mini web server, which will gather your credentials, talk to the social networks, and display the follower counts. To do this, you need to install a LAMP (Linux, Apache, MySQL, PHP) stack. Start by installing the Apache web server by opening a Terminal and typing:
sudo apt-get install apache2 -y
Then, open the web browser on your Pi and type http://localhost — you will see a default page telling you that Apache is working. The page on our little ‘website’ will use code written in the PHP language, so install that by returning to your Terminal and typing:
sudo apt-get install php -y
Once that’s complete, restart Apache:
sudo service apache2 restart
Next, you’ll install the database to store your credentials, settings, and the handles of the social accounts to track. This is done with the following command in your Terminal:
sudo apt-get install mysql-server php-mysql -y
To set a root password for your database, type the following command and follow the on-screen instructions:
sudo mysql_secure_installation
Restart Apache again. Then, for easier management of the database, I recommend installing phpMyAdmin:
sudo apt-get install phpMyAdmin -y
At this point, it’s a good idea to connect your Pi to a WiFi network, unless you’re going to be running a network cable to it. Either way, it’s useful to have SSH enabled and to know its IP address so we can access it remotely. Type the following to access Pi settings and enable SSH:
sudo raspi-config
To determine your Pi’s IP address (which will likely be something like 192.168.0.xxx), type either of the following two commands:
ifconfig # this gives you lots of extra info
hostname -I # this gives you less info, but all we need in this case
Now that SSH is enabled and you know the LAN IP address of the Pi, you can use PuTTY to connect to it from another computer for the rest of your work. The keyboard, mouse, and monitor can now be unplugged from the Raspberry Pi.
Social media monitor
To set up the database, type http://XXX/ phpmyadmin (where XXX is your Pi’s IP address) and log in as root with the password you set previously. Head to the User Accounts section and create a new user called socialCounter.
You can now download the first bit of code for this project by running this in your Terminal window:
cd /var/www/html
sudo apt-get update
sudo apt-get install git -y
sudo git clone https://github.com/paulfp/social- media-counter.git
Next, open up the db.php script and edit it to include the password you set when creating the socialCounter user:
cd ./social-media-counter
sudo nano db.php
The database, including tables and settings, is contained in the socialCounter.sql file; this can be imported either via the Terminal or via phpMyAdmin, then open up the credentials table. To retrieve the subscriber count, YouTube requires a Google API key, so go to console.cloud.google.com and create a new Project (call it anything you like). From the left-hand menu, select ‘APIs & Services’, followed by ‘Library’ and search for the YouTube Data API and enable it. Then go to the ‘Credentials’ tab and create an API key that you can then paste into the ‘googleApiKey’ database field.
Facebook requires you to create an app at developers.facebook.com, after which you can paste the details into the facebookAppId and facebookSecret fields. Unfortunately, due to recent scandals surrounding clandestine misuse of personal data on Facebook, you’ll need to submit your app for review and approval before it will work.
The ‘social_accounts’ table is where you enter the user names for the social networks you want to monitor, so replace those with your own and then open a new tab and navigate to http://XXX/socialmedia-counter. You should now see a black page with a tiny carousel showing the social media icons plus follower counts next to each one. The reason it’s so small is because it’s a 64×16 pixel portion of the screen that we’ll be displaying on our 64×16 LED boards.
GPIO pins to LED display
Now that you have your social network follower counts being grabbed and displayed, it’s time to get that to display on our screens. The first step is to wire them up with the DuPont jumper cables from the Raspberry Pi’s GPIO pins to the connection on the first board. This is quite fiddly, but there’s an excellent guide and diagram on GitHub within Henner Zeller’s library that we’ll be using later, so head to hsmag.cc/PLyRcK and refer to wiring.md.

The Raspberry Pi connects to the RGB LED screens with 14 jumper cables, and the screens are daisy-chained together with a ribbon cable
The second screen is daisy-chained to the first one with the ribbon cable, and the power connector that comes with the screens will plug into both panels. Once you’re done, your setup should look just like the picture on this page.
To display the Pi’s HDMI output on the LED screens, install Adafruit’s rpi-fb-matrix library (which in turn uses Henner Zeller’s library to address the panels) by typing the following commands:
sudo apt-get install -y build-essential libconfig++-dev
cd ~
git clone --recursive https://github.com/ adafruit/rpi-fb-matrix.git
cd rpi-fb-matrix
Next, you must define your wiring as regular. Type the following to edit the Makefile:
nano Makefile
Look for the HARDWARE_DESC= property and ensure the line looks like this: export HARDWARE_DESC=regular before saving and exiting nano. You’re now ready to compile the code, so type this and then sit back and watch the output:
make clean all
Once that’s done, there are a few more settings to change in the matrix configuration file, so open that up:
nano matrix.cfg
You need to make several changes in here, depending on your setup:
- Change display_width to 64 and display_height to 16
- Set panel_width to 32 and panel_height to 16
- Set chain_length to 2
- Set parallel_count to 1
The panel array should look like this:
panels = ( ( { order = 1; rotate = 0; }, { order = 0; rotate = 0; } ) )
Uncomment the crop_origin = (0, 0) line to tell the tool that we don’t want to squish the entire display onto our screens, just an equivalent portion starting right in the top left of the display. Press CTRL+X, then Y, then ENTER to save and exit.

It ain’t pretty…but it’s out of sight. The Raspberry Pi plus the power supply for the screens fit nice and neatly behind the screens. I left each end open to allow airflow
Finally, before you can test the output, there are some other important settings you need to change first, so open up the Raspberry Pi’s boot configuration as follows:
sudo nano /boot/config.txt
First, disable the on-board sound (as it uses hardware that the screens rely on) by looking for the line that says dtparam=audio=on and changing it to off. Also, uncomment the line that says hdmi_force_hotplug=1, to ensure that an HDMI signal is still generated even with no HDMI monitor plugged in. Save and then reboot your Raspberry Pi.
Now run the program using the config you just set:
cd ~/rpi-fb-matrix
sudo ./rpi-fb-matrix matrix.cfg
You should now see the top 64×16 pixels of your Pi’s display represented on your RGB LED panels! This probably consists of the Raspberry Pi icon and the rest of the top portion of the display bar.
No screensaver!
At this point it’s important to ensure that there’s no screensaver or screen blanking enabled on the Pi, as you want this to display all the time. To disable screen blanking, first install the xscreensaver tool:
sudo apt-get install xscreensaver
That will add a screensaver option to the Pi’s GUI menus, allowing you to disable it completely. Finally, you need to tell the Raspberry Pi to do two things each time it loads:
- Run the rpi-fb-matrix program (like we did manually just now)
- Open the web browser in fullscreen (‘kiosk’ mode), pointed to the Social Counter web page
To do so, edit the Pi’s autostart configuration file:
sudo nano ~/.config/lxsession/LXDE-pi/autostart
Insert the following two lines at the end:
@sudo /home/pi/rpi-fb-matrix/rpi-fb-matrix /home/ pi/rpi-fb-matrix/matrix.cfg\
@chromium-browser --kiosk http://localhost/ social-media-counter
Et voilà!
Disconnect any keyboard, monitor, or mouse from the Pi and reboot it one more time. Once it’s started up again, you should have a fully working display cycling through each enabled social network, showing up-to-date follower counts for each.
It’s now time to make a surround to hold all the components together and allow you to wall-mount your display. The styling you go for is up to you — you could even go all out and design and 3D print a custom package.

The finished product, in pride of place on the wall of our office. Now I just need some more subscribers…!
For my surround, I went for the more rustic and homemade look, and used some spare bits of wood from an internal door frame lining. This worked really well due to the pre-cut recess. With a plywood back, you can screw everything together so that the wood holds the screens tightly enough to not require any extra fitting or gluing, making for easier future maintenance. To improve the look and readability of the display (as well as soften the light and reduce the brightness), you can use a reflective diffuser from an old broken LED TV if you can lay your hands on one from eBay or a TV repair shop, or just any other bit of translucent material. I found that two layers stapled on worked and looked great. Add some hooks to the back and — Bob’s your uncle — a finished, wall-mounted display!
Phew — that was quite an advanced build, but you now have a sophisticated display that can be used for any number of things and should delight your customers whilst helping to build your social following as well. Don’t forget to tweet us a picture of yours!
The post Build a social media follower counter appeared first on Raspberry Pi.
Analyzing Amazon Connect records with Amazon Athena, AWS Glue, and Amazon QuickSight
Post Syndicated from Luis Caro Perez original https://aws.amazon.com/blogs/big-data/analyzing-amazon-connect-records-with-amazon-athena-aws-glue-and-amazon-quicksight/
Last year, we released Amazon Connect, a cloud-based contact center service that enables any business to deliver better customer service at low cost. This service is built based on the same technology that empowers Amazon customer service associates. Using this system, associates have millions of conversations with customers when they inquire about their shipping or order information. Because we made it available as an AWS service, you can now enable your contact center agents to make or receive calls in a matter of minutes. You can do this without having to provision any kind of hardware. 2
There are several advantages of building your contact center in the AWS Cloud, as described in our documentation. In addition, customers can extend Amazon Connect capabilities by using AWS products and the breadth of AWS services. In this blog post, we focus on how to get analytics out of the rich set of data published by Amazon Connect. We make use of an Amazon Connect data stream and create an end-to-end workflow to offer an analytical solution that can be customized based on need.
Solution overview
The following diagram illustrates the solution.
In this solution, Amazon Connect exports its contact trace records (CTRs) using Amazon Kinesis. CTRs are data streams in JSON format, and each has information about individual contacts. For example, this information might include the start and end time of a call, which agent handled the call, which queue the user chose, queue wait times, number of holds, and so on. You can enable this feature by reviewing our documentation.
In this architecture, we use Kinesis Firehose to capture Amazon Connect CTRs as raw data in an Amazon S3 bucket. We don’t use the recent feature added by Kinesis Firehose to save the data in S3 as Apache Parquet format. We use AWS Glue functionality to automatically detect the schema on the fly from an Amazon Connect data stream.
The primary reason for this approach is that it allows us to use attributes and enables an Amazon Connect administrator to dynamically add more fields as needed. Also by converting data to parquet in batch (every couple of hours) compression can be higher. However, if your requirement is to ingest the data in Parquet format on realtime, we recoment using Kinesis Firehose recently launched feature. You can review this blog post for further information.
By default, Firehose puts these records in time-series format. To make it easy for AWS Glue crawlers to capture information from new records, we use AWS Lambda to move all new records to a single S3 prefix called flatfiles. Our Lambda function is configured using S3 event notification. To comply with AWS Glue and Athena best practices, the Lambda function also converts all column names to lowercase. Finally, we also use the Lambda function to start AWS Glue crawlers. AWS Glue crawlers identify the data schema and update the AWS Glue Data Catalog, which is used by extract, transform, load (ETL) jobs in AWS Glue in the latter half of the workflow.
You can see our approach in the Lambda code following.
We trigger AWS Glue crawlers based on events because this approach lets us capture any new data frame that we want to be dynamic in nature. CTR attributes are designed to offer multiple custom options based on a particular call flow. Attributes are essentially key-value pairs in nested JSON format. With the help of event-based AWS Glue crawlers, you can easily identify newer attributes automatically.
We recommend setting up an S3 lifecycle policy on the flatfiles folder that keeps records only for 24 hours. Doing this optimizes AWS Glue ETL jobs to process a subset of files rather than the entire set of records.
After we have data in the flatfiles folder, we use AWS Glue to catalog the data and transform it into Parquet format inside a folder called parquet/ctr/. The AWS Glue job performs the ETL that transforms the data from JSON to Parquet format. We use AWS Glue crawlers to capture any new data frame inside the JSON code that we want to be dynamic in nature. What this means is that when you add new attributes to an Amazon Connect instance, the solution automatically recognizes them and incorporates them in the schema of the results.
After AWS Glue stores the results in Parquet format, you can perform analytics using Amazon Redshift Spectrum, Amazon Athena, or any third-party data warehouse platform. To keep this solution simple, we have used Amazon Athena for analytics. Amazon Athena allows us to query data without having to set up and manage any servers or data warehouse platforms. Additionally, we only pay for the queries that are executed.
Try it out!
You can get started with our sample AWS CloudFormation template. This template creates the components starting from the Kinesis stream and finishes up with S3 buckets, the AWS Glue job, and crawlers. To deploy the template, open the AWS Management Console by clicking the following link.
In the console, specify the following parameters:
- BucketName: The name for the bucket to store all the solution files. This name must be unique; if it’s not, template creation fails.
- etlJobSchedule: The schedule in cron format indicating how often the AWS Glue job runs. The default value is every hour.
- KinesisStreamName: The name of the Kinesis stream to receive data from Amazon Connect. This name must be different from any other Kinesis stream created in your AWS account.
- s3interval: The interval in seconds for Kinesis Firehose to save data inside the flatfiles folder on S3. The value must between 60 and 900 seconds.
- sampledata: When this parameter is set to true, sample CTR records are used. Doing this lets you try this solution without setting up an Amazon Connect instance. All examples in this walkthrough use this sample data.
Select the “I acknowledge that AWS CloudFormation might create IAM resources.” check box, and then choose Create. After the template finishes creating resources, you can see the stream name on the stack Outputs tab.
If you haven’t created your Amazon Connect instance, you can do so by following the Getting Started Guide. When you are done creating, choose your Amazon Connect instance in the console, which takes you to instance settings. Choose Data streaming to enable streaming for CTR records. Here, you can choose the Kinesis stream (defined in the KinesisStreamName parameter) that was created by the CloudFormation template.
Now it’s time to generate the data by making or receiving calls by using Amazon Connect. You can go to Amazon Connect Cloud Control Panel (CCP) to make or receive calls using a software phone or desktop phone. After a few minutes, we should see data inside the flatfiles folder. To make it easier to try this solution, we provide sample data that you can enable by setting the sampledata parameter to true in your CloudFormation template.
You can navigate to the AWS Glue console by choosing Jobs on the left navigation pane of the console. We can select our job here. In my case, the job created by CloudFormation is called glueJob-i3TULzVtP1W0; yours should be similar. You run the job by choosing Run job for Action.
After that, we wait for the AWS Glue job to run and to finish successfully. We can track the status of the job by checking the History tab.
When the job finishes running, we can check the Database section. There should be a new table created called ctr in Parquet format.
To query the data with Athena, we can select the ctr table, and for Action choose View data.
Doing this takes us to the Athena console. If you run a query, Athena shows a preview of the data.
When we can query the data using Athena, we can visualize it using Amazon QuickSight. Before connecting Amazon QuickSight to Athena, we must make sure to grant Amazon QuickSight access to Athena and the associated S3 buckets in the account. For more information on doing this, see Managing Amazon QuickSight Permissions to AWS Resources in the Amazon QuickSight User Guide. We can then create a new data set in Amazon QuickSight based on the Athena table that was created.
After setting up permissions, we can create a new analysis in Amazon QuickSight by choosing New analysis.
Then we add a new data set.
We choose Athena as the source and give the data source a name (in this case, I named it connectctr).
Choose the name of the database and the table referencing the Parquet results.
Then choose Visualize.
After that, we should see the following screen.
Now we can create some visualizations. First, search for the agent.username column, and drag it to the AutoGraph section.
We can see the agents and the number of calls for each, so we can easily see which agents have taken the largest amount of calls. If we want to see from what queues the calls came for each agent, we can add the queue.arn column to the visual.
After following all these steps, you can use Amazon QuickSight to add different columns from the call records and perform different types of visualizations. You can build dashboards that continuously monitor your connect instance. You can share those dashboards with others in your organization who might need to see this data.
Conclusion
In this post, you see how you can use services like AWS Lambda, AWS Glue, and Amazon Athena to process Amazon Connect call records. The post also demonstrates how to use AWS Lambda to preprocess files in Amazon S3 and transform them into a format that recognized by AWS Glue crawlers. Finally, the post shows how to used Amazon QuickSight to perform visualizations.
You can use the provided template to analyze your own contact center instance. Or you can take the CloudFormation template and modify it to process other data streams that can be ingested using Amazon Kinesis or stored on Amazon S3.
Additional Reading
If you found this post useful, be sure to check out Analyze Apache Parquet optimized data using Amazon Kinesis Data Firehose, Amazon Athena, and Amazon Redshift and Visualize AWS Cloudtrail Logs Using AWS Glue and Amazon QuickSight.
About the Authors
Luis Caro is a Big Data Consultant for AWS Professional Services. He works with our customers to provide guidance and technical assistance on big data projects, helping them improving the value of their solutions when using AWS.
Peter Dalbhanjan is a Solutions Architect for AWS based in Herndon, VA. Peter has a keen interest in evangelizing AWS solutions and has written multiple blog posts that focus on simplifying complex use cases. At AWS, Peter helps with designing and architecting variety of customer workloads.
Amazon SageMaker Updates – Tokyo Region, CloudFormation, Chainer, and GreenGrass ML
Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/sagemaker-tokyo-summit-2018/
Today, at the AWS Summit in Tokyo we announced a number of updates and new features for Amazon SageMaker. Starting today, SageMaker is available in Asia Pacific (Tokyo)! SageMaker also now supports CloudFormation. A new machine learning framework, Chainer, is now available in the SageMaker Python SDK, in addition to MXNet and Tensorflow. Finally, support for running Chainer models on several devices was added to AWS Greengrass Machine Learning.
Amazon SageMaker Chainer Estimator
Chainer is a popular, flexible, and intuitive deep learning framework. Chainer networks work on a “Define-by-Run” scheme, where the network topology is defined dynamically via forward computation. This is in contrast to many other frameworks which work on a “Define-and-Run” scheme where the topology of the network is defined separately from the data. A lot of developers enjoy the Chainer scheme since it allows them to write their networks with native python constructs and tools.
Luckily, using Chainer with SageMaker is just as easy as using a TensorFlow or MXNet estimator. In fact, it might even be a bit easier since it’s likely you can take your existing scripts and use them to train on SageMaker with very few modifications. With TensorFlow or MXNet users have to implement a train function with a particular signature. With Chainer your scripts can be a little bit more portable as you can simply read from a few environment variables like SM_MODEL_DIR
, SM_NUM_GPUS
, and others. We can wrap our existing script in a if __name__ == '__main__':
guard and invoke it locally or on sagemaker.
import argparse
import os
if __name__ =='__main__':
parser = argparse.ArgumentParser()
# hyperparameters sent by the client are passed as command-line arguments to the script.
parser.add_argument('--epochs', type=int, default=10)
parser.add_argument('--batch-size', type=int, default=64)
parser.add_argument('--learning-rate', type=float, default=0.05)
# Data, model, and output directories
parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
parser.add_argument('--test', type=str, default=os.environ['SM_CHANNEL_TEST'])
args, _ = parser.parse_known_args()
# ... load from args.train and args.test, train a model, write model to args.model_dir.
Then, we can run that script locally or use the SageMaker Python SDK to launch it on some GPU instances in SageMaker. The hyperparameters will get passed in to the script as CLI commands and the environment variables above will be autopopulated. When we call fit the input channels we pass will be populated in the SM_CHANNEL_*
environment variables.
from sagemaker.chainer.estimator import Chainer
# Create my estimator
chainer_estimator = Chainer(
entry_point='example.py',
train_instance_count=1,
train_instance_type='ml.p3.2xlarge',
hyperparameters={'epochs': 10, 'batch-size': 64}
)
# Train my estimator
chainer_estimator.fit({'train': train_input, 'test': test_input})
# Deploy my estimator to a SageMaker Endpoint and get a Predictor
predictor = chainer_estimator.deploy(
instance_type="ml.m4.xlarge",
initial_instance_count=1
)
Now, instead of bringing your own docker container for training and hosting with Chainer, you can just maintain your script. You can see the full sagemaker-chainer-containers on github. One of my favorite features of the new container is built-in chainermn for easy multi-node distribution of your chainer training jobs.
There’s a lot more documentation and information available in both the README and the example notebooks.
AWS GreenGrass ML with Chainer
AWS GreenGrass ML now includes a pre-built Chainer package for all devices powered by Intel Atom, NVIDIA Jetson, TX2, and Raspberry Pi. So, now GreenGrass ML provides pre-built packages for TensorFlow, Apache MXNet, and Chainer! You can train your models on SageMaker then easily deploy it to any GreenGrass-enabled device using GreenGrass ML.
JAWS UG
I want to give a quick shout out to all of our wonderful and inspirational friends in the JAWS UG who attended the AWS Summit in Tokyo today. I’ve very much enjoyed seeing your pictures of the summit. Thanks for making Japan an amazing place for AWS developers! I can’t wait to visit again and meet with all of you.
– Randall
Amazon Neptune Generally Available
Post Syndicated from Randall Hunt original https://aws.amazon.com/blogs/aws/amazon-neptune-generally-available/
Amazon Neptune is now Generally Available in US East (N. Virginia), US East (Ohio), US West (Oregon), and EU (Ireland). Amazon Neptune is a fast, reliable, fully-managed graph database service that makes it easy to build and run applications that work with highly connected datasets. At the core of Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with millisecond latencies. Neptune supports two popular graph models, Property Graph and RDF, through Apache TinkerPop Gremlin and SPARQL, allowing you to easily build queries that efficiently navigate highly connected datasets. Neptune can be used to power everything from recommendation engines and knowledge graphs to drug discovery and network security. Neptune is fully-managed with automatic minor version upgrades, backups, encryption, and fail-over. I wrote about Neptune in detail for AWS re:Invent last year and customers have been using the preview and providing great feedback that the team has used to prepare the service for GA.
Now that Amazon Neptune is generally available there are a few changes from the preview:
- A large number of performance enhancements and updates
- AWS CloudFormation support
- AWS Command Line Interface (CLI)/SDK support
- Updated to Apache TinkerPop 3.3.2
- Support for IAM roles with bulk loading from Amazon Simple Storage Service (S3)
Launching an Amazon Neptune Cluster
Launching a Neptune cluster is as easy as navigating to the AWS Management Console and clicking create cluster. Of course you can also launch with CloudFormation, the CLI, or the SDKs.
You can monitor your cluster health and the health of individual instances through Amazon CloudWatch and the console.
Additional Resources
We’ve created two repos with some additional tools and examples here. You can expect continuous development on these repos as we add additional tools and examples.
- Amazon Neptune Tools Repo
This repo has a useful tool for converting GraphML files into Neptune compatible CSVs for bulk loading from S3. - Amazon Neptune Samples Repo
This repo has a really cool example of building a collaborative filtering recommendation engine for video game preferences.
Purpose Built Databases
There’s an industry trend where we’re moving more and more onto purpose-built databases. Developers and businesses want to access their data in the format that makes the most sense for their applications. As cloud resources make transforming large datasets easier with tools like AWS Glue, we have a lot more options than we used to for accessing our data. With tools like Amazon Redshift, Amazon Athena, Amazon Aurora, Amazon DynamoDB, and more we get to choose the best database for the job or even enable entirely new use-cases. Amazon Neptune is perfect for workloads where the data is highly connected across data rich edges.
I’m really excited about graph databases and I see a huge number of applications. Looking for ideas of cool things to build? I’d love to build a web crawler in AWS Lambda that uses Neptune as the backing store. You could further enrich it by running Amazon Comprehend or Amazon Rekognition on the text and images found and creating a search engine on top of Neptune.
As always, feel free to reach out in the comments or on twitter to provide any feedback!
– Randall
Security updates for Wednesday
Post Syndicated from ris original https://lwn.net/Articles/756020/rss
Security updates have been issued by Arch Linux (strongswan, wireshark-cli, wireshark-common, wireshark-gtk, and wireshark-qt), CentOS (libvirt, procps-ng, and thunderbird), Debian (apache2, git, and qemu), Gentoo (beep, git, and procps), Mageia (mariadb, microcode, python, virtualbox, and webkit2), openSUSE (ceph, pdns, and perl-DBD-mysql), Red Hat (kernel), SUSE (HA kernel modules, libmikmod, ntp, and tiff), and Ubuntu (nvidia-graphics-drivers-384).
Orchestrate Apache Spark applications using AWS Step Functions and Apache Livy
Post Syndicated from Tanzir Musabbir original https://aws.amazon.com/blogs/big-data/orchestrate-apache-spark-applications-using-aws-step-functions-and-apache-livy/
The adoption of Apache Spark has increased significantly over the past few years, and running Spark-based application pipelines is the new normal. Spark jobs that are in an ETL (extract, transform, and load) pipeline have different requirements—you must handle dependencies in the jobs, maintain order during executions, and run multiple jobs in parallel. In most of these cases, you can use workflow scheduler tools like Apache Oozie, Apache Airflow, and even Cron to fulfill these requirements.
Apache Oozie is a widely used workflow scheduler system for Hadoop-based jobs. However, its limited UI capabilities, lack of integration with other services, and heavy XML dependency might not be suitable for some users. On the other hand, Apache Airflow comes with a lot of neat features, along with powerful UI and monitoring capabilities and integration with several AWS and third-party services. However, with Airflow, you do need to provision and manage the Airflow server. The Cron utility is a powerful job scheduler. But it doesn’t give you much visibility into the job details, and creating a workflow using Cron jobs can be challenging.
What if you have a simple use case, in which you want to run a few Spark jobs in a specific order, but you don’t want to spend time orchestrating those jobs or maintaining a separate application? You can do that today in a serverless fashion using AWS Step Functions. You can create the entire workflow in AWS Step Functions and interact with Spark on Amazon EMR through Apache Livy.
In this post, I walk you through a list of steps to orchestrate a serverless Spark-based ETL pipeline using AWS Step Functions and Apache Livy.
Input data
For the source data for this post, I use the New York City Taxi and Limousine Commission (TLC) trip record data. For a description of the data, see this detailed dictionary of the taxi data. In this example, we’ll work mainly with the following three columns for the Spark jobs.
Column name | Column description |
RateCodeID | Represents the rate code in effect at the end of the trip (for example, 1 for standard rate, 2 for JFK airport, 3 for Newark airport, and so on). |
FareAmount | Represents the time-and-distance fare calculated by the meter. |
TripDistance | Represents the elapsed trip distance in miles reported by the taxi meter. |
The trip data is in comma-separated values (CSV) format with the first row as a header. To shorten the Spark execution time, I trimmed the large input data to only 20,000 rows. During the deployment phase, the input file tripdata.csv is stored in Amazon S3 in the <<your-bucket>>/emr-step-functions/input/ folder.
The following image shows a sample of the trip data:
Solution overview
The next few sections describe how Spark jobs are created for this solution, how you can interact with Spark using Apache Livy, and how you can use AWS Step Functions to create orchestrations for these Spark applications.
At a high level, the solution includes the following steps:
- Trigger the AWS Step Function state machine by passing the input file path.
- The first stage in the state machine triggers an AWS Lambda
- The Lambda function interacts with Apache Spark running on Amazon EMR using Apache Livy, and submits a Spark job.
- The state machine waits a few seconds before checking the Spark job status.
- Based on the job status, the state machine moves to the success or failure state.
- Subsequent Spark jobs are submitted using the same approach.
- The state machine waits a few seconds for the job to finish.
- The job finishes, and the state machine updates with its final status.
Let’s take a look at the Spark application that is used for this solution.
Spark jobs
For this example, I built a Spark jar named spark-taxi.jar. It has two different Spark applications:
- MilesPerRateCode – The first job that runs on the Amazon EMR cluster. This job reads the trip data from an input source and computes the total trip distance for each rate code. The output of this job consists of two columns and is stored in Apache Parquet format in the output path.
The following are the expected output columns:
- rate_code – Represents the rate code for the trip.
- total_distance – Represents the total trip distance for that rate code (for example, sum(trip_distance)).
- RateCodeStatus – The second job that runs on the EMR cluster, but only if the first job finishes successfully. This job depends on two different input sets:
- csv – The same trip data that is used for the first Spark job.
- miles-per-rate – The output of the first job.
This job first reads the tripdata.csv file and aggregates the fare_amount by the rate_code. After this point, you have two different datasets, both aggregated by rate_code. Finally, the job uses the rate_code field to join two datasets and output the entire rate code status in a single CSV file.
The output columns are as follows:
- rate_code_id – Represents the rate code type.
- total_distance – Derived from first Spark job and represents the total trip distance.
- total_fare_amount – A new field that is generated during the second Spark application, representing the total fare amount by the rate code type.
Note that in this case, you don’t need to run two different Spark jobs to generate that output. The goal of setting up the jobs in this way is just to create a dependency between the two jobs and use them within AWS Step Functions.
Both Spark applications take one input argument called rootPath. It’s the S3 location where the Spark job is stored along with input and output data. Here is a sample of the final output:
The next section discusses how you can use Apache Livy to interact with Spark applications that are running on Amazon EMR.
Using Apache Livy to interact with Apache Spark
Apache Livy provides a REST interface to interact with Spark running on an EMR cluster. Livy is included in Amazon EMR release version 5.9.0 and later. In this post, I use Livy to submit Spark jobs and retrieve job status. When Amazon EMR is launched with Livy installed, the EMR master node becomes the endpoint for Livy, and it starts listening on port 8998 by default. Livy provides APIs to interact with Spark.
Let’s look at a couple of examples how you can interact with Spark running on Amazon EMR using Livy.
To list active running jobs, you can execute the following from the EMR master node:
If you want to do the same from a remote instance, just change localhost to the EMR hostname, as in the following (port 8998 must be open to that remote instance through the security group):
To submit a Spark job through Livy pointing to the application jar in S3, you can do the following:
Spark submit through Livy returns a session ID that starts from 0. Using that session ID, you can retrieve the status of the job as shown following:
Through Spark submit, you can pass multiple arguments for the Spark job and Spark configuration settings. You can also do that using Livy, by passing the S3 path through the args parameter, as shown following:
All Apache Livy REST calls return a response as JSON, as shown in the following image:
If you want to pretty-print that JSON response, you can pipe command with Python’s JSON tool as follows:
For a detailed list of Livy APIs, see the Apache Livy REST API page. This post uses GET /batches and POST /batches.
In the next section, you create a state machine and orchestrate Spark applications using AWS Step Functions.
Using AWS Step Functions to create a Spark job workflow
AWS Step Functions automatically triggers and tracks each step and retries when it encounters errors. So your application executes in order and as expected every time. To create a Spark job workflow using AWS Step Functions, you first create a Lambda state machine using different types of states to create the entire workflow.
First, you use the Task state—a simple state in AWS Step Functions that performs a single unit of work. You also use the Wait state to delay the state machine from continuing for a specified time. Later, you use the Choice state to add branching logic to a state machine.
The following is a quick summary of how to use different states in the state machine to create the Spark ETL pipeline:
- Task state – Invokes a Lambda function. The first Task state submits the Spark job on Amazon EMR, and the next Task state is used to retrieve the previous Spark job status.
- Wait state – Pauses the state machine until a job completes execution.
- Choice state – Each Spark job execution can return a failure, an error, or a success state So, in the state machine, you use the Choice state to create a rule that specifies the next action or step based on the success or failure of the previous step.
Here is one of my Task states, MilesPerRateCode, which simply submits a Spark job:
This Task state configuration specifies the Lambda function to execute. Inside the Lambda function, it submits a Spark job through Livy using Livy’s POST API. Using ResultPath, it tells the state machine where to place the result of the executing task. As discussed in the previous section, Spark submit returns the session ID, which is captured with $.jobId and used in a later state.
The following code section shows the Lambda function, which is used to submit the MilesPerRateCode job. It uses the Python request library to submit a POST against the Livy endpoint hosted on Amazon EMR and passes the required parameters in JSON format through payload. It then parses the response, grabs id from the response, and returns it. The Next field tells the state machine which state to go to next.
Just like in the MilesPerRate job, another state submits the RateCodeStatus job, but it executes only when all previous jobs have completed successfully.
Here is the Task state in the state machine that checks the Spark job status:
Just like other states, the preceding Task executes a Lambda function, captures the result (represented by jobStatus), and passes it to the next state. The following is the Lambda function that checks the Spark job status based on a given session ID:
In the Choice state, it checks the Spark job status value, compares it with a predefined state status, and transitions the state based on the result. For example, if the status is success, move to the next state (RateCodeJobStatus job), and if it is dead, move to the MilesPerRate job failed state.
Walkthrough using AWS CloudFormation
To set up this entire solution, you need to create a few AWS resources. To make it easier, I have created an AWS CloudFormation template. This template creates all the required AWS resources and configures all the resources that are needed to create a Spark-based ETL pipeline on AWS Step Functions.
This CloudFormation template requires you to pass the following four parameters during initiation.
Parameter | Description |
ClusterSubnetID | The subnet where the Amazon EMR cluster is deployed and Lambda is configured to talk to this subnet. |
KeyName | The name of the existing EC2 key pair to access the Amazon EMR cluster. |
VPCID | The ID of the virtual private cloud (VPC) where the EMR cluster is deployed and Lambda is configured to talk to this VPC. |
S3RootPath | The Amazon S3 path where all required files (input file, Spark job, and so on) are stored and the resulting data is written. |
IMPORTANT: These templates are designed only to show how you can create a Spark-based ETL pipeline on AWS Step Functions using Apache Livy. They are not intended for production use without modification. And if you try this solution outside of the us-east-1 Region, download the necessary files from s3://aws-data-analytics-blog/emr-step-functions, upload the files to the buckets in your Region, edit the script as appropriate, and then run it.
To launch the CloudFormation stack, choose Launch Stack:
Launching this stack creates the following list of AWS resources.
Logical ID | Resource Type | Description |
StepFunctionsStateExecutionRole | IAM role | IAM role to execute the state machine and have a trust relationship with the states service. |
SparkETLStateMachine | AWS Step Functions state machine | State machine in AWS Step Functions for the Spark ETL workflow. |
LambdaSecurityGroup | Amazon EC2 security group | Security group that is used for the Lambda function to call the Livy API. |
RateCodeStatusJobSubmitFunction
|
AWS Lambda function | Lambda function to submit the RateCodeStatus job. |
MilesPerRateJobSubmitFunction | AWS Lambda function | Lambda function to submit the MilesPerRate job. |
SparkJobStatusFunction | AWS Lambda function | Lambda function to check the Spark job status. |
LambdaStateMachineRole | IAM role | IAM role for all Lambda functions to use the lambda trust relationship. |
EMRCluster | Amazon EMR cluster | EMR cluster where Livy is running and where the job is placed. |
During the AWS CloudFormation deployment phase, it sets up S3 paths for input and output. Input files are stored in the <<s3-root-path>>/emr-step-functions/input/ path, whereas spark-taxi.jar is copied under <<s3-root-path>>/emr-step-functions/.
The following screenshot shows how the S3 paths are configured after deployment. In this example, I passed a bucket that I created in the AWS account s3://tm-app-demos for the S3 root path.
If the CloudFormation template completed successfully, you will see Spark-ETL-State-Machine in the AWS Step Functions dashboard, as follows:
Choose the Spark-ETL-State-Machine state machine to take a look at this implementation. The AWS CloudFormation template built the entire state machine along with its dependent Lambda functions, which are now ready to be executed.
On the dashboard, choose the newly created state machine, and then choose New execution to initiate the state machine. It asks you to pass input in JSON format. This input goes to the first state MilesPerRate Job, which eventually executes the Lambda function blog-miles-per-rate-job-submit-function.
Pass the S3 root path as input:
Then choose Start Execution:
The rootPath value is the same value that was passed when creating the CloudFormation stack. It can be an S3 bucket location or a bucket with prefixes, but it should be the same value that is used for AWS CloudFormation. This value tells the state machine where it can find the Spark jar and input file, and where it will write output files. After the state machine starts, each state/task is executed based on its definition in the state machine.
At a high level, the following represents the flow of events:
- Execute the first Spark job, MilesPerRate.
- The Spark job reads the input file from the location <<rootPath>>/emr-step-functions/input/tripdata.csv. If the job finishes successfully, it writes the output data to <<rootPath>>/emr-step-functions/miles-per-rate.
- If the Spark job fails, it transitions to the error state MilesPerRate job failed, and the state machine stops. If the Spark job finishes successfully, it transitions to the RateCodeStatus Job state, and the second Spark job is executed.
- If the second Spark job fails, it transitions to the error state RateCodeStatus job failed, and the state machine stops with the Failed status.
- If this Spark job completes successfully, it writes the final output data to the <<rootPath>>/emr-step-functions/rate-code-status/ It also transitions the RateCodeStatus job finished state, and the state machine ends its execution with the Success status.
This following screenshot shows a successfully completed Spark ETL state machine:
The right side of the state machine diagram shows the details of individual states with their input and output.
When you execute the state machine for the second time, it fails because the S3 path already exists. The state machine turns red and stops at MilePerRate job failed. The following image represents that failed execution of the state machine:
You can also check your Spark application status and logs by going to the Amazon EMR console and viewing the Application history tab:
I hope this walkthrough paints a picture of how you can create a serverless solution for orchestrating Spark jobs on Amazon EMR using AWS Step Functions and Apache Livy. In the next section, I share some ideas for making this solution even more elegant.
Next steps
The goal of this post is to show a simple example that uses AWS Step Functions to create an orchestration for Spark-based jobs in a serverless fashion. To make this solution robust and production ready, you can explore the following options:
- In this example, I manually initiated the state machine by passing the rootPath as input. You can instead trigger the state machine automatically. To run the ETL pipeline as soon as the files arrive in your S3 bucket, you can pass the new file path to the state machine. Because CloudWatch Events supports AWS Step Functions as a target, you can create a CloudWatch rule for an S3 event. You can then set AWS Step Functions as a target and pass the new file path to your state machine. You’re all set!
- You can also improve this solution by adding an alerting mechanism in case of failures. To do this, create a Lambda function that sends an alert email and assigns that Lambda function to a Fail That way, when any part of your state fails, it triggers an email and notifies the user.
- If you want to submit multiple Spark jobs in parallel, you can use the Parallel state type in AWS Step Functions. The Parallel state is used to create parallel branches of execution in your state machine.
With Lambda and AWS Step Functions, you can create a very robust serverless orchestration for your big data workload.
Cleaning up
When you’ve finished testing this solution, remember to clean up all those AWS resources that you created using AWS CloudFormation. Use the AWS CloudFormation console or AWS CLI to delete the stack named Blog-Spark-ETL-Step-Functions.
Summary
In this post, I showed you how to use AWS Step Functions to orchestrate your Spark jobs that are running on Amazon EMR. You used Apache Livy to submit jobs to Spark from a Lambda function and created a workflow for your Spark jobs, maintaining a specific order for job execution and triggering different AWS events based on your job’s outcome.
Go ahead—give this solution a try, and share your experience with us!
Additional Reading
If you found this post useful, be sure to check out Building a Real World Evidence Platform on AWS and Building High-Throughput Genomics Batch Workflows on AWS: Workflow Layer (Part 4 of 4).
About the Author
Tanzir Musabbir is an EMR Specialist Solutions Architect with AWS. He is an early adopter of open source Big Data technologies. At AWS, he works with our customers to provide them architectural guidance for running analytics solutions on Amazon EMR, Amazon Athena & AWS Glue. Tanzir is a big Real Madrid fan and he loves to travel in his free time.
Extending Amazon Linux 2 with EPEL and Let’s Encrypt
Post Syndicated from Chris Munns original https://aws.amazon.com/blogs/compute/extending-amazon-linux-2-with-epel-and-lets-encrypt/
This post courtesy of Jeff Levine Solutions Architect for Amazon Web Services
Amazon Linux 2 is the next generation of Amazon Linux, a Linux server operating system from Amazon Web Services (AWS). Amazon Linux 2 offers a high-performance Linux environment suitable for organizations of all sizes. It supports applications ranging from small websites to enterprise-class, mission-critical platforms.
Amazon Linux 2 includes support for the LAMP (Linux/Apache/MariaDB/PHP) stack, one of the most popular platforms for deploying websites. To secure the transmission of data-in-transit to such websites and prevent eavesdropping, organizations commonly leverage Secure Sockets Layer/Transport Layer Security (SSL/TLS) services which leverage certificates to provide encryption. The LAMP stack provided by Amazon Linux 2 includes a self-signed SSL/TLS certificate. Such certificates may be fine for internal usage but are not acceptable when attestation by a certificate authority is required.
In this post, I discuss how to extend the capabilities of Amazon Linux 2 by installing Let’s Encrypt, a certificate authority provided by the Internet Security Research Group. Let’s Encrypt offers basic SSL/TLS certificates for DNS hosts at no charge that you can use to add encryption-in-transit to a single web server. For commercial or multi-server configurations, you should consider AWS Certificate Manager and Elastic Load Balancing.
Let’s Encrypt also requires the certbot package, which you install from EPEL, the Extra Packaged for Enterprise Linux collection. Although EPEL is not included with Amazon Linux 2, I show how you can install it from the Fedora Project.
Walkthrough
At a high level, you perform the following tasks for this walkthrough:
- Provision a VPC, Amazon Linux 2 instance, and LAMP stack.
- Install and enable the EPEL repository.
- Install and configure Let’s Encrypt.
- Validate the installation.
- Clean up.
Prerequisites and costs
- To follow along with this walkthrough, you need the following:
- An AWS account that provides access to Amazon EC2 and Amazon VPC.
- An Amazon EC2 key pair.
- A program such as PuTTY that allows you to connect to the Amazon Linux 2 instance using the SSH protocol.
- Working knowledge of Amazon EC2 and Amazon VPC.
- The ability to configure DNS entries for a host domain.
You may incur charges for the resources you use including, but not limited to, the Amazon EC2 instance and the associated network charges.
Step 1: Provision a VPC, Amazon Linux 2 instance, and LAMP stack
- Create a VPC with a single public subnet, a routing table, and an internet gateway.
- Launch an Amazon Linux 2 instance in the VPC that you just created. Make sure that you do the following:
- Select the Amazon Linux 2 AMI.
- Choose t2.micro for the instance type.
- Accept all other default values including with regard to storage.
- Create a new security group and accept the default rule that allows TCP port 22 (SSH) from everywhere (0.0.0.0/0 in IPv4). For the purposes of this walkthrough, permitting access from all IP addresses is reasonable. In a production environment, you may restrict access to different addresses.
- Allocate and associate an Elastic IP address to the server when it enters the running state.
- Install a LAMP stack.
- Browse to the Elastic IP address that you just created and confirm that you can see the Apache test page, as illustrated below.
Step 2: Install and enable EPEL
- Connect to your Amazon Linux 2 instance at the Elastic IP address that you just created.
- Download and install the EPEL repository using the following commands:
cd /tmp wget -O epel.rpm –nv \ https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm sudo yum install -y ./epel.rpm
Respond “Y” to all requests for approval to install the software.
Step 3: Install and configure Let’s Encrypt
- If you are no longer connected to the Amazon Linux 2 instance, connect to it at the Elastic IP address that you just created.
- Install certbot, the Let’s Encrypt client to be used to obtain an SSL/TLS certificate and install it into Apache.
sudo yum install python2-certbot-apache.noarch
Respond “Y” to all requests for approval to install the software.
If you see a message appear about SELinux, you can safely ignore it. This is a known issue with the latest version of certbot. - Create a DNS “A record” that maps a host name to the Elastic IP address. For this post, assume that the name of the host is lamp.example.com. If you are hosting your DNS in Amazon Route 53, do this by creating the appropriate record set.
- After the “A record” has propagated, browse to lamp.example.com. The Apache test page should appear. If the page does not appear, use a tool such as nslookup on your workstation to confirm that the DNS record has been properly configured.
- You are now ready to install Let’s Encrypt. Let’s Encrypt does the following:
- Confirms that you have control over the DNS domain being used, by having you create a DNS TXT record using the value that it provides.
- Obtains an SSL/TLS certificate.
- Modifies the Apache-related scripts to use the SSL/TLS certificate and redirects users browsing the site in HTTP mode to HTTPS mode.
- Use the following command to install certbot:
sudo certbot -i apache -a manual \ --preferred-challenges dns -d lamp.example.com
The options have the following meanings:
-i apache Use the Apache installer. -a manual Authenticate domain ownership manually. --preferred-challenges dns Use DNS TXT records for authentication challenge. -d lamp.example.com Specify the domain for the SSL/TLS certificate.
- You are prompted for the following information:
E-mail address for renewals? Enter an email address for certificate renewals.
Accept the terms of services? Respond as appropriate.
Send your e-mail address to the EFF? Respond as appropriate.
Log your current IP address? Respond as appropriate. - You are prompted to deploy a DNS TXT record with the name “_acme-challenge.lamp.example.com” with the supplied value, as shown below.
- After you enter the record, wait until the TXT record propagates. To look up the TXT record to confirm the deployment, use the nslookup command in a separate command window, as shown below. Remember to use the set ty=txt command before entering the TXT record.
You are prompted to select a virtual host. There is only one, so choose 1. The final prompt asks whether to redirect HTTP traffic to HTTPS. To perform the redirection, choose 2. That completes the configuration of Let’s Encrypt. - To enable HTTPS (TCP port 443) traffic, add a rule to the security group for your Amazon Linux 2 instance.
Step: 4: Validate the installation
- Browse to the http:// lamp.example.com site. You are redirected to the SSL/TLS page https://lamp.example.com.
- To look at the encryption information, use the appropriate actions within your browser. For example, in Firefox, you can open the padlock and traverse the menus.
In the encryption technical details, you can see from the “Connection Encrypted” line that traffic to the website is now encrypted using TLS 1.2.
Security note: As of the time of publication, this website also supports TLS 1.0. I recommend that you disable this protocol because of some known vulnerabilities associated with it. To do this:
- Edit the file /etc/letsencrypt/options-ssl-apache.conf.
- Look for the line beginning with SSLProtocol and change it to the following:
SSLProtocol all -SSLv2 -SSLv3 -TLSv1
- Save the file. After you make changes to this file, Let’s Encrypt no longer automatically updates it. Periodically check your log files for recommended updates to this file.
- Restart the httpd server with the following command:
sudo service httpd restart
Step 5: Cleanup
Use the following steps to avoid incurring any further costs.
- Terminate the Amazon Linux 2 instance that you created.
- Release the Elastic IP address that you allocated.
- Revert any DNS changes that you made, including the A and TXT records.
Conclusion
Amazon Linux 2 is an excellent option for hosting websites through the LAMP stack provided by the Amazon-Linux-Extras feature. You can then enhance the security of the Apache web server by installing EPEL and Let’s Encrypt. Let’s Encrypt provisions an SSL/TLS certificate, optionally installs it for you on the Apache server, and enables data-in-transit encryption.
You can get started with Amazon Linux 2 in just a few clicks.
From Framework to Function: Deploying AWS Lambda Functions for Java 8 using Apache Maven Archetype
Post Syndicated from Ryosuke Iwanaga original https://aws.amazon.com/blogs/compute/from-framework-to-function-deploying-aws-lambda-functions-for-java-8-using-apache-maven-archetype/
As a serverless computing platform that supports Java 8 runtime, AWS Lambda makes it easy to run any type of Java function simply by uploading a JAR file. To help define not only a Lambda serverless application but also Amazon API Gateway, Amazon DynamoDB, and other related services, the AWS Serverless Application Model (SAM) allows developers to use a simple AWS CloudFormation template.
AWS provides the AWS Toolkit for Eclipse that supports both Lambda and SAM. AWS also gives customers an easy way to create Lambda functions and SAM applications in Java using the AWS Command Line Interface (AWS CLI). After you build a JAR file, all you have to do is type the following commands:
aws cloudformation package aws cloudformation deploy
To consolidate these steps, customers can use Archetype by Apache Maven. Archetype uses a predefined package template that makes getting started to develop a function exceptionally simple.
In this post, I introduce a Maven archetype that allows you to create a skeleton of AWS SAM for a Java function. Using this archetype, you can generate a sample Java code example and an accompanying SAM template to deploy it on AWS Lambda by a single Maven action.
Prerequisites
Make sure that the following software is installed on your workstation:
- Java
- Maven
- AWS CLI
- (Optional) AWS SAM CLI
Install Archetype
After you’ve set up those packages, install Archetype with the following commands:
git clone https://github.com/awslabs/aws-serverless-java-archetype cd aws-serverless-java-archetype mvn install
These are one-time operations, so you don’t run them for every new package. If you’d like, you can add Archetype to your company’s Maven repository so that other developers can use it later.
With those packages installed, you’re ready to develop your new Lambda Function.
Start a project
Now that you have the archetype, customize it and run the code:
cd /path/to/project_home mvn archetype:generate \ -DarchetypeGroupId=com.amazonaws.serverless.archetypes \ -DarchetypeArtifactId=aws-serverless-java-archetype \ -DarchetypeVersion=1.0.0 \ -DarchetypeRepository=local \ # Forcing to use local maven repository -DinteractiveMode=false \ # For batch mode # You can also specify properties below interactively if you omit the line for batch mode -DgroupId=YOUR_GROUP_ID \ -DartifactId=YOUR_ARTIFACT_ID \ -Dversion=YOUR_VERSION \ -DclassName=YOUR_CLASSNAME
You should have a directory called YOUR_ARTIFACT_ID that contains the files and folders shown below:
├── event.json ├── pom.xml ├── src │ └── main │ ├── java │ │ └── Package │ │ └── Example.java │ └── resources │ └── log4j2.xml └── template.yaml
The sample code is a working example. If you install SAM CLI, you can invoke it just by the command below:
cd YOUR_ARTIFACT_ID mvn -P invoke verify [INFO] Scanning for projects... [INFO] [INFO] ---------------------------< com.riywo:foo >---------------------------- [INFO] Building foo 1.0 [INFO] --------------------------------[ jar ]--------------------------------- ... [INFO] --- maven-jar-plugin:3.0.2:jar (default-jar) @ foo --- [INFO] Building jar: /private/tmp/foo/target/foo-1.0.jar [INFO] [INFO] --- maven-shade-plugin:3.1.0:shade (shade) @ foo --- [INFO] Including com.amazonaws:aws-lambda-java-core:jar:1.2.0 in the shaded jar. [INFO] Replacing /private/tmp/foo/target/lambda.jar with /private/tmp/foo/target/foo-1.0-shaded.jar [INFO] [INFO] --- exec-maven-plugin:1.6.0:exec (sam-local-invoke) @ foo --- 2018/04/06 16:34:35 Successfully parsed template.yaml 2018/04/06 16:34:35 Connected to Docker 1.37 2018/04/06 16:34:35 Fetching lambci/lambda:java8 image for java8 runtime... java8: Pulling from lambci/lambda Digest: sha256:14df0a5914d000e15753d739612a506ddb8fa89eaa28dcceff5497d9df2cf7aa Status: Image is up to date for lambci/lambda:java8 2018/04/06 16:34:37 Invoking Package.Example::handleRequest (java8) 2018/04/06 16:34:37 Decompressing /tmp/foo/target/lambda.jar 2018/04/06 16:34:37 Mounting /private/var/folders/x5/ldp7c38545v9x5dg_zmkr5kxmpdprx/T/aws-sam-local-1523000077594231063 as /var/task:ro inside runtime container START RequestId: a6ae19fe-b1b0-41e2-80bc-68a40d094d74 Version: $LATEST Log output: Greeting is 'Hello Tim Wagner.' END RequestId: a6ae19fe-b1b0-41e2-80bc-68a40d094d74 REPORT RequestId: a6ae19fe-b1b0-41e2-80bc-68a40d094d74 Duration: 96.60 ms Billed Duration: 100 ms Memory Size: 128 MB Max Memory Used: 7 MB {"greetings":"Hello Tim Wagner."} [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 10.452 s [INFO] Finished at: 2018-04-06T16:34:40+09:00 [INFO] ------------------------------------------------------------------------
This maven goal invokes sam local invoke -e event.json
, so you can see the sample output to greet Tim Wagner.
To deploy this application to AWS, you need an Amazon S3 bucket to upload your package. You can use the following command to create a bucket if you want:
aws s3 mb s3://YOUR_BUCKET --region YOUR_REGION
Now, you can deploy your application by just one command!
mvn deploy \ -DawsRegion=YOUR_REGION \ -Ds3Bucket=YOUR_BUCKET \ -DstackName=YOUR_STACK [INFO] Scanning for projects... [INFO] [INFO] ---------------------------< com.riywo:foo >---------------------------- [INFO] Building foo 1.0 [INFO] --------------------------------[ jar ]--------------------------------- ... [INFO] --- exec-maven-plugin:1.6.0:exec (sam-package) @ foo --- Uploading to aws-serverless-java/com.riywo:foo:1.0/924732f1f8e4705c87e26ef77b080b47 11657 / 11657.0 (100.00%) Successfully packaged artifacts and wrote output template to file target/sam.yaml. Execute the following command to deploy the packaged template aws cloudformation deploy --template-file /private/tmp/foo/target/sam.yaml --stack-name <YOUR STACK NAME> [INFO] [INFO] --- maven-deploy-plugin:2.8.2:deploy (default-deploy) @ foo --- [INFO] Skipping artifact deployment [INFO] [INFO] --- exec-maven-plugin:1.6.0:exec (sam-deploy) @ foo --- Waiting for changeset to be created.. Waiting for stack create/update to complete Successfully created/updated stack - archetype [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------ [INFO] Total time: 37.176 s [INFO] Finished at: 2018-04-06T16:41:02+09:00 [INFO] ------------------------------------------------------------------------
Maven automatically creates a shaded JAR file, uploads it to your S3 bucket, replaces template.yaml
, and creates and updates the CloudFormation stack.
To customize the process, modify the pom.xml
file. For example, to avoid typing values for awsRegion
, s3Bucket
or stackName
, write them inside pom.xml
and check in your VCS. Afterward, you and the rest of your team can deploy the function by typing just the following command:
mvn deploy
Options
Lambda Java 8 runtime has some types of handlers: POJO, Simple type and Stream. The default option of this archetype is POJO style, which requires to create request and response classes, but they are baked by the archetype by default. If you want to use other type of handlers, you can use handlerType
property like below:
## POJO type (default) mvn archetype:generate \ ... -DhandlerType=pojo ## Simple type - String mvn archetype:generate \ ... -DhandlerType=simple ### Stream type mvn archetype:generate \ ... -DhandlerType=stream
See documentation for more details about handlers.
Also, Lambda Java 8 runtime supports two types of Logging class: Log4j 2 and LambdaLogger. This archetype creates LambdaLogger implementation by default, but you can use Log4j 2 if you want:
## LambdaLogger (default) mvn archetype:generate \ ... -Dlogger=lambda ## Log4j 2 mvn archetype:generate \ ... -Dlogger=log4j2
If you use LambdaLogger, you can delete ./src/main/resources/log4j2.xml
. See documentation for more details.
Conclusion
So, what’s next? Develop your Lambda function locally and type the following command: mvn deploy
!
With this Archetype code example, available on GitHub repo, you should be able to deploy Lambda functions for Java 8 in a snap. If you have any questions or comments, please submit them below or leave them on GitHub.
Analyze Apache Parquet optimized data using Amazon Kinesis Data Firehose, Amazon Athena, and Amazon Redshift
Post Syndicated from Roy Hasson original https://aws.amazon.com/blogs/big-data/analyzing-apache-parquet-optimized-data-using-amazon-kinesis-data-firehose-amazon-athena-and-amazon-redshift/
Amazon Kinesis Data Firehose is the easiest way to capture and stream data into a data lake built on Amazon S3. This data can be anything—from AWS service logs like AWS CloudTrail log files, Amazon VPC Flow Logs, Application Load Balancer logs, and others. It can also be IoT events, game events, and much more. To efficiently query this data, a time-consuming ETL (extract, transform, and load) process is required to massage and convert the data to an optimal file format, which increases the time to insight. This situation is less than ideal, especially for real-time data that loses its value over time.
To solve this common challenge, Kinesis Data Firehose can now save data to Amazon S3 in Apache Parquet or Apache ORC format. These are optimized columnar formats that are highly recommended for best performance and cost-savings when querying data in S3. This feature directly benefits you if you use Amazon Athena, Amazon Redshift, AWS Glue, Amazon EMR, or any other big data tools that are available from the AWS Partner Network and through the open-source community.
Amazon Connect is a simple-to-use, cloud-based contact center service that makes it easy for any business to provide a great customer experience at a lower cost than common alternatives. Its open platform design enables easy integration with other systems. One of those systems is Amazon Kinesis—in particular, Kinesis Data Streams and Kinesis Data Firehose.
What’s really exciting is that you can now save events from Amazon Connect to S3 in Apache Parquet format. You can then perform analytics using Amazon Athena and Amazon Redshift Spectrum in real time, taking advantage of this key performance and cost optimization. Of course, Amazon Connect is only one example. This new capability opens the door for a great deal of opportunity, especially as organizations continue to build their data lakes.
Amazon Connect includes an array of analytics views in the Administrator dashboard. But you might want to run other types of analysis. In this post, I describe how to set up a data stream from Amazon Connect through Kinesis Data Streams and Kinesis Data Firehose and out to S3, and then perform analytics using Athena and Amazon Redshift Spectrum. I focus primarily on the Kinesis Data Firehose support for Parquet and its integration with the AWS Glue Data Catalog, Amazon Athena, and Amazon Redshift.
Solution overview
Here is how the solution is laid out:
The following sections walk you through each of these steps to set up the pipeline.
1. Define the schema
When Kinesis Data Firehose processes incoming events and converts the data to Parquet, it needs to know which schema to apply. The reason is that many times, incoming events contain all or some of the expected fields based on which values the producers are advertising. A typical process is to normalize the schema during a batch ETL job so that you end up with a consistent schema that can easily be understood and queried. Doing this introduces latency due to the nature of the batch process. To overcome this issue, Kinesis Data Firehose requires the schema to be defined in advance.
To see the available columns and structures, see Amazon Connect Agent Event Streams. For the purpose of simplicity, I opted to make all the columns of type String rather than create the nested structures. But you can definitely do that if you want.
The simplest way to define the schema is to create a table in the Amazon Athena console. Open the Athena console, and paste the following create table statement, substituting your own S3 bucket and prefix for where your event data will be stored. A Data Catalog database is a logical container that holds the different tables that you can create. The default database name shown here should already exist. If it doesn’t, you can create it or use another database that you’ve already created.
That’s all you have to do to prepare the schema for Kinesis Data Firehose.
2. Define the data streams
Next, you need to define the Kinesis data streams that will be used to stream the Amazon Connect events. Open the Kinesis Data Streams console and create two streams. You can configure them with only one shard each because you don’t have a lot of data right now.
3. Define the Kinesis Data Firehose delivery stream for Parquet
Let’s configure the Data Firehose delivery stream using the data stream as the source and Amazon S3 as the output. Start by opening the Kinesis Data Firehose console and creating a new data delivery stream. Give it a name, and associate it with the Kinesis data stream that you created in Step 2.
As shown in the following screenshot, enable Record format conversion (1) and choose Apache Parquet (2). As you can see, Apache ORC is also supported. Scroll down and provide the AWS Glue Data Catalog database name (3) and table names (4) that you created in Step 1. Choose Next.
To make things easier, the output S3 bucket and prefix fields are automatically populated using the values that you defined in the LOCATION parameter of the create table statement from Step 1. Pretty cool. Additionally, you have the option to save the raw events into another location as defined in the Source record S3 backup section. Don’t forget to add a trailing forward slash “ / “ so that Data Firehose creates the date partitions inside that prefix.
On the next page, in the S3 buffer conditions section, there is a note about configuring a large buffer size. The Parquet file format is highly efficient in how it stores and compresses data. Increasing the buffer size allows you to pack more rows into each output file, which is preferred and gives you the most benefit from Parquet.
Compression using Snappy is automatically enabled for both Parquet and ORC. You can modify the compression algorithm by using the Kinesis Data Firehose API and update the OutputFormatConfiguration.
Be sure to also enable Amazon CloudWatch Logs so that you can debug any issues that you might run into.
Lastly, finalize the creation of the Firehose delivery stream, and continue on to the next section.
4. Set up the Amazon Connect contact center
After setting up the Kinesis pipeline, you now need to set up a simple contact center in Amazon Connect. The Getting Started page provides clear instructions on how to set up your environment, acquire a phone number, and create an agent to accept calls.
After setting up the contact center, in the Amazon Connect console, choose your Instance Alias, and then choose Data Streaming. Under Agent Event, choose the Kinesis data stream that you created in Step 2, and then choose Save.
At this point, your pipeline is complete. Agent events from Amazon Connect are generated as agents go about their day. Events are sent via Kinesis Data Streams to Kinesis Data Firehose, which converts the event data from JSON to Parquet and stores it in S3. Athena and Amazon Redshift Spectrum can simply query the data without any additional work.
So let’s generate some data. Go back into the Administrator console for your Amazon Connect contact center, and create an agent to handle incoming calls. In this example, I creatively named mine Agent One. After it is created, Agent One can get to work and log into their console and set their availability to Available so that they are ready to receive calls.
To make the data a bit more interesting, I also created a second agent, Agent Two. I then made some incoming and outgoing calls and caused some failures to occur, so I now have enough data available to analyze.
5. Analyze the data with Athena
Let’s open the Athena console and run some queries. One thing you’ll notice is that when we created the schema for the dataset, we defined some of the fields as Strings even though in the documentation they were complex structures. The reason for doing that was simply to show some of the flexibility of Athena to be able to parse JSON data. However, you can define nested structures in your table schema so that Kinesis Data Firehose applies the appropriate schema to the Parquet file.
Let’s run the first query to see which agents have logged into the system.
The query might look complex, but it’s fairly straightforward:
The query output looks something like this:
Here is another query that shows the sessions each of the agents engaged with. It tells us where they were incoming or outgoing, if they were completed, and where there were missed or failed calls.
The query output looks similar to the following:
6. Analyze the data with Amazon Redshift Spectrum
With Amazon Redshift Spectrum, you can query data directly in S3 using your existing Amazon Redshift data warehouse cluster. Because the data is already in Parquet format, Redshift Spectrum gets the same great benefits that Athena does.
Here is a simple query to show querying the same data from Amazon Redshift. Note that to do this, you need to first create an external schema in Amazon Redshift that points to the AWS Glue Data Catalog.
The following shows the query output:
Summary
In this post, I showed you how to use Kinesis Data Firehose to ingest and convert data to columnar file format, enabling real-time analysis using Athena and Amazon Redshift. This great feature enables a level of optimization in both cost and performance that you need when storing and analyzing large amounts of data. This feature is equally important if you are investing in building data lakes on AWS.
Additional Reading
If you found this post useful, be sure to check out Analyzing VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight and Work with partitioned data in AWS Glue.
About the Author
Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.
Security updates for Thursday
Post Syndicated from ris original https://lwn.net/Articles/754145/rss
Security updates have been issued by Arch Linux (freetype2, libraw, and powerdns), CentOS (389-ds-base and kernel), Debian (php5, prosody, and wavpack), Fedora (ckeditor, fftw, flac, knot-resolver, patch, perl, and perl-Dancer2), Mageia (cups, flac, graphicsmagick, libcdio, libid3tag, and nextcloud), openSUSE (apache2), Oracle (389-ds-base and kernel), Red Hat (389-ds-base and flash-plugin), Scientific Linux (389-ds-base), Slackware (firefox and wget), SUSE (xen), and Ubuntu (wget).
Use AWS Glue to run ETL jobs against non-native JDBC data sources
Post Syndicated from Kapil Shardha original https://aws.amazon.com/blogs/big-data/use-aws-glue-to-run-etl-jobs-against-non-native-jdbc-data-sources/
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easier to prepare and load your data for analytics. You can create and run an ETL job with a few clicks on the AWS Management Console. Just point AWS Glue to your data store. AWS Glue discovers your data and stores the associated metadata (for example, a table definition and schema) in the AWS Glue Data Catalog.
AWS Glue has native connectors to data sources using JDBC drivers, either on AWS or elsewhere, as long as there is IP connectivity. In this post, we demonstrate how to connect to data sources that are not natively supported in AWS Glue today. We walk through connecting to and running ETL jobs against two such data sources, IBM DB2 and SAP Sybase. However, you can use the same process with any other JDBC-accessible database.
AWS Glue data sources
AWS Glue natively supports the following data stores by using the JDBC protocol:
- Publicly accessible databases
- Amazon Aurora
- MariaDB
- Microsoft SQL Server
- MySQL
- Oracle
- PostgreSQL
For more information, see Adding a Connection to Your Data Store in the AWS Glue Developer Guide.
One of the fastest growing architectures deployed on AWS is the data lake. The ETL processes that are used to ingest, clean, transform, and structure data are critically important for this architecture. Having the flexibility to interoperate with a broader range of database engines allows for a quicker adoption of the data lake architecture.
For data sources that AWS Glue doesn’t natively support, such as IBM DB2, Pivotal Greenplum, SAP Sybase, or any other relational database management system (RDBMS), you can import custom database connectors from Amazon S3 into AWS Glue jobs. In this case, the connection to the data source must be made from the AWS Glue script to extract the data, rather than using AWS Glue connections. To learn more, see Providing Your Own Custom Scripts in the AWS Glue Developer Guide.
Setting up an ETL job for an IBM DB2 data source
The first example demonstrates how to connect the AWS Glue ETL job to an IBM DB2 instance, transform the data from the source, and store it in Apache Parquet format in Amazon S3. To successfully create the ETL job using an external JDBC driver, you must define the following:
- The S3 location of the job script
- The S3 location of the temporary directory
- The S3 location of the JDBC driver
- The S3 location of the Parquet data (output)
- The IAM role for the job
By default, AWS Glue suggests bucket names for the scripts and the temporary directory using the following format:
For JDBC drivers, you can create a similar location:
Also for the Parquet data (output), you can create a similar location:
Keep in mind that having the AWS Glue job and S3 buckets in the same AWS Region helps save on cross-Region data transfer fees. For this post, we will work in the US East (Ohio) Region (us-east-2).
Creating the IAM role
The next step is to set up the IAM role that the ETL job will use:
- Sign in to the AWS Management Console, and search for IAM:
- On the IAM console, choose Roles in the left navigation pane.
- Choose Create role. The role type of trusted entity must be an AWS service, specifically AWS Glue.
- Choose Next: Permissions.
- Search for the AWSGlueServiceRole policy, and select it.
- Search again, now for the SecretsManagerReadWrite This policy allows the AWS Glue job to access database credentials that are stored in AWS Secrets Manager.
CAUTION: This policy is open and is being used for testing purposes only. You should create a custom policy to narrow the access just to the secrets that you want to use in the ETL job.
- Select this policy, and choose Next: Review.
- Give your role a name, for example, GluePermissions, and confirm that both policies were selected.
- Choose Create role.
Now that you have created the IAM role, it’s time to upload the JDBC driver to the defined location in Amazon S3. For this example, we will use the DB2 driver, which is available on the IBM Support site.
Storing database credentials
It is a best practice to store database credentials in a safe store. In this case, we use AWS Secrets Manager to securely store credentials. Follow these steps to create those credentials:
- Open the console, and search for Secrets Manager.
- In the AWS Secrets Manager console, choose Store a new secret.
- Under Select a secret type, choose Other type of secrets.
- In the Secret key/value, set one row for each of the following parameters:
- db_username
- db_password
- db_url (for example, jdbc:db2://10.10.12.12:50000/SAMPLE)
- db_table
- driver_name (ibm.db2.jcc.DB2Driver)
- output_bucket: (for example, aws-glue-data-output-1234567890-us-east-2/User)
- Choose Next.
- For Secret name, use DB2_Database_Connection_Info.
- Choose Next.
- Keep the Disable automatic rotation check box selected.
- Choose Next.
- Choose Store.
Adding a job in AWS Glue
The next step is to author the AWS Glue job, following these steps:
- In the AWS Management Console, search for AWS Glue.
- In the navigation pane on the left, choose Jobs under the ETL
- Choose Add job.
- Fill in the basic Job properties:
- Give the job a name (for example, db2-job).
- Choose the IAM role that you created previously (GluePermissions).
- For This job runs, choose A new script to be authored by you.
- For ETL language, choose Python.
- In the Script libraries and job parameters section, choose the location of your JDBC driver for Dependent jars path.
- Choose Next.
- On the Connections page, choose Next
- On the summary page, choose Save job and edit script. This creates the job and opens the script editor.
In the editor, replace the existing code with the following script. Important: Line 47 of the script corresponds to the mapping of the fields in the source table to the destination, dropping of the null fields to save space in the Parquet destination, and finally writing to Amazon S3 in Parquet format.
- Choose Save.
- Choose the black X on the right side of the screen to close the editor.
Running the ETL job
Now that you have created the job, the next step is to execute it as follows:
- On the Jobs page, select your new job. On the Action menu, choose Run job, and confirm that you want to run the job. Wait a few moments as it finishes the execution.
- After the job shows as Succeeded, choose Logs to read the output of the job.
- In the output of the job, you will find the result of executing the df.printSchema() and the message with the df.count().
Also, if you go to your output bucket in S3, you will find the Parquet result of the ETL job.
Using AWS Glue, you have created an ETL job that connects to an existing database using an external JDBC driver. It enables you to execute any transformation that you need.
Setting up an ETL job for an SAP Sybase data source
In this section, we describe how to create an AWS Glue ETL job against an SAP Sybase data source. The process mentioned in the previous section works for a Sybase data source with a few changes required in the job:
- While creating the job, choose the correct jar for the JDBC dependency.
- In the script, change the reference to the secret to be used from AWS Secrets Manager:
And change the mapping of the fields in line 47, as follows:
After you successfully execute the new ETL job, the output contains the same type of information that was generated with the DB2 data source.
Note that each of these JDBC drivers has its own nuances and different licensing terms that you should be aware of before using them.
Maximizing JDBC read parallelism
Something to keep in mind while working with big data sources is the memory consumption. In some cases, “Out of Memory” errors are generated when all the data is read into a single executor. One approach to optimize this is to rely on the parallelism on read that you can implement with Apache Spark and AWS Glue. To learn more, see the Apache Spark SQL module.
You can use the following options:
- partitionColumn: The name of an integer column that is used for partitioning.
- lowerBound: The minimum value of partitionColumn that is used to decide partition stride.
- upperBound: The maximum value of partitionColumn that is used to decide partition stride.
- numPartitions: The number of partitions. This, along with lowerBound (inclusive) and upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the partitionColumn When unset, this defaults to SparkContext.defaultParallelism.
- Those options specify the parallelism of the table read. lowerBound and upperBound decide the partition stride, but they don’t filter the rows in the table. Therefore, Spark partitions and returns all rows in the table. For example:
It’s important to be careful with the number of partitions because too many partitions could also result in Spark crashing your external database systems.
Conclusion
Using the process described in this post, you can connect to and run AWS Glue ETL jobs against any data source that can be reached using a JDBC driver. This includes new generations of common analytical databases like Greenplum and others.
You can improve the query efficiency of these datasets by using partitioning and pushdown predicates. For more information, see Managing Partitions for ETL Output in AWS Glue. This technique opens the door to moving data and feeding data lakes in hybrid environments.
Additional Reading
If you found this post useful, be sure to check out Work with partitioned data in AWS Glue.
About the Authors
Kapil Shardha is a Technical Account Manager and supports enterprise customers with their AWS adoption. He has background in infrastructure automation and DevOps.
William Torrealba is an AWS Solutions Architect supporting customers with their AWS adoption. He has background in Application Development, High Available Distributed Systems, Automation, and DevOps.
AWS Online Tech Talks – May and Early June 2018
Post Syndicated from Devin Watson original https://aws.amazon.com/blogs/aws/aws-online-tech-talks-may-and-early-june-2018/
AWS Online Tech Talks – May and Early June 2018
Join us this month to learn about some of the exciting new services and solution best practices at AWS. We also have our first re:Invent 2018 webinar series, “How to re:Invent”. Sign up now to learn more, we look forward to seeing you.
Note – All sessions are free and in Pacific Time.
Tech talks featured this month:
Analytics & Big Data
May 21, 2018 | 11:00 AM – 11:45 AM PT – Integrating Amazon Elasticsearch with your DevOps Tooling – Learn how you can easily integrate Amazon Elasticsearch Service into your DevOps tooling and gain valuable insight from your log data.
May 23, 2018 | 11:00 AM – 11:45 AM PT – Data Warehousing and Data Lake Analytics, Together – Learn how to query data across your data warehouse and data lake without moving data.
May 24, 2018 | 11:00 AM – 11:45 AM PT – Data Transformation Patterns in AWS – Discover how to perform common data transformations on the AWS Data Lake.
Compute
May 29, 2018 | 01:00 PM – 01:45 PM PT – Creating and Managing a WordPress Website with Amazon Lightsail – Learn about Amazon Lightsail and how you can create, run and manage your WordPress websites with Amazon’s simple compute platform.
May 30, 2018 | 01:00 PM – 01:45 PM PT – Accelerating Life Sciences with HPC on AWS – Learn how you can accelerate your Life Sciences research workloads by harnessing the power of high performance computing on AWS.
Containers
May 24, 2018 | 01:00 PM – 01:45 PM PT – Building Microservices with the 12 Factor App Pattern on AWS – Learn best practices for building containerized microservices on AWS, and how traditional software design patterns evolve in the context of containers.
Databases
May 21, 2018 | 01:00 PM – 01:45 PM PT – How to Migrate from Cassandra to Amazon DynamoDB – Get the benefits, best practices and guides on how to migrate your Cassandra databases to Amazon DynamoDB.
May 23, 2018 | 01:00 PM – 01:45 PM PT – 5 Hacks for Optimizing MySQL in the Cloud – Learn how to optimize your MySQL databases for high availability, performance, and disaster resilience using RDS.
DevOps
May 23, 2018 | 09:00 AM – 09:45 AM PT – .NET Serverless Development on AWS – Learn how to build a modern serverless application in .NET Core 2.0.
Enterprise & Hybrid
May 22, 2018 | 11:00 AM – 11:45 AM PT – Hybrid Cloud Customer Use Cases on AWS – Learn how customers are leveraging AWS hybrid cloud capabilities to easily extend their datacenter capacity, deliver new services and applications, and ensure business continuity and disaster recovery.
IoT
May 31, 2018 | 11:00 AM – 11:45 AM PT – Using AWS IoT for Industrial Applications – Discover how you can quickly onboard your fleet of connected devices, keep them secure, and build predictive analytics with AWS IoT.
Machine Learning
May 22, 2018 | 09:00 AM – 09:45 AM PT – Using Apache Spark with Amazon SageMaker – Discover how to use Apache Spark with Amazon SageMaker for training jobs and application integration.
May 24, 2018 | 09:00 AM – 09:45 AM PT – Introducing AWS DeepLens – Learn how AWS DeepLens provides a new way for developers to learn machine learning by pairing the physical device with a broad set of tutorials, examples, source code, and integration with familiar AWS services.
Management Tools
May 21, 2018 | 09:00 AM – 09:45 AM PT – Gaining Better Observability of Your VMs with Amazon CloudWatch – Learn how CloudWatch Agent makes it easy for customers like Rackspace to monitor their VMs.
Mobile
May 29, 2018 | 11:00 AM – 11:45 AM PT – Deep Dive on Amazon Pinpoint Segmentation and Endpoint Management – See how segmentation and endpoint management with Amazon Pinpoint can help you target the right audience.
Networking
May 31, 2018 | 09:00 AM – 09:45 AM PT – Making Private Connectivity the New Norm via AWS PrivateLink – See how PrivateLink enables service owners to offer private endpoints to customers outside their company.
Security, Identity, & Compliance
May 30, 2018 | 09:00 AM – 09:45 AM PT – Introducing AWS Certificate Manager Private Certificate Authority (CA) – Learn how AWS Certificate Manager (ACM) Private Certificate Authority (CA), a managed private CA service, helps you easily and securely manage the lifecycle of your private certificates.
June 1, 2018 | 09:00 AM – 09:45 AM PT – Introducing AWS Firewall Manager – Centrally configure and manage AWS WAF rules across your accounts and applications.
Serverless
May 22, 2018 | 01:00 PM – 01:45 PM PT – Building API-Driven Microservices with Amazon API Gateway – Learn how to build a secure, scalable API for your application in our tech talk about API-driven microservices.
Storage
May 30, 2018 | 11:00 AM – 11:45 AM PT – Accelerate Productivity by Computing at the Edge – Learn how AWS Snowball Edge support for compute instances helps accelerate data transfers, execute custom applications, and reduce overall storage costs.
June 1, 2018 | 11:00 AM – 11:45 AM PT – Learn to Build a Cloud-Scale Website Powered by Amazon EFS – Technical deep dive where you’ll learn tips and tricks for integrating WordPress, Drupal and Magento with Amazon EFS.
Analyze data in Amazon DynamoDB using Amazon SageMaker for real-time prediction
Post Syndicated from YongSeong Lee original https://aws.amazon.com/blogs/big-data/analyze-data-in-amazon-dynamodb-using-amazon-sagemaker-for-real-time-prediction/
Many companies across the globe use Amazon DynamoDB to store and query historical user-interaction data. DynamoDB is a fast NoSQL database used by applications that need consistent, single-digit millisecond latency.
Often, customers want to turn their valuable data in DynamoDB into insights by analyzing a copy of their table stored in Amazon S3. Doing this separates their analytical queries from their low-latency critical paths. This data can be the primary source for understanding customers’ past behavior, predicting future behavior, and generating downstream business value. Customers often turn to DynamoDB because of its great scalability and high availability. After a successful launch, many customers want to use the data in DynamoDB to predict future behaviors or provide personalized recommendations.
DynamoDB is a good fit for low-latency reads and writes, but it’s not practical to scan all data in a DynamoDB database to train a model. In this post, I demonstrate how you can use DynamoDB table data copied to Amazon S3 by AWS Data Pipeline to predict customer behavior. I also demonstrate how you can use this data to provide personalized recommendations for customers using Amazon SageMaker. You can also run ad hoc queries using Amazon Athena against the data. DynamoDB recently released on-demand backups to create full table backups with no performance impact. However, it’s not suitable for our purposes in this post, so I chose AWS Data Pipeline instead to create managed backups are accessible from other services.
To do this, I describe how to read the DynamoDB backup file format in Data Pipeline. I also describe how to convert the objects in S3 to a CSV format that Amazon SageMaker can read. In addition, I show how to schedule regular exports and transformations using Data Pipeline. The sample data used in this post is from Bank Marketing Data Set of UCI.
The solution that I describe provides the following benefits:
- Separates analytical queries from production traffic on your DynamoDB table, preserving your DynamoDB read capacity units (RCUs) for important production requests
- Automatically updates your model to get real-time predictions
- Optimizes for performance (so it doesn’t compete with DynamoDB RCUs after the export) and for cost (using data you already have)
- Makes it easier for developers of all skill levels to use Amazon SageMaker
All code and data set in this post are available in this .zip file.
Solution architecture
The following diagram shows the overall architecture of the solution.
The steps that data follows through the architecture are as follows:
- Data Pipeline regularly copies the full contents of a DynamoDB table as JSON into an S3
- Exported JSON files are converted to comma-separated value (CSV) format to use as a data source for Amazon SageMaker.
- Amazon SageMaker renews the model artifact and update the endpoint.
- The converted CSV is available for ad hoc queries with Amazon Athena.
- Data Pipeline controls this flow and repeats the cycle based on the schedule defined by customer requirements.
Building the auto-updating model
This section discusses details about how to read the DynamoDB exported data in Data Pipeline and build automated workflows for real-time prediction with a regularly updated model.
Download sample scripts and data
Before you begin, take the following steps:
- Download sample scripts in this .zip file.
- Unzip the src.zip file.
- Find the automation_script.sh file and edit it for your environment. For example, you need to replace 's3://<your bucket>/<datasource path>/' with your own S3 path to the data source for Amazon ML. In the script, the text enclosed by angle brackets—< and >—should be replaced with your own path.
- Upload the json-serde-1.3.6-SNAPSHOT-jar-with-dependencies.jar file to your S3 path so that the ADD jar command in Apache Hive can refer to it.
For this solution, the banking.csv should be imported into a DynamoDB table.
Export a DynamoDB table
To export the DynamoDB table to S3, open the Data Pipeline console and choose the Export DynamoDB table to S3 template. In this template, Data Pipeline creates an Amazon EMR cluster and performs an export in the EMRActivity activity. Set proper intervals for backups according to your business requirements.
One core node(m3.xlarge) provides the default capacity for the EMR cluster and should be suitable for the solution in this post. Leave the option to resize the cluster before running enabled in the TableBackupActivity activity to let Data Pipeline scale the cluster to match the table size. The process of converting to CSV format and renewing models happens in this EMR cluster.
For a more in-depth look at how to export data from DynamoDB, see Export Data from DynamoDB in the Data Pipeline documentation.
Add the script to an existing pipeline
After you export your DynamoDB table, you add an additional EMR step to EMRActivity by following these steps:
- Open the Data Pipeline console and choose the ID for the pipeline that you want to add the script to.
- For Actions, choose Edit.
- In the editing console, choose the Activities category and add an EMR step using the custom script downloaded in the previous section, as shown below.
Paste the following command into the new step after the data upload step:
The element #{output.directoryPath} references the S3 path where the data pipeline exports DynamoDB data as JSON. The path should be passed to the script as an argument.
The bash script has two goals, converting data formats and renewing the Amazon SageMaker model. Subsequent sections discuss the contents of the automation script.
Automation script: Convert JSON data to CSV with Hive
We use Apache Hive to transform the data into a new format. The Hive QL script to create an external table and transform the data is included in the custom script that you added to the Data Pipeline definition.
When you run the Hive scripts, do so with the -e option. Also, define the Hive table with the 'org.openx.data.jsonserde.JsonSerDe' row format to parse and read JSON format. The SQL creates a Hive EXTERNAL table, and it reads the DynamoDB backup data on the S3 path passed to it by Data Pipeline.
Note: You should create the table with the “EXTERNAL” keyword to avoid the backup data being accidentally deleted from S3 if you drop the table.
The full automation script for converting follows. Add your own bucket name and data source path in the highlighted areas.
After creating an external table, you need to read data. You then use the INSERT OVERWRITE DIRECTORY ~ SELECT command to write CSV data to the S3 path that you designated as the data source for Amazon SageMaker.
Depending on your requirements, you can eliminate or process the columns in the SELECT clause in this step to optimize data analysis. For example, you might remove some columns that have unpredictable correlations with the target value because keeping the wrong columns might expose your model to “overfitting” during the training. In this post, customer_id columns is removed. Overfitting can make your prediction weak. More information about overfitting can be found in the topic Model Fit: Underfitting vs. Overfitting in the Amazon ML documentation.
Automation script: Renew the Amazon SageMaker model
After the CSV data is replaced and ready to use, create a new model artifact for Amazon SageMaker with the updated dataset on S3. For renewing model artifact, you must create a new training job. Training jobs can be run using the AWS SDK ( for example, Amazon SageMaker boto3 ) or the Amazon SageMaker Python SDK that can be installed with “pip install sagemaker” command as well as the AWS CLI for Amazon SageMaker described in this post.
In addition, consider how to smoothly renew your existing model without service impact, because your model is called by applications in real time. To do this, you need to create a new endpoint configuration first and update a current endpoint with the endpoint configuration that is just created.
Grant permission
Before you execute the script, you must grant proper permission to Data Pipeline. Data Pipeline uses the DataPipelineDefaultResourceRole role by default. I added the following policy to DataPipelineDefaultResourceRole to allow Data Pipeline to create, delete, and update the Amazon SageMaker model and data source in the script.
Use real-time prediction
After you deploy a model into production using Amazon SageMaker hosting services, your client applications use this API to get inferences from the model hosted at the specified endpoint. This approach is useful for interactive web, mobile, or desktop applications.
Following, I provide a simple Python code example that queries against Amazon SageMaker endpoint URL with its name (“ServiceEndpoint”) and then uses them for real-time prediction.
=== Python sample for real-time prediction ===
Solution summary
The solution takes the following steps:
- Data Pipeline exports DynamoDB table data into S3. The original JSON data should be kept to recover the table in the rare event that this is needed. Data Pipeline then converts JSON to CSV so that Amazon SageMaker can read the data.Note: You should select only meaningful attributes when you convert CSV. For example, if you judge that the “campaign” attribute is not correlated, you can eliminate this attribute from the CSV.
- Train the Amazon SageMaker model with the new data source.
- When a new customer comes to your site, you can judge how likely it is for this customer to subscribe to your new product based on “predictedScores” provided by Amazon SageMaker.
- If the new user subscribes your new product, your application must update the attribute “y” to the value 1 (for yes). This updated data is provided for the next model renewal as a new data source. It serves to improve the accuracy of your prediction. With each new entry, your application can become smarter and deliver better predictions.
Running ad hoc queries using Amazon Athena
Amazon Athena is a serverless query service that makes it easy to analyze large amounts of data stored in Amazon S3 using standard SQL. Athena is useful for examining data and collecting statistics or informative summaries about data. You can also use the powerful analytic functions of Presto, as described in the topic Aggregate Functions of Presto in the Presto documentation.
With the Data Pipeline scheduled activity, recent CSV data is always located in S3 so that you can run ad hoc queries against the data using Amazon Athena. I show this with example SQL statements following. For an in-depth description of this process, see the post Interactive SQL Queries for Data in Amazon S3 on the AWS News Blog.
Creating an Amazon Athena table and running it
Simply, you can create an EXTERNAL table for the CSV data on S3 in Amazon Athena Management Console.
The following query calculates the correlation coefficient between the target attribute and other attributes using Amazon Athena.
=== Sample Query ===
Conclusion
In this post, I introduce an example of how to analyze data in DynamoDB by using table data in Amazon S3 to optimize DynamoDB table read capacity. You can then use the analyzed data as a new data source to train an Amazon SageMaker model for accurate real-time prediction. In addition, you can run ad hoc queries against the data on S3 using Amazon Athena. I also present how to automate these procedures by using Data Pipeline.
You can adapt this example to your specific use case at hand, and hopefully this post helps you accelerate your development. You can find more examples and use cases for Amazon SageMaker in the video AWS 2017: Introducing Amazon SageMaker on the AWS website.
Additional Reading
If you found this post useful, be sure to check out Serving Real-Time Machine Learning Predictions on Amazon EMR and Analyzing Data in S3 using Amazon Athena.
About the Author
Yong Seong Lee is a Cloud Support Engineer for AWS Big Data Services. He is interested in every technology related to data/databases and helping customers who have difficulties in using AWS services. His motto is “Enjoy life, be curious and have maximum experience.”
Security updates for Tuesday
Post Syndicated from ris original https://lwn.net/Articles/753257/rss
Security updates have been issued by Fedora (cups-filters, ghostscript, glusterfs, PackageKit, qpdf, and xen), Mageia (anki, libofx, ming, sox, webkit2, and xdg-user-dirs), Oracle (corosync, java-1.7.0-openjdk, and pcs), Red Hat (java-1.7.0-openjdk), Scientific Linux (corosync, firefox, gcc, glibc, golang, java-1.7.0-openjdk, java-1.8.0-openjdk, kernel, krb5, librelp, libvncserver, libvorbis, ntp, openssh, openssl, PackageKit, patch, pcs, policycoreutils, qemu-kvm, and xdg-user-dirs), Slackware (libwmf and mozilla), and Ubuntu (apache2, ghostscript, mysql-5.7, wavpack, and webkit2gtk).
Secure Build with AWS CodeBuild and LayeredInsight
Post Syndicated from Asif Khan original https://aws.amazon.com/blogs/devops/secure-build-with-aws-codebuild-and-layeredinsight/
This post is written by Asif Awan, Chief Technology Officer of Layered Insight, Subin Mathew – Software Development Manager for AWS CodeBuild, and Asif Khan – Solutions Architect
Enterprises adopt containers because they recognize the benefits: speed, agility, portability, and high compute density. They understand how accelerating application delivery and deployment pipelines makes it possible to rapidly slipstream new features to customers. Although the benefits are indisputable, this acceleration raises concerns about security and corporate compliance with software governance. In this blog post, I provide a solution that shows how Layered Insight, the pioneer and global leader in container-native application protection, can be used with seamless application build and delivery pipelines like those available in AWS CodeBuild to address these concerns.
Layered Insight solutions
Layered Insight enables organizations to unify DevOps and SecOps by providing complete visibility and control of containerized applications. Using the industry’s first embedded security approach, Layered Insight solves the challenges of container performance and protection by providing accurate insight into container images, adaptive analysis of running containers, and automated enforcement of container behavior.
AWS CodeBuild
AWS CodeBuild is a fully managed build service that compiles source code, runs tests, and produces software packages that are ready to deploy. With CodeBuild, you don’t need to provision, manage, and scale your own build servers. CodeBuild scales continuously and processes multiple builds concurrently, so your builds are not left waiting in a queue. You can get started quickly by using prepackaged build environments, or you can create custom build environments that use your own build tools.
Problem Definition
Security and compliance concerns span the lifecycle of application containers. Common concerns include:
Visibility into the container images. You need to verify the software composition information of the container image to determine whether known vulnerabilities associated with any of the software packages and libraries are included in the container image.
Governance of container images is critical because only certain open source packages/libraries, of specific versions, should be included in the container images. You need support for mechanisms for blacklisting all container images that include a certain version of a software package/library, or only allowing open source software that come with a specific type of license (such as Apache, MIT, GPL, and so on). You need to be able to address challenges such as:
· Defining the process for image compliance policies at the enterprise, department, and group levels.
· Preventing the images that fail the compliance checks from being deployed in critical environments, such as staging, pre-prod, and production.
Visibility into running container instances is critical, including:
· CPU and memory utilization.
· Security of the build environment.
· All activities (system, network, storage, and application layer) of the application code running in each container instance.
Protection of running container instances that is:
· Zero-touch to the developers (not an SDK-based approach).
· Zero touch to the DevOps team and doesn’t limit the portability of the containerized application.
· This protection must retain the option to switch to a different container stack or orchestration layer, or even to a different Container as a Service (CaaS ).
· And it must be a fully automated solution to SecOps, so that the SecOps team doesn’t have to manually analyze and define detailed blacklist and whitelist policies.
Solution Details
In AWS CodeCommit, we have three projects:
● “Democode” is a simple Java application, with one buildspec to build the app into a Docker container (run by build-demo-image CodeBuild project), and another to instrument said container (instrument-image CodeBuild project). The resulting container is stored in ECR repo javatestasjavatest:20180415-layered. This instrumented container is running in AWS Fargate cluster demo-java-appand can be seen in the Layered Insight runtime console as the javatestapplication in us-east-1.
● aws-codebuild-docker-imagesis a clone of the official aws-codebuild-docker-images repo on GitHub . This CodeCommit project is used by the build-python-builder CodeBuild project to build the python 3.3.6 codebuild image and is stored at the codebuild-python ECR repo. We then manually instructed the Layered Insight console to instrument the image.
● scan-java-imagecontains just a buildspec.yml file. This file is used by the scan-java-image CodeBuild project to instruct Layered Assessment to perform a vulnerability scan of the javatest container image built previously, and then run the scan results through a compliance policy that states there should be no medium vulnerabilities. This build fails — but in this case that is a success: the scan completes successfully, but compliance fails as there are medium-level issues found in the scan.
This build is performed using the instrumented version of the Python 3.3.6 CodeBuild image, so the activity of the processes running within the build are recorded each time within the LI console.
Build container image
Create or use a CodeCommit project with your application. To build this image and store it in Amazon Elastic Container Registry (Amazon ECR), add a buildspec file to the project and build a container image and create a CodeBuild project.
Scan container image
Once the image is built, create a new buildspec in the same project or a new one that looks similar to below (update ECR URL as necessary):
version: 0.2
phases:
pre_build:
commands:
- echo Pulling down LI Scan API client scripts
- git clone https://github.com/LayeredInsight/scan-api-example-python.git
- echo Setting up LI Scan API client
- cd scan-api-example-python
- pip install layint_scan_api
- pip install -r requirements.txt
build:
commands:
- echo Scanning container started on `date`
- IMAGEID=$(./li_add_image --name <aws-region>.amazonaws.com/javatest:20180415)
- ./li_wait_for_scan -v --imageid $IMAGEID
- ./li_run_image_compliance -v --imageid $IMAGEID --policyid PB15260f1acb6b2aa5b597e9d22feffb538256a01fbb4e5a95
Add the buildspec file to the git repo, push it, and then build a CodeBuild project using with the instrumented Python 3.3.6 CodeBuild image at <aws-region>.amazonaws.com/codebuild-python:3.3.6-layered. Set the following environment variables in the CodeBuild project:
● LI_APPLICATIONNAME – name of the build to display
● LI_LOCATION – location of the build project to display
● LI_API_KEY – ApiKey:<key-name>:<api-key>
● LI_API_HOST – location of the Layered Insight API service
Instrument container image
Next, to instrument the new container image:
- In the Layered Insight runtime console, ensure that the ECR registry and credentials are defined (click the Setup icon and the ‘+’ sign on the top right of the screen to add a new container registry). Note the name given to the registry in the console, as this needs to be referenced in the li_add_imagecommand in the script, below.
- Next, add a new buildspec (with a new name) to the CodeCommit project, such as the one shown below. This code will download the Layered Insight runtime client, and use it to instruct the Layered Insight service to instrument the image that was just built:
version: 0.2 phases: pre_build: commands: echo Pulling down LI API Runtime client scripts git clone https://github.com/LayeredInsight/runtime-api-example-python echo Setting up LI API client cd runtime-api-example-python pip install layint-runtime-api pip install -r requirements.txt build: commands: echo Instrumentation started on `date` ./li_add_image --registry "Javatest ECR" --name IMAGE_NAME:TAG --description "IMAGE DESCRIPTION" --policy "Default Policy" --instrument --wait --verbose
- Commit and push the new buildspec file.
- Going back to CodeBuild, create a new project, with the same CodeCommit repo, but this time select the new buildspec file. Use a Python 3.3.6 builder – either the AWS or LI Instrumented version.
- Click Continue
- Click Save
- Run the build, again on the master branch.
- If everything runs successfully, a new image should appear in the ECR registry with a -layered suffix. This is the instrumented image.
Run instrumented container image
When the instrumented container is now run — in ECS, Fargate, or elsewhere — it will log data back to the Layered Insight runtime console. It’s appearance in the console can be modified by setting the LI_APPLICATIONNAME and LI_LOCATION environment variables when running the container.
Conclusion
In the above blog we have provided you steps needed to embed governance and runtime security in your build pipelines running on AWS CodeBuild using Layered Insight.
Implement continuous integration and delivery of serverless AWS Glue ETL applications using AWS Developer Tools
Post Syndicated from Prasad Alle original https://aws.amazon.com/blogs/big-data/implement-continuous-integration-and-delivery-of-serverless-aws-glue-etl-applications-using-aws-developer-tools/
AWS Glue is an increasingly popular way to develop serverless ETL (extract, transform, and load) applications for big data and data lake workloads. Organizations that transform their ETL applications to cloud-based, serverless ETL architectures need a seamless, end-to-end continuous integration and continuous delivery (CI/CD) pipeline: from source code, to build, to deployment, to product delivery. Having a good CI/CD pipeline can help your organization discover bugs before they reach production and deliver updates more frequently. It can also help developers write quality code and automate the ETL job release management process, mitigate risk, and more.
AWS Glue is a fully managed data catalog and ETL service. It simplifies and automates the difficult and time-consuming tasks of data discovery, conversion, and job scheduling. AWS Glue crawls your data sources and constructs a data catalog using pre-built classifiers for popular data formats and data types, including CSV, Apache Parquet, JSON, and more.
When you are developing ETL applications using AWS Glue, you might come across some of the following CI/CD challenges:
- Iterative development with unit tests
- Continuous integration and build
- Pushing the ETL pipeline to a test environment
- Pushing the ETL pipeline to a production environment
- Testing ETL applications using real data (live test)
- Exploring and validating data
In this post, I walk you through a solution that implements a CI/CD pipeline for serverless AWS Glue ETL applications supported by AWS Developer Tools (including AWS CodePipeline, AWS CodeCommit, and AWS CodeBuild) and AWS CloudFormation.
Solution overview
The following diagram shows the pipeline workflow:
This solution uses AWS CodePipeline, which lets you orchestrate and automate the test and deploy stages for ETL application source code. The solution consists of a pipeline that contains the following stages:
1.) Source Control: In this stage, the AWS Glue ETL job source code and the AWS CloudFormation template file for deploying the ETL jobs are both committed to version control. I chose to use AWS CodeCommit for version control.
To get the ETL job source code and AWS CloudFormation template, download the gluedemoetl.zip file. This solution is developed based on a previous post, Build a Data Lake Foundation with AWS Glue and Amazon S3.
2.) LiveTest: In this stage, all resources—including AWS Glue crawlers, jobs, S3 buckets, roles, and other resources that are required for the solution—are provisioned, deployed, live tested, and cleaned up.
The LiveTest stage includes the following actions:
- Deploy: In this action, all the resources that are required for this solution (crawlers, jobs, buckets, roles, and so on) are provisioned and deployed using an AWS CloudFormation template.
- AutomatedLiveTest: In this action, all the AWS Glue crawlers and jobs are executed and data exploration and validation tests are performed. These validation tests include, but are not limited to, record counts in both raw tables and transformed tables in the data lake and any other business validations. I used AWS CodeBuild for this action.
- LiveTestApproval: This action is included for the cases in which a pipeline administrator approval is required to deploy/promote the ETL applications to the next stage. The pipeline pauses in this action until an administrator manually approves the release.
- LiveTestCleanup: In this action, all the LiveTest stage resources, including test crawlers, jobs, roles, and so on, are deleted using the AWS CloudFormation template. This action helps minimize cost by ensuring that the test resources exist only for the duration of the AutomatedLiveTest and LiveTestApproval
3.) DeployToProduction: In this stage, all the resources are deployed using the AWS CloudFormation template to the production environment.
Try it out
This code pipeline takes approximately 20 minutes to complete the LiveTest test stage (up to the LiveTest approval stage, in which manual approval is required).
To get started with this solution, choose Launch Stack:
This creates the CI/CD pipeline with all of its stages, as described earlier. It performs an initial commit of the sample AWS Glue ETL job source code to trigger the first release change.
In the AWS CloudFormation console, choose Create. After the template finishes creating resources, you see the pipeline name on the stack Outputs tab.
After that, open the CodePipeline console and select the newly created pipeline. Initially, your pipeline’s CodeCommit stage shows that the source action failed.
Allow a few minutes for your new pipeline to detect the initial commit applied by the CloudFormation stack creation. As soon as the commit is detected, your pipeline starts. You will see the successful stage completion status as soon as the CodeCommit source stage runs.
In the CodeCommit console, choose Code in the navigation pane to view the solution files.
Next, you can watch how the pipeline goes through the LiveTest stage of the deploy and AutomatedLiveTest actions, until it finally reaches the LiveTestApproval action.
At this point, if you check the AWS CloudFormation console, you can see that a new template has been deployed as part of the LiveTest deploy action.
At this point, make sure that the AWS Glue crawlers and the AWS Glue job ran successfully. Also check whether the corresponding databases and external tables have been created in the AWS Glue Data Catalog. Then verify that the data is validated using Amazon Athena, as shown following.
Open the AWS Glue console, and choose Databases in the navigation pane. You will see the following databases in the Data Catalog:
Open the Amazon Athena console, and run the following queries. Verify that the record counts are matching.
The following shows the raw data:
The following shows the transformed data:
The pipeline pauses the action until the release is approved. After validating the data, manually approve the revision on the LiveTestApproval action on the CodePipeline console.
Add comments as needed, and choose Approve.
The LiveTestApproval stage now appears as Approved on the console.
After the revision is approved, the pipeline proceeds to use the AWS CloudFormation template to destroy the resources that were deployed in the LiveTest deploy action. This helps reduce cost and ensures a clean test environment on every deployment.
Production deployment is the final stage. In this stage, all the resources—AWS Glue crawlers, AWS Glue jobs, Amazon S3 buckets, roles, and so on—are provisioned and deployed to the production environment using the AWS CloudFormation template.
After successfully running the whole pipeline, feel free to experiment with it by changing the source code stored on AWS CodeCommit. For example, if you modify the AWS Glue ETL job to generate an error, it should make the AutomatedLiveTest action fail. Or if you change the AWS CloudFormation template to make its creation fail, it should affect the LiveTest deploy action. The objective of the pipeline is to guarantee that all changes that are deployed to production are guaranteed to work as expected.
Conclusion
In this post, you learned how easy it is to implement CI/CD for serverless AWS Glue ETL solutions with AWS developer tools like AWS CodePipeline and AWS CodeBuild at scale. Implementing such solutions can help you accelerate ETL development and testing at your organization.
If you have questions or suggestions, please comment below.
Additional Reading
If you found this post useful, be sure to check out Implement Continuous Integration and Delivery of Apache Spark Applications using AWS and Build a Data Lake Foundation with AWS Glue and Amazon S3.
About the Authors
Prasad Alle is a Senior Big Data Consultant with AWS Professional Services. He spends his time leading and building scalable, reliable Big data, Machine learning, Artificial Intelligence and IoT solutions for AWS Enterprise and Strategic customers. His interests extend to various technologies such as Advanced Edge Computing, Machine learning at Edge. In his spare time, he enjoys spending time with his family.
Luis Caro is a Big Data Consultant for AWS Professional Services. He works with our customers to provide guidance and technical assistance on big data projects, helping them improving the value of their solutions when using AWS.
Work with partitioned data in AWS Glue
Post Syndicated from Ben Sowell original https://aws.amazon.com/blogs/big-data/work-with-partitioned-data-in-aws-glue/
AWS Glue provides enhanced support for working with datasets that are organized into Hive-style partitions. AWS Glue crawlers automatically identify partitions in your Amazon S3 data. The AWS Glue ETL (extract, transform, and load) library natively supports partitions when you work with DynamicFrames. DynamicFrames represent a distributed collection of data without requiring you to specify a schema. You can now push down predicates when creating DynamicFrames to filter out partitions and avoid costly calls to S3. We have also added support for writing DynamicFrames directly into partitioned directories without converting them to Apache Spark DataFrames.
Partitioning has emerged as an important technique for organizing datasets so that they can be queried efficiently by a variety of big data systems. Data is organized in a hierarchical directory structure based on the distinct values of one or more columns. For example, you might decide to partition your application logs in Amazon S3 by date—broken down by year, month, and day. Files corresponding to a single day’s worth of data would then be placed under a prefix such as s3://my_bucket/logs/year=2018/month=01/day=23/.
Systems like Amazon Athena, Amazon Redshift Spectrum, and now AWS Glue can use these partitions to filter data by value without making unnecessary calls to Amazon S3. This can significantly improve the performance of applications that need to read only a few partitions.
In this post, we show you how to efficiently process partitioned datasets using AWS Glue. First, we cover how to set up a crawler to automatically scan your partitioned dataset and create a table and partitions in the AWS Glue Data Catalog. Then, we introduce some features of the AWS Glue ETL library for working with partitioned data. You can now filter partitions using SQL expressions or user-defined functions to avoid listing and reading unnecessary data from Amazon S3. We’ve also added support in the ETL library for writing AWS Glue DynamicFrames directly into partitions without relying on Spark SQL DataFrames.
Let’s get started!
Crawling partitioned data
In this example, we use the same GitHub archive dataset that we introduced in a previous post about Scala support in AWS Glue. This data, which is publicly available from the GitHub archive, contains a JSON record for every API request made to the GitHub service. A sample dataset containing one month of activity from January 2017 is available at the following location:
Here you can replace <region> with the AWS Region in which you are working, for example, us-east-1. This dataset is partitioned by year, month, and day, so an actual file will be at a path like the following:
To crawl this data, you can either follow the instructions in the AWS Glue Developer Guide or use the provided AWS CloudFormation template. This template creates a stack that contains the following:
- An IAM role with permissions to access AWS Glue resources
- A database in the AWS Glue Data Catalog named githubarchive_month
- A crawler set up to crawl the GitHub dataset
- An AWS Glue development endpoint (which is used in the next section to transform the data)
To run this template, you must provide an S3 bucket and prefix where you can write output data in the next section. The role that this template creates will have permission to write to this bucket only. You also need to provide a public SSH key for connecting to the development endpoint. For more information about creating an SSH key, see our Development Endpoint tutorial. After you create the AWS CloudFormation stack, you can run the crawler from the AWS Glue console.
In addition to inferring file types and schemas, crawlers automatically identify the partition structure of your dataset and populate the AWS Glue Data Catalog. This ensures that your data is correctly grouped into logical tables and makes the partition columns available for querying in AWS Glue ETL jobs or query engines like Amazon Athena.
After you crawl the table, you can view the partitions by navigating to the table in the AWS Glue console and choosing View partitions. The partitions should look like the following:
For partitioned paths in Hive-style of the form key=val, crawlers automatically populate the column name. In this case, because the GitHub data is stored in directories of the form 2017/01/01, the crawlers use default names like partition_0, partition_1, and so on. You can easily change these names on the AWS Glue console: Navigate to the table, choose Edit schema, and rename partition_0 to year, partition_1 to month, and partition_2 to day:
Now that you’ve crawled the dataset and named your partitions appropriately, let’s see how to work with partitioned data in an AWS Glue ETL job.
Transforming and filtering the data
To get started with the AWS Glue ETL libraries, you can use an AWS Glue development endpoint and an Apache Zeppelin notebook. AWS Glue development endpoints provide an interactive environment to build and run scripts using Apache Spark and the AWS Glue ETL library. They are great for debugging and exploratory analysis, and can be used to develop and test scripts before migrating them to a recurring job.
If you ran the AWS CloudFormation template in the previous section, then you already have a development endpoint named partition-endpoint in your account. Otherwise, you can follow the instructions in this development endpoint tutorial. In either case, you need to set up an Apache Zeppelin notebook, either locally, or on an EC2 instance. You can find more information about development endpoints and notebooks in the AWS Glue Developer Guide.
The following examples are all written in the Scala programming language, but they can all be implemented in Python with minimal changes.
Reading a partitioned dataset
To get started, let’s read the dataset and see how the partitions are reflected in the schema. First, you import some classes that you will need for this example and set up a GlueContext, which is the main class that you will use to read and write data.
Execute the following in a Zeppelin paragraph, which is a unit of executable code:
This is straightforward with two caveats: First, each paragraph must start with the line %spark to indicate that the paragraph is Scala. Second, the spark variable must be marked @transient to avoid serialization issues. This is only necessary when running in a Zeppelin notebook.
Next, read the GitHub data into a DynamicFrame, which is the primary data structure that is used in AWS Glue scripts to represent a distributed collection of data. A DynamicFrame is similar to a Spark DataFrame, except that it has additional enhancements for ETL transformations. DynamicFrames are discussed further in the post AWS Glue Now Supports Scala Scripts, and in the AWS Glue API documentation.
The following snippet creates a DynamicFrame by referencing the Data Catalog table that you just crawled and then prints the schema:
You could also print the full schema using githubEvents.printSchema(). But in this case, the full schema is quite large, so I’ve printed only the top-level columns. This paragraph takes about 5 minutes to run on a standard size AWS Glue development endpoint. After it runs, you should see the following output:
id: string
type: string
actor: struct
repo: struct
payload: struct
public: boolean
created_at: string
year: string
month: string
day: string
org: struct
Note that the partition columns year, month, and day were automatically added to each record.
Filtering by partition columns
One of the primary reasons for partitioning data is to make it easier to operate on a subset of the partitions, so now let’s see how to filter data by the partition columns. In particular, let’s find out what people are building in their free time by looking at GitHub activity on the weekends. One way to accomplish this is to use the filter transformation on the githubEvents DynamicFrame that you created earlier to select the appropriate events:
This snippet defines the filterWeekend function that uses the Java Calendar class to identify those records where the partition columns (year, month, and day) fall on a weekend. If you run this code, you see that there were 6,303,480 GitHub events falling on the weekend in January 2017, out of a total of 29,160,561 events. This seems reasonable—about 22 percent of the events fell on the weekend, and about 29 percent of the days that month fell on the weekend (9 out of 31). So people are using GitHub slightly less on the weekends, but there is still a lot of activity!
Predicate pushdowns for partition columns
The main downside to using the filter transformation in this way is that you have to list and read all files in the entire dataset from Amazon S3 even though you need only a small fraction of them. This is manageable when dealing with a single month’s worth of data. But as you try to process more data, you will spend an increasing amount of time reading records only to immediately discard them.
To address this issue, we recently released support for pushing down predicates on partition columns that are specified in the AWS Glue Data Catalog. Instead of reading the data and filtering the DynamicFrame at executors in the cluster, you apply the filter directly on the partition metadata available from the catalog. Then you list and read only the partitions from S3 that you need to process.
To accomplish this, you can specify a Spark SQL predicate as an additional parameter to the getCatalogSource method. This predicate can be any SQL expression or user-defined function as long as it uses only the partition columns for filtering. Remember that you are applying this to the metadata stored in the catalog, so you don’t have access to other fields in the schema.
The following snippet shows how to use this functionality to read only those partitions occurring on a weekend:
Here you use the SparkSQL string concat function to construct a date string. You use the to_date function to convert it to a date object, and the date_format function with the ‘E’ pattern to convert the date to a three-character day of the week (for example, Mon, Tue, and so on). For more information about these functions, Spark SQL expressions, and user-defined functions in general, see the Spark SQL documentation and list of functions.
Note that the pushdownPredicate parameter is also available in Python. The corresponding call in Python is as follows:
You can observe the performance impact of pushing down predicates by looking at the execution time reported for each Zeppelin paragraph. The initial approach using a Scala filter function took 2.5 minutes:
Because the version using a pushdown lists and reads much less data, it takes only 24 seconds to complete, a 5X improvement!
Of course, the exact benefit that you see depends on the selectivity of your filter. The more partitions that you exclude, the more improvement you will see.
In addition to Hive-style partitioning for Amazon S3 paths, Parquet and ORC file formats further partition each file into blocks of data that represent column values. Each block also stores statistics for the records that it contains, such as min/max for column values. AWS Glue supports pushdown predicates for both Hive-style partitions and block partitions in these formats. While reading data, it prunes unnecessary S3 partitions and also skips the blocks that are determined unnecessary to be read by column statistics in Parquet and ORC formats.
Additional transformations
Now that you’ve read and filtered your dataset, you can apply any additional transformations to clean or modify the data. For example, you could augment it with sentiment analysis as described in the previous AWS Glue post.
To keep things simple, you can just pick out some columns from the dataset using the ApplyMapping transformation:
ApplyMapping is a flexible transformation for performing projection and type-casting. In this example, we use it to unnest several fields, such as actor.login, which we map to the top-level actor field. We also cast the id column to a long and the partition columns to integers.
Writing out partitioned data
The final step is to write out your transformed dataset to Amazon S3 so that you can process it with other systems like Amazon Athena. By default, when you write out a DynamicFrame, it is not partitioned—all the output files are written at the top level under the specified output path. Until recently, the only way to write a DynamicFrame into partitions was to convert it into a Spark SQL DataFrame before writing. We are excited to share that DynamicFrames now support native partitioning by a sequence of keys.
You can accomplish this by passing the additional partitionKeys option when creating a sink. For example, the following code writes out the dataset that you created earlier in Parquet format to S3 in directories partitioned by the type field.
Here, $outpath is a placeholder for the base output path in S3. The partitionKeys parameter can also be specified in Python in the connection_options dict:
When you execute this write, the type field is removed from the individual records and is encoded in the directory structure. To demonstrate this, you can list the output path using the aws s3 ls command from the AWS CLI:
As expected, there is a partition for each distinct event type. In this example, we partitioned by a single value, but this is by no means required. For example, if you want to preserve the original partitioning by year, month, and day, you could simply set the partitionKeys option to be Seq(“year”, “month”, “day”).
Conclusion
In this post, we showed you how to work with partitioned data in AWS Glue. Partitioning is a crucial technique for getting the most out of your large datasets. Many tools in the AWS big data ecosystem, including Amazon Athena and Amazon Redshift Spectrum, take advantage of partitions to accelerate query processing. AWS Glue provides mechanisms to crawl, filter, and write partitioned data so that you can structure your data in Amazon S3 however you want, to get the best performance out of your big data applications.
We hope you try it out!
Additional Reading
If you found this post useful, be sure to check out AWS Glue Now Supports Scala Scripts and Simplify Querying Nested JSON with the AWS Glue Relationalize Transform.
Ben Sowell is a senior software development engineer at AWS Glue. He has worked for more than 5 years on ETL systems to help users unlock the potential of their data. In his free time, he enjoys reading and exploring the Bay Area.
Mohit Saxena is a senior software development engineer at AWS Glue. His passion is building scalable distributed systems for efficiently managing data on cloud. He also enjoys watching movies and reading about the latest technology.
Security updates for Thursday
Post Syndicated from ris original https://lwn.net/Articles/752324/rss
Security updates have been issued by Debian (opencv and wireshark), Fedora (corosync and pcs), Oracle (firefox, kernel, libvncserver, and libvorbis), Slackware (gd), SUSE (kernel), and Ubuntu (apache2).