The choice to run a Lambda function on AL1 or AL2 is based upon the runtime. With the addition of the java8.al2 and provided.al2 runtimes, it is now possible to run Java 8 (Corretto), Go, and custom runtimes on AL2. It also means that the latest version of all supported runtimes can now run on AL2.
Amazon Corretto 8 is a production-ready distribution of the Open Java Development Kit (OpenJDK) 8 and comes with long-term support (LTS). AWS runs Corretto internally on thousands of production services. Patches and improvements in Corretto allows AWS to address real-world service concerns and meet heavy performance and scalability demands.
Developers can now take advantage of these improvements as they develop Lambda functions by using the new Java 8 (Corretto) runtime of java8.al2. You can see the new Java runtimes supported in the Lambda console:
Console: Console: choosing the Java 8 (Corretto) runtime
The custom runtime for Lambda feature was announced at re:Invent 2018. Since then, developers have created custom runtimes for PHP, Erlang/Elixir, Swift, COBOL, Rust, and many others. Until today, custom runtimes have only used the AL1 environment. Now, developers can choose to run custom runtimes in the AL2 execution environment. To do this, select the provided.al2 runtime value in the console when creating or updating your Lambda function:
With the addition of the provided.al2 runtime option, Go developers can now run Lambda functions in AL2. As one of the later supported runtimes for Lambda, Go is implemented differently than other native runtimes. Under the hood, Go is treated as a custom runtime and runs accordingly. A Go developer can take advantage of this by choosing the provided.al2 runtime and providing the required bootstrap file.
Using SAM Build to build AL2 functions
With the new sam build options, this is easily accomplished with the following steps:
Update the AWS Serverless Application Model template to the new provided.al2 runtime. Add the Metadata parameter to set the BuildMethod to makefile.
With the latest runtimes now available on AL2, we are encouraging developers to begin migrating Lambda functions from AL1-based runtimes to AL2-based runtimes. By starting this process now, Lambda functions are running on the latest long-term supported environment.
With AL1, that long-term support is coming to an end. The latest version of AL1, 2018.13, had an original end-of-life date set for June 30, 2020. However, AWS extended this date until December 30, 2020. On this date, AL1 will transition from long-term support (LTS) to a maintenance support period lasting until June 30, 2023. During the maintenance support period, AL1 receives critical and important security updates for a reduced set of packages.
Support timeline
However, AL2 is scheduled for LTS until June 30, 2023 and provides the following support:
Security updates and bug fixes for all packages in core.
Maintain user-space application binary interface (ABI) compatibility for core packages.
Legacy runtime end-of-life schedules
As shown in the preceding chart, some runtimes are still mapped to AL1 host operating systems. AWS Lambda is committed to supporting runtimes through their long-term support (LTS) window, as specified by the language publisher. During this maintenance support period, Lambda provides base operating system and patching for these runtimes. After this period, runtimes are deprecated.
Phase 1: you can no longer create functions that use the deprecated runtime. For at least 30 days, you can continue to update existing functions that use the deprecated runtime.
Phase 2: both function creation and updates are disabled permanently. However, the function continues to be available to process invocation events.
Based on this timeline and our commitment to supporting runtimes through their LTS, the following schedule is followed for the deprecation of AL1-based runtimes:
AWS is committed to helping developers build their Lambda functions with the latest tools and technology. This post covers two new Lambda runtimes that expand the range of available runtimes on AL2. I discuss the end-of-life schedule of AL1 and why developers should start migrating to AL2 now. Finally, I discuss the remaining runtimes and their plan for support until deprecation, according to the AWS runtime support policy.
This post is courtesy of Shivansh Singh, Partner Solutions Architect – AWS
We are excited to announce that you can now develop your AWS Lambda functions using the Python 3.7 runtime. Start using this new version today by specifying a runtime parameter value of “python3.7″ when creating or updating functions. AWS continues to support creating new Lambda functions on Python 3.6 and Python 2.7.
Here’s a quick primer on some of the major new features in Python 3.7:
Data classes
Customization of access to module attributes
Typing enhancements
Time functions with nanosecond resolution
Data classes
In object-oriented programming, if you have to create a class, it looks like the following in Python 3.6:
class Employee:
def __init__(self, name: str, dept: int) -> None:
self.name = name
self.dept = dept
def is_fulltime(self) -> bool:
"""Return True if employee is a Full-time employee, else False"""
return self.dept > 25
The __init__ method receives multiple arguments to initialize a call. These arguments are set as class instance attributes.
With Python 3.7, you have dataclasses, which make class declarations easier and more readable. You can use the @dataclass decorator on the class declaration and self-assignment is taken care of automatically. It generates __init__, __repr__, __eq__, __hash__, and other special methods. In Python 3.7, the Employee class defined earlier looks like the following:
@dataclass
Class Employee:
name: str
dept: int
def is_fulltime(self) -> bool:
"""Return True if employee is a Full-time employee, else False"""
return self.dept > 25
Customization of access to module attributes
Attributes are widely used in Python. Most commonly, they used in classes. However, attributes can be put on functions and modules as well. Attributes are retrieved using the dot notation: something.attribute. You can also get attributes that are named at runtime using the getattr() function.
For classes, something.attr first looks for attr defined on something. If it’s not found, then the special method something.__getattr__(“attr”) is called. The .getattr__() function can be used to customize access to attributes on objects. This customization is not easily available for module attributes, until Python 3.7. But PEP 562 provides __getattr__() on modules, along with a corresponding __dir__() function.
Typing enhancements
Type annotations are commonly used for code hints. However, there were two common issues with using type hints extensively in the code:
Annotations could only use names that were already available in the current scope. In other words, they didn’t support forward references.
Annotating source code had adverse effects on the startup time of Python programs.
Both of these issues are fixed in Python 3.7, by postponing the evaluation of annotations. Instead of compiling code that executes expressions in annotations at their definition time, the compiler stores the annotation in a string form equivalent to the AST of the expression in question.
For example, the following code fails, as spouse cannot be defined as type Employee, given that Employee is not defined yet.
In Python 3.7, the evaluation of annotation is postponed. It gets stored as a string and optionally evaluated as needed. You do need to import __future__, which breaks the backward compatibility of the code with previous versions of Python.
from __future__ import annotations
class Employee:
def __init__(self, name: str, spouse: Employee) --> None
pass
Time functions with nanosecond resolution
The time module gets some new functions in Python 3.7. The clock resolution can exceed the limited precision of a floating point number returned by the time.time() function and its variants. The following new functions are being added:
clock_gettime_ns()
clock_settime_ns()
monotonic_ns()
perf_counter_ns()
process_time_ns()
time_ns()
These functions are similar to already existing functions without the _ns suffix. The difference is that the above functions return a number of nanoseconds as an int instead of a number of seconds as a float.
One of the most common enquiries I receive at Pi Towers is “How can I get my hands on a Raspberry Pi Oracle Weather Station?” Now the answer is: “Why not build your own version using our guide?”
Tadaaaa! The BYO weather station fully assembled.
Our Oracle Weather Station
In 2016 we sent out nearly 1000 Raspberry Pi Oracle Weather Station kits to schools from around the world who had applied to be part of our weather station programme. In the original kit was a special HAT that allows the Pi to collect weather data with a set of sensors.
The original Raspberry Pi Oracle Weather Station HAT
We designed the HAT to enable students to create their own weather stations and mount them at their schools. As part of the programme, we also provide an ever-growing range of supporting resources. We’ve seen Oracle Weather Stations in great locations with a huge differences in climate, and they’ve even recorded the effects of a solar eclipse.
Our new BYO weather station guide
We only had a single batch of HATs made, and unfortunately we’ve given nearly* all the Weather Station kits away. Not only are the kits really popular, we also receive lots of questions about how to add extra sensors or how to take more precise measurements of a particular weather phenomenon. So today, to satisfy your demand for a hackable weather station, we’re launching our Build your own weather station guide!
Fun with meteorological experiments!
Our guide suggests the use of many of the sensors from the Oracle Weather Station kit, so can build a station that’s as close as possible to the original. As you know, the Raspberry Pi is incredibly versatile, and we’ve made it easy to hack the design in case you want to use different sensors.
Many other tutorials for Pi-powered weather stations don’t explain how the various sensors work or how to store your data. Ours goes into more detail. It shows you how to put together a breadboard prototype, it describes how to write Python code to take readings in different ways, and it guides you through recording these readings in a database.
There’s also a section on how to make your station weatherproof. And in case you want to move past the breadboard stage, we also help you with that. The guide shows you how to solder together all the components, similar to the original Oracle Weather Station HAT.
Who should try this build
We think this is a great project to tackle at home, at a STEM club, Scout group, or CoderDojo, and we’re sure that many of you will be chomping at the bit to get started. Before you do, please note that we’ve designed the build to be as straight-forward as possible, but it’s still fairly advanced both in terms of electronics and programming. You should read through the whole guide before purchasing any components.
The sensors and components we’re suggesting balance cost, accuracy, and easy of use. Depending on what you want to use your station for, you may wish to use different components. Similarly, the final soldered design in the guide may not be the most elegant, but we think it is achievable for someone with modest soldering experience and basic equipment.
You can build a functioning weather station without soldering with our guide, but the build will be more durable if you do solder it. If you’ve never tried soldering before, that’s OK: we have a Getting started with soldering resource plus video tutorial that will walk you through how it works step by step.
For those of you who are more experienced makers, there are plenty of different ways to put the final build together. We always like to hear about alternative builds, so please post your designs in the Weather Station forum.
Our plans for the guide
Our next step is publishing supplementary guides for adding extra functionality to your weather station. We’d love to hear which enhancements you would most like to see! Our current ideas under development include adding a webcam, making a tweeting weather station, adding a light/UV meter, and incorporating a lightning sensor. Let us know which of these is your favourite, or suggest your own amazing ideas in the comments!
*We do have a very small number of kits reserved for interesting projects or locations: a particularly cool experiment, a novel idea for how the Oracle Weather Station could be used, or places with specific weather phenomena. If have such a project in mind, please send a brief outline to [email protected], and we’ll consider how we might be able to help you.
The 4.17 kernel appears to be on track for a June 3 release, barring an unlikely last-minute surprise. So the time has come for the usual look at some development statistics for this cycle. While 4.17 is a normal cycle for the most part, it does have one characteristic of note: it is the third kernel release ever to be smaller (in terms of lines of code) than its predecessor.
Backblaze is hiring a Director of Sales. This is a critical role for Backblaze as we continue to grow the team. We need a strong leader who has experience in scaling a sales team and who has an excellent track record for exceeding goals by selling Software as a Service (SaaS) solutions. In addition, this leader will need to be highly motivated, as well as able to create and develop a highly-motivated, success oriented sales team that has fun and enjoys what they do.
The History of Backblaze from our CEO In 2007, after a friend’s computer crash caused her some suffering, we realized that with every photo, video, song, and document going digital, everyone would eventually lose all of their information. Five of us quit our jobs to start a company with the goal of making it easy for people to back up their data.
Like many startups, for a while we worked out of a co-founder’s one-bedroom apartment. Unlike most startups, we made an explicit agreement not to raise funding during the first year. We would then touch base every six months and decide whether to raise or not. We wanted to focus on building the company and the product, not on pitching and slide decks. And critically, we wanted to build a culture that understood money comes from customers, not the magical VC giving tree. Over the course of 5 years we built a profitable, multi-million dollar revenue business — and only then did we raise a VC round.
Fast forward 10 years later and our world looks quite different. You’ll have some fantastic assets to work with:
A brand millions recognize for openness, ease-of-use, and affordability.
A computer backup service that stores over 500 petabytes of data, has recovered over 30 billion files for hundreds of thousands of paying customers — most of whom self-identify as being the people that find and recommend technology products to their friends.
Our B2 service that provides the lowest cost cloud storage on the planet at 1/4th the price Amazon, Google or Microsoft charges. While being a newer product on the market, it already has over 100,000 IT and developers signed up as well as an ecosystem building up around it.
A growing, profitable and cash-flow positive company.
And last, but most definitely not least: a great sales team.
You might be saying, “sounds like you’ve got this under control — why do you need me?” Don’t be misled. We need you. Here’s why:
We have a great team, but we are in the process of expanding and we need to develop a structure that will easily scale and provide the most success to drive revenue.
We just launched our outbound sales efforts and we need someone to help develop that into a fully successful program that’s building a strong pipeline and closing business.
We need someone to work with the marketing department and figure out how to generate more inbound opportunities that the sales team can follow up on and close.
We need someone who will work closely in developing the skills of our current sales team and build a path for career growth and advancement.
We want someone to manage our Customer Success program.
So that’s a bit about us. What are we looking for in you?
Experience: As a sales leader, you will strategically build and drive the territory’s sales pipeline by assembling and leading a skilled team of sales professionals. This leader should be familiar with generating, developing and closing software subscription (SaaS) opportunities. We are looking for a self-starter who can manage a team and make an immediate impact of selling our Backup and Cloud Storage solutions. In this role, the sales leader will work closely with the VP of Sales, marketing staff, and service staff to develop and implement specific strategic plans to achieve and exceed revenue targets, including new business acquisition as well as build out our customer success program.
Leadership: We have an experienced team who’s brought us to where we are today. You need to have the people and management skills to get them excited about working with you. You need to be a strong leader and compassionate about developing and supporting your team.
Data driven and creative: The data has to show something makes sense before we scale it up. However, without creativity, it’s easy to say “the data shows it’s impossible” or to find a local maximum. Whether it’s deciding how to scale the team, figuring out what our outbound sales efforts should look like or putting a plan in place to develop the team for career growth, we’ve seen a bit of creativity get us places a few extra dollars couldn’t.
Jive with our culture: Strong leaders affect culture and the person we hire for this role may well shape, not only fit into, ours. But to shape the culture you have to be accepted by the organism, which means a certain set of shared values. We default to openness with our team, our customers, and everyone if possible. We love initiative — without arrogance or dictatorship. We work to create a place people enjoy showing up to work. That doesn’t mean ping pong tables and foosball (though we do try to have perks & fun), but it means people are friendly, non-political, working to build a good service but also a good place to work.
Do the work: Ideas and strategy are critical, but good execution makes them happen. We’re looking for someone who can help the team execute both from the perspective of being capable of guiding and organizing, but also someone who is hands-on themselves.
Additional Responsibilities needed for this role:
Recruit, coach, mentor, manage and lead a team of sales professionals to achieve yearly sales targets. This includes closing new business and expanding upon existing clientele.
Expand the customer success program to provide the best customer experience possible resulting in upsell opportunities and a high retention rate.
Develop effective sales strategies and deliver compelling product demonstrations and sales pitches.
Acquire and develop the appropriate sales tools to make the team efficient in their daily work flow.
Apply a thorough understanding of the marketplace, industry trends, funding developments, and products to all management activities and strategic sales decisions.
Ensure that sales department operations function smoothly, with the goal of facilitating sales and/or closings; operational responsibilities include accurate pipeline reporting and sales forecasts.
This position will report directly to the VP of Sales and will be staffed in our headquarters in San Mateo, CA.
Requirements:
7 – 10+ years of successful sales leadership experience as measured by sales performance against goals. Experience in developing skill sets and providing career growth and opportunities through advancement of team members.
Background in selling SaaS technologies with a strong track record of success.
Strong presentation and communication skills.
Must be able to travel occasionally nationwide.
BA/BS degree required
Think you want to join us on this adventure? Send an email to jobscontact@backblaze.com with the subject “Director of Sales.” (Recruiters and agencies, please don’t email us.) Include a resume and answer these two questions:
How would you approach evaluating the current sales team and what is your process for developing a growth strategy to scale the team?
What are the goals you would set for yourself in the 3 month and 1-year timeframes?
Thank you for taking the time to read this and I hope that this sounds like the opportunity for which you’ve been waiting.
Amazon Neptune is now Generally Available in US East (N. Virginia), US East (Ohio), US West (Oregon), and EU (Ireland). Amazon Neptune is a fast, reliable, fully-managed graph database service that makes it easy to build and run applications that work with highly connected datasets. At the core of Neptune is a purpose-built, high-performance graph database engine optimized for storing billions of relationships and querying the graph with millisecond latencies. Neptune supports two popular graph models, Property Graph and RDF, through Apache TinkerPop Gremlin and SPARQL, allowing you to easily build queries that efficiently navigate highly connected datasets. Neptune can be used to power everything from recommendation engines and knowledge graphs to drug discovery and network security. Neptune is fully-managed with automatic minor version upgrades, backups, encryption, and fail-over. I wrote about Neptune in detail for AWS re:Invent last year and customers have been using the preview and providing great feedback that the team has used to prepare the service for GA.
Now that Amazon Neptune is generally available there are a few changes from the preview:
A large number of performance enhancements and updates
Launching a Neptune cluster is as easy as navigating to the AWS Management Console and clicking create cluster. Of course you can also launch with CloudFormation, the CLI, or the SDKs.
You can monitor your cluster health and the health of individual instances through Amazon CloudWatch and the console.
Additional Resources
We’ve created two repos with some additional tools and examples here. You can expect continuous development on these repos as we add additional tools and examples.
Amazon Neptune Tools Repo This repo has a useful tool for converting GraphML files into Neptune compatible CSVs for bulk loading from S3.
Amazon Neptune Samples Repo This repo has a really cool example of building a collaborative filtering recommendation engine for video game preferences.
Purpose Built Databases
There’s an industry trend where we’re moving more and more onto purpose-built databases. Developers and businesses want to access their data in the format that makes the most sense for their applications. As cloud resources make transforming large datasets easier with tools like AWS Glue, we have a lot more options than we used to for accessing our data. With tools like Amazon Redshift, Amazon Athena, Amazon Aurora, Amazon DynamoDB, and more we get to choose the best database for the job or even enable entirely new use-cases. Amazon Neptune is perfect for workloads where the data is highly connected across data rich edges.
I’m really excited about graph databases and I see a huge number of applications. Looking for ideas of cool things to build? I’d love to build a web crawler in AWS Lambda that uses Neptune as the backing store. You could further enrich it by running Amazon Comprehend or Amazon Rekognition on the text and images found and creating a search engine on top of Neptune.
As always, feel free to reach out in the comments or on twitter to provide any feedback!
This post is courtesy of Otavio Ferreira, Manager, Amazon SNS, AWS Messaging.
Amazon SNS message filtering provides a set of string and numeric matching operators that allow each subscription to receive only the messages of interest. Hence, SNS message filtering can simplify your pub/sub messaging architecture by offloading the message filtering logic from your subscriber systems, as well as the message routing logic from your publisher systems.
After you set the subscription attribute that defines a filter policy, the subscribing endpoint receives only the messages that carry attributes matching this filter policy. Other messages published to the topic are filtered out for this subscription. In this way, the native integration between SNS and Amazon CloudWatch provides visibility into the number of messages delivered, as well as the number of messages filtered out.
CloudWatch metrics are captured automatically for you. To get started with SNS message filtering, see Filtering Messages with Amazon SNS.
Message Filtering Metrics
The following six CloudWatch metrics are relevant to understanding your SNS message filtering activity:
NumberOfMessagesPublished – Inbound traffic to SNS. This metric tracks all the messages that have been published to the topic.
NumberOfNotificationsDelivered – Outbound traffic from SNS. This metric tracks all the messages that have been successfully delivered to endpoints subscribed to the topic. A delivery takes place either when the incoming message attributes match a subscription filter policy, or when the subscription has no filter policy at all, which results in a catch-all behavior.
NumberOfNotificationsFilteredOut – This metric tracks all the messages that were filtered out because they carried attributes that didn’t match the subscription filter policy.
NumberOfNotificationsFilteredOut-NoMessageAttributes – This metric tracks all the messages that were filtered out because they didn’t carry any attributes at all and, consequently, didn’t match the subscription filter policy.
NumberOfNotificationsFilteredOut-InvalidAttributes – This metric keeps track of messages that were filtered out because they carried invalid or malformed attributes and, thus, didn’t match the subscription filter policy.
NumberOfNotificationsFailed – This last metric tracks all the messages that failed to be delivered to subscribing endpoints, regardless of whether a filter policy had been set for the endpoint. This metric is emitted after the message delivery retry policy is exhausted, and SNS stops attempting to deliver the message. At that moment, the subscribing endpoint is likely no longer reachable. For example, the subscribing SQS queue or Lambda function has been deleted by its owner. You may want to closely monitor this metric to address message delivery issues quickly.
Message filtering graphs
Through the AWS Management Console, you can compose graphs to display your SNS message filtering activity. The graph shows the number of messages published, delivered, and filtered out within the timeframe you specify (1h, 3h, 12h, 1d, 3d, 1w, or custom).
To compose an SNS message filtering graph with CloudWatch:
Open the CloudWatch console.
Choose Metrics, SNS, All Metrics, and Topic Metrics.
Select all metrics to add to the graph, such as:
NumberOfMessagesPublished
NumberOfNotificationsDelivered
NumberOfNotificationsFilteredOut
Choose Graphed metrics.
In the Statistic column, switch from Average to Sum.
Title your graph with a descriptive name, such as “SNS Message Filtering”
After you have your graph set up, you may want to copy the graph link for bookmarking, emailing, or sharing with co-workers. You may also want to add your graph to a CloudWatch dashboard for easy access in the future. Both actions are available to you on the Actions menu, which is found above the graph.
Summary
SNS message filtering defines how SNS topics behave in terms of message delivery. By using CloudWatch metrics, you gain visibility into the number of messages published, delivered, and filtered out. This enables you to validate the operation of filter policies and more easily troubleshoot during development phases.
SNS message filtering can be implemented easily with existing AWS SDKs by applying message and subscription attributes across all SNS supported protocols (Amazon SQS, AWS Lambda, HTTP, SMS, email, and mobile push). CloudWatch metrics for SNS message filtering is available now, in all AWS Regions.
This post is courtesy of Alan Protasio, Software Development Engineer, Amazon Web Services
Just like compute and storage, messaging is a fundamental building block of enterprise applications. Message brokers (aka “message-oriented middleware”) enable different software systems, often written in different languages, on different platforms, running in different locations, to communicate and exchange information. Mission-critical applications, such as CRM and ERP, rely on message brokers to work.
A common performance consideration for customers deploying a message broker in a production environment is the throughput of the system, measured as messages per second. This is important to know so that application environments (hosts, threads, memory, etc.) can be configured correctly.
In this post, we demonstrate how to measure the throughput for Amazon MQ, a new managed message broker service for ActiveMQ, using JMS Benchmark. It should take between 15–20 minutes to set up the environment and an hour to run the benchmark. We also provide some tips on how to configure Amazon MQ for optimal throughput.
Benchmarking throughput for Amazon MQ
ActiveMQ can be used for a number of use cases. These use cases can range from simple fire and forget tasks (that is, asynchronous processing), low-latency request-reply patterns, to buffering requests before they are persisted to a database.
The throughput of Amazon MQ is largely dependent on the use case. For example, if you have non-critical workloads such as gathering click events for a non-business-critical portal, you can use ActiveMQ in a non-persistent mode and get extremely high throughput with Amazon MQ.
On the flip side, if you have a critical workload where durability is extremely important (meaning that you can’t lose a message), then you are bound by the I/O capacity of your underlying persistence store. We recommend using mq.m4.large for the best results. The mq.t2.micro instance type is intended for product evaluation. Performance is limited, due to the lower memory and burstable CPU performance.
Tip: To improve your throughput with Amazon MQ, make sure that you have consumers processing messaging as fast as (or faster than) your producers are pushing messages.
Because it’s impossible to talk about how the broker (ActiveMQ) behaves for each and every use case, we walk through how to set up your own benchmark for Amazon MQ using our favorite open-source benchmarking tool: JMS Benchmark. We are fans of the JMS Benchmark suite because it’s easy to set up and deploy, and comes with a built-in visualizer of the results.
Non-Persistent Scenarios – Queue latency as you scale producer throughput
Getting started
At the time of publication, you can create an mq.m4.large single-instance broker for testing for $0.30 per hour (US pricing).
Step 2 – Create an EC2 instance to run your benchmark Launch the EC2 instance using Step 1: Launch an Instance. We recommend choosing the m5.large instance type.
Step 3 – Configure the security groups Make sure that all the security groups are correctly configured to let the traffic flow between the EC2 instance and your broker.
From the broker list, choose the name of your broker (for example, MyBroker)
In the Details section, under Security and network, choose the name of your security group or choose the expand icon ( ).
From the security group list, choose your security group.
At the bottom of the page, choose Inbound, Edit.
In the Edit inbound rules dialog box, add a role to allow traffic between your instance and the broker: • Choose Add Rule. • For Type, choose Custom TCP. • For Port Range, type the ActiveMQ SSL port (61617). • For Source, leave Custom selected and then type the security group of your EC2 instance. • Choose Save.
Your broker can now accept the connection from your EC2 instance.
Step 4 – Run the benchmark Connect to your EC2 instance using SSH and run the following commands:
After the benchmark finishes, you can find the results in the ~/reports directory. As you may notice, the performance of ActiveMQ varies based on the number of consumers, producers, destinations, and message size.
Amazon MQ architecture
The last bit that’s important to know so that you can better understand the results of the benchmark is how Amazon MQ is architected.
Amazon MQ is architected to be highly available (HA) and durable. For HA, we recommend using the multi-AZ option. After a message is sent to Amazon MQ in persistent mode, the message is written to the highly durable message store that replicates the data across multiple nodes in multiple Availability Zones. Because of this replication, for some use cases you may see a reduction in throughput as you migrate to Amazon MQ. Customers have told us they appreciate the benefits of message replication as it helps protect durability even in the face of the loss of an Availability Zone.
Conclusion
We hope this gives you an idea of how Amazon MQ performs. We encourage you to run tests to simulate your own use cases.
To learn more, see the Amazon MQ website. You can try Amazon MQ for free with the AWS Free Tier, which includes up to 750 hours of a single-instance mq.t2.micro broker and up to 1 GB of storage per month for one year.
The 4.17-rc7 kernel prepatch is out; it’s likely the last one for this development cycle. “So this week wasn’t as calm as the previous weeks have been, but despite that I suspect this is the last rc.”
When I talk with customers and partners, I find that they are in different stages in the adoption of DevOps methodologies. They are automating the creation of application artifacts and the deployment of their applications to different infrastructure environments. In many cases, they are creating and supporting multiple applications using a variety of coding languages and artifacts.
The management of these processes and artifacts can be challenging, but using the right tools and methodologies can simplify the process.
In this post, I will show you how you can automate the creation and storage of application artifacts through the implementation of a pipeline and custom deploy action in AWS CodePipeline. The example includes a Node.js code base stored in an AWS CodeCommit repository. A Node Package Manager (npm) artifact is built from the code base, and the build artifact is published to a JFrogArtifactory npm repository.
I frequently recommend AWS CodePipeline, the AWS continuous integration and continuous delivery tool. You can use it to quickly innovate through integration and deployment of new features and bug fixes by building a workflow that automates the build, test, and deployment of new versions of your application. And, because AWS CodePipeline is extensible, it allows you to create a custom action that performs customized, automated actions on your behalf.
JFrog’s Artifactory is a universal binary repository manager where you can manage multiple applications, their dependencies, and versions in one place. Artifactory also enables you to standardize the way you manage your package types across all applications developed in your company, no matter the code base or artifact type.
If you already have a Node.js CodeCommit repository, a JFrog Artifactory host, and would like to automate the creation of the pipeline, including the custom action and CodeBuild project, you can use this AWS CloudFormationtemplate to create your AWS CloudFormation stack.
This figure shows the path defined in the pipeline for this project. It starts with a change to Node.js source code committed to a private code repository in AWS CodeCommit. With this change, CodePipeline triggers AWS CodeBuild to create the npm package from the node.js source code. After the build, CodePipeline triggers the custom action job worker to commit the build artifact to the designated artifact repository in Artifactory.
This blog post assumes you have already:
· Created a CodeCommit repository that contains a Node.js project.
· Configured a two-stage pipeline in AWS CodePipeline.
The Source stage of the pipeline is configured to poll the Node.js CodeCommit repository. The Build stage is configured to use a CodeBuild project to build the npm package using a buildspec.yml file located in the code repository.
If you do not have a Node.js repository, you can create a CodeCommit repository that contains this simple ‘Hello World’ project. This project also includes a buildspec.yml file that is used when you define your CodeBuild project. It defines the steps to be taken by CodeBuild to create the npm artifact.
If you do not already have a pipeline set up in CodePipeline, you can use this template to create a pipeline with a CodeCommit source action and a CodeBuild build action through the AWS Command Line Interface (AWS CLI). If you do not want to install the AWS CLI on your local machine, you can use AWS Cloud9, our managed integrated development environment (IDE), to interact with AWS APIs.
In your development environment, open your favorite editor and fill out the template with values appropriate to your project. For information, see the readme in the GitHub repository.
Use this CLI command to create the pipeline from the template:
It creates a pipeline that has a CodeCommit source action and a CodeBuild build action.
Integrating JFrog Artifactory
JFrog Artifactory provides default repositories for your project needs. For my NPM package repository, I am using the default virtual npm repository (named npm) that is available in Artifactory Pro. You might want to consider creating a repository per project but for the example used in this post, using the default lets me get started without having to configure a new repository.
I can use the steps in the Set Me Up -> npm section on the landing page to configure my worker to interact with the default NPM repository.
Describes the required values to run the custom action. I will define my custom action in the ‘Deploy’ category, identify the provider as ‘Artifactory’, of version ‘1’, and specify a variety of configurationProperties whose values will be defined when this stage is added to my pipeline.
Polls CodePipeline for a job, scanning for its action-definition properties. In this blog post, after a job has been found, the job worker does the work required to publish the npm artifact to the Artifactory repository.
{
"category": "Deploy",
"configurationProperties": [{
"name": "TypeOfArtifact",
"required": true,
"key": true,
"secret": false,
"description": "Package type, ex. npm for node packages",
"type": "String"
},
{ "name": "RepoKey",
"required": true,
"key": true,
"secret": false,
"type": "String",
"description": "Name of the repository in which this artifact should be stored"
},
{ "name": "UserName",
"required": true,
"key": true,
"secret": false,
"type": "String",
"description": "Username for authenticating with the repository"
},
{ "name": "Password",
"required": true,
"key": true,
"secret": true,
"type": "String",
"description": "Password for authenticating with the repository"
},
{ "name": "EmailAddress",
"required": true,
"key": true,
"secret": false,
"type": "String",
"description": "Email address used to authenticate with the repository"
},
{ "name": "ArtifactoryHost",
"required": true,
"key": true,
"secret": false,
"type": "String",
"description": "Public address of Artifactory host, ex: https://myexamplehost.com or http://myexamplehost.com:8080"
}],
"provider": "Artifactory",
"version": "1",
"settings": {
"entityUrlTemplate": "{Config:ArtifactoryHost}/artifactory/webapp/#/artifacts/browse/tree/General/{Config:RepoKey}"
},
"inputArtifactDetails": {
"maximumCount": 5,
"minimumCount": 1
},
"outputArtifactDetails": {
"maximumCount": 5,
"minimumCount": 0
}
}
There are seven sections to the custom action definition:
category: This is the stage in which you will be creating this action. It can be Source, Build, Deploy, Test, Invoke, Approval. Except for source actions, the category section simply allows us to organize our actions. I am setting the category for my action as ‘Deploy’ because I’m using it to publish my node artifact to my Artifactory instance.
configurationProperties: These are the parameters or variables required for your project to authenticate and commit your artifact. In the case of my custom worker, I need:
TypeOfArtifact: In this case, npm, because it’s for the Node Package Manager.
RepoKey: The name of the repository. In this case, it’s the default npm.
UserName and Password for the user to authenticate with the Artifactory repository.
EmailAddress used to authenticate with the repository.
Artifactory host name or IP address.
provider: The name you define for your custom action stage. I have named the provider Artifactory.
version: Version number for the custom action. Because this is the first version, I set the version number to 1.
entityUrlTemplate: This URL is presented to your users for the deploy stage along with the title you define in your provider. The link takes the user to their artifact repository page in the Artifactory host.
inputArtifactDetails: The number of artifacts to expect from the previous stage in the pipeline.
outputArtifactDetails: The number of artifacts that should be the result from the custom action stage. Later in this blog post, I define 0 for my output artifacts because I am publishing the artifact to the Artifactory repository as the final action.
After I define the custom action in a JSON file, I use the AWS CLI to create the custom action type in CodePipeline:
After I create the custom action type in the same region as my pipeline, I edit the pipeline to add a Deploy stage and configure it to use the custom action I created for Artifactory:
I have created a custom worker for the actions required to commit the npm artifact to the Artifactory repository. The worker is in Python and it runs in a loop on an Amazon EC2 instance. My custom worker polls for a deploy job and publishes the NPM artifact to the Artifactory repository.
The EC2 instance is running Amazon Linux and has an IAM instance role attached that gives the worker permission to access CodePipeline. The worker process is as follows:
Take the configuration properties from the custom worker and poll CodePipeline for a custom action job.
After there is a job in the job queue with the appropriate category, provider, and version, acknowledge the job.
Download the zipped artifact created in the previous Build stage from the provided S3 buckets with the provided temporary credentials.
Unzip the artifact into a temporary directory.
A user-defined Artifactory user name and password is used to receive a temporary API key from Artifactory.
To avoid having to write the password to a file, use that temporary API key and user name to authenticate with the NPM repository.
Publish the Node.js package to the specified repository.
Because I am running my custom worker on an Amazon Linux EC2 instance, I installed npm with the following command:
sudo yum install nodejs npm --enablerepo=epel
For my custom worker, I used pip to install the required Python libraries:
pip install boto3 requests
For a full Python package list, see requirements.txt in the GitHub repository.
Let’s take a look at some of the code snippets from the worker.
First, the worker polls for jobs:
def action_type():
ActionType = {
'category': 'Deploy',
'owner': 'Custom',
'provider': 'Artifactory',
'version': '1' }
return(ActionType)
def poll_for_jobs():
try:
artifactory_action_type = action_type()
print(artifactory_action_type)
jobs = codepipeline.poll_for_jobs(actionTypeId=artifactory_action_type)
while not jobs['jobs']:
time.sleep(10)
jobs = codepipeline.poll_for_jobs(actionTypeId=artifactory_action_type)
if jobs['jobs']:
print('Job found')
return jobs['jobs'][0]
except ClientError as e:
print("Received an error: %s" % str(e))
raise
When there is a job in the queue, the poller returns a number of values from the queue such as jobId, the input and output S3 buckets for artifacts, temporary credentials to access the S3 buckets, and other configuration details from the stage in the pipeline.
After successfully receiving the job details, the worker sends an acknowledgement to CodePipeline to ensure that the work on the job is not duplicated by other workers watching for the same job:
def job_acknowledge(jobId, nonce):
try:
print('Acknowledging job')
result = codepipeline.acknowledge_job(jobId=jobId, nonce=nonce)
return result
except Exception as e:
print("Received an error when trying to acknowledge the job: %s" % str(e))
raise
With the job now acknowledged, the worker publishes the source code artifact into the desired repository. The worker gets the value of the artifact S3 bucket and objectKey from the inputArtifacts in the response from the poll_for_jobs API request. Next, the worker creates a new directory in /tmp and downloads the S3 object into this directory:
def get_bucket_location(bucketName, init_client):
region = init_client.get_bucket_location(Bucket=bucketName)['LocationConstraint']
if not region:
region = 'us-east-1'
return region
def get_s3_artifact(bucketName, objectKey, ak, sk, st):
init_s3 = boto3.client('s3')
region = get_bucket_location(bucketName, init_s3)
session = Session(aws_access_key_id=ak,
aws_secret_access_key=sk,
aws_session_token=st)
s3 = session.resource('s3',
region_name=region,
config=botocore.client.Config(signature_version='s3v4'))
try:
tempdirname = tempfile.mkdtemp()
except OSError as e:
print('Could not write temp directory %s' % tempdirname)
raise
bucket = s3.Bucket(bucketName)
obj = bucket.Object(objectKey)
filename = tempdirname + '/' + objectKey
try:
if os.path.dirname(objectKey):
directory = os.path.dirname(filename)
os.makedirs(directory)
print('Downloading the %s object and writing it to disk in %s location' % (objectKey, tempdirname))
with open(filename, 'wb') as data:
obj.download_fileobj(data)
except ClientError as e:
print('Downloading the object and writing the file to disk raised this error: ' + str(e))
raise
return(filename, tempdirname)
Because the downloaded artifact from S3 is a zip file, the worker must unzip it first. To have a clean area in which to work, I extract the downloaded zip archive into a new directory:
def unzip_codepipeline_artifact(artifact, origtmpdir):
# create a new temp directory
# Unzip artifact into new directory
try:
newtempdir = tempfile.mkdtemp()
print('Extracting artifact %s into temporary directory %s' % (artifact, newtempdir))
zip_ref = zipfile.ZipFile(artifact, 'r')
zip_ref.extractall(newtempdir)
zip_ref.close()
shutil.rmtree(origtmpdir)
return(os.listdir(newtempdir), newtempdir)
except OSError as e:
if e.errno != errno.EEXIST:
shutil.rmtree(newtempdir)
raise
The worker now has the npm package that I want to store in my Artifactory NPM repository.
To authenticate with the NPM repository, the worker requests a temporary token from the Artifactory host. After receiving this temporary token, it creates a .npmrc file in the worker user’s home directory that includes a hash of the user name and temporary token. After it has authenticated, the worker runs npm config set registry <URL OF REPOSITORY> to configure the npm registry value to be the Artifactory host. Next, the worker runs npm publish –registry <URL OF REPOSITORY>, which publishes the node package to the NPM repository in the Artifactory host.
def push_to_npm(configuration, artifact_list, temp_dir, jobId):
reponame = configuration['RepoKey']
art_type = configuration['TypeOfArtifact']
print("Putting artifact into NPM repository " + reponame)
token, hostname, username = gen_artifactory_auth_token(configuration)
npmconfigfile = create_npmconfig_file(configuration, username, token)
url = hostname + '/artifactory/api/' + art_type + '/' + reponame
print("Changing directory to " + str(temp_dir))
os.chdir(temp_dir)
try:
print("Publishing following files to the repository: %s " % os.listdir(temp_dir))
print("Sending artifact to Artifactory NPM registry URL: " + url)
subprocess.call(["npm", "config", "set", "registry", url])
req = subprocess.call(["npm", "publish", "--registry", url])
print("Return code from npm publish: " + str(req))
if req != 0:
err_msg = "npm ERR! Recieved non OK response while sending response to Artifactory. Return code from npm publish: " + str(req)
signal_failure(jobId, err_msg)
else:
signal_success(jobId)
except requests.exceptions.RequestException as e:
print("Received an error when trying to commit artifact %s to repository %s: " % (str(art_type), str(configuration['RepoKey']), str(e)))
raise
return(req, npmconfigfile)
If the return value from publishing to the repository is not 0, the worker signals a failure to CodePipeline. If the value is 0, the worker signals success to CodePipeline to indicate that the stage of the pipeline has been completed successfully.
For the custom worker code, see npm_job_worker.py in the GitHub repository.
I run my custom worker on an EC2 instance using the command python npm_job_worker.py, with an optional --version flag that can be used to specify worker versions other than 1. Then I trigger a release change in my pipeline:
From my custom worker output logs, I have just committed a package named node_example at version 1.0.3:
On artifact: index.js
Committing to the repo: https://artifactory.myexamplehost.com/artifactory/api/npm/npm
Sending artifact to Artifactory URL: https:// artifactoryhost.myexamplehost.com/artifactory/api/npm/npm
npm config: 0
npm http PUT https://artifactory.myexamplehost.com/artifactory/api/npm/npm/node_example
npm http 201 https://artifactory.myexamplehost.com/artifactory/api/npm/npm/node_example
+ [email protected]
Return code from npm publish: 0
Signaling success to CodePipeline
After that has been built successfully, I can find my artifact in my Artifactory repository:
To help you automate this process, I have created this AWS CloudFormation template that automates the creation of the CodeBuild project, the custom action, and the CodePipeline pipeline. It also launches the Amazon EC2-based custom job worker in an AWS Auto Scaling group. This template requires you to have a VPC and CodeCommit repository for your Node.js project. If you do not currently have a VPC in which you want to run your custom worker EC2 instances, you can use this AWS QuickStart to create one. If you do not have an existing Node.js project, I’ve provided a sample project in the GitHub repository.
Conclusion
I‘ve shown you the steps to integrate your JFrog Artifactory repository with your CodePipeline workflow. I’ve shown you how to create a custom action in CodePipeline and how to create a custom worker that works in your CI/CD pipeline. To dig deeper into custom actions and see how you can integrate your Artifactory repositories into your AWS CodePipeline projects, check out the full code base on GitHub.
If you have any questions or feedback, feel free to reach out to us through the AWS CodePipeline forum.
Erin McGill is a Solutions Architect in the AWS Partner Program with a focus on DevOps and automation tooling.
Slack is widely used by DevOps and development teams to communicate status. Typically, when a build has been tested and is ready to be promoted to a staging environment, a QA engineer or DevOps engineer kicks off the deployment. Using Slack in a ChatOps collaboration model, the promotion can be done in a single click from a Slack channel. And because the promotion happens through a Slack channel, the whole development team knows what’s happening without checking email.
In this blog post, I will show you how to integrate AWS services with a Slack application. I use an interactive message button and incoming webhook to promote a stage with a single click.
To follow along with the steps in this post, you’ll need a pipeline in AWS CodePipeline. If you don’t have a pipeline, the fastest way to create one for this use case is to use AWS CodeStar. Go to the AWS CodeStar console and select the Static Website template (shown in the screenshot). AWS CodeStar will create a pipeline with an AWS CodeCommit repository and an AWS CodeDeploy deployment for you. After the pipeline is created, you will need to add a manual approval stage.
You’ll also need to build a Slack app with webhooks and interactive components, write two Lambda functions, and create an API Gateway API and a SNS topic.
As you’ll see in the following diagram, when I make a change and merge a new feature into the master branch in AWS CodeCommit, the check-in kicks off my CI/CD pipeline in AWS CodePipeline. When CodePipeline reaches the approval stage, it sends a notification to Amazon SNS, which triggers an AWS Lambda function (ApprovalRequester).
The Slack channel receives a prompt that looks like the following screenshot. When I click Yes to approve the build promotion, the approval result is sent to CodePipeline through API Gateway and Lambda (ApprovalHandler). The pipeline continues on to deploy the build to the next environment.
Create a Slack app
For App Name, type a name for your app. For Development Slack Workspace, choose the name of your workspace. You’ll see in the following screenshot that my workspace is AWS ChatOps.
After the Slack application has been created, you will see the Basic Information page, where you can create incoming webhooks and enable interactive components.
To add incoming webhooks:
Under Add features and functionality, choose Incoming Webhooks. Turn the feature on by selecting Off, as shown in the following screenshot.
Now that the feature is turned on, choose Add New Webhook to Workspace. In the process of creating the webhook, Slack lets you choose the channel where messages will be posted.
After the webhook has been created, you’ll see its URL. You will use this URL when you create the Lambda function.
If you followed the steps in the post, the pipeline should look like the following.
Write the Lambda function for approval requests
This Lambda function is invoked by the SNS notification. It sends a request that consists of an interactive message button to the incoming webhook you created earlier. The following sample code sends the request to the incoming webhook. WEBHOOK_URL and SLACK_CHANNEL are the environment variables that hold values of the webhook URL that you created and the Slack channel where you want the interactive message button to appear.
# This function is invoked via SNS when the CodePipeline manual approval action starts.
# It will take the details from this approval notification and sent an interactive message to Slack that allows users to approve or cancel the deployment.
import os
import json
import logging
import urllib.parse
from base64 import b64decode
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError
# This is passed as a plain-text environment variable for ease of demonstration.
# Consider encrypting the value with KMS or use an encrypted parameter in Parameter Store for production deployments.
SLACK_WEBHOOK_URL = os.environ['SLACK_WEBHOOK_URL']
SLACK_CHANNEL = os.environ['SLACK_CHANNEL']
logger = logging.getLogger()
logger.setLevel(logging.INFO)
def lambda_handler(event, context):
print("Received event: " + json.dumps(event, indent=2))
message = event["Records"][0]["Sns"]["Message"]
data = json.loads(message)
token = data["approval"]["token"]
codepipeline_name = data["approval"]["pipelineName"]
slack_message = {
"channel": SLACK_CHANNEL,
"text": "Would you like to promote the build to production?",
"attachments": [
{
"text": "Yes to deploy your build to production",
"fallback": "You are unable to promote a build",
"callback_id": "wopr_game",
"color": "#3AA3E3",
"attachment_type": "default",
"actions": [
{
"name": "deployment",
"text": "Yes",
"style": "danger",
"type": "button",
"value": json.dumps({"approve": True, "codePipelineToken": token, "codePipelineName": codepipeline_name}),
"confirm": {
"title": "Are you sure?",
"text": "This will deploy the build to production",
"ok_text": "Yes",
"dismiss_text": "No"
}
},
{
"name": "deployment",
"text": "No",
"type": "button",
"value": json.dumps({"approve": False, "codePipelineToken": token, "codePipelineName": codepipeline_name})
}
]
}
]
}
req = Request(SLACK_WEBHOOK_URL, json.dumps(slack_message).encode('utf-8'))
response = urlopen(req)
response.read()
return None
Create a SNS topic
Create a topic and then create a subscription that invokes the ApprovalRequester Lambda function. You can configure the manual approval action in the pipeline to send a message to this SNS topic when an approval action is required. When the pipeline reaches the approval stage, it sends a notification to this SNS topic. SNS publishes a notification to all of the subscribed endpoints. In this case, the Lambda function is the endpoint. Therefore, it invokes and executes the Lambda function. For information about how to create a SNS topic, see Create a Topic in the Amazon SNS Developer Guide.
Write the Lambda function for handling the interactive message button
This Lambda function is invoked by API Gateway. It receives the result of the interactive message button whether or not the build promotion was approved. If approved, an API call is made to CodePipeline to promote the build to the next environment. If not approved, the pipeline stops and does not move to the next stage.
The Lambda function code might look like the following. SLACK_VERIFICATION_TOKEN is the environment variable that contains your Slack verification token. You can find your verification token under Basic Information on Slack manage app page. When you scroll down, you will see App Credential. Verification token is found under the section.
# This function is triggered via API Gateway when a user acts on the Slack interactive message sent by approval_requester.py.
from urllib.parse import parse_qs
import json
import os
import boto3
SLACK_VERIFICATION_TOKEN = os.environ['SLACK_VERIFICATION_TOKEN']
#Triggered by API Gateway
#It kicks off a particular CodePipeline project
def lambda_handler(event, context):
#print("Received event: " + json.dumps(event, indent=2))
body = parse_qs(event['body'])
payload = json.loads(body['payload'][0])
# Validate Slack token
if SLACK_VERIFICATION_TOKEN == payload['token']:
send_slack_message(json.loads(payload['actions'][0]['value']))
# This will replace the interactive message with a simple text response.
# You can implement a more complex message update if you would like.
return {
"isBase64Encoded": "false",
"statusCode": 200,
"body": "{\"text\": \"The approval has been processed\"}"
}
else:
return {
"isBase64Encoded": "false",
"statusCode": 403,
"body": "{\"error\": \"This request does not include a vailid verification token.\"}"
}
def send_slack_message(action_details):
codepipeline_status = "Approved" if action_details["approve"] else "Rejected"
codepipeline_name = action_details["codePipelineName"]
token = action_details["codePipelineToken"]
client = boto3.client('codepipeline')
response_approval = client.put_approval_result(
pipelineName=codepipeline_name,
stageName='Approval',
actionName='ApprovalOrDeny',
result={'summary':'','status':codepipeline_status},
token=token)
print(response_approval)
Create the API Gateway API
In the Amazon API Gateway console, create a resource called InteractiveMessageHandler.
Create a POST method.
For Integration type, choose Lambda Function.
Select Use Lambda Proxy integration.
From Lambda Region, choose a region.
In Lambda Function, type a name for your function.
Now go back to your Slack application and enable interactive components.
To enable interactive components for the interactive message (Yes) button:
Under Features, choose Interactive Components.
Choose Enable Interactive Components.
Type a request URL in the text box. Use the invoke URL in Amazon API Gateway that will be called when the approval button is clicked.
Now that all the pieces have been created, run the solution by checking in a code change to your CodeCommit repo. That will release the change through CodePipeline. When the CodePipeline comes to the approval stage, it will prompt to your Slack channel to see if you want to promote the build to your staging or production environment. Choose Yes and then see if your change was deployed to the environment.
Conclusion
That is it! You have now created a Slack ChatOps solution using AWS CodeCommit, AWS CodePipeline, AWS Lambda, Amazon API Gateway, and Amazon Simple Notification Service.
Now that you know how to do this Slack and CodePipeline integration, you can use the same method to interact with other AWS services using API Gateway and Lambda. You can also use Slack’s slash command to initiate an action from a Slack channel, rather than responding in the way demonstrated in this post.
The bcachefs filesystem has been under development for a number of years now; according to lead developer Kent Overstreet, it is time to start talking about getting the code upstream. He came to the 2018 Linux Storage, Filesystem, and Memory-Management Summit (LSFMM) to discuss that in a combined filesystem and storage session. Bcachefs grew out of bcache, which is a block layer cache that was merged into Linux 3.10 in mid-2013.
PyCon UK 2018 will take place on Saturday 15 September to Wednesday 19 September in the splendid Cardiff City Hall, just a few miles from the Sony Technology Centre where the vast majority of Raspberry Pis is made. We’re pleased to announce that we’re curating this year’s Education Summit at the conference, where we’ll offer opportunities for young people to learn programming skills, and for educators to undertake professional development!
PyCon UK 2018 is your chance to be welcomed into the wonderful Python community. At the Education Summit, we’ll put on a young coders’ day on the Saturday, and an educators’ day on the Sunday.
Saturday — young coders’ day
On Saturday we’ll be running a CoderDojo full of workshops on Raspberry Pi and micro:bits for young people aged 7 to 17. If they wish, participants will get to make a project and present it to the conference on the main stage, and everyone will be given a free micro:bit to take home!
PyCon UK has been bringing developers and educators together ever since it first started its education track in 2011. This year’s Sunday will be a day of professional development: we’ll give teachers, educators, parents, and coding club leaders the chance to learn from us and from each other to build their programming, computing, and digital making skills.
Professional development for educators
Educators get a special entrance rate for the conference, starting at £48 — get your tickets now. Financial assistance is also available.
Call for proposals
We invite you to send in your proposal for a talk and workshop at the Education Summit! We’re looking for:
25-minute talks for the educators’ day
50-minute workshops for either the young coders’ or the educators’ day
If you have something you’d like to share, such as a professional development session for educators, advice on best practice for teaching programming, a workshop for up-skilling in Python, or a fun physical computing activity for the CoderDojo, then we’d love to hear about it! Please submit your proposalby 15 June.
After the Education Summit, the conference will continue for two days of talks and a final day of development sprints. Feel free to submit your education-related talk to the main conference too if you want to share it with a wider audience! Check out the PyCon UK 2018 website for more information.
Earlier this year on 3 and 4 March, communities around the world held Raspberry Jam events to celebrate Raspberry Pi’s sixth birthday. We sent out special birthday kits to participating Jams — it was amazing to know the kits would end up in the hands of people in parts of the world very far from Raspberry Pi HQ in Cambridge, UK.
The Raspberry Jam Camer team: Damien Doumer, Eyong Etta, Loïc Dessap and Lionel Sichom, aka Lionel Tellem
Preparing for the #PiParty
One birthday kit went to Yaoundé, the capital of Cameroon. There, a team of four students in their twenties — Lionel Sichom (aka Lionel Tellem), Eyong Etta, Loïc Dessap, and Damien Doumer — were organising Yaoundé’s first Jam, called Raspberry Jam Camer, as part of the Raspberry Jam Big Birthday Weekend. The team knew one another through their shared interests and skills in electronics, robotics, and programming. Damien explains in his blog post about the Jam that they planned ahead for several activities for the Jam based on their own projects, so they could be confident of having a few things that would definitely be successful for attendees to do and see.
Show-and-tell at Raspberry Jam Cameroon
Loïc presented a Raspberry Pi–based, Android app–controlled robot arm that he had built, and Lionel coded a small video game using Scratch on Raspberry Pi while the audience watched. Damien demonstrated the possibilities of Windows 10 IoT Core on Raspberry Pi, showing how to install it, how to use it remotely, and what you can do with it, including building a simple application.
Loïc showcases the prototype robot arm he built
There was lots more too, with others discussing their own Pi projects and talking about the possibilities Raspberry Pi offers, including a Pi-controlled drone and car. Cake was a prevailing theme of the Raspberry Jam Big Birthday Weekend around the world, and Raspberry Jam Camer made sure they didn’t miss out.
Yay, birthday cake!!
A big success
Most visitors to the Jam were secondary school students, while others were university students and graduates. The majority were unfamiliar with Raspberry Pi, but all wanted to learn about Raspberry Pi and what they could do with it. Damien comments that the fact most people were new to Raspberry Pi made the event more interactive rather than creating any challenges, because the visitors were all interested in finding out about the little computer. The Jam was an all-round success, and the team was pleased with how it went:
What I liked the most was that we sensitized several people about the Raspberry Pi and what one can be capable of with such a small but powerful device. — Damien Doumer
The Jam team rounded off the event by announcing that this was the start of a Raspberry Pi community in Yaoundé. They hope that they and others will be able to organise more Jams and similar events in the area to spread the word about what people can do with Raspberry Pi, and to help them realise their ideas.
Raspberry Jam Camer gets the thumbs-up
The Raspberry Pi community in Cameroon
In a French-language interview about their Jam, the team behind Raspberry Jam Camer said they’d like programming to become the third official language of Cameroon, after French and English; their aim is to to popularise programming and digital making across Cameroonian society. Neither of these fields is very familiar to most people in Cameroon, but both are very well aligned with the country’s ambitions for development. The team is conscious of the difficulties around the emergence of information and communication technologies in the Cameroonian context; in response, they are seizing the opportunities Raspberry Pi offers to give children and young people access to modern and constantly evolving technology at low cost.
Thanks to Lionel, Eyong, Damien, and Loïc, and to everyone who helped put on a Jam for the Big Birthday Weekend! Remember, anyone can start a Jam at any time — and we provide plenty of resources to get you started. Check out the Guidebook, the Jam branding pack, our specially-made Jam activities online (in multiplelanguages), printable worksheets, and more.
Regardless of your career path, there’s no denying that attending industry events can provide helpful career development opportunities — not only for improving and expanding your skill sets, but for networking as well. According to this article from PayScale.com, experts estimate that somewhere between 70-85% of new positions are landed through networking.
Narrowing our focus to networking opportunities with cloud computing professionals who’re working on tackling some of today’s most innovative and exciting big data solutions, attending big data-focused sessions at an AWS Global Summit is a great place to start.
AWS Global Summits are free events that bring the cloud computing community together to connect, collaborate, and learn about AWS. As the name suggests, these summits are held in major cities around the world, and attract technologists from all industries and skill levels who’re interested in hearing from AWS leaders, experts, partners, and customers.
In addition to networking opportunities with top cloud technology providers, consultants and your peers in our Partner and Solutions Expo, you’ll also hone your AWS skills by attending and participating in a multitude of education and training opportunities.
Here’s a brief sampling of some of the upcoming sessions relevant to big data professionals:
Be sure to check out the main page for AWS Global Summits, where you can see which cities have AWS Summits planned for 2018, register to attend an upcoming event, or provide your information to be notified when registration opens for a future event.
In April, LWN looked at the new API for zero-copy reception of TCP data that had been merged into the net-next tree for the 4.18 development cycle. After that article was written, a couple of issues came to the fore that required some changes to the API for this feature. Those changes have been made and merged; read on for the details.
At KubeCon + CloudNativeCon Europe 2018, several talks explored the topic of container isolation and security. The last year saw the release of Kata Containers which, combined with the CRI-O project, provided strong isolation guarantees for containers using a hypervisor. During the conference, Google released its own hypervisor called gVisor, adding yet another possible solution for this problem. Those new developments prompted the community to work on integrating the concept of “secure containers” (or “sandboxed containers”) deeper into Kubernetes. This work is now coming to fruition; it prompts us to look again at how Kubernetes tries to keep the bad guys from wreaking havoc once they break into a container.
Earlier this spring, an excited group of STEM educators came together to participate in the first ever Raspberry Pi and Arduino workshop in Puerto Rico.
Their three-day digital making adventure was led by MakerTechPR’s José Rullán and Raspberry Pi Certified Educator Alex Martínez. They ran the event as part of the Robot Makers challenge organized by Yees! and sponsored by Puerto Rico’s Department of Economic Development and Trade to promote entrepreneurial skills within Puerto Rico’s education system.
Over 30 educators attended the workshop, which covered the use of the Raspberry Pi 3 as a computer and digital making resource. The educators received a kit consisting of a Raspberry Pi 3 with an Explorer HAT Pro and an Arduino Uno. At the end of the workshop, the educators were able to keep the kit as a demonstration unit for their classrooms. They were enthusiastic to learn new concepts and immerse themselves in the world of physical computing.
In their first session, the educators were introduced to the Raspberry Pi as an affordable technology for robotic clubs. In their second session, they explored physical computing and the coding languages needed to control the Explorer HAT Pro. They started off coding with Scratch, with which some educators had experience, and ended with controlling the GPIO pins with Python. In the final session, they learned how to develop applications using the powerful combination of Arduino and Raspberry Pi for robotics projects. This gave them a better understanding of how they could engage their students in physical computing.
“The Raspberry Pi ecosystem is the perfect solution in the classroom because to us it is very resourceful and accessible.” – Alex Martínez
Computer science and robotics courses are important for many schools and teachers in Puerto Rico. The simple idea of programming a microcontroller from a $35 computer increases the chances of more students having access to more technology to create things.
Puerto Rico’s education system has faced enormous challenges after Hurricane Maria, including economic collapse and the government’s closure of many schools due to the exodus of families from the island. By attending training like this workshop, educators in Puerto Rico are becoming more experienced in fields like robotics in particular, which are key for 21st-century skills and learning. This, in turn, can lead to more educational opportunities, and hopefully the reopening of more schools on the island.
“We find it imperative that our children be taught STEM disciplines and skills. Our goal is to continue this work of spreading digital making and computer science using the Raspberry Pi around Puerto Rico. We want our children to have the best education possible.” – Alex Martínez
After attending Picademy in 2016, Alex has integrated the Raspberry Pi Foundation’s online resources into his classroom. He has also taught small workshops around the island and in the local Puerto Rican makerspace community. José is an electrical engineer, entrepreneur, educator and hobbyist who enjoys learning to use technology and sharing his knowledge through projects and challenges.
Amazon Kinesis Data Firehose is the easiest way to capture and stream data into a data lake built on Amazon S3. This data can be anything—from AWS service logs like AWS CloudTrail log files, Amazon VPC Flow Logs, Application Load Balancer logs, and others. It can also be IoT events, game events, and much more. To efficiently query this data, a time-consuming ETL (extract, transform, and load) process is required to massage and convert the data to an optimal file format, which increases the time to insight. This situation is less than ideal, especially for real-time data that loses its value over time.
To solve this common challenge, Kinesis Data Firehose can now save data to Amazon S3 in Apache Parquet or Apache ORC format. These are optimized columnar formats that are highly recommended for best performance and cost-savings when querying data in S3. This feature directly benefits you if you use Amazon Athena, Amazon Redshift, AWS Glue, Amazon EMR, or any other big data tools that are available from the AWS Partner Network and through the open-source community.
Amazon Connect is a simple-to-use, cloud-based contact center service that makes it easy for any business to provide a great customer experience at a lower cost than common alternatives. Its open platform design enables easy integration with other systems. One of those systems is Amazon Kinesis—in particular, Kinesis Data Streams and Kinesis Data Firehose.
What’s really exciting is that you can now save events from Amazon Connect to S3 in Apache Parquet format. You can then perform analytics using Amazon Athena and Amazon Redshift Spectrum in real time, taking advantage of this key performance and cost optimization. Of course, Amazon Connect is only one example. This new capability opens the door for a great deal of opportunity, especially as organizations continue to build their data lakes.
Amazon Connect includes an array of analytics views in the Administrator dashboard. But you might want to run other types of analysis. In this post, I describe how to set up a data stream from Amazon Connect through Kinesis Data Streams and Kinesis Data Firehose and out to S3, and then perform analytics using Athena and Amazon Redshift Spectrum. I focus primarily on the Kinesis Data Firehose support for Parquet and its integration with the AWS Glue Data Catalog, Amazon Athena, and Amazon Redshift.
Solution overview
Here is how the solution is laid out:
The following sections walk you through each of these steps to set up the pipeline.
1. Define the schema
When Kinesis Data Firehose processes incoming events and converts the data to Parquet, it needs to know which schema to apply. The reason is that many times, incoming events contain all or some of the expected fields based on which values the producers are advertising. A typical process is to normalize the schema during a batch ETL job so that you end up with a consistent schema that can easily be understood and queried. Doing this introduces latency due to the nature of the batch process. To overcome this issue, Kinesis Data Firehose requires the schema to be defined in advance.
To see the available columns and structures, see Amazon Connect Agent Event Streams. For the purpose of simplicity, I opted to make all the columns of type String rather than create the nested structures. But you can definitely do that if you want.
The simplest way to define the schema is to create a table in the Amazon Athena console. Open the Athena console, and paste the following create table statement, substituting your own S3 bucket and prefix for where your event data will be stored. A Data Catalog database is a logical container that holds the different tables that you can create. The default database name shown here should already exist. If it doesn’t, you can create it or use another database that you’ve already created.
That’s all you have to do to prepare the schema for Kinesis Data Firehose.
2. Define the data streams
Next, you need to define the Kinesis data streams that will be used to stream the Amazon Connect events. Open the Kinesis Data Streams console and create two streams. You can configure them with only one shard each because you don’t have a lot of data right now.
3. Define the Kinesis Data Firehose delivery stream for Parquet
Let’s configure the Data Firehose delivery stream using the data stream as the source and Amazon S3 as the output. Start by opening the Kinesis Data Firehose console and creating a new data delivery stream. Give it a name, and associate it with the Kinesis data stream that you created in Step 2.
As shown in the following screenshot, enable Record format conversion (1) and choose Apache Parquet (2). As you can see, Apache ORC is also supported. Scroll down and provide the AWS Glue Data Catalog database name (3) and table names (4) that you created in Step 1. Choose Next.
To make things easier, the output S3 bucket and prefix fields are automatically populated using the values that you defined in the LOCATION parameter of the create table statement from Step 1. Pretty cool. Additionally, you have the option to save the raw events into another location as defined in the Source record S3 backup section. Don’t forget to add a trailing forward slash “ / “ so that Data Firehose creates the date partitions inside that prefix.
On the next page, in the S3 buffer conditions section, there is a note about configuring a large buffer size. The Parquet file format is highly efficient in how it stores and compresses data. Increasing the buffer size allows you to pack more rows into each output file, which is preferred and gives you the most benefit from Parquet.
Compression using Snappy is automatically enabled for both Parquet and ORC. You can modify the compression algorithm by using the Kinesis Data Firehose API and update the OutputFormatConfiguration.
Be sure to also enable Amazon CloudWatch Logs so that you can debug any issues that you might run into.
Lastly, finalize the creation of the Firehose delivery stream, and continue on to the next section.
4. Set up the Amazon Connect contact center
After setting up the Kinesis pipeline, you now need to set up a simple contact center in Amazon Connect. The Getting Started page provides clear instructions on how to set up your environment, acquire a phone number, and create an agent to accept calls.
After setting up the contact center, in the Amazon Connect console, choose your Instance Alias, and then choose Data Streaming. Under Agent Event, choose the Kinesis data stream that you created in Step 2, and then choose Save.
At this point, your pipeline is complete. Agent events from Amazon Connect are generated as agents go about their day. Events are sent via Kinesis Data Streams to Kinesis Data Firehose, which converts the event data from JSON to Parquet and stores it in S3. Athena and Amazon Redshift Spectrum can simply query the data without any additional work.
So let’s generate some data. Go back into the Administrator console for your Amazon Connect contact center, and create an agent to handle incoming calls. In this example, I creatively named mine Agent One. After it is created, Agent One can get to work and log into their console and set their availability to Available so that they are ready to receive calls.
To make the data a bit more interesting, I also created a second agent, Agent Two. I then made some incoming and outgoing calls and caused some failures to occur, so I now have enough data available to analyze.
5. Analyze the data with Athena
Let’s open the Athena console and run some queries. One thing you’ll notice is that when we created the schema for the dataset, we defined some of the fields as Strings even though in the documentation they were complex structures. The reason for doing that was simply to show some of the flexibility of Athena to be able to parse JSON data. However, you can define nested structures in your table schema so that Kinesis Data Firehose applies the appropriate schema to the Parquet file.
Let’s run the first query to see which agents have logged into the system.
The query might look complex, but it’s fairly straightforward:
WITH dataset AS (
SELECT
from_iso8601_timestamp(eventtimestamp) AS event_ts,
eventtype,
-- CURRENT STATE
json_extract_scalar(
currentagentsnapshot,
'$.agentstatus.name') AS current_status,
from_iso8601_timestamp(
json_extract_scalar(
currentagentsnapshot,
'$.agentstatus.starttimestamp')) AS current_starttimestamp,
json_extract_scalar(
currentagentsnapshot,
'$.configuration.firstname') AS current_firstname,
json_extract_scalar(
currentagentsnapshot,
'$.configuration.lastname') AS current_lastname,
json_extract_scalar(
currentagentsnapshot,
'$.configuration.username') AS current_username,
json_extract_scalar(
currentagentsnapshot,
'$.configuration.routingprofile.defaultoutboundqueue.name') AS current_outboundqueue,
json_extract_scalar(
currentagentsnapshot,
'$.configuration.routingprofile.inboundqueues[0].name') as current_inboundqueue,
-- PREVIOUS STATE
json_extract_scalar(
previousagentsnapshot,
'$.agentstatus.name') as prev_status,
from_iso8601_timestamp(
json_extract_scalar(
previousagentsnapshot,
'$.agentstatus.starttimestamp')) as prev_starttimestamp,
json_extract_scalar(
previousagentsnapshot,
'$.configuration.firstname') as prev_firstname,
json_extract_scalar(
previousagentsnapshot,
'$.configuration.lastname') as prev_lastname,
json_extract_scalar(
previousagentsnapshot,
'$.configuration.username') as prev_username,
json_extract_scalar(
previousagentsnapshot,
'$.configuration.routingprofile.defaultoutboundqueue.name') as current_outboundqueue,
json_extract_scalar(
previousagentsnapshot,
'$.configuration.routingprofile.inboundqueues[0].name') as prev_inboundqueue
from kfhconnectblog
where eventtype <> 'HEART_BEAT'
)
SELECT
current_status as status,
current_username as username,
event_ts
FROM dataset
WHERE eventtype = 'LOGIN' AND current_username <> ''
ORDER BY event_ts DESC
The query output looks something like this:
Here is another query that shows the sessions each of the agents engaged with. It tells us where they were incoming or outgoing, if they were completed, and where there were missed or failed calls.
WITH src AS (
SELECT
eventid,
json_extract_scalar(currentagentsnapshot, '$.configuration.username') as username,
cast(json_extract(currentagentsnapshot, '$.contacts') AS ARRAY(JSON)) as c,
cast(json_extract(previousagentsnapshot, '$.contacts') AS ARRAY(JSON)) as p
from kfhconnectblog
),
src2 AS (
SELECT *
FROM src CROSS JOIN UNNEST (c, p) AS contacts(c_item, p_item)
),
dataset AS (
SELECT
eventid,
username,
json_extract_scalar(c_item, '$.contactid') as c_contactid,
json_extract_scalar(c_item, '$.channel') as c_channel,
json_extract_scalar(c_item, '$.initiationmethod') as c_direction,
json_extract_scalar(c_item, '$.queue.name') as c_queue,
json_extract_scalar(c_item, '$.state') as c_state,
from_iso8601_timestamp(json_extract_scalar(c_item, '$.statestarttimestamp')) as c_ts,
json_extract_scalar(p_item, '$.contactid') as p_contactid,
json_extract_scalar(p_item, '$.channel') as p_channel,
json_extract_scalar(p_item, '$.initiationmethod') as p_direction,
json_extract_scalar(p_item, '$.queue.name') as p_queue,
json_extract_scalar(p_item, '$.state') as p_state,
from_iso8601_timestamp(json_extract_scalar(p_item, '$.statestarttimestamp')) as p_ts
FROM src2
)
SELECT
username,
c_channel as channel,
c_direction as direction,
p_state as prev_state,
c_state as current_state,
c_ts as current_ts,
c_contactid as id
FROM dataset
WHERE c_contactid = p_contactid
ORDER BY id DESC, current_ts ASC
The query output looks similar to the following:
6. Analyze the data with Amazon Redshift Spectrum
With Amazon Redshift Spectrum, you can query data directly in S3 using your existing Amazon Redshift data warehouse cluster. Because the data is already in Parquet format, Redshift Spectrum gets the same great benefits that Athena does.
Here is a simple query to show querying the same data from Amazon Redshift. Note that to do this, you need to first create an external schema in Amazon Redshift that points to the AWS Glue Data Catalog.
SELECT
eventtype,
json_extract_path_text(currentagentsnapshot,'agentstatus','name') AS current_status,
json_extract_path_text(currentagentsnapshot, 'configuration','firstname') AS current_firstname,
json_extract_path_text(currentagentsnapshot, 'configuration','lastname') AS current_lastname,
json_extract_path_text(
currentagentsnapshot,
'configuration','routingprofile','defaultoutboundqueue','name') AS current_outboundqueue,
FROM default_schema.kfhconnectblog
The following shows the query output:
Summary
In this post, I showed you how to use Kinesis Data Firehose to ingest and convert data to columnar file format, enabling real-time analysis using Athena and Amazon Redshift. This great feature enables a level of optimization in both cost and performance that you need when storing and analyzing large amounts of data. This feature is equally important if you are investing in building data lakes on AWS.
Roy Hasson is a Global Business Development Manager for AWS Analytics. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan cheering his team on and hanging out with his family.
AWS Glue is a fully managed extract, transform, and load (ETL) service that makes it easier to prepare and load your data for analytics. You can create and run an ETL job with a few clicks on the AWS Management Console. Just point AWS Glue to your data store. AWS Glue discovers your data and stores the associated metadata (for example, a table definition and schema) in the AWS Glue Data Catalog.
AWS Glue has native connectors to data sources using JDBC drivers, either on AWS or elsewhere, as long as there is IP connectivity. In this post, we demonstrate how to connect to data sources that are not natively supported in AWS Glue today. We walk through connecting to and running ETL jobs against two such data sources, IBM DB2 and SAP Sybase. However, you can use the same process with any other JDBC-accessible database.
AWS Glue data sources
AWS Glue natively supports the following data stores by using the JDBC protocol:
One of the fastest growing architectures deployed on AWS is the data lake. The ETL processes that are used to ingest, clean, transform, and structure data are critically important for this architecture. Having the flexibility to interoperate with a broader range of database engines allows for a quicker adoption of the data lake architecture.
For data sources that AWS Glue doesn’t natively support, such as IBM DB2, Pivotal Greenplum, SAP Sybase, or any other relational database management system (RDBMS), you can import custom database connectors from Amazon S3 into AWS Glue jobs. In this case, the connection to the data source must be made from the AWS Glue script to extract the data, rather than using AWS Glue connections. To learn more, see Providing Your Own Custom Scripts in the AWS Glue Developer Guide.
Setting up an ETL job for an IBM DB2 data source
The first example demonstrates how to connect the AWS Glue ETL job to an IBM DB2 instance, transform the data from the source, and store it in Apache Parquet format in Amazon S3. To successfully create the ETL job using an external JDBC driver, you must define the following:
The S3 location of the job script
The S3 location of the temporary directory
The S3 location of the JDBC driver
The S3 location of the Parquet data (output)
The IAM role for the job
By default, AWS Glue suggests bucket names for the scripts and the temporary directory using the following format:
Keep in mind that having the AWS Glue job and S3 buckets in the same AWS Region helps save on cross-Region data transfer fees. For this post, we will work in the US East (Ohio) Region (us-east-2).
Creating the IAM role
The next step is to set up the IAM role that the ETL job will use:
Sign in to the AWS Management Console, and search for IAM:
On the IAM console, choose Roles in the left navigation pane.
Choose Create role. The role type of trusted entity must be an AWS service, specifically AWS Glue.
Choose Next: Permissions.
Search for the AWSGlueServiceRole policy, and select it.
Search again, now for the SecretsManagerReadWrite This policy allows the AWS Glue job to access database credentials that are stored in AWS Secrets Manager.
CAUTION: This policy is open and is being used for testing purposes only. You should create a custom policy to narrow the access just to the secrets that you want to use in the ETL job.
Select this policy, and choose Next: Review.
Give your role a name, for example, GluePermissions, and confirm that both policies were selected.
Choose Create role.
Now that you have created the IAM role, it’s time to upload the JDBC driver to the defined location in Amazon S3. For this example, we will use the DB2 driver, which is available on the IBM Support site.
Storing database credentials
It is a best practice to store database credentials in a safe store. In this case, we use AWS Secrets Manager to securely store credentials. Follow these steps to create those credentials:
Open the console, and search for Secrets Manager.
In the AWS Secrets Manager console, choose Store a new secret.
Under Select a secret type, choose Other type of secrets.
In the Secret key/value, set one row for each of the following parameters:
db_username
db_password
db_url (for example, jdbc:db2://10.10.12.12:50000/SAMPLE)
db_table
driver_name (ibm.db2.jcc.DB2Driver)
output_bucket: (for example, aws-glue-data-output-1234567890-us-east-2/User)
Choose Next.
For Secret name, use DB2_Database_Connection_Info.
Choose Next.
Keep the Disable automatic rotation check box selected.
Choose Next.
Choose Store.
Adding a job in AWS Glue
The next step is to author the AWS Glue job, following these steps:
In the AWS Management Console, search for AWS Glue.
In the navigation pane on the left, choose Jobs under the ETL
Choose Add job.
Fill in the basic Job properties:
Give the job a name (for example, db2-job).
Choose the IAM role that you created previously (GluePermissions).
For This job runs, choose A new script to be authored by you.
For ETL language, choose Python.
In the Script libraries and job parameters section, choose the location of your JDBC driver for Dependent jars path.
Choose Next.
On the Connections page, choose Next
On the summary page, choose Save job and edit script. This creates the job and opens the script editor.
In the editor, replace the existing code with the following script. Important: Line 47 of the script corresponds to the mapping of the fields in the source table to the destination, dropping of the null fields to save space in the Parquet destination, and finally writing to Amazon S3 in Parquet format.
Choose the black X on the right side of the screen to close the editor.
Running the ETL job
Now that you have created the job, the next step is to execute it as follows:
On the Jobs page, select your new job. On the Action menu, choose Run job, and confirm that you want to run the job. Wait a few moments as it finishes the execution.
After the job shows as Succeeded, choose Logs to read the output of the job.
In the output of the job, you will find the result of executing the df.printSchema() and the message with the df.count().
Also, if you go to your output bucket in S3, you will find the Parquet result of the ETL job.
Using AWS Glue, you have created an ETL job that connects to an existing database using an external JDBC driver. It enables you to execute any transformation that you need.
Setting up an ETL job for an SAP Sybase data source
In this section, we describe how to create an AWS Glue ETL job against an SAP Sybase data source. The process mentioned in the previous section works for a Sybase data source with a few changes required in the job:
While creating the job, choose the correct jar for the JDBC dependency.
In the script, change the reference to the secret to be used from AWS Secrets Manager:
After you successfully execute the new ETL job, the output contains the same type of information that was generated with the DB2 data source.
Note that each of these JDBC drivers has its own nuances and different licensing terms that you should be aware of before using them.
Maximizing JDBC read parallelism
Something to keep in mind while working with big data sources is the memory consumption. In some cases, “Out of Memory” errors are generated when all the data is read into a single executor. One approach to optimize this is to rely on the parallelism on read that you can implement with Apache Spark and AWS Glue. To learn more, see the Apache Spark SQL module.
You can use the following options:
partitionColumn: The name of an integer column that is used for partitioning.
lowerBound: The minimum value of partitionColumn that is used to decide partition stride.
upperBound: The maximum value of partitionColumn that is used to decide partition stride.
numPartitions: The number of partitions. This, along with lowerBound (inclusive) and upperBound (exclusive), form partition strides for generated WHERE clause expressions used to split the partitionColumn When unset, this defaults to SparkContext.defaultParallelism.
Those options specify the parallelism of the table read. lowerBound and upperBound decide the partition stride, but they don’t filter the rows in the table. Therefore, Spark partitions and returns all rows in the table. For example:
It’s important to be careful with the number of partitions because too many partitions could also result in Spark crashing your external database systems.
Conclusion
Using the process described in this post, you can connect to and run AWS Glue ETL jobs against any data source that can be reached using a JDBC driver. This includes new generations of common analytical databases like Greenplum and others.
You can improve the query efficiency of these datasets by using partitioning and pushdown predicates. For more information, see Managing Partitions for ETL Output in AWS Glue. This technique opens the door to moving data and feeding data lakes in hybrid environments.
Kapil Shardha is a Technical Account Manager and supports enterprise customers with their AWS adoption. He has background in infrastructure automation and DevOps.
William Torrealba is an AWS Solutions Architect supporting customers with their AWS adoption. He has background in Application Development, High Available Distributed Systems, Automation, and DevOps.
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.