Sending Push Notifications to iOS 13 Devices with Amazon SNS

Post Syndicated from Brent Meyer original

Note: This post was written by Alan Chuang, a Senior Product Manager on the AWS Messaging team.

On September 19, 2019, Apple released iOS 13. This update introduced changes to the Apple Push Notification service (APNs) that can impact your existing workloads. Amazon SNS has made some changes that ensure uninterrupted delivery of push notifications to iOS 13 devices.

iOS 13 introduced a new and required header called apns-push-type. The value of this header informs APNs of the contents of your notification’s payload so that APNs can respond to the message appropriately. The possible values for this header are:

  • alert
  • background
  • voip
  • complication
  • fileprovider
  • mdm

Apple’s documentation indicates that the value of this header “must accurately reflect the contents of your notification’s payload. If there is a mismatch, or if the header is missing on required systems, APNs may return an error, delay the delivery of the notification, or drop it altogether.”

We’ve made some changes to the Amazon SNS API that make it easier for developers to handle this breaking change. When you send push notifications of the alert or background type, Amazon SNS automatically sets the apns-push-type header to the appropriate value for your message. For more information about creating an alert type and a background type notification, see Generating a Remote Notification and Pushing Background Updates to Your App on the Apple Developer website.

Because you might not have enough time to react to this breaking change, Amazon SNS provides two options:

  • If you want to set the apns-push-type header value yourself, or the contents of your notification’s payload require the header value to be set to voip, complication, fileprovider, or mdm, Amazon SNS lets you set the header value as a message attribute using the Amazon SNS Publish API action. For more information, see Specifying Custom APNs Header Values and Reserved Message Attributes for Mobile Push Notifications in the Amazon SNS Developer Guide.
  • If you send push notifications as alert or background type, and if the contents of your notification’s payload follow the format described in the Apple Developer documentation, then Amazon SNS automatically sets the correct header value. To send a background notification, the format requires the aps dictionary to have only the content-available field set to 1. For more information about creating an alert type and a background type notification, see Generating a Remote Notification and Pushing Background Updates to Your App on the Apple Developer website.

We hope these changes make it easier for you to send messages to your customers who use iOS 13. If you have questions about these changes, please open a ticket with our Customer Support team through the AWS Management Console.

Cloudflare Refutes MPA and RIAA’s Piracy Concerns

Post Syndicated from Ernesto original

Earlier this month several copyright holder groups sent their annual “Notorious Markets” complaints to the U.S. Trade Representative (USTR).

The recommendations are meant to call out well-known piracy sites, apps, and services, but Cloudflare is frequently mentioned as well.

The American CDN provider can’t be officially listed since it’s not a foreign company. However, rightsholders have seizes the opportunity to point out that the CDN service helps pirate sites with their infringing activities.

The MPA and RIAA, for example, wrote that Cloudflare frustrates enforcement efforts by helping pirate sites to “hide” their hosting locations. In addition, the Hollywood-affiliated Digital Citizens Alliance (DCA) pointed out that the company helps pirate sites to deliver malware.

This week Cloudflare responded to these allegations. In a rebuttal, sent to the USTR’s Director for Innovation and Intellectual Property, General Counsel Doug Kramer writes that these reports are not an accurate representation of how the company operates.

“My colleagues and I were frustrated to find continued misrepresentations of our business and efforts to malign our services,” Kramer writes.

“We again feel called on to clarify that Cloudflare does not host the referenced websites, cannot block websites, and is not in the business of hiding companies that host illegal content–all facts well known to the industry groups based on our ongoing work with them.”

Kramer points out that the copyright holder groups “rehash” previous complaints, which Cloudflare previously rebutted. In fact, some parts of the CDN provider’s own reply are rehashed too, but there are several new highlights as well.

For example, the USTR’s latest review specifically focuses on malware issues. According to Cloudflare, its services are specifically aimed at mitigating such threats.

“Our system uses the collective intelligence from all the properties on our network to support and immediately update our web application firewall, which can block malware at the edge and prevent it from reaching a site’s origin server. This protects the many content creators who use our services for their websites as well as the users of their websites, from malware,” Kramer writes.

The DCA’s submission, which included a 2016 report from the group, is out of date and inaccurate, Cloudflare says. Several of the mentioned domains are no longer Cloudflare customers, for example. In addition, the DCA never sent any malware complaints to the CDN service.

Cloudflare did previously reach out to the DCA following its malware report, but this effort proved fruitless, the company writes.

“Despite our repeated attempts to get additional information by either
phone or email, DCA cancelled at least three scheduled calls and declined to provide any specific information that would have allowed us to verify the existence of the malware and protect users from malicious activity online,” Kramer notes.

Malware aside, the allegations that Cloudflare helps pirate sites to ‘hide’ their hosting locations are not entirely true either.

Kramer points out that the company has a “Trusted Reporter” program which complainants, including the RIAA, use frequently. This program helps rightsholders to easily obtain the actual hosting locations of Cloudflare customers that engage in widespread copyright infringement.

Although Cloudflare admits that it can’t stop all bad actors online, it will continue to work with the RIAA, MPA, and others to provide them with all the information they need for their enforcement efforts.

None of this is new though. Year after year the same complaints come in and Cloudflare suggests that copyright holders are actually looking for something else. They would like the company to terminate accounts of suspected pirate sites. However, the CDN provider has no intention to do so.

“Their submissions to the Notorious Markets process seem intended to pressure Cloudflare to take over efforts to identify and close down infringing websites for them, but that is something that we are not obligated to do,” Kramer says.

While it would be technically possible, it would require the company to allocate considerable resources to the task. These resources are currently needed to pursue its primary goal, which is to keep the Internet secure and protect users from malware and other risks.

It’s clear that Cloudflare doesn’t want to take any action against customers without a court order. While it has occasionally deviated from this stance by kicking out Daily Stormer and 8Chan, pirate sites are on a different level.

A copy of the letter Cloudflare’s General Counsel Doug Kramer sent to the USTR’s Director for Innovation and Intellectual Property, Jacob Ewerdt, is available here (pdf).

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN reviews, discounts, offers and coupons.

How ironSource built a multi-purpose data lake with Upsolver, Amazon S3, and Amazon Athena

Post Syndicated from Seva Feldman original

ironSource, in their own words, is the leading in-app monetization and video advertising platform, making free-to-play and free-to-use possible for over 1.5B people around the world. ironSource helps app developers take their apps to the next level, including the industry’s largest in-app video network. Over 80,000 apps use ironSource technologies to grow their businesses.

The massive scale in which ironSource operates across its various monetization platforms—including apps, video, and mediation—leads to millions of end-devices generating massive amounts of streaming data. They need to collect, store, and prepare data to support multiple use cases while minimizing infrastructure and engineering overheads.

This post discusses the following:

  • Why ironSource opted for a data lake architecture based on Amazon S3.
  • How ironSource built the data lake using Upsolver.
  • How to create outputs to analytic services such as Amazon Athena, Amazon ES, and Tableau.
  • The benefits of this solution.

Advantages of a data lake architecture

After working for several years in a database-focused approach, the rapid growth in ironSource’s data made their previous system unviable from a cost and maintenance perspective. Instead, they adopted a data lake architecture, storing raw event data on object storage, and creating customized output streams that power multiple applications and analytic flows.

Why ironSource chose an AWS data lake

A data lake was the right solution for ironSource for the following reasons:

  • Scale – ironSource processes 500K events per second and over 20 billion events daily. The ability to store near-infinite amounts of data in S3 without preprocessing the data is crucial.
  • Flexibility – ironSource uses data to support multiple business processes. Because they need to feed the same data into multiple services to provide for different use cases, the company needed to bypass the rigidity and schema limitations entailed by a database approach. Instead, they store all the original data on S3 and create ad-hoc outputs and transformations as needed.
  • Resilience – Because all historical data is on S3, recovery from failure is easier, and errors further down the pipeline are less likely to affect production environments.

Why ironSource chose Upsolver

Upsolver’s streaming data platform automates the coding-intensive processes associated with building and managing a cloud data lake. Upsolver enables ironSource to support a broad range of data consumers and minimize the time DevOps engineers spend on data plumbing by providing a GUI-based, self-service tool for ingesting data, preparing it for analysis, and outputting structured tables to various query services.

Key benefits include the following:

  • Self-sufficiency for data consumers – As a self-service platform, Upsolver allows BI developers, Ops, and software teams to transform data streams into tabular data without writing code.
  • Improved performance – Because Upsolver stores files in optimized Parquet storage on S3, ironSource benefits from high query performance without manual performance tuning.
  • Elastic scaling – ironSource is in hyper-growth, so needs elastic scaling to handle increases in inbound data volume and peaks throughout the week, reprocessing of events from S3, and isolation between different groups that use the data.
  • Data privacy – Because ironSource’s VPC deploys Upsolver with no access from outside, there is no risk to sensitive data.

This post shows how ironSource uses Upsolver to build, manage, and orchestrate its data lake with minimal coding and maintenance.

Solution Architecture

The following diagram shows the architecture ironSource uses:

Architecture showing Apache Kafka with an arrow pointing left to Upsolver. Upsolver contains stream ingestion, schemaless data management and stateful data processing, it has two arrows coming out the bottom, each going to S3, one for raw data, the other for parquet files. Upsolver box has an arrow pointing right to a Query Engines box, which contains Athena, Redshift and Elastic. This box has a an arrow pointing right to Use cases, which contains product analytics, campaign performance and customer dashboards.

Streaming data from Kafka to Upsolver and storing on S3

Apache Kafka streams data from ironSource’s mobile SDK at a rate of up to 500K events per second. Upsolver pulls data from Kafka and stores it in S3 within a data lake architecture. It also keeps a copy of the raw event data while making sure to write each event exactly one time, and stores the same data as Parquet files that are optimized for consumption.

Building the input stream in Upsolver:

Using the Upsolver GUI, ironSource connects directly to the relevant Kafka topics and writes them to S3 precisely one time. See the following screenshot.

Image of the Upsolver UI showing the "Data Sources" tab is open to the "Create a Kafka Data Source" page with "Mobile SDK Cluster" highlighted under the "Compute Cluster" section.

After the data is stored in S3, ironSource can proceed to operationalize the data using a wide variety of databases and analytic tools. The next steps cover the most prominent tools.

Output to Athena

To understand production issues, developers and product teams need access to data. These teams can work with the data directly and answer their own questions by using Upsolver and Athena.

Upsolver simplifies and automates the process of preparing data for consumption in Athena, including compaction, compression, partitioning, and creating and managing tables in the AWS Glue Data Catalog. ironSource’s DevOps teams save hundreds of hours on pipeline engineering. Upsolver’s GUI creates each table one time, and from that point onwards, data consumers are entirely self-sufficient. To ensure queries in Athena run fast and at minimal cost, Upsolver also enforces performance-tuning best practices as data is ingested and stored on S3. For more information, see Top 10 Performance Tuning Tips for Amazon Athena.

Athena’s serverless architecture further compliments this independence, which means there’s no infrastructure to manage and analysts don’t need DevOps to use Amazon Redshift or query clusters for each new question. Instead, analysts can indicate the data they need and get answers.

Sending tables to Athena in Upsolver

In Upsolver, you can declare tables with associated schema using SQL or the built-in GUI. You can expose these tables to Athena through the AWS Glue Data Catalog. Upsolver stores Parquet files in S3 and creates the appropriate table and partition information in the AWS Glue Data Catalog by using Create and Alter DDL statements. You can also edit these tables with Upsolver Output to add, remove, or change columns. Upsolver automates the process of recreating table data on S3 and altering the metadata in the AWS Glue Data Catalog.

Creating the table

Image of the Upsolver UI showing the "Outputs" tab is open to the "Mobile SDK Data" page.

Sending the table to Amazon Athena

Image of the Upsolver UI showing the "Run Parameters" dialog box is open, having arrived there from the "Mobile SDK Data" page noted in the previous image.

Editing the table option for Outputs

Image on the "Mobile SDK Data" page showing the drop down menu from the 3 dots in the upper left with "Edit" highlighted.

Modifying an existing table in the Upsolver Output

Image showing "Alter Existing Table" with a radio button selected, along with a blurb that states "The changes will affect the existing rable from the time specific. Any data already written after that time with be deleted. The previous output will stop once it finishes processing all the data up to the specified time." Below that is a box showing an example data and time. The other option with a radio button not selected is "Create New Table" with the blurb "A new table will be created. The existing table and output will not be affected in any way by this operation. The buttons at the bottom are "Next" and "Cancel," with "Next" selected.

Output to BI platforms

IronSource’s BI analysts use Tableau to query and visualize data using SQL. However, performing this type of analysis on streaming data may require extensive ETL and data preparation, which can limit the scope of analysis and create reporting bottlenecks.

IronSource’s cloud data lake architecture enables BI teams to work with big data in Tableau. They use Upsolver to enrich and filter data and write it to Redshift to build reporting dashboards, or send tables to Athena for ad-hoc analytic queries. Tableau connects natively to both Redshift and Athena, so analysts can query the data using regular SQL and familiar tools, rather than relying on manual ETL processes.

Creating a reduced stream for Amazon ES

Engineering teams at IronSource use Amazon ES to monitor and analyze application logs. However, as with any database, storing raw data in Amazon ES is expensive and can lead to production issues.

Because a large part of these logs are duplicates, Upsolver deduplicates the data. This reduces Amazon ES costs and improves performance. Upsolver cuts down the size of the data stored in Amazon ES by 70% by aggregating identical records. This makes it viable and cost-effective despite generating a high volume of logs.

To do this, Upsolver adds a calculated field to the event stream, which indicates whether a particular log is a duplicate. If so, it filters the log out of the stream that it sends to Amazon ES.

Creating the calculated field

Image showing the Upsolver UI with the "Outputs" tab selected, showing the "Create Calcuated Field" page.

Filtering using the calculated field

Upsolver UI showing the "Outputs" tab selected, on the "Create Filter" page.


Self-sufficiency is a big part of ironSource’s development ethos. In revamping its data infrastructure, the company sought to create a self-service environment for dev and BI teams to work with data, without becoming overly reliant on DevOps and data engineering. Data engineers can now focus on features rather than building and maintaining code-driven ETL flows.

ironSource successfully built an agile and versatile architecture with Upsolver and AWS data lake tools. This solution enables data consumers to work independently with data, while significantly improving data freshness, which helps power both the company’s internal decision-making and external reporting.

Some of the results in numbers include:

  • Thousands of engineering hours saved – ironSource’s DevOps and data engineers save thousands of hours that they would otherwise spend on infrastructure by replacing manual, coding-intensive processes with self-service tools and managed infrastructure.
  • Fees reduction – Factoring infrastructure, workforce, and licensing costs, Upsolver significantly reduces ironSource’s total infrastructure costs.
  • 15-minute latency from Kafka to end-user – Data consumers can respond and take action with near real-time data.
  • 9X increase in scale – Currently at 0.5M incoming events/sec and 3.5M outgoing events/sec.

“It’s important for every engineering project to generate tangible value for the business,” says Seva Feldman, Vice President of Research and Development at ironSource Mobile. “We want to minimize the time our engineering teams, including DevOps, spend on infrastructure and maximize the time spent developing features. Upsolver has saved thousands of engineering hours and significantly reduced total cost of ownership, which enables us to invest these resources in continuing our hypergrowth rather than data pipelines.”

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.


About the Authors

Seva Feldman is Vice President of R&D at ironSource Mobile.
With over two decades of experience senior architecture, DevOps and engineering roles, Seva is an expert in turning operational challenges into opportunities for improvement.


Eran Levy is the Director of Marketing at Upsolver.





Roy Hasson is the Global Business Development Lead of Analytics and Data Lakes at AWS.. He works with customers around the globe to design solutions to meet their data processing, analytics and business intelligence needs. Roy is big Manchester United fan, cheering his team on and hanging out with his family.




ML Platform Meetup: Infra for Contextual Bandits and Reinforcement Learning

Post Syndicated from Netflix Technology Blog original

Faisal Siddiqi

Infrastructure for Contextual Bandits and Reinforcement Learning — theme of the ML Platform meetup hosted at Netflix, Los Gatos on Sep 12, 2019.

Contextual and Multi-armed Bandits enable faster and adaptive alternatives to traditional A/B Testing. They enable rapid learning and better decision-making for product rollouts. Broadly speaking, these approaches can be seen as a stepping stone to full-on Reinforcement Learning (RL) with closed-loop, on-policy evaluation and model objectives tied to reward functions. At Netflix, we are running several such experiments. For example, one set of experiments is focussed on personalizing our artwork assets to quickly select and leverage the “winning” images for a title we recommend to our members.

As with other traditional machine learning and deep learning paths, a lot of what the core algorithms can do depends upon the support they get from the surrounding infrastructure and the tooling that the ML platform provides. Given the infrastructure space for RL approaches is still relatively nascent, we wanted to understand what others in the community are doing in this space.

This was the motivation for the meetup’s theme. It featured three relevant talks from LinkedIn, Netflix and Facebook, and a platform architecture overview talk from first time participant Dropbox.



After a brief introduction on the theme and motivation of its choice, the talks were kicked off by Kinjal Basu from LinkedIn who talked about Online Parameter Selection for Web-Based Ranking via Bayesian Optimization. In this talk, Kinjal used the example of the LinkedIn Feed, to demonstrate how they use bandit algorithms to solve for the optimal parameter selection problem efficiently.

He started by laying out some of the challenges around inefficiencies of engineering time when manually optimizing for weights/parameters in their business objective functions. The key insight was that by assuming a latent Gaussian Process (GP) prior on the key business metric actions like viral engagement, job applications, etc., they were able to reframe the problem as a straight-forward black-box optimization problem. This allowed them to use BayesOpt techniques to solve this problem.

The algorithm used to solve this reformulated optimization problem is a popular E/E technique known as Thompson Sampling. He talked about the infrastructure used to implement this. They have built an offline BayesOpt library, a parameter store to retrieve the right set of parameters, and an online serving layer to score the objective at serving time given the parameter distribution for a particular member.

He also described some practical considerations, like member-parameter stickiness, to avoid per session variance in a member’s experience. Their offline parameter distribution is recomputed hourly, so the member experience remains consistent within the hour. Some simulation results and some online A/B test results were shared, demonstrating substantial lifts in the primary business metrics, while keeping the secondary metrics above preset guardrails.

He concluded by stressing the efficiency their teams had achieved by doing online parameter exploration instead of the much slower human-in-the-loop manual explorations. In the future, they plan to explore adding new algorithms like UCB, considering formulating the problem as a grey-box optimization problem, and switching between the various business metrics to identify which is the optimal metric to optimize.



The second talk was by Netflix on our Bandit Infrastructure built for personalization use cases. Fernando Amat and Elliot Chow jointly gave this talk.

Fernando started the first part of the talk and described the core recommendation problem of identifying the top few titles in a large catalog that will maximize the probability of play. Using the example of evidence personalization — images, text, trailers, synopsis, all assets that come together to add meaning to a title — he described how the problem is essentially a slate recommendation task and is well suited to be solved using a Bandit framework.

If such a framework is to be generic, it must support different contexts, attributions and reward functions. He described a simple Policy API that models the Slate tasks. This API supports the selection of a state given a list of options using the appropriate algorithm and a way to quantify the propensities, so the data can be de-biased. Fernando ended his part by highlighting some of the Bandit Metrics they implemented for offline policy evaluation, like Inverse Propensity Scoring (IPS), Doubly Robust (DR), and Direct Method (DM).

For Bandits, where attribution is a critical part of the equation, it’s imperative to have a flexible and robust data infrastructure. Elliot started the second part of the talk by describing the real-time framework they have built to bring together all signals in one place making them accessible through a queryable API. These signals include member activity data (login, search, playback), intent-to-treat (what title/assets the system wants to impress to the member) and the treatment (impressions of images, trailers) that actually made it to the member’s device.

Elliot talked about what is involved in “Closing the loop”. First, the intent-to-treat needs to be joined with the treatment logging along the way, the policies in effect, the features used and the various propensities. Next, the reward function needs to be updated, in near real time, on every logged action (like a playback) for both short-term and long-term rewards. And finally each new observation needs to update the policy, compute offline policy evaluation metrics and then push the policy back to production so it can generate new intents to treat.

To be able to support this, the team had to standardize on several infrastructure components. Elliot talked about the three key components — a) Standardized Logging from the treatment services, b) Real-time stream processing over Apache Flink for member activity joins, and c) an Apache Spark client for attribution and reward computation. The team has also developed a few common attribution datasets as “out-of-the-box” entities to be used by the consuming teams.

Finally, Elliot ended by talking about some of the challenges in building this Bandit framework. In particular, he talked about the misattribution potential in a complex microservice architecture where often intermediary results are cached. He also talked about common pitfalls of stream-processed data like out of order processing.

This framework has been in production for almost a year now and has been used to support several A/B tests across different recommendation use cases at Netflix.



After a short break, the second session started with a talk from Facebook focussed on practical solutions to exploration problems. Sam Daulton described how the infrastructure and product use cases came along. He described how the adaptive experimentation efforts are aimed at enabling fast experimentation with a goal of adding varying degrees of automation for experts using the platform in an ad hoc fashion all the way to no-human-in-the-loop efforts.

He dived into a policy search problem they tried to solve: How many posts to load for a user depending upon their device’s connection quality. They modeled the problem as an infinite-arm bandit problem and used Gaussian Process (GP) regression. They used Bayesian Optimization to perform multi-metric optimization — e.g., jointly optimizing decrease in CPU utilization along with increase in user engagement. One of the challenges he described was how to efficiently choose a decision point, when the joint optimization search presented a Pareto frontier in the possible solution space. They used constraints on individual metrics in the face of noisy experiments to allow business decision makers to arrive at an optimal decision point.

Not all spaces can be efficiently explored online, so several research teams at Facebook use Simulations offline. For example, a ranking team would ingest live user traffic and subject it to a number of ranking configurations and simulate the event outcomes using predictive models running on canary rankers. The simulations were often biased and needed de-biasing (using multi-task GP regression) for them to be used alongside online results. They observed that by combining their online results with de-biased simulation results they were able to substantially improve their model fit.

To support these efforts, they developed and open sourced some tools along the way. Sam described Ax and BoTorch — Ax is a library for managing adaptive experiments and BoTorch is a library for Bayesian Optimization research. There are many applications already in production for these tools from both basic hyperparameter exploration to more involved AutoML use cases.

The final section of Sam’s talk focussed on Constrained Bayesian Contextual Bandits. They described the problem of video uploads to Facebook where the goal is to maximize the quality of the video without a decrease in reliability of the upload. They modeled it as a Thompson Sampling optimization problem using a Bayesian Linear model. To enforce the constraints, they used a modified algorithm, Constrained Thompson Sampling, to ensure a non-negative change in reliability. The reward function also similarly needed some shaping to align with the constrained objective. With this reward shaping optimization, Sam shared some results that showed how the Constrained Thompson Sampling algorithm surfaced many actions that satisfied the reliability constraints, where vanilla Thompson Sampling had failed.



The last talk of the event was a system architecture introduction by Dropbox’s Tsahi Glik. As a first time participant, their talk was more of an architecture overview of the ML Infra in place at Dropbox.

Tsahi started off by giving some ML usage examples at Dropbox like Smart Sync which predicts which file you will use on a particular device, so it’s preloaded. Some of the challenges he called out were the diversity and size of the disparate data sources that Dropbox has to manage. Data privacy is increasingly important and presents its own set of challenges. From an ML practice perspective, they also have to deal with a wide variety of development processes and ML frameworks, custom work for new use cases and challenges with reproducibility of training.

He shared a high level overview of their ML platform showing the various common stages of developing and deploying a model categorized by the online and offline components. He then dived into some individual components of the platform.

The first component he talked about was a user activity service to collect the input signals for the models. This service, Antenna, provides a way to query user activity events and summarizes the activity with various aggregations. The next component he dived deeper into was a content ingestion pipeline for OCR (optical character recognition). As an example, he explained how the image of a receipt is converted into contextual text. The pipeline takes the image through multiple models for various subtasks. The first classifies whether the image has some detectable text, the second does corner detection, the third does word box detection followed by deep LSTM neural net that does the core sequence based OCR. The final stage performs some lexicographical post processing.

He talked about the practical considerations of ingesting user content — they need to prevent malicious content from impacting the service. To enable this they have adopted a plugin based architecture and each task plugin runs in a sandbox jail environment.

Their offline data preparation ETLs run on Spark and they use Airflow as the orchestration layer. Their training infrastructure relies on a hybrid cloud approach. They have built a layer and command line tool called dxblearn that abstracts the training paths, allowing the researchers to train either locally or leverage AWS. dxblearn also allows them to fire off training jobs for hyperparameter tuning.

Published models are sent to a model store in S3 which are then picked up by their central model prediction service that does online inferencing for all use cases. Using a central inferencing service allows them to partition compute resources appropriately and having a standard API makes it easy to share and also run inferencing in the cloud.

They have also built a common “suggest backend” that is a generic predictive application that can be used by the various edge and production facing services that regularizes the data fetching, prediction and experiment configuration needed for a product prediction use case. This allows them to do live experimentation more easily.

The last part of Tsahi’s talk described a product use case leveraging their ML Platform. He used the example of a promotion campaign ranker, (eg “Try Dropbox business”) for up-selling. This is modeled as a multi-armed bandit problem, an example well in line with the meetup theme.

The biggest value of such meetups lies in the high bandwidth exchange of ideas from like-minded practitioners. In addition to some great questions after the talks, the 150+ attendees stayed well past 2 hours in the reception exchanging stories and lessons learnt solving similar problems at scale.

In the Personalization org at Netflix, we are always interested in exchanging ideas about this rapidly evolving ML space in general and the bandits and reinforcement learning space in particular. We are committed to sharing our learnings with the community and hope to discuss progress here, especially our work on Policy Evaluation and Bandit Metrics in future meetups. If you are interested in working on this exciting space, there are many open opportunities on both engineering and research endeavors.

ML Platform Meetup: Infra for Contextual Bandits and Reinforcement Learning was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

LTTng 2.11.0 “Lafontaine” released

Post Syndicated from jake original

After more than two years of development, the Linux trace toolkit next generation (LTTng)
project has released version 2.11.0 of the kernel and user-space tracing
tool. The release covers the LTTng tools, LTTng user-space tracer, and
LTTng kernel modules. It includes a number of new features that are
described in the announcement including session rotation, dynamic user-space tracing,
call-stack capturing for the kernel and user space, improved networking
performance, NUMA awareness for user-space tracing buffer allocation, and
more. “The biggest feature of this release is the long-awaited session
rotation support. Session rotations now allow you to rotate an
ongoing tracing session much in the same way as you would rotate

The ‘lttng rotate’ command rotates the current trace chunk of
the current tracing session. Once a rotation is completed, LTTng does
not manage the trace chunk archive anymore: you can read it, modify it,
move it, or remove it.

Because a rotation causes the tracing session’s current sub-buffers
to be flushed, trace chunk archives are never redundant, that is, they
do not overlap over time, unlike snapshots.

Once a rotation is complete, offline analyses can be performed on
the resulting trace, much like in ‘normal’ mode. However, the big
advantage is that this can be done without interrupting tracing, and
without being limited to tools which implement the ‘live’ protocol.”

„Репортери без граници”: По-зле не е било, но с Гешев може и да стане

Post Syndicated from Николай Марченко original

петък 18 октомври 2019

Среща с президента на Република България Румен Радев проведе ръководството на световноизвестната правозащитна организация „Репортери без граници”.

На „Дондуков”, 2 на 17 октомври бяха поканени генералният секретар на „Репортери без граници” Кристоф Делоар и говорителката на организация и ръководител на отдела „Източна Европа и Балканите” Полин Аде-Мевел.

Двамата разказаха за срещата пред малцината дошли пред президентството журналисти от печатни и онлайн медии (нито една телевизия не откликна на поканата).

„Следим случая с „Биволъ”

„Ситуацията е абсолютно ужасяваща. Вероятно никога не е била толкова зле, колкото сега”, каза Полин Аде-Мевел. По думите й, пред президента са споделени притесненията за „медийния климат” в България след старта на кампанията за избор на нов главен прокурор.

„Дали сте споменавали казуса със сайта Биволъ?”, пита репортерът на сайта за разследваща журналистика.

„Не сме говорили специално по време на тази среща за казуса „Биволъ”, но сме споменали наличието на съдебни призовки срещу журналисти, не само срещу „Биволъ”, в нашите писмени запитвания”, каза Полин Аде-Мевел.

„Разбира се, президентът е наясно и разбира, че една от причините за това страната да е на толкова ниска позиция (111-о място в класацията за свободата на словото на „Репортери без граници” – бел. авт.) е съдебното преследване, с което журналисти се сблъскват напоследък. И по-специално – в последните 10 месеца”, каза тя.

Говорителката на „Репортери без граници” също така увери, че следи случая на сайта и поддържа контакт с главния редактор на „Биволъ” Атанас Чобанов, след като прокуратурата обяви за евентуалното издаване на европейска заповед срещу него с цел снемане на показания по делото #НАПлийкс.

„Никога медиите ви не са били по-зле”

“Никога ситуацията с медиите при вас не е била толкова зле, откакто България стана демокрация”, заяви Кристоф Делоар.

“Изразихме притесненията си от евентуалното назначение на следващия главен прокурор, смятайки, че кандидатът, който би бил назначен, изрази възгледи за медиите, които не са свързани с професионална безпристрастност и достойнство”, каза генералният секретар на „Репортери без граници” за срещата при Румен Радев.

„Това е риск не само за журналистиката, но и за демокрацията”, смята той. Делоар обясни, че „Репортери без граници” са призовали президента за нов медиен закон. „Препоръчахме законът да осигури редакционната независимост и авторитета на журналистиката”, каза той. Според Кристоф Делоар просто „да се реагира не е достатъчно”, а е „необходима системна реформа”.

„Репортери без граници” са поискали от държавния глава реални мерки в тази посока. “Една от първите мерки трябва да е гаранция, че няма вербални и физически атаки срещу журналистите, или заплахи като тези, които видяхме през последните седмици”, каза Аде-Мевел.

Според Кристоф Делоар в България „има климат на медийна гражданска война”: „Но това не е плурализъм”. „Вместо това трябва да има защита на редакционната независимост и свобода, благонадеждност на информацията”, призова той.

Според ръководителя на неправителствената организация, ситуацията в България „е много притеснителна заради големия натиск върху журналистите, атаките – понякога физически, понякога вербални, заради инструментализацията на медиите срещу опоненти”, каза Делоар.

Според него налице е и “злоупотреба със законите, за да бъде застрашена журналистиката”: „За нас беше важно да изразим това пред президента.”

От три години чакат отговор от премиера

От прессекретариата на президента са лаконични в прессъобщението за срещата. „111-ото място на България в Индекса за свободата на пресата на „Репортери без граници“ е „тревожно преди всичко заради констатираните сериозни проблеми в медийната сфера, която е призвана да бъде опора на демократичните процеси”, казал е Румен Радев.

„По думите на президента, финансирането е ключов въпрос за гарантирането на свободата, независимостта и плурализма в медиите и са необходими ясни критерии за достъп до европейско финансиране и справедливото им разпределение”, съобщиха още от „Дондуков”, 2.

Пред журналисти, от Репортери без граници подкрепиха критиката на българския държавен глава относно медийната политика на управляващите.
“Намираме, че е смело от негова страна, като президент на България, да се включва и да изразява ясен възглед за медийната свобода, вместо да отрича всичко, той предпочита да описва истинската ситуация и това е единственият начин тя да бъде подобрена”, каза Кристоф Делоар.

Журналистите припомниха, че България е парламентарна република и е би било по-правилно да се искат конкретни мерки от премиера Бойко Борисов, не толкова от президента Румен Радев.

Оказа се, че от цели три години се чака… отговор от страна на подчинените на министър-председателя в Министерски съвет. Не е никак странно, че отговорът е получен едва в деня, в който те бяха приети в сградата отстояща едва на 30 метра – при държавния глава.

“Всъщност, бюрото ни за Източна Европа и Балканите се опитва да се свърже с премиера от три години”, каза Кристоф Делоар.

„Днес за пръв път получихме отговор – и макар той и да е отрицателен, за пръв път ни отговориха. Разбрахме, че премиерът е на работно посещение в Брюксел, така че не е възможно да се срещнем днес”, каза Кристоф Делоар.

“Все пак, за нас този отговор е първа стъпка към започването на диалог”, обобщи в изявлението пред медиите Полин Аде-Мевел.

Снимка: Фрогнюз

Letting Birds scooters fly free

Post Syndicated from Matthew Garrett original

(Note: These issues were disclosed to Bird, and they tell me that fixes have rolled out. I haven’t independently verified)

Bird produce a range of rental scooters that are available in multiple markets. With the exception of the Bird Zero[1], all their scooters share a common control board described in FCC filings. The board contains three primary components – a Nordic NRF52 Bluetooth controller, an STM32 SoC and a Quectel EC21-V modem. The Bluetooth and modem are both attached to the STM32 over serial and have no direct control over the rest of the scooter. The STM32 is tied to the scooter’s engine control unit and lights, and also receives input from the throttle (and, on some scooters, the brakes).

The pads labeled TP7-TP11 near the underside of the STM32 and the pads labeled TP1-TP5 near the underside of the NRF52 provide Serial Wire Debug, although confusingly the data and clock pins are the opposite way around between the STM and the NRF. Hooking this up via an STLink and using OpenOCD allows dumping of the firmware from both chips, which is where the fun begins. Running strings over the firmware from the STM32 revealed “Set mode to Free Drive Mode”. Challenge accepted.

Working back from the code that printed that, it was clear that commands could be delivered to the STM from the Bluetooth controller. The Nordic NRF52 parts are an interesting design – like the STM, they have an ARM Cortex-M microcontroller core. Their firmware is split into two halves, one the low level Bluetooth code and the other application code. They provide an SDK for writing the application code, and working through Ghidra made it clear that the majority of the application firmware on this chip was just SDK code. That made it easier to find the actual functionality, which was just listening for writes to a specific BLE attribute and then hitting a switch statement depending on what was sent. Most of these commands just got passed over the wire to the STM, so it seemed simple enough to just send the “Free drive mode” command to the Bluetooth controller, have it pass that on to the STM and win. Obviously, though, things weren’t so easy.

It turned out that passing most of the interesting commands on to the STM was conditional on a variable being set, and the code path that hit that variable had some impressively complicated looking code. Fortunately, I got lucky – the code referenced a bunch of data, and searching for some of the values in that data revealed that they were the AES S-box values. Enabling the full set of commands required you to send an encrypted command to the scooter, which would then decrypt it and verify that the cleartext contained a specific value. Implementing this would be straightforward as long as I knew the key.

Most AES keys are 128 bits, or 16 bytes. Digging through the code revealed 8 bytes worth of key fairly quickly, but the other 8 bytes were less obvious. I finally figured out that 4 more bytes were the value of another Bluetooth variable which could be simply read out by a client. The final 4 bytes were more confusing, because all the evidence made no sense. It looked like it came from passing the scooter serial number to atoi(), which converts an ASCII representation of a number to an integer. But this seemed wrong, because atoi() stops at the first non-numeric value and the scooter serial numbers all started with a letter[2]. It turned out that I was overthinking it and for the vast majority of scooters in the fleet, this section of the key was always “0”.

At that point I had everything I need to write a simple app to unlock the scooters, and it worked! For about 2 minutes, at which point the network would notice that the scooter was unlocked when it should be locked and sent a lock command to force disable the scooter again. Ah well.

So, what else could I do? The next thing I tried was just modifying some STM firmware and flashing it onto a board. It still booted, indicating that there was no sort of verified boot process. Remember what I mentioned about the throttle being hooked through the STM32’s analogue to digital converters[3]? A bit of hacking later and I had a board that would appear to work normally, but about a minute after starting the ride would cut the throttle. Alternative options are left as an exercise for the reader.

Finally, there was the component I hadn’t really looked at yet. The Quectel modem actually contains its own application processor that runs Linux, making it significantly more powerful than any of the chips actually running the scooter application[4]. The STM communicates with the modem over serial, sending it an AT command asking it to make an SSL connection to a remote endpoint. It then uses further AT commands to send data over this SSL connection, allowing it to talk to the internet without having any sort of IP stack. Figuring out just what was going over this connection was made slightly difficult by virtue of all the debug functionality having been ripped out of the STM’s firmware, so in the end I took a more brute force approach – I identified the address of the function that sends data to the modem, hooked up OpenOCD to the SWD pins on the STM, ran OpenOCD’s gdb stub, attached gdb, set a breakpoint for that function and then dumped the arguments being passed to that function. A couple of minutes later and I had a full transaction between the scooter and the remote.

The scooter authenticates against the remote endpoint by sending its serial number and IMEI. You need to send both, but the IMEI didn’t seem to need to be associated with the serial number at all. New connections seemed to take precedence over existing connections, so it would be simple to just pretend to be every scooter and hijack all the connections, resulting in scooter unlock commands being sent to you rather than to the scooter or allowing someone to send fake GPS data and make it impossible for users to find scooters.

In summary: Secrets that are stored on hardware that attackers can run arbitrary code on probably aren’t secret, not having verified boot on safety critical components isn’t ideal, devices should have meaningful cryptographic identity when authenticating against a remote endpoint.

Bird responded quickly to my reports, accepted my 90 day disclosure period and didn’t threaten to sue me at any point in the process, so good work Bird.

(Hey scooter companies I will absolutely accept gifts of interesting hardware in return for a cursory security audit)

[1] And some very early M365 scooters
[2] The M365 scooters that Bird originally deployed did have numeric serial numbers, but they were 6 characters of type code followed by a / followed by the actual serial number – the number of type codes was very constrained and atoi() would terminate at the / so this was still not a large keyspace
[3] Interestingly, Lime made a different design choice here and plumb the controls directly through to the engine control unit without the application processor having any involvement
[4] Lime run their entire software stack on the modem’s application processor, but because of [3] they don’t have any realtime requirements so this is more straightforward

comment count unavailable comments

Adding a Hardware Backdoor to a Networked Computer

Post Syndicated from Bruce Schneier original

Interesting proof of concept:

At the CS3sthlm security conference later this month, security researcher Monta Elkins will show how he created a proof-of-concept version of that hardware hack in his basement. He intends to demonstrate just how easily spies, criminals, or saboteurs with even minimal skills, working on a shoestring budget, can plant a chip in enterprise IT equipment to offer themselves stealthy backdoor access…. With only a $150 hot-air soldering tool, a $40 microscope, and some $2 chips ordered online, Elkins was able to alter a Cisco firewall in a way that he says most IT admins likely wouldn’t notice, yet would give a remote attacker deep control.

Mega Overturns Brazilian ISP Copyright Block

Post Syndicated from Andy original

The inevitable situation facing any site that hosts user-uploaded files is that some users will attempt to store copyright-infringing content.

The bigger the site, the bigger the problem, as YouTube’s copyright department knows only too well. But while few rightsholders would attempt to take on YouTube by filing for an ISP blocking order, plenty of other sites are considered fair game, Mega for example.

After a standing start in 2013, Mega is now a major player in the file-hosting market. Due to its early connections with Kim Dotcom, the site was under huge scrutiny from the very beginning and as such, has always insisted that it is fully compliant when it comes to copyright issues.

Nevertheless, earlier this month it was discovered that users in Brazil could no longer access the service. ISPs in the country had begun blocking the site following a copyright complaint initiated by the Brazilian Association of Subscription Television (ABTA).

Following a September decision, the São Paulo Court of Justice ordered four Internet service providers – Claro Brasil, Vivo-Telefonica, Oi and Algar Telecom – to prevent their subscribers from accessing several domains on copyright grounds, included.

“With respect to the block in Brazil, we respectfully believe that the order is wrong and that the Court has been misled. MEGA has excellent compliance. We are working on a solution,” the company told its customers.

The nature of that solution wasn’t specified at the time but Mega Executive Chairman Stephen Hall says that the company mounted a legal challenge to a process that had actually begun months earlier and didn’t initially include Mega.

“The case started in January 2019 with various sites but not Mega,” Hall informs TorrentFreak.

“The case has been held in secret, apparently because the ABTA submitted that various sites included could change settings in order to evade the block.”

Hall says that Mega was added to the case in September 2019 based on the allegation that a single URL on the site led to infringing content. However, that URL had never been reported to the company as posing a problem.

“We submitted to the Appeal Court details of our rigorous compliance activity such as fast response to copyright takedown requests, suspension of accounts with repeat allegations of copyright infringement etc, as reported in our Transparency Report,” Hall says.

Mega’s Executive Chairman notes that Brazilian law only allows courts to suspend access to a service if it fails to respond to legal requests so Mega eventually came out on top.

“The Appeal Court ordered the block of to be reversed. I believe the lower Court will now reconsider its inclusion of Mega. We are confident that access won’t be blocked again,” Hall concludes.

Reports posted by Mega users to Twitter suggest that at least some previously-blocked users are now able to access the site once again but the company is urging that any still experiencing difficulties should contact their providers.

“Contact your ISP if you still cannot access,” the company says.

According to SimilarWeb stats, there are more visitors to Mega from Brazil than any other country, together making up almost 10% of Mega’s traffic and making it the country’s 108th most popular site.

A report by Mega in January revealed the massive scale of its global operations since its launch six years ago.

“To date, more than 130 million registered MEGA users have uploaded over 53 billion files, utilizing the user-controlled end-to-end encryption we provide,” the company said.

Source: TF, for the latest info on copyright, file-sharing, torrent sites and more. We also have VPN reviews, discounts, offers and coupons.

Ubuntu 19.10 (Eoan Ermine) released

Post Syndicated from jake original

Ubuntu has announced the release of 19.10 “Eoan Ermine” in desktop and server editions as well as all of the different flavors: Ubuntu Budgie, Kubuntu, Lubuntu, Ubuntu Kylin, Ubuntu MATE,
Ubuntu Studio, and Xubuntu. “The Ubuntu kernel has been updated to the 5.3 based Linux kernel, and
our default toolchain has moved to gcc 9.2 with glibc 2.30. Additionally,
the Raspberry Pi images now support the new Pi 4 as well as 2 and 3.

Ubuntu Desktop 19.10 introduces GNOME 3.34 the fastest release yet with
significant performance improvements delivering a more responsive
experience. App organisation is easier with the ability to drag and drop
icons into categorised folders and users can select light or dark Yaru
theme variants. The Ubuntu Desktop installer also introduces installing
to ZFS as a root filesystem as an experimental feature.” More information can also be found in the release notes.

Best practices to scale Apache Spark jobs and partition data with AWS Glue

Post Syndicated from Mohit Saxena original

AWS Glue provides a serverless environment to prepare (extract and transform) and load large amounts of datasets from a variety of sources for analytics and data processing with Apache Spark ETL jobs. This series of posts discusses best practices to help developers of Apache Spark applications and Glue ETL jobs, big data architects, data engineers, and business analysts scale their data processing jobs running on AWS Glue automatically.

The first post of this series discusses two key AWS Glue capabilities to manage the scaling of data processing jobs. The first allows you to horizontally scale out Apache Spark applications for large splittable datasets. The second allows you to vertically scale up memory-intensive Apache Spark applications with the help of new AWS Glue worker types. The post also shows how to use AWS Glue to scale Apache Spark applications with a large number of small files commonly ingested from streaming applications using Amazon Kinesis Data Firehose. Finally, the post shows how AWS Glue jobs can use the partitioning structure of large datasets in Amazon S3 to provide faster execution times for Apache Spark applications.

Understanding AWS Glue worker types

AWS Glue comes with three worker types to help customers select the configuration that meets their job latency and cost requirements. These workers, also known as Data Processing Units (DPUs), come in Standard, G.1X, and G.2X configurations.

The standard worker configuration allocates 5 GB for Spark driver and executor memory, 512 MB for spark.yarn.executor.memoryOverhead, and 50 GB of attached EBS storage. The G.1X worker allocates 10 GB for driver and executor memory, 2 GB memoryOverhead, and 64 GB of attached EBS storage. The G.2X worker allocates 20 GB for driver and executor memory, 4 GB memoryOverhead, and 128 GB of attached EBS storage.

The compute parallelism (Apache Spark tasks per DPU) available for horizontal scaling is the same regardless of the worker type. For example, both standard and G1.X workers map to 1 DPU, each of which can run eight concurrent tasks. A G2.X worker maps to 2 DPUs, which can run 16 concurrent tasks. As a result, compute-intensive AWS Glue jobs that possess a high degree of data parallelism can benefit from horizontal scaling (more standard or G1.X workers). AWS Glue jobs that need high memory or ample disk space to store intermediate shuffle output can benefit from vertical scaling (more G1.X or G2.x workers).

Horizontal scaling for splittable datasets

AWS Glue automatically supports file splitting when reading common native formats (such as CSV and JSON) and modern file formats (such as Parquet and ORC) from S3 using AWS Glue DynamicFrames. For more information about DynamicFrames, see Work with partitioned data in AWS Glue.

A file split is a portion of a file that a Spark task can read and process independently on an AWS Glue worker. By default, file splitting is enabled for line-delimited native formats, which allows Apache Spark jobs running on AWS Glue to parallelize computation across multiple executors. AWS Glue jobs that process large splittable datasets with medium (hundreds of megabytes) or large (several gigabytes) file sizes can benefit from horizontal scaling and run faster by adding more AWS Glue workers.

File splitting also benefits block-based compression formats such as bzip2. You can read each compression block on a file split boundary and process them independently. Unsplittable compression formats such as gzip do not benefit from file splitting. To horizontally scale jobs that read unsplittable files or compression formats, prepare the input datasets with multiple medium-sized files.


Each file split (the blue square in the figure) is read from S3, deserialized into an AWS Glue DynamicFrame partition, and then processed by an Apache Spark task (the gear icon in the figure). Deserialized partition sizes can be significantly larger than the on-disk 64 MB file split size, especially for highly compressed splittable file formats such as Parquet or large files using unsplittable compression formats such as gzip. Typically, a deserialized partition is not cached in memory, and only constructed when needed due to Apache Spark’s lazy evaluation of transformations, thus not causing any memory pressure on AWS Glue workers. For more information on lazy evaluation, see the RDD Programming Guide on the Apache Spark website.

However, explicitly caching a partition in memory or spilling it out to local disk in an AWS Glue ETL script or Apache Spark application can result in out-of-memory (OOM) or out-of-disk exceptions. AWS Glue can support such use cases by using larger AWS Glue worker types with vertically scaled-up DPU instances for AWS Glue ETL jobs.

Vertical scaling for Apache Spark jobs using larger worker types

A variety of AWS Glue ETL jobs, Apache Spark applications, and new machine learning (ML) Glue transformations supported with AWS Lake Formation have high memory and disk requirements. Running these workloads may put significant memory pressure on the execution engine. This memory pressure can result in job failures because of OOM or out-of-disk space exceptions. You may see exceptions from Yarn about memory and disk space.

Exceeding Yarn memory overhead

Apache Yarn is responsible for allocating cluster resources needed to run your Spark application. An application includes a Spark driver and multiple executor JVMs. In addition to the memory allocation required to run a job for each executor, Yarn also allocates an extra overhead memory to accommodate for JVM overhead, interned strings, and other metadata that the JVM needs. The configuration parameter spark.yarn.executor.memoryOverhead defaults to 10% of the total executor memory. Memory-intensive operations such as joining large tables or processing datasets with a skew in the distribution of specific column values may exceed the memory threshold, and result in the following error message:

18/06/13 16:54:29 ERROR YarnClusterScheduler: Lost executor 1 on ip-xxx:
Container killed by YARN for exceeding memory limits. 5.5 GB of 5.5 GB physical memory used.
Consider boosting spark.yarn.executor.memoryOverhead.

Disk space

Apache Spark uses local disk on Glue workers to spill data from memory that exceeds the heap space defined by the spark.memory.fraction configuration parameter. During the sort or shuffle stages of a job, Spark writes intermediate data to local disk before it can exchange that data between the different workers. Jobs may fail due to the following exception when no disk space remains: No space left on device
UnsafeExternalSorter: Thread 20 spilling sort data of 141.0 MB to disk (90 times so far)

AWS Glue job metrics

Most commonly, this is a result of a significant skew in the dataset that the job is processing. You can also identify the skew by monitoring the execution timeline of different Apache Spark executors using AWS Glue job metrics. For more information, see Debugging Demanding Stages and Straggler Tasks.

The following AWS Glue job metrics graph shows the execution timeline and memory profile of different executors in an AWS Glue ETL job. One of the executors (the red line) is straggling due to processing of a large partition, and actively consumes memory for the majority of the job’s duration.

With AWS Glue’s Vertical Scaling feature, memory-intensive Apache Spark jobs can use AWS Glue workers with higher memory and larger disk space to help overcome these two common failures. Using AWS Glue job metrics, you can also debug OOM and determine the ideal worker type for your job by inspecting the memory usage of the driver and executors for a running job. For more information, see Debugging OOM Exceptions and Job Abnormalities.

In general, jobs that run memory-intensive operations can benefit from the G1.X worker type, and those that use AWS Glue’s ML transforms or similar ML workloads can benefit from the G2.X worker type.

Apache Spark UI for AWS Glue jobs

You can also use AWS Glue’s support for Spark UI to inpect and scale your AWS Glue ETL job by visualizing the Directed Acyclic Graph (DAG) of Spark’s execution, and also monitor demanding stages, large shuffles, and inspect Spark SQL query plans. For more information, see Monitoring Jobs Using the Apache Spark Web UI.

The following Spark SQL query plan on the Spark UI shows the DAG for an ETL job that reads two tables from S3, performs an outer-join that results in a Spark shuffle, and writes the result to S3 in Parquet format.

As seen from the plan, the Spark shuffle and subsequent sort operation for the join transformation takes the majority of the job execution time. With AWS Glue vertical scaling, each AWS Glue worker co-locates more Spark tasks, thereby saving on the number of data exchanges over the network.

Scaling to handle large numbers of small files

An AWS Glue ETL job might read thousands or millions of files from S3. This is typical for Kinesis Data Firehose or streaming applications writing data into S3. The Apache Spark driver may run out of memory when attempting to read a large number of files. When this happens, you see the following error message:

# java.lang.OutOfMemoryError: Java heap space
# -XX:OnOutOfMemoryError="kill -9 %p"
# Executing /bin/sh -c "kill -9 12039"...

Apache Spark v2.2 can manage approximately 650,000 files on the standard AWS Glue worker type. To handle more files, AWS Glue provides the option to read input files in larger groups per Spark task for each AWS Glue worker. For more information, see Reading Input Files in Larger Groups.

You can reduce the excessive parallelism from the launch of one Apache Spark task to process each file by using AWS Glue file grouping. This method reduces the chances of an OOM exception on the Spark driver. To configure file grouping, you need to set groupFiles and groupSize parameters. The following code example uses AWS Glue DynamicFrame API in an ETL script with these parameters:

dyf = glueContext.create_dynamic_frame_from_options("s3",
    {'paths': ["s3://input-s3-path/"],
    'groupFiles': 'inPartition',
    'groupSize': '1048576'}, 

You can set groupFiles to group files within a Hive-style S3 partition (inPartition) or across S3 partitions (acrossPartition). In most scenarios, grouping within a partition is sufficient to reduce the number of concurrent Spark tasks and the memory footprint of the Spark driver. In benchmarks, AWS Glue ETL jobs configured with the inPartition grouping option were approximately seven times faster than native Apache Spark v2.2 when processing 320,000 small JSON files distributed across 160 different S3 partitions. A large fraction of the time in Apache Spark is spent building an in-memory index while listing S3 files and scheduling a large number of short-running tasks to process each file. With AWS Glue grouping enabled, the benchmark AWS Glue ETL job could process more than 1 million files using the standard AWS Glue worker type.

groupSize is an optional field that allows you to configure the amount of data each Spark task reads and processes as a single AWS Glue DynamicFrame partition. Users can set groupSize if they know the distribution of file sizes before running the job. The groupSize parameter allows you to control the number of AWS Glue DynamicFrame partitions, which also translates into the number of output files. However, using a considerably small or large groupSize can result in significant task parallelism or under-utilization of the cluster, respectively.

By default, AWS Glue automatically enables grouping without any manual configuration when the number of input files or task parallelism exceeds a threshold of 50,000. The default value of the groupFiles parameter is inPartition, so that each Spark task only reads files within the same S3 partition. AWS Glue computes the groupSize parameter automatically and configures it to reduce the excessive parallelism, and makes use of the cluster compute resources with sufficient Spark tasks running in parallel.

Partitioning data and pushdown predicates

Partitioning has emerged as an important technique for organizing datasets so that a variety of big data systems can query them efficiently. A hierarchical directory structure organizes the data, based on the distinct values of one or more columns. For example, you can partition your application logs in S3 by date, broken down by year, month, and day. Files corresponding to a single day’s worth of data receive a prefix such as the following:


Predicate pushdowns for partition columns

AWS Glue supports pushing down predicates, which define a filter criteria for partition columns populated for a table in the AWS Glue Data Catalog. Instead of reading all the data and filtering results at execution time, you can supply a SQL predicate in the form of a WHERE clause on the partition column. For example, assume the table is partitioned by the year column and run SELECT * FROM table WHERE year = 2019. year represents the partition column and 2019 represents the filter criteria.

AWS Glue lists and reads only the files from S3 partitions that satisfy the predicate and are necessary for processing.

To accomplish this, specify a predicate using the Spark SQL expression language as an additional parameter to the AWS Glue DynamicFrame getCatalogSource method. This predicate can be any SQL expression or user-defined function that evaluates to a Boolean, as long as it uses only the partition columns for filtering.

This example demonstrates this functionality with a dataset of Github events partitioned by year, month, and day. The following code example reads only those S3 partitions related to events that occurred on weekends:


val partitionPredicate ="date_format(to_date(concat(year, '-', month, '-', day)), 'E') in ('Sat', 'Sun')"

val pushdownEvents = glueContext.getCatalogSource(
   database = "githubarchive_month",
   tableName = "data",
   pushDownPredicate = partitionPredicate).getDynamicFrame()

Here you can use the SparkSQL string concat function to construct a date string. The to_date function converts it to a date object, and the date_format function with the ‘E’ pattern converts the date to a three-character day of the week (for example, Mon or Tue). For more information about these functions, Spark SQL expressions, and user-defined functions in general, see the Spark SQL, DataFrames and Datasets Guide and list of functions on the Apache Spark website.

There is a significant performance boost for AWS Glue ETL jobs when pruning AWS Glue Data Catalog partitions. It reduces the time needed for the Spark query engine for listing files in S3 and reading and processing data at runtime. You can achieve further improvement as you exclude additional partitions by using predicates with higher selectivity.

Partitioning data before and during writes to S3

By default, data is not partitioned when writing out the results from an AWS Glue DynamicFrame—all output files are written at the top level under the specified output path. AWS Glue enables partitioning of DynamicFrame results by passing the partitionKeys option when creating a sink. For example, the following code example writes out the dataset in Parquet format to S3 partitioned by the type column:


    connectionType = "s3",
    options = JsonOptions(Map("path" -> "$outpath", "partitionKeys" -> Seq("type"))),
    format = "parquet").writeDynamicFrame(projectedEvents)

In this example, $outpath is a placeholder for the base output path in S3. The partitionKeys parameter corresponds to the names of the columns used to partition the output in S3. When you execute the write operation, it removes the type column from the individual records and encodes it in the directory structure. To demonstrate this, you can list the output path using the following aws s3 ls command from the AWS CLI:

PRE type=CommitCommentEvent/
PRE type=CreateEvent/
PRE type=DeleteEvent/
PRE type=ForkEvent/
PRE type=GollumEvent/
PRE type=IssueCommentEvent/
PRE type=IssuesEvent/
PRE type=MemberEvent/
PRE type=PublicEvent/
PRE type=PullRequestEvent/
PRE type=PullRequestReviewCommentEvent/
PRE type=PushEvent/
PRE type=ReleaseEvent/
PRE type=WatchEvent/

For more information, see aws . s3 . ls in the AWS CLI Command Reference.

In general, you should select columns for partitionKeys that are of lower cardinality and are most commonly used to filter or group query results. For example, when analyzing AWS CloudTrail logs, it is common to look for events that happened between a range of dates. Therefore, partitioning the CloudTrail data by year, month, and day would improve query performance and reduce the amount of data that you need to scan to return the answer.

The benefit of output partitioning is two-fold. First, it improves execution time for end-user queries. Second, having an appropriate partitioning scheme helps avoid costly Spark shuffle operations in downstream AWS Glue ETL jobs when combining multiple jobs into a data pipeline. For more information, see Working with partitioned data in AWS Glue.

S3 or Hive-style partitions are different from Spark RDD or DynamicFrame partitions. Spark partitioning is related to how Spark or AWS Glue breaks up a large dataset into smaller and more manageable chunks to read and apply transformations in parallel. AWS Glue workers manage this type of partitioning in memory. You can control Spark partitions further by using the repartition or coalesce functions on DynamicFrames at any point during a job’s execution and before data is written to S3. You can set the number of partitions using the repartition function either by explicitly specifying the total number of partitions or by selecting the columns to partition the data.

Repartitioning a dataset by using the repartition or coalesce functions often results in AWS Glue workers exchanging (shuffling) data, which can impact job runtime and increase memory pressure. In contrast, writing data to S3 with Hive-style partitioning does not require any data shuffle and only sorts it locally on each of the worker nodes. The number of output files in S3 without Hive-style partitioning roughly corresponds to the number of Spark partitions. In contrast, the number of output files in S3 with Hive-style partitioning can vary based on the distribution of partition keys on each AWS Glue worker.


This post showed how to scale your ETL jobs and Apache Spark applications on AWS Glue for both compute and memory-intensive jobs. AWS Glue enables faster job execution times and efficient memory management by using the parallelism of the dataset and different types of AWS Glue workers. It also helps you overcome the challenges of processing many small files by automatically adjusting the parallelism of the workload and cluster. AWS Glue ETL jobs use the AWS Glue Data Catalog and enable seamless partition pruning using predicate pushdowns. It also allows for efficient partitioning of datasets in S3 for faster queries by downstream Apache Spark applications and other analytics engines such as Amazon Athena and Amazon Redshift. We hope you try out these best practices for your Apache Spark applications on AWS Glue.

The second post in this series will show how to use AWS Glue features to batch process large historical datasets and incrementally process deltas in S3 data lakes. It also demonstrates how to use a custom AWS Glue Parquet writer for faster job execution.


About the Author

Mohit Saxena is a technical lead at AWS Glue. His passion is building scalable distributed systems for efficiently managing data on cloud. He also enjoys watching movies, and reading about the latest technology.




The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.