Forwood Safety uses Amazon QuickSight Q to extend life-saving safety analytics to larger audiences

Post Syndicated from Faye Crompton original

This is a guest post by Faye Crompton from Forwood Safety. Forwood provides fatality prevention solutions to organizations across the globe.

At Forwood Safety, we have a laser focus on saving lives. Our solutions, which provide full content and proven methodology via verification tools and analytical capabilities, have one purpose: eliminating fatalities in the workplace. We recently realized an ambition to provide interactive, dynamic data visualization tools that enable our end users to access safety data in the field, regardless of their experience with analytics and data reporting.

In this post, I’ll talk about how Amazon QuickSight Q solved these challenges by giving users fast data insights through natural language querying capabilities.

Driving data insights with QuickSight

Forwood’s Critical Risk Management (CRM) solution provides organizations with globally benchmarked and comprehensive critical control checklists and verification controls that are proven to prevent fatalities in the workplace. CRM protects frontline workers from serious harm by helping change the culture of risk management for companies. In addition, our Forwood Analytical Self-Service Tool (FAST) enables our customers to use self-service reporting to get updated dashboards that display key safety and fatality prevention metrics.

For several years, we used AWS QuickSight to provide data visualization for our CRM and FAST reporting products, with great success. Most of our technology stack was already based on AWS, so we knew QuickSight would be easy to integrate. QuickSight is agnostic in terms of data sources and types, and it’s a very flexible tool. It’s also an open data technology, so it can accept most of the data sources that we throw at it. Most importantly, it ties in seamlessly with our own architecture and data pipelines in a way that our previous BI tools couldn’t. After we implemented QuickSight, we started using it to power both CRM and FAST, visualizing risk data and serving it back to our customers.

Using QuickSight Q to help site supervisors get answers quickly

Furthering our focus on innovation and usability; we identified a common challenge that we believed QuickSight could solve through our FAST application on behalf of our clients — we needed to make risk data more accessible for those of our clients who aren’t data analysts. We recognize that not everyone is an analyst. We also have mining industry customers who are not frequently accessing our applications via desktop. For example, mining site Supervisors and Operators working deep underground typically have access only via their mobile devices. For these users, it’s easier for them to ask the questions relevant to their specific use cases as needed at point of use, rather than filter and search through a dashboard to find the answers ahead of time.

QuickSight Q was the perfect solution to this challenge. QuickSight Q is a feature within QuickSight that uses machine learning to understand questions and how they relate to business data. The feature provides data insights and visualizations within seconds. With this capability, users can simply type in questions in natural language to access data insights about risk and compliance. Mining site workers, for example, can ask if the site is safe or if the right verification processes are in place. Health and safety teams and mining site supervisors can ask questions such as “Which sites should I verify today?” or “Which risk will be highest next week?” and receive a chart with the relevant data.

Making data more accessible to everyone

QuickSight Q gives our on-site customers near-real-time risk and compliance data from their mobile devices in a way they couldn’t before. With QuickSight Q, we can give our FAST users the opportunity to quickly visualize any fatality risks at their sites based on updated fatality prevention data. All users, not just analysts, can identify worksites that have a higher fatality risk because the data can show trends in non-compliance with safety standards. Our clients no longer have to look in a dashboard for the answers to their questions; those looking at a dashboard can go beyond the dashboard and ask deeper questions.

QuickSight Q solved one of our main BI challenges: how to make risk data more accessible to more people without extensive user training and technical understanding. Soon, we hope to use QuickSight Q as part of a multidimensional predictive dataset using deep learning models to deliver even more insights to our customers.

We look forward to extending our use of QuickSight. When we first started using it, it was strictly for analytics on our existing data. More recently, we started using API deployments for QuickSight. We have many different clients, and we use the API feature to maintain master versions of all 30+ standard reports, and then deploy those dashboards to as many clients as we need to via code. Previously, we saw QuickSight as a function of our analytics products; now we see it as a powerful and flexible toolkit of analytics features that our developers can build with.

Additionally, we look forward to relying on QuickSight Q to bring life-saving safety analytics to more people. QuickSight Q bridges the gap between the data a company has and the decisions that company needs to make, and that’s very powerful for our clients. Forwood Safety is driven to eradicate workplace fatalities, and by getting data to more people and making it easy to access, we can make our solutions more effective, saving more lives.

About the author

Faye Crompton is Head of Analytics, Safety Applications and Computer Vision at Forwood Safety. She leads work on analytics and safety products that reduce fatality risk in mining and other high-risk industries.

How to prepare your application to scale reliably with Amazon EC2

Post Syndicated from Sheila Busser original

This blog post is written by, Gabriele Postorino, Senior Technical Account Manager, and Giorgio Bonfiglio, Principal Technical Account Manager

In this post, we’ll discuss how you can prepare for planned and unplanned scaling events with

Most of the challenges related to horizontal scaling can be mitigated by optimizing the architectural implementation and applying improvements in operational processes.

In the following sections, we’ll explore this in depth. Recommendations can be applied partially or fully – they come with different complexities, and each one will help you reduce the risk of facing insufficient capacity errors or scaling delays, as well as deliver enhancements in areas such as fault tolerance, elasticity, and cost optimization.

Architectural best practices

Instance capacity can be regarded as being divided into “pools” defined by AZ (such as us-east-1a), instance type (for example m5.xlarge), and tenancy. Combining the following two guidelines will widen the capacity pools available to scale out your fleets of instances. This will help you reduce costs, transparently recover from failures, and increase your application scalability.

Instance flexibility

Whether you’re migrating a new workload to the cloud, or tuning an existing workload, you’ll likely evaluate which compute configuration options are available and determine the right configuration for your application.

If your workload is already running on EC2 instances, you might already be aware of the instance type that it runs best on. Let’s say that your application is RAM intensive, and you found that r6i.4xlarge instances are best suited for it.

However, relying on a single instance type might result in artificially limiting your ability to scale compute resources for your workload when needed. It’s always a good idea to explore how your workload behaves when running on other instance types: you might find that your application can serve double the number of requests served by one r6i.4xlarge instance when using one r6i.8xlarge instance or four r6i.2xlarge instances.

Furthermore, there’s no reason to limit your options to a single instance family, generation, or processor type. For example, m6a.8xlarge instances offer the same amount of RAM of r6i.4xlarge and might be used to run your application if needed.

Amazon EC2 Auto Scaling helps you make sure that you have the right number of EC2 instances available to handle the load for your application.

Auto Scaling groups can be configured to respond to scaling events by selecting the type of instance to launch among a list of instance types. You can statically populate the list in advance, as in the following screenshot,

The Instance type requirements section of the Auto Scaling Wizard instance launch options step is shown with the option “Manually add instance types” selected.

or dynamically define it by a set of instance attributes as shown in the subsequent screenshot:

The Instance type requirements section of the Auto Scaling Wizard instance launch options step is shown with the option “Specify instance attribute” selected.

For example, by setting the requirements to a minimum of 8 vCPUs, 64GiB of Memory, and a RAM/CPU ratio of 8 (just like r6i.2xlarge instances), up to 73 instance types can be included in the list of suitable instances. They will be selected for launch starting from the lowest priced instance types. If the request can’t be fulfilled in full by the lowest priced instance type, then additional instances will be launched from the second lowest instance type pool, and so on.

Instance distribution

Each AWS Region consists of multiple, isolated Availability Zones (AZ), interconnected with high-bandwidth, low-latency networking. Spreading a workload across AZs is a well-established resiliency best practice. It will make sure that your end users aren’t impacted in the case of a single AZ, data center, or rack failures, as each AZ has its own distinct instance capacity pools that you can leverage to scale your application fleets.

EC2 Auto Scaling can manage the optimal distribution of EC2 instances in a group across all AZs in a Region automatically, as well as deal with temporary failures transparently. To do so, it must be configured to use at least one subnet in each AZ. Then, it will attempt to distribute instances evenly across AZs and automatically cycle through AZs in case of temporary launch failures.

Diagram showing a VPC with subnets in 2 Availability Zones and an Autoscaling group managing groups of instances of different types

Operational best practices

The way that your workload is operated also impacts your ability to scale it when needed. Failure management and appropriate scaling techniques will help you maximize the availability of your environment.

Failure management

On-Demand capacity isn’t guaranteed to always be available. There might be short windows of time when AWS doesn’t have enough available On-Demand capacity to fulfill your specific request: as the availability of On-Demand capacity changes frequently, it’s important that your launch processes implement retry mechanisms.

Retries and fallbacks are managed automatically by EC2 Auto Scaling. But if you have a custom workflow to launch instances, it should be able to work with server error codes, in particular InsufficientInstanceCapacity or InternalError, by retrying the launch request. For a complete list of error codes for the EC2 API, please refer to our documentation.

Another option provided by EC2 is represented by EC2 Fleets. EC2 Fleet is a feature that helps to implement instance flexibility best practices. Instead of calling RunInstances with one instance type and retrying, EC2 Fleet in Instant mode considers all provided instance types, using a list of instances or Attribute Based Instance selection, and provisions capacity from the pools configured by the EC2 Fleet call where capacity was available.

Scaling technique

Launching EC2 instances as soon as you have an initial indication of increased load, in smaller batches and over a longer time span, helps increase your application performance and reliability while reducing costs and minimizing disruptions.

Two different scaling techniques that follow the increase in load are depicted. One scaling approach adds a large number of instances less frequently, while the second approach launches a smaller batch of instances more frequently. In the graph above, two different scaling techniques are depicted. Scaling approach #1 adds a large number of instances less frequently, while approach #2 launches a smaller batch of instances more frequently. Adopting the first approach risks your application not being able to sustain the increase in load in a timely manner. This will potentially cause an impact on end users and leave the operations team with little time to resolve.

Effective capacity planning

On-Demand Instances are best suited for applications with irregular, uninterruptible workloads. Interruptible workloads can avail of Spot Instances that pick from spare EC2 capacity. They cost less than On-Demand Instances but can be interrupted with a two-minute warning.

If your workload has a stable baseline utilization that hardly changes over time, then you can reserve capacity for your baseline usage of EC2 instances using open On-Demand Capacity Reservations and cover them with Savings Plans to get discounted rates with a one-year or three-year commitment, with the latter offering the bigger discounts.

Open On-Demand Capacity Reservations and Savings Plans aren’t tightly related to the EC2 instances that they cover at a certain point in time. Rather they shift to other usage, matching all of the parameters of the respective On-Demand Capacity Reservation or Savings Plan (e.g., Instance Type, Operating System, AZ, tenancy) in your account or across accounts for which you have sharing enabled. This lets you be dynamic even with your stable baseline. For example, during a rolling update or a blue/green deployment, On-Demand Capacity Reservations and Savings Plans will automatically cover any instances that match the respective criteria.

ODCR Fleets

There are times when you can’t apply all of the recommended mitigating actions in anticipation of a planned event. In those cases, you might want to use On-Demand Capacity Reservation Fleets to reserve capacity in advance for additional peace of mind. Capacity reservation fleets let you define capacity requests across multiple instance types, up to a target capacity that you specify. They can be created and managed using the AWS Command Line Interface (AWS CLI) and the AWS APIs.

Key concepts of Capacity Reservation Fleets are the total target capacity and the instance type weight. The instance type weight expresses the number of capacity units that each instance of a specific instance type counts toward the total target capacity.

Let’s say your workload is memory-bound, you expect to need 1,6TiB of RAM, and you want to use r6i instances. You can create a Capacity Reservation Fleet for r6i instances defining weights for each instance type in the family based on the relative amount of memory that they have in an instance type specification json file.

        "InstanceType": "r6i.2xlarge",                       
        "Weight": 1,
        "EbsOptimized": true,           
        "Priority" : 3
        "InstanceType": "r6i.4xlarge",                        
        "Weight": 2,
        "EbsOptimized": true,            
        "Priority" : 2
        "InstanceType": "r6i.8xlarge",                        
        "Weight": 4,
        "EbsOptimized": true,            
        "Priority" : 1

Then, you want to use this specification to create a Capacity Reservation Fleet that takes care of the underlying Capacity Reservations needed to fulfill your request:

$ aws ec2 create-capacity-reservation-fleet \
--total-target-capacity 25 \
--allocation-strategy prioritized \
--instance-match-criteria open \
--tenancy default \
--end-date 2022-05-31T00:00:00.000Z \
--instance-type-specifications file://instanceTypeSpecification.json

In this example, I set the target capacity to 25, which is the number of r6i.2xlarge needed to get 1,6TiB of total memory across the fleet. As you might have noticed, Capacity Reservation Fleets can be created with an end date. They will automatically cancel themselves and the Capacity Reservations that they created when the end date is reached, so that you don’t need to.

AWS Infrastructure Event Management

Last but not least, our teams can offer the AWS Infrastructure Event Management (IEM) program. Part of select AWS Support offerings, the IEM program has been designed to help you with planning and executing events that impact your infrastructure on AWS. By requesting an IEM engagement, you will be supported by AWS experts during all of the phases of your event.Flow chart showing the steps and IEM is usually made of: 1. Event is planned 2. IEM is initiated 6-8 weeks in advance of the event 3. Infrastructure readiness is assessed and mitigations are applied 4. The event 5. Post-event reviewStarting from your business outcomes and success criteria, we’ll assess your infrastructure readiness for the event, evaluate risks, and recommend specific actions to mitigate them. The AWS experts will focus on your application architecture as a whole and dive deep into each of its components with your respective teams. They might also engage with other AWS teams to notify them of the upcoming event, and get specific prescriptive guidance when needed. During the event, AWS experts will have the context needed to help you resolve any issue that might arise as quickly as possible. The program is included in the Enterprise and Enterprise On-Ramp Support plans and is available to Business Support customers for an additional fee.


Whether you’re planning for a big future event, or you want to make sure that your application can withstand unexpected increases in traffic, it’s important that you consider what we discussed in this article:

  • Use as many instance types as you can, don’t limit your workload to use a single instance type when it could also use a lot more types
  • Distribute your EC2 instances across all AZs in the Region
  • Expect failures: manage retries and fallback options
  • Make use of EC2 Autoscaling and EC2 Fleet whenever possible
  • Avoid scaling spikes: start scaling earlier, in smaller chunks, and more frequently
  • Reserve capacity only when you really need to

For further study, we recommend the Well-Architected Framework Reliability and Operational Excellence pillars as starting points. Moreover, if you have an event coming up, talk to your Technical Account Manager, your Account Team, or contact us to find out how we can help!

[$] An io_uring-based user-space block driver

Post Syndicated from corbet original

The addition of the ublk driver during the 6.0 merge window would have been
easy to miss; it was buried deeply within an io_uring pull request and is
entirely devoid of any sort of documentation that might indicate why it
merits a closer look. Ublk is intended to facilitate the implementation of
high-performance block drivers in user space; to that end, it uses io_uring
for its communication with the kernel. This driver is considered
experimental for now; if it is successful, it might just be a harbinger of
more significant changes to come to the kernel in the future.

Coordinating large messages across accounts and Regions with Amazon SNS and SQS

Post Syndicated from Mrudhula Balasubramanyan original

Many organizations have applications distributed across various business units. Teams in these business units may develop their applications independent of each other to serve their individual business needs. Applications can reside in a single Amazon Web Services (AWS) account or be distributed across multiple accounts. Applications may be deployed to a single AWS Region or span multiple Regions.

Irrespective of how the applications are owned and operated, these applications need to communicate with each other. Within an organization, applications tend to be part of a larger system, therefore, communication and coordination among these individual applications is critical to overall operation.

There are a number of ways to enable coordination among component applications. It can be done either synchronously or asynchronously:

  • Synchronous communication uses a traditional request-response model, in which the applications exchange information in a tightly coupled fashion, introducing multiple points of potential failure.
  • Asynchronous communication uses an event-driven model, in which the applications exchange messages as events or state changes and are loosely coupled. Loose coupling allows applications to evolve independently of each other, increasing scalability and fault-tolerance in the overall system.

Event-driven architectures use a publisher-subscriber model, in which events are emitted by the publisher and consumed by one or more subscribers.

A key consideration when implementing an event-driven architecture is the size of the messages or events that are exchanged. How can you implement an event-driven architecture for large messages, beyond the default maximum of the services? How can you architect messaging and automation of applications across AWS accounts and Regions?

This blog presents architectures for enhancing event-driven models to exchange large messages. These architectures depict how to coordinate applications across AWS accounts and Regions.

Challenge with application coordination

A challenge with application coordination is exchanging large messages. For the purposes of this post, a large message is defined as an event payload between 256 KB and 2 GB. This stems from the fact that Amazon Simple Notification Service (Amazon SNS) and Amazon Simple Queue Service (Amazon SQS) currently have a maximum event payload size of 256 KB. To exchange messages larger than 256 KB, an intermediate data store must be used.

To exchange messages across AWS accounts and Regions, set up the publisher access policy to allow subscriber applications in other accounts and Regions. In the case of large messages, also set up a central data repository and provide access to subscribers.

Figure 1 depicts a basic schematic of applications distributed across accounts communicating asynchronously as part of a larger enterprise application.

Asynchronous communication across applications

Figure 1. Asynchronous communication across applications

Architecture overview

The overview covers two scenarios:

  1. Coordination of applications distributed across AWS accounts and deployed in the same Region
  2. Coordination of applications distributed across AWS accounts and deployed to different Regions

Coordination across accounts and single AWS Region

Figure 2 represents an event-driven architecture, in which applications are distributed across AWS Accounts A, B, and C. The applications are all deployed to the same AWS Region, us-east-1. A single Region simplifies the architecture, so you can focus on application coordination across AWS accounts.

Application coordination across accounts and single AWS Region

Figure 2. Application coordination across accounts and single AWS Region

The application in Account A (Application A) is implemented as an AWS Lambda function. This application communicates with the applications in Accounts B and C. The application in Account B is launched with AWS Step Functions (Application B), and the application in Account C runs on Amazon Elastic Container Service (Application C).

In this scenario, Applications B and C need information from upstream Application A. Application A publishes this information as an event, and Applications B and C subscribe to an SNS topic to receive the events. However, since they are in other accounts, you must define an access policy to control who can access the SNS topic. You can use sample Amazon SNS access policies to craft your own.

If the event payload is in the 256 KB to 2 GB range, you can use Amazon Simple Storage Service (Amazon S3) as the intermediate data store for your payload. Application A uses the Amazon SNS Extended Client Library for Java to upload the payload to an S3 bucket and publish a message to an SNS topic, with a reference to the stored S3 object. The message containing the metadata must be within the SNS maximum message limit of 256 KB. Amazon EventBridge is used for routing events and handling authentication.

The subscriber Applications B and C need to de-reference and retrieve the payloads from Amazon S3. The SQS queue in Account B and Lambda function in Account C subscribe to the SNS topic in Account A. In Account B, a Lambda function is used to poll the SQS queue and read the message with the metadata. The Lambda function uses the Amazon SQS Extended Client Library for Java to retrieve the S3 object referenced in the message.

The Lambda function in Account C uses the Payload Offloading Java Common Library for AWS to get the referenced S3 object.

Once the S3 object is retrieved, the Lambda functions in Accounts B and C process the data and pass on the information to downstream applications.

This architecture uses Amazon SQS and Lambda as subscribers because they provide libraries that support offloading large payloads to Amazon S3. However, you can use any Java-enabled endpoint, such as an HTTPS endpoint that uses Payload Offloading Java Common Library for AWS to de-reference the message content.

Coordination across accounts and multiple AWS Regions

Sometimes applications are spread across AWS Regions, leading to increased latency in coordination. For existing applications, it could take substantive effort to consolidate to a single Region. Hence, asynchronous coordination would be a good fit for this scenario. Figure 3 expands on the architecture presented earlier to include multiple AWS Regions.

Application coordination across accounts and multiple AWS Regions

Figure 3. Application coordination across accounts and multiple AWS Regions

The Lambda function in Account C is in the same Region as the upstream application in Account A, but the Lambda function in Account B is in a different Region. These functions must retrieve the payload from the S3 bucket in Account A.

To provide access, configure the AWS Lambda execution role with the appropriate permissions. Make sure that the S3 bucket policy allows access to the Lambda functions from Accounts B and C.


For variable message sizes, you can specify if payloads are always stored in Amazon S3 regardless of their size, which can help simplify the design.

If the application that publishes/subscribes large messages is implemented using the AWS Java SDK, it must be Java 8 or higher. Service-specific client libraries are also available in Python, C#, and Node.js.

An Amazon S3 Multi-Region Access Point can be an alternative to a centralized bucket for the payloads. It has not been explored in this post due to the asynchronous nature of cross-region replication.

In general, retrieval of data across Regions is slower than in the same Region. For faster retrieval, workloads should be run in the same AWS Region.


This post demonstrates how to use event-driven architectures for coordinating applications that need to exchange large messages across AWS accounts and Regions. The messaging and automation are enabled by the Payload Offloading Java Common Library for AWS and use Amazon S3 as the intermediate data store. These components can simplify the solution implementation and improve scalability, fault-tolerance, and performance of your applications.

Ready to get started? Explore SQS Large Message Handling.

Security updates for Monday

Post Syndicated from jake original

Security updates have been issued by Debian (chromium, libtirpc, and xorg-server), Fedora (giflib, mingw-giflib, and teeworlds), Mageia (chromium-browser-stable, kernel, kernel-linus, mingw-giflib, osmo, python-m2crypto, and sqlite3), Oracle (httpd, php, vim, virt:ol and virt-devel:ol, and xorg-x11-server), SUSE (caddy, crash, dpkg, fwupd, python-M2Crypto, and trivy), and Ubuntu (gdk-pixbuf, libjpeg-turbo, and phpliteadmin).

No Damsels in Distress: How Media and Entertainment Companies Can Secure Data and Content

Post Syndicated from Ryan Blanchard original

No Damsels in Distress: How Media and Entertainment Companies Can Secure Data and Content

Streaming is king in the media and entertainment industry. According to the Motion Picture Association’s Theatrical and Home Entertainment Market Environment Report, the global number of streaming subscribers grew to 1.3 billion in 2021. Consumer demand for immediate digital delivery is skyrocketing. Producing high-quality content at scale is a challenge media companies must step up to on a daily basis. One thing is for sure: Meeting these expectations would be unmanageable left to human hands alone.

Fortunately, cloud adoption has enabled entertainment companies to meet mounting customer and business needs more efficiently. With the high-speed workflow and delivery processes that the cloud enables, distributing direct-to-consumer is now the industry standard.

As media and entertainment companies grow their cloud footprints, they’re also opening themselves up to vulnerabilities threat actors can exploit — and the potential consequences can be financially devastating.

Balancing cloud security with production speed

In 2021, a Twitch data breach showed the impact cyberattacks can have on intellectual property at media and entertainment companies. Attackers stole 128 gigabytes of data from the popular streaming site and posted the collection on 4chan. The released torrent file contained:

  • The history of Twitch’s source code
  • Programs Twitch used to test its own vulnerabilities
  • Proprietary software development kits
  • An unreleased online games store intended to compete with Steam
  • Amazon Game Studios’ next title

Ouch. In mere moments, the attackers stole a ton of sensitive IP and a key security strategy. How did attackers manage this? By exploiting a single misconfigured server.

Before you think, “Well, that couldn’t happen to us,” consider that cloud misconfigurations are the most common source of data breaches.

Yet, media and entertainment businesses can’t afford to slow down their adoption and usage of public cloud infrastructure if they hope to remain relevant. Consumers demand timely content, whether it’s the latest midnight album drop from Taylor Swift or breaking news on the war in Ukraine.

Media and entertainment organizations must mature their cloud security postures alongside their content delivery and production processes to maintain momentum while protecting their most valuable resources: intellectual property, content, and customer data.

We’ve outlined three key cloud security strategies media and entertainment companies can follow to secure their data in the cloud.

1. Expand and consolidate visibility

You can’t protect what you can’t see. There are myriad production, technical, and creative teams working on a host of projects at a media and entertainment company – and they all interact with cloud and container environments throughout their workflow. This opens the door for potential misconfigurations (and then breaches) if these environments aren’t carefully tracked, secured, or even known about.

Here are some key considerations to make:

  • Do you know exactly what platforms are being used across your organization?
  • Do you know how they are being used and whether they are secure?

Most enterprises lack visibility into all the cloud and container environments their teams use throughout each step of their digital supply chain. Implementing a system to continuously monitor all cloud and container services gives you better insight into associated risks. Putting these processes into place will enable you to tightly monitor – and therefore protect – your growing cloud footprint.

How to get started: Improve visibility by introducing a plan for cloud workload protection.

2. Shift left to prevent risk earlier

Cloud, container, and other infrastructure misconfigurations are a major area of concern for most security teams. More than 30% of the data breaches studied in our 2022 Cloud Misconfigurations Report were caused by relaxed security settings and misconfigurations in the cloud. These misconfigurations are alarmingly common across industries and can cause critical exposures, as evidenced in the following example:

In 2021, a server misconfiguration on (a UK-based media company) revealed access credentials to a production-level database and IP addresses to development endpoints. This meant that anyone with those released credentials or addresses could easily access a mountain of proprietary data from the Comcast subsidiary.

One way to avoid these types of breaches is to prevent misconfigurations in your Infrastructure as Code (IaC) templates. Scanning IaC templates, such as Terraform, reduces the likelihood of cloud misconfigurations by ensuring that any templates that are built and deployed are already vetted against the same security and compliance checks as your production cloud infrastructure and services.

By leveraging IaC scanning that provides fast, context-rich results to resource owners, media and entertainment organizations can build a stronger security foundation while reducing friction across DevOps and security teams and cutting down on the number of 11th-hour fixes. Solving problems in the CI/CD pipeline improves efficiency by correcting issues once rather than fixing them over and over again at runtime.

How to get started: Learn about the first step of shifting left with Infrastructure as Code in the CI/CD pipeline.

3. Create a culture of security

As the saying goes, a chain is only as strong as its weakest link. Cloud security practices won’t be as effective or efficient if an organization’s workforce doesn’t understand and value secure processes. A culture of collaboration between DevOps and security is a good start, but the entire organization must understand and uphold security best practices.

Fostering a culture that prioritizes the protection of digital content empowers all parts of (and people in) your supply chain to work with secure practices front-of-mind.

What’s the tell-tale sign that you’ve created a culture of security? When all employees, no matter their department or role, see it as simply another part of their job. This is obviously not to say that you need to turn all employees, or even developers, into security experts, but they should understand how security influences their role and the negative consequences to the business if security recommendations are avoided or ignored.

How to get started: Share this curated set of resources on cloud security for media and entertainment companies with your team.

Achieving continuous content security

Media and entertainment companies can’t afford to slow down if they hope to meet consumer demands. They can’t afford to neglect security, either, if they want to maintain consumer trust.

Remember, the ultimate offense is a strong defense. Building security into your cloud infrastructure processes from the beginning dramatically decreases the odds that an attacker will find a chink in your armor. Moreover, identifying and remediating security issues sooner plays a critical role in protecting consumer data and your intellectual property and other media investments.

Want to learn more about how media and entertainment companies can strengthen their cloud security postures?

Read our eBook: Protecting IP and Consumer Data in the Streaming Age: A Guide to Cloud Security for Digital Media & Entertainment.

Additional reading:


Get the latest stories, expertise, and news about security today.

Фоторепортаж от Монтана Близки и приятели изпратиха Стоян Николов – Торлака

Post Syndicated from Николай Марченко original

понеделник 8 август 2022

Стотици близки, приятели и колеги се сбогуваха с журналиста и писател Стоян Николов – Торлака (1979 – 2022), който беше и дългогодишен колумнист на Сайта за разследваща журналистика „Биволъ“. Опелото…

NIST’s Post-Quantum Cryptography Standards

Post Syndicated from Webmaster original

Quantum computing is a completely new paradigm for computers. A quantum computer uses quantum properties such as superposition, which allows a qubit (a quantum bit) to be neither 0 nor 1, but something much more complicated. In theory, such a computer can solve problems too complex for conventional computers.

Current quantum computers are still toy prototypes, and the engineering advances required to build a functionally useful quantum computer are somewhere between a few years away and impossible. Even so, we already know that that such a computer could potentially factor large numbers and compute discrete logs, and break the RSA and Diffie-Hellman public-key algorithms in all of the useful key sizes.

Cryptographers hate being rushed into things, which is why NIST began a competition to create a post-quantum cryptographic standard in 2016. The idea is to standardize on both a public-key encryption and digital signature algorithm that is resistant to quantum computing, well before anyone builds a useful quantum computer.

NIST is an old hand at this competitive process, having previously done this with symmetric algorithms (AES in 2001) and hash functions (SHA-3 in 2015). I participated in both of those competitions, and have likened them to demolition derbies. The idea is that participants put their algorithms into the ring, and then we all spend a few years beating on each other’s submissions. Then, with input from the cryptographic community, NIST crowns a winner. It’s a good process, mostly because NIST is both trusted and trustworthy.

In 2017, NIST received eighty-two post-quantum algorithm submissions from all over the world. Sixty-nine were considered complete enough to be Round 1 candidates. Twenty-six advanced to Round 2 in 2019, and seven (plus another eight alternates) were announced as Round 3 finalists in 2020. NIST was poised to make final algorithm selections in 2022, with a plan to have a draft standard available for public comment in 2023.

Cryptanalysis over the competition was brutal. Twenty-five of the Round 1 algorithms were attacked badly enough to remove them from the competition. Another eight were similarly attacked in Round 2. But here’s the real surprise: there were newly published cryptanalysis results against at least four of the Round 3 finalists just months ago—moments before NIST was to make its final decision.

One of the most popular algorithms, Rainbow, was found to be completely broken. Not that it could theoretically be broken with a quantum computer, but that it can be broken today—with an off-the-shelf laptop in just over two days. Three other finalists, Kyber, Saber, and Dilithium, were weakened with new techniques that will probably work against some of the other algorithms as well. (Fun fact: Those three algorithms were broken by the Center of Encryption and Information Security, part of the Israeli Defense Force. This represents the first time a national intelligence organization has published a cryptanalysis result in the open literature. And they had a lot of trouble publishing, as the authors wanted to remain anonymous.)

That was a close call, but it demonstrated that the process is working properly. Remember, this is a demolition derby. The goal is to surface these cryptanalytic results before standardization, which is exactly what happened. At this writing, NIST has chosen a single algorithm for general encryption and three digital-signature algorithms. It has not chosen a public-key encryption algorithm, and there are still four finalists. Check NIST’s webpage on the project for the latest information.

Ian Cassels, British mathematician and World War II cryptanalyst, once said that “cryptography is a mixture of mathematics and muddle, and without the muddle the mathematics can be used against you.” This mixture is particularly difficult to achieve with public-key algorithms, which rely on the mathematics for their security in a way that symmetric algorithms do not. We got lucky with RSA and related algorithms: their mathematics hinge on the problem of factoring, which turned out to be robustly difficult. Post-quantum algorithms rely on other mathematical disciplines and problems—code-based cryptography, hash-based cryptography, lattice-based cryptography, multivariate cryptography, and so on—whose mathematics are both more complicated and less well-understood. We’re seeing these breaks because those core mathematical problems aren’t nearly as well-studied as factoring is.

The moral is the need for cryptographic agility. It’s not enough to implement a single standard; it’s vital that our systems be able to easily swap in new algorithms when required. We’ve learned the hard way how algorithms can get so entrenched in systems that it can take many years to update them: in the transition from DES to AES, and the transition from MD4 and MD5 to SHA, SHA-1, and then SHA-3.

We need to do better. In the coming years we’ll be facing a double uncertainty. The first is quantum computing. When and if quantum computing becomes a practical reality, we will learn a lot about its strengths and limitations. It took a couple of decades to fully understand von Neumann computer architecture; expect the same learning curve with quantum computing. Our current understanding of quantum computing architecture will change, and that could easily result in new cryptanalytic techniques.

The second uncertainly is in the algorithms themselves. As the new cryptanalytic results demonstrate, we’re still learning a lot about how to turn hard mathematical problems into public-key cryptosystems. We have too much math and an inability to add more muddle, and that results in algorithms that are vulnerable to advances in mathematics. More cryptanalytic results are coming, and more algorithms are going to be broken.

We can’t stop the development of quantum computing. Maybe the engineering challenges will turn out to be impossible, but it’s not the way to bet. In the face of all that uncertainty, agility is the only way to maintain security.

This essay originally appeared in IEEE Security & Privacy.

EDITED TO ADD: One of the four public-key encryption algorithms selected for further research, SIKE, was just broken.

Мета, противодействието на дезинформацията и междинните избори 2022

Post Syndicated from nellyo original

Мета блокира руски тролове, разпространиха медиите през седмицата, дори се посочиха числа – над хиляда. Източник не беше посочен.


Източникът е редовният тримесечен доклад на Мета за заплахи и враждебно поведение. В увода се сочи, че публичното отчитане на заплахи започва преди пет години.

“За да предоставим по-изчерпателен поглед върху рисковете, с които се справяме, ние разширихме нашите редовни доклади за заплахи, за да включват кибершпионаж, неавтентично поведение и други нововъзникващи заплахи. Докладът няма за цел да отразява всички наши мерки за сигурност, а да споделя забележими тенденции и анализи, които да помогнат за  за разбирането на развиващите се заплахи за сигурността.”

Именно там, в секцията за Русия, се споменават тези закрити акаунти (45 – във Facebook  и  1,037 – в Instagram).


Втори документ от седмицата, свързан с противодействие на дезинформацията, взема повод от приближаващите междинни избори в САЩ. Документът се казва “Подход на Мета към идващите през 2022 междинни избори в САЩ”.

Компанията информира, че е създаден  специален екип, фокусиран върху междинните избори през 2022 г.   Заети са 40 000 души с финансиране 5 милиарда долара за безопасност и сигурност само през 2021 г.  “Ние работим с партньори от федералното правителство, включително ФБР и Агенцията за киберсигурност и сигурност на инфраструктурата, както и с местни и щатски служители по изборите.  Актуализираме инструментите за сигурност и допълнителна защита на кандидатите, техните кампании и длъжностните лица в изборната администрация. Идентифицираме операции за чужда намеса и вътрешно влияние, премахнахме над 150 мрежи от 2017 г. Възприемаме твърда позиция срещу фалшивите акаунти и блокираме милиони всеки ден, повечето от тях по време на създаването им.”, се казва в документа на Мета.   

Свободно слово или слово без отговорност: процесът срещу Алекс Джоунс, Infowars

Post Syndicated from nellyo original

Алекс Джоунс – собственик   на сайта InfoWars –  стана много известен с разпространение на дезинформация и конспиративни теории, а върхът на популярността му се свързва с тормоза, който той чрез сайта си осъществи срещу родителите на убитите деца при стрелба в училище Sandy Hook в Нютаун, Кънектикът, през 2012 г.  Алекс Джоунс разпространяваше тезата, че стрелба не е имало и че родителите си измислят. Те, от своя страна, заведоха дела срещу него.

Infowars беше изхвърлен от платформите и приложенията. Това беше преди години, Тръмп още беше в Туитър и една започваше разговорът може ли частните компании да определят границите на свободата на словото онлайн. И отнася ли се Първата поправка за тях. И ако се отнася, не може ли Алекс Джоунс да говори, каквото пожелае.

Десет години по-късно Джоунс казва на родителите, че никога не е имал намерение да ги нарани. Че стрелба е имало, въпреки че многократно е казвал година след година, че стрелбата е била инсценирана. Но твърди, че  упражнява правото си на свобода на словото. И досега намира начини да разпространява конспиративни теории и да привлича милиони зрители всеки месец, някои от които също са заплашвали родителите, дори са им изпращали смъртни заплахи. 

Семействата на 10-те жертви водят четири отделни дела, стана известно първото решение. Текущият процес в Остин, където е базиран крайнодесният уебсайт на Джоунс Infowars и неговата компания-издател, произтича от иск от 2018 г., подаден от Нийл Хеслин и Скарлет Луис, чийто 6-годишен син беше убит при атаката през 2012 г. с още 19 първокласници и шестима възпитатели. Съдът присъди общо 49,2 милиона долара на семейството.  Другите дела са отложени. 


“Save the First” пише на лентата, която си е сложил Алекс Джоунс. Снимката е на АР,  Briana Sanchez/Austin American-Statesman via AP.

За Първата поправка става дума. В процеса се е очаквало защитата на Джоунс да доказва, че речта му е защитена по смисъла на Първата поправка на Конституцията, но това не се е случило,  не са представени необходимите доказателства – поради което реално не се е провела очакваната дискусия, макар че Алекс Джоунс отново е успял да говори пред медиите, че – ако бъде санкциониран – “това не е Америка”.

OpenSUSE considers dropping reiserfs

Post Syndicated from corbet original

As Jeff Mahoney notes in this
message to the openSUSE factory list
, the reiserfs filesystem has been
unmaintained for years and lacks many of the features that users have come
to expect. He has thus proposed removing reiserfs from openSUSE Tumbleweed

I recognize that there may be people out there with disks
containing reiserfs file systems. If these are in active use, I
would seriously encourage migrating to something actively
maintained. + WARP: More features, still private

Post Syndicated from Mari Galicer original + WARP: More features, still private + WARP: More features, still private

It’s a Saturday night. You open your browser, looking for nearby pizza spots that are open. If the search goes as intended, your browser will show you results that are within a few miles, often based on the assumed location of your IP address. At Cloudflare, we affectionately call this type of geolocation accuracy the “pizza test”. When you use a Cloudflare product that sits between you and the Internet (for example, WARP), it’s one of the ways we work to balance user experience and privacy. Too inaccurate and you’re getting pizza places from a neighboring country; too accurate and you’re reducing the privacy benefits of obscuring your location.

With that in mind, we’re excited to announce two major improvements to our + WARP apps: first, an improvement to how we ensure search results and other geographically-aware Internet activity work without compromising your privacy, and second, a larger network with more locations available to WARP+ subscribers, powering even speedier connections to our global network.

A better Internet browsing experience for every WARP user

When we originally built the WARP mobile app, we wanted to create a consumer-friendly way to connect to our network and privacy-respecting DNS resolver.

What we discovered over time is that the topology of the Internet dictates a different type of experience for users in different locations. Why? Sometimes, because traffic congestion or technical issues route your traffic to a less congested part of the network. Other times, Internet Service Providers may not peer with Cloudflare or engage in traffic engineering to optimize their networks how they see fit, which could result in user traffic connecting to a location that doesn’t quite map to their locale or language.

Regardless of the cause, the impact is that your search results become less relevant, if not outright confusing. For example, in somewhere dense with country borders, like Europe, your traffic in Berlin could get routed to Amsterdam because your mobile operator chooses to not peer in-country, giving you results in Dutch instead of German. This can also be disruptive if you’re trying to stream content subject to licensing restrictions, such as a person in the UK trying to watch BBC iPlayer or a person in Brazil trying to watch the World Cup.

So we fixed this. We just rolled out a major update to the service that powers WARP that will give you a geographically accurate browsing experience without revealing your IP address to the websites you’re visiting. Instead, websites you visit will see a Cloudflare IP address instead, making it harder for them to track you directly.

How it works

Traditionally, consumer VPNs deliberately route your traffic through a server in another country, making your connection slow, and often getting blocked because of their ability to flout location-based content restrictions. We took a different approach when we first launched WARP in 2018, giving you the best possible performance by routing your traffic through the Cloudflare data center closest to you. However, because not every Internet Service Provider (ISP) peers with Cloudflare, users sometimes end up exiting the Cloudflare network from a more “random” data center – one that does not accurately represent their locale.

Websites and third party services often infer geolocation from your IP address, and now, + WARP replaces your original IP address with one that consistently and accurately represents your approximate location.

Here’s how we did it:

  1. We ran an analysis on a subset of our network traffic to find a rough approximation of how many users we have per city.
  2. We divided that amongst our egress IPs, using an anycast architecture to be efficient with the number of additional IPs we had to allocate and advertise per metro area.
  3. We then submitted geolocation information of those IPs to various geolocation database providers, ensuring third party services associate those Cloudflare egress IPs with an accurate approximate location.

It was important to us to provide the benefits of this location accuracy without compromising user privacy, so the app doesn’t ask for specific location permissions or log your IP address. + WARP: More features, still private

An even bigger network for WARP+ users + WARP: More features, still private

We also recently announced that we’ve expanded our network to over 275 cities in over 100 countries. This gave us an opportunity to revisit where we offered WARP, and how we could expand the number of locations users can connect to WARP with (in other words: an opportunity to make things faster).

From today, all WARP+ subscribers will benefit from a larger network with 20+ new cities: with no change in subscription pricing. A closer Cloudflare data center means less latency between your device and Cloudflare, which directly improves your download speed, thanks to what’s called the Bandwidth-Delay Product (put simply: lower latency, higher throughput!).

As a result, sites load faster, both for those on the Cloudflare network and those that aren’t. As we continue to expand our network, we’ll be revising this on a regular basis to ensure that all WARP and WARP+ subscribers continue to get great performance.

Speed, privacy, and relevance

Beyond being able to find pizza on a Saturday night, we believe everyone should be able to browse the Internet freely – and not have to sacrifice the speed, privacy, or relevance of their search results in order to do so.

In the near future, we’ll be investing in features to bring even more of the benefits of Cloudflare infrastructure to every + WARP user. Stay tuned!

Design patterns to manage Amazon EMR on EKS workloads for Apache Spark

Post Syndicated from Jamal Arif original

Amazon EMR on Amazon EKS enables you to submit Apache Spark jobs on demand on Amazon Elastic Kubernetes Service (Amazon EKS) without provisioning clusters. With EMR on EKS, you can consolidate analytical workloads with your other Kubernetes-based applications on the same Amazon EKS cluster to improve resource utilization and simplify infrastructure management. Kubernetes uses namespaces to provide isolation between groups of resources within a single Kubernetes cluster. Amazon EMR creates a virtual cluster by registering Amazon EMR with a namespace on an EKS cluster. Amazon EMR can then run analytics workloads on that namespace.

In EMR on EKS, you can submit your Spark jobs to Amazon EMR virtual clusters using the AWS Command Line Interface (AWS CLI), SDK, or Amazon EMR Studio. Amazon EMR requests the Kubernetes scheduler on Amazon EKS to schedule pods. For every job you run, EMR on EKS creates a container with an Amazon Linux 2 base image, Apache Spark, and associated dependencies. Each Spark job runs in a pod on Amazon EKS worker nodes. If your Amazon EKS cluster has worker nodes in different Availability Zones, the Spark application driver and executor pods can spread across multiple Availability Zones. In this case, data transfer charges apply for cross-AZ communication and increases data processing latency. If you want to reduce data processing latency and avoid cross-AZ data transfer costs, you should configure Spark applications to run only within a single Availability Zone.

In this post, we share four design patterns to manage EMR on EKS workloads for Apache Spark. We then show how to use a pod template to schedule a job with EMR on EKS, and use Karpenter as our autoscaling tool.

Pattern 1: Manage Spark jobs by pod template

Customers often consolidate multiple applications on a shared Amazon EKS cluster to improve utilization and save costs. However, each application may have different requirements. For example, you may want to run performance-intensive workloads such as machine learning model training jobs on SSD-backed instances for better performance, or fault-tolerant and flexible applications on Amazon Elastic Compute Cloud (Amazon EC2) Spot Instances for lower cost. In EMR on EKS, there are a few ways to configure how your Spark job runs on Amazon EKS worker nodes. You can utilize the Spark configurations on Kubernetes with the EMR on EKS StartJobRun API, or you can use Spark’s pod template feature. Pod templates are specifications that determine how to run each pod on your EKS clusters. With pod templates, you have more flexibility and can use pod template files to define Kubernetes pod configurations that Spark doesn’t support.

You can use pod templates to achieve the following benefits:

  • Reduce costs – You can schedule Spark executor pods to run on EC2 Spot Instances while scheduling Spark driver pods to run on EC2 On-Demand Instances.
  • Improve monitoring – You can enhance your Spark workload’s observability. For example, you can deploy a sidecar container via a pod template to your Spark job that can forward logs to your centralized logging application
  • Improve resource utilization – You can support multiple teams running their Spark workloads on the same shared Amazon EKS cluster

You can implement these patterns using pod templates and Kubernetes labels and selectors. Kubernetes labels are key-value pairs that are attached to objects, such as Kubernetes worker nodes, to identify attributes that are meaningful and relevant to users. You can then choose where Kubernetes schedules pods using nodeSelector or Kubernetes affinity and anti-affinity so that it can only run on specific worker nodes. nodeSelector is the simplest way to constrain pods to nodes with specific labels. Affinity and anti-affinity expand the types of constraints you can define.

Autoscaling in Spark workload

Autoscaling is a function that automatically scales your compute resources up or down to changes in demand. For Kubernetes auto scaling, Amazon EKS supports two auto scaling products: the Kubernetes Cluster Autoscaler and the Karpenter open-source auto scaling project. Kubernetes autoscaling ensures your cluster has enough nodes to schedule your pods without wasting resources. If some pods fail to schedule on current worker nodes due to insufficient resources, it increases the size of the cluster and adds additional nodes. It also attempts to remove underutilized nodes when its pods can run elsewhere.

Pattern 2: Turn on Dynamic Resource Allocation (DRA) in Spark

Spark provides a mechanism called Dynamic Resource Allocation (DRA), which dynamically adjusts the resources your application occupies based on the workload. With DRA, the Spark driver spawns the initial number of executors and then scales up the number until the specified maximum number of executors is met to process the pending tasks. Idle executors are deleted when there are no pending tasks. It’s particularly useful if you’re not certain how many executors are needed for your job processing.

You can implement it in EMR on EKS by following the Dynamic Resource Allocation workshop.

Pattern 3: Fully control cluster autoscaling by Cluster Autoscaler

Cluster Autoscaler utilizes the concept of node groups as the element of capacity control and scale. In AWS, node groups are implemented by auto scaling groups. Cluster Autoscaler implements it by controlling the DesiredReplicas field of your auto scaling groups.

To save costs and improve resource utilization, you can use Cluster Autoscaler in your Amazon EKS cluster to automatically scale your Spark pods. The following are recommendations for autoscaling Spark jobs with Amazon EMR on EKS using Cluster Autoscaler:

  • Create Availability Zone bounded auto scaling groups to make sure Cluster Autoscaler only adds worker nodes in the same Availability Zone to avoid cross-AZ data transfer charges and data processing latency.
  • Create separate node groups for EC2 On-Demand and Spot Instances. By doing this, you can add or shrink driver pods and executor pods independently.
  • In Cluster Autoscaler, each node in a node group needs to have identical scheduling properties. That includes EC2 instance types, which should be of similar vCPU to memory ratio to avoid inconsistency and wastage of resources. To learn more about Cluster Autoscaler node groups best practices, refer to Configuring your Node Groups.
  • Adhere to Spot Instance best practices and maximize diversification to take advantages of multiple Spot pools. Create multiple node groups for Spark executor pods with different vCPU to memory ratios. This greatly increases the stability and resiliency of your application.
  • When you have multiple node groups, use pod templates and Kubernetes labels and selectors to manage Spark pod deployment to specific Availability Zones and EC2 instance types.

The following diagram illustrates Availability Zone bounded auto scaling groups.

AZ bounded CA ASG
As multiple node groups are created, Cluster Autoscaler has the concept of expanders, which provide different strategies for selecting which node group to scale. As of this writing, the following strategies are supported: random, most-pods, least-waste, and priority. With multiple node groups of EC2 On-Demand and Spot Instances, you can use the priority expander, which allows Cluster Autoscaler to select the node group that has the highest priority assigned by the user. For configuration details, refer to Priority based expander for Cluster Autoscaler.

Pattern 4: Group-less autoscaling with Karpenter

Karpenter is an open-source, flexible, high-performance Kubernetes cluster auto scaler built with AWS. The overall goal is the same of auto scaling Amazon EKS clusters to adjust un-schedulable pods; however, Karpenter takes a different approach than Cluster Autoscaler, known as group-less provisioning. It observes the aggregate resource requests of unscheduled pods and makes decisions to launch minimal compute resources to fit the un-schedulable pods for efficient binpacking and reducing scheduling latency. It can also delete nodes to reduce infrastructure costs. Karpenter works directly with the Amazon EC2 Fleet.

To configure Karpenter, you create provisioners that define how Karpenter manages un-schedulable pods and expired nodes. You should utilize the concept of layered constraints to manage scheduling constraints. To reduce EMR on EKS costs and improve Amazon EKS cluster utilization, you can use Karpenter with similar constraints of Single-AZ, On-Demand Instances for Spark driver pods, and Spot Instances for executor pods without creating multiple types of node groups. With its group-less approach, Karpenter allows you to be more flexible and diversify better.

The following are recommendations for auto scaling EMR on EKS with Karpenter:

  • Configure Karpenter provisioners to launch nodes in a single Availability Zone to avoid cross-AZ data transfer costs and reduce data processing latency.
  • Create a provisioner for EC2 Spot Instances and EC2 On-Demand Instances. You can reduce costs by scheduling Spark driver pods to run on EC2 On-Demand Instances and schedule Spark executor pods to run on EC2 Spot Instances.
  • Limit the instance types by providing a list of EC2 instances or let Karpenter choose from all the Spot pools available to it. This follows the Spot best practices of diversifying across multiple Spot pools.
  • Use pod templates and Kubernetes labels and selectors to allow Karpenter to spin up right-sized nodes required for un-schedulable pods.

The following diagram illustrates how Karpenter works.

Karpenter How it Works

To summarize the design patterns we discussed:

  1. Pod templates help tailor your Spark workloads. You can configure Spark pods in a single Availability Zone and utilize EC2 Spot Instances for Spark executor pods, resulting in better price-performance.
  2. EMR on EKS supports the DRA feature in Spark. It is useful if you’re not familiar how many Spark executors are needed for your job processing, and use DRA to dynamically adjust the resources your application needs.
  3. Utilizing Cluster Autoscaler enables you to fully control how to autoscale your Amazon EMR on EKS workloads. It improves your Spark application availability and cluster efficiency by rapidly launching right-sized compute resources.
  4. Karpenter simplifies autoscaling with its group-less provisioning of compute resources. The benefits include reduced scheduling latency, and efficient bin-packing to reduce infrastructure costs.

Walkthrough overview

In our example walkthrough, we will show how to use Pod template to schedule a job with EMR on EKS. We use Karpenter as our autoscaling tool.

We complete the following steps to implement the solution:

  1. Create an Amazon EKS cluster.
  2. Prepare the cluster for EMR on EKS.
  3. Register the cluster with Amazon EMR.
  4. For Amazon EKS auto scaling, set up Karpenter auto scaling in Amazon EKS.
  5. Submit a sample Spark job using pod templates to run in single Availability Zone and utilize Spot for Spark executor pods.

The following diagram illustrates this architecture.

Walkthrough Overview


To follow along with the walkthrough, ensure that you have the following prerequisite resources:

Create an Amazon EKS cluster

There are two ways to create an EKS cluster: you can use AWS Management Console and AWS CLI, or you can install all the required resources for Amazon EKS using eksctl, a simple command line utility for creating and managing Kubernetes clusters on EKS. For this post, we use eksctl to create our cluster.

Let’s start with installing the tools to set up and manage your Kubernetes cluster.

  1. Install the AWS CLI with the following command (Linux OS) and confirm it works:
    curl "" -o ""
    sudo ./aws/install
    aws --version

    For other operating systems, see Installing, updating, and uninstalling the AWS CLI version.

  2. Install eksctl, the command line utility for creating and managing Kubernetes clusters on Amazon EKS:
    curl --silent --location "$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
    sudo mv -v /tmp/eksctl /usr/local/bin
    eksctl version

    eksctl is a tool jointly developed by AWS and Weaveworks that automates much of the experience of creating EKS clusters.

  3. Install the Kubernetes command-line tool, kubectl, which allows you to run commands against Kubernetes clusters:
    curl -o kubectl
    chmod +x ./kubectl
    sudo mv ./kubectl /usr/local/bin

  4. Create a new file called eks-create-cluster.yaml with the following:
    kind: ClusterConfig
      name: emr-on-eks-blog-cluster
      region: us-west-2
    availabilityZones: ["us-west-2b", "us-west-2c", "us-west-2d"]
    managedNodeGroups:#On-demand nodegroups for spark job
    - name: singleaz-ng-ondemand
      instanceType: m5.xlarge
      desiredCapacity: 1
      availabilityZones: ["us-west-2b"]

  5. Create an Amazon EKS cluster using the eks-create-cluster.yaml file:
    eksctl create cluster -f eks-create-cluster.yaml

    In this Amazon EKS cluster, we create a single managed node group with a general purpose m5.xlarge EC2 Instance. Launching Amazon EKS cluster, its managed node groups, and all dependencies typically takes 10–15 minutes.

  6. After you create the cluster, you can run the following to confirm all node groups were created:
    eksctl get nodegroups --cluster emr-on-eks-blog-cluster

    You can now use kubectl to interact with the created Amazon EKS cluster.

  7. After you create your Amazon EKS cluster, you must configure your kubeconfig file for your cluster using the AWS CLI:
    aws eks --region us-west-2 update-kubeconfig --name emr-on-eks-blog-cluster
    kubectl cluster-info

You can now use kubectl to connect to your Kubernetes cluster.

Prepare your Amazon EKS cluster for EMR on EKS

Now we prepare our Amazon EKS cluster to integrate it with EMR on EKS.

  1. Let’s create the namespace emr-on-eks-blog in our Amazon EKS cluster:
    kubectl create namespace emr-on-eks-blog

  2. We use the automation powered by eksctl to create role-based access control permissions and to add the EMR on EKS service-linked role into the aws-auth configmap:
    eksctl create iamidentitymapping --cluster emr-on-eks-blog-cluster --namespace emr-on-eks-blog --service-name "emr-containers"

  3. The Amazon EKS cluster already has an OpenID Connect provider URL. You enable IAM roles for service accounts by associating IAM with the Amazon EKS cluster OIDC:
    eksctl utils associate-iam-oidc-provider --cluster emr-on-eks-blog-cluster —approve

    Now let’s create the IAM role that Amazon EMR uses to run Spark jobs.

  4. Create the file blog-emr-trust-policy.json:

    "Version": "2012-10-17",
    "Statement": [
    "Effect": "Allow",
    "Principal": {
    "Service": ""
    "Action": "sts:AssumeRole"

    Set up an IAM role:

    aws iam create-role --role-name blog-emrJobExecutionRole —assume-role-policy-document file://blog-emr-trust-policy.json

    This IAM role contains all permissions that the Spark job needs—for instance, we provide access to S3 buckets and Amazon CloudWatch to access necessary files (pod templates) and share logs.

    Next, we need to attach the required IAM policies to the role so it can write logs to Amazon S3 and CloudWatch.

  5. Create the file blog-emr-policy-document with the required IAM policies. Replace the bucket name with your S3 bucket ARN.

    "Version": "2012-10-17",
    "Statement": [
    "Effect": "Allow",
    "Action": [
    "Resource": ["arn:aws:s3:::<bucket-name>"]
    "Effect": "Allow",
    "Action": [
    "Resource": [

    Attach it to the IAM role created in the previous step:

    aws iam put-role-policy --role-name blog-emrJobExecutionRole --policy-name blog-EMR-JobExecution-policy —policy-document file://blog-emr-policy-document.json

  6. Now we update the trust relationship between the IAM role we just created with the Amazon EMR service identity. The namespace provided here in the trust policy needs to be same when registering the virtual cluster in next step:
    aws emr-containers update-role-trust-policy --cluster-name emr-on-eks-blog-cluster --namespace emr-on-eks-blog --role-name blog-emrJobExecutionRole --region us-west-2

Register the Amazon EKS cluster with Amazon EMR

Registering your Amazon EKS cluster is the final step to set up EMR on EKS to run workloads.

We create a virtual cluster and map it to the Kubernetes namespace created earlier:

aws emr-containers create-virtual-cluster \
    --region us-west-2 \
    --name emr-on-eks-blog-cluster \
    --container-provider '{
       "id": "emr-on-eks-blog-cluster",
       "type": "EKS",
       "info": {
          "eksInfo": {
              "namespace": "emr-on-eks-blog"

After you register, you should get confirmation that your EMR virtual cluster is created:

"arn": "arn:aws:emr-containers:us-west-2:142939128734:/virtualclusters/lwpylp3kqj061ud7fvh6sjuyk",
"id": "lwpylp3kqj061ud7fvh6sjuyk",
"name": "emr-on-eks-blog-cluster"

A virtual cluster is an Amazon EMR concept that means that Amazon EMR registered to a Kubernetes namespace and can run jobs in that namespace. If you navigate to your Amazon EMR console, you can see the virtual cluster listed.

Set up Karpenter in Amazon EKS

To get started with Karpenter, ensure there is some compute capacity available, and install it using the Helm charts provided in the public repository. Karpenter also requires permissions to provision compute resources. For more information, refer to Getting Started.

Karpenter’s single responsibility is to provision compute for your Kubernetes clusters, which is configured by a custom resource called a provisioner. Once installed in your cluster, the Karpenter provisioner observes incoming Kubernetes pods, which can’t be scheduled due to insufficient compute resources in the cluster, and automatically launches new resources to meet their scheduling and resource requirements.

For our use case, we provision two provisioners.

The first is a Karpenter provisioner for Spark driver pods to run on EC2 On-Demand Instances:

kind: Provisioner
  name: ondemand
  ttlSecondsUntilExpired: 2592000 

  ttlSecondsAfterEmpty: 30

  labels: on-demand

    - key: ""
      operator: In
      values: ["us-west-2b"]
    - key: ""
      operator: In
      values: ["arm64"]
    - key: ""
      operator: In
      values: ["on-demand"]

      cpu: "1000"
      memory: 1000Gi

    subnetSelector: emr-on-eks-blog-cluster
    securityGroupSelector: emr-on-eks-blog-cluster

The second is a Karpenter provisioner for Spark executor pods to run on EC2 Spot Instances:

kind: Provisioner
  name: default
  ttlSecondsUntilExpired: 2592000 

  ttlSecondsAfterEmpty: 30

  labels: spot

    - key: ""
      operator: In
      values: ["us-west-2b"]
    - key: ""
      operator: In
      values: ["arm64"]
    - key: ""
      operator: In
      values: ["spot"]

      cpu: "1000"
      memory: 1000Gi

    subnetSelector: emr-on-eks-blog-cluster
    securityGroupSelector: emr-on-eks-blog-cluster

Note the highlighted portion of the provisioner config. In the requirements section, we use the well-known labels with Amazon EKS and Karpenter to add constraints for how Karpenter launches nodes. We add constraints that if the pod is looking for a label spot, it uses this provisioner to launch an EC2 Spot Instance only in Availability Zone us-west-2b. Similarly, we follow the same constraint for the on-demand label. We can also be more granular and provide EC2 instance types in our provisioner, and they can be of different vCPU and memory ratios, giving you more flexibility and adding resiliency to your application. Karpenter launches nodes only when both the provisioner’s and pod’s requirements are met. To learn more about the Karpenter provisioner API, refer to Provisioner API.

In the next step, we define pod requirements and align them with what we have defined in Karpenter’s provisioner.

Submit Spark job using Pod template

In Kubernetes, labels are key-value pairs that are attached to objects, such as pods. Labels are intended to be used to specify identifying attributes of objects that are meaningful and relevant to users. You can constrain a pod so that it can only run on particular set of nodes. There are several ways to do this, and the recommended approaches all use label selectors to facilitate the selection.

Beginning with Amazon EMR versions 5.33.0 or 6.3.0, EMR on EKS supports Spark’s pod template feature. We use pod templates to add specific labels where Spark driver and executor pods should be launched.

Create a pod template file for a Spark driver pod and save them in your S3 bucket:

apiVersion: v1
kind: Pod
  nodeSelector: on-demand
  - name: spark-kubernetes-driver # This will be interpreted as Spark driver container

Create a pod template file for a Spark executor pod and save them in your S3 bucket:

apiVersion: v1
kind: Pod
  nodeSelector: spot
  - name: spark-kubernetes-executor # This will be interpreted as Spark driver container

Pod templates provide different fields to manage job scheduling. For additional details, refer to Pod template fields. Note the nodeSelector for the Spark driver pods and Spark executor pods, which match the labels we defined with the Karpenter provisioner.

For a sample Spark job, we use the following code, which creates multiple parallel threads and waits for a few seconds:

cat << EOF >
import sys
from time import sleep
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("threadsleep").getOrCreate()
def sleep_for_x_seconds(x):sleep(x*20)
sc.parallelize(range(1,6), 5).foreach(sleep_for_x_seconds)

Copy the sample Spark job into your S3 bucket:

aws s3 mb s3://<YourS3Bucket>
aws s3 cp s3://<YourS3Bucket>

Before we submit the Spark job, let’s get the required values of the EMR virtual cluster and Amazon EMR job execution role ARN:

export S3blogbucket= s3://<YourS3Bucket>
export VIRTUAL_CLUSTER_ID=$(aws emr-containers list-virtual-clusters --query "virtualClusters[?state=='RUNNING'].id" --region us-west-2 --output text)

export EMR_ROLE_ARN=$(aws iam get-role --role-name blog-emrJobExecutionRole --query Role.Arn --region us-west-2 --output text)

To enable the pod template feature with EMR on EKS, you can use configuration-overrides to specify the Amazon S3 path to the pod template:

aws emr-containers start-job-run \
--virtual-cluster-id $VIRTUAL_CLUSTER_ID \
--name spark-threadsleep-single-az \
--execution-role-arn $EMR_ROLE_ARN \
--release-label emr-5.33.0-latest \
--region us-west-2 \
--job-driver '{
    "sparkSubmitJobDriver": {
        "entryPoint": "'${S3blogbucket}'/",
        "sparkSubmitParameters": "--conf spark.executor.instances=6 --conf spark.executor.memory=1G --conf spark.executor.cores=1 --conf spark.driver.cores=2"
    }' \
--configuration-overrides '{
    "applicationConfiguration": [
        "classification": "spark-defaults", 
        "properties": {
"spark.kubernetes.driver.podTemplateFile":"'${S3blogbucket}'/spark_driver_podtemplate.yaml", "spark.kubernetes.executor.podTemplateFile":"'${S3blogbucket}'/spark_executor_podtemplate.yaml"
    "monitoringConfiguration": {
      "cloudWatchMonitoringConfiguration": {
        "logGroupName": "/emr-on-eks/emreksblog", 
        "logStreamNamePrefix": "threadsleep"
      "s3MonitoringConfiguration": {
        "logUri": "'"$S3blogbucket"'/logs/"

In the Spark job, we’re requesting two cores for the Spark driver and one core each for Spark executor pod. Because we only had a single EC2 instance in our managed node group, Karpenter looks at the un-schedulable Spark driver pods and utilizes the on-demand provisioner to launch EC2 On-Demand Instances for Spark driver pods in us-west-2b. Similarly, when the Spark executor pods are in pending state, because there are no Spot Instances, Karpenter launches Spot Instances in us-west-2b.

This way, Karpenter optimizes your costs by starting from zero Spot and On-Demand Instances and only creates them dynamically when required. Additionally, Karpenter batches pending pods and then binpacks them based on CPU, memory, and GPUs required, taking into account node overhead, VPC CNI resources required, and daemon sets that will be packed when bringing up a new node. This makes sure you’re efficiently utilizing your resources with least wastage.

Clean up

Don’t forget to clean up the resources you created to avoid any unnecessary charges.

  1. Delete all the virtual clusters that you created:
    #List all the virtual cluster ids
    aws emr-containers list-virtual-clusters#Delete virtual cluster by passing virtual cluster id
    aws emr-containers delete-virtual-cluster —id <virtual-cluster-id>

  2. Delete the Amazon EKS cluster:
    eksctl delete cluster emr-on-eks-blog-cluster

  3. Delete the EMR_EKS_Job_Execution_Role role and policies.


In this post, we saw how to create an Amazon EKS cluster, configure Amazon EKS managed node groups, create an EMR virtual cluster on Amazon EKS, and submit Spark jobs. Using pod templates, we saw how to ensure Spark workloads are scheduled in the same Availability Zone and utilize Spot with Karpenter auto scaling to reduce costs and optimize your Spark workloads.

To get started, try out the EMR on EKS workshop. For more resources, refer to the following:

About the author

Jamal Arif is a Solutions Architect at AWS and a containers specialist. He helps AWS customers in their modernization journey to build innovative, resilient, and cost-effective solutions. In his spare time, Jamal enjoys spending time outdoors with his family hiking and mountain biking.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.