[$] A plan for the kernel Bugzilla

Post Syndicated from original https://lwn.net/Articles/910740/

The kernel’s Bugzilla
instance
is largely unloved and ignored, at least as a bug-reporting
tool for the bulk of the upstream kernel. At the recent Maintainers Summit,
Bugzilla was discussed during the regression-handling session led by Thorsten
Leemhuis. In a followup to that discussion, Leemhuis posted
some ideas for improving the state of bugzilla.kernel.org to the
ksummit-discuss mailing list recently; the resulting discussion helped
clarify a number of problem areas for it—and for the Bugzilla tool itself.

Разследване на Валя Ахчиева Пороят: Магистралите в гората

Post Syndicated from Биволъ original https://bivol.bg/%D0%BF%D0%BE%D1%80%D0%BE%D1%8F%D1%82-%D0%BC%D0%B0%D0%B3%D0%B8%D1%81%D1%82%D1%80%D0%B0%D0%BB%D0%B8%D1%82%D0%B5-%D0%B2-%D0%B3%D0%BE%D1%80%D0%B0%D1%82%D0%B0.html

вторник 11 октомври 2022


“Никой няма да забрави картините с пороя, удавил на втори септември трите карловски села – Каравелово, Богдан и Слатина. И трите са в подножието на планината – Средна гора. Истински…

Improve federated queries with predicate pushdown in Amazon Athena

Post Syndicated from Rohit Bansal original https://aws.amazon.com/blogs/big-data/improve-federated-queries-with-predicate-pushdown-in-amazon-athena/

In modern data architectures, it’s common to store data in multiple data sources. However, organizations embracing this approach still need insights from their data and require technologies that help them break down data silos. Amazon Athena is an interactive query service that makes it easy to analyze structured, unstructured, and semi-structured data stored in Amazon Simple Storage Service (Amazon S3) in addition to relational, non-relation, object, and custom data sources through its query federation capabilities. Athena is serverless, so there’s no infrastructure to manage, and you only pay for the queries that you run.

Organizations building a modern data architecture want to query data in-place from purpose-built data stores without building complex extract, transform, and load (ETL) pipelines. Athena’s federated query feature allows organizations to achieve this and makes it easy to:

  • Create reports and dashboards from data stored in relational, non-relational, object, and custom data sources
  • Run on-demand analysis on data spread across multiple systems of record using a single tool and single SQL dialect
  • Join multiple data sources together to produce new input features for machine learning model training workflows

However, when querying and joining huge amounts of data from different data stores, it’s important for queries to run quickly, at low cost, and without impacting source systems. Predicate pushdown is supported by many query engines and is a technique that can drastically reduce query processing time by filtering data at the source early in the processing workflow. In this post, you’ll learn how predicate pushdown improves query performance and how you can validate when Athena applies predicate pushdown to federated queries.

Benefits of predicate pushdown

The key benefits of predicate pushdown are as follows:

  • Improved query runtime
  • Reduced network traffic between Athena and the data source
  • Reduced load on the remote data source
  • Reduced cost resulting from reduced data scans

Let’s explore a real-world scenario to understand when predicate pushdown is applied to federated queries in Athena.

Solution overview

Imagine a hypothetical ecommerce company with data stored in

Record counts for these tables are as follows.

Data Store Table Name Number of Records Description
Amazon Redshift Catalog_Sales 4.3 billion Current and historical Sales data fact Table
Amazon Redshift Date_dim 73,000 Date Dimension table
DynamoDB Part 20,000 Realtime Parts and Inventory data
DynamoDB Partsupp 80,000 Realtime Parts and supplier data
Aurora MySQL Supplier 1,000 Latest Supplier transactions
Aurora MySQL Customer 15,000 Latest Customer transactions

Our requirement is to query these sources individually and join the data to track pricing and supplier information and compare recent data with historical data using SQL queries with various filters applied. We’ll use Athena federated queries to query and join data from these sources to meet this requirement.

The following diagram depicts how Athena federated queries use data source connectors run as Lambda functions to query data stored in sources other than Amazon S3.

When a federated query is submitted against a data source, Athena invokes the data source connector to determine how to read the requested table and identify filter predicates in the WHERE clause of the query that can be pushed down to the source. Applicable filters are automatically pushed down by Athena and have the effect of omitting unnecessary rows early in the query processing workflow and improving overall query execution time.

Let’s explore three use cases to demonstrate predicate pushdown for our ecommerce company using each of these services.

Prerequisites

As a prerequisite, review Using Amazon Athena Federated Query to know more about Athena federated queries and how to deploy these data source connectors.

Use case 1: Amazon Redshift

In our first scenario, we run an Athena federated query on Amazon Redshift by joining its Catalog_sales and Date_dim tables. We do this to show the number of sales orders grouped by order date. The following query gets the information required and takes approximately 14 seconds scanning approximately 43 MB of data:

SELECT "d_date" AS Order_date,
     count(1) AS Total_Orders
 FROM "lambda:redshift"."order_schema"."catalog_sales" l,
     "lambda:redshift"."order_schema"."date_dim" d
 WHERE l.cs_sold_date_sk = d_date_sk
     and cs_sold_date_sk between 2450815 and 2450822 --Date keys for first week of Jan 1998
 GROUP BY "d_date"
 order by "d_date" 

Athena pushes the following filters to the source for processing:

  • cs_sold_date_sk between 2450815 and 2450822 for the Catalog_Sales table in Amazon Redshift.
  • d_date_sk between 2450815 and 2450822; because of the join l.cs_sold_date_sk=d_date_sk in the query, the Date_dim table is also filtered at the source, and only filtered data is moved from Amazon Redshift to Athena.

Let’s analyze the query plan by using recently released visual explain tool to confirm the filter predicates are pushed to the data source:

As shown above (only displaying the relevant part of the visual explain plan), because of the predicate pushdown, the Catalog_sales and Date_dim tables have filters applied at the source. Athena processes only the resulting filtered data.

Using the Athena console, we can see query processing details using the recently released query stats to interactively explore processing details with predicate pushdown at the query stage:

Displaying only the relevant query processing stages, Catalog_sales table has approximately 4.3 billion records, and Date_dim has approximately 73,000 records in Amazon Redshift. Only 11 million records from the Catalog_sales (Stage 4) and 8 records from the Date_dim (Stage 5) are passed from source to Athena, because the predicate pushdown pushes query filter conditions to the data sources. This filters out unneeded records at the source, and only brings the required rows to Athena.

Using predicate pushdown resulted in scanning 99.75% less data from Catalog_sales and 99.99% less data from Date_dim. This results in a faster query runtime and lower cost.

Use case 2: Amazon Redshift and Aurora MySQL

In our second use case, we run an Athena federated query on Aurora MySQL and Amazon Redshift data stores. This query joins the Catalog_sales and Date_dim tables in Amazon Redshift with the Customer table in the Aurora MySQL database to get the total number of orders with the total amount spent by each customer for the first week in January 1998 for the market segment of AUTOMOBILE. The following query gets the information required and takes approximately 35 seconds scanning approximately 337 MB of data:

SELECT  cs_bill_customer_sk Customer_id ,"d_date" Order_Date 
 ,count("cs_order_number") Total_Orders ,sum(l.cs_net_paid_inc_ship_tax) AS Total_Amount
 FROM "lambda:mysql".sales.customer c,"lambda:redshift"."order_schema"."catalog_sales" l
 ,"lambda:redshift"."order_schema"."date_dim" d
 WHERE c_mktsegment = 'AUTOMOBILE'
 AND c_custkey = cs_bill_customer_sk
 AND l.cs_sold_date_sk=d_date_sk 
 AND cs_sold_date_sk between 2450815 and 2450822 --Date keys for first week of Jan 1998
 GROUP BY cs_bill_customer_sk,"d_date"  
 ORDER BY cs_bill_customer_sk,"d_date"

Athena pushes the following filters to the data sources for processing:

  • cs_sold_date_sk between 2450815 and 2450822 for the Catalog_Sales table in Amazon Redshift.
  • d_date_sk between 2450815 and 2450822; because of the join l.cs_sold_date_sk=d_date_sk in the query, the Date_dim table is also filtered at the source (Amazon Redshift) and only filtered data is moved from Amazon Redshift to Athena.
  • c_mktsegment = 'AUTOMOBILE' for the Customer table in the Aurora MySQL database.

Now let’s consult the visual explain plan for this query to show the predicate pushdown to the source for processing:

As shown above (only displaying the relevant part of the visual explain plan), because of the predicate pushdown, Catalog_sales and Date_dim have the query filter applied at the source (Amazon Redshift), and the customer table has the market segment AUTOMOBILE filter applied at the source (Aurora MySQL). This brings only the filtered data to Athena.

As before, we can see query processing details using the recently released query stats to interactively explore processing details with predicate pushdown at the query stage:

Displaying only the relevant query processing stages, Catalog_sales has 4.3 billion records, Date_Dim has 73,000 records in Amazon Redshift, and Customer has 15,000 records in Aurora MySQL. Only 11 million records from Catalog_sales (Stage 6), 8 records from Date_dim (Stage 7), and 3,000 records from Customer (Stage 5) are passed from the respective sources to Athena because the predicate pushdown pushes query filter conditions to the data sources. This filters out unneeded records at the source and only brings the required rows to Athena.

Here, predicate pushdown resulted in scanning 99.75% less data from Catalog_sales, 99.99% less data from Date_dim, and 79.91% from Customer. Furthermore, this results in a faster query runtime and reduced cost.

Use case 3: Amazon Redshift, Aurora MySQL, and DynamoDB

For our third use case, we run an Athena federated query on Aurora MySQL, Amazon Redshift, and DynamoDB data stores. This query joins the Part and Partsupp tables in DynamoDB, the Catalog_sales and Date_dim tables in Amazon Redshift, and the Supplier and Customer tables in Aurora MySQL to get the quantities available at each supplier for orders with the highest revenue during the first week of January 1998 for the market segment of AUTOMOBILE and parts manufactured by Manufacturer#1.

The following query gets the information required and takes approximately 33 seconds scanning approximately 428 MB of data in Athena:

SELECT "d_date" Order_Date 
     ,c_mktsegment
     ,"cs_order_number"
     ,l.cs_item_sk Part_Key
     ,p.p_name Part_Name
     ,s.s_name Supplier_Name
     ,ps.ps_availqty Supplier_Avail_Qty
     ,l.cs_quantity Order_Qty
     ,l.cs_net_paid_inc_ship_tax Order_Total
 FROM "lambda:dynamo".default.part p, 
     "lambda:mysql".sales.supplier s, 
     "lambda:redshift"."order_schema"."catalog_sales" l, 
     "lambda:dynamo".default.partsupp ps, 
     "lambda:mysql".sales.customer c,
     "lambda:redshift"."order_schema"."date_dim" d
 WHERE 
     c_custkey = cs_bill_customer_sk
     AND l.cs_sold_date_sk=d_date_sk 
     AND c.c_mktsegment = 'AUTOMOBILE'
     AND cs_sold_date_sk between 2450815 and 2450822 --Date keys for first week of Jan 1998
     AND p.p_partkey=ps.ps_partkey
     AND s.s_suppkey=ps.ps_suppkey
     AND p.p_partkey=l.cs_item_sk
     AND p.p_mfgr='Manufacturer#1'

Athena pushes the following filters to the data sources for processing:

  • cs_sold_date_sk between 2450815 and 2450822 for the Catalog_Sales table in Amazon Redshift.
  • d_date_sk between 2450815 and 2450822; because of the join l.cs_sold_date_sk=d_date_sk in the query, the Date_dim table is also filtered at the source and only filtered data is moved from Amazon Redshift to Athena.
  • c_mktsegment = 'AUTOMOBILE' for the Customer table in the Aurora MySQL database.
  • p.p_mfgr='Manufacturer#1' for the Part table in DynamoDB.

Now let’s run the explain plan for this query to confirm predicates are pushed down to the source for processing:

As shown above (displaying only the relevant part of the plan), because of the predicate pushdown, Catalog_sales and Date_dim have the query filter applied at the source (Amazon Redshift), the Customer table has the market segment AUTOMOBILE filter applied at the source (Aurora MySQL), and the Part table has the part manufactured by Manufacturer#1 filter applied at the source (DynamoDB).

We can analyze query processing details using the recently released query stats to interactively explore processing details with predicate pushdown at the query stage:

Displaying only the relevant processing stages, Catalog_sales has 4.3 billion records, Date_Dim has 73,000 records in Amazon Redshift, Customer has 15,000 records in Aurora MySQL, and Part has 20,000 records in DynamoDB. Only 11 million records from Catalog_sales (Stage 5), 8 records from Date_dim (Stage 9), 3,000 records from Customer (Stage 8), and 4,000 records from Part (Stage 4) are passed from their respective sources to Athena, because the predicate pushdown pushes query filter conditions to the data sources. This filters out unneeded records at the source, and only brings the required rows from the sources to Athena.

Considerations for predicate pushdown

When using Athena to query your data sources, consider the following:

  • Depending on the data source, data source connector, and query complexity, Athena can push filter predicates to the source for processing. The following are some of the sources Athena supports predicate pushdown with:
  • Athena also performs predicate pushdown on data stored in an S3 data lake. And, with predicate pushdown for supported sources, you can join all your data sources in one query and achieve fast query performance.
  • You can use the recently released query stats as well as EXPLAIN and EXPLAIN ANALYZE on your queries to confirm predicates are pushed down to the source.
  • Queries may not have predicates pushed to the source if the query’s WHERE clause uses Athena-specific functions (for example, WHERE log2(col)<10).

Conclusion

In this post, we demonstrated three federated query scenarios on Aurora MySQL, Amazon Redshift, and DynamoDB to show how predicate pushdown improves federated query performance and reduces cost and how you can validate when predicate pushdown occurs. If the federated data source supports parallel scans, then predicate pushdown makes it possible to achieve performance that is close to the performance of Athena queries on data stored in Amazon S3. You can utilize the patterns and recommendations outlined in this post when querying supported data sources to improve overall query performance and minimize data scanned.


About the authors

Rohit Bansal is an Analytics Specialist Solutions Architect at AWS. He has nearly two decades of experience helping customers modernize their data platforms. He is passionate about helping customers build scalable, cost-effective data and analytics solutions in the cloud. In his spare time, he enjoys spending time with his family, travel, and road cycling.

Ruchir Tripathi is a Senior Analytics Solutions Architect aligned to Global Financial Services at AWS. He is passionate about helping enterprises build scalable, performant, and cost-effective solutions in the cloud. Prior to joining AWS, Ruchir worked with major financial institutions and is based out of New York Office.

Land data from databases to a data lake at scale using AWS Glue blueprints

Post Syndicated from Moustafa Mahmoud original https://aws.amazon.com/blogs/big-data/land-data-from-databases-to-a-data-lake-at-scale-using-aws-glue-blueprints/

To build a data lake on AWS, a common data ingestion pattern is to use AWS Glue jobs to perform extract, transform, and load (ETL) data from relational databases to Amazon Simple Storage Service (Amazon S3). A project often involves extracting hundreds of tables from source databases to the data lake raw layer. And for each source table, it’s recommended to have a separate AWS Glue job to simplify operations, state management, and error handling. This approach works perfectly with a small number of tables. However, with hundreds of tables, this results in hundreds of ETL jobs, and managing AWS Glue jobs at this scale may pose an operational challenge if you’re not yet ready to deploy using a CI/CD pipeline. Instead, we tackle this issue by decoupling the following:

  • ETL job logic – We use an AWS Glue blueprint, which allows you to reuse one blueprint for all jobs with the same logic
  • Job definition – We use a JSON file, so you can define jobs programmatically without learning a new language
  • Job deployment – With AWS Step Functions, you can copy workflows to manage different data processing use cases on AWS Glue

In this post, you will learn how to handle data lake landing jobs deployment in a standardized way—by maintaining a JSON file with table names and a few parameters (for example, a workflow catalog). AWS Glue workflows are created and updated after manually running the resources deployment flow in Step Functions. You can further customize the AWS Glue blueprints to make your own multi-step data pipelines to move data to downstream layers and purpose-built analytics services (example use cases include partitioning or importing to an Amazon DynamoDB table).

Overview of solution

The following diagram illustrates the solution architecture, which contains two major areas:

  • Resource deployment (components 1–2) – An AWS Step Functions workflow is run manually on demand to update or deploy the required AWS Glue resources. These AWS Glue resources will be used for landing data into the data lake
  • ETL job runs (components 3–6) – The AWS Glue workflows (one per source table) run on the defined schedule, and extract and land data to the data lake raw layer

The solution workflow contains the following steps:

  1. An S3 bucket stores an AWS Glue blueprint (ZIP) and the workflow catalog (JSON file).
  2. A Step Functions workflow orchestrates the AWS Glue resources creation.
  3. We use Amazon Aurora as the data source with our sample data, but any PostgreSQL database works with the provided script, or other JDBC sources with customization.
  4. AWS Secrets Manager stores the secrets of the source databases.
  5. On the predefined schedule, AWS Glue triggers relevant AWS Glue jobs to perform ETL.
  6. Extracted data is loaded into an S3 bucket that serves as the data lake raw layer.

Prerequisites

To follow along with this post, complete the following prerequisite steps.

  • If you want to use your existing database either in AWS or on premises as a data source, you need network connectivity (a subnet and security group) for the AWS Glue jobs that can access the source database, Amazon S3, and Secrets Manager.

Provision resources with AWS CloudFormation

In this step, we provision our solution resources with AWS CloudFormation.

Database with sample data (optional)

This CloudFormation stack works only in AWS Regions where Amazon Aurora Serverless v1 is supported. Complete the following steps to create a database with sample data:

  1. Choose Launch Stack.
  2. On the Create stack page, choose Next.
  3. For Stack name, enter demo-database.
  4. For DBSecurityGroup, choose select the security group for the database (for example, default).
  5. For DBSubnet, choose two or more private subnets to host the database.
  6. For ETLAZ, choose the Availability Zone for ETL jobs. It must match with ETLSubnet.
  7. For ETLSubnet, choose the subnet for the jobs. This must match with ETLAZ.

To find the subnet and corresponding Availability Zone, go to the Amazon Virtual Private Cloud (Amazon VPC) console and look at the columns Subnet ID and Availability Zone.

  1. Choose Next.
  2. On the Configure stack options page, skip the inputs and choose Next.
  3. On the Review page, choose Create stack.
  4. When the stack is complete, go to the Outputs tab and note the value for SecretsARN.

This CloudFormation stack creates the following resources:

  • An Amazon Aurora PostgreSQL-Compatible Edition (Serverless v1, engine version 11.13) database
  • A Secrets Manager secret (DemoDBSecret) storing the connection details to the source database
  • An AWS Glue network connection (rds_vpc) that can communicate with the source database and Amazon S3

Now you can populate the database with sample data. The data is generated by referencing to the sample HR schema.

  1. Open the Amazon RDS Query Editor.
  2. In the Connect to database section, provide the following information:
    • For Database instance, enter demo-<123456789012>.
    • For Database username, connect with a Secrets Manager ARN.
    • For Secrets Manager ARN, enter the ARN from the outputs of the CloudFormation stack.
    • For Database name, enter hr.
  3. Choose Connect to database.
  4. Enter the contents of the SQL file into the editor, then choose Run.

Main stack (required)

This CloudFormation stack works in all AWS Regions.

  1. Choose Launch Stack.
  2. On the Create stack page, choose Next.
  3. For Stack name, enter data-lake-landing.
  4. For BlueprintName, enter a name for your blueprint (default: data-lake-landing).
  5. For S3BucketNamePrefix, enter a prefix (default: data-lake-raw-layer).
  6. Choose Next.
  7. On the Configure stack options page, skip the inputs and choose Next.
  8. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  9. Choose Create stack.
  10. When the stack is complete, go to the Outputs tab and note the names of the S3 bucket (for example, data-lake-raw-layer-123456789012-region) and Step Functions workflow (for example, data-lake-landing).

The CloudFormation stack creates the following resources:

  • An S3 bucket as the data lake raw layer
  • A Step Functions workflow (see the definition on the GitHub repo)
  • AWS Identity and Access Management (IAM) roles and policies for the Step Functions workflow to provision AWS Glue resources and AWS Glue job executions.

The GlueExecutionRole is limited to the DemoDBSecret in Secrets Manager. If you need to connect to other databases which has a different endpoint/address or credentials, don’t forget to create new secrets and grant additional permissions to the IAM role or secrets so your AWS Glue jobs can authenticate with the source databases.

Prepare database connections

If you want to use this solution to perform ETL against your existing databases, follow this section. Otherwise, if you have deployed the CloudFormation stack for the database with sample data, jump to the section “Edit the workflow catalog”.

You need to have a running PostgreSQL database ready. To connect to other database engines, you need to customize this solution, particularly the jdbcUrl in the supplied PySpark script.

Create the database secret

To create your Secrets Manager secret, complete the following steps:

  1. On the Secrets Manager console, choose Store a new secret.
  2. For Secret type, choose Credentials for Amazon RDS database or Credentials for other database.
  3. For Credentials, enter the user name and password to your database.
  4. For Encryption key, keep the default AWS Key Management Service (AWS KMS) managed key aws/secretsmanager.
  5. For Database, choose the database instance, or manually input the engine, server address, database name, and port.
  6. Choose Next.
  7. For Secret name, enter a name for your secret (for example, rds-secrets).
  8. Choose Next.
  9. Skip the Configure rotation – optional page and choose Next.
  10. Review the summary and choose Store.

Create the AWS Glue connection

To create your AWS Glue connection, complete the following steps:

  1. On the AWS Glue Studio console, choose Connectors.
  2. Under Connections, choose Create connection.
  3. For Name, enter a name (for example, rds-vpc).
  4. For Connection type, choose Network.
  5. For the VPC, subnet, and security groups (prepared in the prerequisite steps), enter where the ETL jobs run and are able to connect to the source database, Amazon S3, and Secrets Manager.
  6. Choose Create connection.

You’re now ready to configure the rest of the solution.

Edit the workflow catalog

To download the workflow catalog, complete the following steps:

  1. Download and edit the sample file.
  2. If you are using the provided sample database, you must change the values of GlueExecutionRole and DestinationBucketName. If you are using your own databases, you must change all vaules except WorkflowName, JobScheduleType, and ScheduleCronPattern.
{
  "WorkflowList": [
    {
      "WorkflowName": "rds_to_s3_public_regions",
      "GlueExecutionRole": "arn:aws:iam::123456789012:role/AWSGlueServiceRole-data-lake-landing",
      "NetworkConnectionName": "rds-vpc",
      "SecretName": "DemoDBSecret",
      "SourceTable": "public.regions",
      "DestinationBucketName": "data-lake-raw-layer-123456789012-eu-west-1",
      "TransactionTimestampColumnName": "updated_at",
      "JobScheduleType": "Cron",
      "ScheduleCronPattern": "10 21 * * ? *"
    }
  ]
}
  1. Rename the file your_blueprint_name.json and upload it to your S3 bucket (for example, s3://data-lake-raw-layer-123456789012-eu-west-1/data-lake-landing.json).

The example workflow has the JobScheduleType set to Cron. See Time-based schedules for jobs and crawlers for examples setting cron patterns. Alternatively set JobScheduleType to OnDemand.

See blueprint.cfg for the full list of parameters.

The provided workflow catalog JSON file contains job definitions of seven tables: public.regions, public.countries, public.locations, public.departments, public.jobs, public.employees, and public.job_history.

Review the PySpark script (optional)

The sample script performs the following:

  1. Read the updated records from the source table:
jdbc_df = (spark.read.format("jdbc")
           .option("url", jdbcUrl)
           .option("user", secret["username"])
           .option("password", secret["password"])
           .option("query", sql_query)
           .load()
)
  1. Add the date and timestamp columns:
df_withdate = jdbc_df.withColumn("ingestion_timestamp", lit(current_timestamp()))
  1. Write the DataFrame to Amazon S3 as Parquet files.

Prepare the AWS Glue blueprint

Prepare your AWS Glue blueprint with the following steps:

  1. Download the sample file and unzip it in your local computer.
  2. Make any necessary changes to the PySpark script to include your own logic, and compress the three files (blueprint.cfg, jdbc_to_s3.py, layout.py; exclude any folders) as your_blueprint_name.zip (for example, data-lake-landing.zip):
zip data-lake-landing.zip blueprint.cfg jdbc_to_s3.py layout.py
  1. Upload to the S3 bucket (for example, s3://data-lake-raw-layer-123456789012-region/data-lake-landing.zip).

Now you should have two files uploaded to your S3 bucket.

Run the Step Functions workflow to deploy AWS Glue resources

To run the Step Functions workflow, complete the following steps:

  1. On the Step Functions console, select your state machine (data-lake-landing) and choose View details.
  2. Choose Start execution.
  3. Keep the default values in the pop-up.
  4. Choose Start execution.
  5. Wait until the Success step at the bottom turns green.

It’s normal to have some intermediate steps with the status “Caught error.”

When the workflow catalog contains a large number of ETL job entries, you can expect some delays. In our test environment, creating 100 jobs from a clean state can take around 22 minutes; the second run (deleting existing AWS Glue resources and creating 100 jobs) can take around 27 minutes.

Verify the workflow in AWS Glue

To check the workflow, complete the following steps:

  1. On the AWS Glue console, choose Workflows.
  2. Verify that all AWS Glue workflows defined in workflow_config.json are listed.
  3. Select one of the workflows, and on the Action menu, choose Run.
  4. Wait for about 3 minutes (or longer if not using the provided database with sample data), and verify on the Amazon S3 console that new Parquet files are created in your data lake (for example, s3://data-lake-raw-layer-123456789012-region/database/table/ingestion_date=yyyy-mm-dd/).

Step Functions workflow overview

This section describes the major steps in the Step Functions workflow.

Register the AWS Glue blueprint

A blueprint allows you to parameterize a workflow (defining jobs and crawlers), and subsequently generate multiple AWS Glue workflows reusing the same code logic to handle similar data ETL activities. The following diagram illustrates the AWS Glue blueprint registration part of the Step Functions workflow.

The step Glue: CreateBlueprint takes the ZIP archive in Amazon S3 (sample) and registers it for later use.

To understand how to develop a blueprint, see Developing blueprints in AWS Glue.

Parse the workflow catalog and clean up resources

The step S3: ParseGlueWorkflowsConfig triggers the following Map state, and runs a set of steps for each element of an input array.

We set the maximum concurrency to five parallel iterations to lower the chance of exceeding the maximum allowed API request rate (per account per Region). For each ETL job definition, the Step Functions workflow cleans up relevant AWS Glue resources (if they exist), including the workflow, job, and trigger.

For more information on the Map state, refer to Map.

Run the AWS Glue blueprint

Within the Map state, the step Glue: CreateWorkflowFromBlueprint starts an asynchronous process to create the AWS Glue workflow (for each job definition), and the jobs and triggers that the workflow encapsulates.

In this solution, all AWS Glue workflows share the same logic, beginning with a trigger to handle the schedule, followed by a job to run the ETL logic.

As indicated by the step CreateWorkflowFailed, any AWS Glue blueprint creation failure stops the whole Step Functions workflow and marks it with a failed status. Note that no rollback will happen. Fix the errors and rerun the Step Functions workflow. This will not result in duplicated AWS Glue resources and existing ones will be cleaned up in the process.

Limitations

Note the following limitations of this solution:

  • Each run of the Step Functions workflow deletes all relevant AWS Glue jobs defined in the workflow catalog, and creates new jobs with a different (random) suffix. As a result, you will lose the job run history in AWS Glue. The underlying metrics and logs are retained in Amazon CloudWatch.

Clean up

To avoid incurring future charges, perform the following steps:

  1. Disable the schedules of the deployed AWS Glue jobs:
    1. Open the workload configuration file in your S3 bucket (s3://data-lake-raw-layer-123456789012-eu-west-1/data-lake-landing.json) and replace the value of JobScheduleType to OnDemand for all workflow definitions.
    2. Run the Step Functions workflow (data-lake-landing).
    3. Observe that all AWS Glue triggers ending with _starting_trigger have the trigger type On-demand instead of Schedule.
  2. Empty the S3 bucket and delete the CloudFormation stack.
  3. Delete the deployed AWS Glue resources:
    1. All AWS Glue triggers ending with _starting_trigger.
    2. All AWS Glue jobs starting with the WorkflowName defined in the workflow catalog.
    3. All AWS Glue workflows with the WorkflowName defined in the workflow catalog.
    4. AWS Glue blueprints.

Conclusion

AWS Glue blueprints allow data engineers to build and maintain AWS Glue jobs landing data from RDBMS to your data lake at scale.By adopting this standardized and reusable approach, instead of maintaining hundreds of AWS Glue jobs, you now keep track the workflow catalog. When you have new tables to land to your data lake, simply add the entries to your workflow catalog and rerun the Step Functions workflow to deploy resources.

We highly encourage you to customize the blueprints for your multi-step data pipeline (for example, detect and mask sensitive data) and make them available to your organization and the AWS Glue community. To get started, see the Performing complex ETL activities using blueprints and workflows in AWS Glue and the sample blueprints on GitHub. If you have any questions, please leave a comment.


About the Authors

Moustafa Mahmoud is a Solutions Architect of AWS Data Lab with a passion for data integration, data analysis, machine learning, and BI. Moustafa helps customers convert their ideas to a production-ready data product on AWS. He has over 10 years of experience as a data engineer, machine learning practitioner, and software developer. In his spare time, Moustafa loves exploring nature, reading, and spending time with friends and family.

Corvus Lee is a Solutions Architect of AWS Data Lab. He enjoys all kinds of data-related discussions, and helps customers build MVPs using AWS Databases, Analytics, and Machine Learning services.

Patch Tuesday – October 2022

Post Syndicated from Greg Wiseman original https://blog.rapid7.com/2022/10/11/patch-tuesday-october-2022/

Patch Tuesday - October 2022

The October batch of CVEs published by Microsoft includes 96 vulnerabilities, including 12 fixed earlier this month that affect the Chromium project used by their Edge browser.

Top of mind for many this month is whether Microsoft would patch the two Exchange Server zero-day vulnerabilities (CVE-2022-41040 and CVE-2022-41082) disclosed at the end of September. While Microsoft was relatively quick to acknowledge the vulnerabilities and provide mitigation steps, their guidance has continually changed as the recommended rules to block attack traffic get bypassed. This whack-a-mole approach seems likely to continue until a proper patch addressing the root causes is available; unfortunately, it doesn’t look like that will be happening today. Thankfully, the impact should be more limited than 2021’s ProxyShell and ProxyLogon vulnerabilities due to attackers needing to be authenticated to the server for successful exploitation. Reports are also surfacing about an additional zero-day distinct from these being used in ransomware attacks; however, these have not yet been substantiated.

Microsoft did address two other zero-day vulnerabilities with today’s patches. CVE-2022-41033, an Elevation of Privilege vulnerability affecting the COM+ Event System Service in all supported versions of Windows, has been seen exploited in the wild. CVE-2022-41043 is an Information Disclosure vulnerability affecting Office for Mac that was publicly disclosed but not (yet) seen exploited in the wild.

Nine CVEs categorized as Remote Code Execution (RCE) with Critical severity were also patched today – seven of them affect the Point-to-Point Tunneling Protocol, and like those fixed last month, require an attacker to win a race condition to exploit them. CVE-2022-38048 affects all supported versions of Office, and CVE-2022-41038 could allow an attacker authenticated to SharePoint to execute arbitrary code on the server, provided the account has “Manage List” permissions.

Maxing out the CVSS base score with a 10.0 this month is CVE-2022-37968, an Elevation of Privilege vulnerability in the Azure Arc-enabled Kubernetes cluster Connect component. It’s unclear why Microsoft has assigned such a high score, given that an attacker would need to know the randomly generated external DNS endpoint for an Azure Arc-enabled Kubernetes cluster (arguably making the Attack Complexity “High”). That said, if this condition is met then an unauthenticated user could become a cluster admin and potentially gain control over the Kubernetes cluster. Users of Azure Arc and Azure Stack Edge should check whether auto-updates are turned on, and if not, upgrade manually as soon as possible.

Summary charts

Patch Tuesday - October 2022
Patch Tuesday - October 2022
Patch Tuesday - October 2022
Patch Tuesday - October 2022

Summary tables

Azure vulnerabilities

CVE Title Exploited? Publicly disclosed? CVSSv3 base score Has FAQ?
CVE-2022-37968 Azure Arc-enabled Kubernetes cluster Connect Elevation of Privilege Vulnerability No No 10 Yes
CVE-2022-38017 StorSimple 8000 Series Elevation of Privilege Vulnerability No No 6.8 Yes
CVE-2022-35829 Service Fabric Explorer Spoofing Vulnerability No No 6.2 Yes

Browser vulnerabilities

CVE Title Exploited? Publicly disclosed? CVSSv3 base score Has FAQ?
CVE-2022-41035 Microsoft Edge (Chromium-based) Spoofing Vulnerability No No 8.3 Yes
CVE-2022-3373 Chromium: CVE-2022-3373 Out of bounds write in V8 No No N/A Yes
CVE-2022-3370 Chromium: CVE-2022-3370 Use after free in Custom Elements No No N/A Yes
CVE-2022-3317 Chromium: CVE-2022-3317 Insufficient validation of untrusted input in Intents No No N/A Yes
CVE-2022-3316 Chromium: CVE-2022-3316 Insufficient validation of untrusted input in Safe Browsing No No N/A Yes
CVE-2022-3315 Chromium: CVE-2022-3315 Type confusion in Blink No No N/A Yes
CVE-2022-3313 Chromium: CVE-2022-3313 Incorrect security UI in Full Screen No No N/A Yes
CVE-2022-3311 Chromium: CVE-2022-3311 Use after free in Import No No N/A Yes
CVE-2022-3310 Chromium: CVE-2022-3310 Insufficient policy enforcement in Custom Tabs No No N/A Yes
CVE-2022-3308 Chromium: CVE-2022-3308 Insufficient policy enforcement in Developer Tools No No N/A Yes
CVE-2022-3307 Chromium: CVE-2022-3307 Use after free in Media No No N/A Yes
CVE-2022-3304 Chromium: CVE-2022-3304 Use after free in CSS No No N/A Yes

Developer Tools vulnerabilities

CVE Title Exploited? Publicly disclosed? CVSSv3 base score Has FAQ?
CVE-2022-41034 Visual Studio Code Remote Code Execution Vulnerability No No 7.8 Yes
CVE-2022-41083 Visual Studio Code Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-41032 NuGet Client Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-41042 Visual Studio Code Information Disclosure Vulnerability No No 7.4 Yes

Microsoft Office vulnerabilities

CVE Title Exploited? Publicly disclosed? CVSSv3 base score Has FAQ?
CVE-2022-41038 Microsoft SharePoint Server Remote Code Execution Vulnerability No No 8.8 Yes
CVE-2022-41036 Microsoft SharePoint Server Remote Code Execution Vulnerability No No 8.8 Yes
CVE-2022-41037 Microsoft SharePoint Server Remote Code Execution Vulnerability No No 8.8 Yes
CVE-2022-38053 Microsoft SharePoint Server Remote Code Execution Vulnerability No No 8.8 Yes
CVE-2022-41031 Microsoft Word Remote Code Execution Vulnerability No No 7.8 Yes
CVE-2022-38048 Microsoft Office Remote Code Execution Vulnerability No No 7.8 Yes
CVE-2022-38049 Microsoft Office Graphics Remote Code Execution Vulnerability No No 7.8 Yes
CVE-2022-38001 Microsoft Office Spoofing Vulnerability No No 6.5 Yes
CVE-2022-41043 Microsoft Office Information Disclosure Vulnerability No Yes 3.3 Yes

System Center vulnerabilities

CVE Title Exploited? Publicly disclosed? CVSSv3 base score Has FAQ?
CVE-2022-37971 Microsoft Windows Defender Elevation of Privilege Vulnerability No No 7.1 Yes

Windows vulnerabilities

CVE Title Exploited? Publicly disclosed? CVSSv3 base score Has FAQ?
CVE-2022-38016 Windows Local Security Authority (LSA) Elevation of Privilege Vulnerability No No 8.8 Yes
CVE-2022-38045 Server Service Remote Protocol Elevation of Privilege Vulnerability No No 8.8 Yes
CVE-2022-37984 Windows WLAN Service Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-38003 Windows Resilient File System Elevation of Privilege No No 7.8 Yes
CVE-2022-38028 Windows Print Spooler Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-38039 Windows Kernel Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-37995 Windows Kernel Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-37979 Windows Hyper-V Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-37970 Windows DWM Core Library Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-37980 Windows DHCP Client Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-38050 Win32k Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-37983 Microsoft DWM Core Library Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-37998 Windows Local Session Manager (LSM) Denial of Service Vulnerability No No 7.7 Yes
CVE-2022-37973 Windows Local Session Manager (LSM) Denial of Service Vulnerability No No 7.7 Yes
CVE-2022-38036 Internet Key Exchange (IKE) Protocol Denial of Service Vulnerability No No 7.5 No
CVE-2022-38027 Windows Storage Elevation of Privilege Vulnerability No No 7 Yes
CVE-2022-38021 Connected User Experiences and Telemetry Elevation of Privilege Vulnerability No No 7 Yes
CVE-2022-37974 Windows Mixed Reality Developer Tools Information Disclosure Vulnerability No No 6.5 Yes
CVE-2022-38046 Web Account Manager Information Disclosure Vulnerability No No 6.2 Yes
CVE-2022-37965 Windows Point-to-Point Tunneling Protocol Denial of Service Vulnerability No No 5.9 Yes
CVE-2022-37996 Windows Kernel Memory Information Disclosure Vulnerability No No 5.5 Yes
CVE-2022-38025 Windows Distributed File System (DFS) Information Disclosure Vulnerability No No 5.5 Yes
CVE-2022-38030 Windows USB Serial Driver Information Disclosure Vulnerability No No 4.3 Yes

Windows ESU vulnerabilities

CVE Title Exploited? Publicly disclosed? CVSSv3 base score Has FAQ?
CVE-2022-37982 Microsoft WDAC OLE DB provider for SQL Server Remote Code Execution Vulnerability No No 8.8 Yes
CVE-2022-38031 Microsoft WDAC OLE DB provider for SQL Server Remote Code Execution Vulnerability No No 8.8 Yes
CVE-2022-38040 Microsoft ODBC Driver Remote Code Execution Vulnerability No No 8.8 Yes
CVE-2022-37976 Active Directory Certificate Services Elevation of Privilege Vulnerability No No 8.8 Yes
CVE-2022-30198 Windows Point-to-Point Tunneling Protocol Remote Code Execution Vulnerability No No 8.1 Yes
CVE-2022-22035 Windows Point-to-Point Tunneling Protocol Remote Code Execution Vulnerability No No 8.1 Yes
CVE-2022-24504 Windows Point-to-Point Tunneling Protocol Remote Code Execution Vulnerability No No 8.1 Yes
CVE-2022-33634 Windows Point-to-Point Tunneling Protocol Remote Code Execution Vulnerability No No 8.1 Yes
CVE-2022-38047 Windows Point-to-Point Tunneling Protocol Remote Code Execution Vulnerability No No 8.1 Yes
CVE-2022-38000 Windows Point-to-Point Tunneling Protocol Remote Code Execution Vulnerability No No 8.1 Yes
CVE-2022-41081 Windows Point-to-Point Tunneling Protocol Remote Code Execution Vulnerability No No 8.1 Yes
CVE-2022-37986 Windows Win32k Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-37988 Windows Kernel Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-38037 Windows Kernel Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-38038 Windows Kernel Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-37990 Windows Kernel Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-37991 Windows Kernel Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-37999 Windows Group Policy Preference Client Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-37993 Windows Group Policy Preference Client Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-37994 Windows Group Policy Preference Client Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-37975 Windows Group Policy Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-38051 Windows Graphics Component Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-37997 Windows Graphics Component Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-33635 Windows GDI+ Remote Code Execution Vulnerability No No 7.8 Yes
CVE-2022-37987 Windows Client Server Run-time Subsystem (CSRSS) Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-37989 Windows Client Server Run-time Subsystem (CSRSS) Elevation of Privilege Vulnerability No No 7.8 Yes
CVE-2022-41033 Windows COM+ Event System Service Elevation of Privilege Vulnerability Yes No 7.8 Yes
CVE-2022-38044 Windows CD-ROM File System Driver Remote Code Execution Vulnerability No No 7.8 Yes
CVE-2022-33645 Windows TCP/IP Driver Denial of Service Vulnerability No No 7.5 No
CVE-2022-38041 Windows Secure Channel Denial of Service Vulnerability No No 7.5 No
CVE-2022-34689 Windows CryptoAPI Spoofing Vulnerability No No 7.5 Yes
CVE-2022-37978 Windows Active Directory Certificate Services Security Feature Bypass No No 7.5 Yes
CVE-2022-38042 Active Directory Domain Services Elevation of Privilege Vulnerability No No 7.1 Yes
CVE-2022-38029 Windows ALPC Elevation of Privilege Vulnerability No No 7 Yes
CVE-2022-38033 Windows Server Remotely Accessible Registry Keys Information Disclosure Vulnerability No No 6.5 Yes
CVE-2022-35770 Windows NTLM Spoofing Vulnerability No No 6.5 Yes
CVE-2022-37977 Local Security Authority Subsystem Service (LSASS) Denial of Service Vulnerability No No 6.5 No
CVE-2022-38032 Windows Portable Device Enumerator Service Security Feature Bypass Vulnerability No No 5.9 Yes
CVE-2022-38043 Windows Security Support Provider Interface Information Disclosure Vulnerability No No 5.5 Yes
CVE-2022-37985 Windows Graphics Component Information Disclosure Vulnerability No No 5.5 Yes
CVE-2022-38026 Windows DHCP Client Information Disclosure Vulnerability No No 5.5 Yes
CVE-2022-38034 Windows Workstation Service Elevation of Privilege Vulnerability No No 4.3 Yes
CVE-2022-37981 Windows Event Logging Service Denial of Service Vulnerability No No 4.3 Yes
CVE-2022-38022 Windows Kernel Elevation of Privilege Vulnerability No No 2.5 Yes

Patent clouds for the Opus codec

Post Syndicated from original https://lwn.net/Articles/910848/

The Opus codec is an audio codec that
was designed from the beginning to avoid existing patents in the field and
be royalty-free for all users. It was standardized by the IETF in 2012 as
RFC 6716.
Now a company called Vectis (“a premier
full-suite IP licensing and consultancy boutique
“) is collecting
patents that are claimed to read on Opus
as a way of demanding
royalties on its use.
The planned Opus program will focus on hardware devices and will not be
directed towards open-source software, applications, services, or
content
“. (Thanks to Paul Wise).

Media Workflowing in The Big Apple: NAB Show New York Preview

Post Syndicated from Jeremy Milk original https://www.backblaze.com/blog/media-workflowing-in-the-big-apple-nab-show-new-york-preview/


You can send media in milliseconds to just about every corner of the earth with an origin store at your favorite cloud storage company and a snappy CDN. Sadly, delivering people across continents is a touch more complicated and time intensive. Nevertheless, the Backblaze team is saddling up planes, trains, and automobiles to bring the latest on media workflows to the attendees of NAB Show New York. Whether you’re there in person or virtually, we’ll be discussing and demo-ing all the newest Backblaze B2 Cloud Storage solutions that will ensure your data can travel with ease—no mass transit needed—everywhere you need it to be.

Learn More LIVE in NYC

If you’re attending the NAB Show New York, join us in booth 1239 to learn about integrating B2 Cloud Storage into your workflow. Stop by anytime or you can schedule a meeting here. We’d love to see you.

NAB Show New York Preview: What’s New for Backblaze B2 Media Workflow Solutions

Our booth will have all the goodness you’d expect of us: partners, friendly faces, spots to take a load off and talk about making your data work harder, and, of course, some next-level SWAG. Let’s get into what you can expect.

New Pricing Models and Migration Tools

Our team is on hand to talk you through two new offerings that have been generating a lot of excitement among teams across media organizations:

  • Backblaze B2 Reserve: You can now purchase the Backblaze service many know and love in capacity-based bundles through resellers. If your team seeks 100% budget predictability with transaction fees and premium support included, you should check out this new offering. Check it out here.
  • Universal Data Migration: Recently an International Broadcasting Convention (IBC) 2022 Best of Show nominee, the service makes it easy and FREE to move data into Backblaze from legacy cloud, on-premises, and LTO/tape origins. If your current data storage is holding your team or your budget back, we’ll pay to free your media and move it to B2 Cloud Storage. Learn more here.

Six Flavors of Media Workflow Deep Dives

We’ve gathered materials and expertise to discuss or demo our six most asked about workflow improvements. We’re happy to talk about many other tools and improvements, but here are the six areas we expect to talk about the most:

  1. Moving more (or all) media production to the cloud. Ensuring everyone—clients, collaborators, employers, everyone—has easy real-time access to content is essential for the inevitable geographical distribution of modern media workflows.
  2. Reducing costs. Cloud workflows don’t need to come with costly gotchas, minimum retention penalties, and/or high costs when you actually want to use your content. We’ll explain how the right partners will unlock your budget so you can save on cloud services and spend more on creative projects.
  3. Streamlining delivery. Pairing cloud storage with the right CDN is essential to making sure your media is consumable and monetizable at the edge. From streaming services to ecommerce outlets to legacy media outlets, we’ve helped every type of media organization do more with their content.
  4. Freeing storage. Empty your expensive on-prem storage and stop adding HDs and tapes to the pile by moving finished projects to always-hot cloud storage. This doesn’t just free up space and money: Instantly accessible archives means you can work with and monetize older content with little friction in your creative process.
  5. Safeguarding content. All those tapes or HDs on a shelf, in the closet, or wherever you keep them are hard to manage and harder to access and use. Parking everything safely and securely in the cloud means all that data is centrally accessible, protected, and available for more use.
  6. Backing up (better!). Yes, we’ve got roots in backup going back >15 years—so when it comes to making sure your precious media is protected with easy access for speedy recovery, we’ve got a few thoughts (and solutions).

Partners, Partners, and More Partners…

“The more we get together, the happier we’ll be,” might as well be the theme lyric of cloud workflows. Combining best of breed platforms unlocks better value and functionality, and offers you the ability to build your cloud stack exactly how you need it for your business. We’ve got a large ecosystem of Alliance Partners, and we’re happy to get deep into your needs and demo how you can combine Backblaze B2 Cloud Storage with one or more partners including iconik, LucidLink, Synology (who will also be right next to us in the Javits Center!), and Fastly to best achieve your objectives.

Hoping to visit NAB Show New York but not yet registered? All good. You can register free on the NAB site with promo code NY4429.

Hoping We Can Help You Soon

Whether it’s in person at NAB Show New York or virtually when it works for you, we’d love to walk you through any of the solutions we can serve for hardworking media teams. If you will be in Manhattan, schedule a meeting to ensure you’ll get the right expert on our team, then stick around for the swag and good times. This invitation applies to you too, Channel Partners and Resellers—whether you have active projects or just want to learn more, let’s meet up and chat about ways to deliver more value together. If you’re not making the trip, not a problem. Just contact us here so we can arrange to help virtually.

The post Media Workflowing in The Big Apple: NAB Show New York Preview appeared first on Backblaze Blog | Cloud Storage & Cloud Backup.

Security updates for Tuesday

Post Syndicated from original https://lwn.net/Articles/910828/

Security updates have been issued by Debian (connman, dbus, git, isc-dhcp, strongswan, and wordpress), Fedora (rubygem-pdfkit and seamonkey), Red Hat (gnutls, nettle, rh-ruby27-ruby, and rh-ruby30-ruby), SUSE (libgsasl, python, and snakeyaml), and Ubuntu (graphite2, linux, linux-aws, linux-aws-5.15, linux-azure, linux-azure-5.15,
linux-gcp, linux-gke, linux-gkeop, linux-hwe-5.15, linux-kvm,
linux-lowlatency, linux-lowlatency-hwe-5.15, linux-oracle, linux-raspi, linux, linux-aws, linux-bluefield, linux-gke, linux-gkeop, linux-hwe-5.4,
linux-ibm, linux-ibm-5.4, linux-kvm, linux-oracle, linux, linux-dell300x, linux-kvm, linux-oracle, linux-raspi2,
linux-snapdragon, linux-hwe, linux-oracle, openssh, and pcre3).

Inserting a Backdoor into a Machine-Learning System

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2022/10/inserting-a-backdoor-into-a-machine-learning-system.html

Interesting research: “ImpNet: Imperceptible and blackbox-undetectable backdoors in compiled neural networks, by Tim Clifford, Ilia Shumailov, Yiren Zhao, Ross Anderson, and Robert Mullins:

Abstract: Early backdoor attacks against machine learning set off an arms race in attack and defence development. Defences have since appeared demonstrating some ability to detect backdoors in models or even remove them. These defences work by inspecting the training data, the model, or the integrity of the training procedure. In this work, we show that backdoors can be added during compilation, circumventing any safeguards in the data preparation and model training stages. As an illustration, the attacker can insert weight-based backdoors during the hardware compilation step that will not be detected by any training or data-preparation process. Next, we demonstrate that some backdoors, such as ImpNet, can only be reliably detected at the stage where they are inserted and removing them anywhere else presents a significant challenge. We conclude that machine-learning model security requires assurance of provenance along the entire technical pipeline, including the data, model architecture, compiler, and hardware specification.

Ross Anderson explains the significance:

The trick is for the compiler to recognise what sort of model it’s compiling—whether it’s processing images or text, for example—and then devising trigger mechanisms for such models that are sufficiently covert and general. The takeaway message is that for a machine-learning model to be trustworthy, you need to assure the provenance of the whole chain: the model itself, the software tools used to compile it, the training data, the order in which the data are batched and presented—in short, everything.

Amazon EC2 Trn1 Instances for High-Performance Model Training are Now Available

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/amazon-ec2-trn1-instances-for-high-performance-model-training-are-now-available/

Deep learning (DL) models have been increasing in size and complexity over the last few years, pushing the time to train from days to weeks. Training large language models the size of GPT-3 can take months, leading to an exponential growth in training cost. To reduce model training times and enable machine learning (ML) practitioners to iterate fast, AWS has been innovating across chips, servers, and data center connectivity.

At AWS re:Invent 2021, we announced the preview of Amazon EC2 Trn1 instances powered by AWS Trainium chips. AWS Trainium is optimized for high-performance deep learning training and is the second-generation ML chip built by AWS, following AWS Inferentia.

Today, I’m excited to announce that Amazon EC2 Trn1 instances are now generally available! These instances are well-suited for large-scale distributed training of complex DL models across a broad set of applications, such as natural language processing, image recognition, and more.

Compared to Amazon EC2 P4d instances, Trn1 instances deliver 1.4x the teraFLOPS for BF16 data types, 2.5x more teraFLOPS for TF32 data types, 5x the teraFLOPS for FP32 data types, 4x inter-node network bandwidth, and up to 50 percent cost-to-train savings. Trn1 instances can be deployed in EC2 UltraClusters that serve as powerful supercomputers to rapidly train complex deep learning models. I’ll share more details on EC2 UltraClusters later in this blog post.

New Trn1 Instance Highlights
Trn1 instances are available today in two sizes and are powered by up to 16 AWS Trainium chips with 128 vCPUs. They provide high-performance networking and storage to support efficient data and model parallelism, popular strategies for distributed training.

Trn1 instances offer up to 512 GB of high-bandwidth memory, deliver up to 3.4 petaFLOPS of TF32/FP16/BF16 compute power, and feature an ultra-high-speed NeuronLink interconnect between chips. NeuronLink helps avoid communication bottlenecks when scaling workloads across multiple Trainium chips.

Trn1 instances are also the first EC2 instances to enable up to 800 Gbps of Elastic Fabric Adapter (EFA) network bandwidth for high-throughput network communication. This second generation EFA delivers lower latency and up to 2x more network bandwidth compared to the previous generation. Trn1 instances also come with up to 8 TB of local NVMe SSD storage for ultra-fast access to large datasets.

The following table lists the sizes and specs of Trn1 instances in detail.

Instance Name
vCPUs AWS Trainium Chips Accelerator Memory NeuronLink Instance Memory Instance Networking Local Instance Storage
trn1.2xlarge 8 1 32 GB N/A 32 GB Up to 12.5 Gbps 1x 500 GB NVMe
trn1.32xlarge 128 16 512 GB Supported 512 GB 800 Gbps 4x 2 TB NVMe

Trn1 EC2 UltraClusters
For large-scale model training, Trn1 instances integrate with Amazon FSx for Lustre high-performance storage and are deployed in EC2 UltraClusters. EC2 UltraClusters are hyperscale clusters interconnected with a non-blocking petabit-scale network. This gives you on-demand access to a supercomputer to cut model training time for large and complex models from months to weeks or even days.

Amazon EC2 Trn1 UltraCluster

AWS Trainium Innovation
AWS Trainium chips include specific scalar, vector, and tensor engines that are purpose-built for deep learning algorithms. This ensures higher chip utilization as compared to other architectures, resulting in higher performance.

Here is a short summary of additional hardware innovations:

  • Data Types: AWS Trainium supports a wide range of data types, including FP32, TF32, BF16, FP16, and UINT8, so you can choose the most suitable data type for your workloads. It also supports a new, configurable FP8 (cFP8) data type, which is especially relevant for large models because it reduces the memory footprint and I/O requirements of the model.
  • Hardware-Optimized Stochastic Rounding: Stochastic rounding achieves close to FP32-level accuracy with faster BF16-level performance when you enable auto-casting from FP32 to BF16 data types. Stochastic rounding is a different way of rounding floating-point numbers, which is more suitable for machine learning workloads versus the commonly used Round Nearest Even rounding. By setting the environment variable NEURON_RT_STOCHASTIC_ROUNDING_EN=1 to use stochastic rounding, you can train a model up to 30 percent faster.
  • Custom Operators, Dynamic Tensor Shapes: AWS Trainium also supports custom operators written in C++ and dynamic tensor shapes. Dynamic tensor shapes are key for models with unknown input tensor sizes, such as models processing text.

AWS Trainium shares the same AWS Neuron SDK as AWS Inferentia, making it easy for everyone who is already using AWS Inferentia to get started with AWS Trainium.

For model training, the Neuron SDK consists of a compiler, framework extensions, a runtime library, and developer tools. The Neuron plugin natively integrates with popular ML frameworks, such as PyTorch and TensorFlow.

The AWS Neuron SDK supports just-in-time (JIT) compilation, in addition to ahead-of-time (AOT) compilation, to speed up model compilation, and Eager Debug Mode, for a step-by-step execution.

To compile and run your model on AWS Trainium, you need to change only a few lines of code in your training script. You don’t need to tweak your model or think about data type conversion.

Get Started with Trn1 Instances
In this example, I train a PyTorch model on an EC2 Trn1 instance using the available PyTorch Neuron packages. PyTorch Neuron is based on the PyTorch XLA software package and enables conversion of PyTorch operations to AWS Trainium instructions.

Each AWS Trainium chip includes two NeuronCore accelerators, which are the main neural network compute units. With only a few changes to your training code, you can train your PyTorch model on AWS Trainium NeuronCores.

SSH into the Trn1 instance and activate a Python virtual environment that includes the PyTorch Neuron packages. If you’re using a Neuron-provided AMI, you can activate the preinstalled environment by running the following command:

source aws_neuron_venv_pytorch_p36/bin/activate

Before you can run your training script, you need to make a few modifications. On Trn1 instances, the default XLA device should be mapped to a NeuronCore.

Let’s start by adding the PyTorch XLA imports to your training script:

import torch, torch_xla
import torch_xla.core.xla_model as xm

Then, place your model and tensors onto an XLA device:

model.to(xm.xla_device())
tensor.to(xm.xla_device())

When the model is moved to the XLA device (NeuronCore), subsequent operations on the model are recorded for later execution. This is XLA’s lazy execution which is different from PyTorch’s eager execution. Within the training loop, you have to mark the graph to be optimized and run on the XLA device using xm.mark_step(). Without this mark, XLA cannot determine where the graph ends.

...
for data, target in train_loader:
	output = model(data)
	loss = loss_fn(output, target)
	loss.backward()
	optimizer.step()
	xm.mark_step()
...

You can now run your training script using torchrun <my_training_script>.py.

When running the training script, you can configure the number of NeuronCores to use for training by using torchrun –nproc_per_node.

For example, to run a multi-worker data parallel model training on all 32 NeuronCores in one trn1.32xlarge instance, run torchrun --nproc_per_node=32 <my_training_script>.py.

Data parallel is a strategy for distributed training that allows you to replicate your script across multiple workers, with each worker processing a portion of the training dataset. The workers then share their result with each other.

For more details on supported ML frameworks, model types, and how to prepare your model training script for large-scale distributed training across trn1.32xlarge instances, have a look at the AWS Neuron SDK documentation.

Profiling Tools
Let’s have a quick look at useful tools to keep track of your ML experiments and profile Trn1 instance resource consumption. Neuron integrates with TensorBoard to track and visualize your model training metrics.

AWS Neuron SDK TensorBoard integration

On the Trn1 instance, you can use the neuron-ls command to describe the number of Neuron devices present in the system, along with the associated NeuronCore count, memory, connectivity/topology, PCI device information, and the Python process that currently has ownership of the NeuronCores:

AWS Neuron SDK neuron-ls command

Similarly, you can use the neuron-top command to see a high-level view of the Neuron environment. This shows the utilization of each of the NeuronCores, any models that are currently loaded onto one or more NeuronCores, process IDs for any processes that are using the Neuron runtime, and basic system statistics relating to vCPU and memory usage.

AWS Neuron SDK neuron-top command

Available Now
You can launch Trn1 instances today in the AWS US East (N. Virginia) and US West (Oregon) Regions as On-Demand, Reserved, and Spot Instances or as part of a Savings Plan. As usual with Amazon EC2, you pay only for what you use. For more information, see Amazon EC2 pricing.

Trn1 instances can be deployed using AWS Deep Learning AMIs, and container images are available via managed services such as Amazon SageMaker, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS ParallelCluster.

To learn more, visit our Amazon EC2 Trn1 instances page, and please send feedback to AWS re:Post for EC2 or through your usual AWS Support contacts.

— Antje

AWS Week in Review – October 10, 2022

Post Syndicated from Marcia Villalba original https://aws.amazon.com/blogs/aws/aws-week-in-review-october-10-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

I had an amazing start to the week last week as I was speaking at the AWS Community Day NL. This event had 500 attendees and over 70 speakers, and Dr. Werner Vogels, Amazon CTO, delivered the keynote. AWS Community Days are community-led conferences organized by local communities, with a variety of workshops and sessions. I recommend checking your region for any of these events.

Community Day NL

Last Week’s Launches
Here are some launches that got my attention during the previous week.

Amazon S3 Object Lambda now supports using your own code to change the results of HEAD and LIST requests, besides GET (which we launched last year). This feature now enables more capabilities for what you can do with S3 Object Lambda. Danilo made a Twitter thread with lots of use cases for this new launch.

Amazon SageMaker Clarify now can provide near real-time explanations for ML predictions. SageMaker Clarify is a service that provides explainability by ML models individual predictions. These explanations are important for developers to get visibility into their training data and models to identify potential bias.

AWS Storage Gateway now supports 15 TiB tapes. It increased the maximum supported virtual tape size on Tape Gateway from 5 TiB to 15 TiB, so you can store more data on a single virtual tape, and you can reduce the number of tapes you need to manage.

Amazon Aurora Serverless v2 now supports AWS CloudFormation. Early this year, we announced the general availability of Aurora Serverless v2, and now you can use AWS CloudFormation Templates to deploy and change the database along with the rest of your infrastructure.

AWS Config now supports 15 new resource types, including AWS DataSync, Amazon GuardDuty, Amazon Simple Email Service (Amazon SES), AWS AppSync, AWS Cloud Map, Amazon EC2, and AWS AppConfig. With this launch, you can use AWS Config to monitor configuration data for the supported resource types in your AWS account, and you can see how the configuration changes.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other updates and news that you may have missed:

This week an article about how AWS is leading a pilot project to turn the Greek island of Naxos into a smart island caught my attention. The project introduces smart solutions for mobility, primary healthcare, and the transport of goods. The solution has been built based on four pillars that were important for the island: sustainability, telehealth, leisure, and digital skills. Check out the whole article to learn what they are doing.

Podcast Charlas Técnicas de AWS – If you understand Spanish, this podcast is for you. Podcast Charlas Técnicas is one of the official AWS podcasts in Spanish, and every other week there is a new episode. The podcast is meant for builders, and it shares stories about how customers implemented and learned AWS services, how to architect applications, and how to use new services. You can listen to all the episodes directly from your favorite podcast app or at AWS Podcasts en español.

AWS open-source news and updates – This is a newsletter curated by my colleague Ricardo to bring you the latest open-source projects, posts, events, and more.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS re:Invent reserved seating opens on October 11. If you are planning to attend, book a spot in advance for your favorite sessions. AWS re:Invent is our biggest conference of the year, it happens in Las Vegas from November 28 to December 2, and registrations are open. Many writers of this blog have sessions at re:Invent, and you can search the event agenda using our names.

I started the post talking about AWS Community Days, and there is one in Warsaw, Poland, on October 14. If you are around Warsaw during this week, you can first check out the AWS Pop-up Hub in Warsaw that runs October 10-14 and then join for the Community Day.

On October 20, there is a virtual event for modernizing .NET workloads with Windows containers on AWS, You can register for free.

That’s all for this week. Check back next Monday for another Week in Review!

— Marcia

[$] A deeper look into the GCC Rust front-end

Post Syndicated from original https://lwn.net/Articles/909887/

Philip Herron and Arthur Cohen presented an
update
on the “gccrs” GCC front end for the Rust language at the
2022 Kangrejos conference. Less than
two weeks later — and joined by David Faust — they did it again at the 2022 GNU Tools Cauldron.
This time, though, they were talking to GCC developers and refocused their
presentation accordingly; the result was an interesting look into the
challenges of implementing a compiler for Rust.

Security updates for Monday

Post Syndicated from original https://lwn.net/Articles/910724/

Security updates have been issued by Debian (knot-resolver and libpgjava), Fedora (booth, dotnet3.1, expat, nheko, php-twig, php-twig2, php-twig3, poppler, python-joblib, and seamonkey), Mageia (colord, dbus, enlightenment, kitty, libvncserver, php, python3, and unbound), Slackware (libksba), SUSE (cyrus-sasl, ImageMagick, and xmlgraphics-commons), and Ubuntu (nginx and thunderbird).

Assembly within! BPF tail calls on x86 and ARM

Post Syndicated from Jakub Sitnicki original https://blog.cloudflare.com/assembly-within-bpf-tail-calls-on-x86-and-arm/

Assembly within! BPF tail calls on x86 and ARM

Assembly within! BPF tail calls on x86 and ARM

Early on when we learn to program, we get introduced to the concept of recursion. And that it is handy for computing, among other things, sequences defined in terms of recurrences. Such as the famous Fibonnaci numbersFn = Fn-1 + Fn-2.

Assembly within! BPF tail calls on x86 and ARM

Later on, perhaps when diving into multithreaded programming, we come to terms with the fact that the stack space for call frames is finite. And that there is an “okay” way and a “cool” way to calculate the Fibonacci numbers using recursion:

// fib_okay.c

#include <stdint.h>

uint64_t fib(uint64_t n)
{
        if (n == 0 || n == 1)
                return 1;

        return fib(n - 1) + fib(n - 2);
}

Listing 1. An okay Fibonacci number generator implementation

// fib_cool.c

#include <stdint.h>

static uint64_t fib_tail(uint64_t n, uint64_t a, uint64_t b)
{
    if (n == 0)
        return a;
    if (n == 1)
        return b;

    return fib_tail(n - 1, b, a + b);
}

uint64_t fib(uint64_t n)
{
    return fib_tail(n, 1, 1);
}

Listing 2. A better version of the same

If we take a look at the machine code the compiler produces, the “cool” variant translates to a nice and tight sequence of instructions:

⚠ DISCLAIMER: This blog post is assembly-heavy. We will be looking at assembly code for x86-64, arm64 and BPF architectures. If you need an introduction or a refresher, I can recommend “Low-Level Programming” by Igor Zhirkov for x86-64, and “Programming with 64-Bit ARM Assembly Language” by Stephen Smith for arm64. For BPF, see the Linux kernel documentation.

Assembly within! BPF tail calls on x86 and ARM

Listing 3. fib_cool.c compiled for x86-64 and arm64

The “okay” variant, disappointingly, leads to more instructions than a listing can fit. It is a spaghetti of basic blocks.

Assembly within! BPF tail calls on x86 and ARM

But more importantly, it is not free of x86 call instructions.

$ objdump -d fib_okay.o | grep call
 10c:   e8 00 00 00 00          call   111 <fib+0x111>
$ objdump -d fib_cool.o | grep call
$

This has an important consequence – as fib recursively calls itself, the stacks keep growing. We can observe it with a bit of help from the debugger.

$ gdb --quiet --batch --command=trace_rsp.gdb --args ./fib_okay 6
Breakpoint 1 at 0x401188: file fib_okay.c, line 3.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
n = 6, %rsp = 0xffffd920
n = 5, %rsp = 0xffffd900
n = 4, %rsp = 0xffffd8e0
n = 3, %rsp = 0xffffd8c0
n = 2, %rsp = 0xffffd8a0
n = 1, %rsp = 0xffffd880
n = 1, %rsp = 0xffffd8c0
n = 2, %rsp = 0xffffd8e0
n = 1, %rsp = 0xffffd8c0
n = 3, %rsp = 0xffffd900
n = 2, %rsp = 0xffffd8e0
n = 1, %rsp = 0xffffd8c0
n = 1, %rsp = 0xffffd900
13
[Inferior 1 (process 50904) exited normally]
$

While the “cool” variant makes no use of the stack.

$ gdb --quiet --batch --command=trace_rsp.gdb --args ./fib_cool 6
Breakpoint 1 at 0x40118a: file fib_cool.c, line 13.
[Thread debugging using libthread_db enabled]
Using host libthread_db library "/lib64/libthread_db.so.1".
n = 6, %rsp = 0xffffd938
13
[Inferior 1 (process 50949) exited normally]
$

Where did the calls go?

The smart compiler turned the last function call in the body into a regular jump. Why was it allowed to do that?

It is the last instruction in the function body we are talking about. The caller stack frame is going to be destroyed right after we return anyway. So why keep it around when we can reuse it for the callee’s stack frame?

This optimization, known as tail call elimination, leaves us with no function calls in the “cool” variant of our fib implementation. There was only one call to eliminate – right at the end.

Once applied, the call becomes a jump (loop). If assembly is not your second language, decompiling the fib_cool.o object file with Ghidra helps see the transformation:

long fib(ulong param_1)

{
  long lVar1;
  long lVar2;
  long lVar3;
  
  if (param_1 < 2) {
    lVar3 = 1;
  }
  else {
    lVar3 = 1;
    lVar2 = 1;
    do {
      lVar1 = lVar3;
      param_1 = param_1 - 1;
      lVar3 = lVar2 + lVar1;
      lVar2 = lVar1;
    } while (param_1 != 1);
  }
  return lVar3;
}

Listing 4. fib_cool.o decompiled by Ghidra

This is very much desired. Not only is the generated machine code much shorter. It is also way faster due to lack of calls, which pop up on the profile for fib_okay.

But I am no performance ninja and this blog post is not about compiler optimizations. So why am I telling you about it?

Assembly within! BPF tail calls on x86 and ARM
Alex Dunkel (Maky), CC BY-SA 3.0, via Wikimedia Commons

Tail calls in BPF

The concept of tail call elimination made its way into the BPF world. Although not in the way you might expect. Yes, the LLVM compiler does get rid of the trailing function calls when building for -target bpf. The transformation happens at the intermediate representation level, so it is backend agnostic. This can save you some BPF-to-BPF function calls, which you can spot by looking for call -N instructions in the BPF assembly.

However, when we talk about tail calls in the BPF context, we usually have something else in mind. And that is a mechanism, built into the BPF JIT compiler, for chaining BPF programs.

We first adopted BPF tail calls when building our XDP-based packet processing pipeline. Thanks to it, we were able to divide the processing logic into several XDP programs. Each responsible for doing one thing.

Assembly within! BPF tail calls on x86 and ARM
Slide from “XDP based DDoS Mitigation” talk by Arthur Fabre

BPF tail calls have served us well since then. But they do have their caveats. Until recently it was impossible to have both BPF tails calls and BPF-to-BPF function calls in the same XDP program on arm64, which is one of the supported architectures for us.

Why? Before we get to that, we have to clarify what a BPF tail call actually does.

A tail call is a tail call is a tail call

BPF exposes the tail call mechanism through the bpf_tail_call helper, which we can invoke from our BPF code. We don’t directly point out which BPF program we would like to call. Instead, we pass it a BPF map (a container) capable of holding references to BPF programs (BPF_MAP_TYPE_PROG_ARRAY), and an index into the map.

long bpf_tail_call(void *ctx, struct bpf_map *prog_array_map, u32 index)

       Description
              This  special  helper is used to trigger a "tail call", or
              in other words, to jump into  another  eBPF  program.  The
              same  stack frame is used (but values on stack and in reg‐
              isters for the caller are not accessible to  the  callee).
              This  mechanism  allows  for  program chaining, either for
              raising the maximum number of available eBPF instructions,
              or  to  execute  given programs in conditional blocks. For
              security reasons, there is an upper limit to the number of
              successive tail calls that can be performed.

bpf-helpers(7) man page

At first glance, this looks somewhat similar to the execve(2) syscall. It is easy to mistake it for a way to execute a new program from the current program context. To quote the excellent BPF and XDP Reference Guide from the Cilium project documentation:

Tail calls can be seen as a mechanism that allows one BPF program to call another, without returning to the old program. Such a call has minimal overhead as unlike function calls, it is implemented as a long jump, reusing the same stack frame.

But once we add BPF function calls into the mix, it becomes clear that the BPF tail call mechanism is indeed an implementation of tail call elimination, rather than a way to replace one program with another:

Tail calls, before the actual jump to the target program, will unwind only its current stack frame. As we can see in the example above, if a tail call occurs from within the sub-function, the function’s (func1) stack frame will be present on the stack when a program execution is at func2. Once the final function (func3) function terminates, all the previous stack frames will be unwinded and control will get back to the caller of BPF program caller.

Alas, one with sometimes slightly surprising semantics. Consider the code like below, where a BPF function calls the bpf_tail_call() helper:

struct {
    __uint(type, BPF_MAP_TYPE_PROG_ARRAY);
    __uint(max_entries, 1);
    __uint(key_size, sizeof(__u32));
    __uint(value_size, sizeof(__u32));
} bar SEC(".maps");

SEC("tc")
int serve_drink(struct __sk_buff *skb __unused)
{
    return 0xcafe;
}

static __noinline
int bring_order(struct __sk_buff *skb)
{
    bpf_tail_call(skb, &bar, 0);
    return 0xf00d;
}

SEC("tc")
int server1(struct __sk_buff *skb)
{
    return bring_order(skb);    
}

SEC("tc")
int server2(struct __sk_buff *skb)
{
    __attribute__((musttail)) return bring_order(skb);  
}

We have two seemingly not so different BPF programs – server1() and server2(). They both call the same BPF function bring_order(). The function tail calls into the serve_drink() program, if the bar[0] map entry points to it (let’s assume that).

Do both server1 and server2 return the same value? Turns out that – no, they don’t. We get a hex 🍔 from server1, and a ☕ from server2. How so?

First thing to notice is that a BPF tail call unwinds just the current function stack frame. Code past the bpf_tail_call() invocation in the function body never executes, providing the tail call is successful (the map entry was set, and the tail call limit has not been reached).

When the tail call finishes, control returns to the caller of the function which made the tail call. Applying this to our example, the control flow is serverX() --> bring_order() --> bpf_tail_call() --> serve_drink() -return-> serverX() for both programs.

The second thing to keep in mind is that the compiler does not know that the bpf_tail_call() helper changes the control flow. Hence, the unsuspecting compiler optimizes the code as if the execution would continue past the BPF tail call.

Assembly within! BPF tail calls on x86 and ARM
The call graph for server1() and server2() is the same, but the return value differs due to build time optimizations.

In our case, the compiler thinks it is okay to propagate the constant which bring_order() returns to server1(). Possibly catching us by surprise, if we didn’t check the generated BPF assembly.

We can prevent it by forcing the compiler to make a tail call to bring_order(). This way we ensure that whatever bring_order() returns will be used as the server2() program result.

🛈 General rule – for least surprising results, use musttail attribute when calling a function that contain a BPF tail call.

How does the bpf_tail_call() work underneath then? And why the BPF verifier wouldn’t let us mix the function calls with tail calls on arm64? Time to dig deeper.

Assembly within! BPF tail calls on x86 and ARM
Public Domain image

BPF tail call on x86-64

What does a bpf_tail_call() helper call translate to after BPF JIT for x86-64 has compiled it? How does the implementation guarantee that we don’t end up in a tail call loop forever?

To find out we will need to piece together a few things.

First, there is the BPF JIT compiler source code, which lives in arch/x86/net/bpf_jit_comp.c. Its code is annotated with helpful comments. We will focus our attention on the following call chain within the JIT:

do_jit() 🔗
  emit_prologue() 🔗
  push_callee_regs() 🔗
  for (i = 1; i <= insn_cnt; i++, insn++) {
    switch (insn->code) {
    case BPF_JMP | BPF_CALL:
      /* emit function call */ 🔗
    case BPF_JMP | BPF_TAIL_CALL:
      emit_bpf_tail_call_direct() 🔗
    case BPF_JMP | BPF_EXIT:
      /* emit epilogue */ 🔗
    }
  }

It is sometimes hard to visualize the generated instruction stream just from reading the compiler code. Hence, we will also want to inspect the input – BPF instructions – and the output – x86-64 instructions – of the JIT compiler.

To inspect BPF and x86-64 instructions of a loaded BPF program, we can use bpftool prog dump. However, first we must populate the BPF map used as the tail call jump table. Otherwise, we might not be able to see the tail call jump!

This is due to optimizations that use instruction patching when the index into the program array is known at load time.

# bpftool prog loadall ./tail_call_ex1.o /sys/fs/bpf pinmaps /sys/fs/bpf
# bpftool map update pinned /sys/fs/bpf/jmp_table key 0 0 0 0 value pinned /sys/fs/bpf/target_prog
# bpftool prog dump xlated pinned /sys/fs/bpf/entry_prog
int entry_prog(struct __sk_buff * skb):
; bpf_tail_call(skb, &jmp_table, 0);
   0: (18) r2 = map[id:24]
   2: (b7) r3 = 0
   3: (85) call bpf_tail_call#12
; return 0xf00d;
   4: (b7) r0 = 61453
   5: (95) exit
# bpftool prog dump jited pinned /sys/fs/bpf/entry_prog
int entry_prog(struct __sk_buff * skb):
bpf_prog_4f697d723aa87765_entry_prog:
; bpf_tail_call(skb, &jmp_table, 0);
   0:   nopl   0x0(%rax,%rax,1)
   5:   xor    %eax,%eax
   7:   push   %rbp
   8:   mov    %rsp,%rbp
   b:   push   %rax
   c:   movabs $0xffff888102764800,%rsi
  16:   xor    %edx,%edx
  18:   mov    -0x4(%rbp),%eax
  1e:   cmp    $0x21,%eax
  21:   jae    0x0000000000000037
  23:   add    $0x1,%eax
  26:   mov    %eax,-0x4(%rbp)
  2c:   nopl   0x0(%rax,%rax,1)
  31:   pop    %rax
  32:   jmp    0xffffffffffffffe3   // bug? 🤔
; return 0xf00d;
  37:   mov    $0xf00d,%eax
  3c:   leave
  3d:   ret

There is a caveat. The target addresses for tail call jumps in bpftool prog dump jited output will not make any sense. To discover the real jump targets, we have to peek into the kernel memory. That can be done with gdb after we find the address of our JIT’ed BPF programs in /proc/kallsyms:

# tail -2 /proc/kallsyms
ffffffffa0000720 t bpf_prog_f85b2547b00cbbe9_target_prog        [bpf]
ffffffffa0000748 t bpf_prog_4f697d723aa87765_entry_prog [bpf]
# gdb -q -c /proc/kcore -ex 'x/18i 0xffffffffa0000748' -ex 'quit'
[New process 1]
Core was generated by `earlyprintk=serial,ttyS0,115200 console=ttyS0 psmouse.proto=exps "virtme_stty_c'.
#0  0x0000000000000000 in ?? ()
   0xffffffffa0000748:  nopl   0x0(%rax,%rax,1)
   0xffffffffa000074d:  xor    %eax,%eax
   0xffffffffa000074f:  push   %rbp
   0xffffffffa0000750:  mov    %rsp,%rbp
   0xffffffffa0000753:  push   %rax
   0xffffffffa0000754:  movabs $0xffff888102764800,%rsi
   0xffffffffa000075e:  xor    %edx,%edx
   0xffffffffa0000760:  mov    -0x4(%rbp),%eax
   0xffffffffa0000766:  cmp    $0x21,%eax
   0xffffffffa0000769:  jae    0xffffffffa000077f
   0xffffffffa000076b:  add    $0x1,%eax
   0xffffffffa000076e:  mov    %eax,-0x4(%rbp)
   0xffffffffa0000774:  nopl   0x0(%rax,%rax,1)
   0xffffffffa0000779:  pop    %rax
   0xffffffffa000077a:  jmp    0xffffffffa000072b
   0xffffffffa000077f:  mov    $0xf00d,%eax
   0xffffffffa0000784:  leave
   0xffffffffa0000785:  ret
# gdb -q -c /proc/kcore -ex 'x/7i 0xffffffffa0000720' -ex 'quit'
[New process 1]
Core was generated by `earlyprintk=serial,ttyS0,115200 console=ttyS0 psmouse.proto=exps "virtme_stty_c'.
#0  0x0000000000000000 in ?? ()
   0xffffffffa0000720:  nopl   0x0(%rax,%rax,1)
   0xffffffffa0000725:  xchg   %ax,%ax
   0xffffffffa0000727:  push   %rbp
   0xffffffffa0000728:  mov    %rsp,%rbp
   0xffffffffa000072b:  mov    $0xcafe,%eax
   0xffffffffa0000730:  leave
   0xffffffffa0000731:  ret
#

Lastly, it will be handy to have a cheat sheet of mapping between BPF registers (r0, r1, …) to hardware registers (rax, rdi, …) that the JIT compiler uses.

BPF x86-64
r0 rax
r1 rdi
r2 rsi
r3 rdx
r4 rcx
r5 r8
r6 rbx
r7 r13
r8 r14
r9 r15
r10 rbp
internal r9-r12

Now we are prepared to work out what happens when we use a BPF tail call.

Assembly within! BPF tail calls on x86 and ARM

In essence, bpf_tail_call() emits a jump into another function, reusing the current stack frame. It is just like a regular optimized tail call, but with a twist.

Because of the BPF security guarantees – execution terminates, no stack overflows – there is a limit on the number of tail calls we can have (MAX_TAIL_CALL_CNT = 33).

Counting the tail calls across BPF programs is not something we can do at load-time. The jump table (BPF program array) contents can change after the program has been verified. Our only option is to keep track of tail calls at run-time. That is why the JIT’ed code for the bpf_tail_call() helper checks and updates the tail_call_cnt counter.

The updated count is then passed from one BPF program to another, and from one BPF function to another, as we will see, through the rax register (r0 in BPF).

Luckily for us, the x86-64 calling convention dictates that the rax register does not partake in passing function arguments, but rather holds the function return value. The JIT can repurpose it to pass an additional – hidden – argument.

The function body is, however, free to make use of the r0/rax register in any way it pleases. This explains why we want to save the tail_call_cnt passed via rax onto stack right after we jump to another program. bpf_tail_call() can later load the value from a known location on the stack.

This way, the code emitted for each bpf_tail_call() invocation, and the BPF function prologue work in tandem, keeping track of tail call count across BPF program boundaries.

But what if our BPF program is split up into several BPF functions, each with its own stack frame? What if these functions perform BPF tail calls? How is the tail call count tracked then?

Mixing BPF function calls with BPF tail calls

BPF has its own terminology when it comes to functions and calling them, which is influenced by the internal implementation. Function calls are referred to as BPF to BPF calls. Also, the main/entry function in your BPF code is called “the program”, while all other functions are known as “subprograms”.

Each call to subprogram allocates a stack frame for local state, which persists until the function returns. Naturally, BPF subprogram calls can be nested creating a call chain. Just like nested function calls in user-space.

BPF subprograms are also allowed to make BPF tail calls. This, effectively, is a mechanism for extending the call chain to another BPF program and its subprograms.

If we cannot track how long the call chain can be, and how much stack space each function uses, we put ourselves at risk of overflowing the stack. We cannot let this happen, so BPF enforces limitations on when and how many BPF tail calls can be done:

static int check_max_stack_depth(struct bpf_verifier_env *env)
{
        …
        /* protect against potential stack overflow that might happen when
         * bpf2bpf calls get combined with tailcalls. Limit the caller's stack
         * depth for such case down to 256 so that the worst case scenario
         * would result in 8k stack size (32 which is tailcall limit * 256 =
         * 8k).
         *
         * To get the idea what might happen, see an example:
         * func1 -> sub rsp, 128
         *  subfunc1 -> sub rsp, 256
         *  tailcall1 -> add rsp, 256
         *   func2 -> sub rsp, 192 (total stack size = 128 + 192 = 320)
         *   subfunc2 -> sub rsp, 64
         *   subfunc22 -> sub rsp, 128
         *   tailcall2 -> add rsp, 128
         *    func3 -> sub rsp, 32 (total stack size 128 + 192 + 64 + 32 = 416)
         *
         * tailcall will unwind the current stack frame but it will not get rid
         * of caller's stack as shown on the example above.
         */
        if (idx && subprog[idx].has_tail_call && depth >= 256) {
                verbose(env,
                        "tail_calls are not allowed when call stack of previous frames is %d bytes. Too large\n",
                        depth);
                return -EACCES;
        }
        …
}

While the stack depth can be calculated by the BPF verifier at load-time, we still need to keep count of tail call jumps at run-time. Even when subprograms are involved.

This means that we have to pass the tail call count from one BPF subprogram to another, just like we did when making a BPF tail call, so we yet again turn to value passing through the rax register.

Assembly within! BPF tail calls on x86 and ARM
Control flow in a BPF program with a function call followed by a tail call.

🛈 To keep things simple, BPF code in our examples does not allocate anything on stack. I encourage you to check how the JIT’ed code changes when you add some local variables. Just make sure the compiler does not optimize them out.

To make it work, we need to:

① load the tail call count saved on stack into rax before call’ing the subprogram,
② adjust the subprogram prologue, so that it does not reset the rax like the main program does,
③ save the passed tail call count on subprogram’s stack for the bpf_tail_call() helper to consume it.

A bpf_tail_call() within our suprogram will then:

④ load the tail call count from stack,
⑤ unwind the BPF stack, but keep the current subprogram’s stack frame in tact, and
⑥ jump to the target BPF program.

Now we have seen how all the pieces of the puzzle fit together to make BPF tail work on x86-64 safely. The only open question is does it work the same way on other platforms like arm64? Time to shift gears and dive into a completely different BPF JIT implementation.

Assembly within! BPF tail calls on x86 and ARM
Based on an image by Wutthichai Charoenburi, CC BY 2.0

Tail calls on arm64

If you try loading a BPF program that uses both BPF function calls (aka BPF to BPF calls) and BPF tail calls on an arm64 machine running the latest 5.15 LTS kernel, or even the latest 5.19 stable kernel, the BPF verifier will kindly ask you to reconsider your choice:

# uname -rm
5.19.12 aarch64
# bpftool prog loadall tail_call_ex2.o /sys/fs/bpf
libbpf: prog 'entry_prog': BPF program load failed: Invalid argument
libbpf: prog 'entry_prog': -- BEGIN PROG LOAD LOG --
0: R1=ctx(off=0,imm=0) R10=fp0
; __attribute__((musttail)) return sub_func(skb);
0: (85) call pc+1
caller:
 R10=fp0
callee:
 frame1: R1=ctx(off=0,imm=0) R10=fp0
; bpf_tail_call(skb, &jmp_table, 0);
2: (18) r2 = 0xffffff80c38c7200       ; frame1: R2_w=map_ptr(off=0,ks=4,vs=4,imm=0)
4: (b7) r3 = 0                        ; frame1: R3_w=P0
5: (85) call bpf_tail_call#12
tail_calls are not allowed in non-JITed programs with bpf-to-bpf calls
processed 4 insns (limit 1000000) max_states_per_insn 0 total_states 0 peak_states 0 mark_read 0
-- END PROG LOAD LOG --
…
#

That is a pity! We have been looking forward to reaping the benefits of code sharing with BPF to BPF calls in our lengthy machine generated BPF programs. So we asked – how hard could it be to make it work?

After all, BPF JIT for arm64 already can handle BPF tail calls and BPF to BPF calls, when used in isolation.

It is “just” a matter of understanding the existing JIT implementation, which lives in arch/arm64/net/bpf_jit_comp.c, and identifying the missing pieces.

To understand how BPF JIT for arm64 works, we will use the same method as before – look at its code together with sample input (BPF instructions) and output (arm64 instructions).

We don’t have to read the whole source code. It is enough to zero in on a few particular code paths:

bpf_int_jit_compile() 🔗
  build_prologue() 🔗
  build_body() 🔗
    for (i = 0; i < prog->len; i++) {
       build_insn() 🔗
         switch (code) {
         case BPF_JMP | BPF_CALL:
           /* emit function call */ 🔗
         case BPF_JMP | BPF_TAIL_CALL:
           emit_bpf_tail_call() 🔗
         }
    }
  build_epilogue() 🔗

One thing that the arm64 architecture, and RISC architectures in general, are known for is that it has a plethora of general purpose registers (x0-x30). This is a good thing. We have more registers to allocate to JIT internal state, like the tail call count. A cheat sheet of what roles the hardware registers play in the BPF JIT will be helpful:

BPF arm64
r0 x7
r1 x0
r2 x1
r3 x2
r4 x3
r5 x4
r6 x19
r7 x20
r8 x21
r9 x22
r10 x25
internal x9-x12, x26 (tail_call_cnt), x27

Now let’s try to understand the state of things by looking at the JIT’s input and output for two particular scenarios: (1) a BPF tail call, and (2) a BPF to BPF call.

It is hard to read assembly code selectively. We will have to go through all instructions one by one, and understand what each one is doing.

⚠ Brace yourself. Time to decipher a bit of ARM64 assembly. If this will be your first time reading ARM64 assembly, you might want to at least skim through this Guide to ARM64 / AArch64 Assembly on Linux before diving in.

Scenario #1: A single BPF tail call – tail_call_ex1.bpf.c

Input: BPF assembly (bpftool prog dump xlated)

   0: (18) r2 = map[id:4]           // jmp_table map
   2: (b7) r3 = 0
   3: (85) call bpf_tail_call#12
   4: (b7) r0 = 61453               // 0xf00d
   5: (95) exit

Output: ARM64 assembly (bpftool prog dump jited)

 0:   paciasp                            // Sign LR (ROP protection) ①
 4:   stp     x29, x30, [sp, #-16]!      // Save FP and LR registers ②
 8:   mov     x29, sp                    // Set up Frame Pointer
 c:   stp     x19, x20, [sp, #-16]!      // Save callee-saved registers ③
10:   stp     x21, x22, [sp, #-16]!      // ⋮ 
14:   stp     x25, x26, [sp, #-16]!      // ⋮ 
18:   stp     x27, x28, [sp, #-16]!      // ⋮ 
1c:   mov     x25, sp                    // Set up BPF stack base register (r10)
20:   mov     x26, #0x0                  // Initialize tail_call_cnt ④
24:   sub     x27, x25, #0x0             // Calculate FP bottom ⑤
28:   sub     sp, sp, #0x200             // Set up BPF program stack ⑥
2c:   mov     x1, #0xffffff80ffffffff    // r2 = map[id:4] ⑦
30:   movk    x1, #0xc38c, lsl #16       // ⋮ 
34:   movk    x1, #0x7200                // ⋮
38:   mov     x2, #0x0                   // r3 = 0
3c:   mov     w10, #0x24                 // = offsetof(struct bpf_array, map.max_entries) ⑧
40:   ldr     w10, [x1, x10]             // Load array->map.max_entries
44:   add     w2, w2, #0x0               // = index (0)
48:   cmp     w2, w10                    // if (index >= array->map.max_entries)
4c:   b.cs    0x0000000000000088         //     goto out;
50:   mov     w10, #0x21                 // = MAX_TAIL_CALL_CNT (33)
54:   cmp     x26, x10                   // if (tail_call_cnt >= MAX_TAIL_CALL_CNT)
58:   b.cs    0x0000000000000088         //     goto out;
5c:   add     x26, x26, #0x1             // tail_call_cnt++;
60:   mov     w10, #0x110                // = offsetof(struct bpf_array, ptrs)
64:   add     x10, x1, x10               // = &array->ptrs
68:   lsl     x11, x2, #3                // = index * sizeof(array->ptrs[0])
6c:   ldr     x11, [x10, x11]            // prog = array->ptrs[index];
70:   cbz     x11, 0x0000000000000088    // if (prog == NULL) goto out;
74:   mov     w10, #0x30                 // = offsetof(struct bpf_prog, bpf_func)
78:   ldr     x10, [x11, x10]            // Load prog->bpf_func
7c:   add     x10, x10, #0x24            // += PROLOGUE_OFFSET * AARCH64_INSN_SIZE (4)
80:   add     sp, sp, #0x200             // Unwind BPF stack
84:   br      x10                        // goto *(prog->bpf_func + prologue_offset)
88:   mov     x7, #0xf00d                // r0 = 0xf00d
8c:   add     sp, sp, #0x200             // Unwind BPF stack ⑨
90:   ldp     x27, x28, [sp], #16        // Restore used callee-saved registers
94:   ldp     x25, x26, [sp], #16        // ⋮
98:   ldp     x21, x22, [sp], #16        // ⋮
9c:   ldp     x19, x20, [sp], #16        // ⋮
a0:   ldp     x29, x30, [sp], #16        // ⋮
a4:   add     x0, x7, #0x0               // Set return value
a8:   autiasp                            // Authenticate LR
ac:   ret                                // Return to caller

① BPF program prologue starts with Pointer Authentication Code (PAC), which protects against Return Oriented Programming attacks. PAC instructions are emitted by JIT only if CONFIG_ARM64_PTR_AUTH_KERNEL is enabled.

Arm 64 Architecture Procedure Call Standard mandates that the Frame Pointer (register X29) and the Link Register (register X30), aka the return address, of the caller should be recorded onto the stack.

③ Registers X19 to X28, and X29 (FP) plus X30 (LR), are callee saved. ARM64 BPF JIT does not use registers X23 and X24 currently, so they are not saved.

④ We track the tail call depth in X26. No need to save it onto stack since we use a register dedicated just for this purpose.

⑤ FP bottom is an optimization that allows store/loads to BPF stack with a single instruction and an immediate offset value.

⑥ Reserve space for the BPF program stack. The stack layout is now as shown in a diagram in build_prologue() source code.

⑦ The BPF function body starts here.

bpf_tail_call() instructions start here.

⑨ The epilogue starts here.

Whew! That was a handful 😅.

Notice that the BPF tail call implementation on arm64 is not as optimized as on x86-64. There is no code patching to make direct jumps when the target program index is known at the JIT-compilation time. Instead, the target address is always loaded from the BPF program array.

Ready for the second scenario? I promise it will be shorter. Function prologue and epilogue instructions will look familiar, so we are going to keep annotations down to a minimum.

Scenario #2: A BPF to BPF call – sub_call_ex1.bpf.c

Input: BPF assembly (bpftool prog dump xlated)

int entry_prog(struct __sk_buff * skb):
   0: (85) call pc+1#bpf_prog_a84919ecd878b8f3_sub_func
   1: (95) exit
int sub_func(struct __sk_buff * skb):
   2: (b7) r0 = 61453                   // 0xf00d
   3: (95) exit

Output: ARM64 assembly

int entry_prog(struct __sk_buff * skb):
bpf_prog_163e74e7188910f2_entry_prog:
   0:   paciasp                                 // Begin prologue
   4:   stp     x29, x30, [sp, #-16]!           // ⋮
   8:   mov     x29, sp                         // ⋮
   c:   stp     x19, x20, [sp, #-16]!           // ⋮
  10:   stp     x21, x22, [sp, #-16]!           // ⋮
  14:   stp     x25, x26, [sp, #-16]!           // ⋮
  18:   stp     x27, x28, [sp, #-16]!           // ⋮
  1c:   mov     x25, sp                         // ⋮
  20:   mov     x26, #0x0                       // ⋮
  24:   sub     x27, x25, #0x0                  // ⋮
  28:   sub     sp, sp, #0x0                    // End prologue
  2c:   mov     x10, #0xffffffffffff5420        // Build sub_func()+0x0 address
  30:   movk    x10, #0x8ff, lsl #16            // ⋮
  34:   movk    x10, #0xffc0, lsl #32           // ⋮
  38:   blr     x10 ------------------.         // Call sub_func()+0x0 
  3c:   add     x7, x0, #0x0 <----------.       // r0 = sub_func()
  40:   mov     sp, sp                | |       // Begin epilogue
  44:   ldp     x27, x28, [sp], #16   | |       // ⋮
  48:   ldp     x25, x26, [sp], #16   | |       // ⋮
  4c:   ldp     x21, x22, [sp], #16   | |       // ⋮
  50:   ldp     x19, x20, [sp], #16   | |       // ⋮
  54:   ldp     x29, x30, [sp], #16   | |       // ⋮
  58:   add     x0, x7, #0x0          | |       // ⋮
  5c:   autiasp                       | |       // ⋮
  60:   ret                           | |       // End epilogue
                                      | |
int sub_func(struct __sk_buff * skb): | |
bpf_prog_a84919ecd878b8f3_sub_func:   | |
   0:   paciasp <---------------------' |       // Begin prologue
   4:   stp     x29, x30, [sp, #-16]!   |       // ⋮
   8:   mov     x29, sp                 |       // ⋮
   c:   stp     x19, x20, [sp, #-16]!   |       // ⋮
  10:   stp     x21, x22, [sp, #-16]!   |       // ⋮
  14:   stp     x25, x26, [sp, #-16]!   |       // ⋮
  18:   stp     x27, x28, [sp, #-16]!   |       // ⋮
  1c:   mov     x25, sp                 |       // ⋮
  20:   mov     x26, #0x0               |       // ⋮
  24:   sub     x27, x25, #0x0          |       // ⋮
  28:   sub     sp, sp, #0x0            |       // End prologue
  2c:   mov     x7, #0xf00d             |       // r0 = 0xf00d
  30:   mov     sp, sp                  |       // Begin epilogue
  34:   ldp     x27, x28, [sp], #16     |       // ⋮
  38:   ldp     x25, x26, [sp], #16     |       // ⋮
  3c:   ldp     x21, x22, [sp], #16     |       // ⋮
  40:   ldp     x19, x20, [sp], #16     |       // ⋮
  44:   ldp     x29, x30, [sp], #16     |       // ⋮
  48:   add     x0, x7, #0x0            |       // ⋮
  4c:   autiasp                         |       // ⋮
  50:   ret ----------------------------'       // End epilogue

We have now seen what a BPF tail call and a BPF function/subprogram call compiles down to. Can you already spot what would go wrong if mixing the two was allowed?

That’s right! Every time we enter a BPF subprogram, we reset the X26 register, which holds the tail call count, to zero (mov x26, #0x0). This is bad. It would let users create program chains longer than the MAX_TAIL_CALL_CNT limit.

How about we just skip this step when emitting the prologue for BPF subprograms?

@@ -246,6 +246,7 @@ static bool is_lsi_offset(int offset, int scale)
 static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
 {
        const struct bpf_prog *prog = ctx->prog;
+       const bool is_main_prog = prog->aux->func_idx == 0;
        const u8 r6 = bpf2a64[BPF_REG_6];
        const u8 r7 = bpf2a64[BPF_REG_7];
        const u8 r8 = bpf2a64[BPF_REG_8];
@@ -299,7 +300,7 @@ static int build_prologue(struct jit_ctx *ctx, bool ebpf_from_cbpf)
        /* Set up BPF prog stack base register */
        emit(A64_MOV(1, fp, A64_SP), ctx);

-       if (!ebpf_from_cbpf) {
+       if (!ebpf_from_cbpf && is_main_prog) {
                /* Initialize tail_call_cnt */
                emit(A64_MOVZ(1, tcc, 0, 0), ctx);

Believe it or not. This is everything that was missing to get BPF tail calls working with function calls on arm64. The feature will be enabled in the upcoming Linux 6.0 release.

Outro

From recursion to tweaking the BPF JIT. How did we get here? Not important. It’s all about the journey.

Along the way we have unveiled a few secrets behind BPF tails calls, and hopefully quenched your thirst for low-level programming. At least for today.

All that is left is to sit back and watch the fruits of our work. With GDB hooked up to a VM, we can observe how a BPF program calls into a BPF function, and from there tail calls to another BPF program:

https://demo-gdb-step-thru-bpf.pages.dev/

Until next time 🖖.

The collective thoughts of the interwebz