Tag Archives: open source

Journey to adopt Cloud-Native DevOps platform Series #1: OfferUp modernized DevOps platform with Amazon EKS and Flagger to accelerate time to market

Post Syndicated from Purna Sanyal original https://aws.amazon.com/blogs/devops/journey-to-adopt-cloud-native-devops-platform-series-1-offerup-modernized-devops-platform-with-amazon-eks-and-flagger-to-accelerate-time-to-market/

In this two part series, we discuss the challenges faced by OfferUp, a Digital Native customer, to meet business growth and time-to-market. Their journey involved modernizing their existing DevOps platform, from the traditional monolith virtual machine (VM) based architecture to modern containerized architecture and running cloud-native applications for secured progressive delivery to accelerate time to market. This series will provide strategies, architecture patterns, and technical steps you can adopt to become more agile and innovative like OfferUp has.

OfferUp engineers were using the homegrown DevOps platform to build and release new services on the marketplace platform. In this first post, we discuss the key challenges encountered by OfferUp engineers with the existing DevOps platform, as well as how OfferUp modernized its DevOps platform with Amazon Elastic Kubernetes Service (Amazon EKS) and Flagger, automating production releases with progressive delivery techniques for faster time-to-market with new products and services. Amazon EKS is a managed container service to run and scale Kubernetes applications in the cloud or on-premises.

Previous DevOps architecture

OfferUp is a leading online and mobile customer to customer (C2C) marketplace where users can both buy and sell goods on the platform. Users can browse and purchase products from a broad range of categories, including furniture, clothing, sports equipment, toys, and many more. As a mobile-first company, OfferUp puts a great deal of emphasis on in-person communication between buyers and sellers.

OfferUp built a home grown, self-managed DevOps platform. This platform used a set of manual processes and third-party applications that allows both developers and operations engineers to build and deploy code to a production environment. The DevOps pipeline included topic areas such as source code control, continuous integration/continuous delivery (CI/CD), microservices, as well as development and test Methodologies. The following diagram depicts the previous architecture of OfferUp’s DevOps platform, which was self-managed on Amazon Elastic Compute Cloud (Amazon EC2).

Figure 1: Previous DevOps architecture of OfferUp

OfferUp used GitHub for code repositories. Once the source code was committed in the code repository, Jenkins pulled the source code from code repositories on a scheduled or on-demand basis and built Amazon Machine Images (AMI). The built image was deployed in production by a  custom built deployment tool, Vanaheim, which supports one-box canary deployment and full roll-out deployment strategies. The DevOps engineers used to manually create a deployment job in the Vanaheim portal and then manually monitor the test success rate and service metrics to detect any impact from the deployment. Once the success rate was reached, a full production roll out was performed from the Vanaheim portal.

Key challenges with previous DevOps pipeline

In 2020, OfferUp experienced significant transaction volume growth on its Marketplace platform with the increase of its user base. With OfferUp’s acquisition of LetGo in 2020, there was a need to build a scalable DevOps platform to support future integration and organic growth. The previous DevOps platform, designed and deployed over seven years ago, had reached the limits of its scalability, and could no longer keep up with the platform’s growth. The previous architecture was expensive to run and had a complex infrastructure that made it difficult to upgrade and add new features.

The following key factors drove the push for modernization:

  • Manual verification was required to check if the code was correctly deployed in one of the servers in production, and if the deployment was right in one server, then it was rolled out to other production servers. Full Rollout to production wasn’t automated due to frequent failures requiring manual rollbacks.
  • The previous platform required a longer deployment time (1–2 hours) due to the authoritative batch process, which sometimes caused delays in releasing and testing of new features.
  • The self-managed nature of the Jenkins and Vanaheim clusters was consuming far too much engineering time. Most of the institutional knowledge of this legacy platform was lost over the years and it didn’t align with OfferUp’s philosophy of small DevOps engineering teams. Innovation had stalled partly due to the difficulty of simultaneously upgrading the DevOps platform and releasing new features.

DevOps platform automation with Flagger and Gloo Ingress Controller on Amazon EKS

A key requirement for the next-generation system was that the new architecture would reduce the operational burden on engineering teams, deployment lifecycle, and total cost of ownership. OfferUp evaluated multiple managed container orchestration platforms for its DevOps Platform. It finally selected Amazon EKS for high availability, reducing the average time to deploy a change to the stack from hours to just a few minutes and reducing the complexity in managing and upgrading the Kubernetes cluster. On the Amazon EKS platform, OfferUp uses Flagger, a progressive delivery tool that automates the release process for applications running on Kubernetes. Flagger implements several deployment strategies (Canary releases, A/B testing, and Blue/Green mirroring) using the Gloo Edge ingress controller for traffic routing. Datadog is used as an observability service for monitoring the health of the deployments and effectively managing the canary to progressive delivery. For release analysis, Flagger runs a query on Datadog logs and uses Slack for alerting and notifications. The cloud native technology components of the architecture are described as follows:

Kubernetes and Amazon EKS – Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications. Kubernetes is a graduate project in the CNCF. Amazon EKS is a fully-managed, certified Kubernetes conformant service that simplifies the process of building, securing, operating, and maintaining Kubernetes clusters on AWS. Amazon EKS integrates with core AWS services, such as Amazon CloudWatch, Auto Scaling Groups, and AWS Identity and Access Management (IAM) to provide a seamless experience for monitoring, scaling, and load balancing your containerized applications.

Helm – Helm manage Kubernetes applications. Helm Charts define, install, and upgrade even the most complex Kubernetes application. Charts are easy to create, version, share, and publish. If Kubernetes were an operating system, then Helm would be the package manager. Helm is a graduate project in the CNCF and is maintained by the Helm community.

Flagger – Flagger is a progressive delivery tool that automates the release process for applications running on Kubernetes. Flagger implements a control loop that gradually shifts traffic to the canary while measuring key performance indicators such as HTTP requests success rate, requests average duration, and pods health. Based on the set thresholds, a canary is either promoted or aborted and its analysis is pushed to a Slack channel. Flagger became a CNCF project – part of the Flux family of GitOps tools.

Gloo EdgeGloo Edge is a feature-rich, Kubernetes-native ingress controller. Gloo Edge is exceptional in its function-level routing; its support for legacy apps, microservices, and serverless; its discovery capabilities; and its tight integration with leading open-source projects. Gloo Edge is uniquely designed to support hybrid applications, in which multiple technologies, architectures, protocols, and clouds can coexist.

Observability platformDatadog’s integrations with Kubernetes, Docker, and AWS will let you track the full range of Amazon EKS metrics, as well as logs and performance data from your cluster and applications. Datadog gives you comprehensive coverage of your dynamic infrastructure and applications with features like auto discovery to track services across containers, sophisticated graphing, and alerting options.

Modernized DevOps architecture

In the new architecture, OfferUp uses Github as a version control tool and Github actions as their CI/CD tool. On every Pull request, tests are run, artifacts are built and stored in the JFrog Artifactory, and docker Images are stored in the Amazon Elastic Container Registry (Amazon ECR). Separate deployment pipelines are triggered based on the environment (dev, staging, and production) of choice. Flagger detects any changes in the version of the application and gradually shifts production traffic to the canary. It measures the requests success rate and average response duration metrics from Datadog to decide full rollout in production. For an application deployment, a canary promotion can be defined using Flagger’s custom resource. Flagger rolls back the deployment when the success rate falls below the defined desired success rate metrics.

Figure 2: Modernized DevOps architecture of OfferUp

With the modernized DevOps platform, OfferUp moved from monolithic to microservice architecture where  front-end applications and GraphQL runs on the Amazon EKS cluster. The production cluster runs 110 services and 650+ pods on 60 nodes. The cluster scales up to 100 nodes with Amazon Auto Scaling group based on the traffic pattern. On the networking front, the cluster has a private endpoint and uses both VPC CNI plugin, and the CoreDNS add-on. There are four Amazon EKS clusters, one each for the production, test, utility, and the staging environments. OfferUp has a plan to explore Karpenter open-source autoscaling project, and it will move new applications to the Amazon EKS cluster, allowing the total node counts to scale up to 200.

Benefits of modernized architecture

The new architecture helped OfferUp make  automated decisions to deploy new releases and improve the time to market while reducing unplanned production downtime

  • Faster deployments and Quicker rollbacks – The new architecture reduces the Service Deployment time from one hour down to five minutes, and automates rollback time to five minutes from the manual rollback time of one hour.
  • Automate deployment of new releases – The lack of canary deployment processes in the previous architecture required OfferUp engineers to manually intervene to validate the deployment status, which led to administrative overhead and production outages. The canary deployments take care of the traffic shifting by automatically measuring the requests’ success rate and latency metrics from Datadog and subsequently release the service to production. Deployments are automatically rolled back when the success rate falls below the defined success rate metric thresholds.
  • Simplified Configuration – Configuration has been simplified drastically and integrated within the CI/CD pipeline in the new architecture, thereby reducing configuration complexity, eliminating manual processes, and saving Developers time.
  • More time to Focus on Innovation – With fully automated progressive delivery, the developers no longer need to spend time testing and releasing source code in production. Similarly, migrating from a Self-managed DevOps platform to the Managed Amazon EKS services lowered the DevOps platform’s infrastructure management burden on the engineering team. This helps developers spend more time focusing on building and testing new features and innovations.
  • Cost reduction – Moving from self-managed Amazon EC2-based architecture to the Amazon EKS cluster reduced the cost of operations through shared nodes and improved pod density. The previous architecture was using 200 nodes of Amazon EC2 instances. The same workload was moved to a 50 nodes Amazon EKS cluster. Furthermore, custom applications (Vanaheim and Jenkins) were retired, further reducing the costs.

Conclusion

In this post, you see how OfferUp embarked on the journey to modernize its DevOps platform to support its growth and developers’ velocity. The key factors that drove the modernization decisions were the ability to scale the platform to support the automated testing of features in production, the faster release of new features, cost reduction, and to facilitate future innovation. The modernized DevOps platform on Amazon EKS also decreased the ongoing operational support burden for engineers, and the scalability of the design opens up a lot of headroom for growth.

We encourage you to look into modernizing your existing CI/CD pipeline on Amazon EKS with the Flagger progressive delivery mechanism. Amazon EKS removes the undifferentiated heavy lifting of managing and updating the Kubernetes cluster. Managed node groups automate the provisioning and lifecycle management of worker nodes in an Amazon EKS cluster, which greatly simplifies operational activities, such as new Kubernetes version deployments.

In the next part of the series, you’ll discover how to implement Flagger and Gloo Edge Ingress Controller on Amazon EKS to automate the release process for applications running on Kubernetes.

Further Reading

Journey to adopt Cloud-Native DevOps platform Series #2: Progressive delivery on Amazon EKS with Flagger and Gloo Edge Ingress Controller

About the authors:

Purna Sanyal

Purna Sanyal is a technology enthusiast and an architect at AWS, helping digital native customers solve their business problems with successful adoption of cloud native architecture. He provides technical thought leadership, architecture guidance, and conducts PoCs to enable customers’ digital transformation. He is also passionate about building innovative solutions around Kubernetes, database, analytics, and machine learning.

Alan Liu

Alan Liu is Sr Director of Engineering at OfferUp. He is a technology enthusiast and he worked across a wide variety of industry. He is highly effective, adaptable, scalable, experienced leader with a proven record.

New – Trusted Language Extensions for PostgreSQL on Amazon Aurora and Amazon RDS

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-trusted-language-extensions-for-postgresql-on-amazon-aurora-and-amazon-rds/

PostgreSQL has become the preferred open-source relational database for many enterprises and start-ups with its extensible design for developers. One of the reasons developers use PostgreSQL is it allows them to add database functionality by building extensions with their preferred programming languages.

You can already install and use PostgreSQL extensions in Amazon Aurora PostgreSQL-Compatible Edition and Amazon Relational Database Service for PostgreSQL. We support more than 85 PostgreSQL extensions in Amazon Aurora and Amazon RDS, such as the pgAudit extension for logging your database activity. While many workloads use these extensions, we heard our customers asking for flexibility to build and run the extensions of their choosing for their PostgreSQL database instances.

Today, we are announcing the general availability of Trusted Language Extensions for PostgreSQL (pg_tle), a new open-source development kit for building PostgreSQL extensions. With Trusted Language Extensions for PostgreSQL, developers can build high-performance extensions that run safely on PostgreSQL.

Trusted Language Extensions for PostgreSQL provides database administrators control over who can install extensions and a permissions model for running them, letting application developers deliver new functionality as soon as they determine an extension meets their needs.

To start building with Trusted Language Extensions, you can use trusted languages such as JavaScript, Perl, and PL/pgSQL. These trusted languages have safety attributes, including restricting direct access to the file system and preventing unwanted privilege escalations. You can easily install extensions written in a trusted language on Amazon Aurora PostgreSQL-Compatible Edition 14.5 and Amazon RDS for PostgreSQL 14.5 or a newer version.

Trusted Language Extensions for PostgreSQL is an open-source project licensed under Apache License 2.0 on GitHub. You can comment or suggest items on the Trusted Language Extensions for PostgreSQL roadmap and help us support this project across multiple programming languages, and more. Doing this as a community will help us make it easier for developers to use the best parts of PostgreSQL to build extensions.

Let’s explore how we can use Trusted Language Extensions for PostgreSQL to build a new PostgreSQL extension for Amazon Aurora and Amazon RDS.

Setting up Trusted Language Extensions for PostgreSQL
To use pg_tle with Amazon Aurora or Amazon RDS for PostgreSQL, you need to set up a parameter group that loads pg_tle in the PostgreSQL shared_preload_libraries setting. Choose Parameter groups in the left navigation pane in the Amazon RDS console and Create parameter group to make a new parameter group.

Choose Create after you select postgres14 with Amazon RDS for PostgreSQL in the Parameter group family and pg_tle in the Group Name. You can select aurora-postgresql14 for an Amazon Aurora PostgreSQL-Compatible cluster.

Choose a created pgtle parameter group and Edit in the Parameter group actions dropbox menu. You can search shared_preload_library in the search box and choose Edit parameter. You can add your preferred values, including pg_tle, and choose Save changes.

You can also do the same job in the AWS Command Line Interface (AWS CLI).

$ aws rds create-db-parameter-group \
  --region us-east-1 \
  --db-parameter-group-name pgtle \
  --db-parameter-group-family aurora-postgresql14 \
  --description "pgtle group"

$ aws rds modify-db-parameter-group \
  --region us-east-1 \
  --db-parameter-group-name pgtle \
  --parameters "ParameterName=shared_preload_libraries,ParameterValue=pg_tle,ApplyMethod=pending-reboot"

Now, you can add the pgtle parameter group to your Amazon Aurora or Amazon RDS for PostgreSQL database. If you have a database instance called testing-pgtle, you can add the pgtle parameter group to the database instance using the command below. Please note that this will cause an active instance to reboot.

$ aws rds modify-db-instance \
  --region us-east-1 \
  --db-instance-identifier testing-pgtle \
  --db-parameter-group-name pgtle-pg \
  --apply-immediately

Verify that the pg_tle library is available on your Amazon Aurora or Amazon RDS for PostgreSQL instance. Run the following command on your PostgreSQL instance:

SHOW shared_preload_libraries;

pg_tle should appear in the output.

Now, we need to create the pg_tle extension in your current database to run the command:

 CREATE EXTENSION pg_tle;

You can now create and install Trusted Language Extensions for PostgreSQL in your current database. If you create a new extension, you should grant the pgtle_admin role to your primary user (e.g., postgres) with the following command:

GRANT pgtle_admin TO postgres;

Let’s now see how to create our first pg_tle extension!

Building a Trusted Language Extension for PostgreSQL
For this example, we are going to build a pg_tle extension to validate that a user is not setting a password that’s found in a common password dictionary. Many teams have rules around the complexity of passwords, particularly for database users. PostgreSQL allows developers to help enforce password complexity using the check_password_hook.

In this example, you will build a password check hook using PL/pgSQL. In the hook, you can check to see if the user-supplied password is in a dictionary of 10 of the most common password values:

SELECT pgtle.install_extension (
  'my_password_check_rules',
  '1.0',
  'Do not let users use the 10 most commonly used passwords',
$_pgtle_$
  CREATE SCHEMA password_check;
  REVOKE ALL ON SCHEMA password_check FROM PUBLIC;
  GRANT USAGE ON SCHEMA password_check TO PUBLIC;

  CREATE TABLE password_check.bad_passwords (plaintext) AS
  VALUES
    ('123456'),
    ('password'),
    ('12345678'),
    ('qwerty'),
    ('123456789'),
    ('12345'),
    ('1234'),
    ('111111'),
    ('1234567'),
    ('dragon');
  CREATE UNIQUE INDEX ON password_check.bad_passwords (plaintext);

  CREATE FUNCTION password_check.passcheck_hook(username text, password text, password_type pgtle.password_types, valid_until timestamptz, valid_null boolean)
  RETURNS void AS $$
    DECLARE
      invalid bool := false;
    BEGIN
      IF password_type = 'PASSWORD_TYPE_MD5' THEN
        SELECT EXISTS(
          SELECT 1
          FROM password_check.bad_passwords bp
          WHERE ('md5' || md5(bp.plaintext || username)) = password
        ) INTO invalid;
        IF invalid THEN
          RAISE EXCEPTION 'password must not be found on a common password dictionary';
        END IF;
      ELSIF password_type = 'PASSWORD_TYPE_PLAINTEXT' THEN
        SELECT EXISTS(
          SELECT 1
          FROM password_check.bad_passwords bp
          WHERE bp.plaintext = password
        ) INTO invalid;
        IF invalid THEN
          RAISE EXCEPTION 'password must not be found on a common password dictionary';
        END IF;
      END IF;
    END
  $$ LANGUAGE plpgsql SECURITY DEFINER;

  GRANT EXECUTE ON FUNCTION password_check.passcheck_hook TO PUBLIC;

  SELECT pgtle.register_feature('password_check.passcheck_hook', 'passcheck');
$_pgtle_$
);

You need to enable the hook through the pgtle.enable_password_check configuration parameter. On Amazon Aurora and Amazon RDS for PostgreSQL, you can do so with the following command:

$ aws rds modify-db-parameter-group \
    --region us-east-1 \
    --db-parameter-group-name pgtle \
    --parameters "ParameterName=pgtle.enable_password_check,ParameterValue=on,ApplyMethod=immediate"

It may take several minutes for these changes to propagate. You can check that the value is set using the SHOW command:

SHOW pgtle.enable_password_check;

If the value is on, you will see the following output:

 pgtle.enable_password_check
-----------------------------
 on

Now you can create this extension in your current database and try setting your password to one of the dictionary passwords and observe how the hook rejects it:

CREATE EXTENSION my_password_check_rules;

CREATE ROLE test_role PASSWORD '123456';
ERROR:  password must not be found on a common password dictionary

CREATE ROLE test_role;
SET SESSION AUTHORIZATION test_role;
SET password_encryption TO 'md5';
\password
-- set to "password"
ERROR:  password must not be found on a common password dictionary

To disable the hook, set the value of pgtle.enable_password_check to off:

$ aws rds modify-db-parameter-group \
    --region us-east-1 \
    --db-parameter-group-name pgtle \
    --parameters "ParameterName=pgtle.enable_password_check,ParameterValue=off,ApplyMethod=immediate"

You can uninstall this pg_tle extension from your database and prevent anyone else from running CREATE EXTENSION on my_password_check_rules with the following command:

DROP EXTENSION my_password_check_rules;
SELECT pgtle.uninstall_extension('my_password_check_rules');

You can find more sample extensions and give them a try. To build and test your Trusted Language Extensions in your local PostgreSQL database, you can build from our source code after cloning the repository.

Join Our Community!
The Trusted Language Extensions for PostgreSQL community is open to everyone. Give it a try, and give us feedback on what you would like to see in future releases. We welcome any contributions, such as new features, example extensions, additional documentation, or any bug reports in GitHub.

To learn more about using Trusted Language Extensions for PostgreSQL in the AWS Cloud, see the Amazon Aurora PostgreSQL-Compatible Edition and Amazon RDS for PostgreSQL documentation.

Give it a try, and please send feedback to AWS re:Post for PostgreSQL or through your usual AWS support contacts.

Channy

New – Amazon Redshift Integration with Apache Spark

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-amazon-redshift-integration-with-apache-spark/

Apache Spark is an open-source, distributed processing system commonly used for big data workloads. Spark application developers working in Amazon EMR, Amazon SageMaker, and AWS Glue often use third-party Apache Spark connectors that allow them to read and write the data with Amazon Redshift. These third-party connectors are not regularly maintained, supported, or tested with various versions of Spark for production.

Today we are announcing the general availability of Amazon Redshift integration for Apache Spark, which makes it easy to build and run Spark applications on Amazon Redshift and Redshift Serverless, enabling customers to open up the data warehouse for a broader set of AWS analytics and machine learning (ML) solutions.

With Amazon Redshift integration for Apache Spark, you can get started in seconds and effortlessly build Apache Spark applications in a variety of languages, such as Java, Scala, and Python.

Your applications can read from and write to your Amazon Redshift data warehouse without compromising on the performance of the applications or transactional consistency of the data, as well as performance improvements with pushdown optimizations.

Amazon Redshift integration for Apache Spark builds on an existing open source connector project and enhances it for performance and security, helping customers gain up to 10x faster application performance. We thank the original contributors on the project who collaborated with us to make this happen. As we make further enhancements we will continue to contribute back into the open source project.

Getting Started with Spark Connector for Amazon Redshift
To get started, you can go to AWS analytics and ML services, use data frame or Spark SQL code in a Spark job or Notebook to connect to the Amazon Redshift data warehouse, and start running queries in seconds.

In this launch, Amazon EMR 6.9, EMR Serverless, and AWS Glue 4.0 come with the pre-packaged connector and JDBC driver, and you can just start writing code. EMR 6.9 provides a sample notebook, and EMR Serverless provides a sample Spark Job too.

First, you should set AWS Identity and Access Management (AWS IAM) authentication between Redshift and Spark, between Amazon Simple Storage Service (Amazon S3) and Spark, and between Redshift and Amazon S3. The following diagram describes the authentication between Amazon S3, Redshift, the Spark driver, and Spark executors.

For more information, see Identity and access management in Amazon Redshift in the AWS documentation.

Amazon EMR
If you already have an Amazon Redshift data warehouse and the data available, you can create the database user and provide the right level of grants to the database user. To use this with Amazon EMR, you need to upgrade to the latest version of the Amazon EMR 6.9 that has the packaged spark-redshift connector. Select the emr-6.9.0 release when you create an EMR cluster on Amazon EC2.

You can use EMR Serverless to create your Spark application using the emr-6.9.0 release to run your workload.

EMR Studio also provides an example Jupyter Notebook configured to connect to an Amazon Redshift Serverless endpoint leveraging sample data that you can use to get started quickly.

Here is a Scalar example to build your applications both with Spark Dataframe and Spark SQL. Use IAM-based credentials for connecting to Redshift and use IAM role for unloading and loading data from S3.

// Create the JDBC connection URL and define the Redshift context
val jdbcURL = "jdbc:redshift:iam://<RedshiftEndpoint>:<Port>/<Database>?DbUser=<RsUser>"
val rsOptions = Map (
  "url" -> jdbcURL,
  "tempdir" -> tempS3Dir, 
  "aws_iam_role" -> roleARN,
  )
// Reference the sales table from Redshift 
val sales_df = spark
  .read 
  .format("io.github.spark_redshift_community.spark.redshift") 
  .options(rsOptions) 
  .option("dbtable", "sales") 
  .load() 
sales_df.createOrReplaceTempView("sales") 
// Reference the date table from Redshift using Data Frame 
sales_df.join(date_df, sales_df("dateid") === date_df("dateid"))
  .where(col("caldate") === "2008-01-05")
  .groupBy().sum("qtysold")
  .select(col("sum(qtysold)"))
  .show() 

If Amazon Redshift and Amazon EMR are in different VPCs, you have to configure VPC peering or enable cross-VPC access. Assuming both Amazon Redshift and Amazon EMR are in the same virtual private cloud (VPC), you can create a Spark job or Notebook and connect to the Amazon Redshift data warehouse and write Spark code to use the Amazon Redshift connector.

To learn more, see Use Spark on Amazon Redshift with a connector in the AWS documentation.

AWS Glue
When you use AWS Glue 4.0, the spark-redshift connector is available both as a source and target. In Glue Studio, you can use a visual ETL job to read or write to a Redshift data warehouse simply by selecting a Redshift connection to use within a built-in Redshift source or target node.

The Redshift connection contains Redshift connection details along with the credentials needed to access Redshift with the proper permissions.

To get started, choose Jobs in the left menu of the Glue Studio console. Using either of the Visual modes, you can easily add and edit a source or target node and define a range of transformations on the data without writing any code.

Choose Create and you can easily add and edit a source, target node, and the transform node in the job diagram. At this time, you will choose Amazon Redshift as Source and Target.

Once completed, the Glue job can be executed on Glue for the Apache Spark engine, which will automatically use the latest spark-redshift connector.

The following Python script shows an example job to read and write to Redshift with dynamicframe using the spark-redshift connector.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

print("================ DynamicFrame Read ===============")
url = "jdbc:redshift://<RedshiftEndpoint>:<Port>/dev"
read_options = {
    "url": url,
    "dbtable": dbtable,
    "redshiftTmpDir": redshiftTmpDir,
    "tempdir": redshiftTmpDir,
    "aws_iam_role": aws_iam_role,
    "autopushdown": "true",
    "include_column_list": "false"
}

redshift_read = glueContext.create_dynamic_frame.from_options(
    connection_type="redshift",
    connection_options=read_options
) 

print("================ DynamicFrame Write ===============")

write_options = {
    "url": url,
    "dbtable": dbtable,
    "user": "awsuser",
    "password": "Password1",
    "redshiftTmpDir": redshiftTmpDir,
    "tempdir": redshiftTmpDir,
    "aws_iam_role": aws_iam_role,
    "autopushdown": "true",
    "DbUser": "awsuser"
}

print("================ dyf write result: check redshift table ===============")
redshift_write = glueContext.write_dynamic_frame.from_options(
    frame=redshift_read,
    connection_type="redshift",
    connection_options=write_options
)

When you set up your job detail, you can only use the Glue 4.0 – Supports spark 3.3 Python 3 version for this integration.

To learn more, see Creating ETL jobs with AWS Glue Studio and Using connectors and connections with AWS Glue Studio in the AWS documentation.

Gaining the Best Performance
In the Amazon Redshift integration for Apache Spark, the Spark connector automatically applies predicate and query pushdown to optimize for performance. You can gain performance improvement by using the default Parquet format for the connector used for unloading with this integration.

As the following sample code shows, the Spark connector will turn the supported function into a SQL query and run the query in Amazon Redshift.

import sqlContext.implicits._val
sample= sqlContext.read
.format("io.github.spark_redshift_community.spark.redshift")
.option("url",jdbcURL )
.option("tempdir", tempS3Dir)
.option("unload_s3_format", "PARQUET")
.option("dbtable", "event")
.load()

// Create temporary views for data frames created earlier so they can be accessed via Spark SQL
sales_df.createOrReplaceTempView("sales")
date_df.createOrReplaceTempView("date")
// Show the total sales on a given date using Spark SQL API
spark.sql(
"""SELECT sum(qtysold)
| FROM sales, date
| WHERE sales.dateid = date.dateid
| AND caldate = '2008-01-05'""".stripMargin).show()

Amazon Redshift integration for Apache Spark adds pushdown capabilities for operations such as sort, aggregate, limit, join, and scalar functions so that only the relevant data is moved from the Redshift data warehouse to the consuming Spark application, thereby improving performance.

Available Now
The Amazon Redshift integration for Apache Spark is now available in all Regions that support Amazon EMR 6.9, AWS Glue 4.0, and Amazon Redshift. You can start using the feature directly from EMR 6.9 and Glue Studio 4.0 with the new Spark 3.3.0 version.

Give it a try, and please send us feedback either in the AWS re:Post for Amazon Redshift or through your usual AWS support contacts.

Channy

AWS Week in Review – November 21, 2022

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-november-21-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

A new week starts, and the News Blog team is getting ready for AWS re:Invent! Many of us will be there next week and it would be great to meet in person. If you’re coming, do you know about PeerTalk? It’s an onsite networking program for re:Invent attendees available through the AWS Events mobile app (which you can get on Google Play or Apple App Store) to help facilitate connections among the re:Invent community.

If you’re not coming to re:Invent, no worries, you can get a free online pass to watch keynotes and leadership sessions.

Last Week’s Launches
It was a busy week for our service teams! Here are the launches that got my attention:

AWS Region in Spain – The AWS Region in Aragón, Spain, is now open. The official name is Europe (Spain), and the API name is eu-south-2.

Amazon Athena – You can now apply AWS Lake Formation fine-grained access control policies with all table and file format supported by Amazon Athena to centrally manage permissions and access data catalog resources in your Amazon Simple Storage Service (Amazon S3) data lake. With fine-grained access control, you can restrict access to data in query results using data filters to achieve column-level, row-level, and cell-level security.

Amazon EventBridge – With these additional filtering capabilities, you can now filter events by suffix, ignore case, and match if at least one condition is true. This makes it easier to write complex rules when building event-driven applications.

AWS Controllers for Kubernetes (ACK) – The ACK for Amazon Elastic Compute Cloud (Amazon EC2) is now generally available and lets you provision and manage EC2 networking resources, such as VPCs, security groups and internet gateways using the Kubernetes API. Also, the ACK for Amazon EMR on EKS is now generally available to allow you to declaratively define and manage EMR on EKS resources such as virtual clusters and job runs as Kubernetes custom resources. Learn more about ACK for Amazon EMR on EKS in this blog post.

Amazon HealthLake – New analytics capabilities make it easier to query, visualize, and build machine learning (ML) models. Now HealthLake transforms customer data into an analytics-ready format in near real-time so that you can query, and use the resulting data to build visualizations or ML models. Also new is Amazon HealthLake Imaging (preview), a new HIPAA-eligible capability that enables you to easily store, access, and analyze medical images at any scale. More on HealthLake Imaging can be found in this blog post.

Amazon RDS – You can now transfer files between Amazon Relational Database Service (RDS) for Oracle and an Amazon Elastic File System (Amazon EFS) file system. You can use this integration to stage files like Oracle Data Pump export files when you import them. You can also use EFS to share a file system between an application and one or more RDS Oracle DB instances to address specific application needs.

Amazon ECS and Amazon EKS – We added centralized logging support for Windows containers to help you easily process and forward container logs to various AWS and third-party destinations such as Amazon CloudWatch, S3, Amazon Kinesis Data Firehose, Datadog, and Splunk. See these blog posts for how to use this new capability with ECS and with EKS.

AWS SAM CLI – You can now use the Serverless Application Model CLI to locally test and debug an AWS Lambda function defined in a Terraform application. You can see a walkthrough in this blog post.

AWS Lambda – Now supports Node.js 18 as both a managed runtime and a container base image, which you can learn more about in this blog post. Also check out this interesting article on why and how you should use AWS SDK for JavaScript V3 with Node.js 18. And last but not least, there is new tooling support to build and deploy native AOT compiled .NET 7 applications to AWS Lambda. With this tooling, you can enable faster application starts and benefit from reduced costs through the faster initialization times and lower memory consumption of native AOT applications. Learn more in this blog post.

AWS Step Functions – Now supports cross-account access for more than 220 AWS services to process data, automate IT and business processes, and build applications across multiple accounts. Learn more in this blog post.

AWS Fargate – Adds the ability to monitor the utilization of the ephemeral storage attached to an Amazon ECS task. You can track the storage utilization with Amazon CloudWatch Container Insights and ECS Task Metadata endpoint.

AWS Proton – Now has a centralized dashboard for all resources deployed and managed by AWS Proton, which you can learn more about in this blog post. You can now also specify custom commands to provision infrastructure from templates. In this way, you can manage templates defined using the AWS Cloud Development Kit (AWS CDK) and other templating and provisioning tools. More on CDK support and AWS CodeBuild provisioning can be found in this blog post.

AWS IAM – You can now use more than one multi-factor authentication (MFA) device for root account users and IAM users in your AWS accounts. More information is available in this post.

Amazon ElastiCache – You can now use IAM authentication to access Redis clusters. With this new capability, IAM users and roles can be associated with ElastiCache for Redis users to manage their cluster access.

Amazon WorkSpaces – You can now use version 2.0 of the WorkSpaces Streaming Protocol (WSP) host agent that offers significant streaming quality and performance improvements, and you can learn more in this blog post. Also, with Amazon WorkSpaces Multi-Region Resilience, you can implement business continuity solutions that keep users online and productive with less than 30-minute recovery time objective (RTO) in another AWS Region during disruptive events. More on multi-region resilience is available in this post.

Amazon CloudWatch RUM – You can now send custom events (in addition to predefined events) for better troubleshooting and application specific monitoring. In this way, you can monitor specific functions of your application and troubleshoot end user impacting issues unique to the application components.

AWS AppSync – You can now define GraphQL API resolvers using JavaScript. You can also mix functions written in JavaScript and Velocity Template Language (VTL) inside a single pipeline resolver. To simplify local development of resolvers, AppSync released two new NPM libraries and a new API command. More info can be found in this blog post.

AWS SDK for SAP ABAP – This new SDK makes it easier for ABAP developers to modernize and transform SAP-based business processes and connect to AWS services natively using the SAP ABAP language. Learn more in this blog post.

AWS CloudFormation – CloudFormation can now send event notifications via Amazon EventBridge when you create, update, or delete a stack set.

AWS Console – With the new Applications widget on the Console home, you have one-click access to applications in AWS Systems Manager Application Manager and their resources, code, and related data. From Application Manager, you can view the resources that power your application and your costs using AWS Cost Explorer.

AWS Amplify – Expands Flutter support (developer preview) to Web and Desktop for the API, Analytics, and Storage use cases. You can now build cross-platform Flutter apps with Amplify that target iOS, Android, Web, and Desktop (macOS, Windows, Linux) using a single codebase. Learn more on Flutter Web and Desktop support for AWS Amplify in this post. Amplify Hosting now supports fully managed CI/CD deployments and hosting for server-side rendered (SSR) apps built using Next.js 12 and 13. Learn more in this blog post and see how to deploy a NextJS 13 app with the AWS CDK here.

Amazon SQS – With attribute-based access control (ABAC), you can define permissions based on tags attached to users and AWS resources. With this release, you can now use tags to configure access permissions and policies for SQS queues. More details can be found in this blog.

AWS Well-Architected Framework – The latest version of the Data Analytics Lens is now available. The Data Analytics Lens is a collection of design principles, best practices, and prescriptive guidance to help you running analytics on AWS.

AWS Organizations – You can now manage accounts, organizational units (OUs), and policies within your organization using CloudFormation templates.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
A few more stuff you might have missed:

Introducing our final AWS Heroes of the year – As the end of 2022 approaches, we are recognizing individuals whose enthusiasm for knowledge-sharing has a real impact with the AWS community. Please meet them here!

The Distributed Computing ManifestoWerner Vogles, VP & CTO at Amazon.com, shared the Distributed Computing Manifesto, a canonical document from the early days of Amazon that transformed the way we built architectures and highlights the challenges faced at the end of the 20th century.

AWS re:Post – To make this community more accessible globally, we expanded the user experience to support five additional languages. You can now interact with AWS re:Post also using Traditional Chinese, Simplified Chinese, French, Japanese, and Korean.

For AWS open-source news and updates, here’s the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more.

Upcoming AWS Events
As usual, there are many opportunities to meet:

AWS re:Invent – Our yearly event is next week from November 28 to December 2. If you can’t be there in person, get your free online pass to watch live the keynotes and the leadership sessions.

AWS Community DaysAWS Community Day events are community-led conferences to share and learn together. Join us in Sri Lanka (on December 6-7), Dubai, UAE (December 10), Pune, India (December 10), and Ahmedabad, India (December 17).

That’s all from me for this week. Next week we’ll focus on re:Invent, and then we’ll take a short break. We’ll be back with the next Week in Review on December 12!

Danilo

AWS Week in Review – November 14, 2022

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/aws-week-in-review-november-14-2022/

It’s now just two weeks to AWS re:Invent in Las Vegas, and the pace is picking up, both here on the News Blog, and throughout AWS as everyone get ready for the big event! I hope you get the chance to join us, and have shared links and other information at the bottom of this post. First, though, let’s dive straight in to this week’s review of news and announcements from AWS.

Last Week’s Launches
As usual, let’s start with a summary of some launches from the last week that I want to remind you of:

New Switzerland Region – First and foremost, AWS has opened a new Region, this time in Switzerland. Check out Seb’s post here on the News Blog announcing the launch.

New AWS Resource Explorer – if you’ve ever spent time searching for specific resources in your AWS account, especially across Regions, be sure to take a look at the new AWS Resource Explorer, described in this post by Danilo. Once enabled, indexes of the resources in your account are built and maintained (you have control over which resources are indexed). Once the indexes are built, you can issue queries to more quickly arrive at the required resource without jumping between different Regions and service dashboards in the Management Console.

Amazon Lightsail domain registration and DNS autoconfigurationAmazon Lightsail users can now take advantage of new support for registering domain names with automatic configuration of DNS records. Within the Lightsail console, you’re now able to create and register an Amazon Route 53 domain with just a few clicks. 

New models for Amazon SageMaker JumpStart – Two new state-of-the-art models have been released for Amazon SageMaker JumpStart. SageMaker JumpStart provides pretrained, open-source models covering a wide variety of problem types that help you get started with machine learning. The first new model, Bloom, can be used to complete sentences or generate long paragraphs of text in 46 different languages. The second model, Stable Diffusion, generates realistic images from given text. Find out more about the new models in this What’s New post.

Mac instances and macOS VenturaAmazon Elastic Compute Cloud (Amazon EC2) now has support for running the latest version of macOS, Ventura (13.0), for both EC2 x86 Mac and EC2 M1 Mac instances. These instances enable you to provision and run macOS environments in the AWS Cloud, for developers creating apps for iPhone, iPad, Mac, Apple Watch, Apple TV, and Safari.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other news items you may want to explore:

AWS Open Source News and Updates – This blog is published each week, and Installment 135 is now available, highlighting new open-source projects, tools, and demos from the AWS community.

Upcoming AWS Events
AWS re:Invent 2022 – As I noted at the top of this post, we’re now just two weeks away from the event! Join us live in Las Vegas November 28–December 2 for keynotes, opportunities for training and certification, and over 1,500 technical sessions. If you are joining us, be sure to check out the re:Invent 2022 Attendee Guides, each curated by an AWS Hero, AWS industry team, or AWS partner.

If you can’t join us live in Las Vegas, be sure to join us online to watch the keynotes and leadership sessions. My cohosts and I on the AWS on Air show will also be livestreaming daily from the event, chatting with service teams and special guests about all the launches and other announcements. You can find us on Twitch.tv (we’ll be on the front page throughout the event), the AWS channel on LinkedIn Live, Twitter.com/awsonair, and YouTube Live.

And one final update for the event – if you’re a .NET developer, be sure to check out the XNT track in the session catalog to find details on the seven breakouts, three chalk talks, and the workshop we have available for you at the conference!

Check back next Monday for our last week in review before the start of re:Invent!

— Steve

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS.

AWS Week in Review – November 7, 2022

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-week-in-review-november-7-2022/

With three weeks to go until AWS re:Invent opens in Las Vegas, the AWS News Blog Team is hard at work creating blog posts to share the latest launches and previews with you. As usual, we have a strong mix of new services, new features, and a surprise or two.

Last Week’s Launches
Here are some launches that caught my eye last week:

Amazon SNS Data Protection and Masking – After a quick public preview, this cool feature is now generally available. It uses pattern matching, machine learning models, and content policies to help protect data at scale. You can find many different kinds of personally identifiable information (PII) and protected health information (PHI) in message bodies and either block message delivery or mask (de-identify) the sensitive data, all in real-time and on a per-topic basis. To learn more, read the blog post or the message data protection documentation.

Amazon Textract Updates – This service extracts text, handwriting, and data from any document or image. This past week we updated the AnalyzeID function so that it can now extract the machine readable zone (MRZ) on passports issued by the United States, and we added the entire OCR output to the API response. We also updated the machine learning models that power the AnalyzeDocument function, with a focus on single-character boxed forms commonly found on tax and immigration documents. Finally, we updated the AnalyzeExpense function with support for new fields and higher accuracy for existing fields, bringing the total field count to more than 40.

Another Amazon Braket Processor – Our quantum computing service now supports Aquila, a new 256-qubit quantum computer from QuEra that is based on a programmable array of neutral Rubidium atoms. According to the What’s New, Aquila supports the Analog Hamiltonian Simulation (AHS) paradigm, allowing it to solve for the static and dynamic properties of quantum systems composed of many interacting particles.

Amazon S3 on Outposts – This service now lets you use additional S3 Lifecycle rules to optimize capacity management. You can expire objects as they age or are replaced with newer versions, with control at the bucket level, or for subsets defined by prefixes, object tags, or object sizes. There’s more info in the What’s New and in the S3 documentation.

AWS CloudFormation – There were two big updates last week: support for Amazon RDS Multi-AZ deployments with two readable standbys, and better access to detailed information on failed stack instances for operations on CloudFormation StackSets.

Amazon MemoryDB for Redis – You can now use data tiering as a lower cost way to to scale your clusters up to hundreds of terabytes of capacity. This new option uses a combination of instance memory and SSD storage in each cluster node, with all data stored durably in a multi-AZ transaction log. There’s more information in the What’s New and the blog post.

Amazon EC2 – You can now remove launch permissions for Amazon Machine Images (AMIs) that are directly shared with your AWS account.

X in Y – We launched existing AWS services and instance types in additional Regions:

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some additional news items that you may find interesting:

AWS Open Source News and Updates – My colleague Ricardo Sueiras highlights new open source projects, tools, and demos from the AWS Community. Read Installment 134 to see what’s going on!

New Case Study – A new AWS case study describes how Taggle (a company focused on smart water solutions in Australia) created an IoT platform that runs on AWS and uses Amazon Kinesis Data Streams to store & ingest data in real time. Using AWS allowed them to scale to accommodate 80,000 additional sensors that will roll out in 2022.

Upcoming AWS Events
re:Invent 2022AWS re:Invent is just three weeks away! Join us live from November 28th to December 2nd for keynotes, training and certification opportunities, and over 1,500 technical sessions. If you cannot make it to Las Vegas you can also join us online to watch the keynotes and leadership sessions live. Be sure to check out the re:Invent 2022 Attendee Guides, each curated by an AWS Hero, AWS industry team, or AWS partner.

PeerTalk – If you will be attending re:Invent in person and are interested in meeting with me or any of our featured experts, be sure to check out PeerTalk, our new onsite networking program.

That’s all for this week!

Jeff;

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS.

AWS Week in Review – October 31, 2022

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/aws-week-in-review-october-31-2022/

No tricks, just treats in this weekly roundup of news and announcements. Let’s switch our AWS Management Console into dark mode and dive right into it.

Last Week’s Launches
Here are some launches that got my attention during the previous week:

AWS Local Zones in Hamburg and Warsaw now generally available – AWS Local Zones help you run latency-sensitive applications closer to end users. The AWS Local Zones in Hamburg, Germany, and Warsaw, Poland, are the first Local Zones in Europe. AWS Local Zones are now generally available in 20 metro areas globally, with announced plans to launch 33 additional Local Zones in metro areas around the world. See the full list of available and announced AWS Local Zones, and learn how to get started.

Amazon SageMaker multi-model endpoint (MME) now supports GPU instances – MME is a managed capability of SageMaker Inference that lets you deploy thousands of models on a single endpoint. MMEs can now run multiple models on a GPU core, share GPU instances behind an endpoint across multiple models, and dynamically load and unload models based on the incoming traffic. This can help you reduce costs and achieve better price performance. Learn how to run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints.

Amazon EC2 now lets you replace the root Amazon EBS volume for a running instance – You can now use the Replace Root Volume for patching features in Amazon EC2 to replace your instance root volume using an updated AMI without needing to stop the instance. This makes patching of the guest operating system and applications easier, while retraining the instance store data, networking, and IAM configuration. Check out the documentation to learn more.

AWS Fault Injection Simulator now supports network connectivity disruption – AWS Fault Injection Simulator (FIS) is a managed service for running controlled fault injection experiments on AWS. AWS FIS now has a new action type to disrupt network connectivity and validate that your applications are resilient to a total or partial loss of connectivity. To learn more, visit Network Actions in the AWS FIS user guide.

Amazon SageMaker Automatic Model Tuning now supports Grid Search – SageMaker Automatic Model Tuning helps you find the hyperparameter values that result in the best-performing model for a chosen metric. Until now, you could choose between random, Bayesian, and hyperband search strategies. Grid search now lets you cover every combination of the specified hyperparameter values for use cases in which you need reproducible tuning results. Learn how Amazon SageMaker Automatic Model Tuning now supports grid search.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some additional news items that you may find interesting:

Celebrating over 20 years of AI/ML innovation – On October 25, we hosted the AWS AI/ML Innovation Day. Bratin Saha and other leaders in the field shared the great strides we have made in the past and discussed what’s next in the world of ML. You can watch the recording here.

AWS open-source news and updates – My colleague Ricardo Sueiras writes this weekly open-source newsletter in which he highlights new open-source projects, tools, and demos from the AWS Community. Read edition #133 here.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS re:Invent is only 4 weeks away! Join us live in Las Vegas from November 28–December 2 for keynote announcements, training and certification opportunities, access to 1,500+ technical sessions, and much more. Seats are still available to reserve, and walk-ups are available onsite. You can also join us online to watch live keynotes and leadership sessions.

If you are into machine learning like me, check out the ML attendee guide. AWS Machine Learning Hero Vinicius Caridá put together recommended sessions and tips and tricks for building your agenda. We also have attendee guides on additional topics and industries.

On November 2, there is a virtual event for building modern .NET applications on AWS. You can register for free.

On November 11–12, AWS User Groups in India are hosting the AWS Community Day India 2022, with success stories, use cases, and much more from industry leaders. Sign up for free to join this virtual event.

That’s all for this week. Check back next Monday for another Week in Review!

— Antje

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

AWS Week in Review – October 24, 2022

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/aws-week-in-review-october-24-2022/

Last week, we announced plans to launch the AWS Asia Pacific (Bangkok) Region, which will become our third AWS Region in Southeast Asia. This Region will have three Availability Zones and will give AWS customers in Thailand the ability to run workloads and store data that must remain in-country.

In the Works – AWS Region in Thailand
With this big news, AWS announced a 190 billion baht (US 5 billion dollars) investment to drive Thailand’s digital future over the next 15 years. It includes capital expenditures on the construction of data centers, operational expenses related to ongoing utilities and facility costs, and the purchase of goods and services from Regional businesses.

Since we first opened an office in Bangkok in 2015, AWS has launched 10 Amazon CloudFront edge locations, a highly secure and programmable content delivery network (CDN) in Bangkok. In 2020, we launched AWS Outposts, a family of fully managed solutions delivering AWS infrastructure and services to virtually any on-premises or edge location for a truly consistent hybrid experience in Thailand. This year, we also plan the upcoming launch of an AWS Local Zone in Bangkok, which will enable customers to deliver applications that require single-digit millisecond latency to end users in Thailand.

Photo courtesy of Conor McNamara, Managing Director, ASEAN at AWS

The new AWS Region in Thailand is also part of our broader, multifaceted investment in the country, covering our local team, partners, skills, and the localization of services, including Amazon Transcribe, Amazon Translate, and Amazon Connect.

Many Thailand customers have chosen AWS to run their workloads to accelerate innovation, increase agility, and drive cost savings, such as 2C2P, CP All Plc., Digital Economy Promotion Agency, Energy Response Co. Ltd. (ENRES), PTT Global Public Company Limited (PTT), Siam Cement Group (SCG), Sukhothai Thammathirat Open University, The Stock Exchange of Thailand, Papyrus Studio, and more.

For example, Dr. Werner Vogels, CTO of Amazon.com, introduced the story of Papyrus Studio, a large film studio and one of the first customers in Thailand.

“Customer stories like Papyrus Studio inspire us at AWS. The cloud can allow a small company to rapidly scale and compete globally. It also provides new opportunities to create, innovate, and identify business opportunities that just aren’t possible with conventional infrastructure.”

For more information on how to enable AWS and get support in Thailand, contact our AWS Thailand team.

Last Week’s Launches
My favorite news of last week was to launch dark mode as a beta feature in the AWS Management Console. In Unified Settings, you can choose between three settings for visual mode: Browser default, Light, and Dark. Browser default applies the default dark or light setting of the browser, dark applies the new built-in dark mode, and light maintains the current look and feel of the AWS console. Choose your favorite!

Here are some launches that caught my eye for web, mobile, and IoT application developers:

New AWS Amplify Library for Swift – We announce the general availability of Amplify Library for Swift (previously Amplify iOS). Developers can use Amplify Library for Swift via the Swift Package Manager to build apps for iOS and macOS (currently in beta) platforms with Auth, Storage, Geo, and more features.

The Amplify Library for Swift is open source on GitHub, and we deeply appreciate the feedback we have gotten from the community. To learn more, see Introducing the AWS Amplify Library for Swift in the AWS Front-End Web & Mobile Blog or Amplify Library for Swift documentation.

New Amazon IVS Chat SDKs – Amazon Interactive Video Service (Amazon IVS) now provides SDKs for stream chat with support for web, Android, and iOS. The Amazon IVS stream chat SDKs support common functions for chat room resource management, sending and receiving messages, and managing chat room participants.

Amazon IVS is a managed, live-video streaming service using the broadcast SDKs or standard streaming software such as Open Broadcaster Software (OBS). The service provides cross-platform player SDKs for playback of Amazon IVS streams you need to make low-latency live video available to any viewer around the world. Also, it offers Chat Client Messaging SDK. For more information, see Getting Started with Amazon IVS Chat in the AWS documentation.

New AWS Parameters and Secrets Lambda Extension – This is new extension for AWS Lambda developers to retrieve parameters from AWS Systems Manager Parameter Store and secrets from AWS Secrets Manager. Lambda function developers can leverage this extension to improve their application performance as it decreases the latency and the cost of retrieving parameters and secrets.

Previously, you had to initialize either the core library of a service or the entire service SDK inside a Lambda function for retrieving secrets and parameters. Now you can simply use the extension. To learn more, see AWS Systems Manager Parameter Store documentation and AWS Secrets Manager documentation.

New FreeRTOS Long Term Support Version – We announce the second release of FreeRTOS Long Term Support (LTS) – FreeRTOS 202210.00 LTS. FreeRTOS LTS offers a more stable foundation than standard releases as manufacturers deploy and later update devices in the field. This release includes new and upgraded libraries such as AWS IoT Fleet Provisioning, Cellular LTE-M Interface, coreMQTT, and FreeRTOS-Plus-TCP.

All libraries included in this FreeRTOS LTS version will receive security and critical bug fixes until October 2024. With an LTS release, you can continue to maintain your existing FreeRTOS code base and avoid any potential disruptions resulting from FreeRTOS version upgrades. To learn more, see the FreeRTOS announcement.

Here is some news on performance improvement and increasing capacity:

Up to 10X Improving Amazon Aurora Snapshot Exporting Speed – Amazon Aurora MySQL-Compatible Edition for MySQL 5.7 and 8.0 now speed up to 10x faster snapshot exports to Amazon S3. The performance improvement is automatically applied to all types of database snapshot exports, including manual snapshots, automated system snapshots, and snapshots created by the AWS Backup service. For more information, see Exporting DB cluster snapshot data to Amazon S3 in the Amazon Aurora documentation.

3X Increasing Amazon RDS Read Capacity – Amazon Relational Database Service (RDS) for MySQL, MariaDB, and PostgreSQL now supports 15 read replicas per instance, including up to 5 cross-Region read replicas, delivering up to 3x the previous read capacity. For more information, see Working with read replicas in the Amazon RDS documentation.

2X Increasing AWS Snowball Edge Compute Capacity – The AWS Snowball Edge Compute Optimized device doubled the compute capacity up to 104 vCPUs, doubled the memory capacity up to 416GB RAM, and is now fully SSD with 28TB NVMe storage. The updated device is ideal when you need dense compute resources to run complex workloads such as machine learning inference or video analytics at the rugged, mobile edge such as trucks, aircraft or ships.  You can get started by ordering a Snowball Edge device on the AWS Snow Family console.

2X Increasing Amazon SQS FIFO Default Quota – Amazon Simple Queue Service (SQS) announces the increase of default quota up to 6,000 transactions per second per API action. It is double the previous 3,000 throughput quota for a high throughput mode for FIFO (first in, first out) queues in all AWS Regions where Amazon SQS FIFO queue is available. For a detailed breakdown of default throughput quotas per Region, see Quotas related to messages in the Amazon SQS documentation.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some other news items that you may find interesting:

22 New or Updated Open Datasets on AWS – We released 22 new or updated datasets, including Amazonia-1 imagery, Bitcoin and Ethereum data, and elevation data over the Arctic and Antarctica. The full list of publicly available datasets is on the Registry of Open Data on AWS and is now also discoverable on AWS Data Exchange.

Sustainability with AWS Partners (ft. AWS On Air) – This episode covers a broad discipline of environmental, social, and governance (ESG) issues across all regions, organization types, and industries. AWS Sustainability & Climate Tech provides a comprehensive portfolio of AWS Partner solutions built on AWS that address climate change events and the United Nation’s Sustainable Development Goals (SDG).

AWS Open Source News and Updates #131 – This newsletter covers latest open-source projects such as Amazon EMR Toolkit for VS Code, a VS Code Extension to make it easier to develop Spark jobs on EMR and AWS CDK For Discourse, sample codes that demonstrates how to create a full environment for Discourse, etc. Remember to check out the Open source at AWS keep up to date with all our activity in open source by following us on @AWSOpen.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS re:Invent 2022 Attendee Guide – Browse re:Invent 2022 attendee guides, curated by AWS Heroes, AWS industry teams, and AWS Partners. Each guide contains recommended sessions, tips and tricks for building your agenda, and other useful resources. Also, seat reservations for all sessions are now open for all re:Invent attendees. You can still register for AWS re:Invent either offline or online.

AWS AI/ML Innovation Day on October 25 – Join us for this year’s AWS AI/ML Innovation Day, where you’ll hear from Bratin Saha and other leaders in the field about the great strides AI/ML has made in the past and the promises awaiting us in the future.

AWS Container Day at Kubecon 2022 on October 25–28 – Come join us at KubeCon + CloudNativeCon North America 2022, where we’ll be hosting AWS Container Day Featuring Kubernetes on October 25 and educational sessions at our booth on October 26–28. Throughout the event, our sessions focus on security, cost optimization, GitOps/multi-cluster management, hybrid and edge compute, and more.

You can browse all upcoming in-person, and virtual events.

That’s all for this week. Check back next Monday for another Week in Review!

— Channy

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

The Story of Scalar

Post Syndicated from Derrick Stolee original https://github.blog/2022-10-13-the-story-of-scalar/

When you install Git v2.38, you’ll find a new executable tool available called scalar. At its core, Scalar enables the latest and greatest Git features for working with large repositories. By simply switching from git clone to scalar clone, you will have all of Git’s most impactful performance features, such as partial clone, sparse-checkout, background maintenance, and advanced config options neatly configured for your repository. Have you already cloned your repository? Run scalar register in it to get the same features.

Scalar and Git, together at last

Although Scalar is only now making its formal Git debut, this release represents the culmination of a multi-year journey. Today, we will share the story of how Scalar got to this point. We’ll start from what inspired its creation, how it evolved from a prototype carved out of the VFS for Git codebase, and finally how it landed in upstream Git. Each step of the way was guided by a set of development principles that helped us with each challenge and opportunity.

Special thanks to @chrisd8088, @dscho, @jeffhostetler, @jrbriggs, @kyle-rader, @mjcheetham, @ldennington, @prplr, @wilbaker, and all of the other contributors who helped make this happen!

Our development principles

Before we get into specifics about how Scalar was built and eventually rewritten and contributed upstream, we need to first establish some context. We entered the project with certain values that we used to guide our decisions. Here are a few that are particularly important to this story.

Rapid prototyping

Code speaks volumes. We could design an architecture all we want on paper, but when solving problems at scale, we need to have actual code running before we can make a final decision.

Before committing to a decision, we would quickly build a prototype and measure its performance. During this prototyping phase, we would take shortcuts to get to that point of measurement. Then, we’d throw everything we could at the prototype to make sure it was correct and fast.

Based on the prototype, we would commit to doing the careful engineering of building the feature again but with a test strategy, thoughtful architecture, and a plan for delivering it to users.

Incremental changes over complete rewrites

Looking at where we started to where we ended, it might seem like we are proponents of rewriting things from scratch. We intend to demonstrate exactly the opposite: Scalar moved with small incremental changes that solved an immediate need. While making those changes, we also optimized for reducing our technical debt and creating a better architecture, and that resulted in code moving from .NET to C and then from our fork to upstream Git, but each individual movement was relatively small compared to the entire system.

The biggest reason we focused on incremental changes was because of our next value.

Tests are an asset

Making any kind of software change adds risk to a project. That risk is mitigated when we have a large set of battle-hardened tests. With a robust test suite available, we were able to make significant changes to our architecture with confidence.

Work in the open

Other than the earliest prototypes, all changes were reviewed and merged completely in public, either in the microsoft/scalar repository or the microsoft/git repository. Scalar was an open source project from day one, and was never intended to be a project only for internal use. By contrast, VFS for Git was built as a tool for Microsoft’s internal use first, and open sourcing it was a bonus after it reached enough adoption. Not only did we value that transparency during Scalar’s development, but now we have a history of public code changes to talk about here.

Now that we’ve established these values, let’s begin the story of Scalar.

A catalyst forces a pivot

The Virtual FileSystem for Git project (VFS for Git for short—previously “GVFS”) was built specifically to transition the Microsoft Windows OS monorepo to Git. VFS for Git utilizes a virtual filesystem to lazily load files only when a filesystem read occurs. This greatly reduced the amount of work Git needed to do, but required installing the microsoft/git fork as well as the .NET VFS for Git software and use Azure Repos to host the repository.

Initially, the Microsoft Office monorepo was going to onboard to Git using VFS for Git, but they needed cross-platform support, specifically for macOS development. After getting pretty far in a macOS port, Apple deprecated the kernel features that provided the filesystem virtualization that was required for that flow.

We were in luck, however, because we had come to understand something a key quality of the Office monorepo: Office has a rigorous dependency system that clearly identifies which files are necessary for a local build. This means that a developer could specify the files they need to Git’s sparse-checkout feature instead of dynamically populating the worktree using a virtual filesystem. This also significantly simplifies the software needed to manage their monorepo!

However, there was a problem. The sparse-checkout feature had previously been abandoned as a direction for VFS for Git due to its performance. Git would use a list of patterns to match which paths should be in the worktree and which should be ignored. This pattern matching had an ordering strategy that required iterating through the entire pattern list for every possible path, requiring quadratic time! For one of the larger sparse-checkout definition examples we had, Git would take 40 minutes to evaluate the sparse-checkout patterns.

Sparse-checkout definitions are extremely generic. They include matching on file prefix, but also file suffix, or path substring, and any combination. For our target monorepo, we only needed directory matches. With that limited type of pattern in mind, we added a new mode to Git’s sparse-checkout feature: “cone mode” sparse-checkout. A quick prototype of cone mode sparse-checkout demonstrated that Git could reach similar performance as VFS for Git, especially when paired with the filesystem monitor hook. Our critical performance measurement was the git status command, and we were seeing performance within three or four seconds, which was close to the typical case in VFS for Git.

This was promising enough to move forward with a full prototype. We decided to make this a separate project from VFS for Git, so it needed its own name: Scalar.

Throw the first one away

Once we had a handle on Git command performance using Git’s sparse-checkout feature, we needed to adapt all of the code that allowed fast clones and fetches to work within that environment. For most Git hosting services, Git’s partial clone feature is the best way to solve for fast clones and fetches. However, Azure Repos has an earlier version that was built for VFS for Git called the GVFS protocol. We needed a way to speak the GVFS protocol to bootstrap clones and to dynamically fetch missing objects during Git commands.

This was our first point of asking, “Should we rewrite, or refactor?” The VFS for Git codebase already had all of the client-side code for speaking the GVFS protocol. Not only that, it also had a large set of end-to-end tests that constructed a complete clone from Azure Repos and then ran thousands of Git commands in that environment to make sure they operated exactly the same as a normal Git clone. Since those tests were a significant asset, we set out to construct the first version of this new project starting with the VFS for Git code.

In this initial prototype, we just wanted to get things working for the end-to-end tests to pass. This process included disabling the virtual filesystem code, but leaving all of the hooks that enabled the GVFS Protocol. We also needed to set up sparse-checkout at clone time before initializing the HEAD reference. This prototype was so rough it still didn’t have the Scalar name: it still operated as if it was the gvfs command-line interface.

Diagram showing that the pre-Scalar prototype mostly deleted code from the GVFS protocol.
The rapid prototyping phase mostly deleted code

The end result wasn’t pretty. We couldn’t hope to ship it since it would break compatibility with previous VFS for Git versions. The tests were cobbled together to make things work, but we had disabled sparse-checkout in the tests since the previous tests assumed that every path could be populated dynamically with the virtual filesystem. However, we got to a point where we could reliably create this new repository setup and measure its success. Since the clones were doing the exact same thing as in VFS for Git, the performance matched exactly. Now, we needed to rebuild it, and do it the right way.

Get to Minimum Viable Product (MVP)

From the success of our initial prototype, we moved on to creating an MVP that we could demo to internal users. Here is where we created the Scalar name, the microsoft/scalar repository, and started doing thorough reviews of all changes.

As a team, we decided it would be best to create a new repository rather than to build the project within the VFS for Git codebase. We did not want to be locked into the architecture of VFS for Git as we moved forward, and we also wanted to take advantage of the commit history for the code in the repository. The first task in creating the new project was renaming all references to the old project.

Diagram detailing that, between the pre-Scalar prototype and the version pushed to microsoft/git, many pieces were renamed.
Cleaning up the prototype and renaming things

Updating tests

The next step we had to do was to make sure that we were sufficiently testing the sparse-checkout environment. Recall that we used the full worktree to get tests passing in the prototype, but now we needed to actually be sure that our sparse-checkout environment would work properly.

For this, we found a minimal set of patterns that would include all of the concrete paths used by the test suite.

Then, we made sure that there were interesting changes happening outside of those patterns that would exercise Git features like git merge or git cherry-pick in interesting ways outside of the sparse-checkout definition.

Finally, we added specific tests that involved changing the sparse-checkout definition to make sure that Git would properly fill in the missing files. In this way, we were able to keep all of the existing tests while also adding new tests that were specific to our environment.

Evaluating the MVP

After completing the product changes and test updates, it was time to evaluate the solution. We ran performance numbers to ensure they matched what we saw in our prototype phase. We created local clones to use in daily work to try and catch any lingering bugs.

But it all came down to evaluating the solution with internal users. We demoed Scalar directly with the Office engineering system team and asked pointed questions about whether this would work for them.

In particular, we were worried about the performance of git checkout. In VFS for Git, git checkout is extremely fast because it doesn’t actually do much work. It clears the filesystem of concrete files and replaces them with virtualized files. The cost of populating the filesystem comes later when those files are read by an IDE or a build process. With Scalar, the filesystem is populated within the git checkout process, so that work is now upfront and clear to the user.

By working directly with the engineering system team, we learned that this git checkout performance was not an issue. Since git checkout changes source files, it invalidates the local build. Build times can take hours in this monorepo after taking new changes, so users typically do not use git checkout until the end of the day when they are ready to trigger a long build overnight. For this reason, git checkout was not a critical path for their developers. In fact, there was great interest in being able to know that they could disconnect from the network and still poke around the code without risk of finding a virtual file.

We were good to go with our plan for Scalar. However, the monorepo team needed to build something of their own. They needed a connection between their build system and sparse-checkout. While they built that, we had time to polish Scalar and make it easier to install and use.

Update architecture under stable conditions

With the benefit of a stable test suite and a few months of runway, we were able to take our MVP and rethink the architecture. In particular, we shed some architectural decisions that were critical to how VFS for Git works, but were no longer needed in Scalar.

VFS for Git requires a process running that can handle requests from the filesystem to populate virtualized content on-demand. The existence of this process creates the concept of a “mounted” repository, and even included the commands gvfs mount and gvfs unmount to toggle this state.

Because this process needed to exist, a lot of other things were placed in that process that could be relocated elsewhere in Scalar. We set out to remove the need for this process.

Since we had already removed the virtual filesystem code, there were two remaining pieces that were in the mount process: performing background maintenance and downloading objects via the GVFS protocol.

For background maintenance, we took the fastest approach and moved the scheduled tasks out of the mount process and into the Scalar.Service global singleton process. We had versions of this service for Windows and macOS to handle things like startup operations. Moving the maintenance tasks to this service was quick and easy.

For the object downloads, it was a bigger job. The existing architecture included a read-object hook custom to microsoft/git that was installed by the scalar clone command, and that hook communicated to the mount process which actually communicated with the server and placed the objects in the repository.

For this, we created a tool within microsoft/git to do these missing object queries via the GVFS protocol directly within the Git codebase. This tool lives underneath the code that fills in objects for Git’s partial clone feature. By connecting this tool to partial clone, we could work to improve partial clone while also helping Scalar users at the same time. One major benefit to working within the partial clone framework is that some missing object requests can be batched together into a single request, while the old read-object hook could only ask for one missing object at a time.

Finally, there was nothing important remaining in the mount process, so we deleted it. In addition, we were able to delete the old Git hook.

At this point, we had simplified the architecture to have fewer moving parts and were ready to ship internally.

Diagram showing that removing the mount process simplified Scalar's architecture.
Removing the mount process with the git-gvfs-helper

Upon success, look for low-hanging fruit

Shortly after announcing Scalar to the world, we realized that Scalar could have a larger benefit to the Git ecosystem than just very large monorepos using Azure Repos.

We extended scalar clone to use Git’s partial clone if the remote did not speak the GVFS protocol. In this way, scalar clone became something a user could run against any Git remote.

This was an inflection point in our lifecycle: we had accomplished what we set out to do, but wanted to put these tools in front of more people and find a wider audience. We started to shift our focus from making updates in the .NET project and instead contributing features to the upstream Git project.

Rethink architecture as conditions change

Up until this point, we were using the existing hook approach that speaks to a third-party filesystem monitor. This meant that we needed to install that third-party tool next to Scalar, but also scalar clone would install the hook in addition to all of its other operations. We realized that we could solve our installation complexities, reduce the complexity of scalar clone, and get faster performance if the filesystem monitor was built into Git. With that context, we began building Git’s builtin filesystem monitor. We took early versions into microsoft/git while it was reviewed carefully by the Git community.

Diagram showing early adoption of the builtinFS Monitor.
Early adoption of builtin FS Monitor

An important Scalar feature was background maintenance, which was accomplished by a service running in the background and launching Git commands at certain intervals to keep data fresh and well-organized. This service existed from the VFS for Git days, so it was easy to keep using it on Windows and macOS. However, when the Office team told us that they needed Linux clients to support some of their web developers, we focused on porting Scalar to Linux. This service was one platform-specific part that would be difficult to implement in .NET.

We decided that instead of creating a new service in Scalar, it would be better to implement background maintenance in Git. Once Git had its own cross-platform way of doing maintenance, Scalar could stop using its custom logic and instead rely on git maintenance run.

We then removed the service from Scalar.

Diagram showing that removing background maintenance from Scalar left only the CLI and tests.
Background maintenance leaves us with only the CLI and tests

After making this change, we took another look at our architecture and realized something. Suddenly, Scalar was only a command-line interface on top of Git. Why have it be in C#, separate from the Git source code?

The overhead of dealing with Scalar as a .NET tool was colliding with our maintenance costs of creating releases and shipping it to users. If Office developers require the microsoft/git fork of Git and another tool then things get tricky when we want to release a new version.

We had replaced so many features in the Scalar codebase with Git functionality that starting from a clean slate could allow us to build a more manageable architecture than that of the existing code. Also, by inserting the Scalar CLI into the Git codebase, we could take advantage of internal functions such as using Git config APIs instead of running git config processes to set recommended config values.

With these goals in mind, we ported the Scalar CLI to C in microsoft/git using less than 3,000 lines of code!

This endeavor to recreate the Scalar CLI in the microsoft/git codebase can best be appreciated by seeing that we deleted over 10 times the amount of code from microsoft/scalar than we added to microsoft/git when we removed all product code. We kept the microsoft/scalar repository around as a collection of tests, allowing us to be confident in the new code.

Diagram showing that once the CLI was ported to microsoft/git, only the tests were left behind.
Porting the CLI to microsoft/git leaves only the tests

This was our biggest step in the journey because it involved the largest rewrite of Scalar code. However, the requirements of the Scalar CLI at this point were well-defined and greatly simplified from earlier. We were able to immediately celebrate by no longer shipping the .NET Scalar application to our internal customers and instead rely on just shipping the microsoft/git fork.

There was one downside to this change, though. Before, you could install the .NET Scalar solution on top of any Git version and still get all the benefits of scalar clone. Now, users needed to replace their Git client with microsoft/git in order to get the latest Scalar version. We wanted to make Scalar useful to everyone, not just those that were willing to install our fork.

The journey into core Git

Porting Scalar to C not only enabled hosting the tool in microsoft/git, it opened up the possibility of making Scalar part of the upstream Git project. Although it wouldn’t be the first feature originating in microsoft/git that was contributed upstream, there was no clear precedent for something like Scalar: a standalone executable whose name didn’t start with git in the Git project. That might sound like nothing more than an implementation detail, but it represented a philosophical departure from the existing tools in Git. This divergence would drive us to define what Scalar meant for Git.

contrib/-uting to Git

From the outset, we knew there was a contingent of Git users that would benefit from Scalar beyond microsoft/git‘s typical user base. Features like the filesystem monitor, background maintenance, cone mode sparse-checkout, etc. had all become popular among developers in large repositories. Scalar exposed those and a multitude of other features more readily to users. Still, it wasn’t clear that Scalar as a standalone executable was the best—or Git-friendliest—way to present those features.

To gradually introduce the tool to the Git community, Scalar’s journey upstream began in Git’s contrib/ directory. From the contrib/ README:

Although these pieces are available as part of the official git
source tree, they are in somewhat different status.  The
intention is to keep interesting tools around git here, maybe
even experimental ones, to give users an easier access to them,
and to give tools wider exposure, so that they can be improved
faster.

Despite the loose requirements of contrib/, the submitted version of Scalar still required some changes from what was in microsoft/git. First was removing the GVFS protocol-supported clones. As we mentioned earlier, blobless clones were introduced into Scalar as a fallback for clones using the GVFS protocol, so the upstream version defaulted to using blobless partial clones instead. Additionally, to preserve the separation between contrib/ and the main Git repository, the GitHub Actions workflow was also stripped of references to Scalar, including execution of the microsoft/scalar test suite.

However, being in contrib/ did have some drawbacks. In order to build and install Scalar, a user needed to not only build Git from source, but know to navigate into contrib/scalar/ and build that as well. The separate build and test process also left it prone to changes in the rest of Git unintentionally breaking it. Even with these challenges, this arrangement was exactly what Scalar needed while its features were built out and long-term plan was developed. As we drew closer to finishing those features, we needed to finally answer the question: what should we do with Scalar?

Home sweet home

As soon as the possibility of upstreaming Scalar materialized, there were lots of ideas about what its final form would look like. One popular idea—which can be found in the original RFC—was to dissolve Scalar into a collection of new git commands and options to existing commands. Another was to have scalar reside in the Git tree in a dedicated subdirectory, like gitk. Another was to reimagine it as a Git built-in command: something like git scalar. Along with these implementation decisions came overarching questions of maintenance and relevance to Git.

As the tool was nearing feature completion upstream and the downsides of contrib/ isolation were weighing on the project, we took a step back and revisited the questions of Scalar’s identity. The result was a proposal to update Scalar’s documentation and outline a three-part approach to making the tool generally available in Git:

  1. Add any remaining large repo performance features to Scalar.
  2. Extract the parts of Scalar that are generally applicable to all Git users into built-in commands and/or options.
  3. Move Scalar into the root tree of Git, built and installed as a standalone executable alongside git.

The crux of this approach was a new framing of Scalar within the Git project. Scalar began, like VFS for Git before it, as a tool with its own features and opinions: how to configure a repository, what workflows to use, etc. As it evolved, those features and opinions were folded into Git or adjusted to align better with the upstream project, leaving Scalar with only the parts that fit the very specific role of configuring large repositories. In essence, Git had a user experience niche left by its myriad of large repo-focused performance features. Scalar filled that niche.

The roadmap to Scalar’s completion emerged from this philosophy. First, a few more particularly impactful features would be added to it (namely, the built-in FSMonitor). Then, because Scalar’s purpose is to configure features for large repositories that aren’t set up by default in Git, the parts that serve all Git users (such as repository diagnostics in scalar diagnose) would be extracted into new or existing Git commands. Finally, Scalar would be moved out of contrib/ and into the main build of the repository, intended to continue existing as a dedicated tool for managing large Git repositories.

The best laid plans often go awry but, fortunately, this one didn’t. Over the course of three upstream patch series, Scalar was streamlined inside of contrib/, then moved into its new home as part of core Git. And just in time for the v2.38.0 release!

Diagram showing that the Scalar project was contributed to git/git.
Scalar now lives in the core git/git project

The past, present, and future of Scalar

We’ve shared the story of Scalar not only to publicize a new and exciting feature in Git (seriously, go try it!), but also to illustrate one of the many paths an open source project can take to reach its users. Planning and re-planning, designing and redesigning, and no shortage of engineering lessons were all necessary steps to make Scalar the powerful tool it is today.

It is now a fully-integrated part of Git, but Scalar’s journey is far from over. Scalability and performance in Git is a hot topic—our own engineering blog is a testament to that—and consistent improvement in that area will undoubtedly be part of Scalar’s future. Today, though, Scalar’s eventful history is what has shaped it into the best way to unlock Git’s full potential on your largest repositories.

AWS Week in Review – October 3, 2022

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-october-3-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

A new week and a new month just started. Curious which were the most significant AWS news from the previous seven days? I got you covered with this post.

Last Week’s Launches
Here are the launches that got my attention last week:

Amazon File Cache – A high performance cache on AWS that accelerates and simplifies demanding cloud bursting and hybrid workflows by giving access to files using a fast and familiar POSIX interface, no matter if the original files live on premises on any file system that can be accessed through NFS v3 or on S3.

Amazon Data Lifecycle Manager – You can now automatically archive Amazon EBS snapshots to save up to 75 percent on storage costs for those EBS snapshots that you intend to retain for more than 90 days and rarely access.

AWS App Runner – You can now build and run web applications and APIs from source code using the new Node.js 16 managed runtime.

AWS Copilot – The CLI for containerized apps adds IAM permission boundaries, support for FIFO SNS/SQS for the Copilot worker-service pattern, and using Amazon CloudFront for low-latency content delivery and fast TLS-termination for public load-balanced web services.

Bottlerocket – The Linux-based operating system purpose-built to run container workloads is now supported by Amazon Inspector. Amazon Inspector can now recommend an update of Bottlerocket if it finds a vulnerability.

Amazon SageMaker Canvas – Now supports mathematical functions and operators for richer data exploration and to understand the relationships between variables in your data.

AWS Compute Optimizer – Now provides cost and performance optimization recommendations for 37 new EC2 instance types, including bare metal instances (m6g.metal) and compute optimized instances (c7g.2xlarge, hpc6a.48xlarge), and new memory metrics for Windows instances.

AWS Budgets – Use a simplified 1-click workflow for common budgeting scenarios with step-by-step tutorials on how to use each template.

Amazon Connect – Now provides an updated flow designer UI that makes it easier and faster to build personalized and automated end-customer experiences, as well as a queue dashboard to view and compare real-time queue performance through time series graphs.

Amazon WorkSpaces – You can now provision Ubuntu desktops and use virtual desktops for new categories of workloads, such as for your developers, engineers, and data scientists.

Amazon WorkSpaces Core – A fully managed infrastructure-only solution for third-party Virtual Desktop Infrastructure (VDI) management software that simplifies VDI migration and combines your current VDI software with the security and reliability of AWS. Read more about it in this Desktop and Application Streaming blog post.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
A few more blog posts you might have missed:

Introducing new language extensions in AWS CloudFormation – In this Cloud Operations & Migrations blog post, we introduce the new language transform that enhances CloudFormation core language with intrinsic functions that simplify handling JSON strings (Fn::ToJsonString), array lengths (Fn::Length), and update and deletion policies.

Building a GraphQL API with Java and AWS Lambda – This blog shows different options for resolving GraphQL queries using serverless technologies on AWS.

For AWS open-source news and updates, here’s the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more.

Upcoming AWS Events
As usual, there are many opportunities to meet:

AWS Summits– Connect, collaborate, and learn about AWS at these free in-person events: Bogotá (October 4), and Singapore (October 6).

AWS Community DaysAWS Community Day events are community-led conferences to share and learn together. Join us in Amersfoort, Netherlands (on October 3, today), Warsaw, Poland (October 14), and Dresden, Germany (October 19).

That’s all from me for this week. Come back next Monday for another Week in Review!

Danilo

Highlights from Git 2.38

Post Syndicated from Taylor Blau original https://github.blog/2022-10-03-highlights-from-git-2-38/

The open source Git project just released Git 2.38, with features and bug fixes from over 92 contributors, 24 of them new. We last caught up with you on the latest in Git back when 2.37 was released.

To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.

A repository management tool for large repositories

We talk a lot about performance in Git, especially in the context of large repositories. Returning readers of these blog posts will no doubt be familiar with the dozens of performance optimizations that have landed in Git over the years.

But with so many features to keep track of, it can be easy to miss out some every now and then (along with their corresponding performance gains).

Git’s new built-in repository management tool, Scalar, attempts to solve that problem by curating and configuring a uniform set of features with the biggest impact on large repositories. To start using it, you can either clone a new repository with scalar clone:

$ scalar clone /path/to/repo

Or, you can use the --full-clone option if you don’t want to start out with a sparse checkout. To apply Scalar’s recommended configuration to a clone you already have, you can instead run:

$ cd /path/to/repo
$ scalar register

At the time of writing, Scalar’s default configured features include:

Scalar’s configuration is updated as new (even experimental!) features are introduced to Git. To make sure you’re always using the latest and greatest, be sure to run scalar reconfigure /path/to/repo after a new release to update your repository’s config (or scalar reconfigure -a to update all of your Scalar-registered repositories at once).

Git 2.38 is the first time Scalar has been included in the release, but it has actually existed for much longer. Check back soon for a blog post on how Scalar came to be—from its early days as a standalone .NET application to its journey into core Git!

[source]

Rebase dependent branches with –update-refs

When working on a large feature, it’s often helpful to break up the work across multiple branches that build on each other.

But these branches can become cumbersome to manage when you need to rewrite history in an earlier branch. Since each branch depends on the previous ones, rewriting commits in one branch will leave the subsequent branches disconnected from history after rewriting.

In case that didn’t quite make sense, let’s walk through an example.

Suppose that you are working on a feature (my-feature), but want to break it down into a few distinct parts (maybe for ease of review, or to ensure you’re deploying it safely, etc.). Before you share your work with your colleagues, you build the entire feature up front to make sure that the end-result is feasible, like so.

$ git log --oneline origin/main..HEAD
741a3174683 (HEAD -> my-feature/part-three) Part 3: all done!
1ff073007eb Part 3: step two
880c07e326f Part 3: step one
40529bd11dc (my-feature/part-two) Part 2: step two
0a92cc3acd8 Part 2: step one
eed018043ba (my-feature/part-one) Part 1: step three
646c870d69e Part 1: step two
9147f6d2eb4 Part 1: step one

In the example below, the my-feature/part-three branch resembles what you imagine the final state will look like. But the intermediate check-points (my-feature/part-one, and so on) represent the chunks you intend to submit for code review.

After you submit everything, what happens if you want to make a change to one of the patches in part one?

You might create a fixup! commit on top, but squashing that patch into the one you wanted to change from part one will cause parts two and three to become disconnected:

Creating a fixup commit that causes parts two and three to become disconnected

Notice that after we squashed our fix into “Part 1: step one,” the subsequent branches vanished from history. That’s because they didn’t get updated to depend on the updated tip of my-feature/part-one after rebasing.

You could go through and manually checkout each branch, resetting each to the right commit. But this can get cumbersome quickly if you have a lot of branches, are making frequent changes, or both.

Git 2.38 ships with a new option to git rebase called --update-refs that knows how to perform these updates for you. Let’s try that same example again with the new version of Git.

Rebasing with the new viersion of Git, which updates each branch for you.

Because we used --update-refs, git rebase knew to update our dependent branches, so our history remains intact without having to manually update each individual branch.

If you want to use this option every time you rebase, you can run git config --global rebase.updateRefs true to have Git act as if the --update-refs option is always given.

[source]

Tidbits

This release coincides with the Git project’s participation in the annual Google Summer of Code program. This year, the Git project mentored two students, Shaoxuan Yuan, and Abhradeep Chakraborty, working on sparse index integration and various improvements to reachability bitmaps, respectively.

  • Shaoxuan’s first contribution was integrating the git rm command with the sparse index. The sparse index is a relatively new Git feature that enables Git to shrink the size of its index data structure to only track the contents of your sparse checkout, instead of the entire repository. Long-time readers will remember that Git commands have been converted to be compatible with the sparse-index one-by-one. Commands that aren’t compatible with the sparse index need to temporarily expand the index to cover the entire repository, leading to slow-downs when working in a large repository.

    Shaoxuan’s work made the git rm command compatible with the sparse index, causing it to only expand the index when necessary, bringing Git closer to having all commands be compatible with the sparse index by default.

    [source]

  • Shaoxuan also worked on improving git mv‘s behavior when moving a path from within the sparse checkout definition (sometimes called a “cone”) to outside of the sparse checkout. There were a number of corner cases that required careful reasoning, and curious readers can learn more about exactly how this was implemented in the patches linked below.

    [source]

  • Abhradeep worked on adding a new “lookup table” extension to Git’s reachability bitmap index. For those unfamiliar, this index (stored in a .bitmap file) associates a set of commits to a set of bitmaps, where each bit position corresponds to an object. A 1 bit indicates that a commit can reach the object specified by that bit position, and a 0 indicates that it cannot.

    But .bitmap files do not list their selected commits in a single location. Instead, they prefix each bitmap with the object ID of the commit it corresponds to. That means that in order to know what set of commits are covered by a .bitmap, Git must read the entire contents of the file to discover the set of bitmapped commits.

    Abhradeep addressed this shortcoming by adding an optional “lookup table” at the end of the .bitmap format, which provides a concise list of selected commits, as well as the offset of their corresponding bitmaps within the file. This provided some speed-ups across a handful of benchmarks, making bitmaps faster to load and use, especially for large repositories.

    [source]

  • Abhradeep also worked on sprucing up the technical documentation for the .bitmap format. So if you have ever been curious about or want to hack on Git’s bitmap internals, now is the time!

    [source]

For more about these projects, you can check out each contributor’s final blog posts here and here. Thank you, Shaoxuan, and Abhradeep!

Now that we’ve covered a handful of changes contributed by Google Summer of Code students, let’s take a look at some changes in this release of Git from other Git contributors.

  • You may not be familiar with Git’s merge-tree command, which historically was used to compute trivial three-way merges using Git’s recursive merge strategy. In Git 2.38, this command now knows how to integrate with the new ort merge strategy, allowing it to compute non-trivial merges without touching the index or working copy.

    The existing mode is still available behind a (deprecated) --trivial-merge option. When the new --write-tree mode is used, merge-tree takes two branches to merge, and computes the result using the ort strategy, all without touching the working copy or index. It outputs the resulting tree’s object ID, along with some information about any conflicts it encountered.

    As an aside, we at GitHub recently started using merge-ort to compute merges on GitHub.com more than an order of magnitude faster than before. We had previously used the implementation in libgit2 in order to compute merges without requiring a worktree, since GitHub stores repositories as bare, meaning we do not have a worktree to rely on. These changes will make their way to GitHub Enterprise beginning with verion 3.7.

    [source]

  • Bare Git repositories can be stored in and distributed with other Git repositories. This is often convenient, for example, as an easy mechanism to distribute Git repositories for use as test fixtures.

    When using repositories from less-than-trustworthy sources, this can also present a security risk. Git repositories often execute user-defined programs specified via the $GIT_DIR/config file. For example, core.pager defines which pager program Git uses, and core.editor defines which editor Git opens when you want to write a commit message (among other things).

    There are other examples, but an often-discussed one is the core.fsmonitor configuration, which can be used to specify a path to a filesystem monitoring hook. Because Git often needs to query the state of the filesystem, this hook (when configured) is invoked many times, including from git status, which people commonly script around in their shell prompt.

    This means that it’s possible to convince a victim to run arbitrary code by convincing them to clone a repository with a malicious bare repository embedded inside of it. If they change their working directory into the malicious repository within (since you cannot embed a bare repository at the top-level directory of a repository) and run some Git command, then they are likely to execute the script specified by core.fsmonitor (or any other configuration that specifies a command to execute).

    For this reason, the new safe.bareRepository configuration was introduced. When set to “explicit,” Git will only work with bare repositories specified by the top-level --git-dir argument. Otherwise, when set to “all” (which is the default), Git will continue to work with all bare repositories, embedded or not.

    It is worth noting that setting safe.bareRepository to “explicit” is only required if you worry that you may be cloning malicious repositories and executing Git commands in them.

    [source]

  • git grep learned a new -m option (short for --max-count), which behaves like GNU grep‘s options of the same name. This new option limits the number of matches shown per file. This can be especially useful when combined with other options, like -C or -p (which show code context, or the name of the function which contains each match).

    You could, for example, combine all three of these options to show a summary of how some function is called by many different files in your project. Git has a handful of objects that contain the substring oid_object_info. If you want to look at how callers across different files are structured without seeing more than one example from the same file, you can now run:

    $ git grep -C3 -p -m1 oid_object_info

    [source]

  • If you’ve ever scripted around the directory contents of your Git repository, there’s no doubt that you’ve encountered the git ls-files command. Unlike ls-tree (which lists the contents of a tree object), ls-files lists the contents of the index, the working directory, or both.

    There are already lots of options which can further specify what does or doesn’t get printed in ls-files‘s output. But its output was not easily customizable without additional scripting.

    In Git 2.38, that is no longer the case, with ls-files‘s new --format option. You can now customize how each entry is printed, with fields to print an object’s name and mode, as well as more esoteric options, like its stage in the index, or end-of-line (EOL) behavior.

    [source]

  • git cat-file also learned a new option to respect the mailmap when printing the contents of objects with identifiers in them. This feature was contributed by another Google Summer of Code student, this time working on behalf of GitLab!

    For the uninitiated, the mailmap is a feature which allows mapping name and email pairs to their canonical values, which can be useful if you change your name or email and want to retain authorship over historical commits without rewriting history.

    git show, and many other tools already understand how to remap identities under the mailmap (for example, git show‘s %aN and %aE format placeholders print the mailmapped author name and email, respectively, as opposed to %an and %ae, which don’t respect the mailmap). But git cat-file, which is a low-level command which prints the contents of objects, did not know how to perform this conversion.

    That meant that if you wanted to print a stream of objects, but transform any author, committer, or tagger identities according to the mailmap, you would have to pipe their contents through git show or similar. This is no longer the case, since git cat-file now understands the --[no]-use-mailmap option, meaning this transformation can be done before printing out object contents.

    [source]

  • Finally, Git’s developer documentation got an improvement in this most recent release, by adding a codified version of the Git community’s guidelines for code review. This document is a helpful resource for new and existing contributors to learn about the cultural norms around reviewing patches on the Git mailing list.

    If you’ve ever had the itch to contribute to the Git project, I highly encourage you to read the new reviewing guidelines (as well as the coding guidelines, and the “My First Contribution” document) and get started!

    [source]

The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.38, or any previous version in the Git repository.

Introducing workerd: the Open Source Workers runtime

Post Syndicated from Kenton Varda original https://blog.cloudflare.com/workerd-open-source-workers-runtime/

Introducing workerd: the Open Source Workers runtime

Introducing workerd: the Open Source Workers runtime

Today I’m proud to introduce the first beta release of workerd, the JavaScript/Wasm runtime based on the same code that powers Cloudflare Workers. workerd is Open Source under the Apache License version 2.0.

workerd shares most of its code with the runtime that powers Cloudflare Workers, but with some changes designed to make it more portable to other environments. The name “workerd” (pronounced “worker dee”) comes from the Unix tradition of naming servers with a “-d” suffix standing for “daemon”. The name is not capitalized because it is a program name, which are traditionally lower-case in Unix-like environments.

What it’s for

Self-hosting Workers

workerd can be used to self-host applications that you’d otherwise run on Cloudflare Workers. It is intended to be a production-ready web server for this purpose. workerd has been designed to be unopinionated about hosting environments, so that it should fit nicely into whatever server/VM/container hosting and orchestration system you prefer. It’s just a web server.

Workers has always been based on standardized APIs, so that code is not locked into Cloudflare, and we work closely with other runtimes to promote compatibility. workerd provides another option to ensure that applications built on Workers can run anywhere, by leveraging the same underlying code to get exact, “bug-for-bug” compatibility.

Local development and testing

workerd is also designed to facilitate realistic local testing of Workers. Up until now, this has been achieved using Miniflare, which simulated the Workers API within a Node.js environment. Miniflare has worked well, but in a number of cases its behavior did not exactly match Workers running on Cloudflare. With the release of workerd, Miniflare and the Wrangler CLI tool will now be able to provide a more accurate simulation by leveraging the same runtime code we use in production.

Programmable proxies

workerd can act as an application host, a proxy, or both. It supports both forward and reverse proxy modes. In all cases, JavaScript code can be used to intercept and process requests and responses before forwarding them on. Traditional web servers and proxies have used bespoke configuration languages with quirks that are hard to master. Programming proxies in JavaScript instead provides more power while making the configuration easier to write and understand.

What it is

workerd is not just another way to run JavaScript and Wasm. Our runtime is uniquely designed in a number of ways.

Server-first

Many non-browser JavaScript and Wasm runtimes are designed to be general-purpose: you can use them to build command-line apps, local GUI apps, servers, or anything in between. workerd is not. It specifically focuses on servers, in particular (for now, at least) HTTP servers.

This means in particular that workerd-based applications are event-driven at the top level. Applications do not open listen sockets and accept connections from them; instead, the runtime pushes events to the application. It may seem like a minor difference, but this basic change in perspective directly enables many of the features below.

Web standard APIs

Wherever possible, Workers (and workerd in particular) offers the same standard APIs found in web browsers, such as Fetch, URL, WebCrypto, and others. This means that code built on workerd is more likely to be portable to browsers as well as to other standards-based runtimes. When Workers launched five years ago, it was unusual for a non-browser to offer web APIs, but we are pleased to see that the broader JavaScript ecosystem is now converging on them.

Nanoservices

workerd is a nanoservice runtime. What does that mean?

Microservices have become popular over the last decade as a way to split monolithic servers into smaller components that could be maintained and deployed independently. For example, a company that offers several web applications with a common user authentication flow might have a separate team that maintains the authentication logic. In a monolithic model, the authentication logic might have been offered to the application teams as a library. However, this could be frustrating for the maintainers of that logic, as making any change might require waiting for every application team to deploy an update to their respective server. By splitting the authentication logic into a separate server that all the others talk to, the authentication team is able to deploy changes on their own schedule.

However, microservices have a cost. What was previously a fast library call instead now requires communicating over a network. In addition to added overhead, this communication requires configuration and administration to ensure security and reliability. These costs become greater as the codebase is split into more and more services. Eventually, the costs outweigh the benefits.

Nanoservices are a new model that achieve the benefits of independent deployment with overhead closer to that of library calls. With workerd, many Workers can be configured to run in the same process. Each Worker runs in a separate “isolate”, which gives the appearance of running independently of the others: each isolate loads separate code and has its own global scope. However, when one Worker explicitly sends a request to another Worker, the destination Worker actually runs in the same thread with zero latency. So, it performs more like a function call.

With nanoservices, teams can now break their code into many more independently-deployed pieces without worrying about the overhead.

(Some in the industry prefer to call nanoservices “functions”, implying that each individual function making up an application could be its own service. I feel, however, that this puts too much emphasis on syntax rather than logical functionality. That said, it is the same concept.)

To really make nanoservices work well, we had to minimize the baseline overhead of each service. This required designing workerd very differently from most other runtimes, so that common resources could be shared between services as much as possible. First, as mentioned, we run many nanoservices within a single process, to share basic process overhead and minimize context switching costs. A second big architectural difference between workerd and other runtimes is how it handles built-in APIs. Many runtimes implement significant portions of their built-in APIs in JavaScript, which must then be loaded separately into each isolate. workerd does not; all the APIs are implemented in native code, so that all isolates may share the same copy of that code. These design choices would be difficult to retrofit into another runtime, and indeed these needs are exactly why we chose to build a custom runtime for Workers from the start.

Homogeneous deployment

In a typical microservices model, you might deploy different microservices to containers running across a cluster of machines, connected over a local network. You might manually choose how many containers to dedicate to each service, or you might configure some form of auto-scaling based on resource usage.

workerd offers an alternative model: Every machine runs every service.

workerd’s nanoservices are much lighter-weight than typical containers. As a result, it’s entirely reasonable to run a very large number of them – hundreds, maybe thousands – on a single server. This in turn means that you can simply deploy every service to every machine in your fleet.

Homogeneous deployment means that you don’t have to worry about scaling individual services. Instead, you can simply load balance requests across the entire cluster, and scale the cluster as needed. Overall, this can greatly reduce the amount of administration work needed.

Cloudflare itself has used the homogeneous model on our network since the beginning. Every one of Cloudflare’s edge servers runs our entire software stack, so any server can answer any kind of request on its own. We’ve found it works incredibly well. This is why services on Cloudflare – including ones that use Workers – are able to go from no traffic at all to millions of requests per second instantly without trouble.

Capability bindings: cleaner configuration and SSRF safety

workerd takes a different approach to most runtimes – indeed, to most software development platforms – in how an application accesses external resources.

Most development platforms start from assuming that the application can talk to the whole world. It is up to the application to figure out exactly what it wants to talk to, and name it in some global namespace, such as using a URL. So, an application server that wants to talk to the authentication microservice might use code like this:

// Traditional approach without capability bindings.
fetch("https://auth-service.internal-network.example.com/api", {
  method: "POST",
  body: JSON.stringify(authRequest),
  headers: { "Authorization": env.AUTH_SERVICE_TOKEN }
});

In workerd, we do things differently. An application starts out with no ability to talk to the rest of the world, and must be configured with specific capability bindings that provide it access to specific external resources. So, an application which needs to be able to talk to the authentication service would be configured with a binding called authService, and the code would look something like this:

// Capability-based approach. Hostname doesn't matter; all
// requests to AUTH_SERVICE.fetch() go to the auth service.
env.AUTH_SERVICE.fetch("https://auth/api", {
 method: "POST",
 body: JSON.stringify(authRequest),
});

This may at first appear to be a trivial difference. In both cases, we have to use configuration to control access to external services. In the traditional approach, we’d provide access tokens (and probably the service’s hostname) as environment variables. In the new approach, the environment goes a bit further to provide a full-fledged object. Is this just syntax sugar?

It turns out, this slight change has huge advantages:

First, we can now restrict the global fetch() function to accept only publicly-routable URLs. This makes applications totally immune to SSRF attacks! You cannot trick an application into accessing an internal service unintentionally if the code to access internal services is explicitly different. (In fact, the global fetch() is itself backed by a binding, which can be configured. workerd defaults to connecting it to the public internet, but you can also override it to permit private addresses if you want, or to route to a specific proxy service, or to be blocked entirely.)

With that done, we now have an interesting property: All internal services which an application uses must be configurable. This means:

  • You can easily see a complete list of the internal services an application talks to, without reading all the code.
  • You can always replace these services with mocks for testing purposes.
  • You can always configure an application to authenticate itself differently (e.g. client certificates) or use a different back end, without changing code.

The receiving end of a binding benefits, too. Take the authentication service example, above. The auth service may be another Worker running in workerd as a nanoservice. In this case, the auth service does not need to be bound to any actual network address. Instead, it may be made available strictly to other Workers through their bindings. In this case, the authentication service doesn’t necessarily need to verify that a request received came from an allowed client – because only allowed clients are able to send requests to it in the first place.

Overall, capability bindings allow simpler code that is secure by default, more composable, easier to test, and easier to understand and maintain.

Always backwards compatible

Cloudflare Workers has a hard rule against ever breaking a live Worker running in production. This same dedication to backwards compatibility extends to workerd.

workerd shares Workers’ compatibility date system to manage breaking changes. Every Worker must be configured with a “compatibility date”. The runtime then ensures that the API behaves exactly as it did on that date. At your leisure, you may check the documentation to see if new breaking changes are introduced at a future date, and update your code for them. Most such changes are minor and most code won’t require any changes. However, you are never obliged to update. Old dates will continue to be supported by newer versions of workerd. It is always safe to update workerd itself without updating your code.

What it’s not

To avoid misleading or disappointing anyone, I need to take a moment to call out what workerd is not.

workerd is not a Secure Sandbox

It’s important to note that workerd is not, on its own, a secure way to run possibly-malicious code. If you wish to run code you don’t trust using workerd, you must enclose it in an additional sandboxing layer, such as a virtual machine configured for sandboxing.

workerd itself is designed such that a Worker should not be able to access any external resources to which it hasn’t been granted a capability. However, a complete sandbox solution not only must be designed to restrict access, but also must account for the possibility of bugs – both in software and in hardware. workerd on its own is not sufficient to protect against hardware bugs like Spectre, nor can it adequately defend against the possibility of vulnerabilities in V8 or in workerd’s own code.

The Cloudflare Workers service uses the same code found in workerd, but adds many additional layers of security on top to harden against such bugs. I described some of these in a past blog post. However, these measures are closely tied to our particular environment. For example, we rely on build automation to push V8 patches to production immediately upon becoming available; we separate customers according to risk profile; we rely on non-portable kernel features and assumptions about the host system to enforce security and resource limits. All of this is very specific to our environment, and cannot be packaged up in a reusable way.

workerd is not an independent project

workerd is the core of Cloudflare Workers, a fast-moving project developed by a dedicated team at Cloudflare. We are not throwing code over the wall to forget about, nor are we expecting volunteers to do our jobs for us. workerd’s GitHub repository will be the canonical source used by Cloudflare Workers and our team will be doing much of their work directly in this repository. Just like V8 is developed primarily by the Chrome team for use in Chrome, workerd will be developed primarily by the Cloudflare Workers team for use in Cloudflare Workers.

This means we cannot promise that external contributions will sit on a level playing field with internal ones. Code reviews take time, and work that is needed for Cloudflare Workers will take priority. We also cannot promise we will accept every feature contribution. Even if the code is already written, reviews and maintenance have a cost. Within Cloudflare, we have a product management team who carefully evaluates what features we should and shouldn’t offer, and plenty of ideas generated internally ultimately don’t make the cut.

If you want to contribute a big new feature to workerd, your best bet is to talk to us before you write code, by raising an issue on GitHub early to get input. That way, you can find out if we’re likely to accept a PR before you write it. We also might be able to give hints on how best to implement.

It’s also important to note that while workerd’s internal interfaces may sometimes appear clean and reusable, we cannot make any guarantee that those interfaces won’t completely change on a whim. If you are trying to build on top of workerd internals, you will need to be prepared either to accept a fair amount of churn, or pin to a specific version.

workerd is not an off-the-shelf edge compute platform

As hinted above, the full Cloudflare Workers service involves a lot of technology beyond workerd itself, including additional security, deployment mechanisms, orchestration, and so much more. workerd itself is a portion of our runtime codebase, which is itself a small (albeit critical) piece of the overall Cloudflare Workers service.

We are pleased, though, that this means it is possible for us to release this code under a permissive Open Source license.

Try the Beta

As of this blog post, workerd is in beta. If you want to try it out,

Choose the k-NN algorithm for your billion-scale use case with OpenSearch

Post Syndicated from Jack Mazanec original https://aws.amazon.com/blogs/big-data/choose-the-k-nn-algorithm-for-your-billion-scale-use-case-with-opensearch/

When organizations set out to build machine learning (ML) applications such as natural language processing (NLP) systems, recommendation engines, or search-based systems, often times k-Nearest Neighbor (k-NN) search will be used at some point in the workflow. As the number of data points reaches the hundreds of millions or even billions, scaling a k-NN search system can be a major challenge. Applying Approximate Nearest Neighbor (ANN) search is a great way to overcome this challenge.

The k-NN problem is relatively simple compared to other ML techniques: given a set of points and a query, find the k nearest points in the set to the query. The naive solution is equally understandable: for each point in the set, compute its distance from the query and keep track of the top k along the way.

K-NN concept

The problem with this naive approach is that it doesn’t scale particularly well. The runtime search complexity is O(Nlogk), where N is the number of vectors and k is the number of nearest neighbors. Although this may not be noticeable when the set contains thousands of points, it becomes noticeable when the size gets into the millions. Although some exact k-NN algorithms can speed search up, they tend to perform similarly to the naive approach in higher dimensions.

Enter ANN search. We can reduce the runtime search latency if we loosen a few constraints on the k-NN problem:

  • Allow indexing to take longer
  • Allow more space to be used at query time
  • Allow the search to return an approximation of the k-NN in the set

Several different algorithms have been discovered to do just that.

OpenSearch is a community-driven, Apache 2.0-licensed, open-source search and analytics suite that makes it easy to ingest, search, visualize, and analyze data. The OpenSearch k-NN plugin provides the ability to use some of these algorithms within an OpenSearch cluster. In this post, we discuss the different algorithms that are supported and run experiments to see some of the trade-offs between them.

Hierarchical Navigable Small Worlds algorithm

The Hierarchical Navigable Small Worlds algorithm (HNSW) is one of the most popular algorithms out there for ANN search. It was the first algorithm that the k-NN plugin supported, using a very efficient implementation from the nmslib similarity search library. It has one of the best query latency vs. recall trade-offs and doesn’t require any training. The core idea of the algorithm is to build a graph with edges connecting index vectors that are close to each other. Then, on search, this graph is partially traversed to find the approximate nearest neighbors to the query vector. To steer the traversal towards the query’s nearest neighbors, the algorithm always visits the closest candidate to the query vector next.

But which vector should the traversal start from? It could just pick a random vector, but for a large index, this might be very far from the query’s actual nearest neighbors, leading to poor results. To pick a vector that is generally close to the query vector to start from, the algorithm builds not just one graph, but a hierarchy of graphs. All vectors are added to the bottom layer, and then a random subset of those are added to the layer above, and then a subset of those are added to the layer above that, and so on.

During search, we start from a random vector in the top layer, partially traverse the graph to find (approximately) the nearest point to the query vector in that layer, and then use this vector as the starting point for our traversal of the layer below. We repeat this until we get to the bottom layer. At the bottom layer, we perform the traversal, but this time, instead of just searching for the nearest neighbor, we keep track of the k-nearest neighbors that are visited along the way.

The following figure illustrates this process (inspired from the image in original paper Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs).

You can tune three parameters for HNSW:

  • m – The maximum number of edges a vector will get in a graph. The higher this number is, the more memory the graph will consume, but the better the search approximation may be.
  • ef_search – The size of the queue of the candidate nodes to visit during traversal. When a node is visited, its neighbors are added to the queue to be visited in the future. When this queue is empty, the traversal will end. A larger value will increase search latency, but may provide better search approximation.
  • ef_construction – Very similar to ef_search. When a node is to be inserted into the graph, the algorithm will find its m edges by querying the graph with the new node as the query vector. This parameter controls the candidate queue size for this traversal. A larger value will increase index latency, but may provide a better search approximation.

For more information on HNSW, you can read through the paper Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.

Memory consumption

Although HNSW provides very good approximate nearest neighbor search at low latencies, it can consume a large amount of memory. Each HNSW graph uses roughly 1.1 * (4 * d + 8 * m) * num_vectors bytes of memory:

  • d is the dimension of the vectors
  • m is the algorithm parameter that controls the number of connections each node will have in a layer
  • num_vectors is the number of vectors in the index

To ensure durability and availability, especially when running production workloads, OpenSearch indexes are recommended to have at least one replica shard. Therefore, the memory requirement is multiplied by (1 + number of replicas). For use cases where the data size is 1 billion vectors of 128 dimensions each and m is set to the default value of 16, the estimated amount of memory required would be:

1.1 * (4 * 128 + 8 * 16) * 1,000,000,000 * 2 = 1,408 GB.

If we increase the size of vectors to 512, for example, and the m to 100, which is recommended for vectors with high intrinsic dimensionality, some use cases can require a total memory of approximately 4 TB.

With OpenSearch, we can always horizontally scale the cluster to handle this memory requirement. However, this comes at the expense of raising infrastructure costs. For cases where scaling doesn’t make sense, options to reduce the memory footprint of the k-NN system need to be explored. Fortunately, there are algorithms that we can use to do this.

Inverted File System algorithm

Consider a different approach for approximating a nearest neighbor search: separate your index vectors into a set of buckets, then, to reduce your search time, only search through a subset of these buckets. From a high level, this is what the Inverted File System (IVF) ANN algorithm does. In OpenSearch 1.2, the k-NN plugin introduced support for the implementation of IVF by Faiss. Faiss is an open-sourced library from Meta for efficient similarity search and clustering of dense vectors.

However, if we just randomly split up our vectors into different buckets, and only search a subset of them, this will be a poor approximation. The IVF algorithm uses a more elegant approach. First, before indexing begins, it assigns each bucket a representative vector. When a vector is indexed, it gets added to the bucket that has the closest representative vector. This way, vectors that are closer to each other are placed roughly in the same or nearby buckets.

To determine what the representative vectors for the buckets are, the IVF algorithm requires a training step. In this step, k-Means clustering is run on a set of training data, and the centroids it produces become the representative vectors. The following diagram illustrates this process.

Inverted file system indexing concept

IVF has two parameters:

  • nlist – The number of buckets to create. More buckets will result in longer training times, but may improve the granularity of the search.
  • nprobes – The number of buckets to search. This parameter is fairly straightforward. The more buckets that are searched, the longer the search will take, but the better the approximation.

Memory consumption

In general, IVF requires less memory than HNSW because IVF doesn’t need to store a set of edges for each indexed vector.

We estimate that IVF will roughly require the following amount of memory:

1.1 * (((4 * dimension) * num_vectors) + (4 * nlist * dimension)) bytes

For the case explored for HNSW where there are 1,000,000,000 128-dimensional vectors with one layer of replication, an IVF algorithm with an nlist of 4096 would take roughly 1.1 * (((4 * 128) * 2,000,000,000) + (4 * 4096 * 128)) bytes = 1126 GB.

This savings does come at a cost, however, because HNSW offers a better query latency versus approximation accuracy tradeoff.

Product quantization vector compression

Although you can use HNSW and IVF to speed up nearest neighbor search, they can consume a considerable amount of memory. When we get into the billion-vector scale, we start to require thousands of GBs of memory to support their index structures. As we scale up the number of vectors or the dimension of vectors, this requirement continues to grow. Is there a way to use noticeably less space for our k-NN index?

The answer is yes! In fact, there are a lot of different ways to reduce the amount of memory vectors require. You can change your embedding model to produce smaller vectors, or you can apply techniques like Principle Component Analysis (PCA) to reduce the vector’s dimensionality. Another approach is to use quantization. The general idea of vector quantization is to map a large vector space with continuous values into a smaller space with discrete values. When a vector is mapped into a smaller space, it requires fewer bits to represent. However, this comes at a cost—when mapping to a smaller input space, some information about the vector is lost.

Product quantization (PQ) is a very popular quantization technique in the field of nearest neighbor search. It can be used together with ANN algorithms for nearest neighbor search. Along with IVF, the k-NN plugin added support for Faiss’s PQ implementation in OpenSearch 1.2.

The main idea of PQ is to break up a vector into several sub-vectors and encode the sub-vectors independently with a fixed number of bits. The number of sub-vectors that the original vector is broken up into is controlled by a parameter, m, and the number of bits to encode each sub-vector with is controlled by a parameter, code_size. After encoding finishes, a vector is compressed into roughly m * code_size bits. So, assume we have a set of 100,000 1024-dimensional vectors. With m = 8 and code_size = 8, PQ breaks each vector into 8 128-dimensional sub-vectors and encode each sub-vector with 8 bits.

The values used for encoding are produced during a training step. During training, tables are created with 2code_size entries for each sub-vector partition. Next, k-Means clustering, with a k value of 2code_size, is run on the corresponding partition of sub-vectors from the training data. The centroids produced here are added as the entries to the partition’s table.

After all the tables are created, we encode a vector by replacing each sub-vector with the ID of the closest vector in the partition’s table. In the example where code_size = 8, we only need 8 bits to store an ID because there are 28 elements in the table. So, with dimension = 1024 and m = 8, the total size of one vector (assuming it uses a 32-bit floating point data type) is reduced from 4,096 bytes to roughly 8 bytes!

Product quantization encoding step

When we want to decode a vector, we can reconstruct an approximated version of it by using the stored IDs to retrieve the vectors from each partition’s table. The distance from the query vector to the reconstructed vector can then be computed and used in a nearest neighbor search. (It’s worth noting that, in practice, further optimization techniques like ADC are used to speed up this process for k-NN search).

Product quantization decoding step

Memory consumption

As we mentioned earlier, PQ will encode each vector into roughly m * code_size bits plus some overhead for each vector.

When combining it with IVF, we can estimate the index size as follows:

1.1 * ((((code_size/8) * m + overhead_per_vector) * num_vectors) + (4 * nlist * dimension) + (2 code_size * 4 * dimension) bytes

Using 1 billion vectors, dimension = 128, m = 8, code_size = 8, and nlist = 4096, we get an estimated total memory consumption of 70GB: 1.1 * ((((8 / 8) * 8 + 24) * 1,000,000,000) + (4 * 4096 * 128) + (2^8 * 4 * 128)) * 2 = 70 GB.

Running k-NN with OpenSearch

First make sure you have an OpenSearch cluster up and running. For instructions, refer to Cluster formation. For a more managed solution, you can use Amazon OpenSearch Service.

Before getting into the experiments, let’s go over how to run k-NN workloads in OpenSearch. First, we need to create an index. An index stores a set of documents in a way that they can be easily searched. For k-NN, the index’s mapping tells OpenSearch what algorithms to use and what parameters to use with them. We start by creating an index that uses HNSW as its search algorithm:

PUT my-hnsw-index
{
  "settings": {
    "index": {
      "knn": true,
      "number_of_shards": 10,
      "number_of_replicas" 1,
    }
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "knn_vector",
        "dimension": 4,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "nmslib",
          "parameters": {
            "ef_construction": 128,
            "m": 24
          }
        }
      }
    }
  }
}

In the settings, we need to enable knn so that the index can be searched with the knn query type (more on this later). We also set the number of shards, and the number of replicas each shard will have. An index is made up of a collection of shards. Sharding is how OpenSearch distributes an index across multiple nodes in a cluster. For more information about shards, refer to Sizing Amazon OpenSearch Service domains.

In the mappings, we configure a field called my_vector of type knn_vector to store the vector data. We also pass nmslib as the engine to let OpenSearch know it should use nmslib’s implementation of HNSW. Additionally, we pass l2 as the space_type. The space_type determines the function used to compute the distance between two vectors. l2 refers to the Euclidean distance. OpenSearch also supports cosine similarity and the inner product distance functions.

After the index is created, we can ingest some fake data:

POST _bulk
{ "index": { "_index": "my-hnsw-index", "_id": "1" } }
{ "my_vector": [1.5, 2.5], "price": 12.2 }
{ "index": { "_index": "my-hnsw-index", "_id": "2" } }
{ "my_vector": [2.5, 3.5], "price": 7.1 }
{ "index": { "_index": "my-hnsw-index", "_id": "3" } }
{ "my_vector": [3.5, 4.5], "price": 12.9 }
{ "index": { "_index": "my-hnsw-index", "_id": "4" } }
{ "my_vector": [5.5, 6.5], "price": 1.2 }
{ "index": { "_index": "my-hnsw-index", "_id": "5" } }
{ "my_vector": [4.5, 5.5], "price": 3.7 }
{ "index": { "_index": "my-hnsw-index", "_id": "6" } }
{ "my_vector": [1.5, 5.5, 4.5, 6.4], "price": 10.3 }
{ "index": { "_index": "my-hnsw-index", "_id": "7" } }
{ "my_vector": [2.5, 3.5, 5.6, 6.7], "price": 5.5 }
{ "index": { "_index": "my-hnsw-index", "_id": "8" } }
{ "my_vector": [4.5, 5.5, 6.7, 3.7], "price": 4.4 }
{ "index": { "_index": "my-hnsw-index", "_id": "9" } }
{ "my_vector": [1.5, 5.5, 4.5, 6.4], "price": 8.9 }

After adding some documents to the index, we can search it:

GET my-hnsw-index/_search
{
  "size": 2,
  "query": {
    "knn": {
      "my_vector": {
        "vector": [2, 3, 5, 6],
        "k": 2
      }
    }
  }
}

Creating an index that uses IVF or PQ is a little bit different because these algorithms require training. Before creating the index, we need to create a model using the training API:

POST /_plugins/_knn/models/my_ivfpq_model/_train
{
  "training_index": "train-index",
  "training_field": "train-field",
  "dimension": 128,
  "description": "My model description",
  "method": {
      "name":"ivf",
      "engine":"faiss",
      "parameters":{
        "encoder":{
            "name":"pq",
            "parameters":{
                "code_size": 8,
                "m": 8
            }
        }
      }
  }
}

The training_index and training_field specify where the training data is stored. The only requirement for the training data index is that it has a knn_vector field that has the same dimension as you want your model to have. The method defines the algorithm that should be used for search.

After the training request is submitted, it will run in the background. To check if the training is complete, you can use the GET model API:

GET /_plugins/_knn/models/my_ivfpq_model/filter_path=model_id,state
{
  "model_id" : "my_ivfpq_model",
  "state" : "created"
}

After the model is created, you can create an index that uses this model:

PUT /my-hnsw-index
{
  "settings" : {
    "index.knn": true
    "number_of_shards" : 10,
    "number_of_replicas" : 1,
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "knn_vector",
        "model_id": "my_ivfpq_model"
      }
    }
  }
}

After the index is created, we can add documents to it and search it just like we did for HNSW.

Experiments

Let’s run a few experiments to see how these algorithms perform in practice and what tradeoffs are made. We look at an HNSW versus an IVF index using PQ. For these experiments, we’re interested in search accuracy, query latency, and memory consumption. Because these trade-offs are mainly observed at scale, we use the BIGANN dataset containing 1 billion vectors of 128 dimensions. The dataset also contains 10,000 queries of test data mapping a query to the ground truth closest 100 vectors based on the Euclidean distance.

Specifically, we compute the following search metrics:

  • Latency p99 (ms), Latency p90 (ms), Latency p50 (ms) – Query latency at various quantiles in milliseconds
  • [email protected] – The fraction of the top 10 ground truth neighbors found in the 10 results returned by the plugin
  • Native memory consumption (GB) – The amount of memory used by the plugin during querying

One thing to note is that the BIGANN dataset uses an unsigned integer as the data type. Because the knn_vector field doesn’t support unsigned integers, the data is automatically converted to floats.

To run the experiments, we complete the following steps:

  1. Ingest the dataset into the cluster using the OpenSearch Benchmarks framework (the code can be found on GitHub).
  2. When ingestion is complete, we use the warmup API to prepare the cluster for the search workload.
  3. We run the 10,000 test queries against the cluster 10 times and collect the aggregated results.

The queries return the document ID only, and not the vector, to improve performance (code for this can be found on GitHub).

Parameter selection

One tricky aspect of running experiments is selecting the parameters. There are too many different combinations of parameters to test them all. That being said, we decided to create three configurations for HNSW and IVFPQ:

  • Optimize for search latency and memory
  • Optimize for recall
  • Fall somewhere in the middle

For each optimization strategy, we chose two configurations.

For HNSW, we can tune the m, ef_construction, and ef_search parameters to achieve our desired trade-off:

  • m – Controls the maximum number of edges a node in a graph can have. Because each node has to store all of its edges, increasing this value will increase the memory footprint, but also increase the connectivity of the graph, which will improve recall.
  • ef_construction – Controls the size of the candidate queue for edges when adding a node to the graph. Increasing this value will increase the number of candidates to consider, which will increase the index latency. However, because more candidates will be considered, the quality of the graph will be better, leading to better recall during search.
  • ef_search – Similar to ef_construction, it controls the size of the candidate queue for graph traversal during search. Increasing this value will increase the search latency, but will also improve the recall.

In general, we chose configurations that gradually increased the parameters, as detailed in the following table.

Config ID Optimization Strategy m ef_construction ef_search
hnsw1 Optimize for memory and search latency 8 32 32
hnsw2 Optimize for memory and search latency 16 32 32
hnsw3 Balance between latency, memory, and recall 16 128 128
hnsw4 Balance between latency, memory, and recall 32 256 256
hnsw5 Optimize for recall 32 512 512
hnsw6 Optimize for recall 64 512 512

For IVF, we can tune two parameters:

  • nlist – Controls the granularity of the partitioning. The recommended value for this parameter is a function of the number of vectors in the index. One thing to keep in mind is that there are Faiss indexes that map to Lucene segments. There are several Lucene segments per shard and several shards per OpenSearch index. For our estimates, we assumed that there would be 100 segments per shard and 24 shards, so about 420,000 vectors per Faiss index. With this value, we estimated a good value to be 4096 and kept this constant for the experiments.
  • nprobes – Controls the number of nlist buckets we search. Higher values generally lead to improved recalls at the expense of increased search latencies.

For PQ, we can tune two parameters:

  • mControls the number of partitions to break the vector into. The larger this value is, the better the encoding will approximate the original, at the expense of raising memory consumption.
  • code_sizeControls the number of bits to encode a sub-vector with. The larger this value is, the better the encoding approximates the original, at the expense of raising memory consumption. The max value is 8, so we kept it constant at 8 for all experiments.

The following table summarizes our strategies.

Config ID Optimization Strategy nprobes m (num_sub_vectors)
ivfpq1 Optimize for memory and search latency 8 8
ivfpq2 Optimize for memory and search latency 16 8
ivfpq3 Balance between latency, memory, and recall 32 16
ivfpq4 Balance between latency, memory, and recall 64 32
ivfpq5 Optimize for recall 128 16
ivfpq6 Optimize for recall 128 32

Additionally, we need to figure out how much training data to use for IVFPQ. In general, Faiss recommends between 30,000 and 256,000 training vectors for components involving k-Means training. For our configurations, the maximum k for k-Means is 4096 from the nlist parameter. With this formula, the recommended training set size is between 122,880 and 1,048,576 vectors, so we settled on 1 million vectors. The training data comes from the index vector dataset.

Lastly, for the index configurations, we need to select the shard count. It is recommended to keep the shard size between 10–50 GBs for OpenSearch. Experimentally, we determined that for HNSW, a good number would be 64 shards and for IVFPQ, 42. Both index configurations were configured with one replica.

Cluster configuration

To run these experiments, we used Amazon OpenSearch Service using version 1.3 of OpenSearch to create the clusters. We decided to use the r5 instance family, which provides a good trade-off between memory size and cost.

The number of nodes will depend on the amount of memory that can be used for the algorithm per node and the total amount of memory required by the algorithm. Having more nodes and more memory will generally improve performance, but for these experiments, we want to minimize cost. The amount of memory available per node is computed as memory_available = (node_memory - jvm_size) * circuit_breaker_limit, with the following parameters:

  • node_memory – The total memory of the instance.
  • jvm_size – The OpenSearch JVM heap size. Set to 32 GB.
  • circuit_breaker_limit – The native memory usage threshold for the circuit breaker. Set to 0.5.

Because HNSW and IVFPQ have different memory requirements, we estimate how much memory is needed for each algorithm and determine the required number of nodes accordingly.

For HNSW, with m = 64, the total memory required using the formula from the previous sections is approximately 2,252 GB. Therefore, with r5.12xlarge (384 GB of memory), memory_available is 176 GB and the total number of nodes required is about 12, which we round up to 16 for stability purposes.

Because the IVFPQ algorithm requires less memory, we can use a smaller instance type, the r5.4xlarge instance, which has 128 GB of memory. Therefore, the memory_available for the algorithm is 48 GB. The estimated algorithm memory consumption where m = 64 is a total of 193 GB and the total number of nodes required is four, which we round up to six for stability purposes.

For both clusters, we use c5.2xlarge instance types as dedicated leader nodes. This will provide more stability for the cluster.

According to the AWS Pricing Calculator, for this particular use case, the cost per hour of the HNSW cluster is around $75 an hour, and the IVFPQ cluster costs around $11 an hour. This is important to remember when comparing the results.

Also, keep in mind that these benchmarks can be run using your custom infrastructure, using Amazon Elastic Compute Cloud (Amazon EC2), as long as the instance types and their memory size is equivalent.

Results

The following tables summarize the results from the experiments.

Test ID p50 Query latency (ms) p90 Query latency (ms) p99 Query latency (ms) [email protected] Native memory consumption (GB)
hnsw1 9.1 11 16.9 0.84 1182
hnsw2 11 12.1 17.8 0.93 1305
hnsw3 23.1 27.1 32.2 0.99 1306
hnsw4 54.1 68.3 80.2 0.99 1555
hnsw5 83.4 100.6 114.7 0.99 1555
hnsw6 103.7 131.8 151.7 0.99 2055
Test ID p50 Query latency (ms) p90 Query latency (ms) p99 Query latency (ms) [email protected] Native memory consumption (GB)
ivfpq1 74.9 100.5 106.4 0.17 68
ivfpq2 78.5 104.6 110.2 0.18 68
ivfpq3 87.8 107 122 0.39 83
ivfpq4 117.2 131.1 151.8 0.61 114
ivfpq5 128.3 174.1 195.7 0.40 83
ivfpq6 163 196.5 228.9 0.61 114

As you might expect, given how many more resources it uses, the HNSW cluster has lower query latencies and better recall. However, the IVFPQ indexes use significantly less memory.

For HNSW, increasing the parameters does in fact lead to better recall at the expense of latency. For IVFPQ, increasing m has the most significant impact on improving recall. Increasing nprobes improves the recall marginally, but at the expense of significant increases in latencies.

Conclusion

In this post, we covered different algorithms and techniques used to perform approximate k-NN search at scale (over 1 billion data points) within OpenSearch. As we saw in the previous benchmarks section, there isn’t one algorithm or approach that optimises for all the metrics at once. HNSW, IVF, and PQ each allow you to optimize for different metrics in your k-NN workload. When choosing the k-NN algorithm to use, first understand the requirements of your use case (How accurate does my approximate nearest neighbor search need to be? How fast should it be? What’s my budget?) and then tailor the algorithm configuration to meet them.

You can take a look at the benchmarking code base we used on GitHub. You can also get started with approximate k-NN search today following the instructions in Approximate k-NN search. If you’re looking for a managed solution for your OpenSearch cluster, check out Amazon OpenSearch Service.


About the Authors

Jack Mazanec is a software engineer working on OpenSearch plugins. His primary interests include machine learning and search engines. Outside of work, he enjoys skiing and watching sports.

Othmane Hamzaoui is a Data Scientist working at AWS. He is passionate about solving customer challenges using Machine Learning, with a focus on bridging the gap between research and business to achieve impactful outcomes. In his spare time, he enjoys running and discovering new coffee shops in the beautiful city of Paris.

Scaling Git’s garbage collection

Post Syndicated from Taylor Blau original https://github.blog/2022-09-13-scaling-gits-garbage-collection/

At GitHub, we store a lot of Git data: more than 18.6 petabytes of it, to be precise. That’s more than six times the size of the Library of Congress’s digital collections1. Most of that data comes from the contents of your repositories: your READMEs, source files, tests, licenses, and so on.

But some of that data is just junk: some bit of your repository that is no longer important. It could be a file that you force-pushed over, or the contents of a branch you deleted without merging. In general, this slice of repository data is anything that isn’t contained in at least one of your repository’s branches or tags. Normally, we don’t remove any unreachable data from repositories. But occasionally we do, usually to remove sensitive data, like passwords or SSH keys from your repository’s history.

The process for permanently removing unreachable objects from a repository’s history has a history of causing problems within GitHub, especially in busy repositories or ones with lots of objects. In this post, we’ll talk about what those problems were, why we had them, the tools we built to address them, and some interesting ways we’ve built on top of them. All of this work was contributed upstream to the open-source Git project. Let’s dive in.

Object reachability

In this post, we’re going to talk a lot about “reachable” and “unreachable” objects. You may have heard these terms before, but perhaps only casually. Since we’re going to use them a lot, it will help to have more concrete definitions of the two. An object is reachable when there is at least one branch or tag along which you can reach the object in question. An object is “reached” by crawling through history—from commits to their parents, commits to their root trees, and trees to their sub-trees and blobs. An object is unreachable when no such branch or tag exists.

Sample object graph showing commits, with arrows connecting them to their parents. A few commits have boxes that are connected to them, which represent the tips of branches and tags.

Here, we’re looking at a sample object graph. For simplicity, I’m only showing commits (identified here as circles). Arrows point from commits to their parent(s). A few commits have boxes that are connected to them, which represent the tips of branches and tags.

The parts of the graph that are colored blue are reachable, and the red parts are considered unreachable. You’ll find that if you start at any branch or tag, and follow its arrows, that all commits along that path are considered reachable. Note that unreachable commits which have reachable ones as parents (in our diagram above, anytime an arrow points from a red commit to a blue one) are still considered unreachable, since they are not contained within any branch or tag.

Unreachable objects can also appear in clusters that are totally disconnected from the main object graph, as indicated by the two lone red commits towards the right-hand side of the image.

Pruning unreachable objects

Normally, unreachable objects stick around in your repository until they are either automatically or manually cleaned up. If you’ve ever seen the message, “Auto packing the repository for optimum performance,” in your terminal, Git is doing this for you in the background. You can also trigger garbage collection manually by running:

$ git gc --prune=<date>

That tells Git to trigger a garbage collection and remove unreachable objects. But observant readers might notice the optional <date> parameter to the --prune flag. What is that? The short answer is that Git allows you to restrict which objects get permanently deleted based on the last time they were written. But to fully explain, we first need to talk a little bit about a race condition that can occur when removing objects from a Git repository.

Object deletion raciness

Normally, deleting an unreachable object from a Git repository should not be a notable event. Since the object is unreachable, it’s not part of any branch or tag, and so deleting it doesn’t change the repository’s reachable state. In other words, removing an unreachable object from a repository should be as simple as:

  1. Repacking the repository to remove any copies of the object in question (and recomputing any deltas that are based on that object).
  2. Removing any loose copies of the object that happen to exist.
  3. Updating any additional indexes (like the multi-pack index, or commit-graph) that depend on the (now stale) packs that were removed.

The racy behavior occurs when a repository receives one or more pushes during this process. The main culprit is that the server advertises its objects at a different point in time from processing the objects that the client sent based on that advertisement.

Consider what happens if Git decides (as part of running a git gc operation) that it wants to delete some unreachable object C. If C becomes reachable by some background reference update (e.g., an incoming push that creates a new branch pointing at C), it will then be advertised to any incoming pushes. If one of these pushes happens before C is actually removed, then the repository can end up in a corrupt state. Since the pusher will assume C is reachable (since it was part of the object advertisement), it is allowed to include objects that either reference or depend on C, without sending C itself. If C is then deleted while other reachable parts of the repository depend on it, then the repository will be left in a corrupt state.

Suppose the server receives that push before proceeding to delete C. Then, any objects from the incoming push that are related to it would be immediately corrupt. Reachable parts of the repository that reference C are no longer closed2 over reachability since C is missing. And any objects that are stored as a delta against C can no longer be inflated for the same reason.

Figure demonstrating that one side (responsible for garbage collecting the repository) decides that a certain object is unreachable, while another side makes that object reachable and accepts an incoming push based on that object—before the original side ultimately deletes that (now-reachable) object—leaving the repository in a corrupt state.

In case that was confusing, the above figure should help clear things up. The general idea is that one side (responsible for garbage collecting the repository) decides that a certain object is unreachable, while another side makes that object reachable and accepts an incoming push based on that object—before the original side ultimately deletes that (now-reachable) object—leaving the repository in a corrupt state.

Mitigating object deletion raciness

Git does not completely prevent this race from happening. Instead, it works around the race by gradually expiring unreachable objects based on the last time they were written. This explains the mysterious --prune=<date> option from a few sections ago: when garbage collecting a repository, only unreachable objects which haven’t been written since <date> are removed. Anything else (that is, the set of objects that have been written at least once since <date>) are left around.

The idea is that objects which have been written recently are more likely to become reachable again in the future, and would thus be more likely to be susceptible to the kind of race we talked about above if they were to be pruned. Objects which haven’t been written recently, on the other hand, are proportionally less likely to become reachable again, and so they are safe (or, at least, safer) to remove.

This idea isn’t foolproof, and it is certainly possible to run into the race we talked about earlier. We’ll discuss one such scenario towards the end of this post (along with the way we worked around it). But in practice, this strategy is simple and effective, preventing most instances of potential repository corruption.

Storing loose unreachable objects

But one question remains: how does Git keep track of the age of unreachable objects which haven’t yet aged out of the repository?

The answer, though simple, is at the heart of the problem we’re trying to solve here. Unreachable objects which have been written too recently to be removed from the repository are stored as loose objects, the individual object files stored in .git/objects. Storing these unreachable objects individually means that we can rely on their stat() modification time (hereafter, mtime) to tell us how recently they were written.

But this leads to an unfortunate problem: if a repository has many unreachable objects, and a large number of them were written recently, they must all be stored individually as loose objects. This is undesirable for a number of reasons:

  • Pairs of unreachable objects that share a vast majority of their contents must be stored separately, and can’t benefit from the kind of deduplication offered by packfiles. This can cause your repository to take up much more space than it otherwise would.
  • Having too many files (especially too many in a single directory) can lead to performance problems, including exhausting your system’s available inodes in the extreme case, leaving you unable to create new files, even if there may be space available for them.
  • Any Git operation which has to scan through all loose objects (for example, git repack -d, which creates a new pack containing just your repository’s unpacked objects) will slow down as there are more files to process.

It’s tempting to want to store all of a repository’s unreachable objects into a single pack. But there’s a problem there, too. Since all of the objects in a single pack share the same mtime (the mtime of the *.pack file itself), rewriting any single unreachable object has the effect of updating the mtimes of all of a repository’s unreachable objects. This is because Git optimizes out object writes for packed objects by simply updating the mtime of any pack(s) which contain that object. This makes it nearly impossible to expire any objects out of the repository permanently.

Cruft packs

To solve this problem, we turned to a long-discussed idea on the Git mailing list: cruft packs. The idea is simple: store an auxiliary list of mtime data alongside a pack containing just unreachable objects. To garbage collect a repository, Git places the unreachable objects in a pack. That pack is designated as a “cruft pack” because Git also writes the mtime data corresponding to each object in a separate file alongside that pack. This makes it possible to update the mtime of a single unreachable object without changing the mtimes of any other unreachable object.

To give you a sense of what this looks like in practice, here’s a small example:

a pack of Git objects (represented by rectangles of different colors)

The above figure shows a pack of Git objects (represented by rectangles of different colors), its pack index, and the new .mtimes file. Together, these three files make up what Git calls a “cruft pack,” and it’s what allows Git to store unreachable objects together, without needing a single file for each object.

So, how do they work? Git uses the cruft pack to store a collection of object mtimes together in an array stored in the *.mtimes file. In order to discover the mtime for an individual object in a pack, Git first does a binary search on the pack’s index to discover that object’s lexicographic index. Git can then use that offset to read a 4-byte, unsigned integer in the .mtimes file. The .mtimes file contains a table of integers (one for each object in the associated *.pack file), each representing an epoch timestamp containing that object’s mtime. In other words, the *.mtimes file has a table of numbers, where each number represents an individual object’s mtime, encoded as a number of seconds since the Unix epoch.

Crucially, this makes it possible to store all of a repository’s unreachable objects together in a single pack, without having to store them as individual loose objects, bypassing all of the drawbacks we discussed in the last section. Moreover, it allows Git to update the mtime of a single unreachable object, without inadvertently triggering the same update across all unreachable objects.

Since Git doesn’t portably support updating a file in place, updating an object’s mtime (a process which Git calls “freshening”) takes place by writing a separate copy of that object out as a loose file. Of course, if we had to freshen all objects in a cruft pack, we would end up in a situation no better than before. But such updates tend to be unlikely in practice, and so writing individual copies of a small handful of unreachable objects ends up being a reasonable trade off most of the time.

Generating cruft packs

Now that we have introduced the concept of cruft packs, the question remains: how does Git generate them?

Despite being called git gc (short for “garbage collection”), running git gc does not always result in deleting unreachable objects. If you run git gc --prune=never, then Git will repack all reachable objects and move all unreachable objects to the cruft pack. If, however, you run git gc --prune=1.day.ago, then Git will repack all reachable objects, delete any unreachable objects that are older than one day, and repack the remaining unreachable objects into the cruft pack.

This is because of Git’s treatment of unreachable parts of the repository. While Git only relies on having a reachability closure over reachable objects, Git’s garbage collection routine tries to leave unreachable parts of the repository intact to the extent possible. That means if Git encounters some unreachable cluster of objects in your repository, it will either expire all or none of those objects, but never some subset of them.

We’ll discuss how cruft packs are generated with and without object expiration in the two sections below.

Cruft packs without object expiration

When generating a cruft pack with an object expiration of --date=never, our only goal is to collect all unreachable objects together into a single cruft pack. Broadly speaking, this occurs in three steps:

  1. Starting at all of the branches and tags, generate a pack containing only reachable objects.
  2. Looking at all other existing packs, enumerate the list of objects which don’t appear in the new pack of reachable objects. Create a new pack containing just these objects, which are unreachable.
  3. Delete the existing packs.

If any of that was confusing, don’t worry: we’ll break it down here step by step. The first step to collecting a repository’s unreachable objects is to figure out the parts of it that are reachable. If you’ve ever run git repack -A, this is exactly how that command works. Git starts a reachability traversal beginning at each of the branches and tags in your repository. Then it traverses back through history by walking from commits to their parents, trees to their sub-trees, and so on, marking every object that it sees along the way as reachable.

Demonstration of how Git walks through a commit graph, from commit to parent

Here, we’re showing the same commit graph from earlier in the post. Git’s goal at this point is simply to mark every reachable object that it sees, and it’s those objects that will become the contents of a new pack containing just reachable objects. Git starts by examining each reference, and walking from a commit to its parents until it either finds a commit with no parents (indicating the beginning of history), or a commit that it has already marked as reachable.

In the above, the commit being walked is highlighted in dark blue, and any commits marked as reachable are marked in green. At each step, the commit currently being visited gets marked as reachable, and its parent(s) are visited in the next step. By repeating this process among all branches and tags, Git will mark all reachable objects in the repository.

We can then use this set of objects to produce a new pack containing all reachable objects in a repository. Next, Git needs to discover the set of objects that it didn’t mark in the previous stage. A reasonable first approach might be to store the IDs of all of a repository’s objects in a set, and then remove them one by one as we mark objects reachable along our walk.

But this approach tends to be impractical, since each object will require a minimum of 20 bytes of memory in order to insert into this set. At the time of writing, the linux.git repository contains nearly nine million objects, which would require nearly 180 MB of memory just to write out all of their object IDs.

Instead, Git looks through all of the objects in all of the existing packs, checking whether or not each is contained in the new pack of reachable objects. Any object found in an existing pack which doesn’t appear in the reachable pack is automatically included in the cruft pack.

Animation demonstrating how  Git looks through all of the objects in all of the existing packs, checking whether or not each is contained in the new pack of reachable objects.

Here, we’re going one by one among all of the pre-existing packs (here, labeled as pack-abc.pack, pack-def.pack, and pack-123.pack) and inspecting their objects one at a time. We first start with object c8, looking through the reachable pack (denoted as pack-xyz.pack) to see if any of its objects match c8. Since none do, c8 is marked unreachable (which we represent by filling the object with a red background).

This process is repeated for each object in each existing pack. Once this process is complete, all objects that existed in the repository before starting a garbage collection are marked either green, or red (indicating that they are either reachable, or unreachable, respectively).

Git can then use the set of unreachable objects to generate a new pack, like below:

A set of labeled Git packs

This pack (on the far right of the above image, denoted pack-cruft.pack) contains exactly the set of unreachable objects present in the repository at the beginning of garbage collection. By keeping track of each unreachable object’s mtime while marking existing objects, Git has enough data to write out a *.mtimes file in addition to the new pack, leaving us with a cruft pack containing just the repository’s unreachable objects.

Here, we’re eliding some technical details about keeping track of each object’s mtime along the way, for brevity and simplicity. The routine is straightforward, though: each time we discover an object, we mark its mtime based on how we discovered the object.

  • If an object is found in a packfile, it inherits its mtime from the packfile itself.
  • If an object is found as a loose object, its mtime comes from the loose object file.
  • And if an object is found in an existing cruft pack, its mtime comes from reading the cruft pack’s *.mtimes file at the appropriate index.

If an object is seen more than once (e.g., an unreachable object stored in a cruft pack was freshened, resulting in another loose copy of the object), the mtime which is ultimately recorded in the new cruft pack is the most recent mtime of all of the above.

Cruft packs with object expiration

Generating cruft packs where some objects are going to expire out of the repository follows a similar, but slightly trickier approach than in the non-expiring case.

Doing a garbage collection with a fixed expiration is known as “pruning.” This essentially boils down to asking Git to pack the contents of a repository into two packfiles: one containing reachable objects, and another containing any unreachable objects. But, it also means that for some fixed expiration date, any unreachable objects which have an mtime older than the expiration date are removed from the repository entirely.

The difficulty in this case stems from a fact briefly mentioned earlier in this post, which is that Git attempts to prevent connected clusters of unreachable objects from leaving the repository if some, but not all, of their objects have aged out.

To make things clearer, here’s an example. Suppose that a repository has a handful of blob objects, all connected to some tree object, and all of these objects are unreachable. Assuming that they’re all old enough, then they will all expire together: no big deal. But what if the tree isn’t old enough to be expired? In this case, even though the blobs connected to it could be expired on their own, Git will keep them around since they’re connected to a tree with a sufficiently recent mtime. Git does this to preserve the repository’s reachability closure in case that tree were to become reachable again (in which case, having the tree and its blobs becomes important).

To ensure that Git preserves any unreachable objects which are reachable from recent objects Git handles this case of cruft pack generation slightly differently. At a high level, it:

  1. Generates a candidate list of cruft objects, using the same process as outlined in the previous section.
  2. Then, to determine the actual list of cruft objects to keep around, it performs a reachability traversal using all of the candidate cruft objects, adding any object it sees along the way to the cruft pack.

To make things a little clearer, here’s an example:

Animation of Git performing  a reachability traversal

After determining the set of unreachable objects (represented above as colored red) Git does a reachability traversal from each entry point into the graph of unreachable objects. Above, commits are represented by circles, trees by rectangles, and tree entries as rows within the larger rectangles. The mtimes are written below each commit.

For now, let’s assume our expiration date is d, so any object whose mtime is greater than d must stay (despite being unreachable), and anything older than d can be pruned. Git traverses through each entry and asks, “Is this object old enough to be pruned?” When the answer is “yes” Git leaves the object alone and moves on to the next entry point. When the answer is “no,” however, (ie., Git is looking at an unreachable object whose mtime is too recent to prune), Git marks that object as “rescued” (indicated by turning it green) and then continues its traversal, marking any reachable objects as rescued.

Objects that are rescued during this pass are written to the cruft pack, preserving their existence in the repository, leaving them to either continue to age, or have their mtimes updated before the next garbage collection.

Let’s take a closer look at the example above. Git starts by looking at object C(1,1), and notice that its mtime is d+5, meaning that (since it happens after our expiration time, d) it is too new to expire. That causes Git to start a reachability traversal beginning at C(1,1), rescuing every object it encounters along the way. Since many objects are shared between multiple commits, rescuing an object from a more recent part of the graph often ends up marking older objects as rescued, too.

After finishing the rescuing pass focused on C(1,1), Git moves on to look at C(0,2). But this commit’s mtime is d-10, which is before our expiration cutoff of d, meaning that it is safe to remove. Git can skip looking at any objects reachable from this commit, since none of them will be rescued.

Finally, Git looks at another connected cluster of the unreachable object graph, beginning at C(3,1). Since this object has an mtime of d+10, it is too new to expire, so Git performs another reachability traversal, rescuing it and any objects reachable from it.

Notice that in the final graph state that the main cluster of commits (the one beginning with C(0,2)) is only partially rescued. In fact, only the objects necessary to retain a reachability closure over the rescued objects among that cluster are saved from being pruned. So even though, for example, commit C(2,1) has only part of its tree entries rescued, that is OK since C(2,1) itself will be pruned (hence any non-rescued tree entries connected to it are unimportant and will also be pruned).

Putting it all together

Now that Git can generate a cruft pack and perform garbage collection on a repository with or without pruning objects, it was time to put all of the pieces together and submit the patches to the open-source Git project.

Other Git sub-commands, like repack, and gc needed to learn about cruft packs, and gain command-line flags and configuration knobs in order to opt-in to the new behavior. With all of the pieces in place, you can now trigger a garbage collection by running either:

$ git gc --prune=1.day.ago --cruft

or

$ git repack -d --cruft --cruft-expiration=1.day.ago

to repack your repository into a reachable pack, and a cruft pack containing unreachable objects whose mtimes are within the past day. More details on the new command-line options and configuration can be found here, here, here, and here.

GitHub submitted the entirety of the patches that comprise cruft packs to the open-source Git project, and the results were released in v2.37.0. That means that you can use the same tools as what we run at GitHub on your own laptop, to run garbage collection on your own repositories.

For those curious about the details, you can read the complete thread on the mailing list archive here.

Cruft packs at GitHub

After a lengthy process of testing to ensure that using cruft packs was safe to carry out across all repositories on GitHub, we deployed and enabled the feature across all repositories. We kept a close eye on repositories with large numbers of unreachable objects, since the process of breaking any deltas between reachable and unreachable objects (since the two are now stored in separate packs, and object deltas cannot cross pack boundaries) can cause the initial cruft pack generation to take a long time. A small handful of repositories with many unreachable objects needed more time to generate their very first cruft pack. In those instances, we generated their cruft packs outside of our normal repository maintenance jobs to avoid triggering any timeouts.

Now, every repository on GitHub and in GitHub Enterprise (in version 3.3 and newer) uses cruft packs to store their unreachable objects. This has made garbage collecting repositories (especially busy ones with many unreachable objects) tractable where it often required significant human intervention before. Before cruft packs, many repositories which required clean up were simply out of our reach because of the possibility of creating an explosion of loose objects which could derail performance for all repositories stored on a fileserver. Now, garbage collecting a repository is a simple task, no matter its size or scale.

During our testing, we ran garbage collection on a handful of repositories, and got some exciting results. For repositories that regularly force-push a single commit to their main branch (leaving a majority of their objects unreachable), their on-disk size dropped significantly. The most extreme example we found during testing caused a repository which used to take 186 gigabytes to store shrink to only take 2 gigabytes of space.

On github/github, GitHub’s main codebase, we were able to shrink the repository from around 57 gigabytes to 27 gigabytes. Even though these savings are more modest, the real payoff is in the objects we no longer have to store. Before garbage collecting, each replica of this repository had nearly 60 million objects, including years of test-merges, force-pushes, and all kinds of sources of unreachable objects. Each of these objects contributed to the I/O cost of repacking this repository. After garbage collecting, only 11.8 million objects remained. Since each object in a repository requires around 150 bytes of memory during repacking, we save around 7 gigabytes of RAM during each maintenance routine.

Limbo repositories

Even though we can easily garbage collect a repository of any size, we still have to navigate the inherent raciness that we described at the beginning of this post.

At GitHub, our approach has been to make this situation easy to recover from automatically instead of preventing it entirely (which would require significant surgery to much of Git’s code). To do this, our approach is to create a “limbo” repository whenever a pruning garbage collection is done. Any objects which get expired from the main repository are stored in a separate pack in the limbo repository. Then, the process to garbage collect a repository looks something like:

  1. Generate a cruft pack of recent unreachable objects in the main repository.
  2. Generate a second cruft pack of expired unreachable objects, stored outside of the main repository, in the “limbo” repository.
  3. After garbage collection has completed, run a git fsck in the main repository to detect any object corruption.
  4. If any objects are missing, recover them by copying them over from the limbo repository.

The process for generating a cruft pack of expired unreachable objects boils down to creating another cruft pack (using exactly the same process we described earlier in this post), with two caveats:

  • The expiration cutoff is set to “never” since we want to keep around any objects which we did expire in the previous step.
  • The original cruft pack is treated as a pack containing reachable objects since we want to ignore any unreachable objects which were too recent to expire (and, thus, are stored in the cruft pack in the main repository).

We have used this idea at GitHub with great success, and now treat garbage collection as a hands-off process from start to finish. The patches to implement this approach are available as a preliminary RFC on the Git mailing list here.

Thank you

This work would not have been possible without generous review and collaboration from engineers from within and outside of GitHub. The Git Systems team at GitHub were great to work with while we developed and deployed cruft packs. Special thanks to Torsten Walter, and Michael Haggerty, who played substantial roles in developing limbo repositories.

Outside of GitHub, this work would not have been possible without careful review from the open-source Git community, especially Derrick Stolee, Jeff King, Jonathan Tan, Jonathan Nieder, and Junio C Hamano. In particular, Jeff King contributed significantly to the original development of many of the ideas discussed above.

Notes


  1. It’s true. According to the Library of Congress themselves, their digital collection amounts to more than 3 petabytes in size [source]. The 18.6 petabytes we store at GitHub actually overcounts by a factor of five, since we store a handful of copies of each repository. In reality, it’s hard to provide an exact number, since data is de-duplicated within a fork network, and is stored compressed on disk. Either way you slice it, it’s a lot of data: you get the point. 
  2. Meaning that for any reachable object part of some repository, any objects reachable from it are also contained in that repository. 

Contributing to open source at GitHub

Post Syndicated from Ariel Deitcher original https://github.blog/2022-09-06-contributing-to-open-source-at-github/

Ariel Deitcher (@mntlty) is a Senior Software Engineer at GitHub, working on Pull Requests and Merge Queue (beta). In this post, he shares the challenges he encountered finding his path to contributing to open source, what it was like contributing to open source at GitHub, and some of the lessons he learned.

Getting started with open source can be overwhelming

As a computer science graduate in 2011 and searching for my first tech job, I read that contributing to an open source project could help. It was a great way to build skills, make industry connections, and gain practical experience with a real-world problem. Perfect, I thought, I’ll just pick an open source project on this new website called GitHub, and, well, actually I wasn’t sure how to do that. Finding that “Goldilocks” project (where the size, language(s), domain, and community felt just right) was a lot harder than I thought, and I didn’t feel self-confident enough to make much progress. Overwhelmed, I decided the timing wasn’t right but resolved to try again someday.

It bugged me that the contribution graph on my GitHub profile remained stubbornly empty, as all the code I had committed lived in private repositories. That changed in 2016 when contributions to private repositories could be shown on my profile, but my contributions to open source had not. Between my family, work, life, and the explosive growth in projects to choose from, making that first contribution to open source felt more daunting than ever.

The opportunity to contribute to open source at GitHub

Fast forward to 2021. I read Working in Public: The Making and Maintenance of Open Source Software by Nadia Eghbal while interviewing at GitHub. I was especially captivated by the Stadium model of open source projects, where a small number of maintainers and occasional contributors are vastly outnumbered by a project’s users. This aligned with my mental model of open source projects, where a few performers on a digital stage would conjure feats of coding wizardry. I could only imagine how vulnerable working in public could be, and hoped it would feel less intimidating working at GitHub.

I joined the GitHub team building Merge Queue (beta), a feature which helps users coordinate their merges to a protected branch, ensures that changes are up to date, and that all required checks pass before automatically merging a pull request. Early on, I shared my long-held goal of contributing code to an open source project with my manager, and discussed the GitHub CLI, an open source tool written in Go which lets users interact with GitHub from the command line, as a possible candidate.

While building Merge Queue, our team carefully integrated it with GitHub’s many APIs and tools, checking each one for compatibility and correctness. Testing different scenarios of merging a pull request with the GitHub CLI, I saw that once a Merge Queue was required, running the CLI command gh pr merge would fail in most cases. The Merge Queue was correctly preventing direct merges to its protected branch, and so I began scoping out what changes the CLI might need to support Merge Queue.

As I didn’t have write access to the CLI repository, I forked it, started a new Codespace, and spent some time getting familiar with the CLI’s contributing guidelines and code. Wanting to minimize my changes, I targeted a few places in the merge command to modify. When I was ready, I pushed a commit to my fork and opened a pull request to share with the CLI maintainers. I expected that I would provide support but defer to them for the final implementation.

In reviewing my pull request with the CLI maintainers, it quickly became clear that my changes were hard to reason about. The merge command had accumulated sufficient technical debt that adding more complexity to it was risky. The team asked if I could refactor the merge command in an initial pull request and follow up with a subsequent pull request for the Merge Queue changes after the first was merged. What I had thought would be a rough guide of changes for the CLI maintainers was, in fact, the opportunity I had been looking for to contribute to open source at GitHub. I confirmed that my manager was onboard with this increased commitment, and was ready to get started.

Refactoring the merge command and adding Merge Queue support

I set out to refactor the merge command with a focus on simplicity, readability, and returning early over deeply nested conditionals. The existing test coverage gave me a confidence boost as I began stepping through the code, copying each section into a separate file for later reference, and wrote comments which I felt captured the intent of the removed section. I then grouped related Git and API operations, consolidated common code into appropriately named functions and variables, trimmed unreachable code paths, created a MergeContext struct to encapsulate state, and leaned into Go’s explicit error returns – all of which gave the code a more linear and consistent structure.

As an example, the mergeRun function, which is the heart of the merge command, went from over 220 lines to just 30:

func mergeRun(opts *MergeOptions) error {
    ctx, err := NewMergeContext(opts)
    if err != nil {
        return err
    }

    // no further action is possible when disabling auto merge
    if opts.AutoMergeDisable {
        return ctx.disableAutoMerge()
    }

    ctx.warnIfDiverged()

    if err := ctx.canMerge(); err != nil {
        return err
    }

    if err := ctx.merge(); err != nil {
        return err
    }

    if err := ctx.deleteLocalBranch(); err != nil {
        return err
    }

    if err := ctx.deleteRemoteBranch(); err != nil {
        return err
    }

    return nil
}

When I was finished, I opened a pull request from my fork to the CLI repository, and was blown away by how supportive the code review process was. After a few rounds of feedback, my code was merged and ready to ship in the next release. I was an open source contributor at GitHub!

Returning to my fork, my original Merge Queue changes were now completely out of date. In fact, much of the code I had on my branch no longer existed on the CLI’s trunk branch. Fortunately, I was now intimately familiar with the merge command and was able to make the Merge Queue changes and tests in a subsequent pull request quickly and with confidence.

Lessons learned

Looking back, I learned that searching for the right open source project on my own, trying to create time outside of work, and context switching from my existing projects were obstacles I could not overcome. Instead, the key for me was to find an open source project that was important to what I was already working on, and that I was accountable for. If this sounds familiar, consider asking your manager if you can devote some time to work on an issue in an open source project that you or your team rely on. It’s much easier to get started with an open source project you know and can align with work you’re already committed to. I recognize how fortunate I was to be in the right place at the right time, and with the right support from my manager, but it wasn’t easy.

Many people I know struggle with impostor syndrome, and working in public made me even more aware of mine. I am learning to accept that even though my commits aren’t perfect, and that I’m afraid of being judged for creating bugs like this regression, which will be discoverable forever, I should still contribute. Despite these challenges, I enjoy picking up new issues in the CLI labeled “help wanted (contributions welcome)” whenever I can, and hope you will too!

AWS Week in Review – September 5, 2022

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-september-5-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

As a new week begins, let’s quickly look back at the most significant AWS news from the previous seven days.

Last Week’s Launches
Here are the launches that got my attention last week:

AWS announces open-sourced credentials-fetcher to simplify Microsoft AD access from Linux containers. You can find more in the What’s New post.

AWS Step Functions now has 14 new intrinsic functions that help you process data more efficiently and make it easier to perform data processing tasks such as array manipulation, JSON object manipulation, and math functions within your workflows without having to invoke downstream services or add Task states.

AWS SAM CLI esbuild support is now generally available. You can now use esbuild in the SAM CLI build workflow for your JavaScript applications.

Amazon QuickSight launches a new user interface for dataset management that replaces the existing popup dialog modal with a full-page experience, providing a clearer breakdown of dataset management categories.

AWS GameKit adds Unity support. With this release for Unity, you can integrate cloud-based game features into Win64, MacOS, Android, or iOS games from both the Unreal and Unity engines with just a few clicks.

AWS and VMware announce VMware Cloud on AWS integration with Amazon FSx for NetApp ONTAP. Read more in Veliswa‘s blog post.

The AWS Region in the United Arab Emirates (UAE) is now open. More info in Marcia‘s blog post.

View of Abu Dhabi in the United Arab Emirates

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
A few more blog posts you might have missed:

Easy analytics and cost-optimization with Amazon Redshift Serverless – Four different use cases of Redshift Serverless are discussed in this post.

Building cost-effective AWS Step Functions workflows – In this blog post, Ben explains the difference between Standard and Express Workflows, including costs, migrating from Standard to Express, and some interesting ways of using both together.

How to subscribe to the new Security Hub Announcements topic for Amazon SNS – You can now receive updates about new Security Hub services and features, newly supported standards and controls, and other Security Hub changes.

Deploying AWS Lambda functions using AWS Controllers for Kubernetes (ACK) – With the ACK service controller for AWS Lambda, you can provision and manage Lambda functions with kubectl and custom resources.

For AWS open-source news and updates, here’s the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more.

Upcoming AWS Events
Depending on where you are on this planet, there are many opportunities to meet and learn:

AWS Summits – Come together to connect, collaborate, and learn about AWS. Registration is open for the following in-person AWS Summits: Ottawa (September 8), New Delhi (September 9), Mexico City (September 21–22), Bogotá (October 4), and Singapore (October 6).

AWS Community DaysAWS Community Day events are community-led conferences to share and learn with one another. In September, the AWS community in the US will run events in the Bay Area, California (September 9) and Arlington, Virginia (September 30). In Europe, Community Day events will be held in October. Join us in Amersfoort, Netherlands (October 3), Warsaw, Poland (October 14), and Dresden, Germany (October 19).

That’s all from me for this week. Come back next Monday for another Week in Review!

Danilo

Git’s database internals V: scalability

Post Syndicated from Derrick Stolee original https://github.blog/2022-09-02-gits-database-internals-v-scalability/

This week, we are exploring Git’s internals with the following concept in mind:

Git is the distributed database at the core of your engineering system.

When the database at the core of an application approaches scale limits of a single database node, a common strategy is to shard the database. By splitting the database into multiple components, we can scale beyond the limits of a single node.

For Git, large repositories can have a similar feeling. While there exist some extremely large monorepos operating with success, they require careful attention and advanced features. For some, that effort is better spent sharding the repository. Just like sharding an application database, there are many ways to split a Git repository, with various trade-offs.

When sharding an application database, there are a number of factors to consider.

Some application databases include automatic horizontal sharding based on a shard key, which is usually a string literal that can be sorted lexicographically so related values appear in the same shard due to a common prefix in the shard key. There is no immediate way to shard Git’s object store in this way. The object IDs are hashes of the object contents and have essentially random prefixes.

Instead, we think of sharding strategies that split the repository by other structures, including logical components, paths in the worktree, and time periods in the commit history.

Component sharding: multi-repo

One way to shard an application database is to split out entire tables that can be operated independently and managed by independent services. The Git equivalent is to split a repository in to multiple smaller repositories with no concrete links between them. This creates a multi-repo sharding strategy.

The common approach to this strategy is to extract functionality out of a monolith into a microservice, but that microservice exists in its own Git repository that is not linked at all to the monolith’s repository. This effort might remove code from the monolith across multiple path prefixes due to the monolith’s architecture.

Multi-repo sharding strategy that extracts functionality out of a monolith into a microservice that exists in its own Git repository.

Using this strategy works best if each microservice is paired with a team that manages that repository, including full responsibility for developing, testing, deploying, and monitoring that service. This is very similar to the application database sharding strategy, where there is typically one application component connected to that database shard. There is no need for other components to be aware of that database since it is hidden by the component interface.

Multi-repo environments work best when there is a similar “human abstraction” where the team is autonomous as long as their service satisfies certain contracts that other teams depend on.

The tricky part of the multi-repo setup is that it requires human overhead to track where these component repositories live and how they link together. The only way to link the connections of the larger service ecosystem is through documentation and siloed experiential knowledge. System-wide efforts, such as security audits, become difficult to track to completion.

Another main downside to the multi-repo organization is that shared dependencies become difficult to manage. Common dependencies must be imported using package managers instead of using source control updates. This can make it difficult to track the consumers of those dependencies, leading to a lack of test coverage when updating those core components.

The next sharding strategy solves some of these multi-repo issues by collecting all of the smaller repositories into one larger super-repository.

Horizontal sharding: submodules

Git submodules allow a repository to include a link to another repository within its worktree. The super repository contains one or more submodules at specific paths in the worktree. The information for each submodule is stored in the .gitmodules file, but the tree entry for that submodule’s path points to a commit in the submodule repository.

Submodules create a way to stitch several smaller repositories into a single larger repository. Each has its own distinct commit history, ref store, and object store. Each has its own set of remotes to synchronize. When cloning the super repository, Git does not recursively clone the submodule by default, allowing the user to opt-in to the submodules they want to have locally.

One main benefit of using a super repository is that it becomes the central hub for finding any of the smaller repositories that form a multi-repo setup. This is similar to a horizontally sharded application database that uses a shard coordinator database to actively balance the shards and run queries on the correct shard.

Diagram showing how a Git super repository can become the central hub for finding any of the smaller repositories that form a multi-repo setup.

Further, certain global properties can be guaranteed via continuous integration builds such as cross-submodule source dependencies. In this setup, the super project creates requirements that it cannot advance a submodule unless all builds and tests in the super project pass. This creates some safety that a core component does not break any consumer in the super project.

This global structure has a cost. The submodule repositories become less independent. Since they have their own Git hosting location, users can update them by pushing changes. This can even be done with local builds that make sure that component is self-consistent. However, any update to the submodule repository is incomplete until the super project updates its path pointer to that commit. At the same time, should the submodule repository move forward before that change has been validated within the super repository?

This contention between the independence of the submodule repository and the inter-dependence of submodules in the super repository is a major hurdle. It is up to the architects of this arrangement to create policies and procedures to ensure that all of the components interact well with the entire system.

One common issue developers have in a submodule environment is when there is a source dependency across multiple submodules. If a breaking change is introduced in one submodule repository, the consumer repositories need to be updated to take advantage of those changes. However, this means that all of the submodules need to coordinate when they are updated into the super repository.

There are a lot of tools out there in the wild to help manage submodules, all built on top of the git submodule command. One famous example is Google’s repo tool that coordinates changes across multiple submodules.

If you are interested in submodules and super repository workflows, then you would likely benefit from coming to Git Merge 2022 (or, watching the videos afterward), especially Emily Shaffer’s talk, “An Improved Workflow for Submodules.”

Using a single worktree: Monorepos

The previous two examples focused on reducing the size of Git repositories by breaking them up based on the worktree. By having fewer files in each repository, fewer developers are interacting with each and the repositories grow more slowly. Each approach had its benefits and trade-offs, and one big issue with each was build-time source dependencies between components.

Another common way to avoid source dependencies across multiple repositories is to only have one repository: a monorepo. Here, I’m defining a monorepo as a repository containing all source code required to build and ship a large system. This does not mean that every single file written by an employee of the company must be tracked in “the monorepo.” Instead, monorepos are defined by their strategy for how they choose to include components in the same repository:

If it ships together, it merges together.

One pattern that is increasing in popularity is the service-oriented architecture (SOA) monorepo. In this environment, all of the code for the application is contained in the same repository, but each component deploys as an independent service. In this pattern, each component can be tested against the current version of all of the other services before it is deployed.

The monorepo pattern solves many of the coordination issues discussed in the previous sharding strategies. The biggest downside is that the repository itself grows very quickly. As discussed in the previous parts of this series, Git has many advanced features that improve performance even for large repositories. Monorepos more frequently need to enable those advanced Git features, even for client repositories.

One of the main costs of a monorepo is actually the build system. If every change to the monorepo requires passing builds across the entire system, then the build system needs to take advantage of incremental builds so updates to a single component do not require building the entire monorepo. Most groups using large monorepos have a team dedicated to the developer experience, including improving the build system. Frequently, these build improvements can also lead to being able to use advanced Git features such as sparse-checkout and partial clone, which can greatly reduce the amount of data necessary for client repositories to interact with the monorepo.

Even with a carefully designed architecture and the best Git features available, monorepos can still grow incredibly fast. It may be valuable to take a monorepo and find creative ways to split it and reset the size to something smaller.

Time-based sharding

One solution to a fast-growing monorepo is to consider it as if it was a time-series database: the changes over time are important, so what if it shards based on time instead of based on the worktree?

When performing a time-based shard, first determine a point in time where the existing monorepo can be paused and all movement on the trunk branch can be blocked. Pausing work on a monorepo is very unusual, so should be done with extreme care and preparation.

After pausing the changes to the monorepo’s trunk, create a new repository with the same root tree as the current trunk of the old monorepo, but with a brand new root commit. Be sure to reference the old monorepo and its tip commit somewhere in the message of that new root commit. This commit can be pushed to a new repository.

Diagram representing the time-based sharding strategy.

For a quick refresher on how we represent Git objects, see the key below.

Key to how Git objects are represented. A green circle represents a commit; a blue triangle represents a tree; and a red box represents a blob.

Any ongoing work in the old monorepo must be replayed on top of the new repository. One way to do this is to rebase each topic branch onto the final commit of the trunk branch, then generate patches with git format-patch and then apply those patches in the new repository with git am.

Diagram representing how ongoing work in the old monorepo is replayed on top of the new repository using rebasing and patches.

After the new monorepo shard is created, the old monorepo can be archived as a read-only repository as all new work continues in the new monorepo. There are likely many updates required to ensure that everyone knows the new monorepo location as well as repository secrets to update. If your repository uses infrastructure as code patterns, then almost all of the information for building, testing, and deploying the monorepo will automatically be ready in the new monorepo.

Even with all of these precautions, performing a time-based shard like this is disruptive and requires a timeframe where no new work is merging into the trunk. If you are considering doing this in your engineering system, then I highly recommend doing a few test runs to make sure you minimize the time between locking the old shard and deploying out of the new shard.

The biggest benefit of this approach is that it can be done at any time regardless of the shape of your worktree. The other sharding methods require some amount of architecture changes in order to split into multiple repos or submodules. This approach cuts out the potentially large commit history and all of the old versions of files without changing the repository structure at the tip.

A time-based shard might be particularly beneficial if your commit history includes some anti-patterns for Git repositories, such as large binary files. If you have done the hard work to clean up the worktree at the tip of your repository, you may still want to clear those old files. This sharding approach is similar to rewriting history, except that the new monorepo can have an even smaller size.

However, that commit history from the old monorepo is still important! We just discussed commit history and file history queries in this blog series. It is extremely important to be able to find the reasons why the code is in its current form. In the new monorepo shard, it will look like the entire codebase was created in a single commit!

To satisfy these history queries, Git can combine the two histories in a way that allows a seamless history query, though at some performance cost. The good news is that these history queries across the shard boundary may be common at first, but become less common as time goes on.

The first step to combining the two shards together is to have a local clone of each. In the new shard, add the object store of the old repository as a Git alternate. Add the full path to the .git/objects directory of the old repository into the .git/objects/info/alternates file in the new repository. While this file exists, it allows Git processes in the new repository to see the objects in the old one.

The second step is to use git replace to create a reference that tells Git to swap the contents of the new root commit with the tip of the old repository. Since those commits share the same root tree, the only change will be the message and commit parents at that point. This allows walking “through” the link into the previous commit history.

Strategy for combining two shards together by having a local clone of each.

It is important to note that operating with replace objects enabled comes at some performance cost. In addition to having the large commit history that existed before the split, some features like the commit-graph file discussed in part II are not compatible with replace objects. For this reason, operating in this combined mode should only be done when it is critical to do history queries across the shard boundary.

One way to guarantee that the combined history is quickly available, but does not affect normal Git operations is to “hide” the replace references using the GIT_REPLACE_REF_BASE environment variable. This writes the replace reference in a non-standard location, so the replacement is only effective when that environment variable is set to your custom value.

Using replace references to view a combined form of the history can also help transition ongoing work from the old repository to the new one. While in the combined mode, users can use git rebase to move their topics from the old history to the new history. They no longer need to use the git format-patch and git am transformation.

Here is a concrete example for how I created a time-based shard of the Git repository starting at the v2.37.0 tag:

$ git init
$ echo /home/stolee/_git/git/src/.git/objects >.git/objects/info/alternates

$ git commit-tree -m "new root commit" \
                  -m "Sharded from e4a4b31577c7419497ac30cebe30d755b97752c5" \
                  -m "Signed-off-by: Derrick Stolee <[email protected]>" \
                  a4a2aa60ab45e767b52a26fc80a0a576aef2a010
b49d35c8288501462ca1a008b3bb2efb9b4c4a9d

$ GIT_REPLACE_REF_BASE=refs/shard git replace \
                  b49d35c8288501462ca1a008b3bb2efb9b4c4a9d \
                  e4a4b31577c7419497ac30cebe30d755b97752c5

$ git log --oneline
b49d35c828 (HEAD -> master) new root commit

$ GIT_REPLACE_REF_BASE=refs/shard git log --oneline -n 5
b49d35c828 (HEAD -> master, replaced) Git 2.37
49c837424a Merge branch 'jc/revert-show-parent-info'
5dba4d6540 Merge tag 'l10n-2.37.0-rnd1' of https://github.com/git-l10n/git-po
fc0f8bcd64 revert: config documentation fixes
71e3a31e40 l10n: sv.po: Update Swedish translation (5367t0f0u)

You can follow the instructions in the sharded repository to experience cloning the two repositories and using the combined history as needed.

Time-based shards can be successful in reducing several dimensions of scale that cause friction in a large monorepo. However, the hurdle of transitioning work to a new repository location may be too disruptive for your group. There is one final sharding strategy I’ll discuss today, and it keeps the logistical structure of the monorepo in a single location while still improving how client repositories interact with the remote repository.

Data offloading

When a database grows, it may be beneficial to recognize that some data elements are infrequently accessed and to move that data to less expensive, but also lower performance storage mechanisms. It is possible to do this with Git repositories as well!

It is possible to think about partial clone as a way to offload data to secondary storage. A blobless clone (created by git clone --filter=blob:none) downloads the full commit history and all reachable trees from the origin server, but only downloads blob contents when necessary. In this way, the initial clone can be much faster and the amount of local storage greatly reduced. This comes at a cost that when Git needs a blob to satisfy a git checkout or git blame query, Git needs to communicate across the network to get that information. Frequently, that network hop requires going great distances across the internet and not just a local area network.

This idea of offloading data to secondary storage can work even better if there is a full clone of the remote repository available to add as an alternate. Perhaps the repository lives on a network fileshare that is accessible on the local network. Perhaps your IT department sets up new machines with a hard-disk drive containing a static copy of the repository from certain points in time. In either case, a blobless partial clone can add that static repository as an alternate, providing a faster lookup location for the blobs that do not exist in the local object store.

One major benefit of this kind of setup is that most custom query indexes, such as the commit-graph and changed-path Bloom filters, work automatically in this environment. This can be a great way to bootstrap local clones while minimizing the effect of missing blobs in a partial clone.

However, the current organization only helps at clone time. All fetches and future operations still grow the local repository size at the same rate, without ever reducing the size of the repository.

It is possible to take this idea of data offloading and use it to move data out of your local repository and into secondary storage, freeing up your expensive-but-fast storage for cheap-but-slower storage.

The key idea is again to use Git alternates, and create an alternate that points to some area of secondary storage. The second step is to discover objects in the repository history that are infrequently used, then copy them to that alternate and delete them from the local copy.

To decide what is an “infrequently used” object, we can use the commit history. The commits themselves are cheap and used for many commit history queries, so always keep those in the local storage. Similarly, keep each root tree. Also, objects reachable from recent root trees should be kept locally. (Feel free to be flexible to what you think “recent” means.)

After we know that we care about these objects, there are many ways we can decide what else should be kept. We could have a hard cutoff where we only keep root trees and no other objects older than that cutoff. We could also taper off the object list by first moving the blobs older than the cutoff, then slowly removing trees at certain depths, keeping fewer and fewer trees as the history gets older and older. There are a lot of possibilities to explore in this space.

Diagram representing secondary storage offloading based on recency.

I don’t know of any existing tool that does this kind of secondary storage offloading based on recency, but I think it could be really useful for some of these large monorepos. If this is something you think would work for your team, then try building it yourself tailored to your specific needs. Just promise that you’ll tell me if you do, because I want to see it!

Let’s keep the conversation going!

Thank you for reading this blog series! I had a lot of fun writing it and thinking about these advanced Git features and some potential extensions.

This may be the end of my prepared writing, but I will keep thinking of Git like a database for a very long time. If you have additional ideas to share, then please ping me on Twitter so we can keep the conversation going. I’ll also be speaking at Git Merge 2022 covering all five parts of this blog series, so I look forward to seeing you there!

Git’s database internals IV: distributed synchronization

Post Syndicated from Derrick Stolee original https://github.blog/2022-09-01-gits-database-internals-iv-distributed-synchronization/

This week, we are exploring Git’s internals with the following concept in mind:

Git is the distributed database at the core of your engineering system.

Git’s distributed nature comes from its decentralized architecture. Each repository can act independently on its own without needing to connect to a central server. Repository hosting providers, such as GitHub, create a central place where contributors can collaborate on changes, but developers can work on their own and share their code with the “official” copy when they are ready. CI/CD systems like GitHub Actions help build farms get the latest changes then run builds and tests.

Instead of guaranteeing consistency across the entire repository, the git fetch and git push commands provide ways for repository owners to synchronize select portions of their repositories through reference updates and sharing Git objects. All of these operations require sharing just enough of the Git object data. Git uses several mechanisms to efficiently compute a small set of objects to share without requiring a full list of objects on each side of the exchange. Doing so requires taking advantage of the object store’s shape, including commit history, tree walking, and custom data structures.

Distributed in the most disconnected way

The first thing to consider about a distributed system is the CAP theorem, which states that the system cannot simultaneously be consistent, available, and resilient to partitions (network disconnections). For most distributed systems, network partitions are supposed to be rare and short, even if they are unavoidable.

With Git, partitions are the default state. Each user chooses when to synchronize information across these distributed copies. Even when they do connect, it can be only a partial update, such as when a user pushes one of their local branches to a remote server.

With this idea of being disconnected by default, Git needs to consider its synchronization mechanisms differently than other databases. Each copy can have an incredibly different state and each synchronization has a different goal state.

To start, let’s focus on the case of git fetch run on a client repository and trying to synchronize with a remote repository. Further, let’s assume that we are trying to get all objects reachable from the remote’s branches in refs/heads/ and we will write copies of those refs in refs/remotes/<remote>.

The first thing that happens in this process is called the ref advertisement where the client requests the list of references available on the remote. There are some subtleties about how this works, such as when using Git’s protocol v2. For our purposes, we can assume that the client sends a request and the server sends back a list of every branch in the refs/heads/ and refs/tags/ namespaces along with the current object ID at that branch. The client then filters from that list of references and continues the rest of the communication using object IDs.

You can test the ref advertisement directly using the git ls-remote command, which requests the ref advertisement but does not download any new objects.

$ git ls-remote --heads origin
4af7188bc97f70277d0f10d56d5373022b1fa385        refs/heads/main
00d12607a27e387ad78b5957afa05e89c87e83a5        refs/heads/maint
718a3a8f04800cd0805e8fba0be8862924e20718        refs/heads/next
b8d67d57febde72ace37d40301a429cd64f3593f        refs/heads/seen

Quick tip: synchronize more frequently

Since client repositories usually only synchronize with remotes when the user manually runs git fetch or git pull, it can be helpful to reduce the amount of object transfer required by synchronizing more frequently. When there are fewer “new” objects, less work is required for the synchronization.

The simplest way to do that is to use Git’s background maintenance feature. The git maintenance start command configures regularly-scheduled maintenance, including an hourly “prefetch” task that downloads the latest objects from all remotes. The remote refs are copied into the hidden refs/prefetch/ namespace instead of the usual refs/remotes/ namespace. This allows foreground git fetch commands to update the refs/remotes/ namespace only when requested manually.

This idea is very simple, since it speeds up foreground synchronizations by making sure there is less work to do. Frequently, it can mean that the only work to do is to update the refs in refs/remotes/ since all of the Git objects already exist in the client repository. Those background fetches are made more efficient by running frequently, but let’s discover exactly what happens during a fetch in order to understand how this is possible.

The ultimate question: Which objects are in one copy but not in another?

This synchronization boils down to a new type of query. In its simplest form, this query needs to find a set of objects that is in one repository but not in another. This is a set difference query. If we had the entire repository contents available, then we could list each object in one copy and check if that object exists in the other. Even if we were not working over a network connection, that algorithm takes time on the order of the number of objects in the repository, far more than the number of objects in the result set difference.

We also care about Git’s object graph. We only want objects that are reachable from some set of references and do not care about unreachable objects. Naively iterating over the object store will pick up objects that are not reachable from our chosen refs, adding wasted objects to the set.

Let’s modify our understanding of this query. Instead of being a simple set difference query where we want all objects that are in one repository but not in another, we actually want a reachable set difference query. We are looking for the set of objects that are reachable from a set of objects and not reachable from another set of objects.

Note that I am using objects as the starting point of the reachable set difference query. The Git client is asking to fetch a given set of objects based on the ref advertisement that is already complete. If the server updates a ref in between, the client will not see that change until the next time it fetches and gets a new copy of the ref advertisement.

Git uses the terms wants and haves to define the starting points of this reachable set difference query.

  • A want is an object that is in the serving repository and the client repository requests. These object IDs come from the server’s ref advertisement that do not exist on the client.
  • A have is an object that the client repository has in its object store. These object IDs come from the client’s references, both in refs/heads/ and in refs/remotes/<remote>/.

At this point, we can define the reachable set difference as the objects reachable from any of the wants but not reachable from any haves. In the most extreme case, the fetch operation done as part of git clone uses no haves and only lists a set of wants.

Given a set of wants and haves, we have an additional wrinkle: the remote might not contain the ‘have’ objects. Using tips of refs/remotes/<remote>/ is a good heuristic for finding objects that might exist on the server, but it is no guarantee.

For this reason, Git uses a fetch negotiation step where the client and server communicate back and forth about sets of wants and haves where they can communicate about whether each is known or not. This allows the server to request that the client looks deeper in its history for more ‘have’ objects that might be in common between the client and the server. After a few rounds of this, the two sides can agree that there is enough information to compute a reachable set difference.

Now that the client and server have agreed on a set of haves and wants, let’s dig into the algorithms for computing the object set.

Walking to discover reachable set differences

Let’s start by talking about the simplest way to compute a reachable set difference: use a graph walk to discover the objects reachable from the haves, then use a graph walk to discover the objects reachable from the wants that were not already discovered.

For a quick refresher on how we represent Git objects, see the key below.

Key to how Git objects are represented visually. A green circle represents a commit; a blue triangle represents a tree; and a red box represents a blob.

As discussed in part II, Git’s commit history can be stored in the commit-graph file for fast commit history queries. In this way, Git could walk all of the commits from the haves, then walk to their root trees, then recursively walk trees until finding all trees and blobs reachable from those commits. During this walk, Git can mark each object in-memory with a special flag indicating it is in this reachable set.

To find the reachable set difference, Git can walk from the want objects following each commit parent, root tree, and recursively through the trees. This second walk ignores the objects that were marked in the previous step so each visited object is part of the set difference.

In the figure below, the commit B is a have and the commit A is a want. Two sets are shown. First, everything reachable from B is grouped into a set. The second set is the reachable set difference containing everything reachable from A but not reachable from B.

Figure representing a walk through two commits: a "want" (commit A) and a "have" (commit B).

While this walking algorithm is a natural one to consider, it has a number of significant performance penalties.

First, we will spend a lot of time parsing trees in order to discover their tree entries. We noted in part III that tree parsing is expensive and that was when talking about file history where we only needed to parse trees along a single path. In addition, there are usually many tree entries that point to the same object. For example, an open source license file is usually added once and never modified in a repository. By contrast, almost every commit has a distinct root tree. Thus, each commit introduces a tree with a tree entry pointing to that license file. Git needs to test if the license file is already in the set each time it parses that tree entry. That’s a lot of work. We will revisit how to reduce the time spent parsing trees and following tree entries later, though it will require a new data structure.

The second performance penalty is that this walk requires visiting the entire commit history and likely walking a majority of the Git objects. That cost is paid even if the only objects in the reachable set difference is one commit that changes the README, resulting in a total of one commit, one tree, and one blob in the set difference. The fact that the cost does not scale with the expected output means that even frequent fetches will not reduce this cost.

Thankfully, there is a way to tweak this algorithm to reduce this second cost without needing any new data structures.

Discovering a frontier

If we think about the reachable set difference problem from the perspective of an arbitrary directed graph, then the full walk algorithm of the previous section is the best we can do. However, the Git object graph has additional structure, including different types of objects. Git uses the structure of the commit history to its advantage here, as well as some assumptions about how Git repositories are typically used.

If we think about Git repositories as storing source code, we can expect that code is mostly changed by creating new code. It is rare that we revert changes and reintroduce the exact copy of a code file that existed in the past. With that in mind, walking the full commit history to find every possible object that ever existed is unlikely to be helpful in determining the set of “new” objects.

Instead of walking every object in the full commit history, Git uses the commit history of the haves and wants to discover a frontier of commits. These commits are the commits that are reachable from the haves but are on the boundary between the reachable set difference and the common history. For a commit A to be in the frontier, there must be at least one commit B whose parent is A and B is reachable from the wants but not reachable from the haves.

This idea of a frontier can be visualized using the git log --boundary query with a commit range parameter. In the example below, we are exploring the commits reachable from d02cc45c7a but not reachable from 3d8e3dc4fc. The commits marked with o are on this boundary.

$ git log --graph --oneline --boundary 3d8e3dc4fc..d02cc45c7a
*   acdb1e1053 Merge branch 'mt/checkout-count-fix'
|\
| * 611c7785e8 checkout: fix two bugs on the final count of updated entries
| * 11d14dee43 checkout: show bug about failed entries being included in final report
| * ed602c3f44 checkout: document bug where delayed checkout counts entries twice
* |   f0f9a033ed Merge branch 'cl/rerere-train-with-no-sign'
|\ \
| * | cc391fc886 contrib/rerere-train: avoid useless gpg sign in training
| o | bbea4dcf42 Git 2.37.1
|  /
o / 3d8e3dc4fc Merge branch 'ds/rebase-update-ref'
 /
o e4a4b31577 Git 2.37

Once Git has determined the commit frontier, it can simplify the object walk somewhat. Starting at the frontier, Git walks those root trees and then recursively all of the reachable trees. These objects are marked as reachable from the wants. Then, the second walk from the haves continues as normal, stopping when it sees objects in this smaller set.

Image representing a simplified object walk starting from the the commit frontier. Git walks those root trees and then recursively all of the reachable trees.

With this new algorithm, we see that the cost of the object walk can be much smaller: we expect the algorithm to walk about as many objects as there exist from a few root trees, plus the new objects in the reachable set difference. This could still be a large set, but at least it does not visit every object in the full history. As part of this, we have many fewer repeated tree entries since they are rarely repeated within a walk from a few root trees.

There is an additional cost to this algorithm, though. We might increase the size of the resulting set! If some of the commits in the set difference really are reverts, then they could be “reintroducing” an older object into the resulting set. If this commit reverted the file at a given path, then every commit in the frontier must not have that version of the file at its root tree. This exact revert case is rare enough that these new objects do not account for a significant drawback, but it is worth mentioning.

We can still do better! In the case of a monorepo, that cost of walking all of the trees in the frontier can still be significant. Is there a way that we can compute a reachable set difference more quickly? Yes, but it requires new data structures!

Reachability bitmaps

When considering set arithmetic, such as set differences, a natural data structure to use is a bitmap. Bitmaps represent sets by associating every possible object with a position, and then using an array of bits over those positions to indicate if each object is in the set. Bitmaps are frequently used by application databases as a query index. A bitmap can store a precomputed set of rows in a table that satisfy some property, helping to speed up queries that request data in that set.

The figure below shows how the object graph from the previous figures is laid out so that every object is associated with a bit position. The bitmap at the top has a 1 in the positions corresponding to objects reachable from the commit A. The other positions have value 0 showing that A cannot reach that object.

Figure showing how the object graph from the previous figures is laid out so that every object is associated with a bit position.

Computing the set difference between two bitmaps requires iterating over the bit positions and reporting the positions that have a 1 in the first bitmap and a 0 in the second bitmap. This is identical to the logical operation of “A AND NOT B,” but applied to every bit position.

In this way, Git can represent the reachable sets using bitmaps and then perform the set difference. However, computing each bitmap is at least as expensive as walking all of the reachable objects. Further, as currently defined, bitmaps take at least one bit of memory per object in the repository, which can also become too expensive.

The critical thing that Git does to solve this cost of constructing the bitmaps is by precomputing the reachability bitmaps and storing them on disk. Recall from part I that Git uses compressed object storage files called packfiles to store the object contents. The git repack command takes all of the objects and creates a new packfile along with a pack-index.

The git repack --write-bitmap-index option computes reachability bitmaps at the same time as it repacks the Git object data into a new packfile. Each bit position is associated with an object in the packfile based on the order the objects appear in that packfile. In addition to the .pack and .idx files, a new .bitmap file stores these bitmaps. Git can also store reachability bitmaps across multiple packfiles using a multi-pack-index.

Each reachability bitmap is associated with a single Git commit. The bitmap stores the set of objects reachable from that commit. A .bitmap file can store reachability bitmaps corresponding to one or more commits.

If every commit had a reachability bitmap, then we could compute the reachable set difference from a set of haves and wants using the following process:

  1. Take the bitmap for each ‘have’ commit and merge them together into the union bitmap storing every object reachable from at least one ‘have’ commit.
  2. Take the bitmap for each ‘want’ commit and merge them together into the union bitmap storing every object reachable from at least one ‘want’ commit.
  3. Perform a set difference on the bitmaps created in the previous step.

The figure below shows this third step of performing the set difference on the two reachability bitmaps. The “A – B” bitmap is formed by including a 1 if and only if that position has a 1 in the A bitmap and a 0 in the B bitmap.

Figure showing third step of performing the set difference on the two reachability bitmaps.

Unfortunately, computing and storing a reachability bitmap for every commit in the entire repository is not realistic. First, consider that each bitmap can take up one bit per object in the repository, then multiply that by the number of commits in the repository to get quadratic growth! This isn’t exactly a lower bound on the size of these bitmaps since Git uses a compressed bitmap encoding as well as a form of delta compression between bitmaps. However, the cost of computing and storing each bitmap is still significant.

Even if we were able to store a reachability bitmap for every commit, it is possible that a new commit is pushed to the repository and then is requested by a fetch before a reachability bitmap could be computed for it. Thus, Git needs some way to compute the reachable set difference even when the requested haves and wants do not have pre-computed bitmaps.

Git solves this by using a commit history walk. Starting at the haves, Git walks the commit history until finding a commit that has a precomputed reachability bitmap. That bitmap is used as a starting point, and the commit walk halts when it finds another reachability bitmap or finds a commit that is already contained in the reachable set because its bit is 1 in the bitmap. After the commit set is explored, Git walks the trees starting at the root trees, ignoring any trees that already exist in the reachability bitmap.

In this way, Git can dynamically compute the reachability bitmap containing the full set of objects reachable from the haves. The process is repeated with the wants. Then, the set difference is computed from those two bitmaps.

If the set of precomputed bitmaps is chosen carefully enough and the object order is selected in such a way that the bitmaps compress efficiently, these operations can be done while walking an incredibly small number of objects and using significantly less memory.

With proper maintenance of the reachability bitmap index, these reachable set difference queries can be much faster than the previous frontier walking strategy while also computing the exact set difference. The extra objects that could appear using the frontier algorithm do not appear using the precomputed bitmaps.

If you want to read more about how commits are chosen for bitmaps or how the bitmaps are compressed, read the original announcement of reachability bitmaps which goes into even greater detail. In particular, that post goes very deep on the fact that the object data is sent over the wire using the same packfile format as the on-disk representation discussed in part I, except that Git allows reference deltas to refer to objects already on the client’s machine. The fact that the on-disk representation and the network transfer format use this common format is one of Git’s strengths.

Pushing to a remote

The previous algorithms were focused on computing the reachable set difference required during a git fetch command. After the client sends the list of haves and wants, the server computes the set difference and uses that to send the objects to the client. The natural opposite of this operation is the git push command where the client sends new objects to the server.

We could use the existing algorithm, but we need to flip around some meanings. The haves and wants become commits that “the server has” and “the client wants the server to have”. One caveat is that, by default, git push doesn’t do a negotiation at the start and instead thinks about the references in refs/remotes/<remote> as the set of haves. You can enable the push.negotiate config option if you find this negotiation to be valuable. This negotiation is important if you have not updated your refs/remotes/<remote> references through a git fetch in a while. The negotiation is more useful if you are using background maintenance because you are more likely to have most of the objects the remote will advertise in the negotiation.

Other than reversing the roles of the haves and wants, the goals of git push are exactly the same as git fetch. The command synchronizes objects from one repository to another. However, we can again think about the typical use of the commands to see that there are some asymmetries.

When a client runs git fetch, that command will typically download new objects from several other contributors to that repository, perhaps merged together by pull requests. These changes are likely to include changes to many files across several directories. When a client runs git push, the information that is new to the remote is typically a single topic branch created by a single contributor. The files modified by this effort are likely to be smaller in number than the git fetch case.

Git exploits this asymmetry using a custom reachable set difference algorithm tailored to these expectations.

Sparse reachable set difference

One major asymmetry with git push is that clients rarely find it worth the cost to precompute reachability bitmaps. That maintenance cost is too CPU intensive compared to the number of times git push is run by a typical user. For Git servers, reachability bitmaps are absolutely critical to efficient function, so that extra maintenance is easy to justify.

Without reachability bitmaps, Git falls back to the frontier algorithm when computing the reachable set difference. This works mostly fine for small projects, but when the client repository is very large, the cost of walking every object reachable from even a single root tree becomes too expensive.

This motivated the sparse reachable set difference algorithm. This algorithm is enabled by the pack.useSparse config option, which is now enabled by default. In addition to using the commit history to construct a frontier of commits, the sparse algorithm uses the structure of the trees themselves to compute the reachable set difference.

Just like the frontier algorithm, Git computes the commit frontier as a base of which objects are in common between the haves and wants. Then, instead of walking all the trees reachable from the root trees in the frontier and then walking the root trees from the wants, Git walks these trees in a single walk.

Instead of exploring the object graph directly by walking from tree to tree one at a time, Git changes the walk to do a breadth-first search on the paths available in these trees. Each node of this walk consists of a path and a set of trees. Each tree is marked as uninteresting or interesting, depending on whether they come from the commit frontier or not, respectively.

The walk is initialized with the empty path and the set of root trees from our commit frontier and the commits reachable from the wants. As Git explores a node, it iterates over each tree in the associated set. For each of those trees, it parses the tree entries and considers the path component from each. If the entry points to a blob, then those blobs are marked as interesting or uninteresting. If the entry points to a tree, then the path component leads to a new node and that tree is added to that node’s tree set.

During this walk, uninteresting trees mark their child trees as uninteresting. When visiting a node, Git skips the node if every contained tree is uninteresting.

These “all uninteresting” nodes represent directories where there are no new objects in the reference being pushed relative to the commit frontier. For a large repository and most changes, the vast majority of trees fit in this category. In this way, this sparse algorithm walks only the trees that are necessary to discover these new objects.

Figure representing a sparse algorithm that walks only the trees that are necessary to discover these new objects.

This sparse algorithm is discussed in more detail in the blog post announcing the option when it was available in Git 2.21.0, though the pack.useSparse option was enabled by default starting in Git 2.27.0.

Heuristics and query planning

In this blog series, we are exploring Git’s internals as if they were a database. This goes both directions: we can apply database concepts such as query indexes to frame these advanced Git features, but we can also think about database features that do not have counterparts in Git.

This area of synchronization is absolutely one area where database concepts could apply, but currently do not. The concept I’m talking about is query planning.

When an application database is satisfying a query, it looks at the query and the available query indexes, then constructs a plan for executing the query. Most query languages are declarative in that they define what output they want, but not how to do that operation. This gives the database engine flexibility in how to best use the given information to satisfy the query.

When Git is satisfying a reachable set difference query, it does the most basic level of query planning. It considers its available query indexes and makes a choice on which to use:

  1. If reachability bitmaps exist, then use the bitmap algorithm.
  2. Otherwise, if pack.useSparse is enabled, then use the sparse algorithm.
  3. If neither previous case holds, then use the frontier algorithm.

This is a simple, and possibly unsatisfying way to do query planning. It takes the available indexes into account, but does not check how well those indexes match with the input data.

What if the reachability bitmaps are stale? We might spend more time in the dynamic bitmap computation than we would in the frontier algorithm.

We can walk commits really quickly with the commit-graph. What if there are only a few commits reachable from the wants but not reachable from the frontier? The sparse algorithm might be more efficient than using reachability bitmaps.

This is an area where we could perform some experiments and create a new, dynamic query planning strategy that chooses the best algorithm based on some heuristics and the shape of the commit history.

Already there is some ability to customize this yourself. You can choose to precompute reachability bitmaps or not. You can modify pack.useSparse to opt out of the sparse algorithm.

A change was merged into the Git project that creates a push.useBitmaps config option so you can compute reachability bitmaps locally but also opt out of using them during git push. Reachability bitmaps are integrated with other parts of Git, so it can be helpful to have them around. However, due to the asymmetry of git fetch and git push, the sparse algorithm can still be faster than the bitmap algorithm. Thus, this config will allow you to have the benefits of precomputed reachability bitmaps while also having fast git push commands. Look forward to this config value being available soon in Git 2.38.0!

Come back tomorrow for the final installment!

Now that we’ve explored all of the different ways Git operates on a repository, we have a better grasp on how Git scales its algorithms with the size of the repository. When application databases grow too quickly, many groups resort to sharding their database. In the next (and final!) part of this blog series, we will consider the different ways to scale a Git repository, including by sharding it into smaller components.

I’ll also be speaking at Git Merge 2022 covering all five parts of this blog series, so I look forward to seeing you there!

[Security Nation] Gordon “Fyodor” Lyon on Nmap, the Open-Source Security Scanner

Post Syndicated from Rapid7 original https://blog.rapid7.com/2022/08/31/security-nation-gordon-fyodor-lyon-on-nmap-the-open-source-security-scanner/

[Security Nation] Gordon “Fyodor” Lyon on Nmap, the Open-Source Security Scanner

In this episode of Security Nation, Jen and Tod chat with Gordon “Fyodor” Lyon, author of the widely used open-source Nmap Security Scanner. On the doorstep of Nmap’s 25th anniversary, Gordon and our hosts talk about the tool’s impact on asset management, as well as the struggles and triumphs of creating and managing the solution. They even cover a few highlights from Hollywood films where Nmap makes a guest appearance.

Stick around for our Rapid Rundown, where Tod and Jen talk about a recent warning from the FBI that decentralized finance (DeFi) – i.e., cryptocurrency – poses some unique risks, which attackers are already exploiting.

Gordon “Fyodor” Lyon

[Security Nation] Gordon “Fyodor” Lyon on Nmap, the Open-Source Security Scanner

Gordon “Fyodor” Lyon authored the open-source Nmap Security Scanner in 1997 and continues to coordinate its development. His company also develops and sells Npcap, a raw networking library and driver for Windows. Npcap is now used in hundreds of other software projects, including Wireshark and Microsoft Defender for Identity. Gordon is a founding member of the Honeynet Project and served on the technical advisory boards for Qualys and AlienVault, as well as editorial boards for many conferences and journals. He authored or co-authored the books “Nmap Network Scanning,” “Know Your Enemy: Honeynets,” and “Stealing the Network: How to Own a Continent.” He runs the “Full Disclosure” mailing list, along with popular security resource sites such as SecLists.Org, SecTools.Org, and Insecure.Org.

Show notes

Interview links

  • Check out Nmap if, for some reason, you haven’t already.
  • Learn about Npcap, the packet capture library tool that Gordon and his company also offer.
  • Watch Gordon and HD Moore, the creator of Metasploit, chat about the evolution of network scanning on YouTube.

Rapid Rundown links

Like the show? Want to keep Jen and Tod in the podcasting business? Feel free to rate and review with your favorite podcast purveyor, like Apple Podcasts.

Want More Inspiring Stories From the Security Community?

Subscribe to Security Nation Today

Git’s database internals III: file history queries

Post Syndicated from Derrick Stolee original https://github.blog/2022-08-31-gits-database-internals-iii-file-history-queries/

This week, we are exploring Git’s internals with the following concept in mind:

Git is the distributed database at the core of your engineering system.

Before making a change to a large software system, it can be critical to understand the reasons why the code is in its current form. Looking at commit messages alone is insufficient for this discovery, and instead it is important to find the changes that modified a specific file or certain lines in that file. Git’s file history commands help users find these important points in time where changes were introduced.

Today, let’s dig into these different file history commands and consider them as a set of queries. We will learn how Git optimizes these queries based on the typical structure of file history and how merges work most of the time. Some additional history options may be required to discover what happened in certain special cases, such as using cherry-picks across multiple branches or mistakenly resolving merge conflicts incorrectly. Further, we will see some specialized data structures that accelerate these queries as repositories grow.

git log as file history

The primary way to discover which commits recently changed a file is to use git log -- <path>. This shows commits where their parent has a different Git object at <path>, but there are some subtleties as to which commits are shown, exactly.

One thing to keep in mind with file history queries is that the commit graph structure is still important. It is possible for two changes to happen in parallel and then be connected to the trunk through a merge. To help clarify what is happening with these queries, all examples in this section will assume that the --graph and --oneline options are also specified. The --graph option shows the relationships between commits and in particular will show when two commits are parallel to each other in the history. It also avoids interleaving two parallel sequences of commits that happen to have been created at the same time. I personally recommend that you use --graph whenever using these history walks.

The most important thing to remember is that commits are snapshots, not diffs. For a quick refresher on how we represent Git objects, see the key below.

A key to how different git objects are represented. A green circle represents a commit; a blue triangle represents a tree; and a red box represents a blob.

Git needs to dynamically compute the difference between two commits to see if <path> was changed. This means that Git loads the root trees for those two commits, then compares their tree entry for the first directory of <path> and compares the object ID found in each. This comparison is done recursively until equal object IDs are found (no difference) or all parts of <path> are walked and we find the two different objects at <path> for the two commits.

Image of two Git root trees, representing how Git dynamically computes the difference between two commits.

If we find equality during this process, we say that the two commits are treesame on this path.

For a commit with only one parent, we say that commit is interesting if it is not treesame. This is a natural idea, since this matches the only meaningful diff we could compute for that commit.

Similarly, a merge commit is considered interesting if it is not treesame to any of its parents. The figure below shows a number of interesting commits for a given path based on these treesame relationships.

Figure showing a number of interesting commits for a given path based on these treesame relationships.

In the case of an uninteresting merge commit where there is at least one treesame parent, Git makes different decisions based on the history query type.

Simplified history

By default, git log -- <path> shows the simplified history of <path>. This is defined in the git log documentation, but I’ll present an alternative definition here.

When the simplified history mode encounters a merge commit, it compares the merge commit to each parent in order. If Git finds a treesame parent, then it stops computing diffs at the current merge, marks the merge as uninteresting, and moves on to that parent. If all parents are not treesame, then Git marks the merge as interesting and adds all parents to the walk.

For a path that is not changed very often, almost every merge commit will be treesame to its first parent. This allows Git to skip checking all of the commits made reachable by merges that did not “introduce” a change to the trunk. When a topic branch is merged into the trunk, the new merge commit rarely has any merge conflicts, so it will be treesame to its second parent for all the files that were changed in that topic. The merge would then not be treesame to its first parent on any of these paths.

Figure representing how a merge commit is compared to each parent in order to determine whether it should be marked as interesting.

In the case that the merge commit is different from all of its parents on the path, then the merge is marked as interesting and all parents are added to the walk. This happens frequently when the path is a directory that has different sets of files change, but can also happen if the same file is modified by parallel changes and conflicts were resolved during the merge.

Here is an example query where two parallel topics both modified files inside the src/ directory:

$ git log --graph --oneline -- src/
*   80423fa Merge pull request #800 from ...
|\
| * 9313670 build(deps): bump Newtonsoft.Json in /src/shared/Core
* | 47ba58f diagnose: don't await Git exit on config list
|/
* 5637aa9 macos build: use runtime instead of osx-x64
* 7a99cc0 Fixes typo in Mac dist script

Note that the merge commits with a treesame parent are marked as uninteresting, even if they are different to their first parent. This means that the merge commit will not appear in the file history, even if it is responsible for introducing that change into the commit history. You can add the –show-pulls option to git log to make it output the merge commits that are different to their first parent. This can be particularly helpful if you are trying to also track which pull request was involved in that change.

Here is the output for the previous example, except that --show-pulls is added. Notice the additional “Merge pull request…” lines:

$ git log --graph --oneline --show-pulls -- src/
*   80423fa Merge pull request #800 from ...
|\
| * 9313670 build(deps): bump Newtonsoft.Json in /src/shared/Core
* | 77f7922 Merge pull request #804 from ...
* | 47ba58f diagnose: don't await Git exit on config list
|/
* b83bf02 Merge pull request #788 from ...
* 5637aa9 macos build: use runtime instead of osx-x64
* cf5a693 Merge pull request #778 from ...
* 7a99cc0 Fixes typo in Mac dist script

While this logic to skip huge chunks of history may seem confusing at first, it is a critical performance feature. It allows skipping commits that did not contribute to the latest version of the file. This works almost all of the time, but it is important to know some of the reasons why commits that might be interesting would be skipped by the simplified history mode.

Reverted Changes. Sometimes a pull request changes a file in its first version, but review feedback finds a different way to solve the problem without changing the file. The author might remove the changes to that file within their branch, but really has at least two commits editing the file. The end result makes no changes to the file since one commit reverts the previous changes. When that topic is merged, the merge commit is treesame to its first parent on that path and the topic branch is skipped.

Cherry-picks. Some bug fixes are critical to apply in multiple places, such as maintenance branches to solve security issues. If a commit is cherry-picked in multiple places, then it can look like “the same change” is happening in several parallel branches. If those branches eventually merge, they might merge automatically without conflict because all of the tips agree on the file contents. Thus, the simplified history walk will choose only one of these branches to walk and will discover one of the cherry-picks but not the others.

The previous two reasons are common but mostly harmless reasons why a commit could be skipped during simplified history. As someone who has worked on Git for several years, I can attest that the most common reason someone asks “what happened to my change?” is because of the more difficult third reason:

Merge conflict resolution. When resolving a merge, it is possible to make any number of mistakes. In particular, a common case is that someone gets confused and takes all of their changes and drops all changes from the other side of the merge. When this happens, simplified history works against us because Git sees a treesame parent and ignores the other side that had meaningful changes that were dropped by the merge conflict resolution.

These kinds of merge resolution issues are confusing on first glance, but we can use other history modes to discover what happened.

Full history

The --full-history mode changes from the simplified history mode by walking every commit in the history, regardless of treesame parents on merge commits. A merge commit is marked as interesting if there is at least one parent that is different at the path.

When used with --graph, Git performs parent rewriting to connect the parent links to the next interesting commit reachable from that parent. While the --full-history mode is sure to show all of the possible changes to the path, it is overly noisy. Here is the same repository used in the previous examples, but with --full-history we see many more merge commits:

$ git log --graph --oneline --full-history -- src/
*   5d869d9 Merge pull request #806 from ...
|\
* \   80423fa Merge pull request #800 from ...
|\ \
| |/
|/|
| * 9313670 build(deps): bump Newtonsoft.Json in /src/shared/Core
* |   77f7922 Merge pull request #804 from ...
|\ \
| * | 47ba58f diagnose: don't await Git exit on config list
* | | 162d657 Merge pull request #803 from ...
|/ /
* / 10935fb Merge pull request #700 from ...
|/
*   2d79a03 Merge pull request #797 from ...
|\
* | e209b3d Merge pull request #790 from ...
|/
*   b83bf02 Merge pull request #788 from ...
|\
| * 5637aa9 macos build: use runtime instead of osx-x64

Notice that these new merge commits have a second parent that wraps around and links back into the main history line. This is because that merge brought in a topic branch that did not change the src/ directory, but the first parent of the merge had some changes to the src/ directory relative to the base of the topic branch.

In this way, --full-history will show merges that bring in a topic branch whose history goes “around” meaningful changes. In a large repository, this noise can be so much that it is near impossible to find the important changes you are looking for.

The next history mode was invented to remove this extra noise.

Full history with simplified merges

In addition to --full-history, you can add the --simplify-merges option. This mode performs extra smoothing on the output of the --full-history mode, specifically dropping merge commits unless they actually are important for showing meaningful changes.

Recall from the --full-history example that some merge commits rewrote the second parent to be along the first-parent history. The --simplify-merges option starts by removing those parent connections and instead showing the merge as having a single parent. Then, Git inspects that commit as if it had a single parent from the beginning. If it is treesame to its only parent then that commit is removed. Git then rewrites any connections to that commit as going to its parent instead. This process continues until all simplifications are made, then the resulting history graph is shown.

$ git log --graph --oneline --full-history --simplify-merges -- src/
*   80423fa Merge pull request #800 from ...
|\
| * 9313670 build(deps): bump Newtonsoft.Json in /src/shared/Core
* | 47ba58f diagnose: don't await Git exit on config list
|/
* 5637aa9 macos build: use runtime instead of osx-x64
* 7a99cc0 Fixes typo in Mac dist script

Notice that this history is exactly the same as the simplified history example for this query. That is intentional: these should be the same results unless there really was an interesting change that was skipped.

If these history modes usually have the same output, then why wouldn’t we always use --full-history --simplify-merges? The reason is performance. Not only does simplified history speed up the query by skipping a large portion of commits, it also allows iterative output. The simplified history can output portions of the history without walking the entire history. By contrast, the --simplify-merges algorithm is defined recursively starting at commits with no parents. Git cannot output a single result until walking all reachable commits and computing their diffs on the input path. This can be extremely slow for large repositories.

One common complaint I have heard from Git users is “Git lost my change!” This typically takes the form where a developer knows that they merged in a commit that updated a file, but that change is no longer in the tip of that branch and running git log -- <path> does not show the commit they wrote! This kind of problem is due to file history simplification working as designed and skipping that commit, but it’s because someone created a faulty merge commit that is causing this unexpected behavior. If there is any chance that Git is skipping a commit that you know changed a file, then try to use --full-history with --simplify-merges.

To demonstrate, I took the previous example repository and created a branch that improperly resolved a merge to ignore valid changes that already existed in the trunk. Look carefully at the difference between the two history modes:

$ git log --graph --oneline -- src
* 5637aa9 macos build: use runtime instead of osx-x64
* 7a99cc0 Fixes typo in Mac dist script

$ git log --graph --oneline --full-history --simplify-merges -- src
*   7da271b Update with latest trunk
|\
| *   80423fa Merge pull request #800 from ...
| |\
| | * 9313670 build(deps): bump Newtonsoft.Json in /src/shared/Core
* | | 0b408b0 Resolve merge conflicts
|\| |
| |/
|/|
| * 47ba58f diagnose: don't await Git exit on config list
|/
* 5637aa9 macos build: use runtime instead of osx-x64
* 7a99cc0 Fixes typo in Mac dist script

When the actual history is shown, you can see that I created two “bad” merge commits: 7da271b Update with latest trunk and 0b408b0 Resolve merge conflicts. These both set the src directory equal to their first parents instead of allowing the merge to take the changes from both sides.

This history mode is a good tool to have in your arsenal.

Unfortunately, --full-history with --simplify-merges remains an expensive operation and I do not recommend using it by default. There remains no way to perform merge simplification without exploring the entire commit graph, even with the generation numbers discussed in part II. This remains an open problem, so if you have ideas about how to speed up this operation, then please bring those ideas to the Git developer community! I, for one, will be particularly interested!

Other history queries

Now that we’ve gone deep on the query modes for git log -- <path>, let’s consider a few other file history queries that shift the formula slightly in their own ways.

git log -L

The git log -L option allows specifying a portion of a file instead of an entire file. This helps you focus your history query to a specific function or set of lines. There are two main ways to use it:

  1. git log -L<from>,<to>:<path>: In the file at <path> show any changes in the lines between <from> and <to>.
  2. git log -L:<identifier>:<path>: In the file at <path>, find the code associated with <identifier> and show any changes to those lines. Usually, <identifier> is a function name, but it can also refer to a class or struct.

The -L mode modifies the definition of “treesame” to also consider two versions of the file to be the same if they have the same content at these lines. Importantly, Git will track how the line numbers change as the line content stays the same, but other changes to earlier lines might add or delete lines to the file outside of this range. After that definition of treesame is updated, the history walk is the same as in the simplified history mode.

In this way, the -L mode is more expensive because it needs to compute blob content diffs instead of only comparing object IDs. However, that performance difference can be worthwhile, as it reduces your time spent reading changes to the file that are not important to the section of the file you are reading.

git blame and git annotate

While git log will show all the commits that have changed a given file, the git blame and git annotate commands show the commits that most-recently changed each line of the file. The only difference between the commands is the output style.

To compute these most-recent changes, Git tracks each line in a similar way as it does for git log -L, but then drops the line from consideration once it has found a commit that changed that line.

Speeding up file history queries

The previous sections detailed the types of file history queries available in Git. These queries are similar to the commit history queries from part II in that it helps to walk the commits more quickly. However, file history queries spend a significant amount of time testing treesame relationships by computing diffs.

Recall from part I that we can navigate to the Git object specified by a path at a given commit by following a sequence of object links:

  • First, the commit has a root tree object ID that points to a tree object. The commit-graph file speeds this up slightly by including the root tree inside the commit-graph file instead of needing to parse the commit object directly.
  • Next, for each directory component in the path, Git parses a tree to find the matching tree entry and discovers the object ID of the next tree in the list.
  • Finally, the last tree entry points to the object ID for the object at the path. This could be a tree or a blob object.

The git log -L and git blame queries go an additional step further by computing a content diff of two blobs. We will not focus on this part right now, because this only happens if the blobs are already different.

Structuring repositories for fast history queries

Git spends most of its time parsing trees to satisfy these file history queries. There are a few different dimensions in the structure of the repository that can affect how much time is spent parsing trees:

  1. Tree depth: The number of directories required to reach the specified path means that more trees need to be parsed before finding the object ID for that path. For example, Java namespaces are tied to the directory structure of the source files, so the tree depth tends to be high in these repositories.
  2. Adjacent changes: When comparing two commits at a given path, Git can walk both sides of the comparison at the same time. If two tree entries point to the same object ID at any point along the list of trees, then Git can stop parsing trees and determine the commits are treesame at the path. This happens less frequently if the path is in a directory full of other files that are changed often. For example, a README file for a subproject might be rarely changed, but lives next to the code for that project that changes frequently.

If you are making choices to structure your repository, you might notice that these two dimensions are competing with each other. If you try to reduce the tree depth by using wider directory structures, then you will create more frequent adjacent changes. In reality, a middle ground is best between the two extremes of a very wide or very deep repository.

The other way your repository structure can change the performance of file history queries is actually in the commit history itself. Some repositories require a linear history through rebases or squash-merges. These repositories do not gain any performance benefits from the commit-skipping feature of simplified file history. On the other hand, a linear history will have the exact same history output for all of the history modes, so there is no need to use the advanced modes.

Luckily, Git has a feature that can speed up these file history queries regardless of the repository shape.

Changed-path Bloom filters

To speed up file history queries, Git has an optional query index that allows it to skip parsing trees in the vast majority of cases.

The changed path Bloom filters index stores a data structure called a Bloom filter for every commit. This index is stored in the commit-graph file, so you can compute it yourself using the git commit-graph write --reachable --changed-paths command. Once the changed-path Bloom filters are enabled in your commit-graph, all future writes will update them. This includes the commit-graph writes done by background maintenance enabled by git maintenance start.

A commit’s Bloom filter is a probabilistic set. It stores the information for each path changed between the first parent and that commit. Instead of storing those paths as a list, the Bloom filter uses hash algorithms to flip a set of bits that look random, but are predictable for each input path.

This Bloom filter allows us to ask the question: Is a given path treesame between the first parent and this commit? The answer can be one of two options:

  • Yes, probably different. In this case, we don’t know for sure that the path is different, so we need to parse trees to compute the diff.
  • No, definitely treesame. In this case, we can trust the filter and continue along the first-parent history without parsing any trees.

The parameters of the Bloom filter are configured in such a way that a random treesame path has a 98% likelihood of being reported as definitely treesame by the filter.

While running git log -- <path>, Git is in simplified history mode and checks the first parent of each commit to see if it is treesame. If the changed-path Bloom filter reports that the commit is treesame, then Git ignores the other parents and moves to the next commit without parsing any trees! If <path> is infrequently changed, then almost all commits will be treesame to their first parents for <path> and the Bloom filters can save 98% of the tree-parsing time!

It is reasonable to consider the overhead of checking the Bloom filters. Fortunately, the filters use hash algorithms in such a way that it is possible to hash the input <path> value into a short list of integers once at the start of the query. The remaining effort is to load the filter from the commit-graph file, modulo those integers based on the size of the filter, then check if certain bits are set in the filter. In this way, a single key is being tested against multiple filters, which is a bit unusual compared to the typical application of Bloom filters.

Git also takes advantage of the directory structure of <path>. For example, if the path is given as A/B/C/d.txt, then any commit that changed this path also changed A, A/B, and A/B/C. All of these strings are stored in the changed-path Bloom filter. Thus, we can reduce the number of false positives by testing all of these paths against each filter. If any of these paths is reported as treesame, then the full path must also be treesame.

To test the performance of these different modes, I found a deep path in the Linux kernel repository that was infrequently changed, but some adjacent files are frequently changed: drivers/gpu/drm/i915/TODO.txt.

Command No
commit-graph
No Bloom filters Bloom filters
git log -- <path> 1.03s 0.67s 0.18s
git log --full-history -- <path> 17.8s 11.0s 3.81s
git log --full-history --simplify-merges -- <path> 19.7s 13.3s 5.39s

For queries such as git log -L and git blame, the changed-path Bloom filters only prevent that initial treesame check. When there is a difference between two commits, the content-based diff algorithm still needs to do the same amount of work. This means the performance improvements are more modest for these queries.

For this example, I used a path that is changed slightly more frequently than the previous one, but in the same directory: drivers/gpu/drm/i915/Makefile.

Command No
commit-graph
No Bloom filters Bloom filters
git blame <path> 1.04s 0.82s 0.21s
git log -L100,110:<path> 9.67s 2.64s 1.38s

These performance gains are valuable for a normal user running Git commands in their terminal, but they are extremely important for Git hosting services such as GitHub that use these same Git history queries to power the web history interface. Computing the changed-path Bloom filters in advance can save thousands of CPU hours due to the frequency that users request this data from that centralized source.

Come back tomorrow for more!

Today, we went even deeper into Git’s internals and how its file history modes act as specialized queries into the commit history. Learning these advanced query types is similar to learning advanced language features of SQL such as different JOIN types. The commit-graph file again operated as a query index to accelerate these history queries.

In the next part of this blog series, we will explore how Git acts as a distributed database. Specifically, we will dig into how git fetch and git push help synchronize remote copies of a repository. The structure of the commit graph will be critical, but the cost of parsing trees will be even more immediate. We’ll talk about how reachability bitmaps can speed up some of these operations, but also explore some reasons why bitmaps are not always used.

I’ll also be speaking at Git Merge 2022 covering all five parts of this blog series, so I look forward to seeing you there!