Tag Archives: open source

Journey to adopt Cloud-Native DevOps platform Series #2: Progressive delivery on Amazon EKS with Flagger and Gloo Edge Ingress Controller

Post Syndicated from Purna Sanyal original https://aws.amazon.com/blogs/devops/journey-to-adopt-cloud-native-devops-platform-series-2-progressive-delivery-on-amazon-eks-with-flagger-and-gloo-edge-ingress-controller/

In the last post, OfferUp modernized its DevOps platform with Amazon EKS and Flagger to accelerate time to market, we talked about hypergrowth and the technical challenges encountered by OfferUp in its existing DevOps platform. As a reminder, we presented how OfferUp modernized its DevOps platform with Amazon Elastic Kubernetes Service (Amazon EKS) and Flagger to gain developer’s velocity, automate faster deployment, and achieve lower cost of ownership.

In this post, we discuss the technical steps to build a DevOps platform that enables the progressive deployment of microservices on Amazon Managed Amazon EKS. Progressive delivery exposes a new version of the software incrementally to ingress traffic and continuously measures the success rate of the metrics before allowing all of the new traffics to a newer version of the software. Flagger is the Graduate project of Cloud Native Computing Foundations (CNCF) that enables progressive canary delivery, along with bule/green and A/B Testing, while measuring metrics like HTTP/gRPC request success rate and latency. Flagger shifts and routes traffic between app versions using a service mesh or an Ingress controller

We leverage Gloo Ingress Controller for traffic routing, Prometheus, Datadog, and Amazon CloudWatch for application metrics analysis and Slack to send notification. Flagger will post messages to slack when a deployment has been initialized, when a new revision has been detected, and if the canary analysis failed or succeeded.

Prerequisite steps to build the modern DevOps platform

You need an AWS Account and AWS Identity and Access Management (IAM) user to build the DevOps platform. If you don’t have an AWS account with Administrator access, then create one now by clicking here. Create an IAM user and assign admin role. You can build this platform in any AWS region however, I will you us-west-1 region throughout this post. You can use a laptop (Mac or Windows) or an Amazon Elastic Compute Cloud (AmazonEC2) instance as a client machine to install all of the necessary software to build the GitOps platform. For this post, I launched an Amazon EC2 instance (with Amazon Linux2 AMI) as the client and install all of the prerequisite software. You need the awscli, git, eksctl, kubectl, and helm applications to build the GitOps platform. Here are the prerequisite steps,

  1. Create a named profile(eks-devops)  with the config and credentials file:

aws configure --profile eks-devops

AWS Access Key ID [None]: xxxxxxxxxxxxxxxxxxxxxx

AWS Secret Access Key [None]: xxxxxxxxxxxxxxxxxx

Default region name [None]: us-west-1

Default output format [None]:

View and verify your current IAM profile:

export AWS_PROFILE=eks-devops

aws sts get-caller-identity

  1. If the Amazon EC2 instance doesn’t have git preinstalled, then install git in your Amazon EC2 instance:

sudo yum update -y

sudo yum install git -y

Check git version

git version

Git clone the repo and download all of the prerequisite software in the home directory.

git clone https://github.com/aws-samples/aws-gloo-flux.git

  1. Download all of the prerequisite software from install.sh which includes awscli, eksctl, kubectl, helm, and docker:

cd aws-gloo-flux/eks-flagger/

ls -lt

chmod 700 install.sh ecr-setup.sh

. install.sh

Check the version of the software installed:

aws --version

eksctl version

kubectl version -o json

helm version

docker --version

docker info

If the docker info shows an error like “permission denied”, then reboot the Amazon EC2 instance or re-log in to the instance again.

  1. Create an Amazon Elastic Container Repository (Amazon ECR) and push application images.

Amazon ECR is a fully-managed container registry that makes it easy for developers to share and deploy container images and artifacts. ecr setup.sh script will create a new Amazon ECR repository and also push the podinfo images (6.0.0, 6.0.1, 6.0.2, 6.1.0, 6.1.5 and 6.1.6) to the Amazon ECR. Run ecr-setup.sh script with the parameter, “ECR repository name” (e.g. ps-flagger-repository) and region (e.g. us-west-1)

./ecr-setup.sh <ps-flagger-repository> <us-west-1>

You’ll see output like the following (truncated).

###########################################################

Successfully created ECR repository and pushed podinfo images to ECR #

Please note down the ECR repository URI          

xxxxxx.dkr.ecr.us-west-1.amazonaws.com/ps-flagger-repository                                                   

Technical steps to build the modern DevOps platform

This post shows you how to use the Gloo Edge ingress controller and Flagger to automate canary releases for progressive deployment on the Amazon EKS cluster. Flagger requires a Kubernetes cluster v1.16 or newer and Gloo Edge ingress 1.6.0 or newer. This post will provide a step-by-step approach to install the Amazon EKS cluster with managed node group, Gloo Edge ingress controller, and Flagger for Gloo in the Amazon EKS cluster. Now that the cluster, metrics infrastructure, and Flagger are installed, we can install the sample application itself. We’ll use the standard Podinfo application used in the Flagger project and the accompanying loadtester tool. The Flagger “podinfo” backend service will be called by Gloo’s “VirtualService”, which is the root routing object for the Gloo Gateway. A virtual service describes the set of routes to match for a set of domains. We’ll automate the canary promotion, with the new image of the “podinfo” service, from version 6.0.0 to version 6.0.1. We’ll also create a scenario by injecting an error for automated canary rollback while deploying version 6.0.2.

  1. Use myeks-cluster.yaml to create your Amazon EKS cluster with managed nodegroup. myeks-cluster.yaml deployment file has “cluster name” value as ps-eks-66, region value as us-west-1, availabilityZones as [us-west-1a, us-west-1b], Kubernetes version as 1.24, and nodegroup Amazon EC2 instance type as m5.2xlarge. You can change this value if you want to build the cluster in a separate region or availability zone.

eksctl create cluster -f myeks-cluster.yaml

Check the Amazon EKS Cluster details:

kubectl cluster-info

kubectl version -o json

kubectl get nodes -o wide

kubectl get pods -A -o wide

Deploy the Metrics Server:

kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

kubectl get deployment metrics-server -n kube-system

Update the kubeconfig file to interact with you cluster:

# aws eks update-kubeconfig --name <ekscluster-name> --region <AWS_REGION>

kubectl config view

cat $HOME/.kube/config

  1. Create a namespace “gloo-system” and Install Gloo with Helm Chart. Gloo Edge is an Envoy-based Kubernetes-native ingress controller to facilitate and secure application traffic.

helm repo add gloo https://storage.googleapis.com/solo-public-helm

kubectl create ns gloo-system

helm upgrade -i gloo gloo/gloo --namespace gloo-system

  1. Install Flagger and the Prometheus add-on in the same gloo-system namespace. Flagger is a Cloud Native Computing Foundation project and part of Flux family of GitOps tools.

helm repo add flagger https://flagger.app

helm upgrade -i flagger flagger/flagger \

--namespace gloo-system \

--set prometheus.install=true \

--set meshProvider=gloo

  1. [Optional] If you’re using Datadog as a monitoring tool, then deploy Datadog agents as a DaemonSet using the Datadog Helm chart. Replace RELEASE_NAME and DATADOG_API_KEY accordingly. If you aren’t using Datadog, then skip this step. For this post, we leverage the Prometheus open-source monitoring tool.

helm repo add datadog https://helm.datadoghq.com

helm repo update

helm install <RELEASE_NAME> \

    --set datadog.apiKey=<DATADOG_API_KEY> datadog/datadog

Integrate Amazon EKS/ K8s Cluster with the Datadog Dashboard – go to the Datadog Console and add the Kubernetes integration.

  1. [Optional] If you’re using Slack communication tool and have admin access, then Flagger can be configured to send alerts to the Slack chat platform by integrating the Slack alerting system with Flagger. If you don’t have admin access in Slack, then skip this step.

helm upgrade -i flagger flagger/flagger \

--set slack.url=https://hooks.slack.com/services/YOUR/SLACK/WEBHOOK \

--set slack.channel=general \

--set slack.user=flagger \

--set clusterName=<my-cluster>

  1. Create a namespace “apps”, and applications and load testing service will be deployed into this namespace.

kubectl create ns apps

Create a deployment and a horizontal pod autoscaler for your custom application or service for which canary deployment will be done.

kubectl -n apps apply -k app

kubectl get deployment -A

kubectl get hpa -n apps

Deploy the load testing service to generate traffic during the canary analysis.

kubectl -n apps apply -k tester

kubectl get deployment -A

kubectl get svc -n apps

  1. Use apps-vs.yaml to create a Gloo virtual service definition that references a route table that will be generated by Flagger.

kubectl apply -f ./apps-vs.yaml

kubectl get vs -n apps

[Optional] If you have your own domain name, then open apps-vs.yaml in vi editor and replace podinfo.example.com with your own domain name to run the app in that domain.

  1. Use canary.yaml to create a canary custom resource. Review the service, analysis, and metrics sections of the canary.yaml file.

kubectl apply -f ./canary.yaml

After a couple of seconds, Flagger will create the canary objects. When the bootstrap finishes, Flagger will set the canary status to “Initialized”.

kubectl -n apps get canary podinfo

NAME      STATUS        WEIGHT   LASTTRANSITIONTIME

podinfo   Initialized   0        2023-xx-xxTxx:xx:xxZ

Gloo automatically creates an ELB. Once the load balancer is provisioned and health checks pass, we can find the sample application at the load balancer’s public address. Note down the ELB’s Public address:

kubectl get svc -n gloo-system --field-selector 'metadata.name==gateway-proxy'   -o=jsonpath='{.items[0].status.loadBalancer.ingress[0].hostname}{"\n"}'

Validate if your application is running, and you’ll see an output with version 6.0.0.

curl <load balancer’s public address> -H "Host:podinfo.example.com"

Trigger progressive deployments and monitor the status

You can Trigger a canary deployment by updating the application container image from 6.0.0 to 6.01.

kubectl -n apps set image deployment/podinfo  podinfod=<ECR URI>:6.0.1

Flagger detects that the deployment revision changed and starts a new rollout.

kubectl -n apps describe canary/podinfo

Monitor all canaries, as the promoted status condition can have one of the following statuses: initialized, Waiting, Progressing, Promoting, Finalizing, Succeeded, and Failed.

watch kubectl get canaries --all-namespaces

curl < load balancer’s public address> -H "Host:podinfo.example.com"

Once canary is completed, validate your application. You can see that the version of the application is changed from 6.0.0 to 6.0.1.

{

  "hostname": "podinfo-primary-658c9f9695-4pqbl",

  "version": "6.0.1",

  "revision": "",

  "color": "#34577c",

  "logo": "https://raw.githubusercontent.com/stefanprodan/podinfo/gh-pages/cuddle_clap.gif",

  "message": "greetings from podinfo v6.0.1",

}

[Optional] Open podinfo application from the laptop browser

Find out both of the IP addresses associated with load balancer.

dig < load balancer’s public address >

Open /etc/hosts file in the laptop and add both of the IPs of load balancer in the host file.

sudo vi /etc/hosts

<Public IP address of LB Target node> podinfo.example.com

e.g.

xx.xx.xxx.xxx podinfo.example.com

xx.xx.xxx.xxx podinfo.example.com

Type “podinfo.example.com” in your browser and you’ll find the application in form similar to this:

Figure 1: Greetings from podinfo v6.0.1

Automated rollback

While doing the canary analysis, you’ll generate HTTP 500 errors and high latency to check if Flagger pauses and rolls back the faulted version. Flagger performs automatic Rollback in the case of failure.

Introduce another canary deployment with podinfo image version 6.0.2 and monitor the status of the canary.

kubectl -n apps set image deployment/podinfo podinfod=<ECR URI>:6.0.2

Run HTTP 500 errors or a high-latency error from a separate terminal window.

Generate HTTP 500 errors:

watch curl -H 'Host:podinfo.example.com' <load balancer’s public address>/status/500

Generate high latency:

watch curl -H 'Host:podinfo.example.com' < load balancer’s public address >/delay/2

When the number of failed checks reaches the canary analysis threshold, the traffic is routed back to the primary, the canary is scaled to zero, and the rollout is marked as failed.

kubectl get canaries --all-namespaces

kubectl -n apps describe canary/podinfo

Cleanup

When you’re done experimenting, you can delete all of the resources created during this series to avoid any additional charges. Let’s walk through deleting all of the resources used.

Delete Flagger resources and apps namespace
kubectl delete canary podinfo -n  apps

kubectl delete HorizontalPodAutoscaler podinfo -n apps

kubectl delete deployment podinfo -n   apps

helm -n gloo-system delete flagger

helm -n gloo-system delete gloo

kubectl delete namespace apps

Delete Amazon EKS Cluster
After you’ve finished with the cluster and nodes that you created for this tutorial, you should clean up by deleting the cluster and nodes with the following command:

eksctl delete cluster --name <cluster name> --region <region code>

Delete Amazon ECR

aws ecr delete-repository --repository-name ps-flagger-repository  --force

Conclusion

This post explained the process for setting up Amazon EKS cluster and how to leverage Flagger for progressive deployments along with Prometheus and Gloo Ingress Controller. You can enhance the deployments by integrating Flagger with Slack, Datadog, and webhook notifications for progressive deployments. Amazon EKS removes the undifferentiated heavy lifting of managing and updating the Kubernetes cluster. Managed node groups automate the provisioning and lifecycle management of worker nodes in an Amazon EKS cluster, which greatly simplifies operational activities such as new Kubernetes version deployments.

We encourage you to look into modernizing your DevOps platform from monolithic architecture to microservice-based architecture with Amazon EKS, and leverage Flagger with the right Ingress controller for secured and automated service releases.

Further Reading

Journey to adopt Cloud-Native DevOps platform Series #1: OfferUp modernized DevOps platform with Amazon EKS and Flagger to accelerate time to market

About the authors:

Purna Sanyal

Purna Sanyal is a technology enthusiast and an architect at AWS, helping digital native customers solve their business problems with successful adoption of cloud native architecture. He provides technical thought leadership, architecture guidance, and conducts PoCs to enable customers’ digital transformation. He is also passionate about building innovative solutions around Kubernetes, database, analytics, and machine learning.

AWS Week in Review – January 16, 2023

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/aws-week-in-review-january-16-2023/

Today, we celebrate Martin Luther King Jr. Day in the US to honor the late civil rights leader’s life, legacy, and achievements. In this article, Amazon employees share what MLK Day means to them and how diversity makes us stronger.

Coming back to our AWS Week in Review—it’s been a busy week!

Last Week’s Launches
Here are some launches that got my attention during the previous week:

AWS Local Zones in Perth and Santiago now generally available – AWS Local Zones help you run latency-sensitive applications closer to end users. AWS now has a total of 29 Local Zones; 12 outside of the US (Bangkok, Buenos Aires, Copenhagen, Delhi, Hamburg, Helsinki, Kolkata, Muscat, Perth, Santiago, Taipei, and Warsaw) and 17 in the US. See the full list of available and announced AWS Local Zones and learn how to get started.

AWS Local Zones Locations

AWS Clean Rooms now available in preview – During AWS re:Invent this past November, we announced AWS Clean Rooms, a new analytics service that helps companies across industries easily and securely analyze and collaborate on their combined datasets—without sharing or revealing underlying data. You can now start using AWS Clean Rooms (Preview).

Amazon Kendra updates – Amazon Kendra is an intelligent search service powered by machine learning (ML) that helps you search across different content repositories with built-in connectors. With the new Amazon Kendra Intelligent Ranking for self-managed OpenSearch, you can now improve the quality of your OpenSearch search results using Amazon Kendra’s ML-powered semantic ranking technology.

Amazon Kendra also released an Amazon S3 connector with VPC support to index and search documents from Amazon S3 hosted in your VPC, a new Google Drive Connector to index and search documents from Google Drive, a Microsoft Teams Connector to enable Microsoft Teams messaging search, and a Microsoft Exchange Connector to enable email-messaging search.

Amazon Personalize updates – Amazon Personalize helps you improve customer engagement through personalized product and content recommendations. Using the new Trending-Now recipe, you can now generate recommendations for items that are rapidly becoming more popular with your users. Amazon Personalize now also supports tag-based resource authorization. Tags are labels in the form of key-value pairs that can be attached to individual Amazon Personalize resources to manage resources or allocate costs.

Amazon SageMaker Canvas now delivers up to 3x faster ML model training time – SageMaker Canvas is a visual interface that enables business analysts to generate accurate ML predictions on their own—without having to write a single line of code. The accelerated model training times help you prototype and experiment more rapidly, shortening the time to generate predictions and turn data into valuable insights.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some additional news items and blog posts that you may find interesting:

AWS open-source news and updates – My colleague Ricardo writes this weekly open-source newsletter in which he highlights new open-source projects, tools, and demos from the AWS Community. Read edition #141 here.

ML model hosting best practices in Amazon SageMaker – This seven-part blog series discusses best practices for ML model hosting in SageMaker to help you identify which hosting design pattern meets your needs best. The blog series also covers advanced concepts such as multi-model endpoints (MME), multi-container endpoints (MCE), serial inference pipelines, and model ensembles. Read part one here.

I would also like to recommend this really interesting Amazon Science article about differential privacy for end-to-end speech recognition. The data used to train AI models is protected by differential privacy (DP), which adds noise during training. In this article, Amazon researchers show how ensembles of teacher models can meet DP constraints while reducing error by more than 26 percent relative to standard DP methods.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

#BuildOnLiveBuild On AWS Live events are a series of technical streams on twitch.tv/aws that focus on technology topics related to challenges hands-on practitioners face today.

  • Join the Build On Live Weekly show about the cloud, the community, the code, and everything in between, hosted by AWS Developer Advocates. The show streams every Thursday at 09:00 US PT on twitch.tv/aws.
  • Join the new The Big Dev Theory show, co-hosted with AWS partners, discussing various topics such as data and AI, AIOps, integration, and security. The show streams every Tuesday at 08:00 US PT on twitch.tv/aws.

Check the AWS Twitch schedule for all shows.

AWS Community Days – AWS Community Day events are community-led conferences that deliver a peer-to-peer learning experience, providing developers with a venue to acquire AWS knowledge in their preferred way: from one another.

AWS Innovate Data and AI/ML edition – AWS Innovate is a free online event to learn the latest from AWS experts and get step-by-step guidance on using AI/ML to drive fast, efficient, and measurable results.

  • AWS Innovate Data and AI/ML edition for Asia Pacific and Japan is taking place on February 22, 2023. Register here.
  • Registrations for AWS Innovate EMEA (March 9, 2023) and the Americas (March 14, 2023) will open soon. Check the AWS Innovate page for updates.

You can browse all upcoming in-person and virtual events.

That’s all for this week. Check back next Monday for another Week in Review!

— Antje

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Recog Release v3.0.3

Post Syndicated from Matthew Kienow original https://blog.rapid7.com/2023/01/12/recog-release-v3-0-3-2022-10-20/

Recog Release v3.0.3

Recog Release v3.0.3, which is available now, includes updated fingerprints for Zoho ManageEngine PAM360, Password Manager Pro, and Access Manager Plus; Atlassian Bitbucket Server; and Supervisord Supervisor. It also includes new fingerprints and a number of bug fixes, all of which are detailed below.

Recog is an open source recognition framework used to identify products, operating systems, and hardware through matching network probe data against its extensive fingerprint collection. Support for Recog is part of Rapid7’s ongoing commitment to open source initiatives.

Zoho ManageEngine PAM360, Password Manager Pro, and Access Manager Plus

Fingerprints for these three Zoho ManageEngine products were added shortly after Cybersecurity and Infrastructure Security Agency (CISA) added CVE-2022-35405 to their Known Exploited Vulnerabilities (KEV) catalog on September 22nd, 2022. Favicon, HTML title, and HTTP server fingerprints were created for both PAM360 and Password Manager Pro, and favicon and HTML title fingerprints were created for Access Manager Plus. PAM360 version 5500 (and older) and Password Manager Pro version 12100 (and older) are both vulnerable to an unauthenticated remote code execution (RCE) vulnerability, and Access Manager Plus version 4302 (and older) is vulnerable to an authenticated remote code execution (RCE) vulnerability. In addition, Grant Willcox contributed the Metasploit Zoho Password Manager Pro XML-RPC Java Deserialization exploit module which is capable of exploiting the unauthenticated vulnerability via the XML-RPC interface in Password Manager Pro and PAM360 and attaining RCE as the NT AUTHORITY\SYSTEM user.

More recently, on January 4th, 2023, Zoho released details of a SQL injection vulnerability (CVE-2022-47523) in PAM360 version 5800 (and older), Password Manager Pro version 12200 (and older) and Access Manager Plus version 4308 (and older). From a quick analysis of internet scan data there appears to be only about 76 Password Manager Pro and 21 PAM360 instances on the internet.

Recog Release v3.0.3

Recog Release v3.0.3

Atlassian Bitbucket Server

Favicon, HTML title and HTTP cookie fingerprints for the Atlassian Bitbucket server were added shortly after our Emergent Threat Response for CVE-2022-36804 was published on September 20th, 2022 in response to the command injection vulnerability in multiple API endpoints of both Bitbucket Server and Data Center. An adversary with access to either a public repository or read permissions on a private repository can perform remote code execution simply through a malicious HTTP request. Shelby Pace contributed the Metasploit Bitbucket Git Command Injection exploit module which is capable of exploiting the unauthenticated command injection. Bitbucket Server and Data Center versions 7.6 prior to 7.6.17, 7.17 prior to 7.17.10, 7.21 prior to 7.21.4, 8.0 prior to 8.0.3, 8.1 prior to 8.1.3, 8.2 prior to 8.2.2 and 8.3 prior to 8.3.1 are vulnerable. From a quick analysis of internet scan data there appears to be just under a thousand of these exposed on the internet.

Recog Release v3.0.3

Supervisord Supervisor

Favicon and HTML title fingerprints were added for anyone interested in locating unsupervised Supervisor instances on their networks. The web interface for the process control system allows users to restart or stop processes under the software’s control, and even tail the standard output and error streams. There might be some interesting information in those streams! From a quick analysis of internet scan data there appears to be only about 165 instances on the internet.

Recog Release v3.0.3

New fingerprints (23)

Bugs fixed (3)

Get the release

You can get the v3.0.3 Recog Ruby gem from RubyGems, the v3.0.3 Recog content archive from the Recog v3.0.3 release page, and you can get more details on the changes since the last release from GitHub:

Highlights from Git 2.39

Post Syndicated from Taylor Blau original https://github.blog/2022-12-12-highlights-from-git-2-39/

The open source Git project just released Git 2.39, with features and bug fixes from over 86 contributors, 31 of them new. We last caught up with you on the latest in Git back when 2.38 was released.

To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.


If you use Git on the command-line, you have almost certainly used git log to peruse your project’s history. But you may not be as familiar with its cousin, git shortlog.

git shortlog is used to summarize the output produced by git log. For example, many projects (including Git1) use git shortlog -ns to produce a list of unique contributors in a release, along with the number of commits they authored, like this:

$ git shortlog -ns v2.38.0.. | head -10
   166  Junio C Hamano
   118  Taylor Blau
   115  Ævar Arnfjörð Bjarmason
    43  Jeff King
    26  Phillip Wood
    21  René Scharfe
    15  Derrick Stolee
    11  Johannes Schindelin
     9  Eric Sunshine
     9  Jeff Hostetler
  [...]

We’ve talked about git shortlog in the past, most recently when 2.29 was released to show off its more flexible --group option, which allows you to group commits by fields other than their author or committer. For example, something like:

$ git shortlog -ns --group=author --group=trailer:co-authored-by

would count each commit to its author as well as any individuals in the Co-authored-by trailer.

This release, git shortlog became even more flexible by learning how to aggregate commits based on arbitrary formatting specifiers, like the ones mentioned in the pretty formats section of Git’s documentation.

One neat use is being able to get a view of how many commits were committed each month during a release cycle. Before, you might have written something like this monstrosity:

$ git log v2.38.0.. --date='format:%Y-%m' --format='%cd' | sort | uniq -c

There, --date='format:%Y-%m' tells Git to output each date field like YYYY-MM, and --format='%cd' tells Git to output only the committer date (using the aforementioned format) when printing each commit. Then, we sort the output, and count the number of unique values.

Now, you can ask Git to do all of that for you, by writing:

$ git shortlog v2.38.0.. --date='format:%Y-%m' --group='%cd' -s
     2  2022-08
    47  2022-09
   405  2022-10
   194  2022-11
     5  2022-12

Where -s tells git shortlog output a summary where the left-hand column is the number of commits attributed to each unique group (in this case, the year and month combo), and the right-hand column is the identity of each group itself.

Since you can pass any format specifier to the --group option, the flexibility here is limited only by the pretty formats available, and your own creativity.

[source]


Returning readers may remember our discussion on Git’s new object pruning mechanism, cruft packs. In case you’re new around here, no problem: here’s a refresher.

When you want to tell Git to remove unreachable objects (those which can’t be found by walking along the history of any branch or tag), you might run something like:

$ git gc --cruft --prune=5.minutes.ago

That instructs Git to divvy your repository’s objects into two packs: one containing reachable objects, and another2 containing unreachable objects modified within the last five minutes. This makes sure that a git gc process doesn’t race with incoming reference updates that might leave the repository in a corrupt state. As those objects continue to age, they will be removed from the repository via subsequent git gc invocations. For (many) more details, see our post, Scaling Git’s garbage collection.

Even though the --prune=<date> mechanism of adding a grace period before permanently removing objects from the repository is relatively effective at avoiding corruption in practice, it is not completely fool-proof. And when we do encounter repository corruption, it is useful to have the missing objects close by to allow us to recover a corrupted repository.

In Git 2.39, git repack learned a new option to create an external copy of any objects removed from the repository: --expire-to. When combined with --cruft options like so:

$ git repack --cruft --cruft-expiration=5.minutes.ago -d --expire-to=../backup.git

any unreachable objects which haven’t been modified in the last five minutes are collected together and stored in a packfile that is written to ../backup.git. Then, objects you may be missing after garbage collection are readily available in the pack stored in ../backup.git.

These ideas are identical to the ones described in the “limbo repository” section of our Scaling Git’s garbage collection blog post. At the time of writing that post, those patches were still under review. Thanks to careful feedback from the Git community, the same tools that power GitHub’s own garbage collection are now available to you via Git 2.39.

On a related note, careful readers may have noticed that in order to write a cruft pack, you have to explicitly pass --cruft to both git gc and git repack. This is still the case. But in Git 2.39, users who enable the feature.experimental configuration and are running the bleeding edge of Git will now use cruft packs by default when running git gc.

[source, source]


If you’ve been following along with the gradual introduction of sparse index compatibility in Git commands, this one’s for you.

In previous versions of Git, using git grep --cached (to search through the index instead of the blobs in your working copy) you might have noticed that Git first has to expand your index when using the sparse index feature.

In large repositories where the sparse portion of the repository is significantly smaller than the repository as a whole, this adds a substantial delay before git grep --cached outputs any matches.

Thanks to the work of Google Summer of Code student, Shaoxuan Yuan, this is no longer the case. This can lead to some dramatic performance enhancements: when searching in a location within your sparse cone (for example., git grep --cached $pattern -- 'path/in/sparse/cone'), Git 2.39 outperforms the previous version by nearly 70%.

[source]


This one is a little bit technical, but bear with us, since it ends with a nifty performance optimization that may be coming to a Git server near you.

Before receiving a push, a Git server must first tell the pusher about all of the branches and tags it already knows about. This lets the client omit any objects that it knows the server already has, and results in less data being transferred overall.

Once the server has all of the new objects, it ensures that they are “connected” before entering them into the repository. Generally speaking, this “connectivity check” ensures that none of the new objects mention nonexistent objects; in other words, that the push will not corrupt the repository.

One additional factor worth noting is that some Git servers are configured to avoid advertising certain references. But those references are still used as part of the connectivity check. Taking into account the extra work necessary to incorporate those hidden references into the connectivity check, the additional runtime adds up, especially if there are a large number of hidden references.

In Git 2.39, the connectivity check was enhanced to only consider the references that were advertised, in addition to those that were pushed. In a test repository with nearly 7 million references (only ~3% of which are advertised), the resulting speed-up makes Git 2.39 outperform the previous version by roughly a factor of 4.5.

As your server operators upgrade to the latest version of Git, you should notice an improvement in how fast they are able to process incoming pushes.

[source]


Last but not least, let’s round out our recap of some of the highlights from Git 2.39 with a look at a handful of new security measures.

Git added two new “defense-in-depth” changes in the latest release. First, git apply was updated to refuse to apply patches larger than ~1 GiB in size to avoid potential integer overflows in the apply code. Git was also updated to correctly redact sensitive header information with GIT_TRACE_CURL=1 or GIT_CURL_VERBOSE=1 when using HTTP/2.

If you happen to notice a security vulnerability in Git, you can follow Git’s own documentation on how to responsibly report the issue. Most importantly, if you’ve ever been curious about how Git handles coordinating and disclosing embargoed releases, this release cycle saw a significant effort to codify and write down exactly how Git handles these types of issues.

To read more about Git’s disclosure policy (and learn about how to participate yourself!), you can find more in the repository.

[source, source, source]

The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.39, or any previous version in the Git repository.

Notes


  1. It’s true. In fact, the list at the bottom of the release announcement is generated by running git shortlog on the git log --no-merges between the last and current release. Calculating the number of new and existing contributors in each release is also powered by git shortlog
  2. This is a bit of an oversimplification. In addition to storing the object modification times in an adjacent *.mtimes file, the cruft pack also contains unreachable objects that are reachable from anything modified within the last five minutes, regardless of its age. See the “mitigating object deletion raciness” section for more. 

Journey to adopt Cloud-Native DevOps platform Series #1: OfferUp modernized DevOps platform with Amazon EKS and Flagger to accelerate time to market

Post Syndicated from Purna Sanyal original https://aws.amazon.com/blogs/devops/journey-to-adopt-cloud-native-devops-platform-series-1-offerup-modernized-devops-platform-with-amazon-eks-and-flagger-to-accelerate-time-to-market/

In this two part series, we discuss the challenges faced by OfferUp, a Digital Native customer, to meet business growth and time-to-market. Their journey involved modernizing their existing DevOps platform, from the traditional monolith virtual machine (VM) based architecture to modern containerized architecture and running cloud-native applications for secured progressive delivery to accelerate time to market. This series will provide strategies, architecture patterns, and technical steps you can adopt to become more agile and innovative like OfferUp has.

OfferUp engineers were using the homegrown DevOps platform to build and release new services on the marketplace platform. In this first post, we discuss the key challenges encountered by OfferUp engineers with the existing DevOps platform, as well as how OfferUp modernized its DevOps platform with Amazon Elastic Kubernetes Service (Amazon EKS) and Flagger, automating production releases with progressive delivery techniques for faster time-to-market with new products and services. Amazon EKS is a managed container service to run and scale Kubernetes applications in the cloud or on-premises.

Previous DevOps architecture

OfferUp is a leading online and mobile customer to customer (C2C) marketplace where users can both buy and sell goods on the platform. Users can browse and purchase products from a broad range of categories, including furniture, clothing, sports equipment, toys, and many more. As a mobile-first company, OfferUp puts a great deal of emphasis on in-person communication between buyers and sellers.

OfferUp built a home grown, self-managed DevOps platform. This platform used a set of manual processes and third-party applications that allows both developers and operations engineers to build and deploy code to a production environment. The DevOps pipeline included topic areas such as source code control, continuous integration/continuous delivery (CI/CD), microservices, as well as development and test Methodologies. The following diagram depicts the previous architecture of OfferUp’s DevOps platform, which was self-managed on Amazon Elastic Compute Cloud (Amazon EC2).

Figure 1: Previous DevOps architecture of OfferUp

OfferUp used GitHub for code repositories. Once the source code was committed in the code repository, Jenkins pulled the source code from code repositories on a scheduled or on-demand basis and built Amazon Machine Images (AMI). The built image was deployed in production by a  custom built deployment tool, Vanaheim, which supports one-box canary deployment and full roll-out deployment strategies. The DevOps engineers used to manually create a deployment job in the Vanaheim portal and then manually monitor the test success rate and service metrics to detect any impact from the deployment. Once the success rate was reached, a full production roll out was performed from the Vanaheim portal.

Key challenges with previous DevOps pipeline

In 2020, OfferUp experienced significant transaction volume growth on its Marketplace platform with the increase of its user base. With OfferUp’s acquisition of LetGo in 2020, there was a need to build a scalable DevOps platform to support future integration and organic growth. The previous DevOps platform, designed and deployed over seven years ago, had reached the limits of its scalability, and could no longer keep up with the platform’s growth. The previous architecture was expensive to run and had a complex infrastructure that made it difficult to upgrade and add new features.

The following key factors drove the push for modernization:

  • Manual verification was required to check if the code was correctly deployed in one of the servers in production, and if the deployment was right in one server, then it was rolled out to other production servers. Full Rollout to production wasn’t automated due to frequent failures requiring manual rollbacks.
  • The previous platform required a longer deployment time (1–2 hours) due to the authoritative batch process, which sometimes caused delays in releasing and testing of new features.
  • The self-managed nature of the Jenkins and Vanaheim clusters was consuming far too much engineering time. Most of the institutional knowledge of this legacy platform was lost over the years and it didn’t align with OfferUp’s philosophy of small DevOps engineering teams. Innovation had stalled partly due to the difficulty of simultaneously upgrading the DevOps platform and releasing new features.

DevOps platform automation with Flagger and Gloo Ingress Controller on Amazon EKS

A key requirement for the next-generation system was that the new architecture would reduce the operational burden on engineering teams, deployment lifecycle, and total cost of ownership. OfferUp evaluated multiple managed container orchestration platforms for its DevOps Platform. It finally selected Amazon EKS for high availability, reducing the average time to deploy a change to the stack from hours to just a few minutes and reducing the complexity in managing and upgrading the Kubernetes cluster. On the Amazon EKS platform, OfferUp uses Flagger, a progressive delivery tool that automates the release process for applications running on Kubernetes. Flagger implements several deployment strategies (Canary releases, A/B testing, and Blue/Green mirroring) using the Gloo Edge ingress controller for traffic routing. Datadog is used as an observability service for monitoring the health of the deployments and effectively managing the canary to progressive delivery. For release analysis, Flagger runs a query on Datadog logs and uses Slack for alerting and notifications. The cloud native technology components of the architecture are described as follows:

Kubernetes and Amazon EKS – Kubernetes is an open-source system for automating the deployment, scaling, and management of containerized applications. Kubernetes is a graduate project in the CNCF. Amazon EKS is a fully-managed, certified Kubernetes conformant service that simplifies the process of building, securing, operating, and maintaining Kubernetes clusters on AWS. Amazon EKS integrates with core AWS services, such as Amazon CloudWatch, Auto Scaling Groups, and AWS Identity and Access Management (IAM) to provide a seamless experience for monitoring, scaling, and load balancing your containerized applications.

Helm – Helm manage Kubernetes applications. Helm Charts define, install, and upgrade even the most complex Kubernetes application. Charts are easy to create, version, share, and publish. If Kubernetes were an operating system, then Helm would be the package manager. Helm is a graduate project in the CNCF and is maintained by the Helm community.

Flagger – Flagger is a progressive delivery tool that automates the release process for applications running on Kubernetes. Flagger implements a control loop that gradually shifts traffic to the canary while measuring key performance indicators such as HTTP requests success rate, requests average duration, and pods health. Based on the set thresholds, a canary is either promoted or aborted and its analysis is pushed to a Slack channel. Flagger became a CNCF project – part of the Flux family of GitOps tools.

Gloo EdgeGloo Edge is a feature-rich, Kubernetes-native ingress controller. Gloo Edge is exceptional in its function-level routing; its support for legacy apps, microservices, and serverless; its discovery capabilities; and its tight integration with leading open-source projects. Gloo Edge is uniquely designed to support hybrid applications, in which multiple technologies, architectures, protocols, and clouds can coexist.

Observability platformDatadog’s integrations with Kubernetes, Docker, and AWS will let you track the full range of Amazon EKS metrics, as well as logs and performance data from your cluster and applications. Datadog gives you comprehensive coverage of your dynamic infrastructure and applications with features like auto discovery to track services across containers, sophisticated graphing, and alerting options.

Modernized DevOps architecture

In the new architecture, OfferUp uses Github as a version control tool and Github actions as their CI/CD tool. On every Pull request, tests are run, artifacts are built and stored in the JFrog Artifactory, and docker Images are stored in the Amazon Elastic Container Registry (Amazon ECR). Separate deployment pipelines are triggered based on the environment (dev, staging, and production) of choice. Flagger detects any changes in the version of the application and gradually shifts production traffic to the canary. It measures the requests success rate and average response duration metrics from Datadog to decide full rollout in production. For an application deployment, a canary promotion can be defined using Flagger’s custom resource. Flagger rolls back the deployment when the success rate falls below the defined desired success rate metrics.

Figure 2: Modernized DevOps architecture of OfferUp

With the modernized DevOps platform, OfferUp moved from monolithic to microservice architecture where  front-end applications and GraphQL runs on the Amazon EKS cluster. The production cluster runs 110 services and 650+ pods on 60 nodes. The cluster scales up to 100 nodes with Amazon Auto Scaling group based on the traffic pattern. On the networking front, the cluster has a private endpoint and uses both VPC CNI plugin, and the CoreDNS add-on. There are four Amazon EKS clusters, one each for the production, test, utility, and the staging environments. OfferUp has a plan to explore Karpenter open-source autoscaling project, and it will move new applications to the Amazon EKS cluster, allowing the total node counts to scale up to 200.

Benefits of modernized architecture

The new architecture helped OfferUp make  automated decisions to deploy new releases and improve the time to market while reducing unplanned production downtime

  • Faster deployments and Quicker rollbacks – The new architecture reduces the Service Deployment time from one hour down to five minutes, and automates rollback time to five minutes from the manual rollback time of one hour.
  • Automate deployment of new releases – The lack of canary deployment processes in the previous architecture required OfferUp engineers to manually intervene to validate the deployment status, which led to administrative overhead and production outages. The canary deployments take care of the traffic shifting by automatically measuring the requests’ success rate and latency metrics from Datadog and subsequently release the service to production. Deployments are automatically rolled back when the success rate falls below the defined success rate metric thresholds.
  • Simplified Configuration – Configuration has been simplified drastically and integrated within the CI/CD pipeline in the new architecture, thereby reducing configuration complexity, eliminating manual processes, and saving Developers time.
  • More time to Focus on Innovation – With fully automated progressive delivery, the developers no longer need to spend time testing and releasing source code in production. Similarly, migrating from a Self-managed DevOps platform to the Managed Amazon EKS services lowered the DevOps platform’s infrastructure management burden on the engineering team. This helps developers spend more time focusing on building and testing new features and innovations.
  • Cost reduction – Moving from self-managed Amazon EC2-based architecture to the Amazon EKS cluster reduced the cost of operations through shared nodes and improved pod density. The previous architecture was using 200 nodes of Amazon EC2 instances. The same workload was moved to a 50 nodes Amazon EKS cluster. Furthermore, custom applications (Vanaheim and Jenkins) were retired, further reducing the costs.

Conclusion

In this post, you see how OfferUp embarked on the journey to modernize its DevOps platform to support its growth and developers’ velocity. The key factors that drove the modernization decisions were the ability to scale the platform to support the automated testing of features in production, the faster release of new features, cost reduction, and to facilitate future innovation. The modernized DevOps platform on Amazon EKS also decreased the ongoing operational support burden for engineers, and the scalability of the design opens up a lot of headroom for growth.

We encourage you to look into modernizing your existing CI/CD pipeline on Amazon EKS with the Flagger progressive delivery mechanism. Amazon EKS removes the undifferentiated heavy lifting of managing and updating the Kubernetes cluster. Managed node groups automate the provisioning and lifecycle management of worker nodes in an Amazon EKS cluster, which greatly simplifies operational activities, such as new Kubernetes version deployments.

In the next part of the series, you’ll discover how to implement Flagger and Gloo Edge Ingress Controller on Amazon EKS to automate the release process for applications running on Kubernetes.

Further Reading

Journey to adopt Cloud-Native DevOps platform Series #2: Progressive delivery on Amazon EKS with Flagger and Gloo Edge Ingress Controller

About the authors:

Purna Sanyal

Purna Sanyal is a technology enthusiast and an architect at AWS, helping digital native customers solve their business problems with successful adoption of cloud native architecture. He provides technical thought leadership, architecture guidance, and conducts PoCs to enable customers’ digital transformation. He is also passionate about building innovative solutions around Kubernetes, database, analytics, and machine learning.

Alan Liu

Alan Liu is Sr Director of Engineering at OfferUp. He is a technology enthusiast and he worked across a wide variety of industry. He is highly effective, adaptable, scalable, experienced leader with a proven record.

New – Trusted Language Extensions for PostgreSQL on Amazon Aurora and Amazon RDS

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-trusted-language-extensions-for-postgresql-on-amazon-aurora-and-amazon-rds/

PostgreSQL has become the preferred open-source relational database for many enterprises and start-ups with its extensible design for developers. One of the reasons developers use PostgreSQL is it allows them to add database functionality by building extensions with their preferred programming languages.

You can already install and use PostgreSQL extensions in Amazon Aurora PostgreSQL-Compatible Edition and Amazon Relational Database Service for PostgreSQL. We support more than 85 PostgreSQL extensions in Amazon Aurora and Amazon RDS, such as the pgAudit extension for logging your database activity. While many workloads use these extensions, we heard our customers asking for flexibility to build and run the extensions of their choosing for their PostgreSQL database instances.

Today, we are announcing the general availability of Trusted Language Extensions for PostgreSQL (pg_tle), a new open-source development kit for building PostgreSQL extensions. With Trusted Language Extensions for PostgreSQL, developers can build high-performance extensions that run safely on PostgreSQL.

Trusted Language Extensions for PostgreSQL provides database administrators control over who can install extensions and a permissions model for running them, letting application developers deliver new functionality as soon as they determine an extension meets their needs.

To start building with Trusted Language Extensions, you can use trusted languages such as JavaScript, Perl, and PL/pgSQL. These trusted languages have safety attributes, including restricting direct access to the file system and preventing unwanted privilege escalations. You can easily install extensions written in a trusted language on Amazon Aurora PostgreSQL-Compatible Edition 14.5 and Amazon RDS for PostgreSQL 14.5 or a newer version.

Trusted Language Extensions for PostgreSQL is an open-source project licensed under Apache License 2.0 on GitHub. You can comment or suggest items on the Trusted Language Extensions for PostgreSQL roadmap and help us support this project across multiple programming languages, and more. Doing this as a community will help us make it easier for developers to use the best parts of PostgreSQL to build extensions.

Let’s explore how we can use Trusted Language Extensions for PostgreSQL to build a new PostgreSQL extension for Amazon Aurora and Amazon RDS.

Setting up Trusted Language Extensions for PostgreSQL
To use pg_tle with Amazon Aurora or Amazon RDS for PostgreSQL, you need to set up a parameter group that loads pg_tle in the PostgreSQL shared_preload_libraries setting. Choose Parameter groups in the left navigation pane in the Amazon RDS console and Create parameter group to make a new parameter group.

Choose Create after you select postgres14 with Amazon RDS for PostgreSQL in the Parameter group family and pg_tle in the Group Name. You can select aurora-postgresql14 for an Amazon Aurora PostgreSQL-Compatible cluster.

Choose a created pgtle parameter group and Edit in the Parameter group actions dropbox menu. You can search shared_preload_library in the search box and choose Edit parameter. You can add your preferred values, including pg_tle, and choose Save changes.

You can also do the same job in the AWS Command Line Interface (AWS CLI).

$ aws rds create-db-parameter-group \
  --region us-east-1 \
  --db-parameter-group-name pgtle \
  --db-parameter-group-family aurora-postgresql14 \
  --description "pgtle group"

$ aws rds modify-db-parameter-group \
  --region us-east-1 \
  --db-parameter-group-name pgtle \
  --parameters "ParameterName=shared_preload_libraries,ParameterValue=pg_tle,ApplyMethod=pending-reboot"

Now, you can add the pgtle parameter group to your Amazon Aurora or Amazon RDS for PostgreSQL database. If you have a database instance called testing-pgtle, you can add the pgtle parameter group to the database instance using the command below. Please note that this will cause an active instance to reboot.

$ aws rds modify-db-instance \
  --region us-east-1 \
  --db-instance-identifier testing-pgtle \
  --db-parameter-group-name pgtle-pg \
  --apply-immediately

Verify that the pg_tle library is available on your Amazon Aurora or Amazon RDS for PostgreSQL instance. Run the following command on your PostgreSQL instance:

SHOW shared_preload_libraries;

pg_tle should appear in the output.

Now, we need to create the pg_tle extension in your current database to run the command:

 CREATE EXTENSION pg_tle;

You can now create and install Trusted Language Extensions for PostgreSQL in your current database. If you create a new extension, you should grant the pgtle_admin role to your primary user (e.g., postgres) with the following command:

GRANT pgtle_admin TO postgres;

Let’s now see how to create our first pg_tle extension!

Building a Trusted Language Extension for PostgreSQL
For this example, we are going to build a pg_tle extension to validate that a user is not setting a password that’s found in a common password dictionary. Many teams have rules around the complexity of passwords, particularly for database users. PostgreSQL allows developers to help enforce password complexity using the check_password_hook.

In this example, you will build a password check hook using PL/pgSQL. In the hook, you can check to see if the user-supplied password is in a dictionary of 10 of the most common password values:

SELECT pgtle.install_extension (
  'my_password_check_rules',
  '1.0',
  'Do not let users use the 10 most commonly used passwords',
$_pgtle_$
  CREATE SCHEMA password_check;
  REVOKE ALL ON SCHEMA password_check FROM PUBLIC;
  GRANT USAGE ON SCHEMA password_check TO PUBLIC;

  CREATE TABLE password_check.bad_passwords (plaintext) AS
  VALUES
    ('123456'),
    ('password'),
    ('12345678'),
    ('qwerty'),
    ('123456789'),
    ('12345'),
    ('1234'),
    ('111111'),
    ('1234567'),
    ('dragon');
  CREATE UNIQUE INDEX ON password_check.bad_passwords (plaintext);

  CREATE FUNCTION password_check.passcheck_hook(username text, password text, password_type pgtle.password_types, valid_until timestamptz, valid_null boolean)
  RETURNS void AS $$
    DECLARE
      invalid bool := false;
    BEGIN
      IF password_type = 'PASSWORD_TYPE_MD5' THEN
        SELECT EXISTS(
          SELECT 1
          FROM password_check.bad_passwords bp
          WHERE ('md5' || md5(bp.plaintext || username)) = password
        ) INTO invalid;
        IF invalid THEN
          RAISE EXCEPTION 'password must not be found on a common password dictionary';
        END IF;
      ELSIF password_type = 'PASSWORD_TYPE_PLAINTEXT' THEN
        SELECT EXISTS(
          SELECT 1
          FROM password_check.bad_passwords bp
          WHERE bp.plaintext = password
        ) INTO invalid;
        IF invalid THEN
          RAISE EXCEPTION 'password must not be found on a common password dictionary';
        END IF;
      END IF;
    END
  $$ LANGUAGE plpgsql SECURITY DEFINER;

  GRANT EXECUTE ON FUNCTION password_check.passcheck_hook TO PUBLIC;

  SELECT pgtle.register_feature('password_check.passcheck_hook', 'passcheck');
$_pgtle_$
);

You need to enable the hook through the pgtle.enable_password_check configuration parameter. On Amazon Aurora and Amazon RDS for PostgreSQL, you can do so with the following command:

$ aws rds modify-db-parameter-group \
    --region us-east-1 \
    --db-parameter-group-name pgtle \
    --parameters "ParameterName=pgtle.enable_password_check,ParameterValue=on,ApplyMethod=immediate"

It may take several minutes for these changes to propagate. You can check that the value is set using the SHOW command:

SHOW pgtle.enable_password_check;

If the value is on, you will see the following output:

 pgtle.enable_password_check
-----------------------------
 on

Now you can create this extension in your current database and try setting your password to one of the dictionary passwords and observe how the hook rejects it:

CREATE EXTENSION my_password_check_rules;

CREATE ROLE test_role PASSWORD '123456';
ERROR:  password must not be found on a common password dictionary

CREATE ROLE test_role;
SET SESSION AUTHORIZATION test_role;
SET password_encryption TO 'md5';
\password
-- set to "password"
ERROR:  password must not be found on a common password dictionary

To disable the hook, set the value of pgtle.enable_password_check to off:

$ aws rds modify-db-parameter-group \
    --region us-east-1 \
    --db-parameter-group-name pgtle \
    --parameters "ParameterName=pgtle.enable_password_check,ParameterValue=off,ApplyMethod=immediate"

You can uninstall this pg_tle extension from your database and prevent anyone else from running CREATE EXTENSION on my_password_check_rules with the following command:

DROP EXTENSION my_password_check_rules;
SELECT pgtle.uninstall_extension('my_password_check_rules');

You can find more sample extensions and give them a try. To build and test your Trusted Language Extensions in your local PostgreSQL database, you can build from our source code after cloning the repository.

Join Our Community!
The Trusted Language Extensions for PostgreSQL community is open to everyone. Give it a try, and give us feedback on what you would like to see in future releases. We welcome any contributions, such as new features, example extensions, additional documentation, or any bug reports in GitHub.

To learn more about using Trusted Language Extensions for PostgreSQL in the AWS Cloud, see the Amazon Aurora PostgreSQL-Compatible Edition and Amazon RDS for PostgreSQL documentation.

Give it a try, and please send feedback to AWS re:Post for PostgreSQL or through your usual AWS support contacts.

Channy

New – Amazon Redshift Integration with Apache Spark

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/new-amazon-redshift-integration-with-apache-spark/

Apache Spark is an open-source, distributed processing system commonly used for big data workloads. Spark application developers working in Amazon EMR, Amazon SageMaker, and AWS Glue often use third-party Apache Spark connectors that allow them to read and write the data with Amazon Redshift. These third-party connectors are not regularly maintained, supported, or tested with various versions of Spark for production.

Today we are announcing the general availability of Amazon Redshift integration for Apache Spark, which makes it easy to build and run Spark applications on Amazon Redshift and Redshift Serverless, enabling customers to open up the data warehouse for a broader set of AWS analytics and machine learning (ML) solutions.

With Amazon Redshift integration for Apache Spark, you can get started in seconds and effortlessly build Apache Spark applications in a variety of languages, such as Java, Scala, and Python.

Your applications can read from and write to your Amazon Redshift data warehouse without compromising on the performance of the applications or transactional consistency of the data, as well as performance improvements with pushdown optimizations.

Amazon Redshift integration for Apache Spark builds on an existing open source connector project and enhances it for performance and security, helping customers gain up to 10x faster application performance. We thank the original contributors on the project who collaborated with us to make this happen. As we make further enhancements we will continue to contribute back into the open source project.

Getting Started with Spark Connector for Amazon Redshift
To get started, you can go to AWS analytics and ML services, use data frame or Spark SQL code in a Spark job or Notebook to connect to the Amazon Redshift data warehouse, and start running queries in seconds.

In this launch, Amazon EMR 6.9, EMR Serverless, and AWS Glue 4.0 come with the pre-packaged connector and JDBC driver, and you can just start writing code. EMR 6.9 provides a sample notebook, and EMR Serverless provides a sample Spark Job too.

First, you should set AWS Identity and Access Management (AWS IAM) authentication between Redshift and Spark, between Amazon Simple Storage Service (Amazon S3) and Spark, and between Redshift and Amazon S3. The following diagram describes the authentication between Amazon S3, Redshift, the Spark driver, and Spark executors.

For more information, see Identity and access management in Amazon Redshift in the AWS documentation.

Amazon EMR
If you already have an Amazon Redshift data warehouse and the data available, you can create the database user and provide the right level of grants to the database user. To use this with Amazon EMR, you need to upgrade to the latest version of the Amazon EMR 6.9 that has the packaged spark-redshift connector. Select the emr-6.9.0 release when you create an EMR cluster on Amazon EC2.

You can use EMR Serverless to create your Spark application using the emr-6.9.0 release to run your workload.

EMR Studio also provides an example Jupyter Notebook configured to connect to an Amazon Redshift Serverless endpoint leveraging sample data that you can use to get started quickly.

Here is a Scalar example to build your applications both with Spark Dataframe and Spark SQL. Use IAM-based credentials for connecting to Redshift and use IAM role for unloading and loading data from S3.

// Create the JDBC connection URL and define the Redshift context
val jdbcURL = "jdbc:redshift:iam://<RedshiftEndpoint>:<Port>/<Database>?DbUser=<RsUser>"
val rsOptions = Map (
  "url" -> jdbcURL,
  "tempdir" -> tempS3Dir, 
  "aws_iam_role" -> roleARN,
  )
// Reference the sales table from Redshift 
val sales_df = spark
  .read 
  .format("io.github.spark_redshift_community.spark.redshift") 
  .options(rsOptions) 
  .option("dbtable", "sales") 
  .load() 
sales_df.createOrReplaceTempView("sales") 
// Reference the date table from Redshift using Data Frame 
sales_df.join(date_df, sales_df("dateid") === date_df("dateid"))
  .where(col("caldate") === "2008-01-05")
  .groupBy().sum("qtysold")
  .select(col("sum(qtysold)"))
  .show() 

If Amazon Redshift and Amazon EMR are in different VPCs, you have to configure VPC peering or enable cross-VPC access. Assuming both Amazon Redshift and Amazon EMR are in the same virtual private cloud (VPC), you can create a Spark job or Notebook and connect to the Amazon Redshift data warehouse and write Spark code to use the Amazon Redshift connector.

To learn more, see Use Spark on Amazon Redshift with a connector in the AWS documentation.

AWS Glue
When you use AWS Glue 4.0, the spark-redshift connector is available both as a source and target. In Glue Studio, you can use a visual ETL job to read or write to a Redshift data warehouse simply by selecting a Redshift connection to use within a built-in Redshift source or target node.

The Redshift connection contains Redshift connection details along with the credentials needed to access Redshift with the proper permissions.

To get started, choose Jobs in the left menu of the Glue Studio console. Using either of the Visual modes, you can easily add and edit a source or target node and define a range of transformations on the data without writing any code.

Choose Create and you can easily add and edit a source, target node, and the transform node in the job diagram. At this time, you will choose Amazon Redshift as Source and Target.

Once completed, the Glue job can be executed on Glue for the Apache Spark engine, which will automatically use the latest spark-redshift connector.

The following Python script shows an example job to read and write to Redshift with dynamicframe using the spark-redshift connector.

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

## @params: [JOB_NAME]
args = getResolvedOptions(sys.argv, ['JOB_NAME'])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
job.init(args['JOB_NAME'], args)

print("================ DynamicFrame Read ===============")
url = "jdbc:redshift://<RedshiftEndpoint>:<Port>/dev"
read_options = {
    "url": url,
    "dbtable": dbtable,
    "redshiftTmpDir": redshiftTmpDir,
    "tempdir": redshiftTmpDir,
    "aws_iam_role": aws_iam_role,
    "autopushdown": "true",
    "include_column_list": "false"
}

redshift_read = glueContext.create_dynamic_frame.from_options(
    connection_type="redshift",
    connection_options=read_options
) 

print("================ DynamicFrame Write ===============")

write_options = {
    "url": url,
    "dbtable": dbtable,
    "user": "awsuser",
    "password": "Password1",
    "redshiftTmpDir": redshiftTmpDir,
    "tempdir": redshiftTmpDir,
    "aws_iam_role": aws_iam_role,
    "autopushdown": "true",
    "DbUser": "awsuser"
}

print("================ dyf write result: check redshift table ===============")
redshift_write = glueContext.write_dynamic_frame.from_options(
    frame=redshift_read,
    connection_type="redshift",
    connection_options=write_options
)

When you set up your job detail, you can only use the Glue 4.0 – Supports spark 3.3 Python 3 version for this integration.

To learn more, see Creating ETL jobs with AWS Glue Studio and Using connectors and connections with AWS Glue Studio in the AWS documentation.

Gaining the Best Performance
In the Amazon Redshift integration for Apache Spark, the Spark connector automatically applies predicate and query pushdown to optimize for performance. You can gain performance improvement by using the default Parquet format for the connector used for unloading with this integration.

As the following sample code shows, the Spark connector will turn the supported function into a SQL query and run the query in Amazon Redshift.

import sqlContext.implicits._val
sample= sqlContext.read
.format("io.github.spark_redshift_community.spark.redshift")
.option("url",jdbcURL )
.option("tempdir", tempS3Dir)
.option("unload_s3_format", "PARQUET")
.option("dbtable", "event")
.load()

// Create temporary views for data frames created earlier so they can be accessed via Spark SQL
sales_df.createOrReplaceTempView("sales")
date_df.createOrReplaceTempView("date")
// Show the total sales on a given date using Spark SQL API
spark.sql(
"""SELECT sum(qtysold)
| FROM sales, date
| WHERE sales.dateid = date.dateid
| AND caldate = '2008-01-05'""".stripMargin).show()

Amazon Redshift integration for Apache Spark adds pushdown capabilities for operations such as sort, aggregate, limit, join, and scalar functions so that only the relevant data is moved from the Redshift data warehouse to the consuming Spark application, thereby improving performance.

Available Now
The Amazon Redshift integration for Apache Spark is now available in all Regions that support Amazon EMR 6.9, AWS Glue 4.0, and Amazon Redshift. You can start using the feature directly from EMR 6.9 and Glue Studio 4.0 with the new Spark 3.3.0 version.

Give it a try, and please send us feedback either in the AWS re:Post for Amazon Redshift or through your usual AWS support contacts.

Channy

AWS Week in Review – November 21, 2022

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-november-21-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

A new week starts, and the News Blog team is getting ready for AWS re:Invent! Many of us will be there next week and it would be great to meet in person. If you’re coming, do you know about PeerTalk? It’s an onsite networking program for re:Invent attendees available through the AWS Events mobile app (which you can get on Google Play or Apple App Store) to help facilitate connections among the re:Invent community.

If you’re not coming to re:Invent, no worries, you can get a free online pass to watch keynotes and leadership sessions.

Last Week’s Launches
It was a busy week for our service teams! Here are the launches that got my attention:

AWS Region in Spain – The AWS Region in Aragón, Spain, is now open. The official name is Europe (Spain), and the API name is eu-south-2.

Amazon Athena – You can now apply AWS Lake Formation fine-grained access control policies with all table and file format supported by Amazon Athena to centrally manage permissions and access data catalog resources in your Amazon Simple Storage Service (Amazon S3) data lake. With fine-grained access control, you can restrict access to data in query results using data filters to achieve column-level, row-level, and cell-level security.

Amazon EventBridge – With these additional filtering capabilities, you can now filter events by suffix, ignore case, and match if at least one condition is true. This makes it easier to write complex rules when building event-driven applications.

AWS Controllers for Kubernetes (ACK) – The ACK for Amazon Elastic Compute Cloud (Amazon EC2) is now generally available and lets you provision and manage EC2 networking resources, such as VPCs, security groups and internet gateways using the Kubernetes API. Also, the ACK for Amazon EMR on EKS is now generally available to allow you to declaratively define and manage EMR on EKS resources such as virtual clusters and job runs as Kubernetes custom resources. Learn more about ACK for Amazon EMR on EKS in this blog post.

Amazon HealthLake – New analytics capabilities make it easier to query, visualize, and build machine learning (ML) models. Now HealthLake transforms customer data into an analytics-ready format in near real-time so that you can query, and use the resulting data to build visualizations or ML models. Also new is Amazon HealthLake Imaging (preview), a new HIPAA-eligible capability that enables you to easily store, access, and analyze medical images at any scale. More on HealthLake Imaging can be found in this blog post.

Amazon RDS – You can now transfer files between Amazon Relational Database Service (RDS) for Oracle and an Amazon Elastic File System (Amazon EFS) file system. You can use this integration to stage files like Oracle Data Pump export files when you import them. You can also use EFS to share a file system between an application and one or more RDS Oracle DB instances to address specific application needs.

Amazon ECS and Amazon EKS – We added centralized logging support for Windows containers to help you easily process and forward container logs to various AWS and third-party destinations such as Amazon CloudWatch, S3, Amazon Kinesis Data Firehose, Datadog, and Splunk. See these blog posts for how to use this new capability with ECS and with EKS.

AWS SAM CLI – You can now use the Serverless Application Model CLI to locally test and debug an AWS Lambda function defined in a Terraform application. You can see a walkthrough in this blog post.

AWS Lambda – Now supports Node.js 18 as both a managed runtime and a container base image, which you can learn more about in this blog post. Also check out this interesting article on why and how you should use AWS SDK for JavaScript V3 with Node.js 18. And last but not least, there is new tooling support to build and deploy native AOT compiled .NET 7 applications to AWS Lambda. With this tooling, you can enable faster application starts and benefit from reduced costs through the faster initialization times and lower memory consumption of native AOT applications. Learn more in this blog post.

AWS Step Functions – Now supports cross-account access for more than 220 AWS services to process data, automate IT and business processes, and build applications across multiple accounts. Learn more in this blog post.

AWS Fargate – Adds the ability to monitor the utilization of the ephemeral storage attached to an Amazon ECS task. You can track the storage utilization with Amazon CloudWatch Container Insights and ECS Task Metadata endpoint.

AWS Proton – Now has a centralized dashboard for all resources deployed and managed by AWS Proton, which you can learn more about in this blog post. You can now also specify custom commands to provision infrastructure from templates. In this way, you can manage templates defined using the AWS Cloud Development Kit (AWS CDK) and other templating and provisioning tools. More on CDK support and AWS CodeBuild provisioning can be found in this blog post.

AWS IAM – You can now use more than one multi-factor authentication (MFA) device for root account users and IAM users in your AWS accounts. More information is available in this post.

Amazon ElastiCache – You can now use IAM authentication to access Redis clusters. With this new capability, IAM users and roles can be associated with ElastiCache for Redis users to manage their cluster access.

Amazon WorkSpaces – You can now use version 2.0 of the WorkSpaces Streaming Protocol (WSP) host agent that offers significant streaming quality and performance improvements, and you can learn more in this blog post. Also, with Amazon WorkSpaces Multi-Region Resilience, you can implement business continuity solutions that keep users online and productive with less than 30-minute recovery time objective (RTO) in another AWS Region during disruptive events. More on multi-region resilience is available in this post.

Amazon CloudWatch RUM – You can now send custom events (in addition to predefined events) for better troubleshooting and application specific monitoring. In this way, you can monitor specific functions of your application and troubleshoot end user impacting issues unique to the application components.

AWS AppSync – You can now define GraphQL API resolvers using JavaScript. You can also mix functions written in JavaScript and Velocity Template Language (VTL) inside a single pipeline resolver. To simplify local development of resolvers, AppSync released two new NPM libraries and a new API command. More info can be found in this blog post.

AWS SDK for SAP ABAP – This new SDK makes it easier for ABAP developers to modernize and transform SAP-based business processes and connect to AWS services natively using the SAP ABAP language. Learn more in this blog post.

AWS CloudFormation – CloudFormation can now send event notifications via Amazon EventBridge when you create, update, or delete a stack set.

AWS Console – With the new Applications widget on the Console home, you have one-click access to applications in AWS Systems Manager Application Manager and their resources, code, and related data. From Application Manager, you can view the resources that power your application and your costs using AWS Cost Explorer.

AWS Amplify – Expands Flutter support (developer preview) to Web and Desktop for the API, Analytics, and Storage use cases. You can now build cross-platform Flutter apps with Amplify that target iOS, Android, Web, and Desktop (macOS, Windows, Linux) using a single codebase. Learn more on Flutter Web and Desktop support for AWS Amplify in this post. Amplify Hosting now supports fully managed CI/CD deployments and hosting for server-side rendered (SSR) apps built using Next.js 12 and 13. Learn more in this blog post and see how to deploy a NextJS 13 app with the AWS CDK here.

Amazon SQS – With attribute-based access control (ABAC), you can define permissions based on tags attached to users and AWS resources. With this release, you can now use tags to configure access permissions and policies for SQS queues. More details can be found in this blog.

AWS Well-Architected Framework – The latest version of the Data Analytics Lens is now available. The Data Analytics Lens is a collection of design principles, best practices, and prescriptive guidance to help you running analytics on AWS.

AWS Organizations – You can now manage accounts, organizational units (OUs), and policies within your organization using CloudFormation templates.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
A few more stuff you might have missed:

Introducing our final AWS Heroes of the year – As the end of 2022 approaches, we are recognizing individuals whose enthusiasm for knowledge-sharing has a real impact with the AWS community. Please meet them here!

The Distributed Computing ManifestoWerner Vogles, VP & CTO at Amazon.com, shared the Distributed Computing Manifesto, a canonical document from the early days of Amazon that transformed the way we built architectures and highlights the challenges faced at the end of the 20th century.

AWS re:Post – To make this community more accessible globally, we expanded the user experience to support five additional languages. You can now interact with AWS re:Post also using Traditional Chinese, Simplified Chinese, French, Japanese, and Korean.

For AWS open-source news and updates, here’s the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more.

Upcoming AWS Events
As usual, there are many opportunities to meet:

AWS re:Invent – Our yearly event is next week from November 28 to December 2. If you can’t be there in person, get your free online pass to watch live the keynotes and the leadership sessions.

AWS Community DaysAWS Community Day events are community-led conferences to share and learn together. Join us in Sri Lanka (on December 6-7), Dubai, UAE (December 10), Pune, India (December 10), and Ahmedabad, India (December 17).

That’s all from me for this week. Next week we’ll focus on re:Invent, and then we’ll take a short break. We’ll be back with the next Week in Review on December 12!

Danilo

AWS Week in Review – November 14, 2022

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/aws-week-in-review-november-14-2022/

It’s now just two weeks to AWS re:Invent in Las Vegas, and the pace is picking up, both here on the News Blog, and throughout AWS as everyone get ready for the big event! I hope you get the chance to join us, and have shared links and other information at the bottom of this post. First, though, let’s dive straight in to this week’s review of news and announcements from AWS.

Last Week’s Launches
As usual, let’s start with a summary of some launches from the last week that I want to remind you of:

New Switzerland Region – First and foremost, AWS has opened a new Region, this time in Switzerland. Check out Seb’s post here on the News Blog announcing the launch.

New AWS Resource Explorer – if you’ve ever spent time searching for specific resources in your AWS account, especially across Regions, be sure to take a look at the new AWS Resource Explorer, described in this post by Danilo. Once enabled, indexes of the resources in your account are built and maintained (you have control over which resources are indexed). Once the indexes are built, you can issue queries to more quickly arrive at the required resource without jumping between different Regions and service dashboards in the Management Console.

Amazon Lightsail domain registration and DNS autoconfigurationAmazon Lightsail users can now take advantage of new support for registering domain names with automatic configuration of DNS records. Within the Lightsail console, you’re now able to create and register an Amazon Route 53 domain with just a few clicks. 

New models for Amazon SageMaker JumpStart – Two new state-of-the-art models have been released for Amazon SageMaker JumpStart. SageMaker JumpStart provides pretrained, open-source models covering a wide variety of problem types that help you get started with machine learning. The first new model, Bloom, can be used to complete sentences or generate long paragraphs of text in 46 different languages. The second model, Stable Diffusion, generates realistic images from given text. Find out more about the new models in this What’s New post.

Mac instances and macOS VenturaAmazon Elastic Compute Cloud (Amazon EC2) now has support for running the latest version of macOS, Ventura (13.0), for both EC2 x86 Mac and EC2 M1 Mac instances. These instances enable you to provision and run macOS environments in the AWS Cloud, for developers creating apps for iPhone, iPad, Mac, Apple Watch, Apple TV, and Safari.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Some other news items you may want to explore:

AWS Open Source News and Updates – This blog is published each week, and Installment 135 is now available, highlighting new open-source projects, tools, and demos from the AWS community.

Upcoming AWS Events
AWS re:Invent 2022 – As I noted at the top of this post, we’re now just two weeks away from the event! Join us live in Las Vegas November 28–December 2 for keynotes, opportunities for training and certification, and over 1,500 technical sessions. If you are joining us, be sure to check out the re:Invent 2022 Attendee Guides, each curated by an AWS Hero, AWS industry team, or AWS partner.

If you can’t join us live in Las Vegas, be sure to join us online to watch the keynotes and leadership sessions. My cohosts and I on the AWS on Air show will also be livestreaming daily from the event, chatting with service teams and special guests about all the launches and other announcements. You can find us on Twitch.tv (we’ll be on the front page throughout the event), the AWS channel on LinkedIn Live, Twitter.com/awsonair, and YouTube Live.

And one final update for the event – if you’re a .NET developer, be sure to check out the XNT track in the session catalog to find details on the seven breakouts, three chalk talks, and the workshop we have available for you at the conference!

Check back next Monday for our last week in review before the start of re:Invent!

— Steve

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS.

AWS Week in Review – November 7, 2022

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-week-in-review-november-7-2022/

With three weeks to go until AWS re:Invent opens in Las Vegas, the AWS News Blog Team is hard at work creating blog posts to share the latest launches and previews with you. As usual, we have a strong mix of new services, new features, and a surprise or two.

Last Week’s Launches
Here are some launches that caught my eye last week:

Amazon SNS Data Protection and Masking – After a quick public preview, this cool feature is now generally available. It uses pattern matching, machine learning models, and content policies to help protect data at scale. You can find many different kinds of personally identifiable information (PII) and protected health information (PHI) in message bodies and either block message delivery or mask (de-identify) the sensitive data, all in real-time and on a per-topic basis. To learn more, read the blog post or the message data protection documentation.

Amazon Textract Updates – This service extracts text, handwriting, and data from any document or image. This past week we updated the AnalyzeID function so that it can now extract the machine readable zone (MRZ) on passports issued by the United States, and we added the entire OCR output to the API response. We also updated the machine learning models that power the AnalyzeDocument function, with a focus on single-character boxed forms commonly found on tax and immigration documents. Finally, we updated the AnalyzeExpense function with support for new fields and higher accuracy for existing fields, bringing the total field count to more than 40.

Another Amazon Braket Processor – Our quantum computing service now supports Aquila, a new 256-qubit quantum computer from QuEra that is based on a programmable array of neutral Rubidium atoms. According to the What’s New, Aquila supports the Analog Hamiltonian Simulation (AHS) paradigm, allowing it to solve for the static and dynamic properties of quantum systems composed of many interacting particles.

Amazon S3 on Outposts – This service now lets you use additional S3 Lifecycle rules to optimize capacity management. You can expire objects as they age or are replaced with newer versions, with control at the bucket level, or for subsets defined by prefixes, object tags, or object sizes. There’s more info in the What’s New and in the S3 documentation.

AWS CloudFormation – There were two big updates last week: support for Amazon RDS Multi-AZ deployments with two readable standbys, and better access to detailed information on failed stack instances for operations on CloudFormation StackSets.

Amazon MemoryDB for Redis – You can now use data tiering as a lower cost way to to scale your clusters up to hundreds of terabytes of capacity. This new option uses a combination of instance memory and SSD storage in each cluster node, with all data stored durably in a multi-AZ transaction log. There’s more information in the What’s New and the blog post.

Amazon EC2 – You can now remove launch permissions for Amazon Machine Images (AMIs) that are directly shared with your AWS account.

X in Y – We launched existing AWS services and instance types in additional Regions:

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some additional news items that you may find interesting:

AWS Open Source News and Updates – My colleague Ricardo Sueiras highlights new open source projects, tools, and demos from the AWS Community. Read Installment 134 to see what’s going on!

New Case Study – A new AWS case study describes how Taggle (a company focused on smart water solutions in Australia) created an IoT platform that runs on AWS and uses Amazon Kinesis Data Streams to store & ingest data in real time. Using AWS allowed them to scale to accommodate 80,000 additional sensors that will roll out in 2022.

Upcoming AWS Events
re:Invent 2022AWS re:Invent is just three weeks away! Join us live from November 28th to December 2nd for keynotes, training and certification opportunities, and over 1,500 technical sessions. If you cannot make it to Las Vegas you can also join us online to watch the keynotes and leadership sessions live. Be sure to check out the re:Invent 2022 Attendee Guides, each curated by an AWS Hero, AWS industry team, or AWS partner.

PeerTalk – If you will be attending re:Invent in person and are interested in meeting with me or any of our featured experts, be sure to check out PeerTalk, our new onsite networking program.

That’s all for this week!

Jeff;

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS.

AWS Week in Review – October 31, 2022

Post Syndicated from Antje Barth original https://aws.amazon.com/blogs/aws/aws-week-in-review-october-31-2022/

No tricks, just treats in this weekly roundup of news and announcements. Let’s switch our AWS Management Console into dark mode and dive right into it.

Last Week’s Launches
Here are some launches that got my attention during the previous week:

AWS Local Zones in Hamburg and Warsaw now generally available – AWS Local Zones help you run latency-sensitive applications closer to end users. The AWS Local Zones in Hamburg, Germany, and Warsaw, Poland, are the first Local Zones in Europe. AWS Local Zones are now generally available in 20 metro areas globally, with announced plans to launch 33 additional Local Zones in metro areas around the world. See the full list of available and announced AWS Local Zones, and learn how to get started.

Amazon SageMaker multi-model endpoint (MME) now supports GPU instances – MME is a managed capability of SageMaker Inference that lets you deploy thousands of models on a single endpoint. MMEs can now run multiple models on a GPU core, share GPU instances behind an endpoint across multiple models, and dynamically load and unload models based on the incoming traffic. This can help you reduce costs and achieve better price performance. Learn how to run multiple deep learning models on GPU with Amazon SageMaker multi-model endpoints.

Amazon EC2 now lets you replace the root Amazon EBS volume for a running instance – You can now use the Replace Root Volume for patching features in Amazon EC2 to replace your instance root volume using an updated AMI without needing to stop the instance. This makes patching of the guest operating system and applications easier, while retraining the instance store data, networking, and IAM configuration. Check out the documentation to learn more.

AWS Fault Injection Simulator now supports network connectivity disruption – AWS Fault Injection Simulator (FIS) is a managed service for running controlled fault injection experiments on AWS. AWS FIS now has a new action type to disrupt network connectivity and validate that your applications are resilient to a total or partial loss of connectivity. To learn more, visit Network Actions in the AWS FIS user guide.

Amazon SageMaker Automatic Model Tuning now supports Grid Search – SageMaker Automatic Model Tuning helps you find the hyperparameter values that result in the best-performing model for a chosen metric. Until now, you could choose between random, Bayesian, and hyperband search strategies. Grid search now lets you cover every combination of the specified hyperparameter values for use cases in which you need reproducible tuning results. Learn how Amazon SageMaker Automatic Model Tuning now supports grid search.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some additional news items that you may find interesting:

Celebrating over 20 years of AI/ML innovation – On October 25, we hosted the AWS AI/ML Innovation Day. Bratin Saha and other leaders in the field shared the great strides we have made in the past and discussed what’s next in the world of ML. You can watch the recording here.

AWS open-source news and updates – My colleague Ricardo Sueiras writes this weekly open-source newsletter in which he highlights new open-source projects, tools, and demos from the AWS Community. Read edition #133 here.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS re:Invent is only 4 weeks away! Join us live in Las Vegas from November 28–December 2 for keynote announcements, training and certification opportunities, access to 1,500+ technical sessions, and much more. Seats are still available to reserve, and walk-ups are available onsite. You can also join us online to watch live keynotes and leadership sessions.

If you are into machine learning like me, check out the ML attendee guide. AWS Machine Learning Hero Vinicius Caridá put together recommended sessions and tips and tricks for building your agenda. We also have attendee guides on additional topics and industries.

On November 2, there is a virtual event for building modern .NET applications on AWS. You can register for free.

On November 11–12, AWS User Groups in India are hosting the AWS Community Day India 2022, with success stories, use cases, and much more from industry leaders. Sign up for free to join this virtual event.

That’s all for this week. Check back next Monday for another Week in Review!

— Antje

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

AWS Week in Review – October 24, 2022

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/aws-week-in-review-october-24-2022/

Last week, we announced plans to launch the AWS Asia Pacific (Bangkok) Region, which will become our third AWS Region in Southeast Asia. This Region will have three Availability Zones and will give AWS customers in Thailand the ability to run workloads and store data that must remain in-country.

In the Works – AWS Region in Thailand
With this big news, AWS announced a 190 billion baht (US 5 billion dollars) investment to drive Thailand’s digital future over the next 15 years. It includes capital expenditures on the construction of data centers, operational expenses related to ongoing utilities and facility costs, and the purchase of goods and services from Regional businesses.

Since we first opened an office in Bangkok in 2015, AWS has launched 10 Amazon CloudFront edge locations, a highly secure and programmable content delivery network (CDN) in Bangkok. In 2020, we launched AWS Outposts, a family of fully managed solutions delivering AWS infrastructure and services to virtually any on-premises or edge location for a truly consistent hybrid experience in Thailand. This year, we also plan the upcoming launch of an AWS Local Zone in Bangkok, which will enable customers to deliver applications that require single-digit millisecond latency to end users in Thailand.

Photo courtesy of Conor McNamara, Managing Director, ASEAN at AWS

The new AWS Region in Thailand is also part of our broader, multifaceted investment in the country, covering our local team, partners, skills, and the localization of services, including Amazon Transcribe, Amazon Translate, and Amazon Connect.

Many Thailand customers have chosen AWS to run their workloads to accelerate innovation, increase agility, and drive cost savings, such as 2C2P, CP All Plc., Digital Economy Promotion Agency, Energy Response Co. Ltd. (ENRES), PTT Global Public Company Limited (PTT), Siam Cement Group (SCG), Sukhothai Thammathirat Open University, The Stock Exchange of Thailand, Papyrus Studio, and more.

For example, Dr. Werner Vogels, CTO of Amazon.com, introduced the story of Papyrus Studio, a large film studio and one of the first customers in Thailand.

“Customer stories like Papyrus Studio inspire us at AWS. The cloud can allow a small company to rapidly scale and compete globally. It also provides new opportunities to create, innovate, and identify business opportunities that just aren’t possible with conventional infrastructure.”

For more information on how to enable AWS and get support in Thailand, contact our AWS Thailand team.

Last Week’s Launches
My favorite news of last week was to launch dark mode as a beta feature in the AWS Management Console. In Unified Settings, you can choose between three settings for visual mode: Browser default, Light, and Dark. Browser default applies the default dark or light setting of the browser, dark applies the new built-in dark mode, and light maintains the current look and feel of the AWS console. Choose your favorite!

Here are some launches that caught my eye for web, mobile, and IoT application developers:

New AWS Amplify Library for Swift – We announce the general availability of Amplify Library for Swift (previously Amplify iOS). Developers can use Amplify Library for Swift via the Swift Package Manager to build apps for iOS and macOS (currently in beta) platforms with Auth, Storage, Geo, and more features.

The Amplify Library for Swift is open source on GitHub, and we deeply appreciate the feedback we have gotten from the community. To learn more, see Introducing the AWS Amplify Library for Swift in the AWS Front-End Web & Mobile Blog or Amplify Library for Swift documentation.

New Amazon IVS Chat SDKs – Amazon Interactive Video Service (Amazon IVS) now provides SDKs for stream chat with support for web, Android, and iOS. The Amazon IVS stream chat SDKs support common functions for chat room resource management, sending and receiving messages, and managing chat room participants.

Amazon IVS is a managed, live-video streaming service using the broadcast SDKs or standard streaming software such as Open Broadcaster Software (OBS). The service provides cross-platform player SDKs for playback of Amazon IVS streams you need to make low-latency live video available to any viewer around the world. Also, it offers Chat Client Messaging SDK. For more information, see Getting Started with Amazon IVS Chat in the AWS documentation.

New AWS Parameters and Secrets Lambda Extension – This is new extension for AWS Lambda developers to retrieve parameters from AWS Systems Manager Parameter Store and secrets from AWS Secrets Manager. Lambda function developers can leverage this extension to improve their application performance as it decreases the latency and the cost of retrieving parameters and secrets.

Previously, you had to initialize either the core library of a service or the entire service SDK inside a Lambda function for retrieving secrets and parameters. Now you can simply use the extension. To learn more, see AWS Systems Manager Parameter Store documentation and AWS Secrets Manager documentation.

New FreeRTOS Long Term Support Version – We announce the second release of FreeRTOS Long Term Support (LTS) – FreeRTOS 202210.00 LTS. FreeRTOS LTS offers a more stable foundation than standard releases as manufacturers deploy and later update devices in the field. This release includes new and upgraded libraries such as AWS IoT Fleet Provisioning, Cellular LTE-M Interface, coreMQTT, and FreeRTOS-Plus-TCP.

All libraries included in this FreeRTOS LTS version will receive security and critical bug fixes until October 2024. With an LTS release, you can continue to maintain your existing FreeRTOS code base and avoid any potential disruptions resulting from FreeRTOS version upgrades. To learn more, see the FreeRTOS announcement.

Here is some news on performance improvement and increasing capacity:

Up to 10X Improving Amazon Aurora Snapshot Exporting Speed – Amazon Aurora MySQL-Compatible Edition for MySQL 5.7 and 8.0 now speed up to 10x faster snapshot exports to Amazon S3. The performance improvement is automatically applied to all types of database snapshot exports, including manual snapshots, automated system snapshots, and snapshots created by the AWS Backup service. For more information, see Exporting DB cluster snapshot data to Amazon S3 in the Amazon Aurora documentation.

3X Increasing Amazon RDS Read Capacity – Amazon Relational Database Service (RDS) for MySQL, MariaDB, and PostgreSQL now supports 15 read replicas per instance, including up to 5 cross-Region read replicas, delivering up to 3x the previous read capacity. For more information, see Working with read replicas in the Amazon RDS documentation.

2X Increasing AWS Snowball Edge Compute Capacity – The AWS Snowball Edge Compute Optimized device doubled the compute capacity up to 104 vCPUs, doubled the memory capacity up to 416GB RAM, and is now fully SSD with 28TB NVMe storage. The updated device is ideal when you need dense compute resources to run complex workloads such as machine learning inference or video analytics at the rugged, mobile edge such as trucks, aircraft or ships.  You can get started by ordering a Snowball Edge device on the AWS Snow Family console.

2X Increasing Amazon SQS FIFO Default Quota – Amazon Simple Queue Service (SQS) announces the increase of default quota up to 6,000 transactions per second per API action. It is double the previous 3,000 throughput quota for a high throughput mode for FIFO (first in, first out) queues in all AWS Regions where Amazon SQS FIFO queue is available. For a detailed breakdown of default throughput quotas per Region, see Quotas related to messages in the Amazon SQS documentation.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some other news items that you may find interesting:

22 New or Updated Open Datasets on AWS – We released 22 new or updated datasets, including Amazonia-1 imagery, Bitcoin and Ethereum data, and elevation data over the Arctic and Antarctica. The full list of publicly available datasets is on the Registry of Open Data on AWS and is now also discoverable on AWS Data Exchange.

Sustainability with AWS Partners (ft. AWS On Air) – This episode covers a broad discipline of environmental, social, and governance (ESG) issues across all regions, organization types, and industries. AWS Sustainability & Climate Tech provides a comprehensive portfolio of AWS Partner solutions built on AWS that address climate change events and the United Nation’s Sustainable Development Goals (SDG).

AWS Open Source News and Updates #131 – This newsletter covers latest open-source projects such as Amazon EMR Toolkit for VS Code, a VS Code Extension to make it easier to develop Spark jobs on EMR and AWS CDK For Discourse, sample codes that demonstrates how to create a full environment for Discourse, etc. Remember to check out the Open source at AWS keep up to date with all our activity in open source by following us on @AWSOpen.

Upcoming AWS Events
Check your calendars and sign up for these AWS events:

AWS re:Invent 2022 Attendee Guide – Browse re:Invent 2022 attendee guides, curated by AWS Heroes, AWS industry teams, and AWS Partners. Each guide contains recommended sessions, tips and tricks for building your agenda, and other useful resources. Also, seat reservations for all sessions are now open for all re:Invent attendees. You can still register for AWS re:Invent either offline or online.

AWS AI/ML Innovation Day on October 25 – Join us for this year’s AWS AI/ML Innovation Day, where you’ll hear from Bratin Saha and other leaders in the field about the great strides AI/ML has made in the past and the promises awaiting us in the future.

AWS Container Day at Kubecon 2022 on October 25–28 – Come join us at KubeCon + CloudNativeCon North America 2022, where we’ll be hosting AWS Container Day Featuring Kubernetes on October 25 and educational sessions at our booth on October 26–28. Throughout the event, our sessions focus on security, cost optimization, GitOps/multi-cluster management, hybrid and edge compute, and more.

You can browse all upcoming in-person, and virtual events.

That’s all for this week. Check back next Monday for another Week in Review!

— Channy

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

The Story of Scalar

Post Syndicated from Derrick Stolee original https://github.blog/2022-10-13-the-story-of-scalar/

When you install Git v2.38, you’ll find a new executable tool available called scalar. At its core, Scalar enables the latest and greatest Git features for working with large repositories. By simply switching from git clone to scalar clone, you will have all of Git’s most impactful performance features, such as partial clone, sparse-checkout, background maintenance, and advanced config options neatly configured for your repository. Have you already cloned your repository? Run scalar register in it to get the same features.

Scalar and Git, together at last

Although Scalar is only now making its formal Git debut, this release represents the culmination of a multi-year journey. Today, we will share the story of how Scalar got to this point. We’ll start from what inspired its creation, how it evolved from a prototype carved out of the VFS for Git codebase, and finally how it landed in upstream Git. Each step of the way was guided by a set of development principles that helped us with each challenge and opportunity.

Special thanks to @chrisd8088, @dscho, @jeffhostetler, @jrbriggs, @kyle-rader, @mjcheetham, @ldennington, @prplr, @wilbaker, and all of the other contributors who helped make this happen!

Our development principles

Before we get into specifics about how Scalar was built and eventually rewritten and contributed upstream, we need to first establish some context. We entered the project with certain values that we used to guide our decisions. Here are a few that are particularly important to this story.

Rapid prototyping

Code speaks volumes. We could design an architecture all we want on paper, but when solving problems at scale, we need to have actual code running before we can make a final decision.

Before committing to a decision, we would quickly build a prototype and measure its performance. During this prototyping phase, we would take shortcuts to get to that point of measurement. Then, we’d throw everything we could at the prototype to make sure it was correct and fast.

Based on the prototype, we would commit to doing the careful engineering of building the feature again but with a test strategy, thoughtful architecture, and a plan for delivering it to users.

Incremental changes over complete rewrites

Looking at where we started to where we ended, it might seem like we are proponents of rewriting things from scratch. We intend to demonstrate exactly the opposite: Scalar moved with small incremental changes that solved an immediate need. While making those changes, we also optimized for reducing our technical debt and creating a better architecture, and that resulted in code moving from .NET to C and then from our fork to upstream Git, but each individual movement was relatively small compared to the entire system.

The biggest reason we focused on incremental changes was because of our next value.

Tests are an asset

Making any kind of software change adds risk to a project. That risk is mitigated when we have a large set of battle-hardened tests. With a robust test suite available, we were able to make significant changes to our architecture with confidence.

Work in the open

Other than the earliest prototypes, all changes were reviewed and merged completely in public, either in the microsoft/scalar repository or the microsoft/git repository. Scalar was an open source project from day one, and was never intended to be a project only for internal use. By contrast, VFS for Git was built as a tool for Microsoft’s internal use first, and open sourcing it was a bonus after it reached enough adoption. Not only did we value that transparency during Scalar’s development, but now we have a history of public code changes to talk about here.

Now that we’ve established these values, let’s begin the story of Scalar.

A catalyst forces a pivot

The Virtual FileSystem for Git project (VFS for Git for short—previously “GVFS”) was built specifically to transition the Microsoft Windows OS monorepo to Git. VFS for Git utilizes a virtual filesystem to lazily load files only when a filesystem read occurs. This greatly reduced the amount of work Git needed to do, but required installing the microsoft/git fork as well as the .NET VFS for Git software and use Azure Repos to host the repository.

Initially, the Microsoft Office monorepo was going to onboard to Git using VFS for Git, but they needed cross-platform support, specifically for macOS development. After getting pretty far in a macOS port, Apple deprecated the kernel features that provided the filesystem virtualization that was required for that flow.

We were in luck, however, because we had come to understand something a key quality of the Office monorepo: Office has a rigorous dependency system that clearly identifies which files are necessary for a local build. This means that a developer could specify the files they need to Git’s sparse-checkout feature instead of dynamically populating the worktree using a virtual filesystem. This also significantly simplifies the software needed to manage their monorepo!

However, there was a problem. The sparse-checkout feature had previously been abandoned as a direction for VFS for Git due to its performance. Git would use a list of patterns to match which paths should be in the worktree and which should be ignored. This pattern matching had an ordering strategy that required iterating through the entire pattern list for every possible path, requiring quadratic time! For one of the larger sparse-checkout definition examples we had, Git would take 40 minutes to evaluate the sparse-checkout patterns.

Sparse-checkout definitions are extremely generic. They include matching on file prefix, but also file suffix, or path substring, and any combination. For our target monorepo, we only needed directory matches. With that limited type of pattern in mind, we added a new mode to Git’s sparse-checkout feature: “cone mode” sparse-checkout. A quick prototype of cone mode sparse-checkout demonstrated that Git could reach similar performance as VFS for Git, especially when paired with the filesystem monitor hook. Our critical performance measurement was the git status command, and we were seeing performance within three or four seconds, which was close to the typical case in VFS for Git.

This was promising enough to move forward with a full prototype. We decided to make this a separate project from VFS for Git, so it needed its own name: Scalar.

Throw the first one away

Once we had a handle on Git command performance using Git’s sparse-checkout feature, we needed to adapt all of the code that allowed fast clones and fetches to work within that environment. For most Git hosting services, Git’s partial clone feature is the best way to solve for fast clones and fetches. However, Azure Repos has an earlier version that was built for VFS for Git called the GVFS protocol. We needed a way to speak the GVFS protocol to bootstrap clones and to dynamically fetch missing objects during Git commands.

This was our first point of asking, “Should we rewrite, or refactor?” The VFS for Git codebase already had all of the client-side code for speaking the GVFS protocol. Not only that, it also had a large set of end-to-end tests that constructed a complete clone from Azure Repos and then ran thousands of Git commands in that environment to make sure they operated exactly the same as a normal Git clone. Since those tests were a significant asset, we set out to construct the first version of this new project starting with the VFS for Git code.

In this initial prototype, we just wanted to get things working for the end-to-end tests to pass. This process included disabling the virtual filesystem code, but leaving all of the hooks that enabled the GVFS Protocol. We also needed to set up sparse-checkout at clone time before initializing the HEAD reference. This prototype was so rough it still didn’t have the Scalar name: it still operated as if it was the gvfs command-line interface.

Diagram showing that the pre-Scalar prototype mostly deleted code from the GVFS protocol.
The rapid prototyping phase mostly deleted code

The end result wasn’t pretty. We couldn’t hope to ship it since it would break compatibility with previous VFS for Git versions. The tests were cobbled together to make things work, but we had disabled sparse-checkout in the tests since the previous tests assumed that every path could be populated dynamically with the virtual filesystem. However, we got to a point where we could reliably create this new repository setup and measure its success. Since the clones were doing the exact same thing as in VFS for Git, the performance matched exactly. Now, we needed to rebuild it, and do it the right way.

Get to Minimum Viable Product (MVP)

From the success of our initial prototype, we moved on to creating an MVP that we could demo to internal users. Here is where we created the Scalar name, the microsoft/scalar repository, and started doing thorough reviews of all changes.

As a team, we decided it would be best to create a new repository rather than to build the project within the VFS for Git codebase. We did not want to be locked into the architecture of VFS for Git as we moved forward, and we also wanted to take advantage of the commit history for the code in the repository. The first task in creating the new project was renaming all references to the old project.

Diagram detailing that, between the pre-Scalar prototype and the version pushed to microsoft/git, many pieces were renamed.
Cleaning up the prototype and renaming things

Updating tests

The next step we had to do was to make sure that we were sufficiently testing the sparse-checkout environment. Recall that we used the full worktree to get tests passing in the prototype, but now we needed to actually be sure that our sparse-checkout environment would work properly.

For this, we found a minimal set of patterns that would include all of the concrete paths used by the test suite.

Then, we made sure that there were interesting changes happening outside of those patterns that would exercise Git features like git merge or git cherry-pick in interesting ways outside of the sparse-checkout definition.

Finally, we added specific tests that involved changing the sparse-checkout definition to make sure that Git would properly fill in the missing files. In this way, we were able to keep all of the existing tests while also adding new tests that were specific to our environment.

Evaluating the MVP

After completing the product changes and test updates, it was time to evaluate the solution. We ran performance numbers to ensure they matched what we saw in our prototype phase. We created local clones to use in daily work to try and catch any lingering bugs.

But it all came down to evaluating the solution with internal users. We demoed Scalar directly with the Office engineering system team and asked pointed questions about whether this would work for them.

In particular, we were worried about the performance of git checkout. In VFS for Git, git checkout is extremely fast because it doesn’t actually do much work. It clears the filesystem of concrete files and replaces them with virtualized files. The cost of populating the filesystem comes later when those files are read by an IDE or a build process. With Scalar, the filesystem is populated within the git checkout process, so that work is now upfront and clear to the user.

By working directly with the engineering system team, we learned that this git checkout performance was not an issue. Since git checkout changes source files, it invalidates the local build. Build times can take hours in this monorepo after taking new changes, so users typically do not use git checkout until the end of the day when they are ready to trigger a long build overnight. For this reason, git checkout was not a critical path for their developers. In fact, there was great interest in being able to know that they could disconnect from the network and still poke around the code without risk of finding a virtual file.

We were good to go with our plan for Scalar. However, the monorepo team needed to build something of their own. They needed a connection between their build system and sparse-checkout. While they built that, we had time to polish Scalar and make it easier to install and use.

Update architecture under stable conditions

With the benefit of a stable test suite and a few months of runway, we were able to take our MVP and rethink the architecture. In particular, we shed some architectural decisions that were critical to how VFS for Git works, but were no longer needed in Scalar.

VFS for Git requires a process running that can handle requests from the filesystem to populate virtualized content on-demand. The existence of this process creates the concept of a “mounted” repository, and even included the commands gvfs mount and gvfs unmount to toggle this state.

Because this process needed to exist, a lot of other things were placed in that process that could be relocated elsewhere in Scalar. We set out to remove the need for this process.

Since we had already removed the virtual filesystem code, there were two remaining pieces that were in the mount process: performing background maintenance and downloading objects via the GVFS protocol.

For background maintenance, we took the fastest approach and moved the scheduled tasks out of the mount process and into the Scalar.Service global singleton process. We had versions of this service for Windows and macOS to handle things like startup operations. Moving the maintenance tasks to this service was quick and easy.

For the object downloads, it was a bigger job. The existing architecture included a read-object hook custom to microsoft/git that was installed by the scalar clone command, and that hook communicated to the mount process which actually communicated with the server and placed the objects in the repository.

For this, we created a tool within microsoft/git to do these missing object queries via the GVFS protocol directly within the Git codebase. This tool lives underneath the code that fills in objects for Git’s partial clone feature. By connecting this tool to partial clone, we could work to improve partial clone while also helping Scalar users at the same time. One major benefit to working within the partial clone framework is that some missing object requests can be batched together into a single request, while the old read-object hook could only ask for one missing object at a time.

Finally, there was nothing important remaining in the mount process, so we deleted it. In addition, we were able to delete the old Git hook.

At this point, we had simplified the architecture to have fewer moving parts and were ready to ship internally.

Diagram showing that removing the mount process simplified Scalar's architecture.
Removing the mount process with the git-gvfs-helper

Upon success, look for low-hanging fruit

Shortly after announcing Scalar to the world, we realized that Scalar could have a larger benefit to the Git ecosystem than just very large monorepos using Azure Repos.

We extended scalar clone to use Git’s partial clone if the remote did not speak the GVFS protocol. In this way, scalar clone became something a user could run against any Git remote.

This was an inflection point in our lifecycle: we had accomplished what we set out to do, but wanted to put these tools in front of more people and find a wider audience. We started to shift our focus from making updates in the .NET project and instead contributing features to the upstream Git project.

Rethink architecture as conditions change

Up until this point, we were using the existing hook approach that speaks to a third-party filesystem monitor. This meant that we needed to install that third-party tool next to Scalar, but also scalar clone would install the hook in addition to all of its other operations. We realized that we could solve our installation complexities, reduce the complexity of scalar clone, and get faster performance if the filesystem monitor was built into Git. With that context, we began building Git’s builtin filesystem monitor. We took early versions into microsoft/git while it was reviewed carefully by the Git community.

Diagram showing early adoption of the builtinFS Monitor.
Early adoption of builtin FS Monitor

An important Scalar feature was background maintenance, which was accomplished by a service running in the background and launching Git commands at certain intervals to keep data fresh and well-organized. This service existed from the VFS for Git days, so it was easy to keep using it on Windows and macOS. However, when the Office team told us that they needed Linux clients to support some of their web developers, we focused on porting Scalar to Linux. This service was one platform-specific part that would be difficult to implement in .NET.

We decided that instead of creating a new service in Scalar, it would be better to implement background maintenance in Git. Once Git had its own cross-platform way of doing maintenance, Scalar could stop using its custom logic and instead rely on git maintenance run.

We then removed the service from Scalar.

Diagram showing that removing background maintenance from Scalar left only the CLI and tests.
Background maintenance leaves us with only the CLI and tests

After making this change, we took another look at our architecture and realized something. Suddenly, Scalar was only a command-line interface on top of Git. Why have it be in C#, separate from the Git source code?

The overhead of dealing with Scalar as a .NET tool was colliding with our maintenance costs of creating releases and shipping it to users. If Office developers require the microsoft/git fork of Git and another tool then things get tricky when we want to release a new version.

We had replaced so many features in the Scalar codebase with Git functionality that starting from a clean slate could allow us to build a more manageable architecture than that of the existing code. Also, by inserting the Scalar CLI into the Git codebase, we could take advantage of internal functions such as using Git config APIs instead of running git config processes to set recommended config values.

With these goals in mind, we ported the Scalar CLI to C in microsoft/git using less than 3,000 lines of code!

This endeavor to recreate the Scalar CLI in the microsoft/git codebase can best be appreciated by seeing that we deleted over 10 times the amount of code from microsoft/scalar than we added to microsoft/git when we removed all product code. We kept the microsoft/scalar repository around as a collection of tests, allowing us to be confident in the new code.

Diagram showing that once the CLI was ported to microsoft/git, only the tests were left behind.
Porting the CLI to microsoft/git leaves only the tests

This was our biggest step in the journey because it involved the largest rewrite of Scalar code. However, the requirements of the Scalar CLI at this point were well-defined and greatly simplified from earlier. We were able to immediately celebrate by no longer shipping the .NET Scalar application to our internal customers and instead rely on just shipping the microsoft/git fork.

There was one downside to this change, though. Before, you could install the .NET Scalar solution on top of any Git version and still get all the benefits of scalar clone. Now, users needed to replace their Git client with microsoft/git in order to get the latest Scalar version. We wanted to make Scalar useful to everyone, not just those that were willing to install our fork.

The journey into core Git

Porting Scalar to C not only enabled hosting the tool in microsoft/git, it opened up the possibility of making Scalar part of the upstream Git project. Although it wouldn’t be the first feature originating in microsoft/git that was contributed upstream, there was no clear precedent for something like Scalar: a standalone executable whose name didn’t start with git in the Git project. That might sound like nothing more than an implementation detail, but it represented a philosophical departure from the existing tools in Git. This divergence would drive us to define what Scalar meant for Git.

contrib/-uting to Git

From the outset, we knew there was a contingent of Git users that would benefit from Scalar beyond microsoft/git‘s typical user base. Features like the filesystem monitor, background maintenance, cone mode sparse-checkout, etc. had all become popular among developers in large repositories. Scalar exposed those and a multitude of other features more readily to users. Still, it wasn’t clear that Scalar as a standalone executable was the best—or Git-friendliest—way to present those features.

To gradually introduce the tool to the Git community, Scalar’s journey upstream began in Git’s contrib/ directory. From the contrib/ README:

Although these pieces are available as part of the official git
source tree, they are in somewhat different status.  The
intention is to keep interesting tools around git here, maybe
even experimental ones, to give users an easier access to them,
and to give tools wider exposure, so that they can be improved
faster.

Despite the loose requirements of contrib/, the submitted version of Scalar still required some changes from what was in microsoft/git. First was removing the GVFS protocol-supported clones. As we mentioned earlier, blobless clones were introduced into Scalar as a fallback for clones using the GVFS protocol, so the upstream version defaulted to using blobless partial clones instead. Additionally, to preserve the separation between contrib/ and the main Git repository, the GitHub Actions workflow was also stripped of references to Scalar, including execution of the microsoft/scalar test suite.

However, being in contrib/ did have some drawbacks. In order to build and install Scalar, a user needed to not only build Git from source, but know to navigate into contrib/scalar/ and build that as well. The separate build and test process also left it prone to changes in the rest of Git unintentionally breaking it. Even with these challenges, this arrangement was exactly what Scalar needed while its features were built out and long-term plan was developed. As we drew closer to finishing those features, we needed to finally answer the question: what should we do with Scalar?

Home sweet home

As soon as the possibility of upstreaming Scalar materialized, there were lots of ideas about what its final form would look like. One popular idea—which can be found in the original RFC—was to dissolve Scalar into a collection of new git commands and options to existing commands. Another was to have scalar reside in the Git tree in a dedicated subdirectory, like gitk. Another was to reimagine it as a Git built-in command: something like git scalar. Along with these implementation decisions came overarching questions of maintenance and relevance to Git.

As the tool was nearing feature completion upstream and the downsides of contrib/ isolation were weighing on the project, we took a step back and revisited the questions of Scalar’s identity. The result was a proposal to update Scalar’s documentation and outline a three-part approach to making the tool generally available in Git:

  1. Add any remaining large repo performance features to Scalar.
  2. Extract the parts of Scalar that are generally applicable to all Git users into built-in commands and/or options.
  3. Move Scalar into the root tree of Git, built and installed as a standalone executable alongside git.

The crux of this approach was a new framing of Scalar within the Git project. Scalar began, like VFS for Git before it, as a tool with its own features and opinions: how to configure a repository, what workflows to use, etc. As it evolved, those features and opinions were folded into Git or adjusted to align better with the upstream project, leaving Scalar with only the parts that fit the very specific role of configuring large repositories. In essence, Git had a user experience niche left by its myriad of large repo-focused performance features. Scalar filled that niche.

The roadmap to Scalar’s completion emerged from this philosophy. First, a few more particularly impactful features would be added to it (namely, the built-in FSMonitor). Then, because Scalar’s purpose is to configure features for large repositories that aren’t set up by default in Git, the parts that serve all Git users (such as repository diagnostics in scalar diagnose) would be extracted into new or existing Git commands. Finally, Scalar would be moved out of contrib/ and into the main build of the repository, intended to continue existing as a dedicated tool for managing large Git repositories.

The best laid plans often go awry but, fortunately, this one didn’t. Over the course of three upstream patch series, Scalar was streamlined inside of contrib/, then moved into its new home as part of core Git. And just in time for the v2.38.0 release!

Diagram showing that the Scalar project was contributed to git/git.
Scalar now lives in the core git/git project

The past, present, and future of Scalar

We’ve shared the story of Scalar not only to publicize a new and exciting feature in Git (seriously, go try it!), but also to illustrate one of the many paths an open source project can take to reach its users. Planning and re-planning, designing and redesigning, and no shortage of engineering lessons were all necessary steps to make Scalar the powerful tool it is today.

It is now a fully-integrated part of Git, but Scalar’s journey is far from over. Scalability and performance in Git is a hot topic—our own engineering blog is a testament to that—and consistent improvement in that area will undoubtedly be part of Scalar’s future. Today, though, Scalar’s eventful history is what has shaped it into the best way to unlock Git’s full potential on your largest repositories.

AWS Week in Review – October 3, 2022

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-october-3-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

A new week and a new month just started. Curious which were the most significant AWS news from the previous seven days? I got you covered with this post.

Last Week’s Launches
Here are the launches that got my attention last week:

Amazon File Cache – A high performance cache on AWS that accelerates and simplifies demanding cloud bursting and hybrid workflows by giving access to files using a fast and familiar POSIX interface, no matter if the original files live on premises on any file system that can be accessed through NFS v3 or on S3.

Amazon Data Lifecycle Manager – You can now automatically archive Amazon EBS snapshots to save up to 75 percent on storage costs for those EBS snapshots that you intend to retain for more than 90 days and rarely access.

AWS App Runner – You can now build and run web applications and APIs from source code using the new Node.js 16 managed runtime.

AWS Copilot – The CLI for containerized apps adds IAM permission boundaries, support for FIFO SNS/SQS for the Copilot worker-service pattern, and using Amazon CloudFront for low-latency content delivery and fast TLS-termination for public load-balanced web services.

Bottlerocket – The Linux-based operating system purpose-built to run container workloads is now supported by Amazon Inspector. Amazon Inspector can now recommend an update of Bottlerocket if it finds a vulnerability.

Amazon SageMaker Canvas – Now supports mathematical functions and operators for richer data exploration and to understand the relationships between variables in your data.

AWS Compute Optimizer – Now provides cost and performance optimization recommendations for 37 new EC2 instance types, including bare metal instances (m6g.metal) and compute optimized instances (c7g.2xlarge, hpc6a.48xlarge), and new memory metrics for Windows instances.

AWS Budgets – Use a simplified 1-click workflow for common budgeting scenarios with step-by-step tutorials on how to use each template.

Amazon Connect – Now provides an updated flow designer UI that makes it easier and faster to build personalized and automated end-customer experiences, as well as a queue dashboard to view and compare real-time queue performance through time series graphs.

Amazon WorkSpaces – You can now provision Ubuntu desktops and use virtual desktops for new categories of workloads, such as for your developers, engineers, and data scientists.

Amazon WorkSpaces Core – A fully managed infrastructure-only solution for third-party Virtual Desktop Infrastructure (VDI) management software that simplifies VDI migration and combines your current VDI software with the security and reliability of AWS. Read more about it in this Desktop and Application Streaming blog post.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
A few more blog posts you might have missed:

Introducing new language extensions in AWS CloudFormation – In this Cloud Operations & Migrations blog post, we introduce the new language transform that enhances CloudFormation core language with intrinsic functions that simplify handling JSON strings (Fn::ToJsonString), array lengths (Fn::Length), and update and deletion policies.

Building a GraphQL API with Java and AWS Lambda – This blog shows different options for resolving GraphQL queries using serverless technologies on AWS.

For AWS open-source news and updates, here’s the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more.

Upcoming AWS Events
As usual, there are many opportunities to meet:

AWS Summits– Connect, collaborate, and learn about AWS at these free in-person events: Bogotá (October 4), and Singapore (October 6).

AWS Community DaysAWS Community Day events are community-led conferences to share and learn together. Join us in Amersfoort, Netherlands (on October 3, today), Warsaw, Poland (October 14), and Dresden, Germany (October 19).

That’s all from me for this week. Come back next Monday for another Week in Review!

Danilo

Highlights from Git 2.38

Post Syndicated from Taylor Blau original https://github.blog/2022-10-03-highlights-from-git-2-38/

The open source Git project just released Git 2.38, with features and bug fixes from over 92 contributors, 24 of them new. We last caught up with you on the latest in Git back when 2.37 was released.

To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.

A repository management tool for large repositories

We talk a lot about performance in Git, especially in the context of large repositories. Returning readers of these blog posts will no doubt be familiar with the dozens of performance optimizations that have landed in Git over the years.

But with so many features to keep track of, it can be easy to miss out some every now and then (along with their corresponding performance gains).

Git’s new built-in repository management tool, Scalar, attempts to solve that problem by curating and configuring a uniform set of features with the biggest impact on large repositories. To start using it, you can either clone a new repository with scalar clone:

$ scalar clone /path/to/repo

Or, you can use the --full-clone option if you don’t want to start out with a sparse checkout. To apply Scalar’s recommended configuration to a clone you already have, you can instead run:

$ cd /path/to/repo
$ scalar register

At the time of writing, Scalar’s default configured features include:

Scalar’s configuration is updated as new (even experimental!) features are introduced to Git. To make sure you’re always using the latest and greatest, be sure to run scalar reconfigure /path/to/repo after a new release to update your repository’s config (or scalar reconfigure -a to update all of your Scalar-registered repositories at once).

Git 2.38 is the first time Scalar has been included in the release, but it has actually existed for much longer. Check back soon for a blog post on how Scalar came to be—from its early days as a standalone .NET application to its journey into core Git!

[source]

Rebase dependent branches with –update-refs

When working on a large feature, it’s often helpful to break up the work across multiple branches that build on each other.

But these branches can become cumbersome to manage when you need to rewrite history in an earlier branch. Since each branch depends on the previous ones, rewriting commits in one branch will leave the subsequent branches disconnected from history after rewriting.

In case that didn’t quite make sense, let’s walk through an example.

Suppose that you are working on a feature (my-feature), but want to break it down into a few distinct parts (maybe for ease of review, or to ensure you’re deploying it safely, etc.). Before you share your work with your colleagues, you build the entire feature up front to make sure that the end-result is feasible, like so.

$ git log --oneline origin/main..HEAD
741a3174683 (HEAD -> my-feature/part-three) Part 3: all done!
1ff073007eb Part 3: step two
880c07e326f Part 3: step one
40529bd11dc (my-feature/part-two) Part 2: step two
0a92cc3acd8 Part 2: step one
eed018043ba (my-feature/part-one) Part 1: step three
646c870d69e Part 1: step two
9147f6d2eb4 Part 1: step one

In the example below, the my-feature/part-three branch resembles what you imagine the final state will look like. But the intermediate check-points (my-feature/part-one, and so on) represent the chunks you intend to submit for code review.

After you submit everything, what happens if you want to make a change to one of the patches in part one?

You might create a fixup! commit on top, but squashing that patch into the one you wanted to change from part one will cause parts two and three to become disconnected:

Creating a fixup commit that causes parts two and three to become disconnected

Notice that after we squashed our fix into “Part 1: step one,” the subsequent branches vanished from history. That’s because they didn’t get updated to depend on the updated tip of my-feature/part-one after rebasing.

You could go through and manually checkout each branch, resetting each to the right commit. But this can get cumbersome quickly if you have a lot of branches, are making frequent changes, or both.

Git 2.38 ships with a new option to git rebase called --update-refs that knows how to perform these updates for you. Let’s try that same example again with the new version of Git.

Rebasing with the new viersion of Git, which updates each branch for you.

Because we used --update-refs, git rebase knew to update our dependent branches, so our history remains intact without having to manually update each individual branch.

If you want to use this option every time you rebase, you can run git config --global rebase.updateRefs true to have Git act as if the --update-refs option is always given.

[source]

Tidbits

This release coincides with the Git project’s participation in the annual Google Summer of Code program. This year, the Git project mentored two students, Shaoxuan Yuan, and Abhradeep Chakraborty, working on sparse index integration and various improvements to reachability bitmaps, respectively.

  • Shaoxuan’s first contribution was integrating the git rm command with the sparse index. The sparse index is a relatively new Git feature that enables Git to shrink the size of its index data structure to only track the contents of your sparse checkout, instead of the entire repository. Long-time readers will remember that Git commands have been converted to be compatible with the sparse-index one-by-one. Commands that aren’t compatible with the sparse index need to temporarily expand the index to cover the entire repository, leading to slow-downs when working in a large repository.

    Shaoxuan’s work made the git rm command compatible with the sparse index, causing it to only expand the index when necessary, bringing Git closer to having all commands be compatible with the sparse index by default.

    [source]

  • Shaoxuan also worked on improving git mv‘s behavior when moving a path from within the sparse checkout definition (sometimes called a “cone”) to outside of the sparse checkout. There were a number of corner cases that required careful reasoning, and curious readers can learn more about exactly how this was implemented in the patches linked below.

    [source]

  • Abhradeep worked on adding a new “lookup table” extension to Git’s reachability bitmap index. For those unfamiliar, this index (stored in a .bitmap file) associates a set of commits to a set of bitmaps, where each bit position corresponds to an object. A 1 bit indicates that a commit can reach the object specified by that bit position, and a 0 indicates that it cannot.

    But .bitmap files do not list their selected commits in a single location. Instead, they prefix each bitmap with the object ID of the commit it corresponds to. That means that in order to know what set of commits are covered by a .bitmap, Git must read the entire contents of the file to discover the set of bitmapped commits.

    Abhradeep addressed this shortcoming by adding an optional “lookup table” at the end of the .bitmap format, which provides a concise list of selected commits, as well as the offset of their corresponding bitmaps within the file. This provided some speed-ups across a handful of benchmarks, making bitmaps faster to load and use, especially for large repositories.

    [source]

  • Abhradeep also worked on sprucing up the technical documentation for the .bitmap format. So if you have ever been curious about or want to hack on Git’s bitmap internals, now is the time!

    [source]

For more about these projects, you can check out each contributor’s final blog posts here and here. Thank you, Shaoxuan, and Abhradeep!

Now that we’ve covered a handful of changes contributed by Google Summer of Code students, let’s take a look at some changes in this release of Git from other Git contributors.

  • You may not be familiar with Git’s merge-tree command, which historically was used to compute trivial three-way merges using Git’s recursive merge strategy. In Git 2.38, this command now knows how to integrate with the new ort merge strategy, allowing it to compute non-trivial merges without touching the index or working copy.

    The existing mode is still available behind a (deprecated) --trivial-merge option. When the new --write-tree mode is used, merge-tree takes two branches to merge, and computes the result using the ort strategy, all without touching the working copy or index. It outputs the resulting tree’s object ID, along with some information about any conflicts it encountered.

    As an aside, we at GitHub recently started using merge-ort to compute merges on GitHub.com more than an order of magnitude faster than before. We had previously used the implementation in libgit2 in order to compute merges without requiring a worktree, since GitHub stores repositories as bare, meaning we do not have a worktree to rely on. These changes will make their way to GitHub Enterprise beginning with verion 3.7.

    [source]

  • Bare Git repositories can be stored in and distributed with other Git repositories. This is often convenient, for example, as an easy mechanism to distribute Git repositories for use as test fixtures.

    When using repositories from less-than-trustworthy sources, this can also present a security risk. Git repositories often execute user-defined programs specified via the $GIT_DIR/config file. For example, core.pager defines which pager program Git uses, and core.editor defines which editor Git opens when you want to write a commit message (among other things).

    There are other examples, but an often-discussed one is the core.fsmonitor configuration, which can be used to specify a path to a filesystem monitoring hook. Because Git often needs to query the state of the filesystem, this hook (when configured) is invoked many times, including from git status, which people commonly script around in their shell prompt.

    This means that it’s possible to convince a victim to run arbitrary code by convincing them to clone a repository with a malicious bare repository embedded inside of it. If they change their working directory into the malicious repository within (since you cannot embed a bare repository at the top-level directory of a repository) and run some Git command, then they are likely to execute the script specified by core.fsmonitor (or any other configuration that specifies a command to execute).

    For this reason, the new safe.bareRepository configuration was introduced. When set to “explicit,” Git will only work with bare repositories specified by the top-level --git-dir argument. Otherwise, when set to “all” (which is the default), Git will continue to work with all bare repositories, embedded or not.

    It is worth noting that setting safe.bareRepository to “explicit” is only required if you worry that you may be cloning malicious repositories and executing Git commands in them.

    [source]

  • git grep learned a new -m option (short for --max-count), which behaves like GNU grep‘s options of the same name. This new option limits the number of matches shown per file. This can be especially useful when combined with other options, like -C or -p (which show code context, or the name of the function which contains each match).

    You could, for example, combine all three of these options to show a summary of how some function is called by many different files in your project. Git has a handful of objects that contain the substring oid_object_info. If you want to look at how callers across different files are structured without seeing more than one example from the same file, you can now run:

    $ git grep -C3 -p -m1 oid_object_info

    [source]

  • If you’ve ever scripted around the directory contents of your Git repository, there’s no doubt that you’ve encountered the git ls-files command. Unlike ls-tree (which lists the contents of a tree object), ls-files lists the contents of the index, the working directory, or both.

    There are already lots of options which can further specify what does or doesn’t get printed in ls-files‘s output. But its output was not easily customizable without additional scripting.

    In Git 2.38, that is no longer the case, with ls-files‘s new --format option. You can now customize how each entry is printed, with fields to print an object’s name and mode, as well as more esoteric options, like its stage in the index, or end-of-line (EOL) behavior.

    [source]

  • git cat-file also learned a new option to respect the mailmap when printing the contents of objects with identifiers in them. This feature was contributed by another Google Summer of Code student, this time working on behalf of GitLab!

    For the uninitiated, the mailmap is a feature which allows mapping name and email pairs to their canonical values, which can be useful if you change your name or email and want to retain authorship over historical commits without rewriting history.

    git show, and many other tools already understand how to remap identities under the mailmap (for example, git show‘s %aN and %aE format placeholders print the mailmapped author name and email, respectively, as opposed to %an and %ae, which don’t respect the mailmap). But git cat-file, which is a low-level command which prints the contents of objects, did not know how to perform this conversion.

    That meant that if you wanted to print a stream of objects, but transform any author, committer, or tagger identities according to the mailmap, you would have to pipe their contents through git show or similar. This is no longer the case, since git cat-file now understands the --[no]-use-mailmap option, meaning this transformation can be done before printing out object contents.

    [source]

  • Finally, Git’s developer documentation got an improvement in this most recent release, by adding a codified version of the Git community’s guidelines for code review. This document is a helpful resource for new and existing contributors to learn about the cultural norms around reviewing patches on the Git mailing list.

    If you’ve ever had the itch to contribute to the Git project, I highly encourage you to read the new reviewing guidelines (as well as the coding guidelines, and the “My First Contribution” document) and get started!

    [source]

The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.38, or any previous version in the Git repository.

Introducing workerd: the Open Source Workers runtime

Post Syndicated from Kenton Varda original https://blog.cloudflare.com/workerd-open-source-workers-runtime/

Introducing workerd: the Open Source Workers runtime

Introducing workerd: the Open Source Workers runtime

Today I’m proud to introduce the first beta release of workerd, the JavaScript/Wasm runtime based on the same code that powers Cloudflare Workers. workerd is Open Source under the Apache License version 2.0.

workerd shares most of its code with the runtime that powers Cloudflare Workers, but with some changes designed to make it more portable to other environments. The name “workerd” (pronounced “worker dee”) comes from the Unix tradition of naming servers with a “-d” suffix standing for “daemon”. The name is not capitalized because it is a program name, which are traditionally lower-case in Unix-like environments.

What it’s for

Self-hosting Workers

workerd can be used to self-host applications that you’d otherwise run on Cloudflare Workers. It is intended to be a production-ready web server for this purpose. workerd has been designed to be unopinionated about hosting environments, so that it should fit nicely into whatever server/VM/container hosting and orchestration system you prefer. It’s just a web server.

Workers has always been based on standardized APIs, so that code is not locked into Cloudflare, and we work closely with other runtimes to promote compatibility. workerd provides another option to ensure that applications built on Workers can run anywhere, by leveraging the same underlying code to get exact, “bug-for-bug” compatibility.

Local development and testing

workerd is also designed to facilitate realistic local testing of Workers. Up until now, this has been achieved using Miniflare, which simulated the Workers API within a Node.js environment. Miniflare has worked well, but in a number of cases its behavior did not exactly match Workers running on Cloudflare. With the release of workerd, Miniflare and the Wrangler CLI tool will now be able to provide a more accurate simulation by leveraging the same runtime code we use in production.

Programmable proxies

workerd can act as an application host, a proxy, or both. It supports both forward and reverse proxy modes. In all cases, JavaScript code can be used to intercept and process requests and responses before forwarding them on. Traditional web servers and proxies have used bespoke configuration languages with quirks that are hard to master. Programming proxies in JavaScript instead provides more power while making the configuration easier to write and understand.

What it is

workerd is not just another way to run JavaScript and Wasm. Our runtime is uniquely designed in a number of ways.

Server-first

Many non-browser JavaScript and Wasm runtimes are designed to be general-purpose: you can use them to build command-line apps, local GUI apps, servers, or anything in between. workerd is not. It specifically focuses on servers, in particular (for now, at least) HTTP servers.

This means in particular that workerd-based applications are event-driven at the top level. Applications do not open listen sockets and accept connections from them; instead, the runtime pushes events to the application. It may seem like a minor difference, but this basic change in perspective directly enables many of the features below.

Web standard APIs

Wherever possible, Workers (and workerd in particular) offers the same standard APIs found in web browsers, such as Fetch, URL, WebCrypto, and others. This means that code built on workerd is more likely to be portable to browsers as well as to other standards-based runtimes. When Workers launched five years ago, it was unusual for a non-browser to offer web APIs, but we are pleased to see that the broader JavaScript ecosystem is now converging on them.

Nanoservices

workerd is a nanoservice runtime. What does that mean?

Microservices have become popular over the last decade as a way to split monolithic servers into smaller components that could be maintained and deployed independently. For example, a company that offers several web applications with a common user authentication flow might have a separate team that maintains the authentication logic. In a monolithic model, the authentication logic might have been offered to the application teams as a library. However, this could be frustrating for the maintainers of that logic, as making any change might require waiting for every application team to deploy an update to their respective server. By splitting the authentication logic into a separate server that all the others talk to, the authentication team is able to deploy changes on their own schedule.

However, microservices have a cost. What was previously a fast library call instead now requires communicating over a network. In addition to added overhead, this communication requires configuration and administration to ensure security and reliability. These costs become greater as the codebase is split into more and more services. Eventually, the costs outweigh the benefits.

Nanoservices are a new model that achieve the benefits of independent deployment with overhead closer to that of library calls. With workerd, many Workers can be configured to run in the same process. Each Worker runs in a separate “isolate”, which gives the appearance of running independently of the others: each isolate loads separate code and has its own global scope. However, when one Worker explicitly sends a request to another Worker, the destination Worker actually runs in the same thread with zero latency. So, it performs more like a function call.

With nanoservices, teams can now break their code into many more independently-deployed pieces without worrying about the overhead.

(Some in the industry prefer to call nanoservices “functions”, implying that each individual function making up an application could be its own service. I feel, however, that this puts too much emphasis on syntax rather than logical functionality. That said, it is the same concept.)

To really make nanoservices work well, we had to minimize the baseline overhead of each service. This required designing workerd very differently from most other runtimes, so that common resources could be shared between services as much as possible. First, as mentioned, we run many nanoservices within a single process, to share basic process overhead and minimize context switching costs. A second big architectural difference between workerd and other runtimes is how it handles built-in APIs. Many runtimes implement significant portions of their built-in APIs in JavaScript, which must then be loaded separately into each isolate. workerd does not; all the APIs are implemented in native code, so that all isolates may share the same copy of that code. These design choices would be difficult to retrofit into another runtime, and indeed these needs are exactly why we chose to build a custom runtime for Workers from the start.

Homogeneous deployment

In a typical microservices model, you might deploy different microservices to containers running across a cluster of machines, connected over a local network. You might manually choose how many containers to dedicate to each service, or you might configure some form of auto-scaling based on resource usage.

workerd offers an alternative model: Every machine runs every service.

workerd’s nanoservices are much lighter-weight than typical containers. As a result, it’s entirely reasonable to run a very large number of them – hundreds, maybe thousands – on a single server. This in turn means that you can simply deploy every service to every machine in your fleet.

Homogeneous deployment means that you don’t have to worry about scaling individual services. Instead, you can simply load balance requests across the entire cluster, and scale the cluster as needed. Overall, this can greatly reduce the amount of administration work needed.

Cloudflare itself has used the homogeneous model on our network since the beginning. Every one of Cloudflare’s edge servers runs our entire software stack, so any server can answer any kind of request on its own. We’ve found it works incredibly well. This is why services on Cloudflare – including ones that use Workers – are able to go from no traffic at all to millions of requests per second instantly without trouble.

Capability bindings: cleaner configuration and SSRF safety

workerd takes a different approach to most runtimes – indeed, to most software development platforms – in how an application accesses external resources.

Most development platforms start from assuming that the application can talk to the whole world. It is up to the application to figure out exactly what it wants to talk to, and name it in some global namespace, such as using a URL. So, an application server that wants to talk to the authentication microservice might use code like this:

// Traditional approach without capability bindings.
fetch("https://auth-service.internal-network.example.com/api", {
  method: "POST",
  body: JSON.stringify(authRequest),
  headers: { "Authorization": env.AUTH_SERVICE_TOKEN }
});

In workerd, we do things differently. An application starts out with no ability to talk to the rest of the world, and must be configured with specific capability bindings that provide it access to specific external resources. So, an application which needs to be able to talk to the authentication service would be configured with a binding called authService, and the code would look something like this:

// Capability-based approach. Hostname doesn't matter; all
// requests to AUTH_SERVICE.fetch() go to the auth service.
env.AUTH_SERVICE.fetch("https://auth/api", {
 method: "POST",
 body: JSON.stringify(authRequest),
});

This may at first appear to be a trivial difference. In both cases, we have to use configuration to control access to external services. In the traditional approach, we’d provide access tokens (and probably the service’s hostname) as environment variables. In the new approach, the environment goes a bit further to provide a full-fledged object. Is this just syntax sugar?

It turns out, this slight change has huge advantages:

First, we can now restrict the global fetch() function to accept only publicly-routable URLs. This makes applications totally immune to SSRF attacks! You cannot trick an application into accessing an internal service unintentionally if the code to access internal services is explicitly different. (In fact, the global fetch() is itself backed by a binding, which can be configured. workerd defaults to connecting it to the public internet, but you can also override it to permit private addresses if you want, or to route to a specific proxy service, or to be blocked entirely.)

With that done, we now have an interesting property: All internal services which an application uses must be configurable. This means:

  • You can easily see a complete list of the internal services an application talks to, without reading all the code.
  • You can always replace these services with mocks for testing purposes.
  • You can always configure an application to authenticate itself differently (e.g. client certificates) or use a different back end, without changing code.

The receiving end of a binding benefits, too. Take the authentication service example, above. The auth service may be another Worker running in workerd as a nanoservice. In this case, the auth service does not need to be bound to any actual network address. Instead, it may be made available strictly to other Workers through their bindings. In this case, the authentication service doesn’t necessarily need to verify that a request received came from an allowed client – because only allowed clients are able to send requests to it in the first place.

Overall, capability bindings allow simpler code that is secure by default, more composable, easier to test, and easier to understand and maintain.

Always backwards compatible

Cloudflare Workers has a hard rule against ever breaking a live Worker running in production. This same dedication to backwards compatibility extends to workerd.

workerd shares Workers’ compatibility date system to manage breaking changes. Every Worker must be configured with a “compatibility date”. The runtime then ensures that the API behaves exactly as it did on that date. At your leisure, you may check the documentation to see if new breaking changes are introduced at a future date, and update your code for them. Most such changes are minor and most code won’t require any changes. However, you are never obliged to update. Old dates will continue to be supported by newer versions of workerd. It is always safe to update workerd itself without updating your code.

What it’s not

To avoid misleading or disappointing anyone, I need to take a moment to call out what workerd is not.

workerd is not a Secure Sandbox

It’s important to note that workerd is not, on its own, a secure way to run possibly-malicious code. If you wish to run code you don’t trust using workerd, you must enclose it in an additional sandboxing layer, such as a virtual machine configured for sandboxing.

workerd itself is designed such that a Worker should not be able to access any external resources to which it hasn’t been granted a capability. However, a complete sandbox solution not only must be designed to restrict access, but also must account for the possibility of bugs – both in software and in hardware. workerd on its own is not sufficient to protect against hardware bugs like Spectre, nor can it adequately defend against the possibility of vulnerabilities in V8 or in workerd’s own code.

The Cloudflare Workers service uses the same code found in workerd, but adds many additional layers of security on top to harden against such bugs. I described some of these in a past blog post. However, these measures are closely tied to our particular environment. For example, we rely on build automation to push V8 patches to production immediately upon becoming available; we separate customers according to risk profile; we rely on non-portable kernel features and assumptions about the host system to enforce security and resource limits. All of this is very specific to our environment, and cannot be packaged up in a reusable way.

workerd is not an independent project

workerd is the core of Cloudflare Workers, a fast-moving project developed by a dedicated team at Cloudflare. We are not throwing code over the wall to forget about, nor are we expecting volunteers to do our jobs for us. workerd’s GitHub repository will be the canonical source used by Cloudflare Workers and our team will be doing much of their work directly in this repository. Just like V8 is developed primarily by the Chrome team for use in Chrome, workerd will be developed primarily by the Cloudflare Workers team for use in Cloudflare Workers.

This means we cannot promise that external contributions will sit on a level playing field with internal ones. Code reviews take time, and work that is needed for Cloudflare Workers will take priority. We also cannot promise we will accept every feature contribution. Even if the code is already written, reviews and maintenance have a cost. Within Cloudflare, we have a product management team who carefully evaluates what features we should and shouldn’t offer, and plenty of ideas generated internally ultimately don’t make the cut.

If you want to contribute a big new feature to workerd, your best bet is to talk to us before you write code, by raising an issue on GitHub early to get input. That way, you can find out if we’re likely to accept a PR before you write it. We also might be able to give hints on how best to implement.

It’s also important to note that while workerd’s internal interfaces may sometimes appear clean and reusable, we cannot make any guarantee that those interfaces won’t completely change on a whim. If you are trying to build on top of workerd internals, you will need to be prepared either to accept a fair amount of churn, or pin to a specific version.

workerd is not an off-the-shelf edge compute platform

As hinted above, the full Cloudflare Workers service involves a lot of technology beyond workerd itself, including additional security, deployment mechanisms, orchestration, and so much more. workerd itself is a portion of our runtime codebase, which is itself a small (albeit critical) piece of the overall Cloudflare Workers service.

We are pleased, though, that this means it is possible for us to release this code under a permissive Open Source license.

Try the Beta

As of this blog post, workerd is in beta. If you want to try it out,

Choose the k-NN algorithm for your billion-scale use case with OpenSearch

Post Syndicated from Jack Mazanec original https://aws.amazon.com/blogs/big-data/choose-the-k-nn-algorithm-for-your-billion-scale-use-case-with-opensearch/

When organizations set out to build machine learning (ML) applications such as natural language processing (NLP) systems, recommendation engines, or search-based systems, often times k-Nearest Neighbor (k-NN) search will be used at some point in the workflow. As the number of data points reaches the hundreds of millions or even billions, scaling a k-NN search system can be a major challenge. Applying Approximate Nearest Neighbor (ANN) search is a great way to overcome this challenge.

The k-NN problem is relatively simple compared to other ML techniques: given a set of points and a query, find the k nearest points in the set to the query. The naive solution is equally understandable: for each point in the set, compute its distance from the query and keep track of the top k along the way.

K-NN concept

The problem with this naive approach is that it doesn’t scale particularly well. The runtime search complexity is O(Nlogk), where N is the number of vectors and k is the number of nearest neighbors. Although this may not be noticeable when the set contains thousands of points, it becomes noticeable when the size gets into the millions. Although some exact k-NN algorithms can speed search up, they tend to perform similarly to the naive approach in higher dimensions.

Enter ANN search. We can reduce the runtime search latency if we loosen a few constraints on the k-NN problem:

  • Allow indexing to take longer
  • Allow more space to be used at query time
  • Allow the search to return an approximation of the k-NN in the set

Several different algorithms have been discovered to do just that.

OpenSearch is a community-driven, Apache 2.0-licensed, open-source search and analytics suite that makes it easy to ingest, search, visualize, and analyze data. The OpenSearch k-NN plugin provides the ability to use some of these algorithms within an OpenSearch cluster. In this post, we discuss the different algorithms that are supported and run experiments to see some of the trade-offs between them.

Hierarchical Navigable Small Worlds algorithm

The Hierarchical Navigable Small Worlds algorithm (HNSW) is one of the most popular algorithms out there for ANN search. It was the first algorithm that the k-NN plugin supported, using a very efficient implementation from the nmslib similarity search library. It has one of the best query latency vs. recall trade-offs and doesn’t require any training. The core idea of the algorithm is to build a graph with edges connecting index vectors that are close to each other. Then, on search, this graph is partially traversed to find the approximate nearest neighbors to the query vector. To steer the traversal towards the query’s nearest neighbors, the algorithm always visits the closest candidate to the query vector next.

But which vector should the traversal start from? It could just pick a random vector, but for a large index, this might be very far from the query’s actual nearest neighbors, leading to poor results. To pick a vector that is generally close to the query vector to start from, the algorithm builds not just one graph, but a hierarchy of graphs. All vectors are added to the bottom layer, and then a random subset of those are added to the layer above, and then a subset of those are added to the layer above that, and so on.

During search, we start from a random vector in the top layer, partially traverse the graph to find (approximately) the nearest point to the query vector in that layer, and then use this vector as the starting point for our traversal of the layer below. We repeat this until we get to the bottom layer. At the bottom layer, we perform the traversal, but this time, instead of just searching for the nearest neighbor, we keep track of the k-nearest neighbors that are visited along the way.

The following figure illustrates this process (inspired from the image in original paper Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs).

You can tune three parameters for HNSW:

  • m – The maximum number of edges a vector will get in a graph. The higher this number is, the more memory the graph will consume, but the better the search approximation may be.
  • ef_search – The size of the queue of the candidate nodes to visit during traversal. When a node is visited, its neighbors are added to the queue to be visited in the future. When this queue is empty, the traversal will end. A larger value will increase search latency, but may provide better search approximation.
  • ef_construction – Very similar to ef_search. When a node is to be inserted into the graph, the algorithm will find its m edges by querying the graph with the new node as the query vector. This parameter controls the candidate queue size for this traversal. A larger value will increase index latency, but may provide a better search approximation.

For more information on HNSW, you can read through the paper Efficient and robust approximate nearest neighbor search using Hierarchical Navigable Small World graphs.

Memory consumption

Although HNSW provides very good approximate nearest neighbor search at low latencies, it can consume a large amount of memory. Each HNSW graph uses roughly 1.1 * (4 * d + 8 * m) * num_vectors bytes of memory:

  • d is the dimension of the vectors
  • m is the algorithm parameter that controls the number of connections each node will have in a layer
  • num_vectors is the number of vectors in the index

To ensure durability and availability, especially when running production workloads, OpenSearch indexes are recommended to have at least one replica shard. Therefore, the memory requirement is multiplied by (1 + number of replicas). For use cases where the data size is 1 billion vectors of 128 dimensions each and m is set to the default value of 16, the estimated amount of memory required would be:

1.1 * (4 * 128 + 8 * 16) * 1,000,000,000 * 2 = 1,408 GB.

If we increase the size of vectors to 512, for example, and the m to 100, which is recommended for vectors with high intrinsic dimensionality, some use cases can require a total memory of approximately 4 TB.

With OpenSearch, we can always horizontally scale the cluster to handle this memory requirement. However, this comes at the expense of raising infrastructure costs. For cases where scaling doesn’t make sense, options to reduce the memory footprint of the k-NN system need to be explored. Fortunately, there are algorithms that we can use to do this.

Inverted File System algorithm

Consider a different approach for approximating a nearest neighbor search: separate your index vectors into a set of buckets, then, to reduce your search time, only search through a subset of these buckets. From a high level, this is what the Inverted File System (IVF) ANN algorithm does. In OpenSearch 1.2, the k-NN plugin introduced support for the implementation of IVF by Faiss. Faiss is an open-sourced library from Meta for efficient similarity search and clustering of dense vectors.

However, if we just randomly split up our vectors into different buckets, and only search a subset of them, this will be a poor approximation. The IVF algorithm uses a more elegant approach. First, before indexing begins, it assigns each bucket a representative vector. When a vector is indexed, it gets added to the bucket that has the closest representative vector. This way, vectors that are closer to each other are placed roughly in the same or nearby buckets.

To determine what the representative vectors for the buckets are, the IVF algorithm requires a training step. In this step, k-Means clustering is run on a set of training data, and the centroids it produces become the representative vectors. The following diagram illustrates this process.

Inverted file system indexing concept

IVF has two parameters:

  • nlist – The number of buckets to create. More buckets will result in longer training times, but may improve the granularity of the search.
  • nprobes – The number of buckets to search. This parameter is fairly straightforward. The more buckets that are searched, the longer the search will take, but the better the approximation.

Memory consumption

In general, IVF requires less memory than HNSW because IVF doesn’t need to store a set of edges for each indexed vector.

We estimate that IVF will roughly require the following amount of memory:

1.1 * (((4 * dimension) * num_vectors) + (4 * nlist * dimension)) bytes

For the case explored for HNSW where there are 1,000,000,000 128-dimensional vectors with one layer of replication, an IVF algorithm with an nlist of 4096 would take roughly 1.1 * (((4 * 128) * 2,000,000,000) + (4 * 4096 * 128)) bytes = 1126 GB.

This savings does come at a cost, however, because HNSW offers a better query latency versus approximation accuracy tradeoff.

Product quantization vector compression

Although you can use HNSW and IVF to speed up nearest neighbor search, they can consume a considerable amount of memory. When we get into the billion-vector scale, we start to require thousands of GBs of memory to support their index structures. As we scale up the number of vectors or the dimension of vectors, this requirement continues to grow. Is there a way to use noticeably less space for our k-NN index?

The answer is yes! In fact, there are a lot of different ways to reduce the amount of memory vectors require. You can change your embedding model to produce smaller vectors, or you can apply techniques like Principle Component Analysis (PCA) to reduce the vector’s dimensionality. Another approach is to use quantization. The general idea of vector quantization is to map a large vector space with continuous values into a smaller space with discrete values. When a vector is mapped into a smaller space, it requires fewer bits to represent. However, this comes at a cost—when mapping to a smaller input space, some information about the vector is lost.

Product quantization (PQ) is a very popular quantization technique in the field of nearest neighbor search. It can be used together with ANN algorithms for nearest neighbor search. Along with IVF, the k-NN plugin added support for Faiss’s PQ implementation in OpenSearch 1.2.

The main idea of PQ is to break up a vector into several sub-vectors and encode the sub-vectors independently with a fixed number of bits. The number of sub-vectors that the original vector is broken up into is controlled by a parameter, m, and the number of bits to encode each sub-vector with is controlled by a parameter, code_size. After encoding finishes, a vector is compressed into roughly m * code_size bits. So, assume we have a set of 100,000 1024-dimensional vectors. With m = 8 and code_size = 8, PQ breaks each vector into 8 128-dimensional sub-vectors and encode each sub-vector with 8 bits.

The values used for encoding are produced during a training step. During training, tables are created with 2code_size entries for each sub-vector partition. Next, k-Means clustering, with a k value of 2code_size, is run on the corresponding partition of sub-vectors from the training data. The centroids produced here are added as the entries to the partition’s table.

After all the tables are created, we encode a vector by replacing each sub-vector with the ID of the closest vector in the partition’s table. In the example where code_size = 8, we only need 8 bits to store an ID because there are 28 elements in the table. So, with dimension = 1024 and m = 8, the total size of one vector (assuming it uses a 32-bit floating point data type) is reduced from 4,096 bytes to roughly 8 bytes!

Product quantization encoding step

When we want to decode a vector, we can reconstruct an approximated version of it by using the stored IDs to retrieve the vectors from each partition’s table. The distance from the query vector to the reconstructed vector can then be computed and used in a nearest neighbor search. (It’s worth noting that, in practice, further optimization techniques like ADC are used to speed up this process for k-NN search).

Product quantization decoding step

Memory consumption

As we mentioned earlier, PQ will encode each vector into roughly m * code_size bits plus some overhead for each vector.

When combining it with IVF, we can estimate the index size as follows:

1.1 * ((((code_size/8) * m + overhead_per_vector) * num_vectors) + (4 * nlist * dimension) + (2 code_size * 4 * dimension) bytes

Using 1 billion vectors, dimension = 128, m = 8, code_size = 8, and nlist = 4096, we get an estimated total memory consumption of 70GB: 1.1 * ((((8 / 8) * 8 + 24) * 1,000,000,000) + (4 * 4096 * 128) + (2^8 * 4 * 128)) * 2 = 70 GB.

Running k-NN with OpenSearch

First make sure you have an OpenSearch cluster up and running. For instructions, refer to Cluster formation. For a more managed solution, you can use Amazon OpenSearch Service.

Before getting into the experiments, let’s go over how to run k-NN workloads in OpenSearch. First, we need to create an index. An index stores a set of documents in a way that they can be easily searched. For k-NN, the index’s mapping tells OpenSearch what algorithms to use and what parameters to use with them. We start by creating an index that uses HNSW as its search algorithm:

PUT my-hnsw-index
{
  "settings": {
    "index": {
      "knn": true,
      "number_of_shards": 10,
      "number_of_replicas" 1,
    }
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "knn_vector",
        "dimension": 4,
        "method": {
          "name": "hnsw",
          "space_type": "l2",
          "engine": "nmslib",
          "parameters": {
            "ef_construction": 128,
            "m": 24
          }
        }
      }
    }
  }
}

In the settings, we need to enable knn so that the index can be searched with the knn query type (more on this later). We also set the number of shards, and the number of replicas each shard will have. An index is made up of a collection of shards. Sharding is how OpenSearch distributes an index across multiple nodes in a cluster. For more information about shards, refer to Sizing Amazon OpenSearch Service domains.

In the mappings, we configure a field called my_vector of type knn_vector to store the vector data. We also pass nmslib as the engine to let OpenSearch know it should use nmslib’s implementation of HNSW. Additionally, we pass l2 as the space_type. The space_type determines the function used to compute the distance between two vectors. l2 refers to the Euclidean distance. OpenSearch also supports cosine similarity and the inner product distance functions.

After the index is created, we can ingest some fake data:

POST _bulk
{ "index": { "_index": "my-hnsw-index", "_id": "1" } }
{ "my_vector": [1.5, 2.5], "price": 12.2 }
{ "index": { "_index": "my-hnsw-index", "_id": "2" } }
{ "my_vector": [2.5, 3.5], "price": 7.1 }
{ "index": { "_index": "my-hnsw-index", "_id": "3" } }
{ "my_vector": [3.5, 4.5], "price": 12.9 }
{ "index": { "_index": "my-hnsw-index", "_id": "4" } }
{ "my_vector": [5.5, 6.5], "price": 1.2 }
{ "index": { "_index": "my-hnsw-index", "_id": "5" } }
{ "my_vector": [4.5, 5.5], "price": 3.7 }
{ "index": { "_index": "my-hnsw-index", "_id": "6" } }
{ "my_vector": [1.5, 5.5, 4.5, 6.4], "price": 10.3 }
{ "index": { "_index": "my-hnsw-index", "_id": "7" } }
{ "my_vector": [2.5, 3.5, 5.6, 6.7], "price": 5.5 }
{ "index": { "_index": "my-hnsw-index", "_id": "8" } }
{ "my_vector": [4.5, 5.5, 6.7, 3.7], "price": 4.4 }
{ "index": { "_index": "my-hnsw-index", "_id": "9" } }
{ "my_vector": [1.5, 5.5, 4.5, 6.4], "price": 8.9 }

After adding some documents to the index, we can search it:

GET my-hnsw-index/_search
{
  "size": 2,
  "query": {
    "knn": {
      "my_vector": {
        "vector": [2, 3, 5, 6],
        "k": 2
      }
    }
  }
}

Creating an index that uses IVF or PQ is a little bit different because these algorithms require training. Before creating the index, we need to create a model using the training API:

POST /_plugins/_knn/models/my_ivfpq_model/_train
{
  "training_index": "train-index",
  "training_field": "train-field",
  "dimension": 128,
  "description": "My model description",
  "method": {
      "name":"ivf",
      "engine":"faiss",
      "parameters":{
        "encoder":{
            "name":"pq",
            "parameters":{
                "code_size": 8,
                "m": 8
            }
        }
      }
  }
}

The training_index and training_field specify where the training data is stored. The only requirement for the training data index is that it has a knn_vector field that has the same dimension as you want your model to have. The method defines the algorithm that should be used for search.

After the training request is submitted, it will run in the background. To check if the training is complete, you can use the GET model API:

GET /_plugins/_knn/models/my_ivfpq_model/filter_path=model_id,state
{
  "model_id" : "my_ivfpq_model",
  "state" : "created"
}

After the model is created, you can create an index that uses this model:

PUT /my-hnsw-index
{
  "settings" : {
    "index.knn": true
    "number_of_shards" : 10,
    "number_of_replicas" : 1,
  },
  "mappings": {
    "properties": {
      "my_vector": {
        "type": "knn_vector",
        "model_id": "my_ivfpq_model"
      }
    }
  }
}

After the index is created, we can add documents to it and search it just like we did for HNSW.

Experiments

Let’s run a few experiments to see how these algorithms perform in practice and what tradeoffs are made. We look at an HNSW versus an IVF index using PQ. For these experiments, we’re interested in search accuracy, query latency, and memory consumption. Because these trade-offs are mainly observed at scale, we use the BIGANN dataset containing 1 billion vectors of 128 dimensions. The dataset also contains 10,000 queries of test data mapping a query to the ground truth closest 100 vectors based on the Euclidean distance.

Specifically, we compute the following search metrics:

  • Latency p99 (ms), Latency p90 (ms), Latency p50 (ms) – Query latency at various quantiles in milliseconds
  • recall@10 – The fraction of the top 10 ground truth neighbors found in the 10 results returned by the plugin
  • Native memory consumption (GB) – The amount of memory used by the plugin during querying

One thing to note is that the BIGANN dataset uses an unsigned integer as the data type. Because the knn_vector field doesn’t support unsigned integers, the data is automatically converted to floats.

To run the experiments, we complete the following steps:

  1. Ingest the dataset into the cluster using the OpenSearch Benchmarks framework (the code can be found on GitHub).
  2. When ingestion is complete, we use the warmup API to prepare the cluster for the search workload.
  3. We run the 10,000 test queries against the cluster 10 times and collect the aggregated results.

The queries return the document ID only, and not the vector, to improve performance (code for this can be found on GitHub).

Parameter selection

One tricky aspect of running experiments is selecting the parameters. There are too many different combinations of parameters to test them all. That being said, we decided to create three configurations for HNSW and IVFPQ:

  • Optimize for search latency and memory
  • Optimize for recall
  • Fall somewhere in the middle

For each optimization strategy, we chose two configurations.

For HNSW, we can tune the m, ef_construction, and ef_search parameters to achieve our desired trade-off:

  • m – Controls the maximum number of edges a node in a graph can have. Because each node has to store all of its edges, increasing this value will increase the memory footprint, but also increase the connectivity of the graph, which will improve recall.
  • ef_construction – Controls the size of the candidate queue for edges when adding a node to the graph. Increasing this value will increase the number of candidates to consider, which will increase the index latency. However, because more candidates will be considered, the quality of the graph will be better, leading to better recall during search.
  • ef_search – Similar to ef_construction, it controls the size of the candidate queue for graph traversal during search. Increasing this value will increase the search latency, but will also improve the recall.

In general, we chose configurations that gradually increased the parameters, as detailed in the following table.

Config ID Optimization Strategy m ef_construction ef_search
hnsw1 Optimize for memory and search latency 8 32 32
hnsw2 Optimize for memory and search latency 16 32 32
hnsw3 Balance between latency, memory, and recall 16 128 128
hnsw4 Balance between latency, memory, and recall 32 256 256
hnsw5 Optimize for recall 32 512 512
hnsw6 Optimize for recall 64 512 512

For IVF, we can tune two parameters:

  • nlist – Controls the granularity of the partitioning. The recommended value for this parameter is a function of the number of vectors in the index. One thing to keep in mind is that there are Faiss indexes that map to Lucene segments. There are several Lucene segments per shard and several shards per OpenSearch index. For our estimates, we assumed that there would be 100 segments per shard and 24 shards, so about 420,000 vectors per Faiss index. With this value, we estimated a good value to be 4096 and kept this constant for the experiments.
  • nprobes – Controls the number of nlist buckets we search. Higher values generally lead to improved recalls at the expense of increased search latencies.

For PQ, we can tune two parameters:

  • mControls the number of partitions to break the vector into. The larger this value is, the better the encoding will approximate the original, at the expense of raising memory consumption.
  • code_sizeControls the number of bits to encode a sub-vector with. The larger this value is, the better the encoding approximates the original, at the expense of raising memory consumption. The max value is 8, so we kept it constant at 8 for all experiments.

The following table summarizes our strategies.

Config ID Optimization Strategy nprobes m (num_sub_vectors)
ivfpq1 Optimize for memory and search latency 8 8
ivfpq2 Optimize for memory and search latency 16 8
ivfpq3 Balance between latency, memory, and recall 32 16
ivfpq4 Balance between latency, memory, and recall 64 32
ivfpq5 Optimize for recall 128 16
ivfpq6 Optimize for recall 128 32

Additionally, we need to figure out how much training data to use for IVFPQ. In general, Faiss recommends between 30,000 and 256,000 training vectors for components involving k-Means training. For our configurations, the maximum k for k-Means is 4096 from the nlist parameter. With this formula, the recommended training set size is between 122,880 and 1,048,576 vectors, so we settled on 1 million vectors. The training data comes from the index vector dataset.

Lastly, for the index configurations, we need to select the shard count. It is recommended to keep the shard size between 10–50 GBs for OpenSearch. Experimentally, we determined that for HNSW, a good number would be 64 shards and for IVFPQ, 42. Both index configurations were configured with one replica.

Cluster configuration

To run these experiments, we used Amazon OpenSearch Service using version 1.3 of OpenSearch to create the clusters. We decided to use the r5 instance family, which provides a good trade-off between memory size and cost.

The number of nodes will depend on the amount of memory that can be used for the algorithm per node and the total amount of memory required by the algorithm. Having more nodes and more memory will generally improve performance, but for these experiments, we want to minimize cost. The amount of memory available per node is computed as memory_available = (node_memory - jvm_size) * circuit_breaker_limit, with the following parameters:

  • node_memory – The total memory of the instance.
  • jvm_size – The OpenSearch JVM heap size. Set to 32 GB.
  • circuit_breaker_limit – The native memory usage threshold for the circuit breaker. Set to 0.5.

Because HNSW and IVFPQ have different memory requirements, we estimate how much memory is needed for each algorithm and determine the required number of nodes accordingly.

For HNSW, with m = 64, the total memory required using the formula from the previous sections is approximately 2,252 GB. Therefore, with r5.12xlarge (384 GB of memory), memory_available is 176 GB and the total number of nodes required is about 12, which we round up to 16 for stability purposes.

Because the IVFPQ algorithm requires less memory, we can use a smaller instance type, the r5.4xlarge instance, which has 128 GB of memory. Therefore, the memory_available for the algorithm is 48 GB. The estimated algorithm memory consumption where m = 64 is a total of 193 GB and the total number of nodes required is four, which we round up to six for stability purposes.

For both clusters, we use c5.2xlarge instance types as dedicated leader nodes. This will provide more stability for the cluster.

According to the AWS Pricing Calculator, for this particular use case, the cost per hour of the HNSW cluster is around $75 an hour, and the IVFPQ cluster costs around $11 an hour. This is important to remember when comparing the results.

Also, keep in mind that these benchmarks can be run using your custom infrastructure, using Amazon Elastic Compute Cloud (Amazon EC2), as long as the instance types and their memory size is equivalent.

Results

The following tables summarize the results from the experiments.

Test ID p50 Query latency (ms) p90 Query latency (ms) p99 Query latency (ms) Recall@10 Native memory consumption (GB)
hnsw1 9.1 11 16.9 0.84 1182
hnsw2 11 12.1 17.8 0.93 1305
hnsw3 23.1 27.1 32.2 0.99 1306
hnsw4 54.1 68.3 80.2 0.99 1555
hnsw5 83.4 100.6 114.7 0.99 1555
hnsw6 103.7 131.8 151.7 0.99 2055
Test ID p50 Query latency (ms) p90 Query latency (ms) p99 Query latency (ms) Recall@10 Native memory consumption (GB)
ivfpq1 74.9 100.5 106.4 0.17 68
ivfpq2 78.5 104.6 110.2 0.18 68
ivfpq3 87.8 107 122 0.39 83
ivfpq4 117.2 131.1 151.8 0.61 114
ivfpq5 128.3 174.1 195.7 0.40 83
ivfpq6 163 196.5 228.9 0.61 114

As you might expect, given how many more resources it uses, the HNSW cluster has lower query latencies and better recall. However, the IVFPQ indexes use significantly less memory.

For HNSW, increasing the parameters does in fact lead to better recall at the expense of latency. For IVFPQ, increasing m has the most significant impact on improving recall. Increasing nprobes improves the recall marginally, but at the expense of significant increases in latencies.

Conclusion

In this post, we covered different algorithms and techniques used to perform approximate k-NN search at scale (over 1 billion data points) within OpenSearch. As we saw in the previous benchmarks section, there isn’t one algorithm or approach that optimises for all the metrics at once. HNSW, IVF, and PQ each allow you to optimize for different metrics in your k-NN workload. When choosing the k-NN algorithm to use, first understand the requirements of your use case (How accurate does my approximate nearest neighbor search need to be? How fast should it be? What’s my budget?) and then tailor the algorithm configuration to meet them.

You can take a look at the benchmarking code base we used on GitHub. You can also get started with approximate k-NN search today following the instructions in Approximate k-NN search. If you’re looking for a managed solution for your OpenSearch cluster, check out Amazon OpenSearch Service.


About the Authors

Jack Mazanec is a software engineer working on OpenSearch plugins. His primary interests include machine learning and search engines. Outside of work, he enjoys skiing and watching sports.

Othmane Hamzaoui is a Data Scientist working at AWS. He is passionate about solving customer challenges using Machine Learning, with a focus on bridging the gap between research and business to achieve impactful outcomes. In his spare time, he enjoys running and discovering new coffee shops in the beautiful city of Paris.

Scaling Git’s garbage collection

Post Syndicated from Taylor Blau original https://github.blog/2022-09-13-scaling-gits-garbage-collection/

At GitHub, we store a lot of Git data: more than 18.6 petabytes of it, to be precise. That’s more than six times the size of the Library of Congress’s digital collections1. Most of that data comes from the contents of your repositories: your READMEs, source files, tests, licenses, and so on.

But some of that data is just junk: some bit of your repository that is no longer important. It could be a file that you force-pushed over, or the contents of a branch you deleted without merging. In general, this slice of repository data is anything that isn’t contained in at least one of your repository’s branches or tags. Normally, we don’t remove any unreachable data from repositories. But occasionally we do, usually to remove sensitive data, like passwords or SSH keys from your repository’s history.

The process for permanently removing unreachable objects from a repository’s history has a history of causing problems within GitHub, especially in busy repositories or ones with lots of objects. In this post, we’ll talk about what those problems were, why we had them, the tools we built to address them, and some interesting ways we’ve built on top of them. All of this work was contributed upstream to the open-source Git project. Let’s dive in.

Object reachability

In this post, we’re going to talk a lot about “reachable” and “unreachable” objects. You may have heard these terms before, but perhaps only casually. Since we’re going to use them a lot, it will help to have more concrete definitions of the two. An object is reachable when there is at least one branch or tag along which you can reach the object in question. An object is “reached” by crawling through history—from commits to their parents, commits to their root trees, and trees to their sub-trees and blobs. An object is unreachable when no such branch or tag exists.

Sample object graph showing commits, with arrows connecting them to their parents. A few commits have boxes that are connected to them, which represent the tips of branches and tags.

Here, we’re looking at a sample object graph. For simplicity, I’m only showing commits (identified here as circles). Arrows point from commits to their parent(s). A few commits have boxes that are connected to them, which represent the tips of branches and tags.

The parts of the graph that are colored blue are reachable, and the red parts are considered unreachable. You’ll find that if you start at any branch or tag, and follow its arrows, that all commits along that path are considered reachable. Note that unreachable commits which have reachable ones as parents (in our diagram above, anytime an arrow points from a red commit to a blue one) are still considered unreachable, since they are not contained within any branch or tag.

Unreachable objects can also appear in clusters that are totally disconnected from the main object graph, as indicated by the two lone red commits towards the right-hand side of the image.

Pruning unreachable objects

Normally, unreachable objects stick around in your repository until they are either automatically or manually cleaned up. If you’ve ever seen the message, “Auto packing the repository for optimum performance,” in your terminal, Git is doing this for you in the background. You can also trigger garbage collection manually by running:

$ git gc --prune=<date>

That tells Git to trigger a garbage collection and remove unreachable objects. But observant readers might notice the optional <date> parameter to the --prune flag. What is that? The short answer is that Git allows you to restrict which objects get permanently deleted based on the last time they were written. But to fully explain, we first need to talk a little bit about a race condition that can occur when removing objects from a Git repository.

Object deletion raciness

Normally, deleting an unreachable object from a Git repository should not be a notable event. Since the object is unreachable, it’s not part of any branch or tag, and so deleting it doesn’t change the repository’s reachable state. In other words, removing an unreachable object from a repository should be as simple as:

  1. Repacking the repository to remove any copies of the object in question (and recomputing any deltas that are based on that object).
  2. Removing any loose copies of the object that happen to exist.
  3. Updating any additional indexes (like the multi-pack index, or commit-graph) that depend on the (now stale) packs that were removed.

The racy behavior occurs when a repository receives one or more pushes during this process. The main culprit is that the server advertises its objects at a different point in time from processing the objects that the client sent based on that advertisement.

Consider what happens if Git decides (as part of running a git gc operation) that it wants to delete some unreachable object C. If C becomes reachable by some background reference update (e.g., an incoming push that creates a new branch pointing at C), it will then be advertised to any incoming pushes. If one of these pushes happens before C is actually removed, then the repository can end up in a corrupt state. Since the pusher will assume C is reachable (since it was part of the object advertisement), it is allowed to include objects that either reference or depend on C, without sending C itself. If C is then deleted while other reachable parts of the repository depend on it, then the repository will be left in a corrupt state.

Suppose the server receives that push before proceeding to delete C. Then, any objects from the incoming push that are related to it would be immediately corrupt. Reachable parts of the repository that reference C are no longer closed2 over reachability since C is missing. And any objects that are stored as a delta against C can no longer be inflated for the same reason.

Figure demonstrating that one side (responsible for garbage collecting the repository) decides that a certain object is unreachable, while another side makes that object reachable and accepts an incoming push based on that object—before the original side ultimately deletes that (now-reachable) object—leaving the repository in a corrupt state.

In case that was confusing, the above figure should help clear things up. The general idea is that one side (responsible for garbage collecting the repository) decides that a certain object is unreachable, while another side makes that object reachable and accepts an incoming push based on that object—before the original side ultimately deletes that (now-reachable) object—leaving the repository in a corrupt state.

Mitigating object deletion raciness

Git does not completely prevent this race from happening. Instead, it works around the race by gradually expiring unreachable objects based on the last time they were written. This explains the mysterious --prune=<date> option from a few sections ago: when garbage collecting a repository, only unreachable objects which haven’t been written since <date> are removed. Anything else (that is, the set of objects that have been written at least once since <date>) are left around.

The idea is that objects which have been written recently are more likely to become reachable again in the future, and would thus be more likely to be susceptible to the kind of race we talked about above if they were to be pruned. Objects which haven’t been written recently, on the other hand, are proportionally less likely to become reachable again, and so they are safe (or, at least, safer) to remove.

This idea isn’t foolproof, and it is certainly possible to run into the race we talked about earlier. We’ll discuss one such scenario towards the end of this post (along with the way we worked around it). But in practice, this strategy is simple and effective, preventing most instances of potential repository corruption.

Storing loose unreachable objects

But one question remains: how does Git keep track of the age of unreachable objects which haven’t yet aged out of the repository?

The answer, though simple, is at the heart of the problem we’re trying to solve here. Unreachable objects which have been written too recently to be removed from the repository are stored as loose objects, the individual object files stored in .git/objects. Storing these unreachable objects individually means that we can rely on their stat() modification time (hereafter, mtime) to tell us how recently they were written.

But this leads to an unfortunate problem: if a repository has many unreachable objects, and a large number of them were written recently, they must all be stored individually as loose objects. This is undesirable for a number of reasons:

  • Pairs of unreachable objects that share a vast majority of their contents must be stored separately, and can’t benefit from the kind of deduplication offered by packfiles. This can cause your repository to take up much more space than it otherwise would.
  • Having too many files (especially too many in a single directory) can lead to performance problems, including exhausting your system’s available inodes in the extreme case, leaving you unable to create new files, even if there may be space available for them.
  • Any Git operation which has to scan through all loose objects (for example, git repack -d, which creates a new pack containing just your repository’s unpacked objects) will slow down as there are more files to process.

It’s tempting to want to store all of a repository’s unreachable objects into a single pack. But there’s a problem there, too. Since all of the objects in a single pack share the same mtime (the mtime of the *.pack file itself), rewriting any single unreachable object has the effect of updating the mtimes of all of a repository’s unreachable objects. This is because Git optimizes out object writes for packed objects by simply updating the mtime of any pack(s) which contain that object. This makes it nearly impossible to expire any objects out of the repository permanently.

Cruft packs

To solve this problem, we turned to a long-discussed idea on the Git mailing list: cruft packs. The idea is simple: store an auxiliary list of mtime data alongside a pack containing just unreachable objects. To garbage collect a repository, Git places the unreachable objects in a pack. That pack is designated as a “cruft pack” because Git also writes the mtime data corresponding to each object in a separate file alongside that pack. This makes it possible to update the mtime of a single unreachable object without changing the mtimes of any other unreachable object.

To give you a sense of what this looks like in practice, here’s a small example:

a pack of Git objects (represented by rectangles of different colors)

The above figure shows a pack of Git objects (represented by rectangles of different colors), its pack index, and the new .mtimes file. Together, these three files make up what Git calls a “cruft pack,” and it’s what allows Git to store unreachable objects together, without needing a single file for each object.

So, how do they work? Git uses the cruft pack to store a collection of object mtimes together in an array stored in the *.mtimes file. In order to discover the mtime for an individual object in a pack, Git first does a binary search on the pack’s index to discover that object’s lexicographic index. Git can then use that offset to read a 4-byte, unsigned integer in the .mtimes file. The .mtimes file contains a table of integers (one for each object in the associated *.pack file), each representing an epoch timestamp containing that object’s mtime. In other words, the *.mtimes file has a table of numbers, where each number represents an individual object’s mtime, encoded as a number of seconds since the Unix epoch.

Crucially, this makes it possible to store all of a repository’s unreachable objects together in a single pack, without having to store them as individual loose objects, bypassing all of the drawbacks we discussed in the last section. Moreover, it allows Git to update the mtime of a single unreachable object, without inadvertently triggering the same update across all unreachable objects.

Since Git doesn’t portably support updating a file in place, updating an object’s mtime (a process which Git calls “freshening”) takes place by writing a separate copy of that object out as a loose file. Of course, if we had to freshen all objects in a cruft pack, we would end up in a situation no better than before. But such updates tend to be unlikely in practice, and so writing individual copies of a small handful of unreachable objects ends up being a reasonable trade off most of the time.

Generating cruft packs

Now that we have introduced the concept of cruft packs, the question remains: how does Git generate them?

Despite being called git gc (short for “garbage collection”), running git gc does not always result in deleting unreachable objects. If you run git gc --prune=never, then Git will repack all reachable objects and move all unreachable objects to the cruft pack. If, however, you run git gc --prune=1.day.ago, then Git will repack all reachable objects, delete any unreachable objects that are older than one day, and repack the remaining unreachable objects into the cruft pack.

This is because of Git’s treatment of unreachable parts of the repository. While Git only relies on having a reachability closure over reachable objects, Git’s garbage collection routine tries to leave unreachable parts of the repository intact to the extent possible. That means if Git encounters some unreachable cluster of objects in your repository, it will either expire all or none of those objects, but never some subset of them.

We’ll discuss how cruft packs are generated with and without object expiration in the two sections below.

Cruft packs without object expiration

When generating a cruft pack with an object expiration of --date=never, our only goal is to collect all unreachable objects together into a single cruft pack. Broadly speaking, this occurs in three steps:

  1. Starting at all of the branches and tags, generate a pack containing only reachable objects.
  2. Looking at all other existing packs, enumerate the list of objects which don’t appear in the new pack of reachable objects. Create a new pack containing just these objects, which are unreachable.
  3. Delete the existing packs.

If any of that was confusing, don’t worry: we’ll break it down here step by step. The first step to collecting a repository’s unreachable objects is to figure out the parts of it that are reachable. If you’ve ever run git repack -A, this is exactly how that command works. Git starts a reachability traversal beginning at each of the branches and tags in your repository. Then it traverses back through history by walking from commits to their parents, trees to their sub-trees, and so on, marking every object that it sees along the way as reachable.

Demonstration of how Git walks through a commit graph, from commit to parent

Here, we’re showing the same commit graph from earlier in the post. Git’s goal at this point is simply to mark every reachable object that it sees, and it’s those objects that will become the contents of a new pack containing just reachable objects. Git starts by examining each reference, and walking from a commit to its parents until it either finds a commit with no parents (indicating the beginning of history), or a commit that it has already marked as reachable.

In the above, the commit being walked is highlighted in dark blue, and any commits marked as reachable are marked in green. At each step, the commit currently being visited gets marked as reachable, and its parent(s) are visited in the next step. By repeating this process among all branches and tags, Git will mark all reachable objects in the repository.

We can then use this set of objects to produce a new pack containing all reachable objects in a repository. Next, Git needs to discover the set of objects that it didn’t mark in the previous stage. A reasonable first approach might be to store the IDs of all of a repository’s objects in a set, and then remove them one by one as we mark objects reachable along our walk.

But this approach tends to be impractical, since each object will require a minimum of 20 bytes of memory in order to insert into this set. At the time of writing, the linux.git repository contains nearly nine million objects, which would require nearly 180 MB of memory just to write out all of their object IDs.

Instead, Git looks through all of the objects in all of the existing packs, checking whether or not each is contained in the new pack of reachable objects. Any object found in an existing pack which doesn’t appear in the reachable pack is automatically included in the cruft pack.

Animation demonstrating how  Git looks through all of the objects in all of the existing packs, checking whether or not each is contained in the new pack of reachable objects.

Here, we’re going one by one among all of the pre-existing packs (here, labeled as pack-abc.pack, pack-def.pack, and pack-123.pack) and inspecting their objects one at a time. We first start with object c8, looking through the reachable pack (denoted as pack-xyz.pack) to see if any of its objects match c8. Since none do, c8 is marked unreachable (which we represent by filling the object with a red background).

This process is repeated for each object in each existing pack. Once this process is complete, all objects that existed in the repository before starting a garbage collection are marked either green, or red (indicating that they are either reachable, or unreachable, respectively).

Git can then use the set of unreachable objects to generate a new pack, like below:

A set of labeled Git packs

This pack (on the far right of the above image, denoted pack-cruft.pack) contains exactly the set of unreachable objects present in the repository at the beginning of garbage collection. By keeping track of each unreachable object’s mtime while marking existing objects, Git has enough data to write out a *.mtimes file in addition to the new pack, leaving us with a cruft pack containing just the repository’s unreachable objects.

Here, we’re eliding some technical details about keeping track of each object’s mtime along the way, for brevity and simplicity. The routine is straightforward, though: each time we discover an object, we mark its mtime based on how we discovered the object.

  • If an object is found in a packfile, it inherits its mtime from the packfile itself.
  • If an object is found as a loose object, its mtime comes from the loose object file.
  • And if an object is found in an existing cruft pack, its mtime comes from reading the cruft pack’s *.mtimes file at the appropriate index.

If an object is seen more than once (e.g., an unreachable object stored in a cruft pack was freshened, resulting in another loose copy of the object), the mtime which is ultimately recorded in the new cruft pack is the most recent mtime of all of the above.

Cruft packs with object expiration

Generating cruft packs where some objects are going to expire out of the repository follows a similar, but slightly trickier approach than in the non-expiring case.

Doing a garbage collection with a fixed expiration is known as “pruning.” This essentially boils down to asking Git to pack the contents of a repository into two packfiles: one containing reachable objects, and another containing any unreachable objects. But, it also means that for some fixed expiration date, any unreachable objects which have an mtime older than the expiration date are removed from the repository entirely.

The difficulty in this case stems from a fact briefly mentioned earlier in this post, which is that Git attempts to prevent connected clusters of unreachable objects from leaving the repository if some, but not all, of their objects have aged out.

To make things clearer, here’s an example. Suppose that a repository has a handful of blob objects, all connected to some tree object, and all of these objects are unreachable. Assuming that they’re all old enough, then they will all expire together: no big deal. But what if the tree isn’t old enough to be expired? In this case, even though the blobs connected to it could be expired on their own, Git will keep them around since they’re connected to a tree with a sufficiently recent mtime. Git does this to preserve the repository’s reachability closure in case that tree were to become reachable again (in which case, having the tree and its blobs becomes important).

To ensure that Git preserves any unreachable objects which are reachable from recent objects Git handles this case of cruft pack generation slightly differently. At a high level, it:

  1. Generates a candidate list of cruft objects, using the same process as outlined in the previous section.
  2. Then, to determine the actual list of cruft objects to keep around, it performs a reachability traversal using all of the candidate cruft objects, adding any object it sees along the way to the cruft pack.

To make things a little clearer, here’s an example:

Animation of Git performing  a reachability traversal

After determining the set of unreachable objects (represented above as colored red) Git does a reachability traversal from each entry point into the graph of unreachable objects. Above, commits are represented by circles, trees by rectangles, and tree entries as rows within the larger rectangles. The mtimes are written below each commit.

For now, let’s assume our expiration date is d, so any object whose mtime is greater than d must stay (despite being unreachable), and anything older than d can be pruned. Git traverses through each entry and asks, “Is this object old enough to be pruned?” When the answer is “yes” Git leaves the object alone and moves on to the next entry point. When the answer is “no,” however, (ie., Git is looking at an unreachable object whose mtime is too recent to prune), Git marks that object as “rescued” (indicated by turning it green) and then continues its traversal, marking any reachable objects as rescued.

Objects that are rescued during this pass are written to the cruft pack, preserving their existence in the repository, leaving them to either continue to age, or have their mtimes updated before the next garbage collection.

Let’s take a closer look at the example above. Git starts by looking at object C(1,1), and notice that its mtime is d+5, meaning that (since it happens after our expiration time, d) it is too new to expire. That causes Git to start a reachability traversal beginning at C(1,1), rescuing every object it encounters along the way. Since many objects are shared between multiple commits, rescuing an object from a more recent part of the graph often ends up marking older objects as rescued, too.

After finishing the rescuing pass focused on C(1,1), Git moves on to look at C(0,2). But this commit’s mtime is d-10, which is before our expiration cutoff of d, meaning that it is safe to remove. Git can skip looking at any objects reachable from this commit, since none of them will be rescued.

Finally, Git looks at another connected cluster of the unreachable object graph, beginning at C(3,1). Since this object has an mtime of d+10, it is too new to expire, so Git performs another reachability traversal, rescuing it and any objects reachable from it.

Notice that in the final graph state that the main cluster of commits (the one beginning with C(0,2)) is only partially rescued. In fact, only the objects necessary to retain a reachability closure over the rescued objects among that cluster are saved from being pruned. So even though, for example, commit C(2,1) has only part of its tree entries rescued, that is OK since C(2,1) itself will be pruned (hence any non-rescued tree entries connected to it are unimportant and will also be pruned).

Putting it all together

Now that Git can generate a cruft pack and perform garbage collection on a repository with or without pruning objects, it was time to put all of the pieces together and submit the patches to the open-source Git project.

Other Git sub-commands, like repack, and gc needed to learn about cruft packs, and gain command-line flags and configuration knobs in order to opt-in to the new behavior. With all of the pieces in place, you can now trigger a garbage collection by running either:

$ git gc --prune=1.day.ago --cruft

or

$ git repack -d --cruft --cruft-expiration=1.day.ago

to repack your repository into a reachable pack, and a cruft pack containing unreachable objects whose mtimes are within the past day. More details on the new command-line options and configuration can be found here, here, here, and here.

GitHub submitted the entirety of the patches that comprise cruft packs to the open-source Git project, and the results were released in v2.37.0. That means that you can use the same tools as what we run at GitHub on your own laptop, to run garbage collection on your own repositories.

For those curious about the details, you can read the complete thread on the mailing list archive here.

Cruft packs at GitHub

After a lengthy process of testing to ensure that using cruft packs was safe to carry out across all repositories on GitHub, we deployed and enabled the feature across all repositories. We kept a close eye on repositories with large numbers of unreachable objects, since the process of breaking any deltas between reachable and unreachable objects (since the two are now stored in separate packs, and object deltas cannot cross pack boundaries) can cause the initial cruft pack generation to take a long time. A small handful of repositories with many unreachable objects needed more time to generate their very first cruft pack. In those instances, we generated their cruft packs outside of our normal repository maintenance jobs to avoid triggering any timeouts.

Now, every repository on GitHub and in GitHub Enterprise (in version 3.3 and newer) uses cruft packs to store their unreachable objects. This has made garbage collecting repositories (especially busy ones with many unreachable objects) tractable where it often required significant human intervention before. Before cruft packs, many repositories which required clean up were simply out of our reach because of the possibility of creating an explosion of loose objects which could derail performance for all repositories stored on a fileserver. Now, garbage collecting a repository is a simple task, no matter its size or scale.

During our testing, we ran garbage collection on a handful of repositories, and got some exciting results. For repositories that regularly force-push a single commit to their main branch (leaving a majority of their objects unreachable), their on-disk size dropped significantly. The most extreme example we found during testing caused a repository which used to take 186 gigabytes to store shrink to only take 2 gigabytes of space.

On github/github, GitHub’s main codebase, we were able to shrink the repository from around 57 gigabytes to 27 gigabytes. Even though these savings are more modest, the real payoff is in the objects we no longer have to store. Before garbage collecting, each replica of this repository had nearly 60 million objects, including years of test-merges, force-pushes, and all kinds of sources of unreachable objects. Each of these objects contributed to the I/O cost of repacking this repository. After garbage collecting, only 11.8 million objects remained. Since each object in a repository requires around 150 bytes of memory during repacking, we save around 7 gigabytes of RAM during each maintenance routine.

Limbo repositories

Even though we can easily garbage collect a repository of any size, we still have to navigate the inherent raciness that we described at the beginning of this post.

At GitHub, our approach has been to make this situation easy to recover from automatically instead of preventing it entirely (which would require significant surgery to much of Git’s code). To do this, our approach is to create a “limbo” repository whenever a pruning garbage collection is done. Any objects which get expired from the main repository are stored in a separate pack in the limbo repository. Then, the process to garbage collect a repository looks something like:

  1. Generate a cruft pack of recent unreachable objects in the main repository.
  2. Generate a second cruft pack of expired unreachable objects, stored outside of the main repository, in the “limbo” repository.
  3. After garbage collection has completed, run a git fsck in the main repository to detect any object corruption.
  4. If any objects are missing, recover them by copying them over from the limbo repository.

The process for generating a cruft pack of expired unreachable objects boils down to creating another cruft pack (using exactly the same process we described earlier in this post), with two caveats:

  • The expiration cutoff is set to “never” since we want to keep around any objects which we did expire in the previous step.
  • The original cruft pack is treated as a pack containing reachable objects since we want to ignore any unreachable objects which were too recent to expire (and, thus, are stored in the cruft pack in the main repository).

We have used this idea at GitHub with great success, and now treat garbage collection as a hands-off process from start to finish. The patches to implement this approach are available as a preliminary RFC on the Git mailing list here.

Thank you

This work would not have been possible without generous review and collaboration from engineers from within and outside of GitHub. The Git Systems team at GitHub were great to work with while we developed and deployed cruft packs. Special thanks to Torsten Walter, and Michael Haggerty, who played substantial roles in developing limbo repositories.

Outside of GitHub, this work would not have been possible without careful review from the open-source Git community, especially Derrick Stolee, Jeff King, Jonathan Tan, Jonathan Nieder, and Junio C Hamano. In particular, Jeff King contributed significantly to the original development of many of the ideas discussed above.

Notes


  1. It’s true. According to the Library of Congress themselves, their digital collection amounts to more than 3 petabytes in size [source]. The 18.6 petabytes we store at GitHub actually overcounts by a factor of five, since we store a handful of copies of each repository. In reality, it’s hard to provide an exact number, since data is de-duplicated within a fork network, and is stored compressed on disk. Either way you slice it, it’s a lot of data: you get the point. 
  2. Meaning that for any reachable object part of some repository, any objects reachable from it are also contained in that repository. 

Contributing to open source at GitHub

Post Syndicated from Ariel Deitcher original https://github.blog/2022-09-06-contributing-to-open-source-at-github/

Ariel Deitcher (@mntlty) is a Senior Software Engineer at GitHub, working on Pull Requests and Merge Queue (beta). In this post, he shares the challenges he encountered finding his path to contributing to open source, what it was like contributing to open source at GitHub, and some of the lessons he learned.

Getting started with open source can be overwhelming

As a computer science graduate in 2011 and searching for my first tech job, I read that contributing to an open source project could help. It was a great way to build skills, make industry connections, and gain practical experience with a real-world problem. Perfect, I thought, I’ll just pick an open source project on this new website called GitHub, and, well, actually I wasn’t sure how to do that. Finding that “Goldilocks” project (where the size, language(s), domain, and community felt just right) was a lot harder than I thought, and I didn’t feel self-confident enough to make much progress. Overwhelmed, I decided the timing wasn’t right but resolved to try again someday.

It bugged me that the contribution graph on my GitHub profile remained stubbornly empty, as all the code I had committed lived in private repositories. That changed in 2016 when contributions to private repositories could be shown on my profile, but my contributions to open source had not. Between my family, work, life, and the explosive growth in projects to choose from, making that first contribution to open source felt more daunting than ever.

The opportunity to contribute to open source at GitHub

Fast forward to 2021. I read Working in Public: The Making and Maintenance of Open Source Software by Nadia Eghbal while interviewing at GitHub. I was especially captivated by the Stadium model of open source projects, where a small number of maintainers and occasional contributors are vastly outnumbered by a project’s users. This aligned with my mental model of open source projects, where a few performers on a digital stage would conjure feats of coding wizardry. I could only imagine how vulnerable working in public could be, and hoped it would feel less intimidating working at GitHub.

I joined the GitHub team building Merge Queue (beta), a feature which helps users coordinate their merges to a protected branch, ensures that changes are up to date, and that all required checks pass before automatically merging a pull request. Early on, I shared my long-held goal of contributing code to an open source project with my manager, and discussed the GitHub CLI, an open source tool written in Go which lets users interact with GitHub from the command line, as a possible candidate.

While building Merge Queue, our team carefully integrated it with GitHub’s many APIs and tools, checking each one for compatibility and correctness. Testing different scenarios of merging a pull request with the GitHub CLI, I saw that once a Merge Queue was required, running the CLI command gh pr merge would fail in most cases. The Merge Queue was correctly preventing direct merges to its protected branch, and so I began scoping out what changes the CLI might need to support Merge Queue.

As I didn’t have write access to the CLI repository, I forked it, started a new Codespace, and spent some time getting familiar with the CLI’s contributing guidelines and code. Wanting to minimize my changes, I targeted a few places in the merge command to modify. When I was ready, I pushed a commit to my fork and opened a pull request to share with the CLI maintainers. I expected that I would provide support but defer to them for the final implementation.

In reviewing my pull request with the CLI maintainers, it quickly became clear that my changes were hard to reason about. The merge command had accumulated sufficient technical debt that adding more complexity to it was risky. The team asked if I could refactor the merge command in an initial pull request and follow up with a subsequent pull request for the Merge Queue changes after the first was merged. What I had thought would be a rough guide of changes for the CLI maintainers was, in fact, the opportunity I had been looking for to contribute to open source at GitHub. I confirmed that my manager was onboard with this increased commitment, and was ready to get started.

Refactoring the merge command and adding Merge Queue support

I set out to refactor the merge command with a focus on simplicity, readability, and returning early over deeply nested conditionals. The existing test coverage gave me a confidence boost as I began stepping through the code, copying each section into a separate file for later reference, and wrote comments which I felt captured the intent of the removed section. I then grouped related Git and API operations, consolidated common code into appropriately named functions and variables, trimmed unreachable code paths, created a MergeContext struct to encapsulate state, and leaned into Go’s explicit error returns – all of which gave the code a more linear and consistent structure.

As an example, the mergeRun function, which is the heart of the merge command, went from over 220 lines to just 30:

func mergeRun(opts *MergeOptions) error {
    ctx, err := NewMergeContext(opts)
    if err != nil {
        return err
    }

    // no further action is possible when disabling auto merge
    if opts.AutoMergeDisable {
        return ctx.disableAutoMerge()
    }

    ctx.warnIfDiverged()

    if err := ctx.canMerge(); err != nil {
        return err
    }

    if err := ctx.merge(); err != nil {
        return err
    }

    if err := ctx.deleteLocalBranch(); err != nil {
        return err
    }

    if err := ctx.deleteRemoteBranch(); err != nil {
        return err
    }

    return nil
}

When I was finished, I opened a pull request from my fork to the CLI repository, and was blown away by how supportive the code review process was. After a few rounds of feedback, my code was merged and ready to ship in the next release. I was an open source contributor at GitHub!

Returning to my fork, my original Merge Queue changes were now completely out of date. In fact, much of the code I had on my branch no longer existed on the CLI’s trunk branch. Fortunately, I was now intimately familiar with the merge command and was able to make the Merge Queue changes and tests in a subsequent pull request quickly and with confidence.

Lessons learned

Looking back, I learned that searching for the right open source project on my own, trying to create time outside of work, and context switching from my existing projects were obstacles I could not overcome. Instead, the key for me was to find an open source project that was important to what I was already working on, and that I was accountable for. If this sounds familiar, consider asking your manager if you can devote some time to work on an issue in an open source project that you or your team rely on. It’s much easier to get started with an open source project you know and can align with work you’re already committed to. I recognize how fortunate I was to be in the right place at the right time, and with the right support from my manager, but it wasn’t easy.

Many people I know struggle with impostor syndrome, and working in public made me even more aware of mine. I am learning to accept that even though my commits aren’t perfect, and that I’m afraid of being judged for creating bugs like this regression, which will be discoverable forever, I should still contribute. Despite these challenges, I enjoy picking up new issues in the CLI labeled “help wanted (contributions welcome)” whenever I can, and hope you will too!

AWS Week in Review – September 5, 2022

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-september-5-2022/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

As a new week begins, let’s quickly look back at the most significant AWS news from the previous seven days.

Last Week’s Launches
Here are the launches that got my attention last week:

AWS announces open-sourced credentials-fetcher to simplify Microsoft AD access from Linux containers. You can find more in the What’s New post.

AWS Step Functions now has 14 new intrinsic functions that help you process data more efficiently and make it easier to perform data processing tasks such as array manipulation, JSON object manipulation, and math functions within your workflows without having to invoke downstream services or add Task states.

AWS SAM CLI esbuild support is now generally available. You can now use esbuild in the SAM CLI build workflow for your JavaScript applications.

Amazon QuickSight launches a new user interface for dataset management that replaces the existing popup dialog modal with a full-page experience, providing a clearer breakdown of dataset management categories.

AWS GameKit adds Unity support. With this release for Unity, you can integrate cloud-based game features into Win64, MacOS, Android, or iOS games from both the Unreal and Unity engines with just a few clicks.

AWS and VMware announce VMware Cloud on AWS integration with Amazon FSx for NetApp ONTAP. Read more in Veliswa‘s blog post.

The AWS Region in the United Arab Emirates (UAE) is now open. More info in Marcia‘s blog post.

View of Abu Dhabi in the United Arab Emirates

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
A few more blog posts you might have missed:

Easy analytics and cost-optimization with Amazon Redshift Serverless – Four different use cases of Redshift Serverless are discussed in this post.

Building cost-effective AWS Step Functions workflows – In this blog post, Ben explains the difference between Standard and Express Workflows, including costs, migrating from Standard to Express, and some interesting ways of using both together.

How to subscribe to the new Security Hub Announcements topic for Amazon SNS – You can now receive updates about new Security Hub services and features, newly supported standards and controls, and other Security Hub changes.

Deploying AWS Lambda functions using AWS Controllers for Kubernetes (ACK) – With the ACK service controller for AWS Lambda, you can provision and manage Lambda functions with kubectl and custom resources.

For AWS open-source news and updates, here’s the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more.

Upcoming AWS Events
Depending on where you are on this planet, there are many opportunities to meet and learn:

AWS Summits – Come together to connect, collaborate, and learn about AWS. Registration is open for the following in-person AWS Summits: Ottawa (September 8), New Delhi (September 9), Mexico City (September 21–22), Bogotá (October 4), and Singapore (October 6).

AWS Community DaysAWS Community Day events are community-led conferences to share and learn with one another. In September, the AWS community in the US will run events in the Bay Area, California (September 9) and Arlington, Virginia (September 30). In Europe, Community Day events will be held in October. Join us in Amersfoort, Netherlands (October 3), Warsaw, Poland (October 14), and Dresden, Germany (October 19).

That’s all from me for this week. Come back next Monday for another Week in Review!

Danilo