Tag Archives: Scale

Scaling merge-ort across GitHub

Post Syndicated from Matt Cooper original https://github.blog/2023-07-27-scaling-merge-ort-across-github/

At GitHub, we perform a lot of merges and rebases in the background. For example, when you’re ready to merge your pull request, we already have the resulting merge assembled. Speeding up merge and rebase performance saves both user-visible time and backend resources. Git has recently learned some new tricks which we’re using at scale across GitHub. This post walks through what’s changed and how the experience has improved.

Our requirements for a merge strategy

There are a few non-negotiable parts of any merge strategy we want to employ:

  • It has to be fast. At GitHub’s scale, even a small slowdown is multiplied by the millions of activities going on in repositories we host each day.
  • It has to be correct. For merge strategies, what’s “correct” is occasionally a matter of debate. In those cases, we try to match what users expect (which is often whatever the Git command line does).
  • It can’t check out the repository. There are both scalability and security implications to having a working directory, so we simply don’t.

Previously, we used libgit2 to tick these boxes: it was faster than Git’s default merge strategy and it didn’t require a working directory. On the correctness front, we either performed the merge or reported a merge conflict and halted. However, because of additional code related to merge base selection, sometimes a user’s local Git could easily merge what our implementation could not. This led to a steady stream of support tickets asking why the GitHub web UI couldn’t merge two files when the local command line could. We weren’t meeting those users’ expectations, so from their perspective, we weren’t correct.

A new strategy emerges

Two years ago, Git learned a new merge strategy, merge-ort. As the author details on the mailing list, merge-ort is fast, correct, and addresses many shortcomings of the older default strategy. Even better, unlike merge-recursive, it doesn’t need a working directory. merge-ort is much faster even than our optimized, libgit2-based strategy. What’s more, merge-ort has since become Git’s default. That meant our strategy would fall even further behind on correctness.

It was clear that GitHub needed to upgrade to merge-ort. We split this effort into two parts: first deploy merge-ort for merges, then deploy it for rebases.

merge-ort for merges

Last September, we announced that we’re using merge-ort for merge commits. We used Scientist to run both code paths in production so we can compare timing, correctness, etc. without risking much. The customer still gets the result of the old code path, while the GitHub feature team gets to compare and contrast the behavior of the new code path. Our process was:

  1. Create and enable a Scientist experiment with the new code path.
  2. Roll it out to a fraction of traffic. In our case, we started with some GitHub-internal repositories first before moving to a percentage-based rollout across all of production.
  3. Measure gains, check correctness, and fix bugs iteratively.

We saw dramatic speedups across the board, especially on large, heavily-trafficked repositories. For our own github/github monolith, we saw a 10x speedup in both the average and P99 case. Across the entire experiment, our P50 saw the same 10x speedup and P99 case got nearly a 5x boost.

Chart showing experimental candidate versus control at P50. The candidate implementation fairly consistently stays below 0.1 seconds.

Chart showing experimental candidate versus control at P99. The candidate implementation follows the same spiky pattern as the control, but its peaks are much lower.

Dashboard widgets showing P50 average times for experimental candidate versus control. The control averages 71.07 milliseconds while the candidate averages 7.74 milliseconds.

Dashboard widgets showing P99 average times for experimental candidate versus control. The control averages 1.63 seconds while the candidate averages 329.82 milliseconds.

merge-ort for rebases

Like merges, we also do a huge number of rebases. Customers may choose rebase workflows in their pull requests. We also perform test rebases and other “behind the scenes” operations, so we also brought merge-ort to rebases.

This time around, we powered rebases using a new Git subcommand: git-replay. git replay was written by the original author of merge-ort, Elijah Newren (a prolific Git contributor). With this tool, we could perform rebases using merge-ort and without needing a worktree. Once again, the path was pretty similar:

  1. Merge git-replay into our fork of Git. (We were running the experiment with Git 2.39, which didn’t include the git-replay feature.)
  2. Before shipping, leverage our test suite to detect discrepancies between the old and the new implementations.
  3. Write automation to flush out bugs by performing test rebases of all open pull requests in github/github and comparing the results.
  4. Set up a Scientist experiment to measure the performance delta between libgit2-powered rebases and monitor for unexpected mismatches in behavior.
  5. Measure gains, check correctness, and fix bugs iteratively.

Once again, we were amazed at the results. The following is a great anecdote from testing, as relayed by @wincent (one of the GitHub engineers on this project):

Another way to think of this is in terms of resource usage. We ran the experiment over 730k times. In that interval, our computers spent 2.56 hours performing rebases with libgit2, but under 10 minutes doing the same work with merge-ort. And this was running the experiment for 0.5% of actors. Extrapolating those numbers out to 100%, if we had done all rebases during that interval with merge-ort, it would have taken us 2,000 minutes, or about 33 hours. That same work done with libgit2 would have taken 512 hours!

What’s next

While we’ve covered the most common uses, this is not the end of the story for merge-ort at GitHub. There are still other places in which we can leverage its superpowers to bring better performance, greater accuracy, and improved availability. Squashing and reverting are on our radar for the future, as well as considering what new product features it could unlock down the road.

Appreciation

Many thanks to all the GitHub folks who worked on these two projects. Also, GitHub continues to be grateful for the hundreds of volunteer contributors to the Git open source project, including Elijah Newren for designing, implementing, and continually improving merge-ort.

Migrating Netflix to GraphQL Safely

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/migrating-netflix-to-graphql-safely-8e1e4d4f1e72

By Jennifer Shin, Tejas Shikhare, Will Emmanuel

In 2022, a major change was made to Netflix’s iOS and Android applications. We migrated Netflix’s mobile apps to GraphQL with zero downtime, which involved a total overhaul from the client to the API layer.

Until recently, an internal API framework, Falcor, powered our mobile apps. They are now backed by Federated GraphQL, a distributed approach to APIs where domain teams can independently manage and own specific sections of the API.

Doing this safely for 100s of millions of customers without disruption is exceptionally challenging, especially considering the many dimensions of change involved. This blog post will share broadly-applicable techniques (beyond GraphQL) we used to perform this migration. The three strategies we will discuss today are AB Testing, Replay Testing, and Sticky Canaries.

Migration Details

Before diving into these techniques, let’s briefly examine the migration plan.

Before GraphQL: Monolithic Falcor API implemented and maintained by the API Team

Before moving to GraphQL, our API layer consisted of a monolithic server built with Falcor. A single API team maintained both the Java implementation of the Falcor framework and the API Server.

Phase 1

Created a GraphQL Shim Service on top of our existing Monolith Falcor API.

By the summer of 2020, many UI engineers were ready to move to GraphQL. Instead of embarking on a full-fledged migration top to bottom, we created a GraphQL shim on top of our existing Falcor API. The GraphQL shim enabled client engineers to move quickly onto GraphQL, figure out client-side concerns like cache normalization, experiment with different GraphQL clients, and investigate client performance without being blocked by server-side migrations. To launch Phase 1 safely, we used AB Testing.

Phase 2

Deprecate the GraphQL Shim Service and Legacy API Monolith in favor of GraphQL services owned by the domain teams.

We didn’t want the legacy Falcor API to linger forever, so we leaned into Federated GraphQL to power a single GraphQL API with multiple GraphQL servers.

We could also swap out the implementation of a field from GraphQL Shim to Video API with federation directives. To launch Phase 2 safely, we used Replay Testing and Sticky Canaries.

Testing Strategies: A Summary

Two key factors determined our testing strategies:

  • Functional vs. non-functional requirements
  • Idempotency

If we were testing functional requirements like data accuracy, and if the request was idempotent, we relied on Replay Testing. We knew we could test the same query with the same inputs and consistently expect the same results.

We couldn’t replay test GraphQL queries or mutations that requested non-idempotent fields.

And we definitely couldn’t replay test non-functional requirements like caching and logging user interaction. In such cases, we were not testing for response data but overall behavior. So, we relied on higher-level metrics-based testing: AB Testing and Sticky Canaries.

Let’s discuss the three testing strategies in further detail.

Tool: AB Testing

Netflix traditionally uses AB Testing to evaluate whether new product features resonate with customers. In Phase 1, we leveraged the AB testing framework to isolate a user segment into two groups totaling 1 million users. The control group’s traffic utilized the legacy Falcor stack, while the experiment population leveraged the new GraphQL client and was directed to the GraphQL Shim. To determine customer impact, we could compare various metrics such as error rates, latencies, and time to render.

We set up a client-side AB experiment that tested Falcor versus GraphQL and reported coarse-grained quality of experience metrics (QoE). The AB experiment results hinted that GraphQL’s correctness was not up to par with the legacy system. We spent the next few months diving into these high-level metrics and fixing issues such as cache TTLs, flawed client assumptions, etc.

Wins

High-Level Health Metrics: AB Testing provided the assurance we needed in our overall client-side GraphQL implementation. This helped us successfully migrate 100% of the traffic on the mobile homepage canvas to GraphQL in 6 months.

Gotchas

Error Diagnosis: With an AB test, we could see coarse-grained metrics which pointed to potential issues, but it was challenging to diagnose the exact issues.

Tool: Replay Testing — Validation at Scale!

The next phase in the migration was to reimplement our existing Falcor API in a GraphQL-first server (Video API Service). The Falcor API had become a logic-heavy monolith with over a decade of tech debt. So we had to ensure that the reimplemented Video API server was bug-free and identical to the already productized Shim service.

We developed a Replay Testing tool to verify that idempotent APIs were migrated correctly from the GraphQL Shim to the Video API service.

How does it work?

The Replay Testing framework leverages the @override directive available in GraphQL Federation. This directive tells the GraphQL Gateway to route to one GraphQL server over another. Take, for instance, the following two GraphQL schemas defined by the Shim Service and the Video Service:

The GraphQL Shim first defined the certificationRating field (things like Rated R or PG-13) in Phase 1. In Phase 2, we stood up the VideoService and defined the same certificationRating field marked with the @override directive. The presence of the identical field with the @override directive informed the GraphQL Gateway to route the resolution of this field to the new Video Service rather than the old Shim Service.

The Replay Tester tool samples raw traffic streams from Mantis. With these sampled events, the tool can capture a live request from production and run an identical GraphQL query against both the GraphQL Shim and the new Video API service. The tool then compares the results and outputs any differences in response payloads.

Note: We do not replay test Personally Identifiable Information. It’s used only for non-sensitive product features on the Netflix UI.

Once the test is completed, the engineer can view the diffs displayed as a flattened JSON node. You can see the control value on the left side of the comma in parentheses and the experiment value on the right.

/data/videos/0/tags/3/id: (81496962, null)
/data/videos/0/tags/5/displayName: (Série, value: “S\303\251rie”)

We captured two diffs above, the first had missing data for an ID field in the experiment, and the second had an encoding difference. We also saw differences in localization, date precisions, and floating point accuracy. It gave us confidence in replicated business logic, where subscriber plans and user geographic location determined the customer’s catalog availability.

Wins

  • Confidence in parity between the two GraphQL Implementations
  • Enabled tuning configs in cases where data was missing due to over-eager timeouts
  • Tested business logic that required many (unknown) inputs and where correctness can be hard to eyeball

Gotchas

  • PII and non-idempotent APIs should not be tested using Replay Tests, and it would be valuable to have a mechanism to prevent that.
  • Manually constructed queries are only as good as the features the developer remembers to test. We ended up with untested fields simply because we forgot about them.
  • Correctness: The idea of correctness can be confusing too. For example, is it more correct for an array to be empty or null, or is it just noise? Ultimately, we matched the existing behavior as much as possible because verifying the robustness of the client’s error handling was difficult.

Despite these shortcomings, Replay Testing was a key indicator that we had achieved functional correctness of most idempotent queries.

Tool: Sticky Canary

While Replay Testing validates the functional correctness of the new GraphQL APIs, it does not provide any performance or business metric insight, such as the overall perceived health of user interaction. Are users clicking play at the same rates? Are things loading in time before the user loses interest? Replay Testing also cannot be used for non-idempotent API validation. We reached for a Netflix tool called the Sticky Canary to build confidence.

A Sticky Canary is an infrastructure experiment where customers are assigned either to a canary or baseline host for the entire duration of an experiment. All incoming traffic is allocated to an experimental or baseline host based on their device and profile, similar to a bucket hash. The experimental host deployment serves all the customers assigned to the experiment. Watch our Chaos Engineering talk from AWS Reinvent to learn more about Sticky Canaries.

In the case of our GraphQL APIs, we used a Sticky Canary experiment to run two instances of our GraphQL gateway. The baseline gateway used the existing schema, which routes all traffic to the GraphQL Shim. The experimental gateway used the new proposed schema, which routes traffic to the latest Video API service. Zuul, our primary edge gateway, assigns traffic to either cluster based on the experiment parameters.

We then collect and analyze the performance of the two clusters. Some KPIs we monitor closely include:

  • Median and tail latencies
  • Error rates
  • Logs
  • Resource utilization–CPU, network traffic, memory, disk
  • Device QoE (Quality of Experience) metrics
  • Streaming health metrics

We started small, with tiny customer allocations for hour-long experiments. After validating performance, we slowly built up scope. We increased the percentage of customer allocations, introduced multi-region tests, and eventually 12-hour or day-long experiments. Validating along the way is essential since Sticky Canaries impact live production traffic and are assigned persistently to a customer.

After several sticky canary experiments, we had assurance that phase 2 of the migration improved all core metrics, and we could dial up GraphQL globally with confidence.

Wins

Sticky Canaries was essential to build confidence in our new GraphQL services.

  • Non-Idempotent APIs: these tests are compatible with mutating or non-idempotent APIs
  • Business metrics: Sticky Canaries validated our core Netflix business metrics had improved after the migration
  • System performance: Insights into latency and resource usage help us understand how scaling profiles change after migration

Gotchas

  • Negative Customer Impact: Sticky Canaries can impact real users. We needed confidence in our new services before persistently routing some customers to them. This is partially mitigated by real-time impact detection, which will automatically cancel experiments.
  • Short-lived: Sticky Canaries are meant for short-lived experiments. For longer-lived tests, a full-blown AB test should be used.

In Summary

Technology is constantly changing, and we, as engineers, spend a large part of our careers performing migrations. The question is not whether we are migrating but whether we are migrating safely, with zero downtime, in a timely manner.

At Netflix, we have developed tools that ensure confidence in these migrations, targeted toward each specific use case being tested. We covered three tools, AB testing, Replay Testing, and Sticky Canaries that we used for the GraphQL Migration.

This blog post is part of our Migrating Critical Traffic series. Also, check out: Migrating Critical Traffic at Scale (part 1, part 2) and Ensuring the Successful Launch of Ads.


Migrating Netflix to GraphQL Safely was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

An elastic deployment of Stable Diffusion with Discord on AWS

Post Syndicated from Steven Warren original https://aws.amazon.com/blogs/architecture/an-elastic-deployment-of-stable-diffusion-with-discord-on-aws/

Stable Diffusion is a state-of-the-art text-to-image model that generates images from text. Deploying text-to-image models such as Stable Diffusion can be difficult. Currently, Stable Diffusion requires specific computer hardware known as graphical processing units (GPUs). You can lower the bar to entry by offloading the text-to-image generation onto Amazon Web Services (AWS).

Discord is a popular voice, video, and text communication service. It provides a user interface that people can use to make text-to-image requests. When deployed, all members of a Discord server can create images by using Discord Slash Commands.

In this post, we discuss how to deploy a highly available solution on AWS. This solution will perform text-to-image generation with Stable Diffusion and use Discord as the user interface.

Solution architecture

Many of the services selected are serverless, which will offer many benefits. At the time of writing, Stable Diffusion requires a GPU for inference. Amazon Elastic Compute Cloud (Amazon EC2) was selected because it provides GPUs. The solution architecture is shown in the Figure 1.

Solution architecture diagram

Figure 1. Solution architecture diagram

Let us walk through the architecture of this solution.

Auto scaling with custom metrics

To properly scale the system, a custom Amazon CloudWatch metric is created. This custom CloudWatch metric calculates the number of Amazon Elastic Container Service (Amazon ECS) tasks required to adequately handle the amount of Amazon Simple Queue Service (Amazon SQS) messages. You should have a high-resolution CloudWatch metric to scale up quickly. For this use case, a high-resolution CloudWatch metric of every 10 seconds was implemented.

Next, let’s create the custom CW metric. Amazon EventBridge rules provide a serverless solution for starting actions on a schedule. Here we use an Amazon EventBridge rule, which initiates an AWS Step Function Express Workflow every minute. With the Express Workflow, we can create serverless workflows that take less than five minutes, which helps us avoid long running AWS Lambda functions. The Express Workflow runs a Lambda function every 10 seconds over a one-minute period, which generates the custom CloudWatch metric.

Two high-resolution CloudWatch alarms scale the system up and down, and are initiated by the custom CloudWatch metric. One CloudWatch alarm increases the ECS tasks and EC2 machines. The other alarm decreases the ECS tasks and EC2 machines.

Handling Discord requests

Someone on Discord sends a request. The Amazon API Gateway HTTP API receives the request and passes the information to an AWS Lambda function. The HTTP API provides a cost-effective option compared to REST APIs and provides tools for authentication and authorization. The HTTP API uses cross-origin resource sharing (CORS), which provides security because it only allows discord.com as an origin.

The AWS Lambda function provides a serverless solution for responding to the HTTP API requests. It transforms the HTTP API request and sends a message to the SQS First-In-First-Out (FIFO) queue. SQS seamelessly decouples the architecture between user requests and backend processing. A FIFO queue ensures that user requests are processed in the order they were requested. The AWS Lambda function sends a response back to the HTTP API within three seconds, which is a requirement of Discord Slash Commands.

When scaling up, an EC2 instance is registered with the ECS cluster. The EC2 instance type was selected because it provides GPU instances. ECS provides a repeatable solution to deploy the container across a variety of instance types. This solution currently only uses the g4dn.xlarge instance type. The ECS service will then place an ECS task onto the eligible EC2 instance. The ECS task will use the Amazon Elastic Container Registry (Amazon ECR) private registry to pull the image,  perform text-to-image processing, and respond to the Discord request. The ECR private registry is a managed container registry that manages the image.

Once there is an ECS task running on an Amazon EC2 instance, the ECS task will consume messages from the queue using long polling. This reduces the amount of ReceiveMessage requests the ECS task needs to send. When the ECS task receives a message from the queue, it will then processes the request.

Estimated monthly cost

The example assumes 1,000 requests per month and each request takes 16 seconds to complete. Extra EC2 time was added for the time to begin processing messages (seven minutes) and auto scaling cooldown time (30 minutes). You can adjust the pricing calculations with the AWS Pricing Calculator to reflect your usage and estimated cost.

Prerequisites

This blog assumes familiarity with Terraform, Docker, Discord, Amazon EC2Amazon Elastic Block Store (Amazon EBS)Amazon Virtual Private Cloud (Amazon VPC), AWS Identity and Access Management (IAM), Amazon API Gateway, AWS Lambda, Amazon SQS, Amazon Elastic Container Registry (Amazon ECR), Amazon ECS, Amazon EventBridge, AWS Step Functions, and Amazon CloudWatch.

For this walkthrough, you should have the following prerequisites:

  • Access to an AWS account, with permissions to create the resources described in the installation steps section
  • A virtual private cloud (VPC) with public subnets that is associated with an internet gateway in the region you are deploying into
    We suggest using the default VPC. The subnets will need the tag of key: Tier and value: Public and be attached to the VPC. If you decided to create your own VPC with subnets, make sure that auto-assign IP settings is enabled.
  • An IAM user with the required permissions to deploy the infrastructure
  • A new Discord application that is registered to a Discord server you own with the scope applications.command. Use this tutorial if you need a starting point on creating a Discord application.
    • Discord Bot token
    • Discord Application ID
    • Discord Public Key
  • A Hugging Face account
  • A computer with the following packages installed:

Walkthrough

Complete the following steps to deploy this solution on AWS.

Increase EC2 limits

This solution uses the g4dn.xlarge instance type, which might require you to request an EC2 limit increase. Check your current limit of Running On-Demand All G and VT instances. Make sure you have more than 4 vCPU; a single g4dn.xlarge requires 4 vCPU. We suggest requesting 8 vCPU so that you can access 2 g4dn.xlarge instances.

Deploy the infrastructure

  1. Ensure you have at least 60 GB of storage available and you’re running on a 64-bit x86 architecture system.
  2. Open a command line on the machine you will be deploying from.
  3. Log in as your AWS user through the AWS CLI with the command aws configure. If you are using an EC2 instance, create and use an instance profile rather than using the AWS CLI.
    The region you select will be the one you will deploy into.
  4. Clone the Terraform repository:
    git clone https://github.com/aws-samples/amazon-scalable-infra-discord-diffusion.git
  5. Navigate into the Terraform repository:
    cd amazon-scalable-infra-discord-diffusion
  6. Customize the variables in terraform.tfvars to match your deployment.
  7. Export the following secrets to the command line:
    • export TF_VAR_discord_bot_secret='DISCORD_BOT_SECRET_HERE'
    • export TF_VAR_huggingface_password='HUGGINGFACE_PASSWORD_HERE'
  8. Initialize the repository:
    terraform init
  9. Apply the infrastructure (this takes about 2 minutes):
    terraform apply
  10. Save the outputs for future use.

Set up Discord

This setup adds the Discord interactions URL to your Discord application. After terraform apply comes back successfully, move onto these steps.

  1. Open Discord Application Page -> General Information.
  2. Copy and paste the value from discord_interactions_endpoint_url into the Interactions Endpoint URL, and then save the changes.

If successful, there should be a green box with All your edits have been carefully recorded.

Docker image and Amazon Elastic Container Registry

In this section, you will create a docker image with the Stable Diffusion model.

  1. Exit the terraform repository:
    cd ..
  2. Clone the Docker build repository:
    git clone https://github.com/aws-samples/amazon-scalable-discord-diffusion.git
  3. Navigate to the Docker repository:
    cd amazon-scalable-discord-diffusion
  4. Build and push the docker image to ECR. This requires docker to be installed on the machine and actively running.
    You can find the commands for your deployment from the Amazon ECR repository.

    View push commands for Amazon ECR

    Figure 2. View push commands for Amazon ECR

This is a large image (10GB) and can take over 20 minutes to push depending on your machine’s internet connection.

Request an image with Discord Slash Commands

This section will describe how to request a text to image response with Discord.

  1. Log in to Discord and navigate to the server with your Discord application deployed.
  2. Navigate to a text channel.
  3. Type the command /sparkle.
    A box with COMMANDS MATCHING /sparkle will appear. Select the /sparkle command box.
    Depending on how you customized your Discord Application, the avatar image shown in Figure 3 might be different from what you have.

    Writing a Discord Slash Command

    Figure 3. Writing a Discord Slash Command

  4. Type in a prompt such as a corgi, style of monet.
    A response from YourBotName should appear with the response Submitted to Sparkle: YourPromptHere, as shown in Figure 4.

    First response from AWS Lambda function

    Figure 4. First response from AWS Lambda function

    It will take 10 minutes for an EC2 instance to start with an ECS Task running on the instance. Once an ECS Task is running on the instance, inference times should reduce to under 30 seconds, depending on the request.
    When an ECS Task is running your request, you will see a Processing your Sparkle message, as shown in Figure 5.

    Amazon ECS task processing a request

    Figure 5. Amazon ECS task processing a request

    The message is complete when it says Completed your Sparkle! as shown in Figure 6.

    Amazon ECS task returning the final response

    Figure 6. Amazon ECS task returning the final response

Cleaning up

To avoid incurring future charges, delete the resources created by the Terraform script.

  1. Return to the directory where you deployed your terraform script.
  2. To destroy the infrastructure in AWS, run the command terraform destroy.
  3. When prompted to confirm that you want to destroy the infrastructure, type yes and press Enter.

Conclusion

In summary, we created a solution that allows members of a Discord server to create images from text with a Stable Diffusion model. With this implementation, the deployment can scale to many Discord Servers and handle over one hundred requests per second.

Create projects on AWS that lower the bar to entry for people wanting to try text to image models.