Tag Archives: Engineering

Netflix Studio Engineering Overview

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/netflix-studio-engineering-overview-ed60afcfa0ce

By Steve Urban, Sridhar Seetharaman, Shilpa Motukuri, Tom Mack, Erik Strauss, Hema Kannan, CJ Barker

Netflix is revolutionizing the way a modern studio operates. Our mission in Studio Engineering is to build a unified, global, and digital studio that powers the effective production of amazing content.

Netflix produces some of the world’s most beloved and award-winning films and series, including The Irishman, The Crown, La Casa de Papel, Ozark, and Tiger King. In an effort to effectively and efficiently produce this content we are looking to improve and automate many areas of the production process. We combine our entertainment knowledge and our technical expertise to provide innovative technical solutions from the initial pitch of an idea to the moment our members hit play.

Why Does Studio Engineering Exist?

We enable Netflix to build a unified, global and digital studio that powers the effective production of amazing content.
Studio Engineering’s ‘Why’

The journey of a Netflix Original title from the moment it first comes to us as a pitch, to that press of the play button is incredibly complex. Producing great content requires a significant amount of coordination and collaboration from Netflix employees and external vendors across the various production phases. This process starts before the deal has been struck and continues all the way through launch on the service, involving people representing finance, scheduling, human resources, facilities, asset delivery, and many other business functions. In this overview, we will shed light on the complexity and magnitude of this journey and update this post with links to deeper technical blogs over time.

Content Lifecycle: Pitch, Development, Production, On-Service
Pitch-to-Play

Mission at a Glance

  • Creative pitch: Combine the best of machine learning and human intuition to help Netflix understand how a proposed title compares to other titles, estimate how many subscribers will enjoy it, and decide whether or not to produce it.
  • Business negotiations: Empower the Netflix Legal team with data to help with deal negotiations and acquisition of rights to produce and stream the content.
  • Pre-Production: Provide solutions to plan for resource needs, and discovery of people and vendors to continue expanding the scale of our productions. Any given production requires the collaboration of hundreds of people with varying expertise, so finding exactly the right people and vendors for each job is essential.
  • Production: Enable content creation from script to screen that optimizes the production process for efficiency and transparency. Free up creative resources to focus on what’s important: producing amazing and entertaining content.
  • Post-Production: Help our creative partners collaborate to refine content into their final vision with digital content logistics and orchestration.

What’s Next?

Studio Engineering will be publishing a series of articles providing business and technical insights as we further explore the details behind the journey from pitch to play. Stay tuned as we expand on each stage of the content lifecycle over the coming months!

Here are some related articles to Studio Engineering:


Netflix Studio Engineering Overview was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

How we built our in-house chat platform for the web

Post Syndicated from Grab Tech original https://engineering.grab.com/how-we-built-our-in-house-chat-platform-for-the-web

At Grab, we’ve built an in-house chat platform to help connect our passengers with drivers during a booking, as well as with their friends and family for social sharing purposes.

P2P chat for the Angbow campaign and GrabHitch chat
P2P chat for the Angbow campaign and GrabHitch chat

We wanted to focus on our customer support chat experience, and so we replaced the third-party live chat tool that we’ve used for years with our newly developed chat platform. As a part of this initiative, we extended this platform for the web to integrate with our internal Customer Support portal.

Sample chat between a driver and a customer support agent
Sample chat between a driver and a customer support agent

This is the first time we introduced chat on the web, and we faced a few challenges while building it. In this article, we’ll go over some of these challenges and how we solved them.

Current Architecture

A vast majority of the communication from our Grab Passenger and Driver apps happens via TCP. Our TCP gateway takes care of processing all the incoming messages, authenticating, and routing them to the respective services. Our TCP connections are unicast, which means there is only one active connection possible per user at any point in time. This served us well, as we only allow our users to log in from one device at a time.

A TL;DR version of our current system
A TL;DR version of our current system

However, this model breaks on the web since our users can have multiple tabs open at the same time, and each would establish a new socket connection. Due to the unicast nature of our TCP connections, the older tabs would get disconnected and wouldn’t receive any messages from our servers. Our Customer Support agents love their tabs and have a gazillion open at any time. This behaviour would be too disruptive for them.

The obvious answer was to change our TCP connection strategy to multicast. We took a look at this and quickly realised that it was going to be a huge undertaking and could introduce a lot of unknowns for us to deal with.

We had to consider a different approach for the web and zeroed in on a hybrid approach with a little known Javascript APIs called SharedWorker and BroadcastChannel.

Understanding the basics

Before we jump in, let’s take a quick detour to review some of the terminologies that we’ll be using in this post.

If you’re familiar with how WebWorker works, feel free to skip ahead to the next section. For the uninitiated, JavaScript on the browser runs in a single-threaded environment. Workers are a mechanism to introduce background, OS-level threads in the browser. Creating a worker in JavaScript is simple. Let’s look at it with an example:

//instantiate a worker
const worker = new WebWorker("./worker.js");
worker.postMessage({ message: "Ping" });
worker.onMessage((e) => {
  console.log("Message from the worker");
});

// and in  worker.js
onMessage = (e) => {
  console.log(e.message);
  this.postMessage({ message: "pong" });
};

The worker API comes with a handy postMessage method which can be used to pass messages between the main thread and worker thread. Workers are a great way to add concurrency in a JavaScript application and help in speeding up an expensive process in the background.

Note: While the method looks similar, worker.postMessage is not the same as window.postMessage.

What is a SharedWorker?

SharedWorker is similar to a WebWorker and spawns an OS thread, but as the name indicates, it’s shared across browser contexts. In other words, there is only one instance of that worker running for that domain across tabs/windows. The API is similar to WebWorker but has a few subtle differences.

SharedWorkers internally use MessagePort to pass messages between the worker thread and the main thread. There are two ports- one for sending a message to the main thread and the other to receive. Let’s explore it with an example:

const mySharedWorker = new SharedWorker("./worker.js");
mySharedWorker.port.start();
mySharedWorker.port.postMessage(message);

onconnect = (e) => {
  const port = e.ports[0];

  // Handle messages from the main thread
  port.onmessage = handleEventFromMainThread.bind(port);
};

// Message from the main thread
const handleEventFromMainThread = (params) => {
  console.log("I received", params, "from the main thread");
};

const sendEventToMainThread = (params) => {
  connections.forEach((c) => c.postMessage(params));
};

There is a lot to unpack here. Once a SharedWorker is created, we’ve to manually start the port using mySharedWorker.port.start() to establish a connection between the script running on the main thread and the worker thread. Post that, messages can be passed via the worker’s postMessage method. On the worker side, there is an onconnect callback which helps in setting up listeners for connections from each browser context.

Under the hood, SharedWorker spawns a single OS thread per worker script per domain. For instance, if the script name is worker.js running in the domain https://ce.grab.com. The logic inside worker.js runs exactly once in this domain. The advantage of this approach is that we can run multiple worker scripts in the same-origin each managing a different part of the functionality. This was one of the key reasons why we picked SharedWorker over other solutions.

What are Broadcast channels

In a multi-tab environment, our users may send messages from any of the tabs and switch to another for the next message. For a seamless experience, we need to ensure that the state is in sync across all the browser contexts.

Message passing across tabs
Message passing across tabs

The BroadcastChannel API creates a message bus that allows us to pass messages between multiple browser contexts within the same origin. This helps us sync the message that’s being sent on the client to all the open tabs.

Let’s explore the API with a code example:

const channel = new BroadcastChannel("chat_messages");
// Sets up an event listener to receive messages from other browser contexts
channel.onmessage = ({ data }) => {
  console.log("Received ", data);
};

const sendMessage = (message) => {
  const event = { message, type: "new_message" };
  send(event);
  // Publish event to all browser contexts listening on the chat\_messages channel
  channel.postMessage(event);
};

const off = () => {
  // clear event listeners
  channel.close();
};

One thing to note here is that communication is restricted to listeners from the same origin.

How are our chat rooms powered

Now that we have a basic understanding of how SharedWorker and Broadcast channels work, let’s take a peek into how Grab is using it.

Our Chat SDK abstracts the calls to the worker and the underlying transport mechanism. On the surface, the interface just exposes two methods: one for sending a message and another for listening to incoming events from the server.

export interface IChatSDK {
  sendMessage: (message: ChatMessage) => string;
  sendReadReceipt: (receiptAck: MessageReceiptACK) => void;
  on: (callback: ICallBack) => void;
  off: (topic?: SDKTopics) => void;
  close: () => void;
}

The SDK does all the heavy lifting to manage the connection with our TCP service, and keeping the information in-sync across tabs.

SDK flow
SDK flow

In our worker, we additionally maintain all the connections from browser contexts. When an incoming event arrives from the socket, we publish it to the first active connection. Our SDK listens to this event, processes it, sends out an acknowledgment to the server, and publishes it in the BroadcastChannel. Let’s look at how we’ve achieved this via a code example.

Managing connections in the worker:

let socket;
let instances = 0;
let connections = [];

let URI: string;

// Called when a  new worker is connected.
// Worker is created at
onconnect = e => {
 const port = e.ports[0];

 port.start();
 port.onmessage = handleEventFromMainThread.bind(port);
 connections.push(port);
 instances ++;
};

// Publish ONLY to the first connection.
// Let the caller decide on how to sync this with other tabs
const callback= (topic, payload) => {
    connections[0].postMessage({
      topic,
      payload,
    });
 }

 const handleEventFromMainThread = e => {
   switch (e.data.topic) {
     case SocketTopics.CONNECT: {
       const config = e.data.payload;
       if (!socket) {
         // Establishes a WebSocket connection with the server
          socket = new SocketManager({...})
        } else {
          callback(SocketTopics.CONNECTED, '');
        }
        break;
      }
      case SocketTopics.CLOSE: {
        const index = connections.indexOf(this);
        if (index != -1 && instances > 0) {
            connections.splice(index, 1);
            instances--;
        }
        break;
      }
        // Forward everything else to the server
      default: {
        const payload = e.data;
        socket.sendMessage(payload);
        break;
      }
    }
  }

And in the ChatSDK:

// Implements IChatSDK

// Rough outline of our GrabChat implementation

class GrabChatSDK {
  constructor(config) {
    this.channel = new BroadcastChannel('incoming_events');
    this.channel.onmessage = ({data}) => {
        switch(data.type) {
            // Handle events from other tabs
            // .....
        }
    }
    this.worker = new SharedWorker('./worker', {
        type: 'module',
        name: `${config.appID}-${config.appEnv}`,
        credentials: 'include',
      });
      this.worker.port.start();
      // Publish a connected event, so the worker manager can register this connection
      this.worker.port.postMessage({
        topic: SocketTopics.CONNECT,
        payload,
      });
      // Incoming event from the shared worker
      this.worker.port.onmessage = this._handleIncomingMessage;
      // Disconnect this port before tab closes
      addEventListener('beforeunload', this._disconnect);
    }

    sendMessage(message) {
      // Attempt a delivery of the message
      worker.postMessage({
        topic: SocketTopics.NEW_MESSAGE,
        getPayload(message),
      });
      // Send the message to all tabs to keep things in sync
      this.channel.postMessage(getPayload(message));
    }

    // Hit if this connection is the leader of the SharedWorker connection
    _handleIncomingMessage(event) {
      // Send an ACK to our servers confirming receipt of the message
      worker.postMessage({
        topic: SocketTopics.ACK,
        payload,
      });

      if (shouldBroadcast(event.type)) {
        this.channel.postMessage(event);
      }

      this.callback(event);
    }

    _disconnect() {
      this.worker.port.postMessage(data);
      removeEventListener('beforeunload', this._disconnect);
    }
}

This ensures that there is only one connection between our application and the TCP service irrespective of the number of tabs the page is open in.

Some caveats

While SharedWorker is a great way to enforce singleton objects across browser contexts, the developer experience of SharedWorker leaves a lot to be desired. There aren’t many resources on the web, and it could be quite confusing if this is the first time you’re using this feature.

We faced some trouble integrating SharedWorker with bundling the worker code along. This plugin from GoogleChromeLabs did a great job of alleviating some pain. Debugging an issue with SharedWorker was not obvious. Chrome has a dedicated page for inspecting SharedWorkers (chrome://inspect/#workers), and it took some getting used to.

The browser support for SharedWorker is far from universal. While it works great in Chrome, Firefox, and Opera, Safari and most mobile browsers lack support. This was an acceptable trade-off in our use case, as we built this for an internal portal and all our users are on Chrome.

Shared race
Shared race

SharedWorker enforces uniqueness using a combination of origin and the script name. This could potentially introduce an unintentional race condition during deploy times if we’re not careful. Let’s say the user has a tab open before the latest deployment, and another one after deployment, it’s possible to end up with two different versions of the same script. We built a wrapper over the SharedWorker which cedes control to the latest connection, ensuring that there is only one version of the worker active.

Wrapping up

We’re happy to have shared our learnings from building our in-house chat platform for the web, and we hope you found this post helpful. We’ve built the web solution as a reusable SDK for our internal portals and public-facing websites for quick and easy integration, providing a powerful user experience.

We hope this post also helped you get a deeper sense of how SharedWorker and BroadcastChannels work in a production application.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Introducing GitHub Super Linter: one linter to rule them all

Post Syndicated from Lucas Gravley original https://github.blog/2020-06-18-introducing-github-super-linter-one-linter-to-rule-them-all/

Setting up a new repository with all the right linters for the different types of code can be time consuming and tedious. So many tools and configurations to choose from and often more than one linter is needed to cover all the languages used.

The GitHub Super Linter was built out of necessity by the GitHub Services DevOps Engineering team to maintain consistency in our documentation and code while making communication and collaboration across the company a more productive experience. Now we are open sourcing that so everyone can use and improve it!

The Super Linter solves many of these requirements through automation. Some included features:

  • Prevent broken code from being uploaded to master branches
  • Help establish coding best practices across multiple languages
  • Build guidelines for code layout and format
  • Automate the process to help streamline code reviews
  • With these basic criteria, we should be shipping better, cleaner, and more stable code internally and to our customers and partners

What is it?

The Super Linter is a source code repository that is packaged into a Docker container and called by GitHub Actions. This allows for any repository on GitHub.com to call the Super Linter and start utilizing its benefits.

The Super Linter will currently support a lot of languages and more coming in the future. For details on languages, check out the README.md.

How it works

When you’ve set your repository to start running this action, any time you open a pull request, it will start linting the code case and return via the Status API. It will let you know if any of your code changes passed successfully, or if any errors were detected, where they are, and what they are. This then allows the developer to go back to their branch, fix any issues, and create a new push to the open pull request. At that point, the Super Linter will run again and validate the updated code and repeat the process. You can configure your branch protection rules to make sure all code must pass before being able to merge as an additional measure.

There’s a ton of customization with flags and templates that can help you customize the Super Linter to your individual repository. Just follow the detailed directions at the Super Linter repository and the Super Linter wiki.

This tool can also be helpful for any repository where multiple types of code and/or documentation all live together (monorepo).

GitHub Super Linter in action

Default rules

Standardizing a rule set across the Super Linter has been an interesting challenge as each developer is unique in how they code. This is why we allow users to use any rules for the linter as they see fit for their repository. But, if no ruleset is defined, we must default to a certain standard.

The rule set for Ruby and Rails are pulled from the Ruby gem: rubocop-github and follow the same rules and versioning we use on GitHub.com.

For other languages, we choose what is the default when installing the linter such as: coffeelint or yamllint. For others, we try to find a happy middle ground that lays the simple groundwork and helps establish some best practices like: Markdownlint or pylint.

The beauty of this is, out of the box you will start establishing the framework, and your team can decide at any point, if additional customization is needed, you have all the ability to do so.

Just navigate to the Super Linter and copy templates from the TEMPLATES folder to your local repository.

Join in the fun

We encourage you to set up this action and start the process of cleaning up your codebase and building your team’s standards and best practices.

How can I contribute?

We’re always looking to update best practices, add additional languages, and make the tool easier for consumption. If you’d like to help contribute to this action, check out our contributing guide.

Learn more about our Super Linter

Using GitHub Actions for MLOps & Data Science

Post Syndicated from Hamel Husain original https://github.blog/2020-06-17-using-github-actions-for-mlops-data-science/

Background

Machine Learning Operations (or MLOps) enables Data Scientists to work in a more collaborative fashion, by providing testing, lineage, versioning, and historical information in an automated way.  Because the landscape of MLOps is nascent, data scientists are often forced to implement these tools from scratch. The closely related discipline of DevOps offers some help, however many DevOps tools are generic and require the implementation of “ML awareness” through custom code. Furthermore, these platforms often require disparate tools that are decoupled from your code leading to poor debugging and reproducibility.

To mitigate these concerns, we have created a series of GitHub Actions that integrate parts of the data science and machine learning workflow with a software development workflow. Furthermore, we provide components and examples that automate common tasks.

An Example Of MLOps Using GitHub Actions

Consider the below example of how an experiment tracking system can be integrated with GitHub Actions to enable MLOps. In the below example, we demonstrate how you can orchestrate a machine learning pipeline to run on the infrastructure of your choice, collect metrics using an experiment tracking system, and report the results back to a pull request.

Screenshot of a pull request

A screenshot of this pull request.

For a live demonstration of the above example, please see this talk.

MLOps is not limited to the above example. Due to the composability of GitHub Actions, you can stack workflows in many ways that can help data scientists. Below is a concrete example of a very simple workflow that adds links to mybinder.org on pull requests:

name: Binder
on: 
  pull_request:
    types: [opened, reopened]

jobs:
  Create-Binder-Badge:
    runs-on: ubuntu-latest
    steps:

    - name: checkout pull request branch
      uses: actions/[email protected]
      with:
        ref: ${{ github.event.pull_request.head.sha }}

    - name: comment on PR with Binder link
      uses: actions/[email protected]
      with:
        github-token: ${{secrets.GITHUB_TOKEN}}
        script: |
          var BRANCH_NAME = process.env.BRANCH_NAME;
          github.issues.createComment({
            issue_number: context.issue.number,
            owner: context.repo.owner,
            repo: context.repo.repo,
            body: `[![Binder](https://mybinder.org/badge_logo.svg)](https://mybinder.org/v2/gh/${context.repo.owner}/${context.repo.repo}/${BRANCH_NAME}) :point_left: Launch a binder notebook on this branch`
          }) 
      env:
        BRANCH_NAME: ${{ github.event.pull_request.head.ref }}

When the above YAML file is added to a repository’s .github/workflow directory, pull requests can be annotated with a useful link as illustrated below [1]:

Screenshot showing Actions commenting with a Binder link

A Growing Ecosystem of MLOps & Data Science Actions

There is a growing number of Actions available for machine learning ops and data science. Below are some concrete examples that are in use today, categorized by topic.

Orchestrating Machine Learning Pipelines:

  • Submit Argo Workflows – allows you to orchestrate machine learning pipelines that run on Kubernetes.
  • Publish Kubeflow Pipelines to GKE – Kubeflow Pipelines is a platform for building and deploying portable, scalable machine learning (ML) workflows based on Docker containers.

Jupyter Notebooks:

  • Run parameterized Notebooks – run notebooks programmatically using papermill.
  • Repo2Docker Action – Automatically turn data-science repositories into Jupyter-enabled Docker containers using repo2docker.
  • fastai/fastpages – share information from Jupyter notebooks as blog posts using GitHub Actions & GitHub Pages.

End-To-End Workflow Orchestration:

Experiment Tracking

This is by no means an exhaustive list of the things you might want to automate with GitHub Actions with respect to data science and machine learning.   You can follow our progress towards this goal on our page, which contains links to blog posts, GitHub Actions, talks, and examples that are relevant to this topic.

We invite the community to create other Actions that might be useful for the community. Some ideas for getting started include data and model versioning, model deployment, data validation, as well as expanding upon some of the areas mentioned above. A great place to start is the documentation for GitHub Actions, particularly on how to build Actions for the community!

  • Our page with relevant materials.
  • GitHub Actions official documentation.
  • Hello world Docker Action: A template to demonstrate how to build a Docker Action for other people to use
  • Using self-hosted runners.
  • This talk introducing Actions for data science, including some live-coding!
  • Awesome Actions: A curated list of interesting GitHub Actions by topic
  • Useful GitHub Actions to know about when getting started:
    • actions/checkout: Allows you to quickly clone the contents of your repository into your environment, which you often want to do. This does a number of other things such as automatically mount your repository’s files into downstream Docker containers.
    • mxschmitt/action-tmate: Provides a way to debug Actions interactively. This uses port forwarding to give you a terminal in the browser that is connected to your Actions runner. Be careful not to expose sensitive information if you use this.
    • actions/github-script: Gives you a pre-authenticated ocotokit.js client that allows you to interact with the GitHub API to accomplish almost any task on GitHub automatically. Only these endpoints are supported.

Footnotes:

[1] This example workflow will not work on pull requests from forks.  To enable this, you have to trigger a PR comment to occur via a different event.

Using data science and machine learning for improved customer support

Post Syndicated from Junade Ali original https://blog.cloudflare.com/using-data-science-and-machine-learning-for-improved-customer-support/

Using data science and machine learning for improved customer support

In this blog post we’ll explore three tricks that can be used for data science that helped us solve real problems for our customer support group and our customers. Two for natural language processing in a customer support context and one for identifying attack Internet attack traffic.

Through these examples, we hope to demonstrate how invaluable data processing tricks, visualisations and tools can be before putting data into a machine learning algorithm. By refining data prior to processing, we are able to achieve dramatically improved results without needing to change the underlying machine learning strategies which are used.

Know the Limits (Language Classification)

When browsing a social media site, you may find the site prompts you to translate a post even though it is in your language.

We recently came across a similar problem at Cloudflare when we were looking into language classification for chat support messages. Using an off-the-shelf classification algorithm, users with short messages often had their chats classified incorrectly and our analysis found there’s a correlation between the length of a message and the accuracy of the classification (based on the browser Accept-Language header and the languages of the country where the request was submitted):

Using data science and machine learning for improved customer support

On a subset of tickets, comparing the classified language against the web browser Accept-Language header, we found there was broad agreement between these two properties. When we considered the languages associated with the user’s country, we found another signal.

Using data science and machine learning for improved customer support

In 67% of our sample, we found agreement between these three signals. In 15% of instances the classified language agreed with only the Accept-Language header and in 5% of cases there was only agreement with the languages associated with the user’s country.

Using data science and machine learning for improved customer support

We decided the ideal approach was to train a machine learning model that would take all three signals (plus the confidence rate from the language classification algorithm) and use that to make a prediction. By knowing the limits of a given classification algorithm, we were able to develop an approach that helped compliment it.

A naive approach to do the same may not even need a trained model to do so, simply requiring agreement between two of three properties (classified language, Accept-Language header and country header) helps make a decision about the right language to use.

Hold Your Fire (Fuzzy String Matching)

Fuzzy String Matching is often used in natural language processing when trying to extract information from human text. For example, this can be used for extracting error messages from customer support tickets to do automatic classification. At Cloudflare, we use this as one signal in our natural language processing pipeline for support tickets.

Engineers often use the Levenshtein distance algorithm for string matching; for example, this algorithm is implemented in the Python fuzzywuzzy library. This approach has a high computational overhead (for two strings of length k and l, the algorithm runs in O(k * l) time).

To understand the performance of different string matching algorithms in a customer support context, we compared multiple algorithms (Cosine, Dice, Damerau, LCS and Levenshtein) and measured the true positive rate (TP), false positive rate (FP) and the ratio of false positives to true positives (FP/TP).

Using data science and machine learning for improved customer support

We opted for the Cosine algorithm, not just because it outperformed the Levenshtein algorithm, but also the computational difficulty was reduced to O(k + l) time. The Cosine similarity algorithm is a very simple algorithm; it works by representing words or phrases as a vector representation in a multidimensional vector space, where each unique letter of an alphabet is a separate dimension. The smaller the angle between the two vectors, the closer the word is to another.

The mathematical definitions of each string similarity algorithm and a scientific comparison can be found in our paper: M. Pikies and J. Ali, “String similarity algorithms for a ticket classification system,” 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT), Paris, France, 2019, pp. 36-41. https://doi.org/10.1109/CoDIT.2019.8820497

There were other optimisations we introduce to the fuzzy string matching approaches; the similarity threshold is determined by evaluating the True Positive and False Positive rates on various sample data. We further devised a new tokenization approach for handling phrases and numeric strings whilst using the FastText natural language processing library to determine candidate values for fuzzy string matching and to improve overall accuracy, we will share more about these optimisations in a further blog post.

Using data science and machine learning for improved customer support

“Beyond it is Another Dimension” (Threat Identification)

Attack alerting is particularly important at Cloudflare – this is useful for both monitoring the overall status of our network and providing proactive support to particular at-risk customers.

DDoS attacks can be represented in granularity by a few different features; including differences in request or error rates over a temporal baseline, the relationship between errors and request volumes and other metrics that indicate attack behaviour. One example of a metric we use to differentiate between whether a customer is under a low volume attack or they are experiencing another issue is the relationship between 499 error codes vs 5xx HTTP status codes. Cloudflare’s network edge returns a 499 status code when the client disconnects before the origin web server has an opportunity to respond, whilst 5xx status codes indicate an error handling the request. In the chart below; the x-axis measures the differential increase in 5xx errors over a baseline, whilst the y-axis represents the rate of 499 responses (each scatter represents a 15 minute interval). During a DDoS attack we notice a linear correlation between these criteria, whilst origin issues typically have an increase in one metric instead of another:

Using data science and machine learning for improved customer support

The next question is how this data can be used in more complicated situations – take the following example of identifying a credential stuffing attack in aggregate. We looked at a small number of anonymised data fields for the most prolific attackers of WordPress login portals. The data is based purely on HTTP headers, in total we saw 820 unique IPs towards 16,248 distinct zones (the IPs were hashed and requests were placed into “buckets” as they were collected). As WordPress returns a HTTP 200 when a login fails and a HTTP 302 on a successful login (redirecting to the login panel), we’re able to analyse this just from the status code returned.

On the left hand chart, the x-axis represents a normalised number of unique zones that are under attack (0 means the attacker is hitting the same site whilst 1 means the attacker is hitting all different sites) and the y-axis represents the success rate (using HTTP status codes, identifying the chance of a successful login). The right hand side chart switches the x-axis out for something called the “variety ratio” – this measures the rate of abnormal 4xx/5xx HTTP status codes (i.e. firewall blocks, rate limiting HTTP headers or 5xx status codes). We see clear clusters on both charts:

Using data science and machine learning for improved customer support

However, by plotting this chart in three dimensions with all three fields represented – clusters appear. These clusters are then grouped using an unsupervised clustering algorithm (agglomerative hierarchical clustering):

Using data science and machine learning for improved customer support

Cluster 1 has 99.45% of requests from the same country and 99.45% from the same User-Agent. This tactic, however, has advantages when looking at other clusters – for example, Cluster 0 had 89% of requests coming from three User-Agents (75%, 12.3% and 1.7%, respectively). By using this approach we are able to correlate such attacks together even when they would be hard to identify on a request-to-request basis (as they are being made from different IPs and with different request headers). Such strategies allow us to fingerprint attacks regardless of whether attackers are continuously changing how they make these requests to us.

By aggregating data together then representing the data in multiple dimensions, we are able to gain visibility into the data that would ordinarily not be possible on a request-to-request basis. In product level functionality, it is often important to make decisions on a signal-to-signal basis (“should this request be challenged whilst this one is allowed?”) but by looking at the data in aggregate we are able to focus  on the interesting clusters and provide alerting systems which identify anomalies. Performing this in multiple dimensions provides the tools to reduce false positives dramatically.

Conclusion

From natural language processing to intelligent threat fingerprinting, using data science techniques has improved our ability to build new functionality. Recently, new machine learning approaches and strategies have been designed to process this data more efficiently and effectively; however, preprocessing of data remains a vital tool for doing this. When seeking to optimise data processing pipelines, it often helps to look not just at the tools being used, but also the input and structure of the data you seek to process.

If you’re interested in using data science techniques to identify threats on a large scale network, we’re hiring for Support Engineers (including Security Operations, Technical Support and Support Operations Engineering) in San Francisco, Austin, Champaign, London, Lisbon, Munich and Singapore.

Go Modules- A guide for monorepos (Part 1)

Post Syndicated from Grab Tech original https://engineering.grab.com/go-module-a-guide-for-monorepos-part-1

Go modules are a new feature in Go for versioning packages and managing dependencies. It has been almost 2 years in the making, and it’s finally production-ready in the Go 1.14 release early this year. Go recommends using single-module repositories by default, and warns that multi-module repositories require great care.

At Grab, we have a large monorepo and changing from our existing monorepo structure has been an interesting and humbling adventure. We faced serious obstacles to fully adopting Go modules. This series of articles describes Grab’s experience working with Go modules in a multi-module monorepo, the challenges we faced along the way, and the solutions we came up with.

To fully appreciate Grab’s journey in using Go Modules, it’s important to learn about the beginning of our vendoring process.

Native support for vendoring using the vendor folder

With Go 1.5 came the concept of the vendor folder, a new package discovery method, providing native support for vendoring in Go for the first time.

With the vendor folder, projects influenced the lookup path simply by copying packages into a vendor folder nested at the project root. Go uses these packages before traversing the GOPATH root, which allows a monorepo structure to vendor packages within the same repo as if they were 3rd-party libraries. This enabled go build to work consistently without any need for extra scripts or env var modifications.

Initial obstacles

There was no official command for managing the vendor folder, and even copying the files in the vendor folder manually was common.

At Grab, different teams took different approaches. This meant that we had multiple version manifests and lock files for our monorepo’s vendor folder. It worked fine as long as there were no conflicts. At this time very few 3rd-party libraries were using proper tagging and semantic versioning, so it was worse because the lock files were largely a jumble of commit hashes and timestamps.

Jumbled commit hashes and timestamps
Jumbled commit hashes and timestamps

As a result of the multiple versions and lock files, the vendor directory was not reproducible, and we couldn’t be sure what versions we had in there.

Temporary relief

We eventually settled on using Glide, and standardized our vendoring process. Glide gave us a reproducible, verifiable vendor folder for our dependencies, which worked up until we switched to Go modules.

Vendoring using Go modules

I first heard about Go modules from Russ Cox’s talk at GopherCon Singapore in 2018, and soon after started working on adopting modules at Grab, which was to manage our existing vendor folder.

This allowed us to align with the official Go toolchain and familiarise ourselves with Go modules while the feature matured.

Switching to go mod

Go modules introduced a go mod vendor command for exporting all dependencies from go.mod into vendor. We didn’t plan to enable Go modules for builds at this point, so our builds continued to run exactly as before, indifferent to the fact that the vendor directory was created using go mod.

The initial task to switch to go mod vendor was relatively straightforward as listed here:

  1. Generated a go.mod file from our glide.yaml dependencies. This was scripted so it could be kept up to date without manual effort.
  2. Replaced the vendor directory.
  3. Committed the changes.
  4. Used go mod instead of glide to manage the vendor folder.

The change was extremely large (due to differences in how glide and go mod handled the pruning of unused code), but equivalent in terms of Go code. However, there were some additional changes needed besides porting the version file.

Addressing incompatible dependencies

Some of our dependencies were not yet compatible with Go modules, so we had to use Go module’s replace directive to substitute them with a working version.

A more complex issue was that parts of our codebase relied on nested vendor directories, and had dependencies that were incompatible with the top level. The go mod vendor command attempts to include all code nested under the root path, whether or not they have used a sub-vendor directory, so this led to conflicts.

Problematic paths

Rather than resolving all the incompatibilities, which would’ve been a major undertaking in the monorepo, we decided to exclude these paths from Go modules instead. This was accomplished by placing an empty go.mod file in the problematic paths.

Nested modules

The empty go.mod file worked. This brought us to an important rule of Go modules, which is central to understanding many of the issues we encountered:

A module cannot contain other modules

This means that although the modules are within the same repository, Go modules treat them as though they are completely independent. When running go mod commands in the root of the monorepo, Go doesn’t even ‘see’ the other modules nested within.

Tackling maintenance issues

After completing the initial migration of our vendor directory to go mod vendor however, it opened up a different set of problems related to maintenance.

With Glide, we could guarantee that the Glide files and vendor directory would not change unless we deliberately changed them. This was not the case after switching to Go modules; we found that the go.mod file frequently required unexpected changes to keep our vendor directory reproducible.

There are two frequent cases that cause the go.mod file to need updates: dependency inheritance and implicit updates.

Dependency inheritance

Dependency inheritance is a consequence of Go modules version selection. If one of the monorepo’s dependencies uses Go modules, then the monorepo inherits those version requirements as well.

When starting a new module, the default is to use the latest version of dependencies. This was an issue for us as some of our monorepo dependencies had not been updated for some time. As engineers wanted to import their module from the monorepo, it caused go mod vendor to pull in a huge amount of updates.

To solve this issue, we wrote a quick script to copy the dependency versions from one module to another.

One key learning here is to have other modules use the monorepo’s versions, and if any updates are needed then the monorepo should be updated first.

Implicit updates

Implicit updates are a more subtle problem. The typical Go modules workflow is to use standard Go commands: go build, go test, and so on, and they will automatically update the go.mod file as needed. However, this was sometimes surprising, and it wasn’t always clear why the go.mod file was being updated. Some of the reasons we found were:

  • A new import was added by mistake, causing the dependency to be added to the go.mod file
  • There is a local replace for some module B, and B changes its own go.mod. When there’s a local replace, it bypasses versioning, so the changes to B’s go.mod are immediately inherited.
  • The build imports a package from a dependency that can’t be satisfied with the current version, so Go attempts to update it.

This means that simply creating a tag in an external repository is sometimes enough to affect the go.mod file, if you already have a broken import in the codebase.

Resolving unexpected dependencies using graphs

To investigate the unexpected dependencies, the command go mod graph proved the most useful.

Running graph with good old grep was good enough, but its output is also compatible with the digraph tool for more sophisticated queries. For example, we could use the following command to trace the source of a dependency on cloud.google.com/go:

$ go mod graph | digraph somepath grab.com/example cloud.google.com/go[email protected]

github.com/hashicorp/vault/[email protected] github.com/hashicorp/vault/[email protected]

github.com/hashicorp/vault/[email protected] google.golang.org/[email protected]

google.golang.org/[email protected] google.golang.org/[email protected]

google.golang.org/[email protected] cloud.google.com/[email protected]
Diagram generated using modgraphviz
Diagram generated using modgraphviz

Stay tuned for more

I hope you have enjoyed this article. In our next post, we’ll cover the other solutions we have for catching unexpected changes to the go.mod file and addressing dependency issues.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Credits

The cute Go gopher logo for this blog’s cover image was inspired by Renee French’s original work.

April service disruptions analysis

Post Syndicated from Keith Ballinger original https://github.blog/2020-05-22-april-service-disruptions-analysis/

In April, we experienced multiple service interruptions impacting GitHub.com. Three distinct incidents resulted in unavailability for all of GitHub.com and resulted in degraded services for a total of 5 hours and 36 minutes. 

We know that any disruption to our service can impact your development workflow, and apologize for these interruptions. We investigated each of these incidents and want to share an update on the causes and remediations. 

Background 

We use a traditional primary-replicas configuration, where a primary cluster accepts writes and are then asynchronously fanned out to a number of replica cluster nodes that service read traffic for the majority of the website. 

We run the majority of our systems on our own bare metal infrastructure, with our networking infrastructure built around a Clos network topology with each network device sharing routes via Border Gateway Protocol (BGP). Our edge network devices have complete internet routing tables while our internal network devices have only internal routes which span the internal data centers. The GitHub Load Balancer (GLB) runs as the primary ingress point for customer traffic to our network, and also acts as an internal gateway between our applications and many of our internal services and databases they depend on.

Timeline 

April 2 20:20 UTC (lasting for one hour 44 minutes)

A misconfiguration of our software load balancers disrupted internal routing of traffic between applications that serve GitHub.com and the internal services they depend on. The misconfiguration introduced TCP socket binds that went over a predefined limit. The load balancer canary deployments don’t include all subsystems, so this problem only occurred when deploying the load balancer to each site. 

April 21 15:45 UTC (lasting for one hour and 21 minutes)

A misconfiguration of database connections, related to our ongoing data partitioning efforts, made it unexpectedly to production. This caused 40 percent of traffic to the main mysql1 database cluster powering GitHub.com to stop using replicas for reads and sent all traffic to the mysql1 primary node.

This pressure overloaded the primary node and caused a majority of write traffic to fail. Around 40 percent of requests to GitHub.com failed during a period of 50 minutes. Because of the central nature of the data on the main mysql1 database cluster, all services of GitHub.com and all users were affected.

April 23 13:13 UTC (lasting for two hours and 31 minutes)

A networking configuration was inadvertently applied to our production network, causing a failure of our networking switches for 31 minutes. The configuration intent was to pass a subset of routes to the downstream devices table. The router couldn’t interpret the policy mistake, and it reacted by propagating too many routes to every downstream switch in the region. This resulted in disruptions for all GitHub.com services for an additional two hours.

What we learned and next steps

Following these three incidents, we were able to identify gaps between our staging/canary environments and production. We’re devoting more energy to close these gaps across engineering and more quickly detect and address issues like this in the future. 

One significant investment we’ve made is building a hardware staging environment for network continuous integration. We now have a network staging environment in place, which mirrors our production networks to test changes via CI. This has been in flight for the past nine months and landed last week. We intend to have the first CI jobs for network configuration working in the coming weeks. We are prioritizing this work for our networking template engines in order to reach 100 percent software coverage in the next quarter.

Another area we’ve identified that presents gaps is our staging labs environment. This staging environment does not set up the databases and database connections the same way as the production environment. This can lead to limited testability of connection changes specific to the production environment. We will be addressing this issue in the coming months.

In summary

Developers use GitHub every day to create, collaborate, learn, and get their work done, but none of our work matters if we can’t promise our customers the reliability they need. We’re committed to addressing and learning from these issues to earn the trust you place in us and make GitHub better every day.

You can follow our status page for updates about the availability of our systems. 

Check GitHub’s current status

The post April service disruptions analysis appeared first on The GitHub Blog.

Sawfish phishing campaign targets GitHub users

Post Syndicated from GitHub SIRT original https://github.blog/2020-04-14-sawfish-phishing-campaign-targets-github-users/

Over the last week, GitHub has received reports related to a phishing campaign targeting our customers. We’re publishing this blog to increase awareness of this ongoing threat.

Background

The phishing message claims that a repository or setting in a GitHub user’s account has changed or that unauthorized activity has been detected. The message goes on to invite users to click on a malicious link to review the change. Specific details may vary since there are many different lure messages in use. Here’s a typical example:

A phishing message stating that a user accessed an endpoint through the GitHub API tries to lure users into clicking a malicious link to check their recent activity.

Clicking the link takes the user to a phishing site mimicking the GitHub login page, which steals any credentials entered. For users with TOTP-based two-factor authentication enabled, the site also relays any TOTP codes to the attacker and GitHub in real-time, allowing the attacker to break into accounts protected by TOTP-based two-factor authentication. Accounts protected by hardware security keys are not vulnerable to this attack.

The attacker uses the following tactics, but not all tactics are used in every case:

  • The phishing email is sourced from legitimate domains, using compromised email servers or stolen API credentials for legitimate bulk email providers.
  • Targeting of currently-active GitHub users across many companies in the tech sector and in multiple countries via email addresses used for public commits.
  • Use of URL-shortening services to conceal the true destination of the malicious link. Sometimes the attacker chains multiple URL-shortening services for further obfuscation.
  • Use of PHP-based redirectors on compromised websites to redirect the victim from a less suspicious-looking URL to another malicious one.
  • If the attacker successfully steals GitHub user account credentials, they may quickly create GitHub personal access tokens or authorize OAuth applications on the account in order to preserve access in the event that the user changes their password.
  • In many cases, the attacker immediately downloads private repository contents accessible to the compromised user, including those owned by organization accounts and other collaborators.

What GitHub is doing

GitHub Security is monitoring for new phishing sites while filing abuse reports and takedown requests. We’re committed to enabling users and organizations to better secure their accounts and data, and provide assistance securing accounts and investigating activity associated with compromised accounts.

GitHub is working tirelessly to make existing security features more accessible, as well as adding new features designed to make user accounts significantly harder to compromise.

How to protect yourself

If you believe you may have entered credentials on a phishing site:

In order to prevent phishing attacks (which collect two-factor codes) from succeeding, consider using hardware security keys or WebAuthn two-factor authentication. Also consider using a browser-integrated password manager. Many commercial and open-source options exist including browser-based password management native to popular web browsers. These provide a degree of phishing protection by autofilling or otherwise recognizing only a legitimate domain for which you have previously saved a password. If your password manager doesn’t recognize the website you’re visiting, it might be a phishing site.

To verify that you’re not entering credentials in a phishing site, confirm that the URL in the address bar is https://github.com/login and that the site’s TLS certificate is issued to GitHub, Inc.

Viewing site information of a GitHub page in a browser shows a correct GitHub URL and a valid certificate issued to GitHub, Inc.

Further viewing site information for the GitHub page shows an authentic certificate for GitHub.com from DigiCert stating that the certificate is valid.

If you’ve received phishing emails related to this phishing campaign, please contact GitHub Support with details about the sender email address and URL of the malicious site to help us respond to this issue.

Known phishing domains

Currently, we’ve observed the following phishing domains used by the attacker. Most of these are already offline, but the attacker frequently creates new domains and will likely continue to do so:

  • aws-update[.]net
  • corp-github[.]com
  • ensure-https[.]com
  • git-hub[.]co
  • git-secure-service[.]in
  • githb[.]co
  • glt-app[.]net
  • glt-hub[.]com
  • glthub[.]co
  • glthub[.]info
  • glthub[.]net
  • glthubb[.]info
  • glthube[.]app
  • glthubs[.]com
  • glthubs[.]info
  • glthubs[.]net
  • glthubse[.]info
  • slack-app[.]net
  • ssl-connection[.]net
  • sso-github[.]com
  • sts-github[.]com
  • tsl-github[.]com
  • data-github[.]com
  • gilthub[.]com
  • gïthub[.]com
  • githube[.]app
  • githubs[.]info
  • gltgub[.]net
  • glthhubs[.]net
  • gthub[.]co
  • xn--gthub-cta[.]com

Helping sites get back online: the origin monitoring intern project

Post Syndicated from Annika Garbers original https://blog.cloudflare.com/helping-sites-get-back-online-the-origin-monitoring-intern-project/

Helping sites get back online: the origin monitoring intern project

Helping sites get back online: the origin monitoring intern project

The most impactful internship experiences involve building something meaningful from scratch and learning along the way. Those can be tough goals to accomplish during a short summer internship, but our experience with Cloudflare’s 2019 intern program met both of them and more! Over the course of ten weeks, our team of three interns (two engineering, one product management) went from a problem statement to a new feature, which is still working in production for all Cloudflare customers.

The project

Cloudflare sits between customers’ origin servers and end users. This means that all traffic to the origin server runs through Cloudflare, so we know when something goes wrong with a server and sometimes reflect that status back to users. For example, if an origin is refusing connections and there’s no cached version of the site available, Cloudflare will display a 521 error. If customers don’t have monitoring systems configured to detect and notify them when failures like this occur, their websites may go down silently, and they may hear about the issue for the first time from angry users.

Helping sites get back online: the origin monitoring intern project
Helping sites get back online: the origin monitoring intern project
When a customer’s origin server is unreachable, Cloudflare sends a 5xx error back to the visitor.‌‌

This problem became the starting point for our summer internship project: since Cloudflare knows when customers’ origins are down, let’s send them a notification when it happens so they can take action to get their sites back online and reduce the impact to their users! This work became Cloudflare’s passive origin monitoring feature, which is currently available on all Cloudflare plans.

Over the course of our internship, we ran into lots of interesting technical and product problems, like:

Making big data small

Working with data from all requests going through Cloudflare’s 26 million+ Internet properties to look for unreachable origins is unrealistic from a data volume and performance perspective. Figuring out what datasets were available to analyze for the errors we were looking for, and how to adapt our whiteboarded algorithm ideas to use this data, was a challenge in itself.

Ensuring high alert quality

Because only a fraction of requests show up in the sampled timing and error dataset we chose to use, false positives/negatives were disproportionately likely to occur for low-traffic sites. These are the sites that are least likely to have sophisticated monitoring systems in place (and therefore are most in need of this feature!). In order to make the notifications as accurate and actionable as possible, we analyzed patterns of failed requests throughout different types of Cloudflare Internet properties. We used this data to determine thresholds that would maximize the number of true positive notifications, while making sure they weren’t so sensitive that we end up spamming customers with emails about sporadic failures.

Designing actionable notifications

Cloudflare has lots of different kinds of customers, from people running personal blogs with interest in DDoS mitigation to large enterprise companies with extremely sophisticated monitoring systems and global teams dedicated to incident response. We wanted to make sure that our notifications were understandable and actionable for people with varying technical backgrounds, so we enabled the feature for small samples of customers and tested many variations of the “origin monitoring email”. Customers responded right back to our notification emails, sent in support questions, and posted on our community forums. These were all great sources of feedback that helped us improve the message’s clarity and actionability.

We frontloaded our internship with lots of research (both digging into request data to understand patterns in origin unreachability problems and talking to customers/poring over support tickets about origin unreachability) and then spent the next few weeks iterating. We enabled passive origin monitoring for all customers with some time remaining before the end of our internships, so we could spend time improving the supportability of our product, documenting our design decisions, and working with the team that would be taking ownership of the project.

We were also able to develop some smaller internal capabilities that built on the work we’d done for the customer-facing feature, like notifications on origin outage events for larger sites to help our account teams provide proactive support to customers. It was super rewarding to see our work in production, helping Cloudflare users get their sites back online faster after receiving origin monitoring notifications.

Our internship experience

The Cloudflare internship program was a whirlwind ten weeks, with each day presenting new challenges and learnings! Some factors that led to our productive and memorable summer included:

A well-scoped project

It can be tough to find a project that’s meaningful enough to make an impact but still doable within the short time period available for summer internships. We’re grateful to our managers and mentors for identifying an interesting problem that was the perfect size for us to work on, and for keeping us on the rails if the technical or product scope started to creep beyond what would be realistic for the time we had left.

Working as a team of interns

The immediate team working on the origin monitoring project consisted of three interns: Annika in product management and Ilya and Zhengyao in engineering. Having a dedicated team with similar goals and perspectives on the project helped us stay focused and work together naturally.

Quick, agile cycles

Since our project faced strict time constraints and our team was distributed across two offices (Champaign and San Francisco), it was critical for us to communicate frequently and work in short, iterative sprints. Daily standups, weekly planning meetings, and frequent feedback from customers and internal stakeholders helped us stay on track.

Great mentorship & lots of freedom

Our managers challenged us, but also gave us room to explore our ideas and develop our own work process. Their trust encouraged us to set ambitious goals for ourselves and enabled us to accomplish way more than we may have under strict process requirements.

After the internship

In the last week of our internships, the engineering interns, who were based in the Champaign, IL office, visited the San Francisco office to meet with the team that would be taking over the project when we left and present our work to the company at our all hands meeting. The most exciting aspect of the visit: our presentation was preempted by Cloudflare’s co-founders announcing public S-1 filing at the all hands! 🙂

Helping sites get back online: the origin monitoring intern project
Helping sites get back online: the origin monitoring intern project

Over the next few months, Cloudflare added a notifications page for easy configurability and announced the availability of passive origin monitoring along with some other tools to help customers monitor their servers and avoid downtime.

Ilya is working for Cloudflare part-time during the school semester and heading back for another internship this summer, and Annika is joining the team full-time after graduation this May. We’re excited to keep working on tools that help make the Internet a better place!

Also, Cloudflare is doubling the size of the 2020 intern class—if you or someone you know is interested in an internship like this one, check out the open positions in software engineering, security engineering, and product management.

From 48k lines of code to 10—the story of GitHub’s JavaScript SDK

Post Syndicated from Gregor Martynus original https://github.blog/2020-04-09-from-48k-lines-of-code-to-10-the-story-of-githubs-javascript-sdk/

Gregor is the maintainer of the JavaScript Octokit. He’s a seasoned open source maintainer with a passion for automating repetitive tasks and lowering the barrier for contributors of all kinds and backgrounds. Aside from Octokit, Gregor is a maintainer of Probot, nock, and semantic-release. Outside of work, he spends time taking care of his triplets Nico, Ada, and Kian. You can read more from Gregor on the DEV community and on Twitter.


@octokit/rest wasn’t originally created by GitHub. Instead, @bkeepers decided to adopt the package that was the most popular back in 2017: github. In this post, we’ll cover the story of @octokit/rest—the official JavaScript SDK for GitHub’s REST APIs.

The legacy

Later renamed to @octokit/rest, the github module was one of the oldest projects in the Node ecosystem, with its first commit from June 2010. That was the time of Node v0.1 and package.json didn’t exist because the npm registry was still being made.

In 2017, GitHub hired me to turn github into the official JavaScript SDK for GitHub’s APIs, for both Node.js and browsers. You can even see my first commit from September 2017. At the time, the project had close to 16 thousand lines of code (LOC) spread across three JavaScript files, one huge JSON file, and two files for TypeScript/Flow type definitions.

➜  rest.js git:(50720c8) wc -l lib/*  
     120 lib/error.js
    3246 lib/index.d.ts
     905 lib/index.js
    3232 lib/index.js.flow
      17 lib/promise.js
    7995 lib/routes.json
     143 lib/util.js
   15658 total

The adoption

The primary goal of the project was maintainability. The library’s core piece was a huge routes.json file with nearly eight thousand LOC, and it was supposed to define all of GitHub’s REST API endpoints. The file was maintained manually, and endpoints were only added or fixed once someone realized it was missing or incorrect (check out an example).

With this in mind, instead of manually maintaining the definitions in routes.json, I created a script to scrape GitHub’s REST API documentation nightly and turn it into a machine-readable JSON representation: octokit/routes. If the script found changes, @octokit/rest received a pull request to update the routes.json file. After merging, a new version was released automatically. Thanks to the automation, the routes.json file was now guaranteed to cover all of GitHub’s REST API endpoints, and it consists of 10,275 LOC. The accompanying TypeScript definitions grew to over 26,700 LOC.

The architecture

Once the completeness and maintainability were taken care of, I focused on another project goal: Decomposability.

The JavaScript Octokit is meant for all JavaScript runtime environments, some of which have strict constraints. For example, bundle size is significant for browser usage. Instead of a single, monolithic library which implements all REST API endpoints, authentication strategies, and recommended best practices (such as pagination), it’s important to give users the choice to use lower-level libraries. That way users can make their own choices about trade-offs between bundle size and features.

Here’s an overview of the architecture I came up with in January 2018:

The result of the internal refactoring into the new architecture looked like this:
Note that this output was simplified for readability.

➜  rest.js git:(f7c9f86) wc -l index.* lib/**/*.{js,json}
      31 index.js
    3474 index.d.ts
    3441 index.js.flow

     101 lib/endpoint/ # 4 files
     162 lib/request/ # 3 files

      83 lib/plugins/authentication/ # 3 files
     130 lib/plugins/endpoint-methods/ # 4 files
     130 lib/plugins/pagination/ # 11 files

      58 lib/parse-client-options.js
   10628 lib/routes.json

   18238 total

Over the next six months, I refactored the code and started extracting some of the modules:

  • @octokit/endpoint: Turns API REST endpoint options into generic http request options
  • @octokit/request: Sends parameterized requests to GitHub’s APIs with sensible defaults in browsers and Node
  • before-after-hook: API used to hook into the request lifecycle

After using plugins internally for roughly six months, I announced the plugin API in November 2018 with version 16 and moved most of the library’s module to internal plugins that I could extract in the future.

The new internal code architecture was now looking like this:
Note that this output was simplified for readability:

➜  rest.js git:(01763bf) wc -l index.* plugins/**/*.{js,json} lib/**/*.js
      14 index.js
   26714 index.d.ts

     110 lib/ # 6 files

      86 plugins/authentication/ # 3 files
      77 plugins/pagination/ # 3 files
      39 plugins/register-endpoints/ # 3 files
     108 plugins/validate/ # 2 files
   10275 plugins/rest-api-endpoints/routes.json
   37423 total

Later, I created @octokit/core.js: the new Octokit JavaScript core library that @octokit/rest and all other Octokit libraries will be based on moving forward. Most of its logic was extracted from @octokit/rest, without all deprecated features. I didn’t use it right away within @octokit/rest in order to avoid breaking changes.

As @octokit/core was free of any legacy code and had a lower cost for breaking changes, I experimented with decomposing the different means of authentication, too. The results are separate packages for each authentication strategy—all listed in @octokit/auth‘s README. If you’d like to learn more about GitHub’s authentication strategies, check out my series on GitHub API Authentication.

@octokit/core and the separate authentication libraries would replace all the code in lib/* and plugins/authentication/*. All that was left were three plugins that I extracted into their own modules:

The validate plugin became obsolete thanks to TypeScript’s code validation at compile time. There was no longer a need for validating request parameters in the client. That resulted in a significant reduction of code and bundles size. For example, here’s the current definition for the octokit.checks.create() method:

{
  checks: {
    create: {
      headers: { accept: "application/vnd.github.antiope-preview+json" },
      method: "POST",
      params: {
        actions: { type: "object[]" },
        "actions[].description": { required: true, type: "string" },
        "actions[].identifier": { required: true, type: "string" },
        "actions[].label": { required: true, type: "string" },
        completed_at: { type: "string" },
        conclusion: {
          enum: [
            "success",
            "failure",
            "neutral",
            "cancelled",
            "timed_out",
            "action_required"
          ],
          type: "string"
        },
        details_url: { type: "string" },
        external_id: { type: "string" },
        head_sha: { required: true, type: "string" },
        name: { required: true, type: "string" },
        output: { type: "object" },
        "output.annotations": { type: "object[]" },
        "output.annotations[].annotation_level": {
          enum: ["notice", "warning", "failure"],
          required: true,
          type: "string"
        },
        "output.annotations[].end_column": { type: "integer" },
        "output.annotations[].end_line": { required: true, type: "integer" },
        "output.annotations[].message": { required: true, type: "string" },
        "output.annotations[].path": { required: true, type: "string" },
        "output.annotations[].raw_details": { type: "string" },
        "output.annotations[].start_column": { type: "integer" },
        "output.annotations[].start_line": { required: true, type: "integer" },
        "output.annotations[].title": { type: "string" },
        "output.images": { type: "object[]" },
        "output.images[].alt": { required: true, type: "string" },
        "output.images[].caption": { type: "string" },
        "output.images[].image_url": { required: true, type: "string" },
        "output.summary": { required: true, type: "string" },
        "output.text": { type: "string" },
        "output.title": { required: true, type: "string" },
        owner: { required: true, type: "string" },
        repo: { required: true, type: "string" },
        started_at: { type: "string" },
        status: { enum: ["queued", "in_progress", "completed"], type: "string" }
      },
      url: "/repos/:owner/:repo/check-runs"
    }
  }
}

Starting with v17, the definition looks like this 😎:

{
  checks: {
    create: [
      "POST /repos/{owner}/{repo}/check-runs",
      { mediaType: { previews: ["antiope"] } }
    ]
  }
}

Finally, all I needed to do was to put the previously extracted code back together. As promised, 10 LOC:

import { Octokit as Core } from "@octokit/core";
import { requestLog } from "@octokit/plugin-request-log";
import { paginateRest } from "@octokit/plugin-paginate-rest";
import { restEndpointMethods } from "@octokit/plugin-rest-endpoint-methods";

import { VERSION } from "./version";

export const Octokit = Core
  .plugin([requestLog, paginateRest, restEndpointMethods])
  .defaults({ userAgent: `octokit-rest.js/${VERSION}` });

The tests

Every single line of code was changed between version 16 and the upcoming version 17 of @octokit/rest. The only way to be confident that no new bugs were introduced is to run extensive tests.

When adopting the module back in 2017, no tests existed for our use case, but there were usage examples. The first thing I did was turn these usage examples into integration tests. And because the JavaScript Octokit SDK was meant to be the start of a suite of SDKs across all popular languages, I created octokit/fixtures—a language-agnostic, auto-updating set of http mocks for common use cases.

For the remaining logic specific to @octokit/rest, I created integration tests until we reached 100% test coverage. The tests will fail today if coverage drops below that.

While working on the migration to version 17 with its 10 LOC, I continued to run the tests of version 16 against the new version, with the exception of tests for deprecated APIs. At the same time, too many tests aren’t helpful either. Once I got all tests to pass for version 17, I reviewed all existing tests and removed any that no longer belong in @octokit/rest. Some of the tests were moved into the plugins, @octokit/core or @octokit/request. Now, all that’s left are a few smoke tests and the scenario tests using @octokit/fixtures.

The future

@octokit/rest used to be the Node.js library for GitHub’s REST APIs. Starting with v17, it will be the JavaScript library implementing all best practices for the @octokit/rest APIs, including pagination, throttling and automated request retries. It will support all existing and future authentication strategies and even GraphQL requests, as that is part of @octokit/core.

And finally, we’d like to say thanks to Fabian Jakobs, Mike de Boer, and Joe Gallo who created and maintained the github module before it became @octokit/rest.

The post From 48k lines of code to 10—the story of GitHub’s JavaScript SDK appeared first on The GitHub Blog.

Internship Experience: Cryptography Engineer

Post Syndicated from Watson Ladd original https://blog.cloudflare.com/internship-experience-cryptography-engineer/

Internship Experience: Cryptography Engineer

Internship Experience: Cryptography Engineer

Back in the summer of 2017 I was an intern at Cloudflare. During the scholastic year I was a graduate student working on automorphic forms and computational Langlands at Berkeley: a part of number theory with deep connections to representation theory, aimed at uncovering some of the deepest facts about number fields. I had also gotten involved in Internet standardization and security research, but much more on the applied side.

While I had published papers in computer security and had coded for my dissertation, building and deploying new protocols to production systems was going to be new. Going from the academic environment of little day to day supervision to the industrial one of more direction; from greenfield code that would only ever be run by one person to large projects that had to be understandable by a team; from goals measured in years or even decades, to goals measured in days, weeks, or quarters; these transitions would present some challenges.

Cloudflare at that stage was a very different company from what it is now. Entire products and offices simply did not exist. Argo, now a mainstay of our offering for sophisticated companies, was slowly emerging. Access, which has been helping safeguard employees working from home these past weeks, was then experiencing teething issues. Workers was being extensively developed for launch that autumn. Quicksilver was still in the slow stages of replacing KyotoTycoon. Lisbon wasn’t on the map, and Austin was very new.

Day 1

My first job was to get my laptop working. Quickly I discovered that despite the promise of using either Mac or Linux, only Mac was supported as a local development environment. Most Linux users would take a good part of a month to tweak all the settings and get the local development environment up. I didn’t have months. After three days, I broke down and got a Mac.

Needless to say I asked for some help. Like a drowning man in quicksand, I managed to attract three engineers to this near insoluble problem of the edge dev stack, and after days of hacking on it, fixing problems that had long been ignored, we got it working well enough to test a few things. That development environment is now gone and replaced with one built Kubernetes VMs, and works much better that way. When things work on your machine, you can now send everyone your machine.

Speeding up

With setup complete enough, it was on to the problem we needed to solve. Our goal was to implement a set of three interrelated Internet drafts, one defining secondary certificates, one defining external authentication with TLS certificates, and a third permitting servers to advertise the websites they could serve.

External authentication is a TLS feature that permits a server or a client on an already opened connection to prove its possession of the private key of another certificate. This proof of possession is tied to the TLS connection, avoiding attacks on bearer tokens caused by the lack of this binding.

Secondary certificates is an HTTP/2 feature enabling clients and servers to send certificates together with proof that they actually know the private key. This feature has many applications such as certificate-based authentication, but also enables us to prove that we are permitted to serve the websites we claim to serve.

The last draft was the HTTP/2 ORIGIN frame. The ORIGIN frame enables a website to advertise other sites that it could serve, permitting more connection reuse than allowed under the traditional rules. Connection reuse is an important part of browser performance as it avoids much of the setup of a connection.

These drafts solved an important problem for Cloudflare. Many resources such as JavaScript, CSS, and images hosted by one website can be used by others. Because Cloudflare proxies so many different websites, our servers have often cached these resources as well. Browsers though, do not know that these different websites are made faster by Cloudflare, and as a result they repeat all the steps to request the subresources again. This takes unnecessary time since there is an established and usable perfectly good connection already. If the browser could know this, it could use the connection again.

We could only solve this problem by getting browsers and the broader community of TLS implementers on board. Some of these drafts such as external authentication and secondary certificates had a broader set of motivations, such as getting certificate based authentication to work with HTTP/2 and TLS 1.3. All of these needs had to be addressed in the drafts, even if we were only implementing a subset of the uses.

Successful standards cover the use cases that are needed while being simple enough to implement and achieve interoperability. Implementation experience is essential to achieving this success: a standard with no implementations fails to incorporate hard won lessons. Computers are hard.

Prototype

My first goal was to set up a simple prototype to test the much more complex production implementation, as well as to share outside of Cloudflare so that others could have confidence in their implementations. But these drafts that had to be implemented in the prototype were incremental improvements to an already massive stack of TLS and HTTP standards.

I decided it would be easiest to build on top of an already existing implementation of TLS and HTTP. I picked the Go standard library as my base: it’s simple, readable, and in a language I was already familiar with. There was already a basic demo showcasing support in Firefox for the ORIGIN frame, and it would be up to me to extend it.

Using that as my starting point I was able in 3 weeks to set up a demonstration server and a client. This showed good progress, and that nothing in the specification was blocking implementation. But without integrating it into our servers for further experimentation so that we might  discover rare issues that could be showstoppers. This was a bitter lesson learned from TLS 1.3, where it took months to track down a single brand of printer that was incompatible with the standard, and forced a change.

From Prototype to Production

We also wanted to understand the benefits with some real world data, to convince others that this approach was worthwhile. Our position as a provider to many websites globally gives us diverse, real world data on performance that we use to make our products better, and perhaps more important, to learn lessons that help everyone make the Internet better. As a result we had to implement this in production: the experimental framework for TLS 1.3 development had been removed and we didn’t have an environment for experimentation.

At the time everything at Cloudflare was based on variants of NGINX. We had extended it with modules to implement features like Keyless and customized certificate handling to meet our needs, but much of the business logic was and is carried out in Lua via OpenResty.

Lua has many virtues, but at the time both the TLS termination and the core business logic lived in the same repo despite being different processes at runtime. This made it very difficult to understand what code was running when, and changes to basic libraries could create problems for both. The build system for this creation had the significant disadvantage of building the same targets with different settings. Lua also is a very dynamic language, but unlike the dynamic languages I was used to, there was no way to interact with the system as it was running on requests.

The first step was implementing the ORIGIN frame. In implementing this, we had to figure out which sites hosted the subresources used by the page we were serving. Luckily, we already had this logic to enable server push support driven by Link headers. Building on this let me quickly get ORIGIN working.

This work wasn’t the only thing I was up to as an intern. I was also participating in weekly team meetings, attending our engineering presentations, and getting a sense of what life was like at Cloudflare. We had an excursion for interns to the Computer History Museum in Mountain View and Moffett Field, where we saw the base museum.

The next challenge was getting the CERTIFICATE frame to work. This was a much deeper problem. NGINX processes a request in phases, and some of the phases, like the header processing phase, do not permit network I/O without locking up the event loop. Since we are parsing the headers to determine what to send, the frame is created in the header processing phase. But finding a certificate and telling Keyless to sign it required network I/O.

The standard solution to this problem is to have Lua execute a timer callback, in which network I/O is possible. But this context doesn’t have any data from the request: some serious refactoring was needed to create a way to get the keyless module to function outside the context of a request.

Once the signature was created, the battle was half over. Formatting the CERTIFICATE frame was simple, but it had to be stuck into the connection associated with the request that had demanded it be created. And there was no reason to expect the request was still alive, and no way to know what state it was in when the request was handled by the Keyless module.

To handle this issue I made a shared btree indexed by a number containing space for the data to be passed back and forth. This enabled the request to record that it was ready to send the CERTIFICATE frame and Keyless to record that it was ready with a frame to send. Whichever of these happened second would do the work to enqueue the frame to send out.

This was not an easy solution: the Keyless module had been written years before and largely unmodified. It fundamentally assumed it could access data from the request, and changing this assumption opened the door to difficult to diagnose bugs. It integrates into BoringSSL callbacks through some pretty tricky mechanisms.

However, I was able to test it using the client from the prototype and it worked. Unfortunately when I pushed the commit in which it worked upstream, the CI system could not find the git repo where the client prototype was due to a setting I forgot to change. The CI system unfortunately didn’t associate this failure with the branch, but attempted to check it out whenever it checked out any other branch people were working on. Murphy ensured my accomplishment had happened on a Friday afternoon Pacific time, and the team that manages the SSL server was then exclusively in London…

Monday morning the issue was quickly fixed, and whatever tempers had frayed were smoothed over when we discovered the deficiency in the CI system that had enabled a single branch to break every build. It’s always tricky to work in a global team. Later Alessandro flew to San Francisco for a number of projects with the team here and we worked side by side trying to get a demonstration working on a test site. Unfortunately there was some difficulty tracking down a bug that prevented it working in production. We had run out of time, and my internship was over.  Alessandro flew back to London, and I flew to Idaho to see the eclipse.

The End

Ultimately we weren’t able to integrate this feature into the software at our edge: the risks of such intrusive changes for a very experimental feature outweighed the benefits. With not much prospect of support by clients, it would be difficult to get the real savings in performance promised. There also were nontechnical issues in standardization that have made this approach more difficult to implement: any form of traffic direction that doesn’t obey DNS creates issues for network debugging, and there were concerns about the impact of certificate misissuance.

While the project was less successful than I hoped it would be, I learned a lot of important skills: collaborating on large software projects, working with git, and communicating with other implementers about issues we found.  I also got a taste of what it would be like to be on the Research team at Cloudflare and turning research from idea into practical reality and this ultimately confirmed my choice to go into industrial research.

I’ve now returned to Cloudflare full-time, working on extensions for TLS as well as time synchronization. These drafts have continued to progress through the standardization process, and we’ve contributed some of the code I wrote as a starting point for other implementers to use.  If we knew all our projects would work out, they wouldn’t be ambitious enough to be research worth doing.

If this sort of research experience appeals to you, we’re hiring.

February service disruptions post-incident analysis

Post Syndicated from Keith Ballinger original https://github.blog/2020-03-26-february-service-disruptions-post-incident-analysis/

In late February, GitHub experienced multiple service interruptions that resulted in degraded service for a total of eight hours and 14 minutes over four distinct events. Unexpected variations in database load, coupled with an unintended configuration issue introduced as a part of ongoing scaling improvements, led to resource contention in our mysql1 database cluster.

We sincerely apologize for any negative impact these interruptions may have caused. We know you place trust in GitHub for your most important projects—and we make it our top priority to maintain that trust with a platform that is highly durable and available. We’re committed to applying what we’ve learned from these incidents to avoid disruptions moving forward.

Background

Originally, all our MySQL data lived in a single database cluster. Over time, as mysql1 grew larger and busier, we split functionally grouped sets of tables into new clusters and created new clusters for new features. However, much of our core dataset still resides within that original cluster.

We’re constantly scaling our databases to handle the additional load driven by new users and new products. In this case, an unexpected variance in database load contributed to cluster degradations and unavailability.

Timeline

2020, February 19 15:17 UTC (lasting for 52 minutes)

At this time, an unexpectedly resource-intensive query began running against our mysql1 database cluster. The intent was to run this load against our read replica pool at a much lower frequency, but we inadvertently sent this traffic to the master of the cluster, increasing the pressure on that host beyond surplus capacity. This pressure overloaded ProxySQL, which is responsible for connection pooling, resulting in an inability to consistently perform queries.

2020, February 20 21:31 UTC (lasting for 47 minutes)

Two days later, as part of a planned master database promotion, we saw unexpectedly high load which triggered a similar ProxySQL failure. The intent of this maintenance was to give our teams visibility into issues they might experience when a master is momentarily read-only.

After an initial load spike, we were able to use the same remediation steps as in the previous incident to restore the system to a working state. We suspended further maintenance events of this type as we investigated how this system failed.

2020, February 25 16:36 UTC (lasting for two hours 12 minutes)

In this third incident involving ProxySQL, active database connections crossed a critical threshold that changed the behavior of this new infrastructure. Because connections remained above the critical threshold after remediation, the system fell back into a degraded state. During this time, GitHub.com services were affected by stalled writes on our mysql1 database cluster.

As a result of our previous investigation, we understood that file descriptor limits on ProxySQL nodes were capped at levels significantly lower than intended and insufficient to maintain throughput at high load levels. Specifically, because of a system-level limit of 1048576, our process manager silently reduced our LimitNOFILE setting from 1073741824 to 65536. During remediation, we also encountered a race condition between our process manager and service configurations which slowed our ability to change our file limit to 1048576.

2020, February 27 14:31 UTC (lasting for four hours 23 minutes)

Application logic changes to database query patterns rapidly increased load on the master of our mysql1 database cluster. This spike slowed down the cluster enough to affect availability for all dependent services.

What we learned

Observability and performance improvements

We’ve used these events to identify operational readiness and observability improvements around how we need to operate ProxySQL. We have made changes to allow us to more quickly detect and address issues like this in the future. Remediating these issues were straightforward once we tracked down interactions between systems. It’s clear to us that we need better system integration and performance testing at realistic load levels in some areas before fully deploying to production.

We’re also devoting more energy to understanding the performance characteristics of ProxySQL at scale and the trickle-down effect on other services before it affects our users.

Optimizing for stability

We decided to freeze production deployments for three days to address short-term hotspotting as a result of the final incident on February 27. This helped us stabilize GitHub.com. Our increased confidence in service reliability gave us the breathing room to plan next steps thoughtfully and also helped identify long-term investments we can make to mitigate underlying scalability issues.

Next steps

Immediate changes: Data partitioning

We shipped a sizable chunk of data partitioning efforts we’ve worked on for the past six months just days after these incidents for one of our more significant MySQL table domains, the “abilities” table. Given that every authentication request to GitHub uses this table domain in some way, we had to be very deliberate about how we performed this work to fulfill our requirement of zero downtime and minimal user-impact. To accomplish this, we:

  1. Removed all JOIN queries between the abilities table and any other table in the database
  2. Built a new cluster to move all of the table data into
  3. Copied all data to the new cluster using Vitess‘s vertical replication feature, keeping the copied data up-to-date in real-time
  4. Moved all reads to the new cluster
  5. Moved all writes to the new cluster using Vitess’ proxy layer vtgate

Steps one and two took months’ worth of effort, while steps three through five were completed in four hours.

These changes reduced load on the mysql1 cluster master by 20 percent, and queries per second by 15 percent. The following graph shows a snapshot of query performance on March 2nd, prior to partitioning, and on March 3rd, after partitioning. Queries per second peaked at 109.9k/second immediately prior to partitioning and decreased to a peak of 89.19k queries/second after.

Query performance graph showing decrease in queries per second on March 2-3.

Additional technical and organizational initiatives

  • Audit and lower reads from leader databases: We’re auditing reads from master databases with the intent to lower them. We put unnecessary load on our master databases by reading from them when replicas have the same data. This can exacerbate capacity issues and service reliability by adding load to the most critical databases at GitHub. By auditing and changing these reads to replica databases, we increase headroom in our SQL databases and reduce the risk of them being overloaded as production load varies.
  • Widen our usage of feature flags: We’re requiring feature flags for code updates, which allows us to disable problematic code dynamically, and will dramatically speed up recovery for active incidents.
  • Complete in-flight functional partitioning: We’re completing our in-flight functional partitioning of our mysql1 database cluster. We’ve been working on moving tables that comprise functional domains out of the cluster over the last year. We’re very close to finishing that work for a significant number of tables and schema domains. Shipping these improvements will immediately reduce writes to the cluster by 60 percent and storage requirements by 70 percent. Expect to see another post in the coming months that details the performance benefits from this work.
  • Refine our dashboards: We’re improving our visibility into the effects of deploys on our mysql1 database cluster. Our current deployment dashboard can be noisy, making it difficult to determine deploy safety. Interviews with engineers involved in these incidents noted the cognitive load required to process noisy dashboards. We predict refined dashboards will allow us to spot problems with deploys earlier in the process.
  • Invest in additional data partitioning opportunities: We’ve identified an additional 12 schema domains that we’ll split out from the cluster to further protect us from request and storage constraints.
  • Shard to scale: We’ll begin sharding our largest schema set. Functional partitioning buys us time, but it’s not a sufficient solution to our scaling needs. We plan to use sharding (tenant partitioning) to move us from vertical scaling to horizontal scaling, enabling us to expand capacity much more easily as we grow.

In summary

We know you depend on our platform to be reliable. With the immediate changes we’ve made and long-term plans in progress, we’ll continue to use what we’ve learned to make GitHub better every day.

The post February service disruptions post-incident analysis appeared first on The GitHub Blog.

Six years of the GitHub Security Bug Bounty program

Post Syndicated from Brian Anglin original https://github.blog/2020-03-25-six-years-of-the-github-security-bug-bounty-program/

Last month GitHub reached some big milestones for our Security Bug Bounty program. As of February 2020, it’s been six years since we started accepting submissions. Over the years we’ve been able to invest in the bug bounty community through live events, private bug bounties, feature previews, and of course through cash bounties.

We’re excited to announce that we recently passed $1,000,000 in total payments to researchers since we moved our program to HackerOne in 2016. We paid out over half of our total awards in the last year alone, reaching almost $590,000 in total bounty rewards across our programs. We’ve also been able to provide quick response times to an increasingly large amount of submissions—maintaining an average response time of 17 hours. This is all while seeing a 40 percent increase in submissions since last year. We’re sharing some highlights from the past year along with our upcoming plans for the future.

2019 highlights

Cool bugs

One of my favorite parts of working on the bug bounty program is getting to see the amazing submissions we get from the community. Many of the best submissions show an understanding of GitHub and our technology that rivals that of our own engineering teams. We’ve offered very competitive bounties so we can attract those talented individuals and provide them an incentive to spend time digging deep into our codebase. The community in 2019 did not disappoint.

OAuth flow bypass using cross-site HEAD requests

@not-an-ardvark has a lot of great submissions to our program but this was particularly impactful. He wrote a great post about it in detail that I’ll quickly recap.

GitHub provides a few ways for integrators to interact with our ecosystem. One of the ways integrators can use GitHub is via OAuth applications which allow the application to take actions on behalf of a GitHub user. Before allowing access to a user’s data, an OAuth application must redirect the user to GitHub.com, allowing them to review the requested permissions and explicitly authorize the application. @not-an-ardvark found a way to bypass our controls to authorize OAuth applications without any user interaction. Let’s get into how this happened.

When we process state changing requests on GitHub.com, such as authorizing an OAuth application, we rely on Ruby on Rails’ Cross Site Request Forgery (CSRF) protection. We inject a special token into the DOM of every form element that we validate when receiving POST requests. The OAuth application authorization flow uses POST requests which require a valid CSRF token. However, the OAuth controller incorrectly allowed both POST and HEAD requests to trigger the authorization logic. We skip CSRF validation when processing HEAD requests since they’re not typically state changing. This allowed a malicious site to automatically authorize an OAuth application without any user interaction.

Due to the severity of the vulnerability, we needed to patch it as quickly as possible. We worked closely with the engineering team and shipped a fix to GitHub users within three hours of receiving the submission. We also conducted a full investigation with SIRT engineers and confirmed that this vulnerability wasn’t exploited in the wild. Additionally, we rolled out patches for GitHub Enterprise Server for all supported versions. We rewarded @not-an-aardvark with $25,000 for the severity of the vulnerability and their detailed writeup in their submission.

This bug demonstrates the important role that researchers play in our overall security. By identifying this issue via our bug bounty program, we were able to protect our users by patching the issue and validating that it wasn’t previously exploited.

GitHub.com remote code execution through command injection

@ajxchapman achieved remote code execution in GitHub.com by triggering command injection in our Mercurial import feature. The import logic didn’t correctly sanitize branch names which allowed a maliciously crafted branch name to execute code on our servers. Since the import feature is quite complicated, we’ve traditionally run the import code in a sandbox on dedicated servers isolated from our production network. This isolation limited the impact of the vulnerability, and we were able to quickly release a fix for GitHub.com and backported the fix for GitHub Enterprise Server customers. We also audited the import logic for similar issues and confirmed from our logging systems that this wasn’t exploited in the wild.

What makes this bug particularly interesting is the root cause: it was ultimately caused by an outdated dependency. The bug existed in a dependency that handles code imports and was previously fixed upstream. However, we failed to keep up with the latest version and were ultimately vulnerable to this issue. This issue highlights how critical dependency management is to the overall success of a security program. GitHub continues to invest in dependency management tooling to keep us and our customers secure. Find more of Alex’s work on his personal blog.

Expanded scope

GitHub released many new features in 2019 that were added to our Security Bug Bounty scope:

  • Pull reminders added functionality to help keep engineers informed of new pull requests that need attention. We included the solution into our core application and existing Slack integration.
  • Automated security updates (formerly Dependabot) added a better way to track vulnerabilities in dependencies since it automatically opens new pull requests updating the version of a dependency when it finds a new security fix.
  • GitHub for mobile is GitHub’s first presence in the App Store. This brought new requirements of our API and new security concerns in our application. We’re delivering the same security and functionality that’s available on GitHub.com.
  • GitHub Actions is one of GitHub’s biggest releases since pull requests and whole classes of new security corner cases. Through close collaboration with our engineering partners, we’ve provided users the ability to run their code right on GitHub.com.
  • Semmle’s LGTM tool was a significant addition to our suite of security tools, like Dependabot and the Maintainer Security Advisories. LGTM allows our users to scan for potential security issues in their code on every pull request.

We’ve had several valuable submissions that influenced the development of these products significantly. We paid out over $20,000 in bounties for vulnerabilities affecting the products in this expanded scope, and we’re excited to continue expanding our Bug Bounty scope as GitHub grows.

H1-702

In August 2019, we returned to Las Vegas to participate in our second H1-702 event. This event invited the top hackers from HackerOne’s platform to join us along with two other companies for three nights of live hacking. We were excited to participate and wanted to give researchers every incentive to dig deep into our application. We even added a bunch of bonuses on top of our base payouts, including bonuses for Best Proof of Concept, Longest Exploit Chain, and RCE. We also set up a CTF on GitHub.com to direct researchers to some of our newest attack surfaces. Lastly, we hid flags in a Maintainer Security Advisory and GitHub Package Registry with bonuses for every flag. We received positive feedback from some of our researchers about our CTF and will continue to include a CTF component in future events.

Overall, we paid out over $155,000 to researchers in one night, with half those rewards for high or critical severity issues. We can’t express how important live-hacking events, like H1-702, are to our bug bounty program. We look forward to more live-hacking events in the future and other new and innovative ways to engage the community.

Private bug bounty

Beyond the wide scope of our public program, we conducted an invite-only program where we preview features to researchers before they’re launched to everyone. These private programs allow us to work closely with a small group, and give us the opportunity to find bugs before they can affect the majority of our users. We’ve paid out just over $37,000 via our private program this year, and many of these findings were fixed before new features reached a significant number of our customers.

Actions CI/CD

Following the success of our first private bug bounty targeting GitHub Actions, we wanted to re-run the private program to target the most recent iteration of our GitHub Actions product. We used what we learned in our first bug bounty to secure the product against similar issues. The community accepted the challenge and found novel bugs in our second iteration.

Automated security updates (formerly Dependabot)

Just like any combination of two complex systems, the acquisition of Dependabot presented a unique challenge for our security team in integrating these two separate architectures. We used the private bug bounty to supplement our own security review of these new services. The findings from the private bug bounty program greatly informed how we integrated Dependabot with GitHub.com. We were also able to surface a few issues before rolling it out.

Pull reminders

Like Dependabot, pull reminders required the same care and attention to ensure a secure transition from an integration to a first-party GitHub product. Pull reminders also added more complexity through its connection to Slack. Our own Slack integration provided a foundation for this feature, but there was significant re-architecture and development to tie these two features together. Again, we turned to our bug bounty community to test our pull reminder integration before releasing the feature widely.

2020 initiatives

We have a lot of plans for 2020 and want to highlight some of our upcoming changes.

Security Lab bounty program

We launched the GitHub Security Lab bounty program to incentivize researchers to help us secure all open source software. The new program rewards community members who write CodeQL queries that detect entire vulnerability classes so that the rest of the community can run those queries against their own projects. This results in removing vulnerabilities at scale.

Making a contribution to this program not only influences the global state of software security,  but also prevents similar vulnerabilities from being released in the future. This is an exciting twist on our traditional bug bounty program, and we’re excited to see researchers using our new CodeQL tooling. To date, we received 20 submissions and awarded almost $21,000, with hundreds of vulnerabilities fixed across the OSS ecosystem as a direct result.

CVEs and disclosure

This year, we’re assigning CVEs to bounty submissions which affect GitHub Enterprise Server. This is a big step forward in consistently communicating the state of our software to our customers, but also provides another accolade for our researchers who identify vulnerabilities in GitHub Enterprise Server.

Get involved

Are you excited by the new additions to our program? Get involved! Visit the GitHub Security Bug Bounty page for details of our scope, rules, and rewards. We can’t wait to make GitHub better for everyone with the help of your submissions.

Learn more about the GitHub Security Bug Bounty

The post Six years of the GitHub Security Bug Bounty program appeared first on The GitHub Blog.

CERT partners with GitHub Security Lab for automated remediation

Post Syndicated from Nico Waisman original https://github.blog/2020-03-18-cert-partners-with-github-security-lab-for-automated-remediation/

As security researchers, the GitHub Security Lab team constantly embarks on an emotional journey with each new vulnerability challenge. The excitement of starting new research, the disappointment that comes with hitting a plateau, the energy it takes to stay focused and on target…and hopefully, the sheer joy of achieving a tangible result after weeks or months of working on a problem that seemed unsolvable.

Regardless of how proud you are of the results, do you ever get a nagging feeling that maybe you didn’t make enough of an impact? While single bug fixes are worthwhile in improving code, it’s not sufficient enough to improve the state of security of the open source software (OSS) ecosystem as a whole. This holds true especially when you consider that software is always growing and changing—and as vulnerabilities are fixed, new ones are introduced.

Beyond single bug fixes

At GitHub, we host millions of OSS projects which puts us in a unique position to take a different approach with OSS security. We have the power and responsibility to make an impact beyond single bug fixes. This is why a big part of the GitHub Security Lab mission is to find ways to scale our vulnerability hunting efforts and empower others to do the same.

Our goal is to turn single vulnerabilities into hundreds, if not thousands, of bug fixes at a time. Enabled by the GitHub engineering teams, we aim to establish workflows that are open to the community that tackle vulnerabilities at scale on the GitHub platform.

Ultimately, we want to establish feedback loops with the developer and security communities, and act as security facilitators, all while working with the OSS community to secure the software we all depend upon.


We’re taking a deep-dive in the remediation of a security vulnerability with CERT. Learn more about how we found ways to scale our vulnerability hunting efforts and empower others to do the same.

Continue reading on the Security Lab blog

The post CERT partners with GitHub Security Lab for automated remediation appeared first on The GitHub Blog.

Tackling UI test execution time imbalance for Xcode parallel testing

Post Syndicated from Grab Tech original https://engineering.grab.com/tackling-ui-test-execution-time-imbalance-for-xcode-parallel-testing

Introduction

Testing is a common practice to ensure that code logic is not easily broken during development and refactoring. Having tests running as part of Continuous Integration (CI) infrastructure is essential, especially with a large codebase contributed by many engineers. However, the more tests we add, the longer it takes to execute. In the context of iOS development, the execution time of the whole test suite might be significantly affected by the increasing number of tests written. Running CI pre-merge pipelines against a change, would cost us more time. Therefore, reducing test execution time is a long term epic we have to tackle in order to build a good CI infrastructure.

Apart from splitting tests into subsets and running each of them in a CI job, we can also make use of the Xcode parallel testing feature to achieve parallelism within one single CI job. However, due to platform-specific implementations, there are some constraints that prevent parallel testing from working efficiently. One constraint we found is that tests of the same Swift class run on the same simulator. In this post, we will discuss this constraint in detail and introduce a tip to overcome it.

Background

Xcode parallel testing

The parallel testing feature was shipped as part of the Xcode 10 release. This support enables us to easily configure test setup:

  • There is no need to care about how to split a given test suite.
  • The number of workers (i.e. parallel runners/instances) is configurable. We can pass this value in the xcodebuild CLI via the -parallel-testing-worker-count option.
  • Xcode takes care of cloning and starts simulators accordingly.

However, the distribution logic under the hood is a black-box. We do not really know how tests are assigned to each worker or simulator, and in which order.

Three simulators running tests in parallel
Three simulators running tests in parallel

It is worth mentioning that even without the Xcode parallel testing support, we can still achieve similar improvements by running subsets of tests in different child processes. But it takes more effort to dispatch tests to each child process in an efficient way, and to handle the output from each test process appropriately.

Test time imbalance

Generally, a parallel execution system is at its best efficiency if each parallel task executes in roughly the same duration and ends at roughly the same time.

If the time spent on each parallel task is significantly different, it will take more time than expected to execute all tasks. For example, in the following image, it takes the system on the left 13 mins to finish 3 tasks. Whereas, the one on the right takes only 10.5 mins to finish those 3 tasks.

Bad parallelism vs. good parallelism
Bad parallelism vs. good parallelism

Assume there are N workers. The ith worker executes its tasks in ti seconds/minutes. In the left plot, t1 = 10 mins, t2 = 7 mins, t3 = 13 mins.

We define the test time imbalance metric as the difference between the min and max end time:

max(ti) – min(ti)

For the example above, the test time imbalance is 13 mins – 7 mins = 6 mins.

Contributing factors in test time imbalance

There are several factors causing test time imbalance. The top two prominent factors are:

  1. Tests vary in execution time.
  2. Tests of the same class run on the same simulator.

An example of the first factor is that in our project, around 50% of tests execute in a range of 20-40 secs. Some tests take under 15 secs to run while several take up to 2 minutes. Sometimes tests taking longer execution time is inevitable since those tests usually touch many flows, which cannot be split. If such tests run last, the test time imbalance may increase.

However, this issue, in general, does not matter that much because long-time-execution tests do not always run last.

Regarding the second factor, there is no official Apple documentation that explicitly states this constraint. When Apple first introduced parallel testing support in Xcode 10, they only mentioned that test classes are distributed across runner processes:

“Test parallelization occurs by distributing the test classes in a target across multiple runner processes. Use the test log to see how your test classes were parallelized. You will see an entry in the log for each runner process that was launched, and below each runner you will see the list of classes that it executed.”

For example, we have a test class JobFlowTests that includes five tests and another test class TutorialTests that has only one single test.

final class JobFlowTests: BaseXCTestCase {
func testHappyFlow() { ... }
  func testRecoverFlow() { ... }
  func testJobIgnoreByDax() { ... }
  func testJobIgnoreByTimer() { ... }
  func testForceClearBooking() { ... }
}
...
final class TutorialTests: BaseXCTestCase {
  func testOnboardingFlow() { ... }
}

When executing the two tests with two simulators running in parallel, the actual run is like the one shown on the left side of the following image, but ideally it should work like the one on the right side.

Tests of the same class are supposed to run on the same simulator but they should be able to run on different simulators.
Tests of the same class are supposed to run on the same simulator but they should be able to run on different simulators.

Diving deep into Xcode parallel testing

Demystifying Xcode scheduling log

As mentioned above, Xcode distributes tests to simulators/workers in a black-box manner. However, by looking at the scheduling log generated when running tests, we can understand how Xcode parallel testing works.

When running UI tests via the xcodebuild command:

$ xcodebuild -workspace Driver/Driver.xcworkspace \
    -scheme Driver \
    -configuration Debug \
    -sdk 'iphonesimulator' \
    -destination 'platform=iOS Simulator,id=EEE06943-7D7B-4E76-A3E0-B9A5C1470DBE' \
    -derivedDataPath './DerivedData' \
    -parallel-testing-enabled YES \
    -parallel-testing-worker-count 2 \
    -only-testing:DriverUITests/JobFlowTests \    # 👈👈👈👈👈
    -only-testing:DriverUITests/TutorialTests \
    test-without-building

The log can be found inside the *.xcresult folder under DerivedData/Logs/Test. For example: DerivedData/Logs/Test/Test-Driver-2019.11.04\_23-31-34-+0800.xcresult/1\_Test/Diagnostics/DriverUITests-144D9549-FD53-437B-BE97-8A288855E259/scheduling.log

Scheduling log under xcresult folder.
Scheduling log under xcresult folder
2019-11-05 03:55:00 +0000: Received worker from worker provider: 0x7fe6a684c4e0 [0: Clone 1 of DaxIOS-XC10-1-iP7-1 (3D082B53-3159-4004-A798-EA5553C873C4)]
2019-11-05 03:55:13 +0000: Worker 0x7fe6a684c4e0 [4985: Clone 1 of DaxIOS-XC10-1-iP7-1 (3D082B53-3159-4004-A798-EA5553C873C4)] finished bootstrapping
2019-11-05 03:55:13 +0000: Parallelization enabled; test execution driven by the IDE
2019-11-05 03:55:13 +0000: Skipping test class discovery
2019-11-05 03:55:13 +0000: Executing tests {(	# 👈👈👈👈👈
    DriverUITests/JobFlowTests,
    DriverUITests/TutorialTests
)}; skipping tests {(
)}
2019-11-05 03:55:13 +0000: Load balancer requested an additional worker
2019-11-05 03:55:13 +0000: Dispatching tests {(  # 👈👈👈👈👈
    DriverUITests/JobFlowTests
)} to worker: 0x7fe6a684c4e0 [4985: Clone 1 of DaxIOS-XC10-1-iP7-1 (3D082B53-3159-4004-A798-EA5553C873C4)]
2019-11-05 03:55:13 +0000: Received worker from worker provider: 0x7fe6a1582e40 [0: Clone 2 of DaxIOS-XC10-1-iP7-1 (F640C2F1-59A7-4448-B700-7381949B5D00)]
2019-11-05 03:55:39 +0000: Dispatching tests {(  # 👈👈👈👈👈
    DriverUITests/TutorialTests
)} to worker: 0x7fe6a684c4e0 [4985: Clone 1 of DaxIOS-XC10-1-iP7-1 (3D082B53-3159-4004-A798-EA5553C873C4)]
...

Looking at the log below, we know that once a test class is dispatched or distributed to a worker/simulator, all tests of that class will be executed in that simulator.

2019-11-05 03:55:39 +0000: Dispatching tests {(
    DriverUITests/TutorialTests
)} to worker: 0x7fe6a684c4e0 [4985: Clone 1 of DaxIOS-XC10-1-iP7-1 (3D082B53-3159-4004-A798-EA5553C873C4)]

Even when we customize a test suite (by swizzling some XCTestSuite class methods or variables), to split a test suite into multiple suites, it does not work because the made-up test suite is only initialized after tests are dispatched to a given worker.

Therefore, any hook to bypass this constraint must be done early on.

Passing the -only-testing argument to xcodebuild command

Now we pass tests (instead of test classes) to the -only-testing argument.

$ xcodebuild -workspace Driver/Driver.xcworkspace \
    # ...
    -only-testing:DriverUITests/JobFlowTests/testJobIgnoreByTimer \
    -only-testing:DriverUITests/JobFlowTests/testRecoverFlow \
    -only-testing:DriverUITests/JobFlowTests/testJobIgnoreByDax \
    -only-testing:DriverUITests/JobFlowTests/testHappyFlow \
    -only-testing:DriverUITests/JobFlowTests/testForceClearBooking \
    -only-testing:DriverUITests/TutorialTests/testOnboardingFlow \
    test-without-building

But still, the scheduling log shows that tests are grouped by test class before being dispatched to workers (see the following log for reference). This grouping is automatically done by Xcode (which it should not).

2019-11-05 04:21:42 +0000: Executing tests {(	# 👈
    DriverUITests/JobFlowTests/testJobIgnoreByTimer,
    DriverUITests/JobFlowTests/testRecoverFlow,
    DriverUITests/JobFlowTests/testJobIgnoreByDax,
    DriverUITests/TutorialTests/testOnboardingFlow,
    DriverUITests/JobFlowTests/testHappyFlow,
    DriverUITests/JobFlowTests/testForceClearBooking
)}; skipping tests {(
)}
2019-11-05 04:21:42 +0000: Load balancer requested an additional worker
2019-11-05 04:21:42 +0000: Dispatching tests {(  # 👈 ❌
    DriverUITests/JobFlowTests/testJobIgnoreByTimer,
    DriverUITests/JobFlowTests/testForceClearBooking,
    DriverUITests/JobFlowTests/testJobIgnoreByDax,
    DriverUITests/JobFlowTests/testHappyFlow,
    DriverUITests/JobFlowTests/testRecoverFlow
)} to worker: 0x7fd781261940 [6300: Clone 1 of DaxIOS-XC10-1-iP7-1 (93F0FCB6-C83F-4419-9A75-C11765F4B1CA)]
......

Overcoming grouping logic in Xcode parallel testing

Tweaking the -only-testing argument values

Based on our observation, we can imagine how Xcode runs tests in parallel. See below.

Step 1.   tests = detect_tests_to_run() # parse -only-testing arguments
Step 2.   groups_of_tests = group_tests_by_test_class(tests)
Step 3.   while groups_of_tests is not empty:
Step 3.1. 	worker = find_free_worker()
Step 3.2.     if worker is not None:
                  dispatch_tests_to_workers(groups_of_tests.pop())

In the pseudo-code above, we do not have much control to change step 2 since that grouping logic is implemented by Xcode. But we have a good guess that Xcode groups tests, by the first two components (class name) only (For example, DriverUITests/JobFlowTests). In other words, tests having the same class name run together on one simulator.

The trick to break this constraint is simple. We can tweak the input (test names) so that each group contains only one test. By inserting a random token in the class name, all class names in the tests that are passed via -only-testing argument are different.

For example, instead of passing:

-only-testing:DriverUITests/JobFlowTests/testJobIgnoreByTimer \
-only-testing:DriverUITests/JobFlowTests/testRecoverFlow \

We rather use:

-only-testing:DriverUITests/JobFlowTests_AxY132z8/testJobIgnoreByTimer \
-only-testing:DriverUITests/JobFlowTests_By8MTk7l/testRecoverFlow \

Or we can use the test name itself as the token:

-only-testing:DriverUITests/JobFlowTests_testJobIgnoreByTimer/testJobIgnoreByTimer \
-only-testing:DriverUITests/JobFlowTests_testRecoverFlow/testRecoverFlow \

After that, looking at the scheduling log, we will see that the trick can bypass the grouping logic. Now, only one test is dispatched to a worker once ready.

2019-11-05 06:06:56 +0000: Dispatching tests {(	# 👈 ✅
    DriverUITests/JobFlowTests_testJobIgnoreByDax/testJobIgnoreByDax
)} to worker: 0x7fef7952d0e0 [13857: Clone 2 of DaxIOS-XC10-1-iP7-1 (9BA030CD-C90F-4B7A-B9A7-D12F368A5A64)]
2019-11-05 06:06:58 +0000: Dispatching tests {(	# 👈 ✅
    DriverUITests/TutorialTests_testOnboardingFlow/testOnboardingFlow
)} to worker: 0x7fef7e85fd70 [13719: Clone 1 of DaxIOS-XC10-1-iP7-1 (584F99FE-49C2-4536-B6AC-90B8A10F361B)]
2019-11-05 06:07:07 +0000: Dispatching tests {(	# 👈 ✅
    DriverUITests/JobFlowTests_testRecoverFlow/testRecoverFlow
)} to worker: 0x7fef7952d0e0 [13857: Clone 2 of DaxIOS-XC10-1-iP7-1 (9BA030CD-C90F-4B7A-B9A7-D12F368A5A64)]

Handling tweaked test names

When a worker/simulator receives a request to run a test, the app (could be the runner app or the hosting app) initializes an XCTestSuite corresponding to the test name. In order for the test suite to be properly made up, we need to remove the inserted token.

This could be done easily by swizzling the XCTestSuite.init(forTestCaseWithName:). Inside that swizzled function, we remove the token and then call the original init function.

extension XCTestSuite {
  /// For 'Selected tests' suite
  @objc dynamic class func swizzled_init(forTestCaseWithName maskedName: String) -> XCTestSuite {
    /// Recover the original test name
    /// - masked: UITestCaseA_testA1/testA1      	--> recovered: UITestCaseA/testA1
    /// - masked: Driver/UITestCaseA_testA1/testA1   --> recovered: Driver/UITestCaseA/testA1
    guard let testBaseName = maskedName.split(separator: "/").last else {
      return swizzled_init(forTestCaseWithName: maskedName)
    }
    let recoveredName = maskedName.replacingOccurrences(of: "_\(testBaseName)/", with: "/") # 👈 remove the token
    return swizzled_init(forTestCaseWithName: recoveredName) # 👈 call the original init
  }
}
Swizzle function to run tests properly
Swizzle function to run tests properly

Test class discovery

In order to adopt this tip, we need to know which test classes we need to run in advance. Although Apple does not provide an API to obtain the list before running tests, this can be done in several ways. One approach we can use is to generate test classes using Sourcery. Another alternative is to parse the binaries inside .xctest bundles (in build products) to look for symbols related to tests.

Conclusion

In this article, we identified some factors causing test execution time imbalance in Xcode parallel testing (particularly for UI tests).

We also looked into how Xcode distributes tests in parallel testing. We also try to mitigate a constraint in which tests within the same class run on the same simulator. The trick not only reduces the imbalance but also gives us more confidence in adding more tests to a class without caring about whether it affects our CI infrastructure.

Below is the metric about test time imbalance recorded when running UI tests. After adopting the trick, we saw a decrease in the metric (which is a good sign). As of now, the metric stabilizes at around 0.4 mins.

Tracking data of UI test time imbalance (in minutes) in our project, collected by multiple runs
Tracking data of UI test time imbalance (in minutes) in our project, collected by multiple runs

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

February service disruptions

Post Syndicated from Keith Ballinger original https://github.blog/2020-02-28-february-service-disruptions/

Recently, we’ve had multiple service interruptions on GitHub.com. We know how important reliability of our service is for your projects and teams. We take this responsibility very seriously and apologize for these disruptions.

These incidents were distinct events, but have a common theme of uncovering new challenges in scaling our database tier. Specifically, increased load on our largest database cluster contributed to degradations across multiple services.  

In the coming weeks, we’ll follow up with a more in-depth and technical report of these events and the work we are doing to improve the scalability and performance of our backend systems.  

We have several data partitioning initiatives already in progress, and we’ll be rolling out some of this work very soon. You can follow our status page for updates about the availability of our systems. 

 

Check GitHub’s current status

The post February service disruptions appeared first on The GitHub Blog.

Returning 575 Terabytes of storage space back to our users

Post Syndicated from Grab Tech original https://engineering.grab.com/returning-storage-space-back-to-our-users

Have you ever run out of storage on your phone? Mobile phones come with limited storage and with the multiplication of apps and large video files, many of you are running out of space.

In this article, we explain how we measure and reduce the storage footprint of the Grab App on a user’s device to help you overcome this issue.

The wakeup call

Android vitals (information provided by Google play Console about our app performance) gives us two main pieces of information about storage footprint.

15.7% of users have less than 1GB of free storage and they tend to uninstall more than other users (1.2x).

The proportion of 30 day active devices which reported less than 1GB free storage. Calculated as a 30 days rolling average.

Active devices with <1GB free space
Active devices with <1GB free space

This is the ratio of uninstalls on active devices with less than 1GB free storage to uninstalls on all active devices. Calculated as a 30 days rolling average.

Ratio of uninstalls on active devices with less than 1GB
Ratio of uninstalls on active devices with less than 1GB

Instrumentation to know where we stand

First things first, we needed to know how much space the Grab App occupies on user device. So we started using our personal devices. We can find this information by opening the phone settings and selecting Grab App.

App Settings
App Settings

For this device (screenshot), the application itself (Installed binary) was 186 MB and the total footprint was 322 MB. Since this information varies a lot based on the usage of the app, we needed this information directly from our users in production.

Disclaimer: We are only measuring files that are inside the internal Grab app folder (Cache/Database). We do NOT measure any file that is not inside the private Grab folder.

We decided to leverage on our current implementation using StorageManager API to gather the following information during each session launch:

  • Application Size (Installed binary size)
  • Cache folder size
  • Total footprint
Sample code to retrieve storage information on Android
Sample code to retrieve storage information on Android

Data analysis

We began analysing this data one month after our users’ updated their app and found that the cache size was anomaly huge (> 1GB) for a lot of users. Intrigued, we dug deeper.

We added code to log the top largest files inside the cache folder, and we found that most of the files were inside a sub cache folder that was no longer in use. This was due to a usage of a 3rd party library that was removed from our app. We added a specific metric to track the size of this folder.

In the end, a lot of users still had this old cache data and for some users the amount of data can be up to 1GB.

Root cause analysis

The Grab app relies a lot on 3rd party libraries. For example, Picasso was a library we used in the past for image display which is now replaced by Glide. Picasso uses a cache to store images and avoid making network calls again and again. After removing Picasso from the app, we didn’t delete this cache folder on the user device. We knew there would likely be more third-party libraries that had been discontinued so we expanded our analysis to look at how other 3rd party libraries cached their data.

Freeing up space on user’s phone

Here comes the fun part. We implemented a cleanup mechanism to remove old cache folders. When users update the GrabApp, any old cache folders which were there before would automatically be removed. Through this, we released up to 1GB of data in a second back to our users. In total, we removed 575 terabytes of old cache data across more than 13 million devices (approximately 40MB per user on average).

Data summary

The following graph shows the total size of junk data (in Terabytes) that we can potentially remove each day, calculated by summing up the maximum size of cache when a user opens the Grab app each day.

The first half of the graph reflects the amount of junk data in relation to the latest app version before auto-clean up was activated. The second half of the graph shows a dramatic dip in junk data after auto-clean up was activated. We were deleting up to 33 Terabytes of data per day on the user’s device when we first started!

Sum of all junk data on user’s device reported per day in Terabytes
Sum of all junk data on user’s device reported per day in Terabytes

Next step

This is the first phase of our journey in reducing the storage footprint of our app on Android devices. We specifically focused on making improvements at scale i.e. deliver huge storage gains to the most number of users in the shortest time. In the next phase, we will look at more targeted improvements for specific groups of users that still have a high storage footprint. In addition, we are also reviewing iOS data to see if a round of clean up is necessary.

Concurrently, we are also reducing the maximum size of cache created by some libraries. For example, Glide by default creates a cache of 250MB but this can be configured and optimised.

We hope you found this piece insightful and please remember to update your app regularly to benefit from the improvements we’re making every day. If you find that your app is still taking a lot of space on your phone, be assured that we’re looking into it.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Grab-Posisi – SouthEast Asia’s first comprehensive GPS trajectory dataset

Post Syndicated from Grab Tech original https://engineering.grab.com/grab-posisi

Introduction        

At Grab, thousands of bookings happen daily via the Grab app. The driver phones and GPS devices enable us to collect large-scale GPS trajectories.

Apart from the time and location of the object, GPS trajectories are also characterised by other parameters such as speed, the headed direction, the area and distance covered during its travel, and the travelled time. Thus, the trajectory patterns from users GPS data are a valuable source of information for a wide range of urban applications, such as solving transportation problems, traffic prediction, and developing reasonable urban planning.

Currently, it’s a herculean task to create and maintain the GPS datasets since it’s costly and laborious. As a result, most of the GPS datasets available today in the market have poor coverage or contain outdated information. They cover only a small area of a city, have low sampling rates and contain less contextual information of the GPS pings such as no accuracy level, bearing, and speed.Despite over a dozen mapping communities engaged in collecting GPS trajectory datasets, a significant amount of effort would be required for data cleaning and data pre-processing in order to utilize them.

To overcome the shortfalls in the existing datasets, we built Grab-Posisi, the first GPS trajectory dataset of Southeast Asia. The term Posisi refers to a position in Bahasa. The data was collected from Grab drivers’ phones while in transit. By tackling the addition of major arterial roads in regions where existing maps have poor coverage, and the incremental improvement of coverage in regions where major roads are already mapped, Posisi substantially improves mapping productivity.

What’s inside the dataset

The whole Grab-Posisi dataset contains in total 84K trajectories that consist of more than 80 million GPS pings and cover over 1 million km. The average trajectory length is 11.94 km and the average duration per trip is 21.50 minutes.

The data were collected very recently in April 2019 with a 1 second sampling rate, which is the highest amongst all the publicly available datasets. It also has richer contextual information, including the accuracy level, bearing and speed. The accuracy level is important because GPS measurements are noisy and the true location can be anywhere inside a circle centred at the reported location with a radius equal to the accuracy level. The bearing is the horizontal direction of travel, measured in degrees relative to true north. Finally, the speed is reported in meters/second over ground.

As the GPS trajectories were collected from Grab drivers’ phones while in transit, we labelled each trajectory by phone device type being either Android or iOS. This is the first dataset which differentiates such device information. Furthermore, we also label the trajectories by driving mode (Car or Motorcycle).

All drivers’ personal information is encrypted and the real start/end locations are removed within the dataset.

Data format

Each trajectory is serialised in a file in Apache Parquet format. The whole dataset size is around 2 GB. Each GPS ping is associated with values for a trajectory ID, latitude, longitude, timestamp (UTC), accuracy level, bearing and speed. The GPS sampling rate is 1 second, which is the highest among all the existing open source datasets. Table 1 shows a sample of the dataset.

Table 1: Sample dataset
Table 1: Sample dataset

Coverage

Figure 1a shows the spatial coverage of the dataset in Singapore. Compared with the GPS datasets available in the market that only cover a specific area of a city, the Grab-Posisi dataset encompasses almost the whole island of Singapore. Figure 1b depicts the GPS density in Singapore. Red represents high density while green represents low density. Expressways in Singapore are clearly visible because of their dense GPS pings.

Figure 1a. Spatial coverage (Singapore)
Figure 1a. Spatial coverage (Singapore)
Figure 1b. GPS density (highways have more GPS)
Figure 1b. GPS density (highways have more GPS)

Figure 2a illustrates that the Grab-Posisi dataset encloses not only central Jakarta but also extends to external highways. Figure 2b depicts the GPS density of cars in Jakarta. Compared with Singapore, trips in Jakarta are spread out in all different areas, not just concentrated on highways.

Figure 2a. Spatial coverage (Jakarta)
Figure 2a. Spatial coverage (Jakarta)
Figure 2b. GPS density (Car)
Figure 2b. GPS density (Car)

Applications of Grab-Posisi

The following are some of the applications of Grab-Posisi dataset.

On Map Inference

The traditional method used in updating road networks in maps is time-consuming and labour-intensive. That’s why maps might have important roads missing and real-time traffic conditions might be unavailable. To address this problem, we can use GPS trajectories in reconstructing road networks automatically.

A bunch of map generation algorithms can be applied to infer both map topology and road attributes. Figure 3b shows a snippet of the inferred map from our GPS trajectories (Figure 3a) using one of the algorithms. As you can see from the blue dots, the skeleton of the underlining map inferred is correct, although some section of the inferred road is disconnected, and at the roundabout in the bottom right corner it’s not a smooth curve.

Figure 3a. Raw GPS trajectories
Figure 3a. Raw GPS trajectories  
Figure 3b. Inferred Map
Figure 3b. Inferred Map

On Map Matching                                         

The map matching refers to the task of automatically determining the correct route where the driver has travelled on a digital map, given a sequence of raw and noisy GPS points. The correction of the raw GPS data has been important for many location-based applications such as navigation, tracking, and road attribute detection as aforementioned. The accuracy levels provided in the Grab-Posisi dataset can be of great use to address this issue.

On Traffic Detection and Forecast                         

In addition to the inference of a static digital map, the Grab-Posisi GPS dataset can also be used to perform real-time traffic forecasting, which is very important for congestion detection, flow control, route planning, and navigation. Some examples of the fundamental indicators that are mostly used to monitor the current status of traffic conditions include the average speed, volume, and density in each road segment. These variables can be computed based on drivers’ GPS trajectories and can be used to predict the future traffic conditions.

On Mode Detection                         

Transportation mode detection refers to the task of identifying the travel mode of a user (some examples of transportation mode include walk, bike, car, bus, etc.). The GPS trajectories in our dataset are associated with rich attributes including GPS accuracy, bearing, and speed in addition to the latitude and longitude of geo-coordinates, which can be used to develop mode detection models. Our dataset also provides labels for each trajectory to be collected from a car or motorcycle, which can be used to verify performance of those models.

Economics Perspective                                         

The real-world GPS trajectories of people reveal realistic travel patterns and demands, which can be of great help for city planning. As there are some realistic constraints faced by governments such as budget limitations and construction inconvenience, it is important to incorporate both the planning authorities’ requirements and the realistic travel demands mined from trajectories for intelligent city planning. For example, the trajectories of cars can provide suggestions on how to schedule highway constructions. The trajectories of motorcycles can help the government to choose the optimal locations to construct motorcycle lanes for safety concerns.

Want to access our dataset?

Grab-Posisi dataset offers a great value and is a significant resource to the community for benchmarking and revisiting existing technologies.         

If you want to access our dataset for research purposes, email [email protected] with the following details:

  • Your Name and contact details
  • Your institution
  • Your potential usage of the dataset

When using Grab-Posisi dataset, please cite the following paper:

Huang, X., Yin, Y., Lim, S., Wang, G., Hu, B., Varadarajan, J., … & Zimmermann, R. (2019, November). Grab-Posisi: An Extensive Real-Life GPS Trajectory Dataset in Southeast Asia. In Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Prediction of Human Mobility (pp. 1-10). DOI: https://doi.org/10.1145/3356995.3364536

Click here to download the published paper.

Click here to download the BibTex file.

Note: You cannot use Grab-Posisi dataset for commercial purposes.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Automating MySQL schema migrations with GitHub Actions and more

Post Syndicated from Shlomi Noach original https://github.blog/2020-02-14-automating-mysql-schema-migrations-with-github-actions-and-more/

In the past year, GitHub engineers shipped GitHub Packages, Actions, Sponsors, Mobile, security advisories and updates, notifications, code navigation, and more. Needless to say, the development pace at GitHub is accelerated.

With MySQL serving our backends, updating code requires changes to the underlying database schema. New features may require new tables, columns, changes to existing columns or indexes, dropping unused tables, and so on. On average, we have two schema migrations running daily on our production servers. Some days we have a half dozen migrations to run. We’ll cover how this amounted to a significant toil on the database infrastructure team, and how we searched for a solution to automate the manual parts of the process.

What’s in a migration?

At first glance, migrating appears to be no more difficult than adding a CREATE, ALTER or DROP TABLE statement. At a closer look, the process is far more complex, and involves multiple owners, platforms, environments, and transitions between those pieces. Here’s the flow as we experience it at GitHub:

1. Starting the process

It begins with a developer who identifies the need for a schema change. Maybe they need a new table, or a new column in an existing table. The developer has a local testing environment where they can experiment however they like, until they’re satisfied and wish to apply changes to production.

2. Feedback and review

The developer doesn’t just apply their changes online. First, they seek review and discussion with their peers. Depending on the change, they may ask for a review from a group of schema reviewers (at GitHub, this is a volunteer group experienced with database design). Then, they seek the agreement of the database infrastructure team, who owns the production databases. The database infrastructure team reviews the changes, looking for performance concerns, among other potential issues. Assuming all reviews are favorable, it’s on the database infrastructure engineer to deploy the change to production.

3. Taking the change to production

At this point, we need to determine where the change is taking place since we have multiple clusters. Some of them are sharded, so we have to ask: Where do the affected tables exist in our clusters or schemas? Next, we need to know what to run. The developer presented the schema they want to see in production, but how do we transition the existing production schema into the one requested? What’s the formal CREATE, ALTER or DROP statement? Following what to run, we need to know how we should run the migration. Do we run the query directly? Or is it a blocking operation and we need an online schema change tool? And finally, we need to know when to execute the migration. Perhaps now is not a good time if there’s already a migration running on the cluster.

4. Migration

At long last, we’re ready to run the migration. Some of our larger tables may take hours and even days to migrate, especially since the site needs to be up and running. We want to track status. And we want to see what impact the migration may have on production, or, preferably, to ensure it does not have an impact.

5. Completing the process

Even as the migration completes there are further steps to take. There’s some cleanup process, and we want to unblock the next migration, if any currently exists. The database infrastructure team wishes to advertise to the developer that the changes have taken place, and the developer will have their own followup to address.

Throughout that flow, there’s a lot of potential for friction:

  • Does the database infrastructure team review the developer’s request in a timely fashion?
  • Is the review process productive?
  • Do we need to wait for something before running the migration?
  • Is the database infrastructure engineer actually available to run the migration, or perhaps they’re busy with other tasks?

The database infrastructure engineer needs to either create or review the migration statement, double-check their logic, ensure they can begin the migration, follow up, unblock other migrations as needed, advertise progress to the developer, and so on.

With our volume of daily migrations, this flow sometimes consumed hours of work of a database infrastructure engineer per day, and—in the best-case scenario—at least several hours of work per week. They would frequently multitask between two or three migrations and keep mental notes for next steps. Developers would ping us to ask what the status was, and their work was sometimes blocked until the migration was complete.

A brief history of schema migration automation at GitHub

GitHub was originally created as a Ruby on Rails (RoR) app. Like other frameworks, and in particular, those using Active Record, RoR has a built-in mechanism to generate database schema from code, as well as programmatically express migrations. RoR tooling can analyze code changes and create and run the SQL statements to change the database schema.

We use the GitHub flow to manage our own development: when suggesting a change, we create a branch, commit, push, and open a pull request. We use the declarative approach to schema definition: our RoR GitHub repository contains the full schema definition, such as the CREATE TABLE statements that generate the complete schema. This way, we know exactly what schema is associated with each commit or branch. Counter that with the programmatic approach, where your commits contain migration statements, and where to deduce a schema you need to start at some baseline and run through all statements sequentially.

The database infrastructure and the application teams collaborated to create a set of chatops tooling. We ran a chatops command to list pull requests with schema changes, and then another command to generate the CREATE/ALTER/DROP statement for a given pull request. For this, we used RoR’s rake command. Our wrapper scripts then added meta information, like which cluster is involved, and generated a script used to run the migration.

The generated statements and script were mostly fine, with occasional SQL syntax errors. We’d review the output and fix it manually as needed.

A few years ago we developed gh-ost, an online table migration solution, which added even more visibility and control through our chatops. We’d check progress, change runtime configuration, and cut-over the migration through chat. While simple, these were still manual steps.

The heart of GitHub’s app remains with the same RoR, but we’ve expanded far beyond it. We created more repositories and some also use RoR, while others are in other programming languages such as Go. However, we didn’t use Object Relational Mapping practice with the new repositories.

As GitHub expanded, the more toil the database infrastructure team had. We’d review pull requests, compare schemas, generate migration statements manually, and verify on a local machine. Other than the git log, no formal tracking for schema migrations existed. We’d check in chat, issues, and pull requests to see what was done and what wasn’t. We’d keep track of ongoing migrations in our heads, context switch between the migrations throughout the day, and how often we’d get interrupted by notifications. And we did this while taking each migration through the next step, keeping mental notes, and communicating the progress to our peers.

With these steps in mind, we wanted a solution to automate the process. We came up with various ideas, and in 2019 GitHub Actions was released. This was our solution: multiple loosely coupled components, each owning a specific aspect of the flow, all orchestrated by a controller service. The next section covers the breakdown of our solution.

Code

Our basic premise is that schema design should be treated as code. We want the schema to be versioned, and we want to know what schema is associated and with what version of our code.

To illustrate, GitHub provides not only github.com, but also GitHub Enterprise, an on-premise solution. On github.com we run continuous deployments. With GitHub Enterprise, we make periodic releases, and our customers can upgrade in-house. This means we need to be able to reproduce any schema changes we make to github.com on a customer’s Enterprise server.

Therefore we must keep our schema design coupled with the code in the same git repository. For a developer to design a schema change, they need to follow our normal development flow: create a branch, commit, push, and open a pull request. The pull request is where code is reviewed and discussion takes place for any changes. It’s where continuous integration and testing run. Our solution revolves around the pull request, and this is standardized across all our repositories.

The change

Once a pull request is opened, we need to be able to identify what changes we’d like to make. Typically, when we review code changes, we look at the diff. And it might be tempting to expect that git diff can help us formalize the schema change. Unfortunately, this is not the case, and git diff is poor at identifying these changes. For example, consider this simplified table definition:

CREATE TABLE some_table (
  id int(10) unsigned NOT NULL AUTO_INCREMENT,
  hostname varchar(128) NOT NULL,
  PRIMARY KEY (id),
  KEY (hostname)
);

Suppose we decide to add a new column and drop the index on hostname. The new schema becomes:

CREATE TABLE some_table (
  id int(10) unsigned NOT NULL AUTO_INCREMENT,
  hostname varchar(128) NOT NULL,
  time_created TIMESTAMP NOT NULL,
  PRIMARY KEY (id)
);

Running git diff on the two schemas yields the following:

@@ -1,6 +1,6 @@
 CREATE TABLE some_table (
   id int(10) unsigned NOT NULL DEFAULT 0,
   hostname varchar(128) NOT NULL,
-  PRIMARY KEY (id),
-  KEY (hostname)
+  time_created TIMESTAMP NOT NULL,
+  PRIMARY KEY (id)
 );

The pull request’s “Files changed” tab shows the same:

This is a sample Pull Request where we change a table's schema. git diff does a poor job of analyzing the schema change.

See how the PRIMARY KEY line goes into the diff because of the trailing comma. This diff does not capture the schema change well, and while RoR provides tooling for that,  we’ve still had to carefully review them. Fortunately, there’s a good MySQL-oriented tool to do the task.

skeema

skeema is an open source schema management utility developed by Evan Elias. It expects the declarative approach, and looks for a schema definition on your file system (hopefully as part of your repository). The file system layout should include a directory per schema/database, a file per table, and then some special configuration files telling skeema the identities of, and the credentials for, MySQL servers in various environments. Skeema is able to run useful tasks, such as:

  • skeema diff: generate SQL statements that convert the existing database schema into the schema defined in the file system. This includes as many CREATE, ALTER and DROP TABLE statements as needed.
  • skeema push: actually apply changes to the database server for the schema to match the one on file system
  • skeema pull: rewrite the filesystem schema based on the existing schema in the MySQL server.

skeema can do much more, including the ability to invoke online schema change tools—but that’s outside this post’s scope.

Git users will feel comfortable with skeema. Indeed, skeema works very well with git-versioned schemas. For us, the most valuable asset is its diff output: a well formed, reliable set of statements to show the SQL transition from one schema to another. For example, skeema diff output for the above schema change is:

USE `test`;
ALTER TABLE `some_table` ADD COLUMN `time_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, DROP KEY `hostname`;

Note that the above is not only correct, but also formal. It reproduces correctly whether our code uses lower/upper case, includes/omits default value, etc.

We wanted to use skeema to tell us what statements we needed to run to get from our existing state into the state defined in the pull request. Assuming the master branch reflects our current production schema, this now becomes a matter of diffing the schemas between master and the pull request’s branch.

Skeema wasn’t without its challenges, and we had to figure out where to place skeema from a design perspective. Do the developers own it? Does every repository own it? Is there a central service to own it? Each presented its own problems, from false ownership to excessive responsibilities and access.

GitHub Actions

Enter GitHub Actions. With Actions, you’re able to run code as a response to events taking place in your repository. A new pull request, review, comment, issue, and quite a few others, are such events. The code (the action) is arbitrary, and GitHub spawns a container on its own infrastructure, where your code will run. What makes this extra interesting is that the container can get access to your repository. GitHub Actions implicitly receives an API token to interact with the repository.

The container comes with popular software packages pre-installed, such as a MySQL server.

Perhaps the most classic use of Actions is CI/CD.  When  a pull_request event occurs (a new pull request and any subsequent commit) run some code to build, test, lint, or validate the change. We took this approach to run skeema as part of a pull_request action flow, called skeema-diff.

Here’s a simplified breakdown of the action:

  1. Fetch skeema binary
  2. Checkout master branch
  3. Run skeema push to populate the container’s MySQL server with the schema as defined by the master branch
  4. Checkout pull request’s branch
  5. Run skeema diff to generate the statements that take the schema from the one in MySQL (remember, this is the master schema) to the one in the pull request’s branch
  6. Add the diff as a comment in the pull request
  7. Add a special label to indicate this pull request has a schema change

The GitHub Action, running skeema, generates schema diff output, which is added as a comment to the Pull Request. The comment presents the correct ALTER statement implied by the code change. This comment is both human and machine readable.

The code is more complex than what we’ve shown. We actually use base and head instead of master and branch, and there’s some logic to formalize, edit and validate the diff, to handle commits that further change the schema, among other processes.

By now, we have a partial flow, which works entirely on GitHub’s platform:

  • Schema change as code
  • Review process, based on GitHub’s pull request flow
  • Automated schema change analysis, based on skeema running in a GitHub Action
  • A visible output, presented as a pull request comment

Up to this point, everything is constrained to the repository. The repository itself doesn’t have information about where the schema gets deployed in production. This information is something that’s outside the repository’s scope, and it’s owned by the database infrastructure team rather than the repository’s developers. Neither the repository nor any action running on that repository has access to production, nor should they, as that would be a breach of domains.

Before we describe how the schema gets to production, let’s jump ahead and discuss the schema migration itself.

Schema migrations and gh-ost

Even the simplest schema migration isn’t simple. We are concerned with three types of table migrations:

  • CREATE TABLE is the simplest and the safest. We created something that didn’t exist before, and its creation time is instantaneous. Note that if the target cluster is sharded, this must be applied on all shards. If the cluster is sharded with vitess, then the vitess vtgate service automatically handles this for us.
  • DROP TABLE is a simple statement that comes with a great risk. What if it’s still in use and some code breaks as a result of the table going away? Note that we don’t actually drop tables as part of schema migrations. Any DROP TABLE statement is converted into a RENAME TABLE. Instead of DROP TABLE repositories (whoops!), our automation runs RENAME TABLE repositories TO _repositories_DROP_20200101123456. If our application fails because of this, we have an instant revert command: RENAME back to the original. Renamed tables are kept around for a few days prior to being garbage collected and dropped by our automation.
  • ALTER TABLE is the most complex case, mainly because it takes time to alter a table. We don’t actually ALTER tables in-place. We use gh-ost to emulate an ALTER TABLE, and the end result is the same even though the process is completely different. It doesn’t lock our apps, throttles as much as needed, and it’s controllable as well as auditable. We’ve run gh-ost in production for over three and a half years. It has little to no impact on production, and we generally don’t care that it’s running. But some of our larger tables may still take hours or even days to migrate. We also only run one ALTER (or, gh-ost) at a time on a cluster. Concurrent migrations are possible but compete over resources, leading to overall longer runtimes than sequential execution. This means that an ALTER migration requires scheduling. We need to be able to tell if a migration is already running on a cluster, as well as prioritize and queue migrations that apply to the same cluster. We also need to be able to tell the status over the duration of hours or days, and this needs to be communicated to the developer, the owner of the change. And, if the cluster is sharded, we need to run the migration per shard.

In order to run a migration, we must first determine the strategy for that migration (Is it direct query, gh-ost, or a manual?). We need to be able to tell where it can run,  how to go about the process if the cluster is sharded, as well as When to schedule it. While migrations can wait in queue while others are running, we want to be able to prioritize migrations, in case the queue is large.

skeefree

We created skeefree as the glue, which means it’s an orchestrating service that’s aware of our repositories, can communicate with our pull requests, knows about production (or, can get information about production) and which invokes the migrations. We run skeefree as a stateless kubernetes service, backed by a MySQL database that holds the state. Note that skeefree’s own schema is managed by skeefree.

skeefree uses GitHub’s API to interact with pull requests, GitHub’s internal inventory and discovery services, to locate clusters in production, and gh-ost to run migrations. Skeefree is best described by following a schema migration flow:

  1. A developer wishes to change the schema, so they open a pull request.
  2. skeema-diff Action springs to life and seeks a schema change. If a schema change isn’t found in the pull request, nothing happens. If there is a schema change, the Action, computes the change via skeema, adds a well-formed comment to the pull request indicating the change, and adds a migration:skeema:diff label to the pull request. This is done via the GitHub API.
  3. A developer looks into the change, and seeks review from a team member. At this time they may communicate to team members without actually going to production. Finally, they add the label migration:for:review.
  4. skeefree is aware of the developer’s repository and uses the GitHub API to periodically look for open pull requests, which are labeled by both migration:skeema:diff and migration:for:review, and have been approved by at least one developer.
  5. Once detected, skeefree investigates the pull request, and reads the schema change comment, generated by the Action. It maps the schema/repository to the schema/production cluster, and uses our inventory and discovery services to know if the cluster is sharded. Then, it finds the location and name of the cluster.
  6. skeefree then adds this to its backend database, and advertises its analysis on the pull request with another comment. This comment generally means “here’s what I will do if you approve”. And it proceeds to get a review from an authority. Once the user labels the Pull Request as "migration:for:review", skeefree analyzes the migration and evaluates where it needs to run. It proceeds to seek review from an authority.
  7. For most repositories, the authority is the database-infrastructure team. On our original RoR repository, we also seek review from a cross-functional team, known as the db-schema-reviewers, who are familiar with the general application and database design throughout the years and who have more context to offer. skeefree automatically knows which teams should be notified on which repositories.
  8. The relevant teams review and hopefully approve, and skeefree detects the approval, before choosing the proper strategy (direct query for CREATE and DROP, or RENAME), and gh-ost for ALTER. It then queues the migration(s).
  9. skeefree’s scheduler periodically checks what next can be executed. Remember we only run a single ALTER migration on a given cluster at a time, but we also have a limited number of runner hosts. If there’s a free runner host and the cluster is not running any migration, skeefree then proceeds to kick off a migration. Skeefree advertises this fact as a pull request comment to notify the developer that the migration started.
  10. Once the migration is complete, skeefree announces it in a pull request comment. The same applies should the migration fail.
  11. The pull request may also have more than one migration. Perhaps the cluster is sharded, or there may be multiple tables changed in the pull request. Once all migrations are successfully completed, skeefree advertises this in a pull request comment. The developer is notified that all migrations are done, and they’re encouraged to proceed with their standard deploy/merge flow.

as skeefree runs the migrations, it adds comments on the Pull Request page to indicate its progress. When all migrations are complete, skeefree comments as much, again on the pull request page.

Analysis of the flow

There are a few nuances here that make a good experience to everyone involved:

  • The database infrastructure team doesn’t know about the pull request until the developer explicitly adds the migration:for:review label. It’s like a draft pull request or a pull request that’s a work in progress, only this flag applies specifically to the schema migration flow. This allows the developer to use their preferred flow, and communicate with their team without interrupting the database infrastructure team or getting premature reviews.
  • The skeema analysis is contained within the repository, which means That no external service is required. The developer can check the diff result, themselves.
  • The Action is the only part of the flow that looks at the code. Neither skeefree nor gh-ost look at the actual code, and they don’t need git access.
  • The database infrastructure team only needs to take a single step, which is review the pull request.
  • The developers own the creation of pull requests, getting peer reviews, and finally, deploying and merging. These are the exact operations that should be under their ownership. Moreover, they get visibility into the state of their migration. By looking at the pull request page or their GitHub notifications, they can tell whether the pull request has been reviewed, queued, started, completed, or failed. They don’t need to ask. Even better, we have chatops that give visibility into the overall state of migration queue, a running migration’s progress, and more. These chatops are available for all to invoke.
  • The database infrastructure team owns the process of mapping the repository schema to production. This is done via chatops, but can also be completed via configuration. The team is able to cancel a pull request, retry a failed migration, and more.
  • gh-ost is generally trusted, and we have control over a running migration. This means that we can force it to throttle, set up a different throttle threshold, make it use less resources, or terminate it, if needed. We also have a throttling mechanism throughout our stack, so that long running processes like migrations yield to higher priority operations, which extends their own runtime so it doesn’t generate too much load on our database servers.
  • We use our own prefered pull request flow, oActions (skeefree was an early adopter for Actions), GitHub API, and our existing datacenter and database infrastructure, all of which are well understood internally.

Public availability

skeefree and the skeema-diff Action were authored internally at GitHub to solve a specific problem. skeefree uses our internal inventory and discovery services, it works with our chatops and uses some internal libraries.

Our experience in releasing open source software is that no one’s use case is exactly the same as ours. Our perception of an automated migrations flow may be very different from another organization’s perception. We still want to share more than just our words, so we’ve open sourced the code.

It’s a bit of a peculiar OSS release:

  • it’s missing some libraries; it will not build.
  • It expects some of our internal services to exist, which more than likely won’t be on your platform.
  • It expects chatops, and you may not be using chatops.
  • The code also needs to be rewritten for adaptation to your environment,

Note that the code is available, but not open for issues and pull requests. We hope the community finds it useful.

Get the code

The post Automating MySQL schema migrations with GitHub Actions and more appeared first on The GitHub Blog.