Tag Archives: Engineering

How to get your organization started with containerized deployments

Post Syndicated from Sarah Khalife original https://github.blog/2020-10-15-how-to-get-your-organization-started-with-containerized-deployments/

This is our second post on cloud deployment with containers. Looking for more? Join our upcoming GitHub Actions webcast with Sarah, Solutions Engineer Benedict Oleforo, and Senior Product Manager Kayla Ngan on October 22.

In the past few years, businesses have moved towards cloud-native operating models to help streamline operations and move away from costly infrastructure. When running applications in dynamic environments with Docker, Kubernetes, and other tooling, a container becomes the tool of choice as a consistent, atomic unit of packaging, deployment, and application management. This sounds straightforward: build a new application, package it into containers, and scale elastically across the infrastructure of your choice. Then you can automatically update with new images as needed and focus more on solving problems for your end users and customers.

However, organizations don’t work in vacuums. They’re part of a larger ecosystem of customers, partners, and open source communities, with unique cultures, existing processes, applications, and tooling investments in place. This adds new challenges and complexity for adopting cloud native tools such as containers, Kubernetes, and other container schedulers.

Challenges for adopting container-based strategies in organizations

At GitHub, we’re fortunate to work with many customers on their container and DevOps strategy. When it comes to adopting containers, there are a few consistent challenges we see across organizations.

  • Containerizing and maintaining applications: Most organizations have existing applications and need to make the decision about whether to keep them as-is, or to place them in containers for an easier transition to the cloud. Even then, teams need to determine whether a single container for the application is appropriate (in a lift-and-shift motion to the cloud), or if more extensive work is needed to break it down into multiple services, delivered as a set of containers.
  • Efficiently configuring and managing permissions: Adopting containers often translates to better collaboration for everyone in your organization. DevOps is now more than just core developers and IT operators. It includes release and infosec engineers, data scientists, QA, project managers, and other roles. But collaborating across multiple teams introduces new needs for configuring and managing permissions for code, along with the automation to support it.
  • Standardizing best practices across the organization: Containers help teams scale and integrate quickly, but may also require updating your CI/CD practices to match. You have to validate they work well for existing applications, while incorporating the correct user and package permissions and policies.. The best practices you set have to be flexible for others too. Individual teams—who are transitioning to new ways of working—need to be able to optimize for their own goals.

Connecting teams and cloud-native tools with GitHub

Despite the few challenges of adopting containers and leveraging Kubernetes, more and more organizations continue to use them. Stepping over those hurdles allows enterprises to automate and streamline their operations, here with a few examples of how enterprises make it work successfully with support from package managers and CI/CD tools. At GitHub, we’ve introduced container support in GitHub Packages, CI/CD through GitHub Actions, and partnered within the ecosystem to simplify cloud-native workflows. Finding the right container tools should mean less work, not more—easily integrating alongside other tools, projects, and processes your organization already uses.

See container best practices in action

Want to simplify container deployments in your organization? Join me, Solutions Engineer Benedict Oleforo, and Senior Product Manager Kayla Ngan on October 22 to learn more about successfully adopting containers. We’ll walk through how to use them in the real world and demo best practices for deploying an application to Azure with GitHub Container Registry.

When
October 22, 2020
11:00 am PT / 2:00 pm ET

Sign up for the webcast.

Optimally scaling Kafka consumer applications

Post Syndicated from Grab Tech original https://engineering.grab.com/optimally-scaling-kafka-consumer-applications

Earlier this year, we took you on a journey on how we built and deployed our event sourcing and stream processing framework at Grab. We’re happy to share that we’re able to reliably maintain our uptime and continue to service close to 400 billion events a week. We haven’t stopped there though. To ensure that we can scale our framework as the Grab business continuously grows, we have spent efforts optimizing our infrastructure.

In this article, we will dive deeper into our Kubernetes infrastructure setup for our stream processing framework. We will cover why and how we focus on optimal scalability and availability of our infrastructure.

Quick Architecture Recap

Coban Platform Architecture

The Coban platform provides lightweight Golang plugin architecture-based data processing pipelines running in Kubernetes. These are essentially Kafka consumer pods that consume data, process it, and then materialize the results into various sinks (RDMS, other Kafka topics).

Anatomy of a Processing Pod

Anatomy of a Processing Pod

Each stream processing pod (the smallest unit of a pipeline’s deployment) has three top level components:

  • Trigger: An interface that connects directly to the source of the data and converts it into an event channel.
  • Runtime: This is the app’s entry point and the orchestrator of the pod. It manages the worker pools, triggers, event channels, and lifecycle events.
  • Pipeline plugin: This is provided by the user, and conforms to a contract that the platform team publishes. It contains the domain logic for the pipeline and houses the pipeline orchestration defined by a user based on our Stream Processing Framework.

Optimal Scaling

We initially architected our Kubernetes setup around horizontal pod autoscaling (HPA), which scales the number of pods per deployment based on CPU and memory usage. HPA keeps CPU and memory per pod specified in the deployment manifest and scales horizontally as the load changes.

These were the areas of application wastage we observed on our platform:

  • As Grab’s traffic is uneven, we’d always have to provision for peak traffic. As users would not (or could not) always account for ramps, they would be fairly liberal with setting limit values (CPU and memory), leading to resource wastage.
  • Pods often had uneven traffic distribution despite fairly even partition load distribution in Kafka. The Stream Processing Framework(SPF) is essentially Kafka consumers consuming from Kafka topics, hence the number of pods scaling in and out resulted in unequal partition load per pod.

Vertically Scaling with Fixed Number of Pods

We initially kept the number of pods for a pipeline equal to the number of partitions in the topic the pipeline consumes from. This ensured even distribution of partitions to each pod providing balanced consumption. In order to abstract this from the end user, we automated the application deployment process to directly call the Kafka API to fetch the number of partitions during runtime.

After achieving a fixed number of pods for the pipeline, we wanted to move away from HPA. We wanted our pods to scale up and down as the load increases or decreases without any manual intervention. Vertical pod autoscaling (VPA) solves this problem as it relieves us from any manual operation for setting up resources for our deployment.

We just deploy the application and let VPA handle the resources required for its operation. It’s known to not be very susceptible to quick load changes as it trains its model to monitor the deployment’s load trend over a period of time before recommending an optimal resource. This process ensures the optimal resource allocation for our pipelines considering the historic trends on throughput.

We saw a ~45% reduction in our total resource usage vs resource requested after moving to VPA with a fixed number of pods from HPA.

Anatomy of a Processing Pod

Managing Availability

We broadly classify our workloads as latency sensitive (critical) and latency tolerant (non-critical). As a result, we could optimize scheduling and cost efficiency using priority classes and overprovisioning on heterogeneous node types on AWS.

Kubernetes Priority Classes

The main cost of running EKS in AWS is attributed to the EC2 machines that form the worker nodes for the Kubernetes cluster. Running On-Demand brings all the guarantees of instance availability but it is definitely very expensive. Hence, our first action to drive cost optimisation was to include Spot instances in our worker node group.

With the uncertainty of losing a spot instance, we started assigning priority to our various applications. We then let the user choose the priority of their pipeline depending on their use case. Different priorities would result in different node affinity to different kinds of instance groups (On-Demand/Spot). For example, Critical pipelines (latency sensitive) run on On-Demand worker node groups and Non-critical pipelines (latency tolerant) on Spot instance worker node groups.

We use priority class as a method of preemption, as well as a node affinity that chooses a certain priority pipeline for the node group to deploy to.

Overprovisioning

With spot instances running we realised a need to make our cluster quickly respond to failures. We wanted to achieve quick rescheduling of evicted pods, hence we added overprovisioning to our cluster. This means we keep some noop pods occupying free space running in our worker node groups for the quick scheduling of evicted or deploying pods.

The overprovisioned pods are the lowest priority pods, thus can be preempted by any pod waiting in the queue for scheduling. We used cluster proportional autoscaler to decide the right number of these overprovisioned pods, which scales up and down proportionally to cluster size (i.e number of nodes and CPU in worker node group). This relieves us from tuning the number of these noop pods as the cluster scales up or down over the period keeping the free space proportional to current cluster capacity.

Lastly, overprovisioning also helped improve the deployment time because there is no  dependency on the time required for Auto Scaling Groups (ASG) to add a new node to the cluster every time we want to deploy a new application.

Future Improvements

Evolution is an ongoing process. In the next few months, we plan to work on custom resources for combining VPA and fixed deployment size. Our current architecture setup works fine for now, but we would like to create a more tuneable in-house CRD(Custom Resource Definition) for VPA that incorporates rightsizing our Kubernetes deployment horizontally.


Authored By Shubham Badkur on behalf of the Coban team at Grab – Ryan Ooi, Karan Kamath, Hui Yang, Yuguang Xiao, Jump Char, Jason Cusick, Shrinand Thakkar, Dean Barlan, Shivam Dixit, Andy Nguyen, and Ravi Tandon.


Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Testing cloud apps with GitHub Actions and cloud-native open source tools

Post Syndicated from Sarah Khalife original https://github.blog/2020-10-09-devops-cloud-testing/

See this post in action during GitHub Demo Days on October 16.

What makes a project successful? For developers building cloud-native applications, successful projects thrive on transparent, consistent, and rigorous collaboration. That collaboration is one of the reasons that many open source projects, like Docker containers and Kubernetes, grow to become standards for how we build, deliver, and operate software. Our Open Source Guides and Introduction to innersourcing are great first steps to setting up and encouraging these best practices in your own projects.

However, a common challenge that application developers face is manually testing against inconsistent environments. Accurately testing Kubernetes applications can differ from one developer’s environment to another, and implementing a rigorous and consistent environment for end-to-end testing isn’t easy. It can also be very time consuming to spin up and down Kubernetes clusters. The inconsistencies between environments and the time required to spin up new Kubernetes clusters can negatively impact the speed and quality of cloud-native applications.

Building a transparent CI process

On GitHub, integration and testing becomes a little easier by combining GitHub Actions with open source tools. You can treat Actions as the native continuous integration and continuous delivery (CI/CD) tool for your project, and customize your Actions workflow to include automation and validation as next steps.

Since Actions can be triggered based on nearly any GitHub event, it’s also possible to build in accountability for updating tests and fixing bugs. For example, when a developer creates a pull request, Actions status checks can automatically block the merge if the test fails.

Here are a few more examples:

Branch protection rules in the repository help enforce certain workflows, such as requiring more than one pull request review or requiring certain status checks to pass before allowing a pull request to merge.

GitHub Actions are natively configured to act as status checks when they’re set up to trigger `on: [pull_request]`.

Continuous integration (CI) is extremely valuable as it allows you to run tests before each pull request is merged into production code. In turn, this will reduce the number of bugs that are pushed into production and increases confidence that newly introduced changes will not break existing functionality.

But transparency remains key: Requiring CI status checks on protected branches provides a clearly-defined, transparent way to let code reviewers know if the commits meet the conditions set for the repository—right in the pull request view.

Using community-powered workflows

Now that we’ve thought through the simple CI policies, automated workflows are next. Think of an Actions workflow as a set of “plug and play” open sourced, automated steps contributed by the community. You can use them as they are, or customize and make them your own. Once you’ve found the right one, open sourced Actions can be plugged into your workflow with the`- uses: repo/action-name` field.

You might ask, “So how do I find available Actions that suit my needs?”

The GitHub Marketplace!

As you’re building automation and CI pipelines, take advantage of Marketplace to find pre-built Actions provided by the community. Examples of pre-built Actions span from a Docker publish and the kubectl CLI installation to container scans and cloud deployments. When it comes to cloud-native Actions, the list keeps growing as container-based development continues to expand.

Testing with kind

Testing is a critical part of any CI/CD pipeline, but running tests in Kubernetes can absorb the extra time that automation saves. Enter kind. kind stands for “Kubernetes in Docker.” It’s an open source project from the Kubernetes special interest group (SIGs) community, and a tool for running local Kubernetes clusters using Docker container “nodes.” Creating a kind cluster is a simple way to run Kubernetes cluster and application testing—without having to spin up a complete Kubernetes environment.

As the number of Kubernetes users pushing critical applications to production grows, so does the need for a repeatable, reliable, and rigorous testing process. This can be accomplished by combining the creation of a homogenous Kubernetes testing environment with kind, the community-powered Marketplace, and the native and transparent Actions CI process.

Bringing it all together with kind and Actions

Come see kind and Actions at work during our next GitHub Demo Day live stream on October 16, 2020 at 11am PT. I’ll walk you through how to easily set up automated and consistent tests per pull request, including how to use kind with Actions to automatically run end-to-end tests across a common Kubernetes environment.

Our Journey to Continuous Delivery at Grab (Part 1)

Post Syndicated from Grab Tech original https://engineering.grab.com/our-journey-to-continuous-delivery-at-grab

This blog post is a two-part presentation of the effort that went into improving the continuous delivery processes for backend services at Grab in the past two years. In the first part, we take stock of where we started two years ago and describe the software and tools we created while introducing some of the integrations we’ve done to automate our software delivery in our staging environment.


Continuous Delivery is the principle of delivering software often, every day.

As a backend engineer at Grab, nothing matters more than the ability to innovate quickly and safely. Around the end of 2018, Grab’s transportation and deliveries backend architecture consisted of roughly 270 services (the majority being microservices). The deployment process was lengthy, required careful inputs and clear communication. The care needed to push changes in production and the risk associated with manual operations led to the introduction of a Slack bot to coordinate deployments. The bot ensures that deployments occur only during off-peak and within work hours:

Overview of the Grab Delivery Process
Overview of the Grab Delivery Process

Once the build was completed, engineers who desired to deploy their software to the Staging environment would copy release versions from the build logs, and paste them in a Jenkins job’s parameter. Tests needed to be manually triggered from another dedicated Jenkins job.

Prior to production deployments, engineers would generate their release notes via a script and update them manually in a wiki document. Deployments would be scheduled through interactions with a Slack bot that controls release notes and deployment windows. Production deployments were made once again by pasting the correct parameters into two dedicated Jenkins jobs, one for the canary (a.k.a. one-box) deployment and the other for the full deployment, spread one hour apart. During the monitoring phase, engineers would continuously observe metrics reported on our dashboards.

In spite of the fragmented process and risky manual operations impacting our velocity and stability, around 614 builds were running each business day and changes were deployed on our staging environment at an average rate of 300 new code releases per business day, while production changes averaged a rate of 28 new code releases per business day.

Our Deployment Funnel, Towards the End of 2018
Our Deployment Funnel, Towards the End of 2018

These figures meant that, on average, it took 10 business days between each service update in production, and only 10% of the staging deployments were eventually promoted to production.

Automating Continuous Deployments at Grab

With an increased focus on Engineering efficiency, in 2018 we started an internal initiative to address frictions in deployments that became known as Conveyor. To build Conveyor with a small team of engineers, we had to rely on an already mature platform which exhibited properties that are desirable to us to achieve our mission.

Hands-off deployments

Deployments should be an afterthought. Engineers should be as removed from the process as possible, and whenever possible, decisions should be taken early, during the code review process. The machine will do the heavy lifting, and only when it can’t decide for itself, should the engineer be involved. Notifications can be leveraged to ensure that engineers are only informed when something goes wrong and a human decision is required.

Hands-off Deployment Principle
Hands-off Deployment Principle

Confidence in Deployments

Grab’s focus on gathering internal Engineering NPS feedback helped us collect valuable metrics. One of the metrics we cared about was our engineers’ confidence in their production deployments. A team’s entire deployment process to production could last for more than a day and may extend up to a week for teams with large infrastructures running critical services. The possibility of losing progress in deployments when individual steps may last for hours is detrimental to the improvement of Engineering efficiency in the organisation. The deployment automation platform is the bedrock of that confidence. If the platform itself fails regularly or does provide a path of upgrade that is transparent to end-users, any features built on top of it would suffer from these downtimes and ultimately erode confidence in deployments.

Tailored To Most But Extensible For The Few

Our backend engineering teams are working on diverse stacks, and so are their deployment processes. Right from the start, we wanted our product to benefit the largest population of engineers that had adopted the same process, so as to maximize returns on our investments. To ease adoption, we decided to tailor a deployment pipeline such that:

  1. It would model the exact sequence of manual processes followed by this population of engineers.
  2. Switching to use that pipeline should require as little work as possible by service teams.

However, in cases where this model would not fit a team’s specific process, our deployment platform should be open and extensible and support new customizations even when they are not originally supported by the product’s ecosystem.

Cloud-Agnosticity

While we were going to target a specific process and team, to ensure that our solution would stand the test of time, we needed to ensure that our solution would support the variety of environments currently used in production. This variety was also likely to increase, and we wanted a platform that would mature together with the rest of our ecosystem.

Overview Of Conveyor

Setting Sail With Spinnaker

Conveyor is based on Spinnaker, an open-source, multi-cloud continuous delivery platform. We’ve chosen Spinnaker over other platforms because it is a mature deployment platform with no single point of failure, supports complex workflows (referred to as pipelines in Spinnaker), and already supports a large array of cloud providers. Since Spinnaker is open-source and extensible, it allowed us to add the features we needed for the specificity of our ecosystem.

To further ease adoption within our organization, we built a tailored  user interface and created our own domain-specific language (DSL) to manage its pipelines as code.

Outline of Conveyor's Architecture
Outline of Conveyor’s Architecture

Onboarding To A Simpler Interface

Spinnaker comes with its own interface, it has all the features an engineer would want from an advanced continuous delivery system. However, Spinnaker interface is vastly different from Jenkins and makes for a steep learning curve.

To reduce our barrier to adoption, we decided early on to create a simple interface for our users. In this interface, deployment pipelines take the center stage of our application. Pipelines are objects managed by Spinnaker, they model the different steps in the workflow of each deployment. Each pipeline is made up of stages that can be assembled like lego-bricks to form the final pipeline. An instance of a pipeline is called an execution.

Conveyor dashboard. Sensitive information like authors and service names are redacted.
Conveyor Dashboard

With this interface, each engineer can focus on what matters to them immediately: the pipeline they have started, or those started by other teammates working on the same services as they are. Conveyor also provides a search bar (on the top) and filters (on the left) that work in concert to explore all pipelines executed at Grab.

We adopted a consistent set of colours to model all information in our interface:

  • blue: represent stages that are currently running;
  • red: stages that have failed or important information;
  • yellow: stages that require human interaction;
  • and finally, in green: stages that were successfully completed.

Conveyor also provides a task and notifications area, where all stages requiring human intervention are listed in one location. Manual interactions are often no more than just YES or NO questions:

Conveyor tasks. Sensitive information like author/service names is redacted.
Conveyor Tasks

Finally, in addition to supporting automated deployments, we greatly simplified the start of manual deployments. Instead of being required to copy/paste information, each parameter can be selected on the interface from a set of predefined items, sorted chronologically, and presented with contextual information to help engineers in their decision.

Several parameters are required for our deployments and their values are selected from the UI to ensure correctness.

Simplified manual deployments
Simplified Manual Deployments

Ease Of Adoption With Our Pipeline-As-Code DSL

Ease of adoption for the team is not simply about the learning curve of the new tools. We needed to make it easy for teams to configure their services to deploy with Conveyor. Since we focused on automating tasks that were already performed manually, we needed only to configure the layer that would enable the integration.

We set on creating a pipeline-as-code implementation when none were widely being developed in the Spinnaker community. It’s interesting to see that two years on, this idea has grown in parallel in the community, with the birth of other pipeline-as-code implementations. Our pipeline-as-code is referred to as the Pipeline DSL, and its configuration is located inside each team’s repository. Artificer is the name of our Pipeline DSL interpreter and it runs with every change inside our monorepository:

Artificer: Our Pipeline DSL
Artificer: Our Pipeline DSL

Pipelines are being updated at every commit if necessary.

Creating a conveyor.jsonnet file inside with the service’s directory of our monorepository with the few lines below is all that’s required for Artificer to do its work and get the benefits of automation provided by Conveyor’s pipeline:

local default = import 'default.libsonnet';
[
 {
 name: "service-name",
 group: [
 "group-name",
 ]
 }
]

Sample minimal conveyor.jsonnet configuration to onboard services.

In this file, engineers simply specify the name of their service and the group that a user should belong to, to have deployment rights for the service.

Once the build is completed, teams can log in to Conveyor and start manual deployments of their services with our pipelines. Three pipelines are provided by default: the integration pipeline used for tests and developments, the staging pipeline used for pre-production tests, and the production pipeline for production deployment.

Thanks to the simplicity of this minimal configuration file, we were able to generate these configuration files for all existing services of our monorepository. This resulted in the automatic onboarding of a large number of teams and was a major contributing factor to the adoption of Conveyor throughout our organisation.

Our Journey To Engineering Efficiency (for backend services)

The sections below relate some of the improvements in engineering efficiency we’ve delivered since Conveyor’s inception. They were not made precisely in this order but for readability, they have been mapped to each step of the software development lifecycle.

Automate Deployments at Build Time

Continuous Integration Job
Continuous Integration Job

Continuous delivery begins with a pushed code commit in our trunk-based development flow. Whenever a developer pushes changes onto their development branch or onto the trunk, a continuous integration job is triggered on Jenkins. The products of this job (binaries, docker images, etc) are all uploaded into our artefact repositories. We’ve made two additions to our continuous integration process.

The first modification happens at the step “Upload & Register artefacts”. At this step, each artefact created is now registered in Conveyor with its associated metadata. When and if an engineer needs to trigger a deployment manually, Conveyor can display the list of versions to choose from, eliminating the need for error-prone manual inputs:

 Staging
Staging

Each selectable version shows contextual information: title, author, version and link to the code change where it originated. During registration, the commit time is also recorded and used to order entries chronologically in the interface. To ensure this integration is not a single point of failure for deployments, manual input is still available optionally.

The second modification implements one of the essential feature continuous delivery: your deployments should happen often, automatically. Engineers are now given the possibility to start automatic deployments once continuous integration has successfully completed, by simply modifying their project’s continuous integration settings:

 "AfterBuild": [
  {
      "AutoDeploy": {
      "OnDiff": false,
      "OnLand": true
    }
    "TYPE": "conveyor"
  }
 ],

Sample settings needed to trigger auto-deployments. ‘Diff’ refers to code review submissions, and ‘Land’ refers to merged code changes.

Staging Pipeline

Before deploying a new artefact to a service in production, changes are validated on the staging environment. During the staging deployment, we verify that canary (one-box) deployments and full deployments with automated smoke and functional tests suites.

Staging Pipeline
Staging Pipeline

We start by acquiring a deployment lock for this service and this environment. This prevents another deployment of the same service on the same environment to happen concurrently, other deployments will be waiting in a FIFO queue until the lock is released.

The stage “Compute Changeset” ensures that the deployment is not a rollback. It verifies that the new version deployed does not correspond to a rollback by comparing the ancestry of the commits provided during the artefact registration at build time: since we automate deployments after the build process has completed, cases of rollback may occur when two changes are created in quick succession and the latest build completes earlier than the older one.

After the stage “Deploy Canary” has completed, smoke test run. There are three kinds of tests executed at different stages of the pipeline: smoke, functional and security tests. Smoke tests directly reach the canary instance’s endpoint, by-passing load-balancers. If the smoke tests fail,  the canary is immediately rolled back and this deployment is terminated.

All tests are generated from the same builds as the artefact being tested and their versions must match during testing. To ensure that the right version of the test run and distinguish between the different kind of tests to perform, we provide additional metadata that will be passed by Conveyor to the tests system, known internally as Gandalf:

local default = import 'default.libsonnet';
[
  {
    name: "service-name",
    group: [
    "group-name",
    ],
    gandalf_smoke_tests: [
    {
        path: "repo.internal/path/to/my/smoke/tests"
      }
      ],
      gandalf_functional_tests: [
      {
        path: "repo.internal/path/to/my/functional/tests"
      }
      gandalf_security_tests: [
      {
        path: "repo.internal/path/to/my/security/tests"
      }
      ]
    }
]

Sample conveyor.jsonnet configuration with integration tests added.

Additionally, in parallel to the execution of the smoke tests, the canary is also being monitored from the moment its deployment has completed and for a predetermined duration. We leverage our integration with Datadog to allow engineers to select the alerts to monitor. If an alert is triggered during the monitoring period, and while the tests are executed, the canary is again rolled back, and the pipeline is terminated. Engineers can specify the alerts by adding them to the conveyor.jsonnet configuration file together with the monitoring duration:

local default = import 'default.libsonnet';
[
 {
   name: "service-name",
   group: [
   "group-name",
   ],
    gandalf_smoke_tests: [
    {
      path: "repo.internal/path/to/my/smoke/tests"
   }
   ],
   gandalf_functional_tests: [
   {
   path: "repo.internal/path/to/my/functional/tests"
  }
     gandalf_security_tests: [
     {
     path: "repo.internal/path/to/my/security/tests"
     }
     ],
     monitor: {
     stg: {
     duration_seconds: 300,
     alarms: [
     {
   type: "datadog",
   alert_id: 12345678
   },
   {
   type: "datadog",
   alert_id: 23456789
      }
      ]
      }
    }
  }
]

Sample conveyor.jsonnet configuration with alerts in staging added.

When the smoke tests and monitor pass and the deployment of new artefacts is completed, the pipeline execution triggers functional and security tests. Unlike smoke tests, functional & security tests run only after that step, as they communicate with the cluster through load-balancers, impersonating other services.

Before releasing the lock, release notes are generated to inform engineers of the delta of changes between the version they just released and the one currently running in production. Once the lock is released, the stage “Check Policies” verifies that the parameters and variable of the deployment obeys a specific set of criteria, for example: if its service metadata is up-to-date in our service inventory, or if the base image used during deployment is sufficiently recent.

Here’s how the policy stage, the engine, and the providers interact with each other:

Check Policy Stage
Check Policy Stage

In Spinnaker, each event of a pipeline’s execution updates the pipeline’s state in the database. The current state of the pipeline can be fetched by its API as a single JSON document, describing all information related to its execution: including its parameters, the contextual information related to each stage or even the response from the various interfacing components. The role of our “Policy Check” stage is to query this JSON representation of the pipeline, to extract and transform the variables which are forwarded to our policy engine for validation. Our policy engine gathers judgements passed by different policy providers. If the validation by the policy engine fails, the deployment is not rolled back this time; however, promotion to production is not possible and the pipeline is immediately terminated.

The journey through staging deployment finally ends with the stage “Register Deployment”. This stage registers that a successful deployment was made in our staging environment as an artefact. Similarly to the policy check above, certain parameters of the deployment are picked up and consolidated into this document. We use this kind of artefact as proof for upcoming production deployment.

Continuing Our Journey to Engineering Efficiency

With the advancements made in continuous integration and deployment to staging, Conveyor has reduced the efforts needed by our engineers to just three clicks in its interface, when automated deployment is used. Even when the deployment is triggered manually, Conveyor gives the assurance that the parameters selected are valid and it does away with copy/pasting and human interactions across heterogeneous tools.

In the sequel to this blog post, we’ll dive into the improvements that we’ve made to our production deployments and introduce a crucial concept that led to the creation of our proof for successful staging deployment. Finally, we’ll cover the impact that Conveyor had on the continuous delivery of our backend services, by comparing our deployment velocity when we started two years ago versus where we are today.


All these improvements in efficiency for our engineers would never have been possible without the hard work of all team members involved in the project, past and present: Evan Sebastian, Tanun Chalermsinsuwan, Aufar Gilbran, Deepak Ramakrishnaiah, Repon Kumar Roy (Kowshik), Su Han, Voislav Dimitrijevikj, Qijia Wang, Oscar Ng, Jacob Sunny, Subhodip Mandal, and many others who have contributed and collaborated with them.


Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Uncovering the truth behind Lua and Redis data consistency

Post Syndicated from Grab Tech original https://engineering.grab.com/uncovering-the-truth-behind-lua-and-redis-data-consistency

Uncovering the truth behind Lua and Redis data consistency

Our team at Grab uses Redis as one of our message queues. The Redis server is deployed in a master/replica setup. Quite recently, we have been noticing a spike in the CPU usage of the Redis replicas every time we deploy our service, even when the replicas are not in use and when there’s no read traffic to it. However, the issue is resolved once we reboot the replica.

Because a reboot of the replica fixes the issue every time, we thought that it might be due to some Elasticache replication issues and didn’t pursue further. However, a recent Redis failover brought this to our attention again. After the failover, the problematic replica becomes the new master and its CPU immediately goes to 100% with the read traffic, which essentially means the cluster is not functional after the failover. And this time we investigated the issue with new vigour. What we found in our investigation led us to deep dive into the details of Redis replication and its implementation of Hash.

Did you know that Redis master/replica can become inconsistent in certain scenarios?

Did you know the encoding of Hash objects on the master and the replica are different even if the writing operations are exactly the same and in the same order? Read on to find out why.

The problem

The following graph shows the CPU utilization of the master vs. the replica immediately after our service is deployed.

Architecture diagram
CPU Utilization

From the graph, you can see the following CPU usage trends. Replica’s CPU usage:

  • Increases immediately after our service is deployed.
  • Spikes higher than the master after a certain time.
  • Get’s back to normal after a reboot.

Cursory investigation

Because the spike occurs only when we deploy our service, we scrutinised all the scripts that were triggered immediately after the deployment. Lua monitor script was identified as a possible suspect. The script redistributes inactive service instances’ messages in the queue to active service instances so that messages can be processed by other healthy instances.

We ran a few experiments related to the Lua monitor script using the Redis monitor command to compare the script’s behaviour on master and the replica. A side note, because this command causes performance degradation, use it with discretion. Coming back to the script, we were surprised to note that the monitor script behaves differently between the master and the replica:

  • Redis executes the script separately on the master and the replica. We expected the script to execute only on master and the resulting changes to be replicated to the secondary.
  • The Redis command HGETALL used in the script returns the hash keys in a different order on master compared to the replica.

Due to the above reasons, the script causes data inconsistencies between the master and its replica. From that point on, the data between the master and the replica keeps diverging till they become completely distinct. Due to the inconsistency, the data on the secondary does not get deleted correctly thereby growing into an extremely large dataset. Any further operations on the large dataset requires a higher CPU usage, which explains why the replica’s CPU usage is higher than the master.

During replica reboots, the data gets synced and consistent again, which is why the CPU usage gets to normal values after rebooting.

Diving deeper on HGETALL

We knew that the keys of a hash are not ordered and we should not rely on the order. But it still puzzled us that the order is different even when the writing sequence is the same between the master and the replica. Plus the fact that the orders are always the same in our local environment with a similar setup made us even more curious.

So to better understand the underlying magic of Redis and to avoid similar bugs in the future, we decided to hammer on and read the Redis source code to get more details.

HGETALL command handling code

The HGETALL command is handled by the function genericHgetallCommand and it further calls hashTypeNext to iterate through the Hash object. A snippet of the code is shown as follows:

/* Move to the next entry in the hash. Return C_OK when the next entry
 * could be found and C_ERR when the iterator reaches the end. */
int hashTypeNext(hashTypeIterator *hi) {
    if (hi->encoding == OBJ_ENCODING_ZIPLIST) {
        // call zipListNext
    } else if (hi->encoding == OBJ_ENCODING_HT) {
        // call dictNext
    } else {
        serverPanic("Unknown hash encoding");
    }
    return C_OK;
}

From the code snippet, you can see that the Redis Hash object actually has two underlying representations:

  • ZIPLIST
  • HASHTABLE (dict)

A bit of research online helped us understand that, to save memory, Redis chooses between the two hash representations based on the following limits:

  • By default, Redis stores the Hash object as a zipped list when the hash has less than 512 entries and when each element’s size is smaller than 64 bytes.
  • If either limit is exceeded, Redis converts the list to a hashtable, and this is irreversible. That is, Redis won’t convert the hashtable back to a list again, even if the entries/size falls below the limit.

Eureka moment

Based on this understanding, we checked the encoding of the problematic hash in staging.

stg-bookings-qu-002.pcxebj.0001.apse1.cache.amazonaws.com:6379> object encoding queue_stats
"hashtable"

stg-bookings-qu-001.pcxebj.0001.apse1.cache.amazonaws.com:6379> object encoding queue_stats
"ziplist"

To our surprise, the encodings of the Hash object on the master and its replica were different. Which means if we add or delete elements in the hash, the sequence of the keys won’t be the same due to hashtable operation vs. list operation!

Now that we have identified the root cause, we were still curious about the difference in encoding between the master and the replica.

How could the underlying representations be different?

We reasoned, “If the master and its replica’s writing operations are exactly the same and in the same order, why are the underlying representations still different?

To answer this, we further looked through the Redis source to find all the possible places that a Hash object’s representation could be changed and soon found the following code snippet:

/* Load a Redis object of the specified type from the specified file.
 * On success a newly allocated object is returned, otherwise NULL. */
robj *rdbLoadObject(int rdbtype, rio *rdb) {
  //...
  if (rdbtype == RDB_TYPE_HASH) {
    //...
    o = createHashObject();  // ziplist

    /* Too many entries? Use a hash table. */
    if (len > server.hash_max_ziplist_entries)
        hashTypeConvert(o, OBJ_ENCODING_HT);

    //...
  }
}

Reading through the code we understand the following behaviour:

  • When restoring from an RDB file, Redis creates a ziplist first for Hash objects.
  • Only when the size of the Hash object is greater than the hash_max_ziplist_entries, the ziplist is converted to a hashtable.

So, if you have a Redis Hash object encoded as a hashtable with its length less than hash_max_ziplist_entries (512) in the master, when you set up a replica, it is encoded as a ziplist.

We were able to verify this behaviour in our local setup as well.

How did we fix it?

We could use the following two approaches to address this issue:

  • Enable script effect replication mode. This tells Redis to replicate the commands generated by the script instead of running the whole script on the replica. One disadvantage to using this approach is that it adds network traffic between the master and the replica.
  • Ensure the behaviour of the Lua monitor script is deterministic. In our case, we can do this by sorting the outputs of HKEYS/HGETALL.

We chose the latter approach because:

  • The Hash object is pretty small ( < 30 elements) so the sorting overhead is low, less than 1ms for 100 elements based on our tests.
  • Replicating our script effect would end up replicating thousands of Redis writing commands on the secondary causing a much higher overhead compared to replicating just the script.

After the fix, the CPU usage of the replica remained in range after each deployment. This also prevented the Redis cluster from being destroyed in the event of a master failover.

Key takeaways

In addition to writing clear and maintainable code, it’s equally important to understand the underlying storage layer that you are dealing with to produce efficient and bug-free code.

The following are some of the key learnings on Redis:

  • Redis does not guarantee the consistency between master and its replica nodes when Lua scripts are used. You have to ensure that the behaviour of the scripts are deterministic to avoid data inconsistency.
  • Redis replicates the whole Lua script instead of the resulting commands to the replica. However, this is the default behaviour and you can disable it.
  • To save memory, Redis uses different representations for Hash. Your Hash object could be stored as a list in memory or a hashtable. This is not guaranteed to be the same across the master and its replicas.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

GitHub Availability Report: August 2020

Post Syndicated from Keith Ballinger original https://github.blog/2020-09-02-github-availability-report-august-2020/

Introduction

In August, we experienced no incidents resulting in service downtime. This month’s GitHub Availability Report will dive into updates to the GitHub Status Page and provide follow-up details on how we’ve addressed the incident mentioned in July’s report.

Status page refresh

At GitHub, we are always striving to be more transparent and clear with our users. We know our customers trust us to communicate the current availability of our services and we are now making that more clear for everyone to see. Previously, our status page shared a 90-day history of GitHub’s availability by service, but this history can distract users from what’s happening actively, which during an incident is the most important piece of information. Starting today, the status page will display our current availability and inform users of any degradation in service with real-time updates from the team. The history view will continue to be available and can be found under the “incident history” link. As a reminder, if you want to receive updates on any status changes, you can subscribe to get notifications whenever GitHub creates, updates, or resolves an incident.

Check out the new GitHub Status Page.

Animation showing new GitHub Status page

Follow up to July 13 08:18 UTC

Since the incident mentioned in July’s GitHub Availability Report, we’ve worked on a number of improvements to both our deployment tooling and to the way we configure our Kubernetes deployments, with the goal of improving the reliability posture of our systems.

First, we audited all the Kubernetes deployments used in production to remove all usages of the ImagePullPolicy of Always configuration.

Our philosophy when dealing with Kubernetes configuration is to make it easy for internal users to ship their code to production while continuing to follow best practices. For this reason, we implemented a change that automatically replaces the ImagePullPolicy of Always setting in all our Kubernetes-deployed applications, while still allowing experienced users with particular needs to opt out of this automation.

Second, we implemented a mechanism equivalent to the one of Kubernetes mutating admission controllers that we use to inject the latest version of sidecar containers, identified by the SHA256 digest of the image.

These changes allowed us to remove the strong coupling between Kubernetes Pods and the availability of our Docker registry in case of container restarts. We have more improvements in the pipeline that will help further increase the resilience of our Kubernetes deployments and we plan to share more information about those in the future.

In Summary

We’re excited to be able to share these updates with you and look forward to future updates as we continue our efforts in making GitHub more resilient every day.

Upgrading GitHub to Ruby 2.7

Post Syndicated from Eileen M. Uchitelle original https://github.blog/2020-08-25-upgrading-github-to-ruby-2-7/

After many months of work, we deployed GitHub to production using Ruby 2.7 in July. For those who aren’t familiar with GitHub’s stack, we’ve been running on Ruby since the beginning. Many years ago, we ran GitHub on a fork of Ruby (and Rails!) and while that hasn’t been the case for some time, that experience taught us how important it is to keep up with new releases.

Ruby 2.7 is a unique upgrade because the Ruby Core team has deprecated how keyword arguments behave. With this release, future versions of Ruby will no longer accept passing an options hash when a method expects keyword arguments. At GitHub, we’re committed to running deprecation-free on both Ruby and Rails to prevent falling behind on future upgrades. It’s important to identify major changes early so we can evolve the application when necessary.

In order to run Ruby 2.7 deprecation-free, we had to fix over 11k warnings. Fixing that many warnings, some of which were coming from external libraries, takes a lot of coordination and teamwork. In order to be successful we needed a solid strategy for sharing the work.

Strategy

Just like we did with our Rails upgrade, we set up our application to be dual-bootable in both Ruby 2.6 and Ruby 2.7 by using an environment variable. This made it easy for us to make backwards compatible changes, merge those to the main branch, and avoid maintaining a long running branch for our upgrade. It also made it easier for other engineering teams who needed to make changes to get their system running with the new Ruby version. Due to how large our application is (over 400k lines!) and how many changes go in daily (100’s of PRs!), this drastically simplifies our upgrade process.

Once we had the build running, we weren’t quite yet ready to ask other teams to help fix warnings. Since Ruby warnings are simply strings in the test output we needed to capture the deprecations and turn them into lists for each team to fix.

To accomplish this we monkey patched the Warning module in Ruby. Here’s a simplified version of our monkey patch:

module Warning
  def self.warn(warning)
    root = ENV["RAILS_ROOT"].to_s + "/"
    warning = warning.gsub(root, "")

    line = caller_locations.find do |location|
      location.path.end_with?("_test.rb")
    end

    origin = line&.path&.gsub(root, "")

    WarningsCollector.instance << [warning.chomp, origin]

    STDERR.print(message)
  end
end

The patch stores the deprecation warning and the test path that caused the warning in a WarningCollector object which writes the warnings to a file and then processes them:

class WarningsCollector < ParallelCollector
  def process
    filename = "warnings.txt"
    path = File.join(dir, filename)

    File.open(path, "a") do |f|
      @data.each do |message, origin|
        f.puts [message, origin].join("*^.^*") # ascii art so we can split on it later.
      end
    end

    script = File.absolute_path("../../../script/process-ruby-warnings", __FILE__)
    system(script, dir)
  end
end

The WarningCollector#process method stores all the warnings in a file called warnings.txt. We then parse warnings using CODEOWNERS and turn them into files that correspond to each owning team.

Once we had all the warnings processed, we opened issues for those teams with easy-to-follow directions for booting the application in the new Ruby version. Our warning reports included the file emitting the warning, the warning itself, and the test suites that triggered the warnings. They looked like this:

- [x] `app/jobs/delete_job.rb`
  - **warnings**
    - Line 16: warning: Using the last argument as keyword parameters is deprecated; maybe ** should be added to the call
  - **test suites that trigger these warnings**
    - test/jobs/delete_job.rb

This process helped us avoid duplicating work across teams and made it simple to determine ownership and status of each warning.

We tracked warning counts in the Ruby 2.7 CI build to ensure that new code wasn’t introducing new warnings. After a few months, coordinating with 40 teams, 30+ gem upgrades, and 11k warnings our CI build was 100% warning-free. Gems that were unmaintained were replaced with maintained gems. Once we had fixed the warnings, we altered our monkey patch to raise errors in Ruby 2.7 which ensured that all new code going into the GitHub codebase was warning-free.

Benefits of Upgrading to 2.7

You may be reading this and wondering why it’s worth doing all this work and investing the engineering resources and time in the Ruby upgrade. If you’ve been writing Ruby for a while you’re likely aware of the difficulty with this particular upgrade. It’s been the topic of conversation in the Ruby community since before the release in December. Regardless of how hard this upgrade was, we saw an impressive improvement in performance. The Ruby Core team is well on their way to fulfilling the promise of Ruby 3.0 being 3x faster.

First, we saw a drop in the amount of time it takes the application to boot in production mode. In production (this is when the entire application is eager loaded) we saw our boot time drop from an average of ~90 seconds to ~70 seconds. That’s a 20-second drop. Here’s a graph:

Chart showing time it takes the application to boot in production mode

This faster boot time means faster deploys which means you get our features, bug fixes, and performance improvements faster as well!

In addition to an improvement in boot time, we saw a decrease in object allocations which went from ~780k allocations to ~668k allocations. Object allocations affect available memory so it’s important to lower these numbers whenever possible.

Chart showing decrease in object allocations

Aside from the performance benefits of upgrading, ensuring you stay on the most recent version of your languages and frameworks helps keep your application healthy. Through this process we found a lot of unowned code that was no longer used in the application and deleted it. We also took this opportunity to remove or replace unmaintained gems in our application.

For gems that were maintained we gave back to the community by sending patches for any gems that were emitting warnings in our application including Rails, rails-controller-testingcapybarafactory_botview_componentposix-spawngithub-dsruby-kafka, and many others. GitHub believes strongly in supporting the open source community and upgrades are one of many ways that we do so directly.

Deployment

There are risks to deploying any major upgrade, but at GitHub we’ve designed processes that reduce this risk drastically.

For Ruby and Rails upgrades, we run in dual-builds until we’re sure all the tests are passing and code is stable. In addition, we have all teams that work on the core product click test their area of the codebase in a staging environment to ensure there are no obvious issues with the new version.

Rolling out the upgrade is a big deal, so we do it carefully by increasing the percentage of traffic running on the new version and verifying each deployment is error-free in Sentry and regression-free in Datadog. For this deploy, we rolled out to 2% of traffic and quickly saw a new frozen string exception. Due to our process we were able to rollback quickly and less than 10 users saw an error in one endpoint.

Once we had a fix for the frozen string exception, we restarted the rollout process and again deployed to 2% of traffic. We let this one sit for 15 minutes before going to the next percentage: 30% of Kubernetes partitions. Again we waited about 15 minutes and after verifying there was no regression we deployed to another 30% to total 60% of Kubernetes partitions.

Finally, we deployed to 30% of our non-Kubernetes deployment partitions. These deploys take longer because they need to compile Ruby. It’s a bit nerve-wracking waiting 15 minutes for Ruby to compile, but everything went smoothly. From there we did a full-production deploy and merged the upgrade after 30 minutes. Overall the entire deploy took about 2 hours.

At GitHub, we’ve invested in building out processes for deploying Ruby and Rails upgrades so that we can be confident they are the lowest possible risk. We had no downtime while deploying the Ruby upgrade and our customer impact was almost zero.

Was it worth it?

For any companies that are wondering if this upgrade is worth it the answer is: 100%. Even without the performance improvements, falling behind on Ruby upgrades has drastic negative effects on the stability of your codebase. Upgrading Ruby supports your application health, improves performance, fixes language and framework bugs, and guides the future of the language!

At GitHub, not only do we believe in the open source community, we believe that a strong foundation is the first step to a stable, resilient, and functioning application. Running on the most recent version Ruby helps us do just that. We’re looking forward to Ruby 3.0 and beyond. Happy upgrading!

Securing and managing multi-cloud Presto Clusters with Grab’s DataGateway

Post Syndicated from Grab Tech original https://engineering.grab.com/data-gateway

Introduction

Data is the lifeblood of Grab and the insights we gain from it drive all the most critical business decisions made by Grabbers and our leaders every day.

Grab’s Data Engineering (DE) team is responsible for maintaining the data platform, which consists of data pipelines, job schedulers, and the query/computation engines that are the key components for generating insights from data. SQL is the core language for analytics at Grab and as of early 2020, our Presto platform serves about 200 user groups that add up to 500 users who run 350,000 queries every day. These queries span across 10,000 tables that process up to 1PB of data daily.

In 2016, we started the DataGateway project to enable us to manage data access for the hundreds of Grabbers who needed access to Presto for their work. Since then, DataGateway has grown to become much more than just an access control mechanism for Presto. In this blog, we want to share what we’ve achieved since the initial launch of the project.

The problems we wanted to solve

As we were reviewing the key challenges around data access in Grab and assessing possible solutions, we came up with this prioritized list of user requirements we wanted to work on:

  • Use a single endpoint to serve everyone.
  • Manage user access to clusters, schemas, tables, and fields.
  • Provide seamless user experience when presto clusters are scaled up/down, in/out, or provisioned/decommissioned.
  • Capture audit trail of user activities.

To provide Grabbers with the critical need of interactive querying, as well as performing extract, transform, load (ETL) jobs, we evaluated several technologies. Presto was among the ones we evaluated, and was what we eventually chose although it didn’t meet all of our requirements out of the box. In order to address these gaps, we came up with the idea of a security gateway for the Presto compute engine that could also act as a load balancer/proxy, this is how we ended up creating the DataGateway.

DataGateway is a service that sits between clients and Presto clusters. It is essentially a smart HTTP proxy server that is an abstraction layer on top of the Presto clusters that handles the following actions:

  1. Parse incoming SQL statements to get requested schemas, tables, and fields.
  2. Manage user Access Control List (ACL) to limit users’ data access by checking against the SQL parsing results.
  3. Manage users’ cluster access.
  4. Redirect users’ traffic to the authorized clusters.
  5. Show meaningful error messages to users whenever the query is rejected or exceptions from clusters are encountered.

Anatomy of DataGateway

The DataGateway’s key components are as follows:

  • API Service
  • SQL Parser
  • Auth framework
  • Administration UI

We leveraged Kubernetes to run all these components as microservices.

Figure 1. DataGateway Key Components
Figure 1. DataGateway Key Components

API Service

This is the component that manages all users and cluster-facing processes. We integrated this service with the Presto API, which means it appears to be the same as a Presto cluster to a client. It accepts query requests from clients, gets the parsing result and runs authorization from the SQL Parser and the Auth Framework.

If everything is good to go, the API Service forwards queries to the assigned clusters and continues the entire query process.

Auth Framework

This handles both authentication and authorization requests. It stores the ACL of users and communicates with the API Service and the SQL Parser to run the entire authentication process. But why is it a microservice instead of a module in API Service, you ask? It’s because we keep evolving the security checks at Grab to ensure that everything is compliant with our security requirements, especially when dealing with data.

We wanted to make it flexible to fulfill ad-hoc requests from the security team without affecting the API Service. Furthermore, there are different authentication methods out there that we might need to deal with (OAuth2, SSO, you name it). The API Service supports multiple authentication frameworks that enable different authentication methods for different users.

SQL Parser

This is a SQL parsing engine to get schema, tables, and fields by reading SQL statements. Since Presto SQL parsing works differently in each version, we would compile multiple SQL Parsers that are identical to the Presto clusters we run. The SQL Parser becomes the single source of truth.

Admin UI

This is a UI for Presto administrators to manage clusters and user access, as well as to select an authentication framework, making it easier for the administrators to deal with the entire ecosystem.

How we deployed DataGateway using Kubernetes

In the past couple of years, we’ve had significant growth in workloads from analysts and data scientists. As we were very enthusiastic about Kubernetes, DataGateway was chosen as one of the earliest services for deployment in Kubernetes. DataGateway in Kubernetes is known to be highly available and fully scalable to handle traffic from users and systems.

We also tested the HPA feature of Kubernetes, which is a dynamic scaling feature to scale in or out the number of pods based on actual traffic and resource consumption.

Figure 2. DataGateway deployment using Kubernetes
Figure 2. DataGateway deployment using Kubernetes

Functionality of DataGateway

This section highlights some of the ways we use DataGateway to manage our Presto ecosystem efficiently.

Restrict users based on Schema/Table level access

In a setup where a Presto cluster is deployed on AWS Amazon Elastic MapReduce (EMR) or Elastic Kubernetes Service (EKS), we configure an IAM role and attach it to the EMR or EKS nodes. The IAM role is set to limit the access to S3 storage. However, the IAM only provides bucket-level and file-level control; it doesn’t meet our requirements to have schema, table, and column-level ACLs. That’s how DataGateway is found useful in such scenarios.

One of the DataGateway services is an SQL Parser. As previously covered, this is a service that parses and digs out schemas and tables involved in a query. The API service receives the parsing result and checks against the ACL of users, and decides whether to allow or reject the query. This is a remarkable improvement in our security control since we now have another layer to restrict access, on top of the S3 storage. We’ve implemented an SQL-based access control down to table level.

As shown in the Figure 3, user A is trying run a SQL statement select * from locations.cities. The SQL Parser reads the statement and tells the API service that user A is trying to read data from the table cities in the schema locations. Then, the API service checks against the ACL of user A. The service finds that user A has only read access to table countries in schema locations. Eventually, the API service denies this attempt because user A doesn’t have read access to table cities in the schema locations.

Figure 3. An example of how to check user access to run SQL statements
Figure 3. An example of how to check user access to run SQL statements

The above flow shows an access denied result because the user doesn’t have the appropriate permissions.

Seamless User Experience during the EMR migration

We use AWS EMR to deploy Presto as an SQL query engine since deployment is really easy. However, without DataGateway, any EMR operations such as terminations, new cluster deployment, config changes, and version upgrades, would require quite a bit of user involvement. We would sometimes need users to make changes on their side. For example, request users to change the endpoints to connect to suitable clusters.

With DataGateway, ACLs exist for each of the user accounts. The ACL includes the list of EMR clusters that users are allowed to access. As a Presto access management platform, here the DataGateway redirects user traffics to an appropriate cluster based on the ACL, like a proxy. Users always connect to the same endpoint we offer, which is the DataGateway. To switch over from one cluster to another, we just need to edit the cluster ACL and everything is handled seamlessly.

Figure 4. Cluster switching using DataGateway
Figure 4. Cluster switching using DataGateway

Figure 4 highlights the case when we’re switching EMR from one cluster to another. No changes are required from users.

We executed the migration of our entire Presto platform from an AWS EMR instance to another AWS EMR instance using the same methodology. The migrations were executed with little to no disruption for our users. We were able to move 40 clusters with hundreds of users. They were able to issue millions of queries daily in a few phases over a couple of months.

In most cases, users didn’t have to make any changes on their end, they just continued using Presto as usual while we made the changes in the background.

Multi-Cloud Data Lake/Presto Cluster maintenance

Recently, we started to build and maintain data lakes not just in one cloud, but two – in AWS and Azure. Since most end-users are AWS-based, and each team has their own AWS sub-account to run their services and workloads, it would be a nightmare to bridge all the connections and access routes between these two clouds from end-to-end, sub-account by sub-account.

Here, the DataGateway plays the role of the multi-cloud gateway. Since all end-users’ AWS sub-accounts have peered to DataGateway’s network, everything becomes much easier to handle.

For end-users, they retain the same Presto connection profile. The DE team then handles the connection setup from DataGateway to Azure, and also the deployment of Presto clusters in Azure.

When all is set, end-users use the same endpoint to DataGateway. We offer a feature called Cluster Switch that allows users to switch between AWS Presto cluster and Azure Presto Cluster on the fly by filling in parameters on the connection string. This feature allows users to switch to their target Presto cluster without any endpoint changes. The switch works instantly whenever they do the change. That means users can run different queries in different clusters based on their requirements.

This feature has helped the DE team to maintain Presto Cluster easily. We can spin up different Presto clusters for different teams, so that each team has their own query engine to run their queries with dedicated resources.

Figure 5. Sub-account connections and Queries
Figure 5. Sub-account connections and Queries

Figure 5 shows an example of how sub-accounts connect to DataGateway and run queries on resources in different clouds and clusters.

Figure 6. Sample scenario without DataGateway
Figure 6. Sample scenario without DataGateway

Figure 6 shows a scenario of what would happen if DataGatway doesn’t exist. Each of the accounts would have to maintain its own connections, Virtual Private Cloud (VPC) peering, and express link to connect to our Presto resources.

Summary

DataGateway is playing a key role in Grab’s entire Presto ecosystem. It helps us manage user access and cluster selections on a single endpoint, ensuring that everyone is running their Presto queries on the same place. It also helps distribute workload to different types and versions of Presto clusters.

When we started to deploy the DataGateway on Kubernetes, our vision for the Presto ecosystem underwent an epic change as it further motivated us to continuously improve. Since then, we’ve had new ideas on deployment method/pipeline, microservice implementations, scaling strategy, resource control, we even made use of Kubernetes and designed an on-demand, container-based Presto cluster provisioning engine. We’ll share this in another engineering blog, so do stay tuned!.

We also made crucial enhancements on data access control as we extended Presto’s access controls down to the schema/table-level.

In day-to-day operations, especially when we started to implement data lake in multiple clouds, DataGateway solved a lot of implementation issues. DataGateway made it simpler to switch a user’s Presto cluster from one cloud to another or allow a user to use a different Presto cluster using parameters. DataGateway allowed us to provide a seamless experience to our users.

Looking forward, we’ve more and more ideas for our Presto ecosystem, such Spark DataGateway or AWS Athena integrations, to keep our data safe at any time and to provide our users with a smoother experience when dealing with data used for analysis or research.


Authored by Vinnson Lee on behalf of the Presto Development Team at Grab – Edwin Law, Qui Hieu Nguyen, Rahul Penti, Wenli Wan, Wang Hui and the Data Engineering Team.


Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Introducing the Rally + GitHub integration

Post Syndicated from Jared Murrell original https://github.blog/2020-08-18-introducing-the-rally-github-integration/

GitHub’s Professional Services Engineering team has decided to open source another project: Rally + GitHub. You may have seen our most recent open source project, Super Linter. Well, the team has done it again, this time to help users ensure that Rally stays up to date with the latest development in GitHub! 🎉

Rally + GitHub

This project integrates GitHub Enterprise Server (and cloud, if you host it yourself) with Broadcom’s Rally project management.

Every time a pull request is created or updated, Rally + GitHub will check for the existence of a Rally User Story or Defect in the titlebody, or commit messages, and then validate that they both exist and are in the correct state within Rally.

Animation showing a pull request being created

Why was it created?

GitHub Enterprise Server had a legacy Services integration with Rally. The deprecation of legacy Services for GitHub was announced in 2018, and the release of GitHub Enterprise Server 2.20 officially removed this functionality. As a result, many GitHub Enterprise users will be left without the ability to integrate the two platforms when upgrading to recent releases of GitHub Enterprise Server.

While Broadcom created a new integration for github.com, this functionality does not extend to GitHub Enterprise Server environments.

Get Started

We encourage you to check out this project and set it up with your existing Rally instance. A good place to start getting set up is the Get Started guide in the project’s README.md

We invite you to join us in developing this project! Come engage with us by opening up an issue even just to share your experience with the project.

Animation showing Rally and GitHub integration

Why Write ADRs

Post Syndicated from Eli Perkins original https://github.blog/2020-08-13-why-write-adrs/

Architecture decision records, also known as ADRs, are a great way to document how and why a decision was reached within a codebase. We’ve started to adopt them within the mobile team here at GitHub, documenting decisions that affect the iOS codebase and Android codebase, as well as decisions that affect both mobile clients.

ADRs are not the most common within open source codebases, but have gained more popularity since ~2017 within long-lived, “evolutionary” codebases like those in more enterprise-y settings.

So why write an ADR? Why spend time documenting something when a decision has already been made?

They’re not for you, they’re for future you

ADRs are not meant to be a self-discovery, reflection process on what you decided. ADRs will help you 6-12 months from now recall what your mindset was when you decided upon that architecture.

ADRs capture the decision at the time it’s being made. ADRs are the culmination of all those minutes and hours you spent in meetings, on Zoom, in Slack, or jamming through various proof-of-concepts in Xcode. All of the context that’s in your head has a chance to get out into words so that when you’re revisiting the architecture down the road, you can put that context right back into your head.

The real bonus comes when someone git blames you some months from now and asks you how the GitHubAPIClient module works. Better than setting up a 30-minute pairing call to walk them through the code, you can now link to the ADR you wrote to explain some more about the decisions made while building the GitHubAPIClient module.

They’re not for you, they’re for your peers

ADRs force you to write more than a one-liner “this ships the feature for #3128”. They are a longer form of prose to help your teammates understand why the feature is built the way it is, and not built some other way (see: “Alternatives Considered” and “Pros/Cons” within ADRs themselves).

Something that may be simple for you might be complicated for your teammates. Taking the time to write down what your thought process was as you made decisions gives your teammates the chance to get inside your head. Writing ADRs allows for “decision socialization”, where your team comes to a decision that the team is responsible for maintaining, rather than decisions made in isolation.

By expanding upon what you write in your pull request titles and descriptions (and you are still writing quality pull request descriptions, right?), you give your teammates more information about how a patch or diff works in a larger system.

Better yet, by writing an ADR ahead of putting your pull requests up, you’ll get better pull request reviews from the team reviewing it. No longer do you need to explain how line 387 in APIClient+Caching.swift will affect data fetching and caching architecture, because your teammates already understand how you’re changing the system from the ADR you wrote about “Adding cache support to E-Tag entities”.

They’re not for you, they’re for your future peers

ADRs are not meant you to show off how smart you are or for folks to fawn over the architecture you built. ADRs are for helping onboard new teammates as they work to understand the codebase and how it has evolved over time.

As teams scale and grow, the number of lines of communication between teammates increases. A team of three individuals only has three lines of communication (A <> BA <> CB <> C). A team of four has six (A <> BA <> CA <> DB <> CB <> DC <> D). Wanna do the math for a team of five or six? How about 14 engineers, two designers, two PMs, and three EMs?

Writing down decisions made helps communicate to your current teammates, but also those who join your team as it scales and grows. By informing your team how and why a decision was made in an asynchronous fashion, you no longer need to “hop on a Zoom call” to onboard each new teammate on a per-architectural-decision basis.

In the best-case scenario, you’ll have your teammates writing new ADRs for yousuperseding the decisions past you made, so that you can learn from your teammates in the future.

Write more ADRs

I hope this has convinced you to document the decisions that you’re making as we build software that (hopefully) millions of folks use! As our team grows larger and our codebases grow more entangled and intertwined, architecture decision records are a great way to help future us, our current teammates, and future teammates.

👋 Thanks for coming to my TED talk.

More Info

Go Modules- A guide for monorepos (Part 2)

Post Syndicated from Grab Tech original https://engineering.grab.com/go-module-a-guide-for-monorepos-part-2

This is the second post on the Go module series, which highlights Grab’s experience working with Go modules in a multi-module monorepo. In this article, we’ll focus on suggested solutions for catching unexpected changes to the go.mod file and addressing dependency issues. We’ll also cover automatic upgrades and other learnings uncovered from the initial obstacles in using Go modules.

Vendoring process issues

Our previous vendoring process fell solely on the developer who wanted to add or update a dependency. However, it was often the case that the developer came across many unexpected changes, due to previous vendoring attempts, accidental imports and changes to dependencies.

The developer would then have to resolve these issues before being able to make a change, costing time and causing frustration with the process. It became clear that it wasn’t practical to expect the developer to catch all of the potential issues while vendoring, especially since Go modules itself was new and still in development.

Avoiding unexpected changes

Reluctantly, we added a check to our CI process which ran on every merge request. This helped ensure that there are no unexpected changes required to go mod. This added time to every build and often flagged a failure, but it saved a lot of post-merge hassle. We then realized that we should have done this from the beginning.

Since we hadn’t enabled Go modules for builds yet, we couldn’t rely on the \mod=readonly flag. We implemented the check by running go mod vendor and then checking the resulting difference.

If there were any changes to go.mod or the vendor directory, the merge request would get rejected. This worked well in ensuring the integrity of our go.mod.

Roadblocks and learnings

However, as this was the first time we were using Go modules on our CI system, it uncovered some more problems.

Private repository access

There was the problem of accessing private repositories. We had to ensure that the CI system was able to clone all of our private repositories as well as the main monorepo, by adding the relevant SSH deploy keys to the repository.

False positives

The check sometimes fired false positives – detecting a go mod failure when there were no changes. This was often due to network issues, especially when the modules are hosted by less reliable third-party servers. This is somewhat solved in Go 1.13 onwards with the introduction of proxy servers, but our workaround was simply to retry the command several times.

We also avoided adding dependencies hosted by a domain that we haven’t seen before, unless absolutely necessary.

Inconsistent Go versions

We found several inconsistencies between Go versions – running go mod vendor on one Go version gave different results to another. One example was a change to the checksums. These inconsistencies are less common now, but still remain between Go 1.12 and later versions. The only solution is to stick to a single version when running the vendoring process.

Automated upgrades

There are benefits to using Go modules for vendoring. It’s faster than previous solutions, better supported by the community and it’s part of the language, so it doesn’t require any extra tools or wrappers to use it.

One of the most useful benefits from using Go modules is that it enables automated upgrades of dependencies in the go.mod file – and it becomes more useful as more third-party modules adopt Go modules and semantic versioning.

Automated updates workflow
Automated updates workflow

We call our solution for automating updates at Grab the AutoVend Bot. It is built around a single Go command, go list -m -u all, which finds and lists available updates to the dependencies listed in go.mod (add \json for JSON output). We integrated the bot with our development workflow and change-request system to take the output from this command and create merge requests automatically, one per update.

Once the merge request is approved (by a human, after verifying the test results), the bot would push the change. We have hundreds of dependencies in our main monorepo module, so we’ve scheduled it to run a small number each day so we’re not overwhelmed.

By reducing the manual effort required to update dependencies to almost nothing, we have been able to apply hundreds of updates to our dependencies, and ensure our most critical dependencies are on the latest version. This not only helps keep our dependencies free from bugs and security flaws, but it makes future updates far easier and less impactful by reducing the set of changes needed.

In Summary

Using Go modules for vendoring has given us valuable and low-risk exposure to the feature. We have been able to detect and solve issues early, without affecting our regular builds, and develop tooling that’ll help us in future.

Although Go modules is part of the standard Go toolchain, it shouldn’t be viewed as a complete off the shelf solution that can be dropped into a codebase, especially a monorepo.

Like many other Go tools, the Modules feature comprises many small, focused tools that work best when combined together with other code. By embracing this concept and leveraging things like go list, go mod graph and go mod vendor, Go modules can be made to integrate into existing workflows, and deliver the benefits of structured versioning and reproducible builds.

I hope you have enjoyed this article on using Go modules and vendoring within a monorepo.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Credits

The cute Go gopher logo for this blog’s cover image was inspired by Renee French’s original work.

GitHub Availability Report: July 2020

Post Syndicated from Keith Ballinger original https://github.blog/2020-08-05-github-availability-report-july-2020/

In July we experienced one specific incident resulting in a degraded state of availability for GitHub.com. We’d like to share our learnings from this incident with the community in the spirit of being transparent about our service disruptions, and helping other services improve their own operations.

July 13 08:18 UTC (lasting for four hours, 25 minutes)

The incident started when our production Kubernetes Pods started getting marked as unavailable. This cascaded through our clusters resulting in a reduction in capacity, which ultimately brought down our services. Investigation into the Pods revealed that a single container within the Pod was exceeding its defined memory limits and being terminated. Even though that container is not required for production traffic to be processed, the nature of Kubernetes requires that all containers be healthy for a Pod to be marked as available.

Normally when a Pod runs into this failure mode, the cluster will recover within a minute or so. In this case, the container in the Pod was configured with an ImagePullPolicy of Always, which instructed Kubernetes to fetch a new container image every time. However, due to a routine DNS maintenance operation that had been completed earlier, our clusters were unable to successfully reach our registry resulting in Pods failing to start. This issue impact was increased when a redeploy was triggered in an attempt to mitigate, and we saw the failure start to propagate across our production clusters. It wasn’t until we restarted the process with the cached DNS records that we were able to successfully fetch container images, redeploy, and recover our services. 

Moving forward, we’ve identified a number of areas to address this quarter: 

  • Enhancing monitoring ensuring Pod restarts would not fail again based on this same pattern
  • Minimizing our dependency on the image registry
  • Expanding validation during DNS changes
  • Reevaluating all the existing Kubernetes deployment policies

In parallel, we have an ongoing workstream to improve our approach to progressive deployments that will provide the ability to carefully evaluate the impact of deployments in a more incremental fashion. This is part of a broader engineering initiative focused on reliability that we will have more details on in the coming months.

In summary

We place great importance in the reliability of our service along with the trust that our users place in us every day. We look forward to continuing to share more details of our journey and hope you can learn from our experiences along the way. 

CodeGen: Semantic’s improved language support system

Post Syndicated from Ayman Nadeem original https://github.blog/2020-08-04-codegen-semantics-improved-language-support-system/

The Semantic Code team shipped a massive improvement to the language support system that powers code navigation. Code navigation features only scratch the surface of possibilities that start to open up when we combine Semantic‘s program analysis potential with GitHub’s scale.

GitHub is home to over 50 million developers worldwide. Our team’s mission is to analyze code on our platform and surface insights that empower users to feel more productive. Ideally, this analysis should work for all users, regardless of which programming language they use. Until now, however, we’ve only been able to support a handful of languages due to the high cost of adding and maintaining them. Our new language support system, CodeGen, cuts that cost dramatically by automating a significant portion of the pipeline, in addition to making it more resilient. The result is that it is now easier than ever to add new programming languages that get parsed by our library.

Language Support is mission-critical

Before Semantic can compute diffs or perform abstract interpretation, we need to transform code into a representation that is amenable to our analysis goals. For this reason, a significant portion of our pipeline deals with processing source code from files on GitHub.com into an appropriate representation—we call this our “language support system”.

Diagram showing semantic architecture

Adding languages has been difficult

Zooming into the part of Semantic that achieves language support, we see that it involved several development phases, including two parsing steps that required writing and maintaining two separate grammars per language.

Diagram showing language support pipeline

Reading the diagram from left to right, our historic language support pipeline:

  1. Parsed source code into ASTs. A grammar is hand-written for a given language using tree-sitter, an incremental GLR parsing library for programming tools.
  2. Read tree-sitter ASTs into Semantic. Connecting Semantic to tree-sitter‘s C library requires providing an interface to the C source. We achieve this through our haskell-tree-sitter library, which has Haskell bindings to tree-sitter.
  3. Parsed ASTs into a generalized representation of syntax. For these ASTs to be consumable by our Haskell project, we had to translate the tree-sitter parse trees into an appropriate representation. This required:
    • À la carte syntax types: generalization across programming languages Many constructs, such as If statements, occur in several languages. Instead of having different representations of If statements for each language, could we reduce duplication by creating a generalized representation of syntax that could be shared across languages, such as a datatype modeling the semantics of conditional logic? This was the reasoning behind creating our hand-written, generalized à la carte syntaxes based on Wouter Swierstra’s Data types à la carte approach, allowing us to represent those shared semantics across languages. For example, this file captures a subset of à la carte datatypes that model expression syntaxes across languages.
    • Assignment: turning tree-sitter‘s representation into a Haskell datatype representation We had to translate tree-sitter AST nodes to be represented by the new generalized à la carte syntax. To do this, a second grammar was written in Haskell to assign the nodes of the tree-sitter ASTs onto a generalized representation of syntax modeled by the à la carte datatypes. As an example, here is the Assignment grammar written for Ruby.
  4. Performed Evaluation. Next, we captured what it meant to interpret the syntax datatypes. To do so, we wrote a polymorphic type class called Evaluatable, which defined the necessary interface for a term to be evaluated. Evaluatable instances were added for each of the à la carte syntaxes.
  5. Handled Effects. In addition to evaluation semantics, describing the control flow of a given program also necessitates modeling effects. This helps ensure we can represent things like the file system, state, non-determinism, and other effectful computations.
  6. Validated via tests. Tests for diffing, tagging, graphing, and evaluating source code written in that language were added along the process.

Challenges posed by the system

The process described had several obstacles. Not only was it very technically involved, but it had additional limitations.

  1. The system was brittle. Each language’s Assignment code was tightly coupled to the language’s tree-sitter grammar. This meant it could break at runtime if we changed the structure of the grammar, without any compile-time error. To prevent such errors required tracking ongoing changes in tree-sitter, which was also tedious, manual, and error-prone. Each time a grammar changed, assignment changes had to be made to accommodate new tree-structures, such as nodes that changed names or shifted positions. Because improvements to the underlying grammars required changes to Assignment—which were costly in terms of time and risky in terms of the possibility of introducing bugs—, our system had inadvertently become incentivized against iterative improvement.
  2. There were no named child nodes. tree-sitter‘s syntax nodes didn’t provide us with named child nodes. Instead, child nodes were structured as ordered-lists, without any name indicating the role of each child. This didn’t match Semantic’s internal representation of syntax nodes, where each type of node has a specific set of named children. This meant more Assignment work was necessary to compensate for the discrepancy. One concern, for example, was about how we represented comments, which could be any arbitrary node attached to any part of the AST. But if we had named child nodes, this would allow us to associate comments relative to their parent nodes (like if a comment appeared in an if statement, it could be the first child for that if statement node). This would also apply to any other nodes that could appear anywhere within the tree, such as Ruby heredocs.
  3. Evaluation and à la carte sum types were sub-optimal. Taking a step back to examine language support also gave us an opportunity to rethink our à la carte datatypes and the evaluation machinery. À la carte syntax types were motivated by a desire to better share tooling in evaluating common fragments of languages. However, the introduction of these syntax types (and the design of the Evaluatable typeclass) did not make our analysis sensitive to minor linguistic differences, or even to relate different pieces of syntax together. We could overcome this by adding language-specific syntax datatypes to be used with Assignment, along with their accompanying Evaluatable instances—but this would defeat the purpose of a generalized representation. This is because à la carte syntax was essentially untyped; it enforced only a minimal structure on the tree. As a result, any given subterm could be any element of the syntax, and not some limited subset. This meant that many Evaluatable instances had to deal with error conditions that in practice can’t occur. To make this idea more concrete, consider examples showcasing a before and after syntax type transformation:
    -- former system: à la carte syntax
    
    data Op a = Op { ann :: a, left :: Expression a, right :: Expression a }
    -- new system: auto-generated, precisely typed syntax
    
    data Op a = Op { ann :: a, left :: Err (Expression a), right :: Err (Expression a) }

    The shape of a syntax type in our à la carte paradigm has polymorphic children, compared with the monomorphic children of our new “precisely-typed” syntax, which offers better guarantees of what we could expect.

  4. Infeasible time and effort was required. A two-step parsing process required writing two separate language-specific grammars by hand. This was time-consuming, engineering-intensive, error-prone, and tedious. The Assignment grammar used parser combinators in Haskell mirroring the tree-sitter grammar specification, which felt like a lot of duplicated work. For a long time, this work’s mechanical nature begged the question of whether we could automate parts of it. While we’ve open-sourced Semantic, leveraging community support for adding languages has been difficult because, until recently, it was backed by such a grueling process.

Designing a new system

To address challenges, we introduced a few changes:

  1. Add named child nodes. To address the issue of not having named child nodes, we modified the tree-sitter library by adding a new function called field to the grammar API and resultantly updating every language grammar. When parsing, you can retrieve a nodes’ children based on their field name. Here is an example of what a Python if_statement looks like in the old and new tree-sitter grammar APIs:
    Screenshot of a diff highlighting an example of what a Python if_statement looks like in the old and new tree-sitter grammar APIs
  2. Generate a Node Interface File. Once a grammar has this way of associating child references, the parser generation code also produces a node-types.json file that indicates what kinds of children references you can expect for each node type. This JSON file provides static information about nodes’ fields based on the grammar. Using this JSON, applications like ours can use meta-programming to generate specific datatypes for each kind of node. Here is an example of the JSON generated from the grammar definition of an if statement. This file provided a schema for a language’s ASTs and introduced additional improvements, such as the way we specify highlighting.
  3. Auto-generate language-specific syntax datatypes. Using the structure provided by the node-types.json file, we can auto-generate syntax types instead of writing them by hand. First, we deserialize the JSON file to capture the structure we want to represent in the desired shape of datatypes. Specifically, we have four distinct shapes that the nodes in our node-types JSON file take on: sumsproductsnamed leaves, and anonymous leaves. We then use Template Haskell to generate syntax datatypes for each of the language constructs represented by the Node Interface File. This means that our hand-written à la carte syntax types get replaced with auto-generated language-specific types, saving all of the developer time historically spent writing them. Here is an example of an auto-generated datatype representing a Python if statement derived from the JSON snippet provided above, which is structurally a product type.
  4. Build ASTs generically. Once we have an exhaustive set of language-specific datatypes, we need to have a mechanism that can map appropriate auto-generated datatypes onto the ASTs representing the source code being parsed. Historically, this was accomplished by manually writing an Assignment grammar. To obviate the need for a second grammar, we have created an API that uses Haskell’s generic metaprogramming framework to unmarshal tree-sitter’s parse trees automatically. We iterate over tree-sitter‘s parse trees using its tree cursor API and produce Haskell ASTs, where each node is represented by a Template Haskell generated datatype described by the previous step. This allows us to parse a particular set of nodes according to their structure, and return an AST with meta-data (such as range and span information). Here is an example of the AST generated if the Python source code is simply 1
    Screenshot of CodeGen language support pipeline

The final result is a set of language-specific, strongly-typed, TH-generated datatypes represented as the sum of syntax possible at a given point in the grammar. Strongly-typed trees give us the ability to indicate only the subset of the syntax that can occur at a given position. For example, a function’s name would be strongly typed as an identifier; a switch statement would contain case statements; and so on. This provides better guarantees about where syntax can occur, and strong compile-time guarantees about both correctness and completeness.

The new system bypasses a significant part of the engineering effort historically required; it cuts code from our pipeline in addition to addressing the technical limitations described above. The diagram below provides a visual “diff” of the old and new systems.

Diagram showing language support pipeline

A big testament to our approach’s success was that we were able to remove our à la carte syntaxes completely. In addition, we were also able to ship two new languages, Java and CodeQL, using precise ASTs generated by the new system.

Contributions welcome!

To learn more about how you can help, check out our documentation here.

Highlights from Git 2.28

Post Syndicated from Taylor Blau original https://github.blog/2020-07-27-highlights-from-git-2-28/

The open source Git project just released Git 2.28 with features and bug fixes from over 58 contributors, 13 of them new. We last caught up with you on the latest in Git back when 2.26 was released. Here’s a look at some of the most interesting features and changes introduced since then.

Introducing init.defaultBranch

When you initialize a new Git repository from scratch with git init, Git has always created an initial first branch with the name master. In Git 2.28, a new configuration option, init.defaultBranch is being introduced to replace the hard-coded term. (For more background on this change, this statement from the Software Freedom Conservancy is an excellent place to look).

Starting in Git 2.28, git init will instead look to the value of init.defaultBranch when creating the first branch in a new repository. If that value is unset, init.defaultBranch defaults to master. Here, it’s important to note that:

  1. This configuration variable can be set by the user, and overriding the default value is as easy as:
    $ git config --global init.defaultBranch main
    
  2. This configuration variable only affects new repositories, and does not cause branches in existing projects to be renamed. git clone will also continue to respect the HEAD of the repository you’re cloning from, so you won’t see a change in branch names until a maintainer initiates one.

This change supports the many communities, both on GitHub and in the wider Git community, who are considering renaming the default branch name of their repository from master.

To learn more about the complementary changes GitHub is making, see github/renaming. GitLab and Bitbucket are also making similar changes.

[source]

Changed-path Bloom filters

In Git 2.27, the commit-graph file format was extended to store changed-path Bloom filters. What does all of that mean? In a sense, this new information helps Git find points in history that touched a given path much more quickly (for example, git log -- <path>, or git blame). Git 2.28 takes advantage of these optimizations to deliver a handful of sizeable performance improvements.

Before we get into all of that, it’s worth taking a refresher through commit graphs whether you’re new to the concept, or familiar with them. (If you are familiar, and want to take a deeper dive, check out this blog post explaining all of the juicy technical details).
In the very simplest terms, the commit-graph file stores information about commits. In essence, the commit-graph acts like a cache for commonly-accessed information about commits: who their parent(s) are, what their root tree is, and things like that. It also stores computed information, too, like a commit’s generation number, and changed-path Bloom filters (more on that in just a moment).

Why store all of this information? To understand the answer to this, it is helpful to have a cursory understanding of how Git stores objects. Git stores objects in one of two ways: either as a loose object (in which case the object’s contents are stored in a single file unique to that object on disk), or as a packed object (in which case the object is assembled from a compressed format in a *.pack file). No matter which way a commit is stored, we still have to parse and decompress it before its fields like “root tree” and “parents” can be accessed.

With a commit-graph file, all of that information is immediate: for a given commit C, Git knows exactly where to look in a commit-graph file for all of those fields that we store, and can read them off immediately, no decompression or piecing together required. This can shave some time off your usual Git operations by itself, but where the commit-graph really shines is in the computed data it stores.

Generation numbers are a sort of reachability index that can help Git answer questions about things like reachability and topological ordering very quickly. Since generation numbers aren’t new in this release (and trying to explain them quickly would lose a lot of the benefit of a more careful exposition), I’ll refer you instead to this blog post by freshly-minted Hubber Derrick Stolee on the matter.

What’s new in 2.28?

OK, if you’ve made it this far, you’ve got a pretty good handle on what commit graphs are, and what they’re useful for. Now, let’s get to the juicy details. In Git 2.27, the commit-graph file learned how to store changed-path Bloom filters. What are changed-path Bloom filters, you ask? A Bloom filter is a probabilistic set; that is it’s a set of items, but querying that set for the presence of some item x returns either “x is definitely not in this set” or “x might be in this set”, but never “x is definitely in this set”. The commit-graph stores one of these Bloom filters for commits that reside in the commit-graph, and it populates that Bloom filter with a list of paths changed between that commit and its first parent.

These Bloom filters are a huge boon for performance in lots of Git commands. The general pattern is something like: if you have a Git command that computes diffs (which can sometimes be proportionally expensive), then having Bloom filters allows Git to compute far fewer diffs by skipping the computation for certain commits when their Bloom filters return “definitely not” for paths of interest.

Take git log -- /path/to/file, for example. Prior to Git 2.27, git log would have to compute a diff over every revision in its walk before determining whether or not to show it (i.e., whether or not that diff has any entries for /path/to/file). In Git 2.27 and newer, Git can skip computing many of those diffs altogether by consulting each commit C‘s changed-path Bloom filter and querying it for /path/to/file. Again: if querying returns “definitely not”, then Git knows that computing that diff is strictly uninteresting.

Because computing diffs between commits can be expensive (at least, relative to the complexity of the algorithm for which they are being generated), reducing the number of diffs computed overall can greatly improve performance.

To try this for yourself, you can run the command:

$ git commit-graph write --reachable --changed-paths

This generates a commit-graph file with changed path Bloom filters enabled.[1] You should be able to see performance improvements in commands like git log -- <path>, git log -L, git blame, and anything else that computes first-parent diffs against a given pathspec.

[source, source, source]

Tidbits

Now that we’ve talked about a few of the headlining changes from the past couple of releases, let’s look at a few more new features 🔎

  • Have you ever been looking for the parts of history that changed some path? Maybe you just want to know about the commits that have modified some file, and that can be found easily enough by running git log -- <path>.Sometimes, you might be interested not only in which commits touched <path>, but which merge commits brought those commits into the main line of developement. Have you ever found those merges difficult to find? You’re not alone. In most cases, Git will skip showing you those kind of merges with git log -- <path>, since those commits don’t modify the <path> by themselves.Now you can bring those merges back into view with Git’s new --show-pulls flag to revision walking commands, like git log and git rev-list. For a particularly informative view, try:
    $ git log --oneline --graph --show-pulls -- <path>
    

    [source]

  • When you run git pull in a repository when you’re tracking a remote branch, one of four things can happen: there might be no changes, changes on the server, client, or both. As long as there aren’t changes in both directions, resolving the difference is straightforward: when there are no changes at all, there’s nothing to do. When the server is strictly ahead of the client, the client fast-forwards to the state on the server.But, when there are change both on the client and on the server: what happens? That depends on whether not you have the pull.rebase configuration set. If you do, your branch is rebased on top of where you’re pulling from, and otherwise, a merge is performed.These merges can clutter your history and be tricky to back out of without starting over your pull from scratch. Git 2.28 now warns you of this case (specifically, when pull.rebase is unset, and you didn’t explicitly specify --[no-]rebase as an argument to git pull).

    [source]

  • Git now includes a GitHub Actions workflow which you can use to run Git’s own integration tests on a variety of platforms and compilers. There’s no extra effort required on your part: if you have a fork of git/git on GitHub, each push will be run through the array of tests necessary to validate your change. But wait: doesn’t Git use a mailing list for development? Yes, it does, but now you can use GitGitGadget on the git/git repository. This means that you can open a pull request, and have GitGitGadget send it to the mailing list on your behalf. So, if you’re more comfortable contributing to Git like that instead of composing emails manually, you can now contribute to Git from start to finish using GitHub.

    [source]

  • On the other hand, if you don’t mind sending an email or two, it’s now much easier to interact with the Git mailing list when you encounter a bug by running git bugreport. Running this new command will open your $EDITOR with a pre-populated form of questions that will be useful in debugging your issue. It also includes some helpful information about your system, like your CPU architecture, what version of Git you’re running, and so on.When you’re done, you can send that file as the body of an email to the Git mailing list, and rest assured that you’ve opened a helpful bug report.

    [source]

  • We’ve talked a number of times about Git’s clean and smudge filters and the corresponding process filter (which simulates multiple clean and smudge filters in a single process). Up until recently, the protocol for these filters has been relatively straightforward: Git supplies one end of the content, and the filter produces the other.In Git 2.27, more information is supplied over the protocol, like metadata about the branch being checked out in the case of git checkout, or the remote that was contacted in case of a git fetch. This new information could be used in tools like, for eg., Git LFS in order to figure out which remote to contact for extra data.

    [source]

  • Last but not least, git status learned some new tricks, too. You might recall from a recent blog post that we talked how sparse checkouts can shrink the size of your monorepo. Now, git status can remind you of when you are in a sparse checkout by telling you what percentage of files you have checked out.For fans of git-prompt.sh, the prompt will now display SPARSE if you are in a sparse checkout, too.

    [source]

The rest of the iceberg

That’s just a sample of changes from the latest couple of releases. For more, check out the release notes for 2.27 and 2.28, or any previous version in the Git repository.

[1]: Note that since Bloom filters are not persisted automatically (that is, you have to pass --changed-paths explicitly on each subsequent write), it is a good idea to disable configuration that automatically generates commit-graphs, like fetch.writeCommitGraph and gc.writeCommitGraph.

Introducing GitHub’s OpenAPI Description

Post Syndicated from Marc-Andre Giroux original https://github.blog/2020-07-27-introducing-githubs-openapi-description/

The GitHub REST API has been through three major revisions since it was first released, only a month after the site was launched. We often receive feedback that our REST API is an inspiration to many for design, and that it’s an industry reference for what an API should look like. Today, we’re excited to announce an improvement to how developers can interact with the API. GitHub has open sourced an OpenAPI description of the REST API.

OpenAPI

The OpenAPI specification is a programming language agnostic standard that lets providers describe the interface of their HTTP APIs. This allows both humans and machines to discover the capabilities of an API without needing to first read documentation or understand the implementation. OpenAPI is a widely adopted industry standard and GitHub is proud to be part of the community and help push the standard forward.

Try it Out

The GitHub OpenAPI description contains more than 600 operations exposed in our API. For visual exploration of the API, you can load the description as a Postman Collection. Programmatically, the description can be used to generate mock servers, test suites, and bindings for languages not supported by Octokit.

The description is provided under two formats. The bundled version is preferred for most use cases as it makes use of OpenAPI components for reuse and readability. For tooling that has poor support for inline references to components, we also provide a fully dereferenced version.

Active Development

The description is currently in beta. Describing a 12-year-old API is no easy task. We’ve built this description using a mix of existing JSON schemas, documented examples, contract testing, and love. We expect to make the description even more complete and accurate as we go forward and as OpenAPI becomes central to our developer experience — internally and externally.

Quarterly releases of the description are available for GitHub Enterprise Server and GitHub Private Instances, with versions like v2.21. More frequent updates to the description will be available for GitHub.com.

How Can I Contribute?

We’re always looking to make our OpenAPI description more complete and accurate as well as making it easier to consume. If you’d like to help contribute to the description, check out our contributing guide. If something is not working for you, please file an Issue on the repository.

Building a complete OpenAPI description for the GitHub API was no easy task and could not have been possible without a great team. Thanks to Gregor Martynus for his initial work on describing the API, the Docs Engineering team for their amazing work around OpenAPI and documentation, Will Roden for his help validating the description with octo-go, as well as the folks at Redoc.ly who helped along the way.

Learn more about our REST API OpenAPI Description

*  The OpenAPI Initiative logo is a trademark of The Linux Foundation

Cloudflare outage on July 17, 2020

Post Syndicated from John Graham-Cumming original https://blog.cloudflare.com/cloudflare-outage-on-july-17-2020/

Cloudflare outage on July 17, 2020

Today a configuration error in our backbone network caused an outage for Internet properties and Cloudflare services that lasted 27 minutes. We saw traffic drop by about 50% across our network. Because of the architecture of our backbone this outage didn’t affect the entire Cloudflare network and was localized to certain geographies.

The outage occurred because, while working on an unrelated issue with a segment of the backbone from Newark to Chicago, our network engineering team updated the configuration on a router in Atlanta to alleviate congestion. This configuration contained an error that caused all traffic across our backbone to be sent to Atlanta. This quickly overwhelmed the Atlanta router and caused Cloudflare network locations connected to the backbone to fail.

The affected locations were San Jose, Dallas, Seattle, Los Angeles, Chicago, Washington, DC, Richmond, Newark, Atlanta, London, Amsterdam, Frankfurt, Paris, Stockholm, Moscow, St. Petersburg, São Paulo, Curitiba, and Porto Alegre. Other locations continued to operate normally.

For the avoidance of doubt: this was not caused by an attack or breach of any kind.

We are sorry for this outage and have already made a global change to the backbone configuration that will prevent it from being able to occur again.

The Cloudflare Backbone

Cloudflare outage on July 17, 2020

Cloudflare operates a backbone between many of our data centers around the world. The backbone is a series of private lines between our data centers that we use for faster and more reliable paths between them. These links allow us to carry traffic between different data centers, without going over the public Internet.

We use this, for example, to reach a website origin server sitting in New York, carrying requests over our private backbone to both San Jose, California, as far as Frankfurt or São Paulo. This additional option to avoid the public Internet allows a higher quality of service, as the private network can be used to avoid Internet congestion points. With the backbone, we have far greater control over where and how to route Internet requests and traffic than the public Internet provides.

Timeline

All timestamps are UTC.

First, an issue occurred on the backbone link between Newark and Chicago which led to backbone congestion in between Atlanta and Washington, DC.

In responding to that issue, a configuration change was made in Atlanta. That change started the outage at 21:12. Once the outage was understood, the Atlanta router was disabled and traffic began flowing normally again at 21:39.

Shortly after, we saw congestion at one of our core data centers that processes logs and metrics, causing some logs to be dropped. During this period the edge network continued to operate normally.

  • 20:25: Loss of backbone link between EWR and ORD
  • 20:25: Backbone between ATL and IAD is congesting
  • 21:12 to 21:39: ATL attracted traffic from across the backbone
  • 21:39 to 21:47: ATL dropped from the backbone, service restored
  • 21:47 to 22:10: Core congestion caused some logs to drop, edge continues operating
  • 22:10: Full recovery, including logs and metrics

Here’s a view of the impact from Cloudflare’s internal traffic manager tool. The red and orange region at the top shows CPU utilization in Atlanta reaching overload, and the white regions show affected data centers seeing CPU drop to near zero as they were no longer handling traffic. This is the period of the outage.

Other, unaffected data centers show no change in their CPU utilization during the incident. That’s indicated by the fact that the green color does not change during the incident for those data centers.

Cloudflare outage on July 17, 2020

What happened and what we’re doing about it

As there was backbone congestion in Atlanta, the team had decided to remove some of Atlanta’s backbone traffic. But instead of removing the Atlanta routes from the backbone, a one line change started leaking all BGP routes into the backbone.

{master}[edit]
atl01# show | compare 
[edit policy-options policy-statement 6-BBONE-OUT term 6-SITE-LOCAL from]
!       inactive: prefix-list 6-SITE-LOCAL { ... }

The complete term looks like this:

from {
    prefix-list 6-SITE-LOCAL;
}
then {
    local-preference 200;
    community add SITE-LOCAL-ROUTE;
    community add ATL01;
    community add NORTH-AMERICA;
    accept;
}

This term sets the local-preference, adds some communities, and accepts the routes that match the prefix-list. Local-preference is a transitive property on iBGP sessions (it will be transferred to the next BGP peer).

The correct change would have been to deactivate the term instead of the prefix-list.

By removing the prefix-list condition, the router was instructed to send all its BGP routes to all other backbone routers, with an increased local-preference of 200. Unfortunately at the time, local routes that the edge routers received from our compute nodes had a local-preference of 100. As the higher local-preference wins, all of the traffic meant for local compute nodes went to Atlanta compute nodes instead.

With the routes sent out, Atlanta started attracting traffic from across the backbone.

We are making the following changes:

  • Introduce a maximum-prefix limit on our backbone BGP sessions – this would have shut down the backbone in Atlanta, but our network is built to function properly without a backbone. This change will be deployed on Monday, July 20.
  • Change the BGP local-preference for local server routes. This change will prevent a single location from attracting other locations’ traffic in a similar manner. This change has been deployed following the incident.

Conclusion

We’ve never experienced an outage on our backbone and our team responded quickly to restore service in the affected locations, but this was a very painful period for everyone involved. We are sorry for the disruption to our customers and to all the users who were unable to access Internet properties while the outage was happening.

We’ve already made changes to the backbone configuration to make sure that this cannot happen again, and further changes will resume on Monday.

The journey of deploying Apache Airflow at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/the-journey-of-deploying-apache-airflow-at-Grab

At Grab, we use Apache Airflow to schedule and orchestrate the ingestion and transformation of data,  train machine learning models, and the copy data between clouds. There are many engineering teams at Grab that use Airflow, each of which originally had their own Airflow instance.

The proliferation of independently managed Airflow instances resulted in inefficient use of resources, where each team ended up solving the same problems of logging, scaling, monitoring, and more. From this morass came the idea of having a single dedicated team to manage all the Airflow instances for anyone in Grab that wants to use Airflow as a scheduling tool.

We designed and implemented an Apache Airflow-based scheduling and orchestration platform that currently runs close to 20 Airflow instances for different teams at Grab. Sounds interesting? What follows is a brief history.

Early days

Circa 2018, we were running a few hundred Directed Acyclic Graphs (DAGs) on one Airflow instance in the Data Engineering team. There was no dedicated team to maintain it, and no Airflow expert in our team. We were struggling to maintain our Airflow instance, which was causing many jobs to fail every day. We were facing issues with library management, scaling, managing and syncing artefacts across all Airflow components, upgrading Airflow versions, deployment, rollbacks, etc.

After a few postmortem reports, we realized that we needed a dedicated team to maintain our Airflow. This was how our Airflow team was born.

In the initial months, we dedicated ourselves to stabilizing our Airflow environment. During this process, we realized that Airflow has a steep learning curve and requires time and effort to understand and maintain properly. Also, we found that tweaking of Airflow configurations required a thorough understanding of Airflow internals.

We felt that for the benefit of everyone at Grab, we should leverage what we learned about Airflow to help other teams at Grab; there was no need for anyone else to go through the same struggles we did. That’s when we started thinking about managing Airflow for other teams.

We talked to the Data Science and Engineering teams who were also running Airflow to schedule their jobs. Almost all the teams were struggling to maintain their Airflow instance. A few teams didn’t have enough technical expertise to maintain their instance. The Data Scientists and Analysts that we spoke to were more than happy to outsource the overhead of Airflow maintenance and wanted to focus more on their Data Science use cases instead.

We started working with one of the Data Science teams and initiated the discussion to create a dockerized Airflow instance and run it on our Kubernetes cluster.

We created the Airflow instance and maintained it for them. Later, we were approached by two more teams to help with their Airflow instances. This was the trigger for us to design and create a platform on which we can efficiently manage Airflow instances for different teams.

Current state

As mentioned, we are currently serving close to 20 Airflow instances for various teams on this platform and leverage Apache Airflow to schedule thousands of daily jobs.  Each Airflow instance is currently scheduling 1k to 60k daily jobs. Also, new teams can quickly try out Airflow without worrying about infrastructure and maintenance overhead. Let’s go through the important aspects of this platform such as design considerations, architecture, deployment, scalability, dependency management, monitoring and alerting, and more.

Design considerations

The initial step we took towards building our scheduling platform was to define a set of expectations and guidelines around ownership, infrastructure, authentication, common artifacts and CI/CD, to name a few.

These were the considerations we had in mind:

  • Deploy containerized Airflow instances on Kubernetes cluster to isolate Airflow instances at the team level. It should scale up and scale out according to usage.
  • Each team can have different sets of jobs that require specific dependencies on the Airflow server.
  • Provide common CI/CD templates to build, test, and deploy Airflow instances. These CI/CD templates should be flexible enough to be extended by users and modified according to their use case.
  • Common plugins, operators, hooks, sensors will be shipped to all Airflow instances. Moreover, each team can have its own plugins, operators, hooks, and sensors.
  • Support LDAP based authentication as it is natively supported by Apache Airflow. Each team can authenticate Airflow UI by their LDAP credentials.
  • Use the Hashicorp Vault to store Airflow specific secrets. Inject these secrets via sidecar in Airflow servers.
  • Use ELK stack to access all application logs and infrastructure logs.
  • Datadog and PagerDuty will be used for monitoring and alerting.
  • Ingest job statistics such as total number of jobs scheduled, no of failed jobs, no of successful jobs, active DAGs, etc. into the data lake and will be accessible via Presto.
Architecture diagram
Architecture diagram

Infrastructure management

Initially, we started deploying Airflow instances on  Kubernetes clusters managed via Kubernetes Operations (KOPS). Later, we migrated to Amazon EKS to reduce the overhead of managing the Kubernetes control plane. Each Kubernetes namespace deploys one Airflow instance.

We chose Terraform to manage infrastructure as code. We deployed each Airflow instance using Terraform modules, which include a helm_release Terraform resource on top of our customized Airflow Helm Chart.

Each Airflow instance connects to its own Redis and RDS. RDS is responsible for storing Airflow metadata and Redis is acting as a celery broker between Airflow scheduler and Airflow workers.

The Hashicorp Vault is used to store secrets required by Airflow instances and injected via sidecar by each Airflow component. The ELK stack stores all logs related to Airflow instances and is used for troubleshooting any instance. Datadog, Slack, and PagerDuty are used to send alerts.

Presto is used to access job statistics, such as numbers on scheduled jobs, failed jobs, successful jobs, and active DAGs, to help each team to analyze their usage and stability of their jobs.

Doing things at scale

There are two kinds of scaling we need to talk about:

  • scaling of Airflow instances on a resource level handling different loads
  • scaling in terms of teams served on the platform

To scale Airflow instances, we set the request and the limit of each Airflow component allowing any of the components to scale up easily. To scale out Airflow workers, we decided to enable the horizontal pod autoscaler (HPA) using Memory and CPU parameters. The cluster autoscaler on EKS helps in scaling the platform to accommodate more teams.

Moreover, we categorized all our Airflow instances in three sizes (small, medium, and large) to efficiently use the resources. This was based on how many hourly/daily jobs it scheduled. Each Airflow instance type has a specific RDS instance type and storage, Redis instance type and CPU and memory, request/limit for scheduler, worker, web server, and flower. There are different Airflow configurations for each instance type to optimize the given resources to the Airflow instance.

Airflow image and version management

The Airflow team builds and releases one common base Docker image for each Airflow version. The base image has Airflow installed with specific versions, as well as common Python packages, plugins, helpers, tests, patches, and so on.

Each team has their customized Docker image on top of the base image. In their customized Docker image, they can update the Python packages and can download other artifacts that they require. Each Airflow instance will be deployed using the team’s customized image.

There are common CI/CD templates provided by the Airflow team to build the customized image, run unit tests, and deploy Airflow instances from their GitLab pipeline.

To upgrade the Airflow version, the Airflow team reviews and studies the changelog of the released Airflow version, note down the important features and its impacts, open issues, bugs, and workable solutions. Later, we build and release the base Docker image using the new Airflow version.

We support only one Airflow version for all Airflow instances to have less maintenance overhead. In the case of minor or major versions, we support one old and new versions until the retirement period.

How do we deploy

There is a deployment ownership guideline that explains the schedule of deployments and the corresponding PICs. All teams have agreed on this guideline and share the responsibility with the Airflow Team.

There are two kinds of deployment:

  • DAG deployment: This is part of the common GitLab CI/CD template. The Airflow team doesn’t trigger the DAG deployment, it’s fully owned by the teams.
  • Airflow instance deployment: The Airflow instance deployment is required in these scenarios:
    1. update in base Docker image
    2. add/update in Python packages by any team
    3. customization in the base image by any team
    4. change in Airflow configurations
    5. change in the resource of scheduler, worker, web server or flower

Base Docker image update

The Airflow team maintains the base Docker image on the AWS Elastic Container Registry. The GitLab CI/CD builds the updated base image whenever the Airflow team changes the base image. The base image is validated by automated deployment on the test environment and automated smoke test. The Airflow instance owner of each team needs to trigger their build and deployment pipeline to apply the base image changes on their Airflow instance.

Python package additions or updates

Each team can add or update their Python dependencies. The Gitlab CI/CD pipeline builds a new image with updated changes. The Airflow instance owner manually triggers the deployment from their CI/CD pipeline. There is a flag to make it automated deployment as well.

Based image customization

Each team can add any customizations on the base image. Similar to the above scenario, the Gitlab CI/CD pipeline builds a new image with updated changes. The Airflow instance owner manually triggers the deployment from their CI/CD pipeline. To automate the deployment, a flag is made available.

Configuration Airflow and Airflow component resource changes

To optimize the Airflow instances, the Airflow Team makes changes to the Airflow configurations and resources of any of the Airflow components. The Airflow configurations and resources are also part of the Terraform code. Atlantis (https://www.runatlantis.io/) deploys the Airflow instances with Terraform changes.

There is no downtime in any form of deployment and doesn’t impact the running tasks and the Airflow UI.

Testing

During the process of making our first Airflow stable, we started exploring testing in Airflow. We wanted to validate the correctness of DAGs, duplicate DAG IDs, checking typos and cyclicity in DAGs, etc. We then later wrote the tests by ourselves and published a detailed blog in several channels: usejournal (part1) and medium (part2).

These tests are available in the base image and run in the GitLab pipeline from the user’s repository to validate their DAGs. The unit tests run using the common GitLab CI/CD template provided by the Airflow team.

Monitoring & alerting

Our scheduling platform runs the Airflow instance for many critical jobs scheduled by each team. It’s important for us to monitor all Airflow instances and alert respective stakeholders in case of any failure.

We use a Datadog for monitoring and alerting. To create a common Datadog dashboard, it is required to pass tags with metrics from Airflow and till Airflow 1.10.x, it doesn’t support tagging to Datadog metrics.

We have contributed to the community to enable Datadog support and it will be released in Airflow 2.0.0 (https://github.com/apache/Airflow/pull/7376). We internally patched this pull request and created the common Datadog dashboard.

There are three categories of metrics that we are interested in:

  • EKS cluster metrics: It includes total In-Service Nodes, allocated CPU cores, allocated Memory, Node status, CPU/Memory request vs limit, Node disk and Memory pressure, Rx-Tx packets dropped/errors, etc.
  • Host Metrics: These metrics are for each host participating in the EKS cluster. It includes Host CPU/Memory utilization, Host free memory, System disk, and EBS IOPS, etc.
  • Airflow instance metrics: These metrics are for each Airflow instance. It includes scheduler heartbeats, DagBag size, DAG processing import errors, DAG processing time, open/used slots in a pool, each pod’s Memory/CPU usage, CPU and Memory utilization of metadata DB, database connections as well as the number of workers, active/paused DAGs, successful/failed/queued/running tasks, etc.
Sample Datadog dashboard
Sample Datadog dashboard

We alert respective stakeholders and oncalls using Slack and PagerDuty.

Benefits

These are the benefits of having our own Scheduling Platform:

  • Scaling: HPA on Airflow workers running on EKS with autoscaler helps Airflow workers to scale automatically to theoretically infinite scale. This enables teams to run thousands of DAGs.
  • Logging: Centralized logging using Kibana.
  • Better Isolation: Separate Docker images for each team provide better isolation.
  • Better Customization: All teams are provided with a mechanism to customize their Airflow worker environment according to their requirements.
  • Zero Downtime: Rolling upgrade and termination period on Airflow workers helps in zero downtime during the deployment.
  • Efficient usage of infrastructure: Each team doesn’t need to allocate infrastructure for Airflow instances. All Airflow instances are deployed on one shared EKS cluster.
  • Less maintenance overhead for users:  Users can focus on their core work and don’t need to spend time maintaining Airflow instances and it’s resources.
  • Common plugins and helpers: All common plugins and helpers available to use on Airflow instances. Each team doesn’t need to add.

Conclusion

Designing and implementing our own scheduling platform started with many challenges and unknowns. We were not sure about the scale we were aiming for, the heterogeneous workload from each team, or the level of triviality or complexity we were going to be faced. After two years, we have successfully built and productionized a scalable scheduling platform that helps teams at Grab to schedule their workload.

We have many failure stories, odd things we ran into, hacks and workarounds we patched. But, we went through it and provided a cost-effective and scalable scheduling platform with low maintenance overhead to all teams at Grab.

What’s Next

Moving ahead, we will be exploring to add the following capabilities:

  • REST APIs to enable teams to access their Airflow instance programmatically and have better integration with other tools and frameworks.
  • Support of dynamic DAGs at scale to help in decreasing the DAG maintenance overhead.
  • Template-based engine to act as a middle layer between the scheduling platform and external systems. It will have a set of templates to generate DAGs which helps in better integration with the external system.

We suggest anyone who is running multiple Airflow instances within different teams to look at this approach and build the centralized scheduling platform. Before you begin,  review the feasibility of building the centralized platform as it requires a vision, a lot of effort, and cross-communication with many teams.


Authored by Chandulal Kavar on behalf of the Airflow team at Grab – Charles Martinot, Vinson Lee, Akash Sihag, Piyush Gupta, Pramiti Goel, Dewin Goh, QuiHieu Nguyen, James Anh-Tu Nguyen, and the Data Engineering Team.


Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Introducing the GitHub Availability Report

Post Syndicated from Keith Ballinger original https://github.blog/2020-07-08-introducing-the-github-availability-report/

What is the Availability Report?

Historically, GitHub has published post-incident reviews for major incidents that impact service availability. Whether we’re sharing new investments to infrastructure or detailing site downtimes, our belief is that we can collectively grow as an industry by learning from one another. This month, we’re excited to introduce the GitHub Availability Report.

What can you expect?

On the first Wednesday of each month, we’ll publish a report describing GitHub’s availability, including a description of any incidents that may have occurred and update you on how we are evolving our engineering systems and practices in response. You should expect these updates to include a summary of what happened, as well as a technical explanation for incidents where we believe the occurrence was novel and contains information that helps engineers around the world learn how to improve product operations at scale.

Why are we doing this?

Availability and performance are a core feature, including how GitHub responds to service disruptions. We strive to engineer systems that are highly available and fault-tolerant and we expect that most of these monthly updates will recap periods of time where GitHub was >99% available. When things don’t go as planned, rather than waiting to share information about particularly interesting incidents, we want to describe all of the events that may impact you. Our hope is that by increasing our transparency and sharing what we’ve learned, rather than simply reporting minutes of downtime on a status page, everyone can learn from our experiences. At GitHub, we take the trust you place in us very seriously, and we hope this is a way for you to help hold us accountable for continuously improving our operational excellence as well as our product functionality.

Availability Report for May and June

In May and June, we experienced four distinct incidents resulting in a lack of availability or degraded service for GitHub.com.

May 5 00:45 UTC (lasting for two hours and 24 minutes)

During the incident, a shared database table’s auto-incrementing ID column exceeded the size that can be represented by the MySQL Integer type (Rails int(11)): 2147483647. When we attempted to insert larger integers into the column, the database rejected the value and Rails raised an ActiveModel::RangeError, which resulted in 500s from our API endpoint.

This impacted GitHub apps that rely on getting installation tokens. The top affected GitHub apps internally included Actions, Pages, and Dependabot.

GitHub’s monitoring systems currently alert when tables hit 70% of the primary key size used. We are now extending our test frameworks to include a linter in place for int / bigint foreign key mismatches.

May 22 16:41 UTC (lasting for five hours and nine minutes)

During a planned maintenance operation (failing over a MySQL primary instance) we experienced a novel crash in the mysqld process on the newly promoted MySQL primary server. To mitigate the impact of the crash, we manually redirected traffic back to the original primary. However, the crashed MySQL primary had already served approximately six seconds of write traffic. At this point, a restore of replicas from the new primary was initiated which took approximately four hours with a further hour for cluster reconfiguration to re-enable full read capacity. For a period of approximately five hours, users may have observed delays before data written to the affected database cluster were visible in the web interface and API.

We’ve run multiple internal gameday exercises in response to ensure a higher degree of preparedness for similar topology inconsistencies and will continue to exercise our automated failover systems to reduce recovery time.

June 19 08:52 UTC (lasting for 51 minutes)

Changes to better instrument A/B experimentation for UI improvements introduced an unknown dependency on the presence of a specific, dynamically generated file that is served by a separate application.

During an application deployment, the file failed to be generated on a significant proportion of the application deployments due to a high retrieval rate being rate limited by the upstream application. This resulted in site-wide application errors for a percentage of users enrolled in the experiment. Upon detection, we were able to disable the requirement on this file which restored service to all users.

Going forward, configuration for A/B and multivariate experiments will be cached internally to ensure successful propagation of dependencies.

June 29 12:03 UTC (lasting for two hours and 29 minutes)

As part of maintenance, the database team rolled out an updated version of ProxySQL on Monday, June 22. A week later, the primary MySQL node on one of our main database clusters failed and was replaced automatically by a new host. Within seconds, the newly promoted primary crashed. Orchestrator’s anti-flapping mechanism prevented a subsequent automatic failover. After we recovered service manually, the new primary became CPU starved and crashed again. A new primary was promoted which also crashed shortly thereafter. To recover, we rolled back to the previous version of ProxySQL and disabled a change in our application that had required the new ProxySQL version. When this completed, we were able to allow writes on the primary node without it crashing.

We are analyzing application logs, MySQL core dumps, and our internal telemetry as part of continued investigation into the CPU starvation issue to avoid similar failure modes going forward.

In summary

As an organization we continue to invest heavily in reliability. We treat each incident discussed here as an invaluable opportunity from which to learn and grow. Our systems and processes continue to evolve based on these learnings and we look forward to sharing our progress in future updates.

Please follow our status page for real time updates and watch our blog for next month’s availability report.

How we launched docs.github.com

Post Syndicated from Sarah Schneider original https://github.blog/2020-07-02-how-we-launched-docs-github-com/

ICYMI: docs.github.com is the new place to discover all of GitHub’s product documentation!

We recently completed a major overhaul of GitHub’s documentation websites. When you visit docs.github.com today, you’ll see content from the former help.github.com and developer.github.com sites in a unified experience.

Our engineering goals were two-fold: 1) improve the reading and navigation experience for GitHub’s users; 2) improve the suite of tools that GitHub’s writers use to create and publish documentation.

Combining the content was the last of several complex projects we completed to reach these goals. Here’s the story behind this years-long effort, undertaken in collaboration with GitHub’s Product Documentation team and many other contributors.

 A brief history of the docs sites

Providing separate docs sites for different audiences was the right choice for us for many years. But our plans evolved along with GitHub’s products. Over time, we aspired to help an international audience use GitHub by:

  • Offering multi-language support for all content
  • Scaling docs for new products
  • Autogenerating API reference docs
  • Providing interactive experiences
  • Allowing anyone to easily contribute documentation

We couldn’t do these things when we had two static sites, each with its own codebase, its own way of organizing content, and its own markup conventions. Efforts were made to streamline the tooling over the years, but they were limited by the nature of static builds.

To achieve our goals, we determined we needed to write a custom dynamic backend, and eventually, combine the content.

Only fixing what was broken

Our docs sites were previously hosted on GitHub Pages using Jekyll  practices: a content directory full of Markdown files and a data directory full of YAML files. This is a great setup for simple sites, and it worked for us for a long time. Although we outgrew Jekyll tooling, the writing conventions based in Markdown and YAML worked well. So we kept them, and we built the dynamic site around them.

Keeping these conventions let us alleviate pain points in the tooling without introducing a new paradigm for technical writing and asking writers to learn it. It also meant that writers could continue publishing content that helps people use GitHub in the old system while we built the new one.

What was broken?

We outgrew static site tooling for a number of reasons. A big factor was the complexity of versioning content for our GitHub Enterprise Server product.

We release a new version of GitHub Enterprise Server every three months, and we support docs for each version for one year before we deprecate them. At any time, we provide docs for four versions of GitHub Enterprise Server.

We handle this complexity by single-sourcing our docs. This means we provide multiple versions of each article, and a dropdown on the site lets users switch between versions. Here’s how it looked in the old help.github.com:

Screenshot showing versioned articles

Versioning can be hard. Some articles are available in all versions. Some are GitHub.com-only. Some are Enterprise Server-only. Some have lots of internal conditionals, where a single paragraph or even a word may be relevant in some versions but not others. We also have workflows and tools for releasing new versions and deprecating old versions.

What does single-sourcing look like under the hood? Writers use the Liquid templating language (another Jekyll holdover) to render version-specific content using if / else statements:

{% if page.version == 'dotcom' or page.version ver_gt '2.20' %}

Content relevant to new versions

{% else %}

Content relevant to old versions

{% endif %}

Statements like this are all over the content and data files.

Static site generators are designed to do one build. They don’t build multiple versions of pages. To support our single-source approach in the Jekyll days, we had to create a backport process, in which writers would build Enterprise Server versions separately from building GitHub.com docs. Backport pull requests had to be reviewed, deployed to staging, and published as a separate process. Over the years, as we released new Enterprise Server versions, the tooling started to fray around the edges. Backports took a long time to build, did weird things, or got forgotten entirely. Ultimately, backports became a liability.

Launching a dynamic help.github.com

When we set out to create a new dynamic site, we started with help.github.com. We built it over six months and carefully coordinated with the writers to swap out the backend, while mostly leaving the content alone. In February 2019, we launched the new Node.js site backed by Express. On the frontend, there is just vanilla JavaScript and CSS using Primer.

It was a big improvement:

  • No more build
    • The app loads metadata for all pages at server startup, but the contents are rendered dynamically at page load.
    • Shaved ~10 minutes off deploy times.
  • Dynamic version rendering
    • No more backports! Enterprise Server content is loaded at the same time as everything else.
  • Fastly CDN
    • Serves as a global edge cache to keep things fast.
  • Better search using Algolia
  • Less chatops, more GitHub flow
    • Staging and production deployments happen automatically.

Internationalized docs

Within a few months, the dynamic backend allowed us to reach our next major milestone: internationalization. We launched the Japanese and simplified Chinese versions of the site in June 2019 and added support for Spanish and Portuguese by the end of the year. (Look for a deep dive post into the internationalization process coming soon!)

This was progress. But developer.github.com was still running on the old static build, and parts of it were starting to break down. We needed to bring the developer content into the new codebase.

Supporting multiple products

First, we needed to more robustly support the idea of products.

When we originally launched the new help site, the homepage did allow users to choose a product:

Old help siteBut the content for these products was organized in wildly different ways. For example, a directory called content/dotcom/articles contained nearly a thousand Markdown files with no hierarchy. URLs looked like help.github.com/articles/<article>, with no indication of which product they belonged to. Writers and contributors had a hard time navigating the repository. It all “worked,” but it wouldn’t scale.

So we created a new product-centric structure that would be consistent across the site: content/<product>/<category>/<article>, with URLs that matched. To support this change, we developed a new TOC system and refactored product handling on the backend. Once again, we coordinated the changes with writers who were still actively writing, and once again, we left the core Jekyll conventions untouched. We also added support for redirects from the legacy article URLs to the new product-based ones.

In 2019, we released GitHub Actions as the first new product on the help site.

With a more scalable content organization in place, we were ready to start thinking about how to get developer content into the codebase.

Autogenerating API documentation

Historically, developer.github.com hosted documentation for integrators, including docs for GitHub’s two APIs: REST and GraphQL.

REST docs via openAPI

From the time GitHub’s REST docs were first released roughly a decade ago, they were handwritten and updated by humans using unstructured formats. This workflow was sustainable at first, but as the API grew, it became a big drain on writers’ time. Updating REST input and response parameters manually was hard enough, but versioning them for GitHub.com and Enterprise Server was almost impossibly complex. Readers had long asked for code samples and other standard features of API documentation we were unable to provide. We’d dreamed of autogenerating REST docs from a structured schema, but this seemed unattainable for years.

With the new codebase in place, it opened the door for us to think about autogeneration for real. And we happened into some lucky timing.

Octokit maintainer Gregor Martynus had already started the process of generating an OpenAPI schema that described how GitHub’s API works. This schema happened to be exactly what we needed. Rather than reinventing the wheel, we invested in that existing schema effort and enlisted the services of Redoc.ly, a small firm that specializes in OpenAPI schema design and implementation. We worked with them to get the work-in-progress OpenAPI over the finish line and ready for production use, and created a pipeline to consume and render the docs from OpenAPI.

Check out the new REST docs: http://docs.github.com/rest/reference

GraphQL docs

GraphQL is a different story from REST. Since GitHub first released its GraphQL API in 2017, we’ve had a pipeline for autogenerating docs from a schema using https://github.com/gjtorikian/graphql-docs.

This tooling worked well, but it was written in Ruby, and with the new Node.js backend, we needed something more JavaScript-friendly. We looked for existing JavaScript GraphQL docs generators but didn’t find any that fit our specific needs. So we rolled our own.

We wrote a script that takes a GraphQL schema as input, does some sanitization, and outputs JSON files containing only the data needed for rendering documentation. Our HTML files loop over that JSON data and render it on page load.

The script runs via a scheduled GitHub Actions workflow and automatically opens and merges PRs with the updates. This means writers never have to touch GraphQL documentation; it publishes itself.

Check out the new GraphQL docs: http://docs.github.com/graphql/reference

Scripting the content migration

In addition to API docs, developer.github.com contained content about GitHub and OAuth apps, GitHub Marketplace, and webhooks. The majority of this content is vanilla Markdown, so the project requirements were (finally) straightforward: import the files, process them, run tests. We wrote scripts to do this in a repeatable process.

The content strategist on the docs team created a comprehensive spreadsheet that mapped all the old developer content to their new product-based locations, with titles and intros for each. This formed the basis of our scripted efforts. We ran the scripts several times, doing reviews and making changes each time, before unveiling the final documentation.

Check out the new docs: http://docs.github.com/developers

Redirecting all the things

Getting a 404 on a documentation site is an abrupt end to a learning experience. GitHub’s docs team makes a commitment to prevent 404s when content files get renamed or moved around. There are a number of ways we support 20,000+ redirects in the codebase, and they can get complex.

For example, if you go to an Enterprise URL without a version, such as https://docs.github.com/enterprise, the site redirects you to the latest version by injecting the number in the URL. But we have to be careful – what if the URL happens to include enterprise somewhere but is not a real Enterprise Server path? No number should be injected in those cases.

We also redirect any URL without a language code to use the /en prefix. And we have special link rewriting under the hood to make sure that if you are on a Japanese page, all links to other GitHub articles take you to /ja versions of those pages instead of /en versions.

Soon we will enable blanket redirects to point most https://developer.github.com links to https://docs.github.com. That step won’t be too hard, but in the course of the migration, we changed the names and locations of much of the developer content. For example, /v3 became /rest/reference, /apps became /developers/apps, and so on. To support all these redirects, we worked from a list of the top few hundred developer.github.com URLs from Google Analytics to whittle down dead links, path by path.

These redirects will help GitHub’s users arrive at the content they want when they navigate docs.github.com or follow legacy bookmarks or links.

What’s next

This is an exciting time for content at GitHub. With the foundation we built for docs.github.com, we can’t wait to continue improving the experience for people creating and using GitHub’s content. Keep an eye out for more behind-the-scenes posts about docs.github.com!

Netflix Studio Engineering Overview

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/netflix-studio-engineering-overview-ed60afcfa0ce

By Steve Urban, Sridhar Seetharaman, Shilpa Motukuri, Tom Mack, Erik Strauss, Hema Kannan, CJ Barker

Netflix is revolutionizing the way a modern studio operates. Our mission in Studio Engineering is to build a unified, global, and digital studio that powers the effective production of amazing content.

Netflix produces some of the world’s most beloved and award-winning films and series, including The Irishman, The Crown, La Casa de Papel, Ozark, and Tiger King. In an effort to effectively and efficiently produce this content we are looking to improve and automate many areas of the production process. We combine our entertainment knowledge and our technical expertise to provide innovative technical solutions from the initial pitch of an idea to the moment our members hit play.

Why Does Studio Engineering Exist?

We enable Netflix to build a unified, global and digital studio that powers the effective production of amazing content.
Studio Engineering’s ‘Why’

The journey of a Netflix Original title from the moment it first comes to us as a pitch, to that press of the play button is incredibly complex. Producing great content requires a significant amount of coordination and collaboration from Netflix employees and external vendors across the various production phases. This process starts before the deal has been struck and continues all the way through launch on the service, involving people representing finance, scheduling, human resources, facilities, asset delivery, and many other business functions. In this overview, we will shed light on the complexity and magnitude of this journey and update this post with links to deeper technical blogs over time.

Content Lifecycle: Pitch, Development, Production, On-Service
Pitch-to-Play

Mission at a Glance

  • Creative pitch: Combine the best of machine learning and human intuition to help Netflix understand how a proposed title compares to other titles, estimate how many subscribers will enjoy it, and decide whether or not to produce it.
  • Business negotiations: Empower the Netflix Legal team with data to help with deal negotiations and acquisition of rights to produce and stream the content.
  • Pre-Production: Provide solutions to plan for resource needs, and discovery of people and vendors to continue expanding the scale of our productions. Any given production requires the collaboration of hundreds of people with varying expertise, so finding exactly the right people and vendors for each job is essential.
  • Production: Enable content creation from script to screen that optimizes the production process for efficiency and transparency. Free up creative resources to focus on what’s important: producing amazing and entertaining content.
  • Post-Production: Help our creative partners collaborate to refine content into their final vision with digital content logistics and orchestration.

What’s Next?

Studio Engineering will be publishing a series of articles providing business and technical insights as we further explore the details behind the journey from pitch to play. Stay tuned as we expand on each stage of the content lifecycle over the coming months!

Here are some related articles to Studio Engineering:


Netflix Studio Engineering Overview was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.