Tag Archives: Engineering

February service disruptions post-incident analysis

Post Syndicated from Keith Ballinger original https://github.blog/2020-03-26-february-service-disruptions-post-incident-analysis/

In late February, GitHub experienced multiple service interruptions that resulted in degraded service for a total of eight hours and 14 minutes over four distinct events. Unexpected variations in database load, coupled with an unintended configuration issue introduced as a part of ongoing scaling improvements, led to resource contention in our mysql1 database cluster.

We sincerely apologize for any negative impact these interruptions may have caused. We know you place trust in GitHub for your most important projects—and we make it our top priority to maintain that trust with a platform that is highly durable and available. We’re committed to applying what we’ve learned from these incidents to avoid disruptions moving forward.

Background

Originally, all our MySQL data lived in a single database cluster. Over time, as mysql1 grew larger and busier, we split functionally grouped sets of tables into new clusters and created new clusters for new features. However, much of our core dataset still resides within that original cluster.

We’re constantly scaling our databases to handle the additional load driven by new users and new products. In this case, an unexpected variance in database load contributed to cluster degradations and unavailability.

Timeline

2020, February 19 15:17 UTC (lasting for 52 minutes)

At this time, an unexpectedly resource-intensive query began running against our mysql1 database cluster. The intent was to run this load against our read replica pool at a much lower frequency, but we inadvertently sent this traffic to the master of the cluster, increasing the pressure on that host beyond surplus capacity. This pressure overloaded ProxySQL, which is responsible for connection pooling, resulting in an inability to consistently perform queries.

2020, February 20 21:31 UTC (lasting for 47 minutes)

Two days later, as part of a planned master database promotion, we saw unexpectedly high load which triggered a similar ProxySQL failure. The intent of this maintenance was to give our teams visibility into issues they might experience when a master is momentarily read-only.

After an initial load spike, we were able to use the same remediation steps as in the previous incident to restore the system to a working state. We suspended further maintenance events of this type as we investigated how this system failed.

2020, February 25 16:36 UTC (lasting for two hours 12 minutes)

In this third incident involving ProxySQL, active database connections crossed a critical threshold that changed the behavior of this new infrastructure. Because connections remained above the critical threshold after remediation, the system fell back into a degraded state. During this time, GitHub.com services were affected by stalled writes on our mysql1 database cluster.

As a result of our previous investigation, we understood that file descriptor limits on ProxySQL nodes were capped at levels significantly lower than intended and insufficient to maintain throughput at high load levels. Specifically, because of a system-level limit of 1048576, our process manager silently reduced our LimitNOFILE setting from 1073741824 to 65536. During remediation, we also encountered a race condition between our process manager and service configurations which slowed our ability to change our file limit to 1048576.

2020, February 27 14:31 UTC (lasting for four hours 23 minutes)

Application logic changes to database query patterns rapidly increased load on the master of our mysql1 database cluster. This spike slowed down the cluster enough to affect availability for all dependent services.

What we learned

Observability and performance improvements

We’ve used these events to identify operational readiness and observability improvements around how we need to operate ProxySQL. We have made changes to allow us to more quickly detect and address issues like this in the future. Remediating these issues were straightforward once we tracked down interactions between systems. It’s clear to us that we need better system integration and performance testing at realistic load levels in some areas before fully deploying to production.

We’re also devoting more energy to understanding the performance characteristics of ProxySQL at scale and the trickle-down effect on other services before it affects our users.

Optimizing for stability

We decided to freeze production deployments for three days to address short-term hotspotting as a result of the final incident on February 27. This helped us stabilize GitHub.com. Our increased confidence in service reliability gave us the breathing room to plan next steps thoughtfully and also helped identify long-term investments we can make to mitigate underlying scalability issues.

Next steps

Immediate changes: Data partitioning

We shipped a sizable chunk of data partitioning efforts we’ve worked on for the past six months just days after these incidents for one of our more significant MySQL table domains, the “abilities” table. Given that every authentication request to GitHub uses this table domain in some way, we had to be very deliberate about how we performed this work to fulfill our requirement of zero downtime and minimal user-impact. To accomplish this, we:

  1. Removed all JOIN queries between the abilities table and any other table in the database
  2. Built a new cluster to move all of the table data into
  3. Copied all data to the new cluster using Vitess‘s vertical replication feature, keeping the copied data up-to-date in real-time
  4. Moved all reads to the new cluster
  5. Moved all writes to the new cluster using Vitess’ proxy layer vtgate

Steps one and two took months’ worth of effort, while steps three through five were completed in four hours.

These changes reduced load on the mysql1 cluster master by 20 percent, and queries per second by 15 percent. The following graph shows a snapshot of query performance on March 2nd, prior to partitioning, and on March 3rd, after partitioning. Queries per second peaked at 109.9k/second immediately prior to partitioning and decreased to a peak of 89.19k queries/second after.

Query performance graph showing decrease in queries per second on March 2-3.

Additional technical and organizational initiatives

  • Audit and lower reads from leader databases: We’re auditing reads from master databases with the intent to lower them. We put unnecessary load on our master databases by reading from them when replicas have the same data. This can exacerbate capacity issues and service reliability by adding load to the most critical databases at GitHub. By auditing and changing these reads to replica databases, we increase headroom in our SQL databases and reduce the risk of them being overloaded as production load varies.
  • Widen our usage of feature flags: We’re requiring feature flags for code updates, which allows us to disable problematic code dynamically, and will dramatically speed up recovery for active incidents.
  • Complete in-flight functional partitioning: We’re completing our in-flight functional partitioning of our mysql1 database cluster. We’ve been working on moving tables that comprise functional domains out of the cluster over the last year. We’re very close to finishing that work for a significant number of tables and schema domains. Shipping these improvements will immediately reduce writes to the cluster by 60 percent and storage requirements by 70 percent. Expect to see another post in the coming months that details the performance benefits from this work.
  • Refine our dashboards: We’re improving our visibility into the effects of deploys on our mysql1 database cluster. Our current deployment dashboard can be noisy, making it difficult to determine deploy safety. Interviews with engineers involved in these incidents noted the cognitive load required to process noisy dashboards. We predict refined dashboards will allow us to spot problems with deploys earlier in the process.
  • Invest in additional data partitioning opportunities: We’ve identified an additional 12 schema domains that we’ll split out from the cluster to further protect us from request and storage constraints.
  • Shard to scale: We’ll begin sharding our largest schema set. Functional partitioning buys us time, but it’s not a sufficient solution to our scaling needs. We plan to use sharding (tenant partitioning) to move us from vertical scaling to horizontal scaling, enabling us to expand capacity much more easily as we grow.

In summary

We know you depend on our platform to be reliable. With the immediate changes we’ve made and long-term plans in progress, we’ll continue to use what we’ve learned to make GitHub better every day.

The post February service disruptions post-incident analysis appeared first on The GitHub Blog.

Six years of the GitHub Security Bug Bounty program

Post Syndicated from Brian Anglin original https://github.blog/2020-03-25-six-years-of-the-github-security-bug-bounty-program/

Last month GitHub reached some big milestones for our Security Bug Bounty program. As of February 2020, it’s been six years since we started accepting submissions. Over the years we’ve been able to invest in the bug bounty community through live events, private bug bounties, feature previews, and of course through cash bounties.

We’re excited to announce that we recently passed $1,000,000 in total payments to researchers since we moved our program to HackerOne in 2016. We paid out over half of our total awards in the last year alone, reaching almost $590,000 in total bounty rewards across our programs. We’ve also been able to provide quick response times to an increasingly large amount of submissions—maintaining an average response time of 17 hours. This is all while seeing a 40 percent increase in submissions since last year. We’re sharing some highlights from the past year along with our upcoming plans for the future.

2019 highlights

Cool bugs

One of my favorite parts of working on the bug bounty program is getting to see the amazing submissions we get from the community. Many of the best submissions show an understanding of GitHub and our technology that rivals that of our own engineering teams. We’ve offered very competitive bounties so we can attract those talented individuals and provide them an incentive to spend time digging deep into our codebase. The community in 2019 did not disappoint.

OAuth flow bypass using cross-site HEAD requests

@not-an-ardvark has a lot of great submissions to our program but this was particularly impactful. He wrote a great post about it in detail that I’ll quickly recap.

GitHub provides a few ways for integrators to interact with our ecosystem. One of the ways integrators can use GitHub is via OAuth applications which allow the application to take actions on behalf of a GitHub user. Before allowing access to a user’s data, an OAuth application must redirect the user to GitHub.com, allowing them to review the requested permissions and explicitly authorize the application. @not-an-ardvark found a way to bypass our controls to authorize OAuth applications without any user interaction. Let’s get into how this happened.

When we process state changing requests on GitHub.com, such as authorizing an OAuth application, we rely on Ruby on Rails’ Cross Site Request Forgery (CSRF) protection. We inject a special token into the DOM of every form element that we validate when receiving POST requests. The OAuth application authorization flow uses POST requests which require a valid CSRF token. However, the OAuth controller incorrectly allowed both POST and HEAD requests to trigger the authorization logic. We skip CSRF validation when processing HEAD requests since they’re not typically state changing. This allowed a malicious site to automatically authorize an OAuth application without any user interaction.

Due to the severity of the vulnerability, we needed to patch it as quickly as possible. We worked closely with the engineering team and shipped a fix to GitHub users within three hours of receiving the submission. We also conducted a full investigation with SIRT engineers and confirmed that this vulnerability wasn’t exploited in the wild. Additionally, we rolled out patches for GitHub Enterprise Server for all supported versions. We rewarded @not-an-aardvark with $25,000 for the severity of the vulnerability and their detailed writeup in their submission.

This bug demonstrates the important role that researchers play in our overall security. By identifying this issue via our bug bounty program, we were able to protect our users by patching the issue and validating that it wasn’t previously exploited.

GitHub.com remote code execution through command injection

@ajxchapman achieved remote code execution in GitHub.com by triggering command injection in our Mercurial import feature. The import logic didn’t correctly sanitize branch names which allowed a maliciously crafted branch name to execute code on our servers. Since the import feature is quite complicated, we’ve traditionally run the import code in a sandbox on dedicated servers isolated from our production network. This isolation limited the impact of the vulnerability, and we were able to quickly release a fix for GitHub.com and backported the fix for GitHub Enterprise Server customers. We also audited the import logic for similar issues and confirmed from our logging systems that this wasn’t exploited in the wild.

What makes this bug particularly interesting is the root cause: it was ultimately caused by an outdated dependency. The bug existed in a dependency that handles code imports and was previously fixed upstream. However, we failed to keep up with the latest version and were ultimately vulnerable to this issue. This issue highlights how critical dependency management is to the overall success of a security program. GitHub continues to invest in dependency management tooling to keep us and our customers secure. Find more of Alex’s work on his personal blog.

Expanded scope

GitHub released many new features in 2019 that were added to our Security Bug Bounty scope:

  • Pull reminders added functionality to help keep engineers informed of new pull requests that need attention. We included the solution into our core application and existing Slack integration.
  • Automated security updates (formerly Dependabot) added a better way to track vulnerabilities in dependencies since it automatically opens new pull requests updating the version of a dependency when it finds a new security fix.
  • GitHub for mobile is GitHub’s first presence in the App Store. This brought new requirements of our API and new security concerns in our application. We’re delivering the same security and functionality that’s available on GitHub.com.
  • GitHub Actions is one of GitHub’s biggest releases since pull requests and whole classes of new security corner cases. Through close collaboration with our engineering partners, we’ve provided users the ability to run their code right on GitHub.com.
  • Semmle’s LGTM tool was a significant addition to our suite of security tools, like Dependabot and the Maintainer Security Advisories. LGTM allows our users to scan for potential security issues in their code on every pull request.

We’ve had several valuable submissions that influenced the development of these products significantly. We paid out over $20,000 in bounties for vulnerabilities affecting the products in this expanded scope, and we’re excited to continue expanding our Bug Bounty scope as GitHub grows.

H1-702

In August 2019, we returned to Las Vegas to participate in our second H1-702 event. This event invited the top hackers from HackerOne’s platform to join us along with two other companies for three nights of live hacking. We were excited to participate and wanted to give researchers every incentive to dig deep into our application. We even added a bunch of bonuses on top of our base payouts, including bonuses for Best Proof of Concept, Longest Exploit Chain, and RCE. We also set up a CTF on GitHub.com to direct researchers to some of our newest attack surfaces. Lastly, we hid flags in a Maintainer Security Advisory and GitHub Package Registry with bonuses for every flag. We received positive feedback from some of our researchers about our CTF and will continue to include a CTF component in future events.

Overall, we paid out over $155,000 to researchers in one night, with half those rewards for high or critical severity issues. We can’t express how important live-hacking events, like H1-702, are to our bug bounty program. We look forward to more live-hacking events in the future and other new and innovative ways to engage the community.

Private bug bounty

Beyond the wide scope of our public program, we conducted an invite-only program where we preview features to researchers before they’re launched to everyone. These private programs allow us to work closely with a small group, and give us the opportunity to find bugs before they can affect the majority of our users. We’ve paid out just over $37,000 via our private program this year, and many of these findings were fixed before new features reached a significant number of our customers.

Actions CI/CD

Following the success of our first private bug bounty targeting GitHub Actions, we wanted to re-run the private program to target the most recent iteration of our GitHub Actions product. We used what we learned in our first bug bounty to secure the product against similar issues. The community accepted the challenge and found novel bugs in our second iteration.

Automated security updates (formerly Dependabot)

Just like any combination of two complex systems, the acquisition of Dependabot presented a unique challenge for our security team in integrating these two separate architectures. We used the private bug bounty to supplement our own security review of these new services. The findings from the private bug bounty program greatly informed how we integrated Dependabot with GitHub.com. We were also able to surface a few issues before rolling it out.

Pull reminders

Like Dependabot, pull reminders required the same care and attention to ensure a secure transition from an integration to a first-party GitHub product. Pull reminders also added more complexity through its connection to Slack. Our own Slack integration provided a foundation for this feature, but there was significant re-architecture and development to tie these two features together. Again, we turned to our bug bounty community to test our pull reminder integration before releasing the feature widely.

2020 initiatives

We have a lot of plans for 2020 and want to highlight some of our upcoming changes.

Security Lab bounty program

We launched the GitHub Security Lab bounty program to incentivize researchers to help us secure all open source software. The new program rewards community members who write CodeQL queries that detect entire vulnerability classes so that the rest of the community can run those queries against their own projects. This results in removing vulnerabilities at scale.

Making a contribution to this program not only influences the global state of software security,  but also prevents similar vulnerabilities from being released in the future. This is an exciting twist on our traditional bug bounty program, and we’re excited to see researchers using our new CodeQL tooling. To date, we received 20 submissions and awarded almost $21,000, with hundreds of vulnerabilities fixed across the OSS ecosystem as a direct result.

CVEs and disclosure

This year, we’re assigning CVEs to bounty submissions which affect GitHub Enterprise Server. This is a big step forward in consistently communicating the state of our software to our customers, but also provides another accolade for our researchers who identify vulnerabilities in GitHub Enterprise Server.

Get involved

Are you excited by the new additions to our program? Get involved! Visit the GitHub Security Bug Bounty page for details of our scope, rules, and rewards. We can’t wait to make GitHub better for everyone with the help of your submissions.

Learn more about the GitHub Security Bug Bounty

The post Six years of the GitHub Security Bug Bounty program appeared first on The GitHub Blog.

CERT partners with GitHub Security Lab for automated remediation

Post Syndicated from Nico Waisman original https://github.blog/2020-03-18-cert-partners-with-github-security-lab-for-automated-remediation/

As security researchers, the GitHub Security Lab team constantly embarks on an emotional journey with each new vulnerability challenge. The excitement of starting new research, the disappointment that comes with hitting a plateau, the energy it takes to stay focused and on target…and hopefully, the sheer joy of achieving a tangible result after weeks or months of working on a problem that seemed unsolvable.

Regardless of how proud you are of the results, do you ever get a nagging feeling that maybe you didn’t make enough of an impact? While single bug fixes are worthwhile in improving code, it’s not sufficient enough to improve the state of security of the open source software (OSS) ecosystem as a whole. This holds true especially when you consider that software is always growing and changing—and as vulnerabilities are fixed, new ones are introduced.

Beyond single bug fixes

At GitHub, we host millions of OSS projects which puts us in a unique position to take a different approach with OSS security. We have the power and responsibility to make an impact beyond single bug fixes. This is why a big part of the GitHub Security Lab mission is to find ways to scale our vulnerability hunting efforts and empower others to do the same.

Our goal is to turn single vulnerabilities into hundreds, if not thousands, of bug fixes at a time. Enabled by the GitHub engineering teams, we aim to establish workflows that are open to the community that tackle vulnerabilities at scale on the GitHub platform.

Ultimately, we want to establish feedback loops with the developer and security communities, and act as security facilitators, all while working with the OSS community to secure the software we all depend upon.


We’re taking a deep-dive in the remediation of a security vulnerability with CERT. Learn more about how we found ways to scale our vulnerability hunting efforts and empower others to do the same.

Continue reading on the Security Lab blog

The post CERT partners with GitHub Security Lab for automated remediation appeared first on The GitHub Blog.

Tackling UI test execution time imbalance for Xcode parallel testing

Post Syndicated from Grab Tech original https://engineering.grab.com/tackling-ui-test-execution-time-imbalance-for-xcode-parallel-testing

Introduction

Testing is a common practice to ensure that code logic is not easily broken during development and refactoring. Having tests running as part of Continuous Integration (CI) infrastructure is essential, especially with a large codebase contributed by many engineers. However, the more tests we add, the longer it takes to execute. In the context of iOS development, the execution time of the whole test suite might be significantly affected by the increasing number of tests written. Running CI pre-merge pipelines against a change, would cost us more time. Therefore, reducing test execution time is a long term epic we have to tackle in order to build a good CI infrastructure.

Apart from splitting tests into subsets and running each of them in a CI job, we can also make use of the Xcode parallel testing feature to achieve parallelism within one single CI job. However, due to platform-specific implementations, there are some constraints that prevent parallel testing from working efficiently. One constraint we found is that tests of the same Swift class run on the same simulator. In this post, we will discuss this constraint in detail and introduce a tip to overcome it.

Background

Xcode parallel testing

The parallel testing feature was shipped as part of the Xcode 10 release. This support enables us to easily configure test setup:

  • There is no need to care about how to split a given test suite.
  • The number of workers (i.e. parallel runners/instances) is configurable. We can pass this value in the xcodebuild CLI via the -parallel-testing-worker-count option.
  • Xcode takes care of cloning and starts simulators accordingly.

However, the distribution logic under the hood is a black-box. We do not really know how tests are assigned to each worker or simulator, and in which order.

Three simulators running tests in parallel
Three simulators running tests in parallel

It is worth mentioning that even without the Xcode parallel testing support, we can still achieve similar improvements by running subsets of tests in different child processes. But it takes more effort to dispatch tests to each child process in an efficient way, and to handle the output from each test process appropriately.

Test time imbalance

Generally, a parallel execution system is at its best efficiency if each parallel task executes in roughly the same duration and ends at roughly the same time.

If the time spent on each parallel task is significantly different, it will take more time than expected to execute all tasks. For example, in the following image, it takes the system on the left 13 mins to finish 3 tasks. Whereas, the one on the right takes only 10.5 mins to finish those 3 tasks.

Bad parallelism vs. good parallelism
Bad parallelism vs. good parallelism

Assume there are N workers. The ith worker executes its tasks in ti seconds/minutes. In the left plot, t1 = 10 mins, t2 = 7 mins, t3 = 13 mins.

We define the test time imbalance metric as the difference between the min and max end time:

max(ti) – min(ti)

For the example above, the test time imbalance is 13 mins – 7 mins = 6 mins.

Contributing factors in test time imbalance

There are several factors causing test time imbalance. The top two prominent factors are:

  1. Tests vary in execution time.
  2. Tests of the same class run on the same simulator.

An example of the first factor is that in our project, around 50% of tests execute in a range of 20-40 secs. Some tests take under 15 secs to run while several take up to 2 minutes. Sometimes tests taking longer execution time is inevitable since those tests usually touch many flows, which cannot be split. If such tests run last, the test time imbalance may increase.

However, this issue, in general, does not matter that much because long-time-execution tests do not always run last.

Regarding the second factor, there is no official Apple documentation that explicitly states this constraint. When Apple first introduced parallel testing support in Xcode 10, they only mentioned that test classes are distributed across runner processes:

“Test parallelization occurs by distributing the test classes in a target across multiple runner processes. Use the test log to see how your test classes were parallelized. You will see an entry in the log for each runner process that was launched, and below each runner you will see the list of classes that it executed.”

For example, we have a test class JobFlowTests that includes five tests and another test class TutorialTests that has only one single test.

final class JobFlowTests: BaseXCTestCase {
func testHappyFlow() { ... }
  func testRecoverFlow() { ... }
  func testJobIgnoreByDax() { ... }
  func testJobIgnoreByTimer() { ... }
  func testForceClearBooking() { ... }
}
...
final class TutorialTests: BaseXCTestCase {
  func testOnboardingFlow() { ... }
}

When executing the two tests with two simulators running in parallel, the actual run is like the one shown on the left side of the following image, but ideally it should work like the one on the right side.

Tests of the same class are supposed to run on the same simulator but they should be able to run on different simulators.
Tests of the same class are supposed to run on the same simulator but they should be able to run on different simulators.

Diving deep into Xcode parallel testing

Demystifying Xcode scheduling log

As mentioned above, Xcode distributes tests to simulators/workers in a black-box manner. However, by looking at the scheduling log generated when running tests, we can understand how Xcode parallel testing works.

When running UI tests via the xcodebuild command:

$ xcodebuild -workspace Driver/Driver.xcworkspace \
    -scheme Driver \
    -configuration Debug \
    -sdk 'iphonesimulator' \
    -destination 'platform=iOS Simulator,id=EEE06943-7D7B-4E76-A3E0-B9A5C1470DBE' \
    -derivedDataPath './DerivedData' \
    -parallel-testing-enabled YES \
    -parallel-testing-worker-count 2 \
    -only-testing:DriverUITests/JobFlowTests \    # 👈👈👈👈👈
    -only-testing:DriverUITests/TutorialTests \
    test-without-building

The log can be found inside the *.xcresult folder under DerivedData/Logs/Test. For example: DerivedData/Logs/Test/Test-Driver-2019.11.04\_23-31-34-+0800.xcresult/1\_Test/Diagnostics/DriverUITests-144D9549-FD53-437B-BE97-8A288855E259/scheduling.log

Scheduling log under xcresult folder.
Scheduling log under xcresult folder
2019-11-05 03:55:00 +0000: Received worker from worker provider: 0x7fe6a684c4e0 [0: Clone 1 of DaxIOS-XC10-1-iP7-1 (3D082B53-3159-4004-A798-EA5553C873C4)]
2019-11-05 03:55:13 +0000: Worker 0x7fe6a684c4e0 [4985: Clone 1 of DaxIOS-XC10-1-iP7-1 (3D082B53-3159-4004-A798-EA5553C873C4)] finished bootstrapping
2019-11-05 03:55:13 +0000: Parallelization enabled; test execution driven by the IDE
2019-11-05 03:55:13 +0000: Skipping test class discovery
2019-11-05 03:55:13 +0000: Executing tests {(	# 👈👈👈👈👈
    DriverUITests/JobFlowTests,
    DriverUITests/TutorialTests
)}; skipping tests {(
)}
2019-11-05 03:55:13 +0000: Load balancer requested an additional worker
2019-11-05 03:55:13 +0000: Dispatching tests {(  # 👈👈👈👈👈
    DriverUITests/JobFlowTests
)} to worker: 0x7fe6a684c4e0 [4985: Clone 1 of DaxIOS-XC10-1-iP7-1 (3D082B53-3159-4004-A798-EA5553C873C4)]
2019-11-05 03:55:13 +0000: Received worker from worker provider: 0x7fe6a1582e40 [0: Clone 2 of DaxIOS-XC10-1-iP7-1 (F640C2F1-59A7-4448-B700-7381949B5D00)]
2019-11-05 03:55:39 +0000: Dispatching tests {(  # 👈👈👈👈👈
    DriverUITests/TutorialTests
)} to worker: 0x7fe6a684c4e0 [4985: Clone 1 of DaxIOS-XC10-1-iP7-1 (3D082B53-3159-4004-A798-EA5553C873C4)]
...

Looking at the log below, we know that once a test class is dispatched or distributed to a worker/simulator, all tests of that class will be executed in that simulator.

2019-11-05 03:55:39 +0000: Dispatching tests {(
    DriverUITests/TutorialTests
)} to worker: 0x7fe6a684c4e0 [4985: Clone 1 of DaxIOS-XC10-1-iP7-1 (3D082B53-3159-4004-A798-EA5553C873C4)]

Even when we customize a test suite (by swizzling some XCTestSuite class methods or variables), to split a test suite into multiple suites, it does not work because the made-up test suite is only initialized after tests are dispatched to a given worker.

Therefore, any hook to bypass this constraint must be done early on.

Passing the -only-testing argument to xcodebuild command

Now we pass tests (instead of test classes) to the -only-testing argument.

$ xcodebuild -workspace Driver/Driver.xcworkspace \
    # ...
    -only-testing:DriverUITests/JobFlowTests/testJobIgnoreByTimer \
    -only-testing:DriverUITests/JobFlowTests/testRecoverFlow \
    -only-testing:DriverUITests/JobFlowTests/testJobIgnoreByDax \
    -only-testing:DriverUITests/JobFlowTests/testHappyFlow \
    -only-testing:DriverUITests/JobFlowTests/testForceClearBooking \
    -only-testing:DriverUITests/TutorialTests/testOnboardingFlow \
    test-without-building

But still, the scheduling log shows that tests are grouped by test class before being dispatched to workers (see the following log for reference). This grouping is automatically done by Xcode (which it should not).

2019-11-05 04:21:42 +0000: Executing tests {(	# 👈
    DriverUITests/JobFlowTests/testJobIgnoreByTimer,
    DriverUITests/JobFlowTests/testRecoverFlow,
    DriverUITests/JobFlowTests/testJobIgnoreByDax,
    DriverUITests/TutorialTests/testOnboardingFlow,
    DriverUITests/JobFlowTests/testHappyFlow,
    DriverUITests/JobFlowTests/testForceClearBooking
)}; skipping tests {(
)}
2019-11-05 04:21:42 +0000: Load balancer requested an additional worker
2019-11-05 04:21:42 +0000: Dispatching tests {(  # 👈 ❌
    DriverUITests/JobFlowTests/testJobIgnoreByTimer,
    DriverUITests/JobFlowTests/testForceClearBooking,
    DriverUITests/JobFlowTests/testJobIgnoreByDax,
    DriverUITests/JobFlowTests/testHappyFlow,
    DriverUITests/JobFlowTests/testRecoverFlow
)} to worker: 0x7fd781261940 [6300: Clone 1 of DaxIOS-XC10-1-iP7-1 (93F0FCB6-C83F-4419-9A75-C11765F4B1CA)]
......

Overcoming grouping logic in Xcode parallel testing

Tweaking the -only-testing argument values

Based on our observation, we can imagine how Xcode runs tests in parallel. See below.

Step 1.   tests = detect_tests_to_run() # parse -only-testing arguments
Step 2.   groups_of_tests = group_tests_by_test_class(tests)
Step 3.   while groups_of_tests is not empty:
Step 3.1. 	worker = find_free_worker()
Step 3.2.     if worker is not None:
                  dispatch_tests_to_workers(groups_of_tests.pop())

In the pseudo-code above, we do not have much control to change step 2 since that grouping logic is implemented by Xcode. But we have a good guess that Xcode groups tests, by the first two components (class name) only (For example, DriverUITests/JobFlowTests). In other words, tests having the same class name run together on one simulator.

The trick to break this constraint is simple. We can tweak the input (test names) so that each group contains only one test. By inserting a random token in the class name, all class names in the tests that are passed via -only-testing argument are different.

For example, instead of passing:

-only-testing:DriverUITests/JobFlowTests/testJobIgnoreByTimer \
-only-testing:DriverUITests/JobFlowTests/testRecoverFlow \

We rather use:

-only-testing:DriverUITests/JobFlowTests_AxY132z8/testJobIgnoreByTimer \
-only-testing:DriverUITests/JobFlowTests_By8MTk7l/testRecoverFlow \

Or we can use the test name itself as the token:

-only-testing:DriverUITests/JobFlowTests_testJobIgnoreByTimer/testJobIgnoreByTimer \
-only-testing:DriverUITests/JobFlowTests_testRecoverFlow/testRecoverFlow \

After that, looking at the scheduling log, we will see that the trick can bypass the grouping logic. Now, only one test is dispatched to a worker once ready.

2019-11-05 06:06:56 +0000: Dispatching tests {(	# 👈 ✅
    DriverUITests/JobFlowTests_testJobIgnoreByDax/testJobIgnoreByDax
)} to worker: 0x7fef7952d0e0 [13857: Clone 2 of DaxIOS-XC10-1-iP7-1 (9BA030CD-C90F-4B7A-B9A7-D12F368A5A64)]
2019-11-05 06:06:58 +0000: Dispatching tests {(	# 👈 ✅
    DriverUITests/TutorialTests_testOnboardingFlow/testOnboardingFlow
)} to worker: 0x7fef7e85fd70 [13719: Clone 1 of DaxIOS-XC10-1-iP7-1 (584F99FE-49C2-4536-B6AC-90B8A10F361B)]
2019-11-05 06:07:07 +0000: Dispatching tests {(	# 👈 ✅
    DriverUITests/JobFlowTests_testRecoverFlow/testRecoverFlow
)} to worker: 0x7fef7952d0e0 [13857: Clone 2 of DaxIOS-XC10-1-iP7-1 (9BA030CD-C90F-4B7A-B9A7-D12F368A5A64)]

Handling tweaked test names

When a worker/simulator receives a request to run a test, the app (could be the runner app or the hosting app) initializes an XCTestSuite corresponding to the test name. In order for the test suite to be properly made up, we need to remove the inserted token.

This could be done easily by swizzling the XCTestSuite.init(forTestCaseWithName:). Inside that swizzled function, we remove the token and then call the original init function.

extension XCTestSuite {
  /// For 'Selected tests' suite
  @objc dynamic class func swizzled_init(forTestCaseWithName maskedName: String) -> XCTestSuite {
    /// Recover the original test name
    /// - masked: UITestCaseA_testA1/testA1      	--> recovered: UITestCaseA/testA1
    /// - masked: Driver/UITestCaseA_testA1/testA1   --> recovered: Driver/UITestCaseA/testA1
    guard let testBaseName = maskedName.split(separator: "/").last else {
      return swizzled_init(forTestCaseWithName: maskedName)
    }
    let recoveredName = maskedName.replacingOccurrences(of: "_\(testBaseName)/", with: "/") # 👈 remove the token
    return swizzled_init(forTestCaseWithName: recoveredName) # 👈 call the original init
  }
}
Swizzle function to run tests properly
Swizzle function to run tests properly

Test class discovery

In order to adopt this tip, we need to know which test classes we need to run in advance. Although Apple does not provide an API to obtain the list before running tests, this can be done in several ways. One approach we can use is to generate test classes using Sourcery. Another alternative is to parse the binaries inside .xctest bundles (in build products) to look for symbols related to tests.

Conclusion

In this article, we identified some factors causing test execution time imbalance in Xcode parallel testing (particularly for UI tests).

We also looked into how Xcode distributes tests in parallel testing. We also try to mitigate a constraint in which tests within the same class run on the same simulator. The trick not only reduces the imbalance but also gives us more confidence in adding more tests to a class without caring about whether it affects our CI infrastructure.

Below is the metric about test time imbalance recorded when running UI tests. After adopting the trick, we saw a decrease in the metric (which is a good sign). As of now, the metric stabilizes at around 0.4 mins.

Tracking data of UI test time imbalance (in minutes) in our project, collected by multiple runs
Tracking data of UI test time imbalance (in minutes) in our project, collected by multiple runs

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

February service disruptions

Post Syndicated from Keith Ballinger original https://github.blog/2020-02-28-february-service-disruptions/

Recently, we’ve had multiple service interruptions on GitHub.com. We know how important reliability of our service is for your projects and teams. We take this responsibility very seriously and apologize for these disruptions.

These incidents were distinct events, but have a common theme of uncovering new challenges in scaling our database tier. Specifically, increased load on our largest database cluster contributed to degradations across multiple services.  

In the coming weeks, we’ll follow up with a more in-depth and technical report of these events and the work we are doing to improve the scalability and performance of our backend systems.  

We have several data partitioning initiatives already in progress, and we’ll be rolling out some of this work very soon. You can follow our status page for updates about the availability of our systems. 

 

Check GitHub’s current status

The post February service disruptions appeared first on The GitHub Blog.

Returning 575 Terabytes of storage space back to our users

Post Syndicated from Grab Tech original https://engineering.grab.com/returning-storage-space-back-to-our-users

Have you ever run out of storage on your phone? Mobile phones come with limited storage and with the multiplication of apps and large video files, many of you are running out of space.

In this article, we explain how we measure and reduce the storage footprint of the Grab App on a user’s device to help you overcome this issue.

The wakeup call

Android vitals (information provided by Google play Console about our app performance) gives us two main pieces of information about storage footprint.

15.7% of users have less than 1GB of free storage and they tend to uninstall more than other users (1.2x).

The proportion of 30 day active devices which reported less than 1GB free storage. Calculated as a 30 days rolling average.

Active devices with <1GB free space
Active devices with <1GB free space

This is the ratio of uninstalls on active devices with less than 1GB free storage to uninstalls on all active devices. Calculated as a 30 days rolling average.

Ratio of uninstalls on active devices with less than 1GB
Ratio of uninstalls on active devices with less than 1GB

Instrumentation to know where we stand

First things first, we needed to know how much space the Grab App occupies on user device. So we started using our personal devices. We can find this information by opening the phone settings and selecting Grab App.

App Settings
App Settings

For this device (screenshot), the application itself (Installed binary) was 186 MB and the total footprint was 322 MB. Since this information varies a lot based on the usage of the app, we needed this information directly from our users in production.

Disclaimer: We are only measuring files that are inside the internal Grab app folder (Cache/Database). We do NOT measure any file that is not inside the private Grab folder.

We decided to leverage on our current implementation using StorageManager API to gather the following information during each session launch:

  • Application Size (Installed binary size)
  • Cache folder size
  • Total footprint
Sample code to retrieve storage information on Android
Sample code to retrieve storage information on Android

Data analysis

We began analysing this data one month after our users’ updated their app and found that the cache size was anomaly huge (> 1GB) for a lot of users. Intrigued, we dug deeper.

We added code to log the top largest files inside the cache folder, and we found that most of the files were inside a sub cache folder that was no longer in use. This was due to a usage of a 3rd party library that was removed from our app. We added a specific metric to track the size of this folder.

In the end, a lot of users still had this old cache data and for some users the amount of data can be up to 1GB.

Root cause analysis

The Grab app relies a lot on 3rd party libraries. For example, Picasso was a library we used in the past for image display which is now replaced by Glide. Picasso uses a cache to store images and avoid making network calls again and again. After removing Picasso from the app, we didn’t delete this cache folder on the user device. We knew there would likely be more third-party libraries that had been discontinued so we expanded our analysis to look at how other 3rd party libraries cached their data.

Freeing up space on user’s phone

Here comes the fun part. We implemented a cleanup mechanism to remove old cache folders. When users update the GrabApp, any old cache folders which were there before would automatically be removed. Through this, we released up to 1GB of data in a second back to our users. In total, we removed 575 terabytes of old cache data across more than 13 million devices (approximately 40MB per user on average).

Data summary

The following graph shows the total size of junk data (in Terabytes) that we can potentially remove each day, calculated by summing up the maximum size of cache when a user opens the Grab app each day.

The first half of the graph reflects the amount of junk data in relation to the latest app version before auto-clean up was activated. The second half of the graph shows a dramatic dip in junk data after auto-clean up was activated. We were deleting up to 33 Terabytes of data per day on the user’s device when we first started!

Sum of all junk data on user’s device reported per day in Terabytes
Sum of all junk data on user’s device reported per day in Terabytes

Next step

This is the first phase of our journey in reducing the storage footprint of our app on Android devices. We specifically focused on making improvements at scale i.e. deliver huge storage gains to the most number of users in the shortest time. In the next phase, we will look at more targeted improvements for specific groups of users that still have a high storage footprint. In addition, we are also reviewing iOS data to see if a round of clean up is necessary.

Concurrently, we are also reducing the maximum size of cache created by some libraries. For example, Glide by default creates a cache of 250MB but this can be configured and optimised.

We hope you found this piece insightful and please remember to update your app regularly to benefit from the improvements we’re making every day. If you find that your app is still taking a lot of space on your phone, be assured that we’re looking into it.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Grab-Posisi – SouthEast Asia’s first comprehensive GPS trajectory dataset

Post Syndicated from Grab Tech original https://engineering.grab.com/grab-posisi

Introduction        

At Grab, thousands of bookings happen daily via the Grab app. The driver phones and GPS devices enable us to collect large-scale GPS trajectories.

Apart from the time and location of the object, GPS trajectories are also characterised by other parameters such as speed, the headed direction, the area and distance covered during its travel, and the travelled time. Thus, the trajectory patterns from users GPS data are a valuable source of information for a wide range of urban applications, such as solving transportation problems, traffic prediction, and developing reasonable urban planning.

Currently, it’s a herculean task to create and maintain the GPS datasets since it’s costly and laborious. As a result, most of the GPS datasets available today in the market have poor coverage or contain outdated information. They cover only a small area of a city, have low sampling rates and contain less contextual information of the GPS pings such as no accuracy level, bearing, and speed.Despite over a dozen mapping communities engaged in collecting GPS trajectory datasets, a significant amount of effort would be required for data cleaning and data pre-processing in order to utilize them.

To overcome the shortfalls in the existing datasets, we built Grab-Posisi, the first GPS trajectory dataset of Southeast Asia. The term Posisi refers to a position in Bahasa. The data was collected from Grab drivers’ phones while in transit. By tackling the addition of major arterial roads in regions where existing maps have poor coverage, and the incremental improvement of coverage in regions where major roads are already mapped, Posisi substantially improves mapping productivity.

What’s inside the dataset

The whole Grab-Posisi dataset contains in total 84K trajectories that consist of more than 80 million GPS pings and cover over 1 million km. The average trajectory length is 11.94 km and the average duration per trip is 21.50 minutes.

The data were collected very recently in April 2019 with a 1 second sampling rate, which is the highest amongst all the publicly available datasets. It also has richer contextual information, including the accuracy level, bearing and speed. The accuracy level is important because GPS measurements are noisy and the true location can be anywhere inside a circle centred at the reported location with a radius equal to the accuracy level. The bearing is the horizontal direction of travel, measured in degrees relative to true north. Finally, the speed is reported in meters/second over ground.

As the GPS trajectories were collected from Grab drivers’ phones while in transit, we labelled each trajectory by phone device type being either Android or iOS. This is the first dataset which differentiates such device information. Furthermore, we also label the trajectories by driving mode (Car or Motorcycle).

All drivers’ personal information is encrypted and the real start/end locations are removed within the dataset.

Data format

Each trajectory is serialised in a file in Apache Parquet format. The whole dataset size is around 2 GB. Each GPS ping is associated with values for a trajectory ID, latitude, longitude, timestamp (UTC), accuracy level, bearing and speed. The GPS sampling rate is 1 second, which is the highest among all the existing open source datasets. Table 1 shows a sample of the dataset.

Table 1: Sample dataset
Table 1: Sample dataset

Coverage

Figure 1a shows the spatial coverage of the dataset in Singapore. Compared with the GPS datasets available in the market that only cover a specific area of a city, the Grab-Posisi dataset encompasses almost the whole island of Singapore. Figure 1b depicts the GPS density in Singapore. Red represents high density while green represents low density. Expressways in Singapore are clearly visible because of their dense GPS pings.

Figure 1a. Spatial coverage (Singapore)
Figure 1a. Spatial coverage (Singapore)
Figure 1b. GPS density (highways have more GPS)
Figure 1b. GPS density (highways have more GPS)

Figure 2a illustrates that the Grab-Posisi dataset encloses not only central Jakarta but also extends to external highways. Figure 2b depicts the GPS density of cars in Jakarta. Compared with Singapore, trips in Jakarta are spread out in all different areas, not just concentrated on highways.

Figure 2a. Spatial coverage (Jakarta)
Figure 2a. Spatial coverage (Jakarta)
Figure 2b. GPS density (Car)
Figure 2b. GPS density (Car)

Applications of Grab-Posisi

The following are some of the applications of Grab-Posisi dataset.

On Map Inference

The traditional method used in updating road networks in maps is time-consuming and labour-intensive. That’s why maps might have important roads missing and real-time traffic conditions might be unavailable. To address this problem, we can use GPS trajectories in reconstructing road networks automatically.

A bunch of map generation algorithms can be applied to infer both map topology and road attributes. Figure 3b shows a snippet of the inferred map from our GPS trajectories (Figure 3a) using one of the algorithms. As you can see from the blue dots, the skeleton of the underlining map inferred is correct, although some section of the inferred road is disconnected, and at the roundabout in the bottom right corner it’s not a smooth curve.

Figure 3a. Raw GPS trajectories
Figure 3a. Raw GPS trajectories  
Figure 3b. Inferred Map
Figure 3b. Inferred Map

On Map Matching                                         

The map matching refers to the task of automatically determining the correct route where the driver has travelled on a digital map, given a sequence of raw and noisy GPS points. The correction of the raw GPS data has been important for many location-based applications such as navigation, tracking, and road attribute detection as aforementioned. The accuracy levels provided in the Grab-Posisi dataset can be of great use to address this issue.

On Traffic Detection and Forecast                         

In addition to the inference of a static digital map, the Grab-Posisi GPS dataset can also be used to perform real-time traffic forecasting, which is very important for congestion detection, flow control, route planning, and navigation. Some examples of the fundamental indicators that are mostly used to monitor the current status of traffic conditions include the average speed, volume, and density in each road segment. These variables can be computed based on drivers’ GPS trajectories and can be used to predict the future traffic conditions.

On Mode Detection                         

Transportation mode detection refers to the task of identifying the travel mode of a user (some examples of transportation mode include walk, bike, car, bus, etc.). The GPS trajectories in our dataset are associated with rich attributes including GPS accuracy, bearing, and speed in addition to the latitude and longitude of geo-coordinates, which can be used to develop mode detection models. Our dataset also provides labels for each trajectory to be collected from a car or motorcycle, which can be used to verify performance of those models.

Economics Perspective                                         

The real-world GPS trajectories of people reveal realistic travel patterns and demands, which can be of great help for city planning. As there are some realistic constraints faced by governments such as budget limitations and construction inconvenience, it is important to incorporate both the planning authorities’ requirements and the realistic travel demands mined from trajectories for intelligent city planning. For example, the trajectories of cars can provide suggestions on how to schedule highway constructions. The trajectories of motorcycles can help the government to choose the optimal locations to construct motorcycle lanes for safety concerns.

Want to access our dataset?

Grab-Posisi dataset offers a great value and is a significant resource to the community for benchmarking and revisiting existing technologies.         

If you want to access our dataset for research purposes, email [email protected] with the following details:

  • Your Name and contact details
  • Your institution
  • Your potential usage of the dataset

When using Grab-Posisi dataset, please cite the following paper:

Huang, X., Yin, Y., Lim, S., Wang, G., Hu, B., Varadarajan, J., … & Zimmermann, R. (2019, November). Grab-Posisi: An Extensive Real-Life GPS Trajectory Dataset in Southeast Asia. In Proceedings of the 3rd ACM SIGSPATIAL International Workshop on Prediction of Human Mobility (pp. 1-10). DOI: https://doi.org/10.1145/3356995.3364536

Click here to download the published paper.

Click here to download the BibTex file.

Note: You cannot use Grab-Posisi dataset for commercial purposes.

Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.

Automating MySQL schema migrations with GitHub Actions and more

Post Syndicated from Shlomi Noach original https://github.blog/2020-02-14-automating-mysql-schema-migrations-with-github-actions-and-more/

In the past year, GitHub engineers shipped GitHub Packages, Actions, Sponsors, Mobile, security advisories and updates, notifications, code navigation, and more. Needless to say, the development pace at GitHub is accelerated.

With MySQL serving our backends, updating code requires changes to the underlying database schema. New features may require new tables, columns, changes to existing columns or indexes, dropping unused tables, and so on. On average, we have two schema migrations running daily on our production servers. Some days we have a half dozen migrations to run. We’ll cover how this amounted to a significant toil on the database infrastructure team, and how we searched for a solution to automate the manual parts of the process.

What’s in a migration?

At first glance, migrating appears to be no more difficult than adding a CREATE, ALTER or DROP TABLE statement. At a closer look, the process is far more complex, and involves multiple owners, platforms, environments, and transitions between those pieces. Here’s the flow as we experience it at GitHub:

1. Starting the process

It begins with a developer who identifies the need for a schema change. Maybe they need a new table, or a new column in an existing table. The developer has a local testing environment where they can experiment however they like, until they’re satisfied and wish to apply changes to production.

2. Feedback and review

The developer doesn’t just apply their changes online. First, they seek review and discussion with their peers. Depending on the change, they may ask for a review from a group of schema reviewers (at GitHub, this is a volunteer group experienced with database design). Then, they seek the agreement of the database infrastructure team, who owns the production databases. The database infrastructure team reviews the changes, looking for performance concerns, among other potential issues. Assuming all reviews are favorable, it’s on the database infrastructure engineer to deploy the change to production.

3. Taking the change to production

At this point, we need to determine where the change is taking place since we have multiple clusters. Some of them are sharded, so we have to ask: Where do the affected tables exist in our clusters or schemas? Next, we need to know what to run. The developer presented the schema they want to see in production, but how do we transition the existing production schema into the one requested? What’s the formal CREATE, ALTER or DROP statement? Following what to run, we need to know how we should run the migration. Do we run the query directly? Or is it a blocking operation and we need an online schema change tool? And finally, we need to know when to execute the migration. Perhaps now is not a good time if there’s already a migration running on the cluster.

4. Migration

At long last, we’re ready to run the migration. Some of our larger tables may take hours and even days to migrate, especially since the site needs to be up and running. We want to track status. And we want to see what impact the migration may have on production, or, preferably, to ensure it does not have an impact.

5. Completing the process

Even as the migration completes there are further steps to take. There’s some cleanup process, and we want to unblock the next migration, if any currently exists. The database infrastructure team wishes to advertise to the developer that the changes have taken place, and the developer will have their own followup to address.

Throughout that flow, there’s a lot of potential for friction:

  • Does the database infrastructure team review the developer’s request in a timely fashion?
  • Is the review process productive?
  • Do we need to wait for something before running the migration?
  • Is the database infrastructure engineer actually available to run the migration, or perhaps they’re busy with other tasks?

The database infrastructure engineer needs to either create or review the migration statement, double-check their logic, ensure they can begin the migration, follow up, unblock other migrations as needed, advertise progress to the developer, and so on.

With our volume of daily migrations, this flow sometimes consumed hours of work of a database infrastructure engineer per day, and—in the best-case scenario—at least several hours of work per week. They would frequently multitask between two or three migrations and keep mental notes for next steps. Developers would ping us to ask what the status was, and their work was sometimes blocked until the migration was complete.

A brief history of schema migration automation at GitHub

GitHub was originally created as a Ruby on Rails (RoR) app. Like other frameworks, and in particular, those using Active Record, RoR has a built-in mechanism to generate database schema from code, as well as programmatically express migrations. RoR tooling can analyze code changes and create and run the SQL statements to change the database schema.

We use the GitHub flow to manage our own development: when suggesting a change, we create a branch, commit, push, and open a pull request. We use the declarative approach to schema definition: our RoR GitHub repository contains the full schema definition, such as the CREATE TABLE statements that generate the complete schema. This way, we know exactly what schema is associated with each commit or branch. Counter that with the programmatic approach, where your commits contain migration statements, and where to deduce a schema you need to start at some baseline and run through all statements sequentially.

The database infrastructure and the application teams collaborated to create a set of chatops tooling. We ran a chatops command to list pull requests with schema changes, and then another command to generate the CREATE/ALTER/DROP statement for a given pull request. For this, we used RoR’s rake command. Our wrapper scripts then added meta information, like which cluster is involved, and generated a script used to run the migration.

The generated statements and script were mostly fine, with occasional SQL syntax errors. We’d review the output and fix it manually as needed.

A few years ago we developed gh-ost, an online table migration solution, which added even more visibility and control through our chatops. We’d check progress, change runtime configuration, and cut-over the migration through chat. While simple, these were still manual steps.

The heart of GitHub’s app remains with the same RoR, but we’ve expanded far beyond it. We created more repositories and some also use RoR, while others are in other programming languages such as Go. However, we didn’t use Object Relational Mapping practice with the new repositories.

As GitHub expanded, the more toil the database infrastructure team had. We’d review pull requests, compare schemas, generate migration statements manually, and verify on a local machine. Other than the git log, no formal tracking for schema migrations existed. We’d check in chat, issues, and pull requests to see what was done and what wasn’t. We’d keep track of ongoing migrations in our heads, context switch between the migrations throughout the day, and how often we’d get interrupted by notifications. And we did this while taking each migration through the next step, keeping mental notes, and communicating the progress to our peers.

With these steps in mind, we wanted a solution to automate the process. We came up with various ideas, and in 2019 GitHub Actions was released. This was our solution: multiple loosely coupled components, each owning a specific aspect of the flow, all orchestrated by a controller service. The next section covers the breakdown of our solution.

Code

Our basic premise is that schema design should be treated as code. We want the schema to be versioned, and we want to know what schema is associated and with what version of our code.

To illustrate, GitHub provides not only github.com, but also GitHub Enterprise, an on-premise solution. On github.com we run continuous deployments. With GitHub Enterprise, we make periodic releases, and our customers can upgrade in-house. This means we need to be able to reproduce any schema changes we make to github.com on a customer’s Enterprise server.

Therefore we must keep our schema design coupled with the code in the same git repository. For a developer to design a schema change, they need to follow our normal development flow: create a branch, commit, push, and open a pull request. The pull request is where code is reviewed and discussion takes place for any changes. It’s where continuous integration and testing run. Our solution revolves around the pull request, and this is standardized across all our repositories.

The change

Once a pull request is opened, we need to be able to identify what changes we’d like to make. Typically, when we review code changes, we look at the diff. And it might be tempting to expect that git diff can help us formalize the schema change. Unfortunately, this is not the case, and git diff is poor at identifying these changes. For example, consider this simplified table definition:

CREATE TABLE some_table (
  id int(10) unsigned NOT NULL AUTO_INCREMENT,
  hostname varchar(128) NOT NULL,
  PRIMARY KEY (id),
  KEY (hostname)
);

Suppose we decide to add a new column and drop the index on hostname. The new schema becomes:

CREATE TABLE some_table (
  id int(10) unsigned NOT NULL AUTO_INCREMENT,
  hostname varchar(128) NOT NULL,
  time_created TIMESTAMP NOT NULL,
  PRIMARY KEY (id)
);

Running git diff on the two schemas yields the following:

@@ -1,6 +1,6 @@
 CREATE TABLE some_table (
   id int(10) unsigned NOT NULL DEFAULT 0,
   hostname varchar(128) NOT NULL,
-  PRIMARY KEY (id),
-  KEY (hostname)
+  time_created TIMESTAMP NOT NULL,
+  PRIMARY KEY (id)
 );

The pull request’s “Files changed” tab shows the same:

This is a sample Pull Request where we change a table's schema. git diff does a poor job of analyzing the schema change.

See how the PRIMARY KEY line goes into the diff because of the trailing comma. This diff does not capture the schema change well, and while RoR provides tooling for that,  we’ve still had to carefully review them. Fortunately, there’s a good MySQL-oriented tool to do the task.

skeema

skeema is an open source schema management utility developed by Evan Elias. It expects the declarative approach, and looks for a schema definition on your file system (hopefully as part of your repository). The file system layout should include a directory per schema/database, a file per table, and then some special configuration files telling skeema the identities of, and the credentials for, MySQL servers in various environments. Skeema is able to run useful tasks, such as:

  • skeema diff: generate SQL statements that convert the existing database schema into the schema defined in the file system. This includes as many CREATE, ALTER and DROP TABLE statements as needed.
  • skeema push: actually apply changes to the database server for the schema to match the one on file system
  • skeema pull: rewrite the filesystem schema based on the existing schema in the MySQL server.

skeema can do much more, including the ability to invoke online schema change tools—but that’s outside this post’s scope.

Git users will feel comfortable with skeema. Indeed, skeema works very well with git-versioned schemas. For us, the most valuable asset is its diff output: a well formed, reliable set of statements to show the SQL transition from one schema to another. For example, skeema diff output for the above schema change is:

USE `test`;
ALTER TABLE `some_table` ADD COLUMN `time_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, DROP KEY `hostname`;

Note that the above is not only correct, but also formal. It reproduces correctly whether our code uses lower/upper case, includes/omits default value, etc.

We wanted to use skeema to tell us what statements we needed to run to get from our existing state into the state defined in the pull request. Assuming the master branch reflects our current production schema, this now becomes a matter of diffing the schemas between master and the pull request’s branch.

Skeema wasn’t without its challenges, and we had to figure out where to place skeema from a design perspective. Do the developers own it? Does every repository own it? Is there a central service to own it? Each presented its own problems, from false ownership to excessive responsibilities and access.

GitHub Actions

Enter GitHub Actions. With Actions, you’re able to run code as a response to events taking place in your repository. A new pull request, review, comment, issue, and quite a few others, are such events. The code (the action) is arbitrary, and GitHub spawns a container on its own infrastructure, where your code will run. What makes this extra interesting is that the container can get access to your repository. GitHub Actions implicitly receives an API token to interact with the repository.

The container comes with popular software packages pre-installed, such as a MySQL server.

Perhaps the most classic use of Actions is CI/CD.  When  a pull_request event occurs (a new pull request and any subsequent commit) run some code to build, test, lint, or validate the change. We took this approach to run skeema as part of a pull_request action flow, called skeema-diff.

Here’s a simplified breakdown of the action:

  1. Fetch skeema binary
  2. Checkout master branch
  3. Run skeema push to populate the container’s MySQL server with the schema as defined by the master branch
  4. Checkout pull request’s branch
  5. Run skeema diff to generate the statements that take the schema from the one in MySQL (remember, this is the master schema) to the one in the pull request’s branch
  6. Add the diff as a comment in the pull request
  7. Add a special label to indicate this pull request has a schema change

The GitHub Action, running skeema, generates schema diff output, which is added as a comment to the Pull Request. The comment presents the correct ALTER statement implied by the code change. This comment is both human and machine readable.

The code is more complex than what we’ve shown. We actually use base and head instead of master and branch, and there’s some logic to formalize, edit and validate the diff, to handle commits that further change the schema, among other processes.

By now, we have a partial flow, which works entirely on GitHub’s platform:

  • Schema change as code
  • Review process, based on GitHub’s pull request flow
  • Automated schema change analysis, based on skeema running in a GitHub Action
  • A visible output, presented as a pull request comment

Up to this point, everything is constrained to the repository. The repository itself doesn’t have information about where the schema gets deployed in production. This information is something that’s outside the repository’s scope, and it’s owned by the database infrastructure team rather than the repository’s developers. Neither the repository nor any action running on that repository has access to production, nor should they, as that would be a breach of domains.

Before we describe how the schema gets to production, let’s jump ahead and discuss the schema migration itself.

Schema migrations and gh-ost

Even the simplest schema migration isn’t simple. We are concerned with three types of table migrations:

  • CREATE TABLE is the simplest and the safest. We created something that didn’t exist before, and its creation time is instantaneous. Note that if the target cluster is sharded, this must be applied on all shards. If the cluster is sharded with vitess, then the vitess vtgate service automatically handles this for us.
  • DROP TABLE is a simple statement that comes with a great risk. What if it’s still in use and some code breaks as a result of the table going away? Note that we don’t actually drop tables as part of schema migrations. Any DROP TABLE statement is converted into a RENAME TABLE. Instead of DROP TABLE repositories (whoops!), our automation runs RENAME TABLE repositories TO _repositories_DROP_20200101123456. If our application fails because of this, we have an instant revert command: RENAME back to the original. Renamed tables are kept around for a few days prior to being garbage collected and dropped by our automation.
  • ALTER TABLE is the most complex case, mainly because it takes time to alter a table. We don’t actually ALTER tables in-place. We use gh-ost to emulate an ALTER TABLE, and the end result is the same even though the process is completely different. It doesn’t lock our apps, throttles as much as needed, and it’s controllable as well as auditable. We’ve run gh-ost in production for over three and a half years. It has little to no impact on production, and we generally don’t care that it’s running. But some of our larger tables may still take hours or even days to migrate. We also only run one ALTER (or, gh-ost) at a time on a cluster. Concurrent migrations are possible but compete over resources, leading to overall longer runtimes than sequential execution. This means that an ALTER migration requires scheduling. We need to be able to tell if a migration is already running on a cluster, as well as prioritize and queue migrations that apply to the same cluster. We also need to be able to tell the status over the duration of hours or days, and this needs to be communicated to the developer, the owner of the change. And, if the cluster is sharded, we need to run the migration per shard.

In order to run a migration, we must first determine the strategy for that migration (Is it direct query, gh-ost, or a manual?). We need to be able to tell where it can run,  how to go about the process if the cluster is sharded, as well as When to schedule it. While migrations can wait in queue while others are running, we want to be able to prioritize migrations, in case the queue is large.

skeefree

We created skeefree as the glue, which means it’s an orchestrating service that’s aware of our repositories, can communicate with our pull requests, knows about production (or, can get information about production) and which invokes the migrations. We run skeefree as a stateless kubernetes service, backed by a MySQL database that holds the state. Note that skeefree’s own schema is managed by skeefree.

skeefree uses GitHub’s API to interact with pull requests, GitHub’s internal inventory and discovery services, to locate clusters in production, and gh-ost to run migrations. Skeefree is best described by following a schema migration flow:

  1. A developer wishes to change the schema, so they open a pull request.
  2. skeema-diff Action springs to life and seeks a schema change. If a schema change isn’t found in the pull request, nothing happens. If there is a schema change, the Action, computes the change via skeema, adds a well-formed comment to the pull request indicating the change, and adds a migration:skeema:diff label to the pull request. This is done via the GitHub API.
  3. A developer looks into the change, and seeks review from a team member. At this time they may communicate to team members without actually going to production. Finally, they add the label migration:for:review.
  4. skeefree is aware of the developer’s repository and uses the GitHub API to periodically look for open pull requests, which are labeled by both migration:skeema:diff and migration:for:review, and have been approved by at least one developer.
  5. Once detected, skeefree investigates the pull request, and reads the schema change comment, generated by the Action. It maps the schema/repository to the schema/production cluster, and uses our inventory and discovery services to know if the cluster is sharded. Then, it finds the location and name of the cluster.
  6. skeefree then adds this to its backend database, and advertises its analysis on the pull request with another comment. This comment generally means “here’s what I will do if you approve”. And it proceeds to get a review from an authority. Once the user labels the Pull Request as "migration:for:review", skeefree analyzes the migration and evaluates where it needs to run. It proceeds to seek review from an authority.
  7. For most repositories, the authority is the database-infrastructure team. On our original RoR repository, we also seek review from a cross-functional team, known as the db-schema-reviewers, who are familiar with the general application and database design throughout the years and who have more context to offer. skeefree automatically knows which teams should be notified on which repositories.
  8. The relevant teams review and hopefully approve, and skeefree detects the approval, before choosing the proper strategy (direct query for CREATE and DROP, or RENAME), and gh-ost for ALTER. It then queues the migration(s).
  9. skeefree’s scheduler periodically checks what next can be executed. Remember we only run a single ALTER migration on a given cluster at a time, but we also have a limited number of runner hosts. If there’s a free runner host and the cluster is not running any migration, skeefree then proceeds to kick off a migration. Skeefree advertises this fact as a pull request comment to notify the developer that the migration started.
  10. Once the migration is complete, skeefree announces it in a pull request comment. The same applies should the migration fail.
  11. The pull request may also have more than one migration. Perhaps the cluster is sharded, or there may be multiple tables changed in the pull request. Once all migrations are successfully completed, skeefree advertises this in a pull request comment. The developer is notified that all migrations are done, and they’re encouraged to proceed with their standard deploy/merge flow.

as skeefree runs the migrations, it adds comments on the Pull Request page to indicate its progress. When all migrations are complete, skeefree comments as much, again on the pull request page.

Analysis of the flow

There are a few nuances here that make a good experience to everyone involved:

  • The database infrastructure team doesn’t know about the pull request until the developer explicitly adds the migration:for:review label. It’s like a draft pull request or a pull request that’s a work in progress, only this flag applies specifically to the schema migration flow. This allows the developer to use their preferred flow, and communicate with their team without interrupting the database infrastructure team or getting premature reviews.
  • The skeema analysis is contained within the repository, which means That no external service is required. The developer can check the diff result, themselves.
  • The Action is the only part of the flow that looks at the code. Neither skeefree nor gh-ost look at the actual code, and they don’t need git access.
  • The database infrastructure team only needs to take a single step, which is review the pull request.
  • The developers own the creation of pull requests, getting peer reviews, and finally, deploying and merging. These are the exact operations that should be under their ownership. Moreover, they get visibility into the state of their migration. By looking at the pull request page or their GitHub notifications, they can tell whether the pull request has been reviewed, queued, started, completed, or failed. They don’t need to ask. Even better, we have chatops that give visibility into the overall state of migration queue, a running migration’s progress, and more. These chatops are available for all to invoke.
  • The database infrastructure team owns the process of mapping the repository schema to production. This is done via chatops, but can also be completed via configuration. The team is able to cancel a pull request, retry a failed migration, and more.
  • gh-ost is generally trusted, and we have control over a running migration. This means that we can force it to throttle, set up a different throttle threshold, make it use less resources, or terminate it, if needed. We also have a throttling mechanism throughout our stack, so that long running processes like migrations yield to higher priority operations, which extends their own runtime so it doesn’t generate too much load on our database servers.
  • We use our own prefered pull request flow, oActions (skeefree was an early adopter for Actions), GitHub API, and our existing datacenter and database infrastructure, all of which are well understood internally.

Public availability

skeefree and the skeema-diff Action were authored internally at GitHub to solve a specific problem. skeefree uses our internal inventory and discovery services, it works with our chatops and uses some internal libraries.

Our experience in releasing open source software is that no one’s use case is exactly the same as ours. Our perception of an automated migrations flow may be very different from another organization’s perception. We still want to share more than just our words, so we’ve open sourced the code.

It’s a bit of a peculiar OSS release:

  • it’s missing some libraries; it will not build.
  • It expects some of our internal services to exist, which more than likely won’t be on your platform.
  • It expects chatops, and you may not be using chatops.
  • The code also needs to be rewritten for adaptation to your environment,

Note that the code is available, but not open for issues and pull requests. We hope the community finds it useful.

Get the code

The post Automating MySQL schema migrations with GitHub Actions and more appeared first on The GitHub Blog.

Supercharge your command line experience: GitHub CLI is now in beta

Post Syndicated from Billy Griffin original https://github.blog/2020-02-12-supercharge-your-command-line-experience-github-cli-is-now-in-beta/

We’re introducing an easier and more seamless way to work with GitHub from the command line—GitHub CLI, now in beta. Millions of developers rely on GitHub to make building software more fun and collaborative, and gh brings the GitHub experience right to your terminal.

You can install GitHub CLI today on macOS, Windows, and Linux, and there’s more to come as we iterate on your feedback from the beta. It’s available today for GitHub Team and Enterprise Cloud, but not yet available for GitHub Enterprise Server. We’ll be exploring support for Enterprise Server when it’s out of beta.

How can you use GitHub CLI?

We started with issues and pull requests because many developers use them every day. Check out a few examples of how gh can improve your experience when contributing to an open source project and learn more from the manual.

Filter lists to your needs

Find an open source project you want to contribute to and clone the repository. And then, to see where maintainers want community contributions, use gh to filter the issues to only show those with help wanted labels.

Quickly view the details

Find an issue describing a bug that seems like something you can fix, and use gh to quickly open it in the browser to get all the details you need to get started.

Create a pull request

Create a branch, make several commits to fix the bug described in the issue, and use gh to create a pull request to share your contribution.

By using GitHub CLI to create pull requests, it also automatically creates a fork when you don’t already have one, and it pushes your branch and creates your pull request to get your change merged.

View the status of your work

Get a quick snapshot the next morning of what happened since you created your pull request. gh shows the review and check status of your pull requests.

Easily check out pull requests

One of the maintainers reviewed your pull request and requested changes. You probably switched branches since then, so use gh to checkout the pull request branch. We never remember the right commands either!

Make the changes, push them, and soon enough the pull request is merged—congratulations!

Help shape GitHub CLI

We hope you’ll love the foundation we’ve built with pull requests and issues. And we’re even more excited about the future as we explore what it looks like to build a truly delightful experience with GitHub on the command line. As GitHub CLI continues to make it even more seamless to contribute to projects on GitHub, the sky’s the limit on what we can achieve together.

We can’t wait to hear about your experience with GitHub CLI, and we’d love your feedback. Create an issue in our open source repository or provide feedback in our google form. What commands feel like you can’t live without them? What’s clunky or missing? Let us know so we can make GitHub CLI even better.

Learn more about GitHub CLI beta

The post Supercharge your command line experience: GitHub CLI is now in beta appeared first on The GitHub Blog.

How we built the good first issues feature

Post Syndicated from Tiferet Gazit original https://github.blog/2020-01-22-how-we-built-good-first-issues/

GitHub is leveraging machine learning (ML) to help more people contribute to open source. We’ve launched the good first issues feature, powered by deep learning, to help new contributors find easy issues they can tackle in projects that fit their interests.

If you want to start contributing, or to attract new contributors to a project you maintain, get started with our overview of the good first issues feature. Read on to learn about how we leverage machine learning to detect easy issues.

In May 2019, our initial launch surfaced recommendations based on labels that were applied to issues by project maintainers. Some analysis of our data, together with manual curation, led to a list of about 300 label names used by popular open source repositories—all synonyms for either “good first issue” or “documentation”. Some examples include “beginner friendly”, “easy bug fix”, and “low-hanging-fruit”. We include documentation issues because they’re often a good way for new people to begin contributing, but rank them lower relative to issues specifically tagged as easy for beginners.

Relying on these labels, however, means that only about 40 percent of the repositories we recommend have easy issues we can surface. Moreover, it leaves maintainers with the burden of triaging and labeling issues. Instead of relying on maintainers to manually label their issues, we wanted to use machine learning to broaden the set of issues we could surface.

Last month, we shipped an updated version, that includes both label-based and ML-based issue recommendations. With this new version, we’re able to surface issues in about 70 percent of repositories we recommend to users. There is a tradeoff between coverage and accuracy, which is the typical precision and recall tradeoff found in any ML product. To prevent the feed from being swamped with false positive detections, we aim for extremely high precision at the cost of recall. This is necessary because only a tiny minority of all issues are good first issues. The exact numbers are constantly evolving, both as we improve our training data and modeling, and as we adjust the precision and recall tradeoff based on feedback and user behavior.

How it works

As with any supervised machine learning project, the first challenge is building a labeled training set. In this case, manual labeling is a difficult task that requires domain-area expertise in a range of topics, projects, and programming languages. Instead, we’ve opted for a weakly-supervised approach, inferring labels for hundreds of thousands of candidate samples automatically.

To detect positive training samples, we begin with issues that have any of the roughly-300 labels in our curated list, but this training set is not sufficiently large or diverse for our needs. We supplement it with a few sets of issues that are also likely to be beginner-friendly. This includes issues that were closed by a pull request from a user who had never previously contributed to the repository, and issues that were closed by a pull request that touched only a few lines in a single file. Because we prefer to miss some good issues than to swamp the feed with false positives, we define all issues that were not explicitly detected as positive samples to be negative samples in our training set. Naturally, this gives us a hugely imbalanced set, so we subsample the negative set in addition to weighting the loss function. We detect and remove near-duplicate issues, and separate the training, validation, and test sets across repositories to prevent data leakage from similar content.

In order for our classifiers to detect good issues as soon as they’re opened, we train them using only issue titles and bodies, and avoid relying on additional signals such as conversations. The titles and bodies undergo some preprocessing and denoising, such as removing parts of the body that are likely to come from an issue template, since they carry no signal for our problem. We’ve experimented with a range of model variants and architectures, including both classical ML methods such as random forests using tf-idf input vectors and neural network models such as 1D convolutional neural networks and recurrent neural networks. The deep learning models are implemented in TensorFlow, and use one-hot encodings of issue titles and bodies, fed into trainable embedding layers, as separate inputs to networks, with features from the two inputs concatenated towards the top of the network. Unsurprisingly, these networks generally outperform the classical methods using tf-idf, because they can use the full information contained in word ordering, context, and sentence structure, rather than just information on the unordered relative count of words in the vocabulary. Because both the training and inference happen offline, the added cost of deep learning methods is not prohibitive in this case. However, given the limited size of the positive training set, we’ve found various textual data augmentation techniques to be crucial to the success of these networks, in addition to regularization and early stopping. 

To surface issue recommendations given a trained classifier, we run inference on all qualifying open issues from non-archived public repositories. Each issue for which the classifier predicts a probability above the required threshold is slated for recommendation, with a confidence score equal to its predicted probability. We also detect all open issues from non-archived public repositories that have at least one of the labels from the curated label list. These issues are given a confidence score based on the relevance of their labels, with synonyms of “good first issue” given higher confidence than synonyms of “documentation”. In general, label-based detections are given higher confidence than ML-based detections. Within each repository, all detected issues are then ranked primarily based on their confidence score, along with a penalty on issue age.

Our data acquisition, training, and inference pipelines run daily, using scheduled Argo workflows, to ensure our results remain fresh and relevant. As the first deep-learning-enabled product to launch on Github.com, this feature required careful design to ensure that the infrastructure would generalize to future projects.

What’s next

We continue to iterate on the training data, training pipeline, and classifier models to improve the surfaced issue recommendations. In parallel, we’re adding better signals to our repository recommendations to help users find and get involved with the best projects related to their interests. We also plan to add a mechanism for maintainers and triagers to approve or remove ML-based recommendations in their repositories. Finally, we plan on extending issue recommendations to offer personalized suggestions on next issues to tackle for anyone who has already made contributions to a project.

We hope you find these issue recommendations useful.

Learn more about good first issues

The post How we built the good first issues feature appeared first on The GitHub Blog.

Preventing App Performance Degradation Due to Sudden Ride Demand Spikes

Post Syndicated from Grab Tech original https://engineering.grab.com/preventing-app-performance-degradation-due-to-sudden-ride-demand-spikes

In Southeast Asia, when it rains, it pours. It’s a major mood dampener especially if you are stuck outside when the rain starts, you are about to have an awful day.

In the early days of Grab, if the rains came at the wrong time, like during morning rush hour, then we engineers were also in for a terrible day.

In those days, demand for Grab’s ride services grew much faster than our ability to scale our tech system and this often meant clocking late nights just to ensure our system could handle the ever-growing demand. When there’s a massive, sudden spike in ride bookings, our system often struggled to manage the load.

There were also other contributors to demand spikes, for example when public transport services broke down or when a major event such as an international concert ends and event-goers all need a ride at the same time.

Upon reflection, we realized there were two integral aspects to these incidents.

Firstly, they were localized events. The increase in demand came from a particular geographical location; in some cases a very small area. These localized events had the potential to cause so much load on our system that it impacted the experience of other users outside the geolocation.

Secondly, the underlying problem was a lack of drivers (supply) in that particular geographical area.

At Grab, our goal has always been to get everyone a ride when and where they needed it, but in this situation, it was just not possible. We needed to find a way to ensure this localized demand spike did not affect our ability to meet the needs of other users.

Enter the Spampede Filter

The Spampede (a play of the words spam and stampede) filter was inspired by another concept you may have read on this blog, circuit breakers.

In software, as in electronics, circuit breakers are designed to protect a system by short-circuiting in the face of adverse conditions.

Let’s break this down.

There are two key concepts here: short-circuiting and adverse conditions.

Firstly, short-circuiting, in this context means performing minimal processing on a particular booking, and by doing so, reducing the overall load on the system. Secondly, adverse conditions, in this, we refer to a large number of unfulfilled requests for a particular service, from a small geographical area, within a short time window. With these two concepts in mind, we devised the following process.

Spampede Design

First, we needed to track unallocated requests in a location-aware manner. To do this, we convert the requested pickup location of an unallocated request using the Geohash Integer algorithm.  

After the conversion, the resulting value is an exact location. We can convert this location into a “bucket” or area by reducing the precision.

This method is by no means smart or aware of the local geography, but it is incredibly CPU efficient and requires no external resources like network API calls.

Now that we can track unallocated requests, we needed a way for the tracking to be time-aware. After all, traffic conditions, driver locations, and passenger demand are continually changing. We could have implemented something precise like a sliding window sum, but that would have introduced a lot of complexity and a significantly higher CPU and memory cost.

By using the Unix timestamp, we converted the current time to a “bucket” of time by using the straightforward formula:

Event Sourcing

where bs is the size of the time buckets in seconds

With location and time buckets calculated, we can track the unallocated bookings using Redis. We could have used any data store, but Redis was familiar and battle-proven to us.

To do this, we first constructed the Redis key by combining the service type, the geographic location, and the time bucket. With this key, we call the INCR command, which increments the value stored in that location and returns the new value.

If the value returned is 1, this indicates that this is the first value stored for this bucket combination, and we would then make a second call, this time to EXPIRE. With this second call, we would set a time to live (TTL) on the Redis item, allowing the data to be self-cleaning.

You will notice that we are blindly calling increment and only making a second call if needed. This pattern is more efficient and resource-friendly than using a more traditional, load-check-store pattern.

The next step was the configuration. Specifically, setting how many unallocated bookings could happen in a particular location and time bucket before the circuit opened. For this, we decided on Redis again. Again, we could have used anything, but we were already using Redis and, as mentioned previously, quite familiar with it.

Finally, the last piece. We introduced code at the beginning of our booking processing, most importantly, before any calls to any other services and before any significant processing was done. This code compared the location, time, and requested service to the currently configured Spampede setting, along with the previously unallocated bookings. If the maximum had already been reached, then we immediately stopped processing the booking.

This might sound harsh- to immediately refuse a booking request without even trying to fulfill it. But the goal of the Spampede filter is to prevent excessive, localized demand from impacting all of the users of the system.

Conclusion

Reading about this as a programmer, it probably feels strange, intentionally dropping bookings and impacting the business this way.

After all, we want nothing more than to help people get to where they need to be. This process is a system safety mechanism to ensure that the system stays alive and able to do just that.

I would be remiss if I didn’t highlight the critical software-engineering takeaway here is a combination of the Observer effect and the underlying goals of the CAP theorem. Observing a system will influence the system due to the cost of instrumentation and monitoring.

Generally, the higher the accuracy or consistency of the monitoring and limits, the higher the resource cost.

In this case, we have intentionally chosen the most resource-efficient options and traded accuracy for more throughput.

Plumbing At Scale

Post Syndicated from Grab Tech original https://engineering.grab.com/plumbing-at-scale

When you open the Grab app and hit book, a series of events are generated that define your personalised experience with us: booking state machines kick into motion, driver partners are notified, reward points are computed, your feed is generated, etc. While it is important for you to know that a request has been received, a lot happens asynchronously in our back-end services.

As custodians and builders of the streaming platform at Grab operating at massive scale (think terabytes of data ingress each hour), the Coban team’s mission is to provide a NoOps, managed platform for seamless, secure access to event streams in real-time, for every team at Grab.

Coban Sewu Waterfall In Indonesia
Coban Sewu Waterfall In Indonesia. (Streams, get it?)

Streaming systems are often at the heart of event-driven architectures, and what starts as a need for a simple message bus for asynchronous processing of events quickly evolves into one that requires a more sophisticated stream processing paradigms.
Earlier this year, we saw common patterns of event processing emerge across our Go backend ecosystem, including:

  • Filtering and mapping stream events of one type to another
  • Aggregating events into time windows and materializing them back to the event log or to various types of transactional and analytics databases

Generally, a class of problems surfaced which could be elegantly solved through an event sourcing1 platform with a stream processing framework built over it, similar to the Keystone platform at Netflix2.

This article details our journey building and deploying an event sourcing platform in Go, building a stream processing framework over it, and then scaling it (reliably and efficiently) to service over 300 billion events a week.

Event Sourcing

Event sourcing is an architectural pattern where changes to an application state are stored as a sequence of events, which can be replayed, recomputed, and queried for state at any time. An implementation of the event sourcing pattern typically has three parts to it:

  • An event log
  • Processor selection logic: The logic that selects which chunk of domain logic to run based on an incoming event
  • Processor domain logic: The domain logic that mutates an application’s state
Event Sourcing
Event Sourcing

Event sourcing is a building block on which architectural patterns such as Command Query Responsibility Segregation3, serverless systems, and stream processing pipelines are built.

The Case For Stream Processing

Below are some use cases serviced by stream processing, built on event sourcing.

Asynchronous State Management

A pub-sub system allows for change events from one service to be fanned out to multiple interested subscribers without letting any one subscriber block the progress of others. Abstracting the event log and centralising it democratises access to this log to all back-end services. It enables the back-end services to apply changes from this centralised log to their own state, independent of downstream services, and/or publish their state changes to it.

Time Windowed Aggregations

Time-windowed aggregates are a common requirement for machine learning models (as features) as well as analytics. For example, personalising the Grab app landing page requires counting your interaction with various widget elements in recent history, not any one event in particular. Similarly, an analyst may not be interested in the details of a singular booking in real-time, but in building demand heatmaps segmented by geohashes. For latency-sensitive lookups, especially for the personalisation example, pre-aggregations are preferred instead of post-aggregations.

Stream Joins, Filtering, Mapping

Event logs are typically sharded by some notion of topics to logically divide events of interest around a theme (booking events, profile updates, etc.). Building bigger topics out of smaller ones, as well as smaller ones from bigger ones are common ways to compose “substreams” of the log of interest directed towards specific services. For example, a promo service may only be interested in listening to booking events for promotional bookings.

Realtime Business Intelligence

Outputs of stream processing workloads are also plugged into realtime Business Intelligence (BI) and stream analytics solutions upstream, as raw data for visualizations on operations dashboards.

Archival

For offline analytics, as well as reconciliation and disaster recovery, having an archive in a cold store helps for certain mission critical streams.

Platform Requirements

Any processing platform for event sourcing and stream processing has certain expectations around its functionality.

Scaling and Elasticity

Stream/Event Processing pipelines need to be elastic and responsive to changes in traffic patterns, especially considering that user activity (rides, food, deliveries, payments) varies dramatically during the course of a day or week. A spike in food orders on rainy days shouldn’t cause indefinite order processing latencies.

NoOps

For a platform team, it’s important that users can easily onboard and manage their pipeline lifecycles, at their preferred cadence. To scale effectively, the process of scaffolding, configuring, and deploying pipelines needs to be standardised, and infrastructure managed. Both the platform and users are able to leverage common standards of telemetry, configuration, and deployment strategies, and users benefit from a lack of infrastructure management overhead.

Multi-Tenancy

Our platform has quickly scaled to support hundreds of pipelines. Workload isolation, independent processing uptime guarantees, and resource allocation and cost audit are important requirements necessitating multi-tenancy, which help amortize platform overhead costs.

Resiliency

Whether latency sensitive or latency tolerant, all workloads have certain expectations on processing uptime. From a user’s perspective, there must be guarantees on pipeline uptimes and data completeness, upper bounds on processing delays, instrumentation for alerting, and self-healing properties of the platform for remediation.

Tunable Tradeoffs

Some pipelines are latency sensitive, and rely on processing completeness seconds after event ingress. Other pipelines are latency tolerant, and can tolerate disruption to processing lasting in tens of minutes. A one size fits all solution is likely to be either cost inefficient or unreliable. Having a way for users to make these tradeoffs consciously becomes important for ensuring efficient processing guarantees at a reasonable cost. Similarly, in the case of upstream failures or unavailability, being able to tune failure modes (like wait, continue, or retry) comes in handy.

Stream Processing Framework

While basic event sourcing covers simple use cases like archival, more complicated ones benefit from a common framework that shifts the mental model for processing from per event processing to stream pipeline orchestration.
Given that Go is a “paved road” for back-end development at Grab, and we have service code and bindings for streaming data in a mono-repository, we built a Go framework with a subset of capabilities provided by other streaming frameworks like Flink4.

Logic Blocks In A Stream Processing Pipeline
Logic Blocks In A Stream Processing Pipeline

Capabilities

Some capabilities built into the framework include:

  • Deduplication: Enables pipelines to idempotently reprocess data in case of rewinds/replays, and provides some processing guarantees within a time window for certain use cases including sinking to datastores.
  • Filtering and Mapping: An ability to filter a source stream data and map them onto target streams.
  • Aggregation: An ability to generate and execute aggregation logic such as sum, avg, max, and min in a window.
  • Windowing: An ability to window processing into tumbling, sliding, and session windows.
  • Join: An ability to combine two streams together with certain join keys in a window.
  • Processor Chaining: Various functionalities can be chained to build more complicated pipelines from simpler ones. For example: filter a large stream into a smaller one, aggregate it over a time window, and then map it to a new stream.
  • Rewind: The ability to rewind the processing logic by a few hours through configuration.
  • Replay: The ability to replay archived data into the same or a separate pipeline via configuration.
  • Sinks: A number of connectors to standard Grab stores are provided, with concerns of auth, telemetry, etc. managed in the runtime.
  • Error Handling: Providing an easy way to indicate whether to wait, skip, and/or retry in case of upstream failures is an important tuning parameter that users need for making sensible tradeoffs in dimensions of backpressure, latency, correctness, etc.

Architecture

Coban Platform
Coban Platform

Our event log is primarily a bunch of critical Kafka clusters, which are being polled by various pipelines deployed by service teams on the platform for incoming events. Each pipeline is an isolated deployment, has an identity, and the ability to connect to various upstream sinks to materialise results into, including the event log itself.
There is also a metastore available as an intermediate store for processing pipelines, so the pipelines themselves are stateless with their lifecycle completely beholden to the whims of their owners.

Anatomy of a Processing Pipeline

Anatomy Of A Stream Processing Pod
Anatomy Of A Stream Processing Pod

Anatomy of a Stream Processing Pod
Each stream processing pod (the smallest unit of a pipeline’s deployment) has three top level components:

  • Triggers: An interface that connects directly to the source of the data and converts it into an event channel.
  • Runtime: This is the app’s entrypoint and the orchestrator of the pod. It manages the worker pools, triggers, event channels, and lifecycle events.
  • The Pipeline plugin: The plugin is provided by the user, and conforms to a contract that the platform team publishes. It contains the domain logic for the pipeline and houses the pipeline orchestration defined by a user based on our Stream Processing Framework.

Deployment Infrastructure

Our deployment infrastructure heavily leverages Kubernetes on AWS. After a (pretty high) initial cost for infrastructure set up, we’ve found scaling to hundreds of pipelines a breeze with the Kubernetes provided controls. We package our stateless pipeline workloads into Kubernetes deployments, with each pod containing a unit of a stream pipeline, with sidecars that integrate them with our monitoring systems. Other cluster wide tooling deployed (usually as DaemonSets) deal with metric collection, log ingestion, and autoscaling. We currently use the Horizontal Pod Autoscaler5 to manage traffic elasticity, and the Cluster Autoscaler6 to manage worker node scaling.

Kubernetes
A Typical Kubernetes Set Up On AWS

Metastore

Some pipelines require storage for use cases ranging from deduplication to stores for materialised results of time windowed aggregations. All our pipelines have access to clusters of ScyllaDB instances (which we use as our internal store), made available to pipeline authors via interfaces in the Stream Processing Framework. Results of these aggregations are then made available to backend services via our GrabStats service, which is a thin query layer over the latest pipeline results.

Compute Isolation

A nice property of packaging pipelines as Kubernetes deployments is a good degree of compute workload isolation for pipelines. While node resources of pipeline pods are still shared (and there are potential noisy neighbour issues on matters like logging throughput), the pipeline pods of various pods can be scheduled and rescheduled across a wide range of nodes safely and swiftly, with minimal impact to pods of other pipelines.

Redundancy

Stateless processing pods mean we can set up backup or redundant Kubernetes clusters in hot-hot, hot-warm, or hot-cold modes. We use this to ensure high processing availability despite limited control plane guarantees from any single cluster. (Since EKS SLAs for the Kubernetes control plane guarantee only 99.9% uptime today7.) Transparent to our users, we make the deployment systems aware of multiple available targets for scheduling.

Availability vs Cost

As alluded to in the “Platform Requirements” section, having a way of trading off availability for cost becomes important where the requirements and criticality of each processing pipeline are very different. Given that AWS spot instances are a lot cheaper8 than on-demand ones, we use user annotated Kubernetes priority classes to determine deployment targets for pipelines. For latency tolerant pipelines, we schedule them on Spot instances which are routinely between 40-90% cheaper than on demand instances on which latency sensitive pipelines run. The caveat is that Spot instances occasionally disappear, and these workloads are disrupted until a replacement node for their scheduling can be found.

What’s Next?

  • Expand the ecosystem of triggers to support custom sources of data i.e. the “event log”, as well as push based (RPC driven) versus just pull based triggers
  • Build a control plane for API integration with pipeline lifecycle management
  • Move some workloads to use the Vertical Pod Autoscaler9 in Kubernetes instead of horizontal scaling, as most of our workloads have a limit on parallelism (which is their partition count in Kafka topics)
  • Move from Go plugins for pipelines to plugins over RPC, like what HashiCorp does10, to enable processing logic in non-Go languages.
  • Use either pod gossip or a service mesh with a control plane to set up quotas for shared infrastructure usage per pipeline. This is to protect upstream dependencies and the metastore from surges in event backlogs.
  • Improve availability guarantees for pipeline pods by occasionally redistributing/rescheduling pods across nodes in our Kubernetes cluster to prevent entire workloads being served out of a few large nodes.

Authored By Karan Kamath on behalf of the Coban team at Grab-
Zezhou Yu, Ryan Ooi, Hui Yang, Yuguang Xiao, Ling Zhang, Roy Kim, Matt Hino, Jump Char, Lincoln Lee, Jason Cusick, Shrinand Thakkar, Dean Barlan, Shivam Dixit, Shubham Badkur, Fahad Pervaiz, Andy Nguyen, Ravi Tandon, Ken Fishkin, and Jim Caputo.


Footnotes

Coban Sewu Waterfall Photo by Dwinanda Nurhanif Mujito on Unsplash

Cover Photo by tian kuan on Unsplash

  1. https://martinfowler.com/eaaDev/EventSourcing.html 

  2. https://medium.com/netflix-techblog/keystone-real-time-stream-processing-platform-a3ee651812a 

  3. https://martinfowler.com/bliki/CQRS.html 

  4. https://flink.apache.org 

  5. https://kubernetes.io/docs/tasks/run-application/horizontal-pod-autoscale/ 

  6. https://github.com/kubernetes/autoscaler/tree/master/cluster-autoscaler 

  7. https://aws.amazon.com/eks/sla/ 

  8. https://aws.amazon.com/ec2/pricing/ 

  9. https://github.com/kubernetes/autoscaler/tree/master/vertical-pod-autoscaler 

  10. https://github.com/hashicorp/go-plugin 

Journey to a Faster Everyday Super App Where Every Millisecond Counts

Post Syndicated from Grab Tech original https://engineering.grab.com/journey-to-a-faster-everyday-super-app

Introduction

At Grab, we are moving faster than ever. In 2019 alone, we released dozens of new features in the Grab passenger app. With our goal to delight users in Southeast Asia with a powerful everyday super app, the app’s performance became one of the most critical components in delivering that experience to our users.

This post narrates the journey of our performance improvement efforts on the Grab passenger app. It highlights how we were able to reduce the time spent starting the app by more than 60%, while preventing regressions introduced by new features. We use the p95 scale when referring to these improvements.

Here’s a quick look at the improvements and timeline:

Improvements Timeline

Improving App Performance

While app performance consists of different aspects – such as battery consumption rate, network performance, app responsiveness, etc. – the first thing users notice is the time it takes for an app to start. Apps that take too long to load frustrate users, leading to bad reviews and uninstalls.

We focused our efforts on the app’s time to interactive(TTI), which consists of two main operations:

  • Starting the app
  • Displaying interactive service tiles (these are the icons for the services offered on the app such as Transport, Food, Delivery, and so on)

There are many other operations that occur in the background, which we won’t cover in this article.

We prioritised on optimising the app’s ability to load the service tiles (highlighted in the image below) and render them as interactive upon startup (cold start). This allowed users to use the app as soon as they launch it.

Service Tiles

Instrumentation and Benchmarking

Before we could start improving the app’s performance, we needed to know where we stood and set measurable goals.

We couldn’t get a baseline from local performance testing as it did not simulate the real environment condition, where network variability and device performance are contributing factors. Thus, we needed to use real production data to get an accurate reflection of our current performance at a scale. In production, we measured the performance of ~8-9 millions users per day – a small subset of our overall active user base.

As a start, we measured the different components contributing to TTI, such as binary loading, library initialisations, and tiles loading. For example, if we had to measure the time taken by function A, this is how it looked like in the code:

functionA (){
// start the timer
....
....
...
//Stop the timer, calculate the time difference and send it as an analytic event
}

With all the numbers from the contributing components, we took the sum to calculate the full TTI (as shown in the following image).

Full TTI

When the numbers started rolling in from production, we needed specific measurements to interpret those numbers, so we started looking at TTI’s 50th, 90th, and 95th percentile. A 90th percentile (p90) of x seconds means that 90% of the users have an interactive screen in at most x seconds.

We chose to only focus on p50 and p95 as these cover the majority of our users who deal with performance issues. Improving performance for <p50 (who already have high-end devices) would not bring too much of a value, and improving for >p95 would be very difficult as the app performance improvements will be limited by device performance.

By the end of January, we got the p50, p90, and p95 numbers for the contributing components that summed up to TTI numbers for tiles, which allowed us to start identifying areas with potential improvements.

Caching and Animation Removal

While reviewing the TTI numbers, we were drawn to contributors with high time consumption rates such as tile loading and app start animation. Other evident improvement we worked on was caching data between app launches instead of waiting for a network response for loading tiles at every single app launch.

Tile Caching

Based on the gathered data, the service tiles only change when a user travels between cities. This is because the available services vary in each city. Since users do not frequently change cities, the service tiles do not change very frequently either, and so caching the tiles made sense. However, we also needed to sync the fresh tiles, in case of any change. So, we updated the logic based on these findings. as illustrated in the following image:

Tile Caching Logic

Caching tiles brought us a huge improvement of ~3s on each platform.

Animation Removal

We came across a beautifully created animation at appstart that didn’t provide any additional value in terms of information or practicality.

With detailed discussions and trade-offs with designers, we removed the animation and improved our TTI further by 1s.

In conclusion, with the caching and animation removal alone, we improved the TTI by 4s.

Welcome Static Linking and Coroutines

At this point, our users gained 4 seconds of their time back, but we didn’t want to stop with that number. So, we dug through the data to see what further enhancements we could do. When we could not find anything else that was similar to caching and animation removal, we shifted to architecture fundamentals.

We knew that this was not an easy route to take and that it would come with a cost; if we decided to choose a component related to architecture fundamentals, all the other teams working on the Grab app would be impacted. We had to evaluate our options and make decisions with trade-offs for overall improvements. And this eventually led to static linking on iOS and coroutines on Android.

Binary Loading

Binary loading is one of the first steps in both mobile platforms when an app is launched. It primarily contributes to pre-main and dex-loading, on iOS and Android respectively.

The pre-main time on iOS was about 7.9s. It is known in the iOS development world that each framework (binary) can either be dynamically or statically linked. While static helps in a faster app start, it brings complexity in building frameworks that are elaborate or contain resources bundles.Building a lot of libraries statically also impact build times negatively.With proper evaluations, we decided to take the route to enable more static linking due to the trade-offs.

Apple recommends a maximum of half a dozen dynamic frameworks for an optimal performance. Guess what? Our passenger app had 107 dynamically linked frameworks, a lot of them were internal.

The task looked daunting at first, since it affected all parts of the app, but we were ready to tackle the challenge head on. Deciding to take this on was the easy part, the actual work entailed lots of tricky coordination and collaboration with multiple teams.

We created an RFC (Request For Comments) doc to propose the static linking of frameworks, wherever applicable, and co-ordinated with teams with the agreed timelines to execute this change.

While collaborating with teams, we learned that we could remove 12 frameworks entirely that were no longer required. This exercise highlighted the importance of regular cleanup and deprecation in our codebase, and was added into our standard process.

And so, we were left with 95 frameworks; 75 of which were statically linked successfully, resulting in our p90 pre-main dropping by 41%.

As Grabbers, it’s in our DNA to push ourselves a little more. With the remaining 20 frameworks, our pre-main was still considerably high. Out of the 20 frameworks, 10 could not be statically linked without issues. As a workaround, we merged multiple dynamic frameworks into one. One of our outstanding engineers even created a plug-in for this, which is called the Cocoapod Merge. With this plug-in, we were able to merge 10 dynamically linked frameworks into 2. We’ve made this plug-in open source: https://github.com/grab/cocoapods-pod-merge.

With all of the above steps, we were finally left with 12 dynamic frameworks – a huge 88% reduction.

The following image illustrates the complex numbers mentioned above:

Static Linking

Using cocoapod merge further helped us with ~0.8s of improvement.

Coroutines

While we were executing the static linking initiative on iOS, we also started refactoring the application initialisation for a modular and clean code on Android. This resulted in creating an ApplicationInitialiser class, which handles the entire application initialisation process with maximum parallelism using coroutines.

Now all the libraries are being initialised in parallel via coroutines and thus enabling better utilisations of computing resources and a faster TTI.

This refactoring and background initialisation for libraries on Android helped in gaining ~0.4s of improvements.

Changing the Basics – Visualisation Setup

By the end of H1 2019, we observed a 50% improvement in TTI, and now it was time to set new goals for H2 2019. Until this point, we would query our database for all metric numbers, copy the numbers into a spreadsheet, and compare them against weeks and app versions.

Despite the high loads of manual work and other challenges, this method still worked at the beginning due to the improvements we had to focus on.

However, in H2 2019 it became apparent that we had to reassess our methodology of reading numbers. So, we started thinking about other ways to present and visualise these numbers better. With help from our Product Analyst, we took advantage of metabase’s advanced capabilities and presented our goals and metrics in a clear and easy to understand format.

For example, here is a graph that shows the top contributing metrics for Android:

Android Metrics

Looking at it, we could clearly tell which metric needed to be prioritised for improvements.

We did this not only for our metrics, but also for our main goals, which allowed us to easily see our progress and track our improvements on a daily basis.

Visualisation

The color bars in the above image depicts the status of our numbers against our goals and also shows the actual numbers at p50, p90, and p95.

As our tracking progressed, we started including more granular and precise measurements, to help guide the team and achieve more impactful improvements of around ~0.3-0.4s.

Fortunately, we were deprecating a third-party library for analytics and experimentation, which happened to be one of the highest contributing metrics for both platforms due to a high number of operations on the main thread. We started using our own in-house experimentation platform where we had better control over performance. We removed this third-party dependency, and it helped us with huge improvements of ~2.5s on Android and ~0.5-0.7s on iOS.

You might be wondering as to why there is such a big difference on the iOS and Android improvement numbers for this dependency. This was due to the setting user attributes operations that ran only in the Android codebase, which was performed on the main thread and took a huge amount of time. These were the times that made us realise that we should focus more on the consistency for both platforms, as well as to identify the third-party library APIs that are used, and to assess whether they are absolutely necessary.

*Tip*: So, it is time for you as well to eliminate such inconsistencies, if there are any.

Ok, there goes our third quarter with ~3s of improvement on Android and ~1.3s on iOS.

Performance Regression Detection

Entering into Q4 brought us many challenges as we were running out of improvements to make. Even finding an improvement worth ~0.05s was really difficult! We were also strongly challenged by regressions (increase in TTI numbers) because of continuous feature releases and code additions to the app start process.

So, maintaining the TTI numbers became our primary task for this period. We started looking into setting up processes to block regressions from being merged to the master, or at least get notified before they hit production.

To begin with, we identified the main sources of regressions: static linking breakage on iOS and library initialisation in the app startup process on Android.

We took the following measures to cover these cases:

Linters

We built linters on the Continuous Integration (CI) pipeline to detect potential changes in static linking on iOS and the ApplicationInitialiser class on Android. The linters block the changelist and enforce a special review process for such changes.

Library Integration Process

The team also focused on setting up a process for library integrations, where each library (internal or third party) will first be evaluated for performance impact before it is integrated into the codebase.

While regression guarding was in process, we were simultaneously trying to bring in more improvements for TTI. We enabled the Link Time Optimisations (LTO) flag on iOS to improve the overall app performance. We also experimented on order files on iOS and anko layout on Android, but were ruled out due to known issues.

On Android, we hit the bottom hard as there were minimal improvements. Fortunately, it was a different story for iOS. We managed to get improvements worth ~0.6s by opting for lazy loading, optimising I/O operations, and deferring more operations to post app start (if applicable).

Next Steps

We will be looking at the different aspects of performance such as network, battery, and storage, while maintaining our current numbers for TTI.

  • Network performance – Track the turnaround time for network requests then move on to optimisations.
  • Battery performance – Focus on profiling the app for CPU and energy intensive operations, which drains the battery, and then move to optimisations.
  • Storage performance – Review our caching and storage mechanisms, and then look for ways to optimise them.

In addition to these, we are also focusing on bringing performance initiatives for all the teams at Grab. We believe that performance is a collaborative approach, and we would like to improve the app performance in all aspects.

We defined different metrics to track performance e.g. Time to Interactive, Time to feedback (the time taken to get the feedback for a user action), UI smoothness indicators, storage, and network metrics.

We are enabling all teams to benchmark their performance numbers based on defined metrics and move on to a path of improvement.

Conclusion

Overall, we improved by 60%, and this calls for a big celebration! Woohoo! The bigger celebration came from knowing that we’ve improved our customers’ experience in using our app.

This graph represents our performance improvement journey for the entire 2019, in terms of TTI.

Performance Graph

Based on the graph, looking at our p95 improvements and converting them to number of hours saved per day gives us ~21,388 hours on iOS and ~38,194 hours saved per day on Android.

Hey, did you know that it takes approximately 80-85 hours to watch all the episodes of Friends? Just saying. 🙂

We will continue to serve our customers for a better and faster experience in the upcoming years.

Marionette – Enabling E2E user-scenario simulation

Post Syndicated from Grab Tech original https://engineering.grab.com/marionette-enabling-e2e-user-scenario-simulation

Introduction

A plethora of interconnected microservices is what powers the Grab’s app. The microservices work behind the scenes to delight millions of our customers in Southeast Asia. It is a no-brainer that we emphasize on strong testing tools, so our app performs flawlessly to continuously meet our customers’ needs.

Background

We have a microservices-based architecture, in which microservices are interconnected to numerous other microservices. Each passing day sees teams within Grab updating their microservices, which in turn enhances the overall app. If any of the microservices fail after changes are rolled out, it may lead to the whole app getting into an unstable state or worse. This is a major risk and that’s why we stress on conducting “end-to-end (E2E) testing” as an integral part of our software test life-cycle.

E2E tests are done for all crucial workflows in the app, but not for every detail. For that we have conventional tests such as unit tests, component tests, functional tests, etc. Consider E2E testing as the final approval in the quality assurance of the app.

Writing E2E tests in the microservices world is not a trivial task. We are not testing just a single monolithic application. To test a workflow on the app from a user’s perspective, we need to traverse multiple microservices, which communicate through different protocols such as HTTP/HTTPS and TCP. E2E testing gets even more challenging with the continuous addition of microservices. Over the years, we have grown tremendously with hundreds of microservices working in the background to power our app.

Some major challenges in writing E2E tests for the microservices-based apps are:

  • Availability

    Getting all microservices together for E2E testing is tough. Each development team works independently and is responsible only for its microservices. Teams use different programming languages, data stores, etc for each microservice. It’s hard to construct all pieces in a common test environment as a complete app for E2E testing each time.

  • Data or resource set up

    E2E testing requires comprehensive data set up. Otherwise, testing results are affected because of data constraints, and not due to any recent changes to underlying microservices. For example, we need to create real-life equivalent driver accounts, passenger accounts, etc and to have those, there are a few dependencies on other internal systems which manage user accounts. Further, data and booking generation should be robust enough to replicate real-world scenarios as far as possible.

  • Access and authentication

    Usually, the test cases require sequential execution in E2E testing. In a microservices architecture, it is difficult to test a workflow which requires access and permissions to several resources or data that should remain available throughout the test execution.

  • Resource and time intensive

    It is expensive and time consuming to run E2E tests; significant time is involved in deploying new changes, configuring all the necessary test data, etc.

Though there are several challenges, we had to find a way to overcome them and test workflows from the beginning to the end in our app.

Our approach to overcome challenges

We knew what our challenges were and what we wanted to achieve from E2E testing, so we started thinking about how to develop a platform for E2E tests. To begin with, we determined that the scope of E2E testing that we’re going to primarily focus on is Grab’s transport domain — the microservices powering the driver and passenger apps.

One approach is to “simulate” user scenarios through a single platform before any new versions of these microservices are released. Ideally, the platform should also have the capabilities to set up the data required for these simulations. For example, ride booking requires data set up such as driver accounts, passenger accounts, location coordinates, geofencing, etc.

We wanted to create a single platform that multiple teams could use to set up their test data and run E2E user-scenario simulations easily. We put ourselves to work on that idea, which resulted in the creation of an internal platform called “Marionette”. It simulates actions performed by Grab’s passenger and driver apps as they are expected to behave in the real world. The objective is to ensure that all standard user workflows are tested before deploying new app versions.

Introducing Marionette

Marionette enables Grabbers (developers and QAs) to run E2E user-scenario simulations without depending on the actual passenger and driver apps. Grabbers can set up data as well as configure data such as drivers, passengers, taxi types, etc to mimic the real-world behavior.

Let’s look at the overall architecture to understand Marionette better:

Overall Architecture

Grabbers can interact with Marionette through three channels: UI, SDK, and through RESTful API endpoints in their test scripts. All requests are routed through a load balancer to the Marionette platform. The Marionette platform in turn talks to the required microservices to create test data and to run the simulations.

The benefits

With Marionette, Grabbers now have the ability to:

  • Simulate the whole booking flow including customer and driver behavior as well as transition through the booking life cycle including pick-up, drop-off, cancellation, etc. For example, developers can make passenger booking from the UI and configure pick-up points, drop-off points, taxi types, and other parameters easily. They can define passenger behaviour such as “make bookings after a specified time interval”, “cancel each booking”, etc. They can also set driver locations, define driver behaviour such as “always accept booking manually”, “decline received bookings”, etc.
  • Simulate bookings in all cities where Grab operates. Further, developers can run simulations for multiple Grab taxi types such as JustGrab, GrabShare, etc.
  • Visualize passengers, drivers, and ride transitions on the UI, which lets them easily test their workflows.
  • Save efforts and time spent on installing third-party android or iOS emulators, troubleshooting or debugging .apk installation files, etc before testing workflows.
  • Conduct E2E testing without real mobile devices and installed apps.
  • Run automatic simulations, in which a particular set of scenarios are run continuously, thus helping developers with exploratory testing.

How we isolated simulations among users

It is important to have independent simulations for each user. Otherwise, simulations don’t yield correct results. This was one of the challenges we faced when we first started running simulations on Marionette.

To resolve this issue, we came up with the idea of “cohorts”. A cohort is a logical group of passengers and drivers who are located in a particular city. Each simulation on Marionette is run using a “cohort” containing the number of drivers and passengers required for that simulation. When a passenger/driver needs to interact with other passengers/drivers (such as for ride bookings), Marionette ensures that the interaction is constrained to resources within the cohort. This ensures that drivers and passengers are not shared in different test cases/simulations, resulting in more consistent test runs.

How to interact with Marionette

Let’s take a look at how to interact with Marionette starting with its user interface first.

User Interface

The Marionette UI is designed to provide the same level of granularity as available on the real passenger and driver apps.

Generally, the UI is used in the following scenarios:

  • To test common user scenarios/workflows after deploying a change on staging.
  • To test the end-to-end booking flow right from the point where a passenger makes a booking till drop-off at the destination.
  • To simulate functionality of other teams within Grab – the passenger app developers can simulate the driver app for their testing and vice versa. Usually, teams work independently and the ability to simulate the dependent app for testing allows developers to work independently.
  • To perform E2E testing (such as by QA teams) without writing any test scripts.

The Marionette UI also allows Grabbers to create and set up data. All that needs to be done is to specify the necessary resources such as number of drivers, number of passengers, city to run the simulation, etc. Running E2E simulations involves just the click of a button after data set up. Reports generated at the end of running simulations provide a graphical visualization of the results. Visual reports save developers’ time, which otherwise is spent on browsing through logs to ascertain errors.

SDK

Marionette also provides an SDK, written in the Go programming language.

It lets developers:

  • Create resources such as passengers, drivers, and cohorts for simulating booking flows.
  • Create booking simulations in both staging and production.
  • Set bookings to specific states as needed for simulation through customizable driver and passenger behaviour.
  • Make HTTP requests and receive responses that matter in tests.
  • Run load tests by scaling up booking requests to match the required workload (QPS).

Let’s look at a high-level booking test case example to understand the simulation workflow.

Assume we want to run an E2E booking test with this driver behavior type — “accepts passenger bookings and transits between booking states according to defined behavior parameters”. This is just one of the driver behavior types in Marionette; other behavior types are also supported. Similarly, passengers also have behaviour types.

To write the E2E test for this example case, we first define the driver behavior in a function like this:

Overall Architecture

Then, we handle the booking request for the driver like this:

Overall Architecture

The SDK client makes the handling of passengers, drivers, and bookings very easy as developers don’t need to worry about hitting multiple services and multiple APIs to set up their required driver and passenger actions. Instead, teams can focus on testing their use cases.

To ensure that passengers and drivers are isolated in our test, we need to group them together in a cohort before running the E2E test.

Overall Architecture

In summary, we have defined the driver’s behavior, created the booking request, created the SDK client and associated the driver and passenger to a cohort. Now, we just have to trigger the E2E test from our IDE. It’s just that simple and easy!

Previously, developers had to write boilerplate code to make HTTP requests and parse returned HTTP responses. With the Marionette SDK in place, developers don’t have to write any boilerplate code saving significant time and effort in E2E testing.

RESTful APIs in test scripts

Marionette provides several RESTful API endpoints that cover different simulation areas such as resource or data creation APIs, driver APIs, passenger APIs, etc. APIs are particularly suitable for scripted testing. Developers can directly call these APIs in their test scripts to facilitate their own tests such as load tests, integration tests, E2E tests, etc.

Developers use these APIs with their preferred programming languages to run simulations. They don’t need to worry about any underlying complexities when using the APIs. For example, developers in Grab have created custom libraries using Marionette APIs in Python, Java, and Bash to run simulations.

What’s next

Currently, we cover E2E tests for our transport domain (microservices for the passenger and driver apps) through Marionette. The next phase is to expand into a full-fledged platform that can test microservices in other Grab domains such as Food, Payments, and so on. Going forward, we are also looking to further simplify the writing of E2E tests and running them as a part of the CD pipeline for seamless testing before deployment.

In conclusion

We had an idea of creating a simulation platform that can run and facilitate E2E testing of microservices. With Marionette, we have achieved this objective. Marionette has helped us understand how end users use our apps, allowing us to make improvements to our services. Further, Marionette ensures there are no breaking changes and provides additional visibility into potential bugs that might be introduced as a result of any changes to microservices.

If you have any comments or questions about Marionette, please leave a comment below.

Behind the scenes: GitHub security alerts

Post Syndicated from Justin Hutchings original https://github.blog/2019-12-11-behind-the-scenes-github-vulnerability-alerts/

If you have code on GitHub, chances are that you’ve had a security vulnerability alert at some point. Since the feature launched, GitHub has sent more than 62 million security alerts for vulnerable dependencies.

How does it work?

Vulnerability alerts rely on two pieces of data: an inventory of all the software that your code depends on, and a curated list of known vulnerabilities in open-source code. 

Any time you push a change to a dependency manifest file, GitHub has a job that parses those manifest files, and stores your dependency on those packages in the dependency graph. If you’re dependent on something that hasn’t been seen before, a background task runs to get more information about the package from the package registries themselves and adds it. We use the information from the package registries to establish the canonical repository that the package came from, and to help populate metadata like readmes, known versions, and the published licenses. 

On GitHub Enterprise Server, this process works identically, except we don’t get any information from the public package registries in order to protect the privacy of the server and its code. 

The dependency graph supports manifests for JavaScript (npm, Yarn), .NET (Nuget), Java (Maven), PHP (Composer), Python (PyPI), and Ruby (Rubygems). This data powers our vulnerability alerts, but also dependency insights, the used by badge, and the community contributors experiences. 

Beyond the dependency graph, we aggregate data from a number of sources and curate those to bring you actionable security alerts. GitHub brings in security vulnerability data from a number of sources, including the National Vulnerability Database (a service of the United States National Institute of Standards and Technology), maintainer security advisories from open-source maintainers, community datasources, and our partner WhiteSource

Once we learn about a vulnerability, it passes through an advanced machine learning model that’s trained to recognize vulnerabilities which impact developers. This model rejects anything that isn’t related to an open-source toolchain. If the model accepts the vulnerability, a bot creates a pull request in a GitHub private repository for our  team of curation experts to manually review.

GitHub curates vulnerabilities because CVEs (Common Vulnerability Entries) are often ambiguous about which open-source projects are impacted. This can be particularly challenging when multiple libraries with similar names exist, or when they’re a part of a larger toolkit. Depending on the kind of vulnerability, our curation team may follow-up with outside security researchers or maintainers about the impact assessment. This follow-up helps to confirm that an alert is warranted and to identify the exact packages that are impacted. 

Once the curation team completes the mappings, we merge the pull request and it starts a background job that notifies users about any affected repositories. Depending on the vulnerability, this can cause a lot of alerts. In a recent incident, more than two million repositories were alerted about a vulnerable version of lodash, a popular JavaScript utility library.

GitHub Enterprise Server customers get a slightly different experience. If an admin has enabled security vulnerability alerts through GitHub Connect, the server will download the latest curated list of vulnerabilities from GitHub.com over the private GitHub Connect channel on its next scheduled sync (about once per hour). If a new vulnerability exists, the server determines the impacted users and repositories before generating alerts directly. 

Security vulnerabilities are a matter of public good. High-profile breaches impact the trustworthiness of the entire tech industry, so we publish a curated set of vulnerabilities on our GraphQL APIs for community projects and enterprise tools to use in custom workflows as necessary. Users can also browse the known vulnerabilities from public sources on the GitHub Advisory Database.

Engineers behind the feature

Despite advanced technology, security alerting is a human process driven by dedicated GitHubbers. Meet Rob (@rschultheis), one of the core members of our security team, and learn about his experiences at GitHub through a friendly Q&A:

Humphrey Dogart (German Shepherd) and Rob Schultheis (Software Engineer on the GitHub Security team)

How long have you been with GitHub? 

Two years

How did you get into software security? 

I’ve worked with open source software for most of my 20 year career in tech, and honestly for much of that time I didn’t pay much attention to security. When I started at GitHub I was given the opportunity to work on the first iteration of security alerts. It quickly became clear that having a high quality, open dataset was going to be a critical factor in the success of the feature. I dove into the task of curating that advisory dataset and found a whole side to the industry that was open for exploration, and I’ve stayed with it ever since!

What are the trickiest parts of vulnerability curation? 

The hardest problem is probably confirming that our advisory data correctly identifies which version(s) of a package are vulnerable to a given advisory, and which version(s) first address it.

What was the most difficult security vulnerability you’ve had to publish? 

One memorable vulnerability was CVE-2015-9284. This one was tough in several ways because it was a part of a popular library, it was also unpatched when it became fully public, and finally, it was published four years after the initial disclosure to maintainers. Even worse, all attempts to fix it had stalled.

We ended up proceeding to publish it and the community quickly responded and finally got the security issue patched.

What’s your favorite feel-good moment working in security? 

Seeing tweets and other feedback thanking us is always wonderful. We do read them! And that goes the same for those critical of the feature or the way certain advisories were disclosed or published. Please keep them coming—they’re really valuable to us as we keep evolving our security offerings.

Since you work at home, can you introduce us to your furry officemate? 

I live with a seven month old shepherd named Humphrey Dogart. His primary responsibilities are making sure I don’t spend all day on the computer, and he does a great job of that. I think we make a great team!


Learn more about GitHub security alerts

The post Behind the scenes: GitHub security alerts appeared first on The GitHub Blog.

Debugging network stalls on Kubernetes

Post Syndicated from Theo Julienne original https://github.blog/2019-11-21-debugging-network-stalls-on-kubernetes/

We’ve talked about Kubernetes before, and over the last couple of years it’s become the standard deployment pattern at GitHub. We now run a large portion of both internal and public-facing services on Kubernetes. As our Kubernetes clusters have grown, and our targets on the latency of our services have become more stringent, we began to notice that certain services running on Kubernetes in our environment were experiencing sporadic latency that couldn’t be attributed to the performance characteristics of the application itself.

Essentially, applications running on our Kubernetes clusters would observe seemingly random latency of up to and over 100ms on connections, which would cause downstream timeouts or retries. Services were expected to be able to respond to requests in well under 100ms, which wasn’t feasible when the connection itself was taking so long. Separately, we also observed very fast MySQL queries, which we expected to take a matter of milliseconds and that MySQL observed taking only milliseconds, were being observed taking 100ms or more from the perspective of the querying application.

The problem was initially narrowed down to communications that involved a Kubernetes node, even if the other side of a connection was outside Kubernetes. The most simple reproduction we had was a Vegeta benchmark that could be run from any internal host, targeting a Kubernetes service running on a node port, and would observe the sporadically high latency. In this post, we’ll walk through how we tracked down the underlying issue.

Removing complexity to find the path at fault

Using an example reproduction, we wanted to narrow down the problem and remove layers of complexity. Initially, there were too many moving parts in the flow between Vegeta and pods running on Kubernetes to determine if this was a deeper network problem, so we needed to rule some out.

vegeta to kube pod via NodePort

The client, Vegeta, creates a TCP connection to any kube-node in the cluster. Kubernetes runs in our data centers as an overlay network (a network that runs on top of our existing datacenter network) that uses IPIP (which encapsulates the overlay network’s IP packet inside the datacenter’s IP packet). When a connection is made to that first kube-node, it performs stateful Network Address Translation (NAT) to convert the kube-node’s IP and port to an IP and port on the overlay network (specifically, of the pod running the application). On return, it undoes each of these steps. This is a complex system with a lot of state, and a lot of moving parts that are constantly updating and changing as services deploy and move around.

As part of running a tcpdump on the original Vegeta benchmark, we observed the latency during a TCP handshake (between SYN and SYN-ACK). To simplify some of the complexity of HTTP and Vegeta, we can use hping3 to just “ping” with a SYN packet and see if we observe the latency in the response packet—then throw away the connection. We can filter it to only include packets over 100ms and get a simpler reproduction case than a full Layer 7 Vegeta benchmark or attack against the service. The following “pings” a kube-node using TCP SYN/SYN-ACK on the “node port” for the service (30927) with an interval of 10ms, filtered for slow responses:

[email protected] ~ $ sudo hping3 172.16.47.27 -S -p 30927 -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1485 win=29200 rtt=127.1 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1486 win=29200 rtt=117.0 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1487 win=29200 rtt=106.2 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=1488 win=29200 rtt=104.1 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5024 win=29200 rtt=109.2 ms

len=46 ip=172.16.47.27 ttl=59 DF id=0 sport=30927 flags=SA seq=5231 win=29200 rtt=109.2 ms

Our first new observation from the sequence numbers and timings is that this isn’t a one-off, but is often grouped, like a backlog that eventually gets processed.

Next up, we want to narrow down which component(s) were potentially at fault. Is it the kube-proxy iptables NAT rules that are hundreds of rules long? Is it the IPIP tunnel and something on the network handling them poorly? One way to validate this is to test each step of the system. What happens if we remove the NAT and firewall logic and only use the IPIP part:

hping 3 on the IPIP tunnel alone

Linux thankfully lets you just talk directly to an overlay IP when you’re on a machine that’s part of the same network, so that’s pretty easy to do:

[email protected] ~ $ sudo hping3 10.125.20.64 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7346 win=0 rtt=127.3 ms

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7347 win=0 rtt=117.3 ms

len=40 ip=10.125.20.64 ttl=64 DF id=0 sport=0 flags=RA seq=7348 win=0 rtt=107.2 ms

Based on our results, the problem still remains! That rules out iptables and NAT. Is it TCP that’s the problem? Let’s see what happens when we perform a normal ICMP ping:

[email protected] ~ $ sudo hping3 10.125.20.64 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'

len=28 ip=10.125.20.64 ttl=64 id=42594 icmp_seq=104 rtt=110.0 ms

len=28 ip=10.125.20.64 ttl=64 id=49448 icmp_seq=4022 rtt=141.3 ms

len=28 ip=10.125.20.64 ttl=64 id=49449 icmp_seq=4023 rtt=131.3 ms

len=28 ip=10.125.20.64 ttl=64 id=49450 icmp_seq=4024 rtt=121.2 ms

len=28 ip=10.125.20.64 ttl=64 id=49451 icmp_seq=4025 rtt=111.2 ms

len=28 ip=10.125.20.64 ttl=64 id=49452 icmp_seq=4026 rtt=101.1 ms

len=28 ip=10.125.20.64 ttl=64 id=50023 icmp_seq=4343 rtt=126.8 ms

len=28 ip=10.125.20.64 ttl=64 id=50024 icmp_seq=4344 rtt=116.8 ms

len=28 ip=10.125.20.64 ttl=64 id=50025 icmp_seq=4345 rtt=106.8 ms

len=28 ip=10.125.20.64 ttl=64 id=59727 icmp_seq=9836 rtt=106.1 ms

Our results show that the problem still exists. Is it the IPIP tunnel that’s causing the problem? Let’s simplify things further:

hping 3 directly between hosts

Is it possible that it’s every packet between these two hosts?

[email protected] ~ $ sudo hping3 172.16.47.27 --icmp -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'

len=46 ip=172.16.47.27 ttl=61 id=41127 icmp_seq=12564 rtt=140.9 ms

len=46 ip=172.16.47.27 ttl=61 id=41128 icmp_seq=12565 rtt=130.9 ms

len=46 ip=172.16.47.27 ttl=61 id=41129 icmp_seq=12566 rtt=120.8 ms

len=46 ip=172.16.47.27 ttl=61 id=41130 icmp_seq=12567 rtt=110.8 ms

len=46 ip=172.16.47.27 ttl=61 id=41131 icmp_seq=12568 rtt=100.7 ms

len=46 ip=172.16.47.27 ttl=61 id=9062 icmp_seq=31443 rtt=134.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9063 icmp_seq=31444 rtt=124.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9064 icmp_seq=31445 rtt=114.2 ms

len=46 ip=172.16.47.27 ttl=61 id=9065 icmp_seq=31446 rtt=104.2 ms

Behind the complexity, it’s as simple as two kube-node hosts sending any packet, even ICMP pings, to each other. They’ll still see the latency, if the target host is a “bad” one (some are worse than others).

Now there’s one last thing to question: we clearly don’t observe this everywhere, so why is it just on kube-node servers? And does it occur when the kube-node is the sender or the receiver? Luckily, this is also pretty easy to narrow down by using a host outside Kubernetes as a sender, but with the same “known bad” target host (from a staff shell host to the same kube-node). We can observe this is still an issue in that direction:

[email protected] ~ $ sudo hping3 172.16.47.27 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=312 win=0 rtt=108.5 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=5903 win=0 rtt=119.4 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=6227 win=0 rtt=139.9 ms

len=46 ip=172.16.47.27 ttl=61 DF id=0 sport=9876 flags=RA seq=7929 win=0 rtt=131.2 ms

And then perform the same from the previous source kube-node to a staff shell host (which rules out the source host, since a ping has both an RX and TX component):

[email protected] ~ $ sudo hping3 172.16.33.44 -p 9876 -S -i u10000 | egrep --line-buffered 'rtt=[0-9]{3}\.'
^C
--- 172.16.33.44 hping statistic ---
22352 packets transmitted, 22350 packets received, 1% packet loss
round-trip min/avg/max = 0.2/7.6/1010.6 ms

Looking into the packet captures of the latency we observed, we get some more information. Specifically, that the “sender” host (bottom) observes this timeout while the “receiver” host (top) does not—see the Delta column (in seconds):

packet captures of the latency

Additionally, by looking at the difference between the ordering of the packets (based on the sequence numbers) on the receiver side of the TCP and ICMP results above, we can observe that ICMP packets always arrive in the same sequence they were sent, but with uneven timing, while TCP packets are sometimes interleaved, but a subset of them stall. Notably, we observe that if you count the ports of the SYN packets, the ports are not in order on the receiver side, while they’re in order on the sender side.

There is a subtle difference between how modern server NICs—like we have in our data centers—handle packets containing TCP vs ICMP. When a packet arrives, the NIC hashes the packet “per connection” and tries to divvy up the connections across receive queues, each (approximately) delegated to a given CPU core. For TCP, this hash includes both source and destination IP and port. In other words, each connection is hashed (potentially) differently. For ICMP, just the IP source and destination are hashed, since there are no ports.

Another new observation is that we can tell that ICMP observes stalls on all communications between the two hosts during this period from the sequence numbers in ICMP vs TCP, while TCP does not. This tells us that the RX queue hashing is likely in play, almost certainly indicating the stall is in processing RX packets, not in sending responses.

This rules out kube-node transmits, so we now know that it’s a stall in processing packets, and that it’s on the receive side on some kube-node servers.

Deep dive into Linux kernel packet processing

To understand why the problem could be on the receiving side on some kube-node servers, let’s take a look at how the Linux kernel processes packets.

Going back to the simplest traditional implementation, the network card receives a packet and sends an interrupt to the Linux kernel stating that there’s a packet that should be handled. The kernel stops other work, switches context to the interrupt handler, processes the packet, then switches back to what it was doing.

traditional interrupt-driven approach

This context switching is slow, which may have been fine on a 10Mbit NIC in the 90s, but on modern servers where the NIC is 10G and at maximal line rate can bring in around 15 million packets per second, on a smaller server with eight cores that could mean the kernel is interrupted millions of times per second per core.

Instead of constantly handling interrupts, many years ago Linux added NAPI, the networking API that modern drivers use for improved performance at high packet rates. At low rates, the kernel still accepts interrupts from the NIC in the method we mentioned. Once enough packets arrive and cross a threshold, it disables interrupts and instead begins polling the NIC and pulling off packets in batches. This processing is done in a “softirq”, or software interrupt context. This happens at the end of syscalls and hardware interrupts, which are times that the kernel (as opposed to userspace) is already running.

NAPI polling in softirq at end of syscalls

This is much faster, but brings up another problem. What happens if we have so many packets to process that we spend all our time processing packets from the NIC, but we never have time to let the userspace processes actually drain those queues (read from TCP connections, etc.)? Eventually the queues would fill up, and we’d start dropping packets. To try and make this fair, the kernel limits the amount of packets processed in a given softirq context to a certain budget. Once this budget is exceeded, it wakes up a separate thread called ksoftirqd (you’ll see one of these in ps for each core) which processes these softirqs outside of the normal syscall/interrupt path. This thread is scheduled using the standard process scheduler, which already tries to be fair.

NAPI polling crossing threshold in schedule ksoftirqd

With an overview of the way the kernel is processing packets, we can see there is definitely opportunity for this processing to become stalled. If the time between softirq processing calls grows, packets could sit in the NIC RX queue for a while before being processed. This could be something deadlocking the CPU core, or it could be something slow preventing the kernel from running softirqs.

Narrow down processing to a core or method

At this point, it makes sense that this could happen, and we know we’re observing something that looks a lot like it. The next step is to confirm this theory, and if we do, understand what’s causing it.

Let’s revisit the slow round trip packets we saw before:

len=46 ip=172.16.53.32 ttl=61 id=29573 icmp_seq=1953 rtt=99.3 ms

len=46 ip=172.16.53.32 ttl=61 id=29574 icmp_seq=1954 rtt=89.3 ms

len=46 ip=172.16.53.32 ttl=61 id=29575 icmp_seq=1955 rtt=79.2 ms

len=46 ip=172.16.53.32 ttl=61 id=29576 icmp_seq=1956 rtt=69.1 ms

len=46 ip=172.16.53.32 ttl=61 id=29577 icmp_seq=1957 rtt=59.1 ms

len=46 ip=172.16.53.32 ttl=61 id=29790 icmp_seq=2070 rtt=75.7 ms

len=46 ip=172.16.53.32 ttl=61 id=29791 icmp_seq=2071 rtt=65.6 ms

len=46 ip=172.16.53.32 ttl=61 id=29792 icmp_seq=2072 rtt=55.5 ms

As discussed previously, these ICMP packets are hashed to a single NIC RX queue and processed by a single CPU core. If we want to understand what the kernel is doing, it’s helpful to know where (cpu core) and how (softirq, ksoftirqd) it’s processing these packets so we can catch them in action.

Now it’s time to use the tools that allow live tracing of a running Linux kernel—bcc is what was used here. This allows you to write small C programs that hook arbitrary functions in the kernel, and buffer events back to a userspace Python program which can summarize and return them to you. The “hook arbitrary functions in the kernel” is the difficult part, but it actually goes out of its way to be as safe as possible to use, because it’s designed for tracing exactly this type of production issue that you can’t simply reproduce in a testing or dev environment.

The plan here is simple: we know the kernel is processing those ICMP ping packets, so let’s hook the kernel function icmp_echo which takes an incoming ICMP “echo request” packet and initiates sending the ICMP “echo response” reply. We can identify the packet using the incrementing icmp_seq shown by hping3 above.

The code for this bcc script looks complex, but breaking it down it’s not as scary as it sounds. The icmp_echo function is passed a struct sk_buff *skb, which is the packet containing the ICMP echo request. We can delve into this live and pull out the echo.sequence (which maps to the icmp_seq shown by hping3 above), and send that back to userspace. Conveniently, we can also grab the current process name/id as well. This gives us results like the following, live as the kernel processes these packets:

TGID    PID     PROCESS NAME    ICMP_SEQ
0       0       swapper/11      770
0       0       swapper/11      771
0       0       swapper/11      772
0       0       swapper/11      773
0       0       swapper/11      774
20041   20086   prometheus      775
0       0       swapper/11      776
0       0       swapper/11      777
0       0       swapper/11      778
4512    4542   spokes-report-s  779

One thing to note about this process name is that in a post-syscall softirq context, you see the process that made the syscall show as the “process”, even though really it’s the kernel processing it safely within the kernel context.

With that running, we can now correlate back from the stalled packets observed with hping3 to the process that’s handling it. A simple grep on that capture for the icmp_seq values with some context shows what happened before these packets were processed. The packets that line up with the above hping3 icmp_seq values have been marked along with the rtt’s we observed above (and what we’d have expected if <50ms rtt’s weren’t filtered out):

TGID    PID     PROCESS NAME    ICMP_SEQ ** RTT
--
10137   10436   cadvisor        1951
10137   10436   cadvisor        1952
76      76      ksoftirqd/11    1953 ** 99ms
76      76      ksoftirqd/11    1954 ** 89ms
76      76      ksoftirqd/11    1955 ** 79ms
76      76      ksoftirqd/11    1956 ** 69ms
76      76      ksoftirqd/11    1957 ** 59ms
76      76      ksoftirqd/11    1958 ** (49ms)
76      76      ksoftirqd/11    1959 ** (39ms)
76      76      ksoftirqd/11    1960 ** (29ms)
76      76      ksoftirqd/11    1961 ** (19ms)
76      76      ksoftirqd/11    1962 ** (9ms)
--
10137   10436   cadvisor        2068
10137   10436   cadvisor        2069
76      76      ksoftirqd/11    2070 ** 75ms
76      76      ksoftirqd/11    2071 ** 65ms
76      76      ksoftirqd/11    2072 ** 55ms
76      76      ksoftirqd/11    2073 ** (45ms)
76      76      ksoftirqd/11    2074 ** (35ms)
76      76      ksoftirqd/11    2075 ** (25ms)
76      76      ksoftirqd/11    2076 ** (15ms)
76      76      ksoftirqd/11    2077 ** (5ms)

The results tells us a few things. First, these packets are being processed by ksoftirqd/11 which conveniently tells us this particular pair of machines have their ICMP packets hashed to core 11 on the receiving side. We can also see that every time we see a stall, we always see some packets processed in cadvisor’s syscall softirq context, followed by ksoftirqd taking over and processing the backlog, exactly the number we’d expect to work through the backlog.

The fact that cadvisor is always running just prior to this immediately also implicates it in the problem. Ironically, cadvisor “analyzes resource usage and performance characteristics of running containers”, yet it’s triggering this performance problem. As with many things related to containers, it’s all relatively bleeding-edge tooling which can result in some somewhat expected corner cases of bad performance.

What is cadvisor doing to stall things?

With the understanding of how the stall can happen, the process causing it, and the CPU core it’s happening on, we now have a pretty good idea of what this looks like. For the kernel to hard block and not schedule ksoftirqd earlier, and given we see packets processed under cadvisor’s softirq context, it’s likely that cadvisor is running a slow syscall which ends with the rest of the packets being processed:

slow syscall causing stalled packet processing on NIC RX queue

That’s a theory but how do we validate this is actually happening? One thing we can do is trace what’s running on the CPU core throughout this process, catch the point where the packets are overflowing budget and processed by ksoftirqd, then look back a bit to see what was running on the CPU core. Think of it like taking an x-ray of the CPU every few milliseconds. It would look something like this:

tracing of cpu to catch bad syscall and preceding work

Conveniently, this is something that’s already mostly supported. The perf record tool samples a given CPU core at a certain frequency and can generate a call graph of the live system, including both userspace and the kernel. Taking that recording and manipulating it using a quick fork of a tool from Brendan Gregg’s FlameGraph that retained stack trace ordering, we can get a one-line stack trace for each 1ms sample, then get a sample of the 100ms before ksoftirqd is in the trace:

# record 999 times a second, or every 1ms with some offset so not to align exactly with timers
sudo perf record -C 11 -g -F 999
# take that recording and make a simpler stack trace.
sudo perf script 2>/dev/null | ./FlameGraph/stackcollapse-perf-ordered.pl | grep ksoftir -B 100

This results in the following:

(hundreds of traces that look similar)


cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_iter
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
cadvisor;[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];[cadvisor];entry_SYSCALL_64_after_swapgs;do_syscall_64;sys_read;vfs_read;seq_read;memcg_stat_show;mem_cgroup_nr_lru_pages;mem_cgroup_node_nr_lru_pages
ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;ixgbe_poll;ixgbe_clean_rx_irq;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;bond_handle_frame;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;ipip_tunnel_xmit;ip_tunnel_xmit;iptunnel_xmit;ip_local_out;dst_output;__ip_local_out;nf_hook_slow;nf_iterate;nf_conntrack_in;generic_packet;ipt_do_table;set_match_v4;ip_set_test;hash_net4_kadt;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;hash_net4_test
ksoftirqd/11;ret_from_fork;kthread;kthread;smpboot_thread_fn;smpboot_thread_fn;run_ksoftirqd;__do_softirq;net_rx_action;gro_cell_poll;napi_gro_receive;netif_receive_skb_internal;inet_gro_receive;__netif_receive_skb_core;ip_rcv_finish;ip_rcv;ip_forward_finish;ip_forward;ip_finish_output;nf_iterate;ip_output;ip_finish_output2;__dev_queue_xmit;dev_hard_start_xmit;dev_queue_xmit_nit;packet_rcv;tpacket_rcv;sch_direct_xmit;validate_xmit_skb_list;validate_xmit_skb;netif_skb_features;ixgbe_xmit_frame_ring;swiotlb_dma_mapping_error;__dev_queue_xmit;dev_hard_start_xmit;__bpf_prog_run;__bpf_prog_run

There’s a lot there, but looking through it you can see it’s the cadvisor-then-ksoftirqd pattern we saw from the ICMP tracer above. What does it mean?

Each line is a trace of the CPU at a point in time. Each call down the stack is separated by ; on that line. Looking at the middle of the lines we can see the syscall being called is read(): .... ;do_syscall_64;sys_read; ... So cadvisor is spending a lot of time in a read() syscall relating to mem_cgroup_* functions (the top of the call stack / end of line).

The call stack trace isn’t convenient to see what’s being read, so let’s use strace to see what cadvisor is doing and find 100ms-or-slower syscalls:

[email protected] ~ $ sudo strace -p 10137 -T -ff 2>&1 | egrep '<0\.[1-9]'
[pid 10436] <... futex resumed> )       = 0 <0.156784>
[pid 10432] <... futex resumed> )       = 0 <0.258285>
[pid 10137] <... futex resumed> )       = 0 <0.678382>
[pid 10384] <... futex resumed> )       = 0 <0.762328>
[pid 10436] <... read resumed> "cache 154234880\nrss 507904\nrss_h"..., 4096) = 658 <0.179438>
[pid 10384] <... futex resumed> )       = 0 <0.104614>
[pid 10436] <... futex resumed> )       = 0 <0.175936>
[pid 10436] <... read resumed> "cache 0\nrss 0\nrss_huge 0\nmapped_"..., 4096) = 577 <0.228091>
[pid 10427] <... read resumed> "cache 0\nrss 0\nrss_huge 0\nmapped_"..., 4096) = 577 <0.207334>
[pid 10411] <... epoll_ctl resumed> )   = 0 <0.118113>
[pid 10382] <... pselect6 resumed> )    = 0 (Timeout) <0.117717>
[pid 10436] <... read resumed> "cache 154234880\nrss 507904\nrss_h"..., 4096) = 660 <0.159891>
[pid 10417] <... futex resumed> )       = 0 <0.917495>
[pid 10436] <... futex resumed> )       = 0 <0.208172>
[pid 10417] <... futex resumed> )       = 0 <0.190763>
[pid 10417] <... read resumed> "cache 0\nrss 0\nrss_huge 0\nmapped_"..., 4096) = 576 <0.154442>

Sure enough, we see the slow read() calls. From the content being read and mem_cgroup context above, these read() calls are to a memory.stat file which shows the memory usage and limits of a cgroup (the resource isolation technology used by Docker). cadvisor is polling this file to get resource utilization details for the containers. Let’s see if it’s the kernel or cadvisor that’s doing something unexpected by attempting the read ourselves:

[email protected] ~ $ time cat /sys/fs/cgroup/memory/memory.stat >/dev/null

real    0m0.153s
user    0m0.000s
sys    0m0.152s
[email protected] ~ $

Since we can reproduce it, this indicates that it’s the kernel hitting a pathologically bad case.

What causes this read to be so slow

At this point it’s much more simple to find similar issues reported by others. As it turns out, this has been reported to cadvisor as an excessive CPU usage problem, it just hadn’t been observed that latency was also being introduced to the network stack randomly as well. In fact, some folks internally had noticed cadvisor was consuming more CPU than expected, but it didn’t seem to be causing an issue since our servers had plenty of CPU capacity, and so the CPU usage hadn’t yet been investigated.

The overview of the issue is that the memory cgroup is accounting for memory usage inside a namespace (container). When all processes in that cgroup exit, the memory cgroup is released by Docker. However, “memory” isn’t just process memory, and although processes memory usage itself is gone, it turns out the kernel also assigns cached content like dentries and inodes (directory and file metadata) that are cached to the memory cgroup. From that issue:

“zombie” cgroups: cgroups that have no processes and have been deleted but still have memory charged to them (in my case, from the dentry cache, but it could also be from page cache or tmpfs).

Rather than the kernel iterating over every page in the cache at cgroup release time, which could be very slow, they choose to wait for those pages to be reclaimed and then finally clean up the cgroup once all are reclaimed when memory is needed, lazily. In the meantime, the cgroup still needs to be counted during stats collection.

From a performance perspective, they are trading off time on a slow process by amortizing it over the reclamation of each page, opting to make the initial cleanup fast in return for leaving some cached memory around. That’s fine, when the kernel reclaims the last of the cached memory, the cgroup eventually gets cleaned up, so it’s not really a “leak”. Unfortunately the search that memory.stat performs, the way it’s implemented on the kernel version (4.9) we’re running on some servers, combined with the huge amount of memory on our servers, means it can take a significantly long time for the last of the cached data to be reclaimed and for the zombie cgroup to be cleaned up.

It turns out we had nodes that had such a large number of zombie cgroups that some had reads/stalls of over a second.

The workaround on that cadvisor issue, to immediately free the dentries/inodes cache systemwide, immediately stopped the read latency, and also the network latency stalls on the host, since the dropping of the cache included the cached pages in the “zombie” cgroups and so they were also freed. This isn’t a solution, but it does validate the cause of the issue.

As it turns out newer kernel releases (4.19+) have improved the performance of the memory.stat call and so this is no longer a problem after moving to that kernel. In the interim, we had existing tooling that was able to detect problems with nodes in our Kubernetes clusters and gracefully drain and reboot them, which we used to detect the cases of high enough latency that would cause issues, and treat them with a graceful reboot. This gave us breathing room while OS and kernel upgrades were rolled out to the remainder of the fleet.

Wrapping up

Since this problem manifested as NIC RX queues not being processed for hundreds of milliseconds, it was responsible for both high latency on short connections and latency observed mid-connection such as between MySQL query and response packets. Understanding and maintaining performance of our most foundational systems like Kubernetes is critical to the reliability and speed of all services that build on top of them. As we invest in and improve on this performance, every system we run benefits from those improvements.

The post Debugging network stalls on Kubernetes appeared first on The GitHub Blog.

How we implemented domain-driven development in Golang

Post Syndicated from Grab Tech original https://engineering.grab.com/domain-driven-development-in-golang

Partnerships have always been core to Grab’s super app strategy. We believe in collaborating with partners who are the best in what they do – combining their expertise with what we’re good at so that we can bring high-quality new services to our customers, at the same time create new opportunities for the merchant and driver-partners in our ecosystem.

That’s why we launched GrabPlatform last year. To make it easier for partners to either integrate Grab into their services, or integrate their services into Grab.

In view of that, part of the GrabPlatform’s team mission is to make it easy for partners to integrate with Grab services. These partners are external companies that would like to offer Grab’s services such as ride-booking through their own websites or applications. To do that, we decided to build a website that will serve as a one-stop-shop that would allow them to self-service these integrations.

The challenges we faced with the conventional approach

In the process of building this website, our team noticed that the majority of the functions and responsibilities were added to files without proper segregation. A single file would contain more than 500 lines of code. Each of these files were imported from different collections of source codes, resulting in an unstructured codebase. Any changes to the existing functions risked breaking existing functionality; we realized then that we needed to proactively plan for the future. Hence, we decided to use the principles of Domain-Driven Design (DDD) and idiomatic Go. This blog aims to demonstrate the process of how we leveraged those concepts to design a modern application.

How we implemented DDD in our codebase

Here’s how we went about solving our unstructured codebase using DDD principles.

Step 1: Gather domain (business) knowledge

We collaborated closely with our domain experts (in our case, this was our product team) to identify functionality and flow. From them, we discovered the following key points:

  • After creating a project, developers are added to the project.
  • The domain experts wanted an ability to add other products (e.g. Pricing service, ETA service, GrabPay service) to their projects.
  • They wanted the ability to create multiple authentication clients to access the above products.

Step 2: Break down domain knowledge into bounded context

Now that we had gathered the required domain knowledge (i.e. what our code needed to reflect to our partners), it was time to use the DDD strategic tool Bounded Context to break down problems into subcontexts. Here is a graphical representation of how we converted the problem into smaller units.

Bounded Context

We identified several dependencies on each of the units involved in the project. Take some of these examples:

  • The project domain overlapped with the product and developer domains.
  • Our RideBooking project can only exist if it has some products like Ridebooking APIs and not the other way around.

What this means is a product can exist independent of the project, but a project will have no significance without any product. In the same way, a project is dependent on the developers, but developers can exist whether or not they belong to a project.

Step 3: Identify value objects or entity (lowest layer)

Looking at the above bounded contexts, we figured out the building blocks (i.e. value objects or entity) to break down the above functionality and flow.

// ProjectDAO ...
type ProjectDAO struct {
  ID            int64
  UUID          string
  Status        ProjectStatus
  CreatedAt     time.Time
}

// DeveloperDAO ...
type DeveloperDAO struct {
  ID            int64
  UUID          string
  PhoneHash     *string
  Status        Status
  CreatedAt     time.Time
}

// ProductDAO ...
type ProductDAO struct {
  ID            int64
  UUID          string
  Name          string
  Description   *string
  Status        ProductStatus
  CreatedAt     time.Time
}

// DeveloperProjectDAO to map developer's to a project
type DeveloperProjectDAO struct {
  ID            int64
  DeveloperID   int64
  ProjectID     int64
  Status        DeveloperProjectStatus
}

// ProductProjectDAO to map product's to a project
type ProductProjectDAO struct {
  ID            int64
  ProjectID     int64
  ProductID     int64
  Status        ProjectProductStatus
}

All the objects shown above have ID as a field and can be identifiable, hence they are identified as entities and not as value objects. But if we apply domain knowledge, DeveloperProjectDAO and ProductProjectDAO are actually not independent entities. Project object is the aggregate root since it must exist before the child fields, DevProjectDAO and ProdcutProjectDAO, can exist.

Step 4: Create the repositories

As stated above, we created an interface to abstract the working logic of a particular domain (i.e. Repository). Here is an example of how we designed the repositories:

// ProductRepositoryImpl responsible for product functionality
type ProductRepositoryImpl struct {
  productDao storage.IProductDao // private field
}

type ProductRepository interface {
  GetProductsByIDs(ctx context.Context, ids []int64) ([]IProduct, error)
}

// DeveloperRepositoryImpl
type DeveloperRepositoryImpl struct {
  developerDAO storage.IDeveloperDao // private field
}

type DeveloperRepository interface {
  FindActiveAllowedByDeveloperIDs(ctx context.Context, developerIDs []interface{}) ([]*Developer, error)
  GetDeveloperDetailByProfile(ctx context.Context, developerProfile *appdto.DeveloperProfile) (IDeveloper, error)
}

Here is a look at how we designed our repository for aggregate root project:

// Unexported Struct
type productProjectRepositoryImpl struct {
  productProjectDAO storage.IProjectProductDao // private field
}

type ProductProjectRepository interface {
  GetAllProjectProductByProjectID(ctx context.Context, projectID int64) ([]*ProjectProduct, error)
}

// Unexported Struct
type developerProjectRepositoryImpl struct {
  developerProjectDAO storage.IDeveloperProjectDao // private field
}

type DeveloperProjectRepository interface {
  GetDevelopersByProjectIDs(ctx context.Context, projectIDs []interface{}) ([]*DeveloperProject, error)
  UpdateMappingWithRole(ctx context.Context, developer IDeveloper, project IProject, role string) (*DeveloperProject, error)
}

// Unexported Struct
type projectRepositoryImpl struct {
  projectDao storage.IProjectDao // private field
}

type ProjectRepository interface {
  GetProjectsByIDs(ctx context.Context, projectIDs []interface{}) ([]*Project, error)
  GetActiveProjectByUUID(ctx context.Context, uuid string) (IProject, error)
  GetProjectByUUID(ctx context.Context, uuid string) (*Project, error)
}

type ProjectAggregatorImpl struct {
  projectRepositoryImpl           // private field
  developerProjectRepositoryImpl  // private field
  productProjectRepositoryImpl    // private field
}

type ProjectAggregator interface {
  GetProjects(ctx context.Context) ([]*dto.Project, error)
  AddDeveloper(ctx context.Context, request *appdto.AddDeveloperRequest) (*appdto.AddDeveloperResponse, error)
  GetProjectWithProducts(ctx context.Context, uuid string) (IProject, error)
}

Step 5: Identify Domain Events

The functions described in Step 4 only returns the ID of the developer and product, which conveys no information to the users. In order to provide developer and product information, we use the domain-event technique to return the actual product and developer attributes.

A domain event is something that happened in a bounded context that you want another context of a domain to be aware of. For example, if there are new updates to the developer domain, it’s important to convey these updates to the project domain. This propagation technique is termed as domain event. Domain events enable independence between different classes.

One way to implement it is seen here:

// file: project\_aggregator.go
func (p *ProjectAggregatorImpl) GetProjects(ctx context.Context) ([]*dto.Project, error) {
  ....
  ....
  developers := p.EventHandler.Handle(DomainEvent.FindDeveloperByDeveloperIDs{DeveloperIDs})
  ....
}

// file: event\_type.go
type FindDeveloperByDeveloperIDs struct{ developerID []interface{} }

// file: event\_handler.go
func (e *EventHandler) Handle(event interface{}) interface{} {
  switch op := event.(type) {
      case FindDeveloperByDeveloperIDs:
            developers, _ := e.developerRepository.FindDeveloperByDeveloperIDs(op.developerIDs)
            return developers
      case ....
      ....
    }
}
Domain Event

Some common mistakes to avoid when implementing DDD in your codebase:

  • Not engaging with domain experts. Not interacting with domain experts is a common mistake when using DDD. Talking to domain experts to get an understanding of the problem domain from their perspective is at the core of DDD. Starting with schemas or data modelling instead of talking to domain experts may create code based on a relational model instead of it built around a domain model.
  • Ignoring the language of the domain experts. Creating a ubiquitous language shared with domain experts is also a core DDD practice. This common language must be used in all discussions as well as in the code, e.g. in class and method names.
  • Not identifying bounded contexts. A common approach to solving a complex problem is breaking it down into smaller parts. Creating bounded contexts is breaking down a large domain into smaller ones, each handling one cohesive part of the domain.
  • Using an anaemic domain model. This is a common sign that a team is not doing DDD and often a symptom of a failure in the modelling process. At first, an anaemic domain model often looks like a real domain model with correct names, but the classes lack functionalities. They contain only the Get and Set methods.

How the DDD model improved our software development

Thanks to this brand new clean up, we achieved the following:

  • Core functionalities are evenly distributed to the overall codebase and not limited to just a few files.
  • The developers are aware of what each folder is responsible for by simply looking at the file naming and folder structure.
  • The risk of breaking major functionalities by merely making small changes is greatly reduced. Changing a feature is now more efficient.

The team now finds the code well structured and we require less hand-holding for onboarders, thanks to the simplicity of the structure.

Finally, the most important thing, we now have a system oriented towards our business necessities. Everyone ends up using the same language and terms. Developers communicate better with the business team. The work is more efficient when it comes to establishing solutions for the models that reflect how the business operates, instead of how the software operates.

Lessons Learnt

  • Use DDD to collaborate among all project disciplines (product, business, partner, and so on) and clearly understand the business requirements.
  • Establish a ubiquitous language to discuss domain-related concepts.
  • Use bounded contexts to break down complex domains into manageable parts.
  • Implement a layered architecture (i.e. DDD building blocks) to focus on particular aspects of the application.
  • To simplify your dependency, use domain event to communicate with sub-bounded context.

Griffin, an anti-fraud risk rule engine making billions of predictions daily

Post Syndicated from Grab Tech original https://engineering.grab.com/griffin

Introduction

At Grab, the scale and fast-moving nature of our business means we need to be vigilant about potential risks to our customers and to our business. Some of the things we watch for include promotion abuse, or passenger safety on late-night ride allocations. To overcome these issues, the TIS (Trust/Identity/Safety) taskforce was formed with a group of AI developers dedicated to fraud detection and prevention.

The team’s mission is:

  • to keep fraudulent users away from our app or services
  • ensure our customers’ safety, and
  • Manage user identities to securely login to the Grab app.

The TIS team’s scope covers not just transport, but also our food, deliver and other Grab verticals.

How we prevented fraudulent transactions in the earlier days

In our early days when Grab was smaller, we used a rules-based approach to block potentially fraudulent transactions. Rules are like boolean conditions that determines if the result will be true or false. These rules were very effective in mitigating fraud risk, and we used to create them manually in the code.

We started with very simple rules. For example:

Rule 1:

 IF a credit card has been declined today

 THEN this card cannot be used for booking

To quickly incorporate rules in our app or service, we integrated them in our backend service code and deployed our service frequently to use the latest rules.

It worked really well in the beginning. Our logic was relatively simple, and only one developer managed the changes regularly. It was very lightweight to trigger the rule deployment and enforce the rules.

However, as the business rapidly expanded, we had to exponentially increase the rule complexity. For example, consider these two new rules:

Rule 2:

IF a credit card has been declined today but this passenger has good booking history

THEN we would still allow this booking to go through, but precharge X amount

Rule 3:

IF a credit card has been declined(but paid off) more than twice in the last 3-months

THEN we would still not allow this booking

The system scans through the rules, one by one, and if it determines that any rule is tripped it will check the other rules. In the example above, if a credit card has been declined more than twice in the last 3-months, the passenger will not be allowed to book even though he has a good booking history.

Though all rules follow a similar pattern, there are subtle differences in the logic and they enable different decisions. Maintaining these complex rules was getting harder and harder.

Now imagine we added more rules as shown in the example below. We first check if the device used by the passenger is a high-risk one. e.g using an emulator for booking. If not, we then check the payment method to evaluate the risk (e.g. any declined booking from the credit card), and then make a decision on whether this booking should be precharged or not. If passenger is using a low-risk  device but is in some risky location where we traditionally see a lot of fraud bookings, we would then run some further checks about the passenger booking history to decide if a pre-charge is also needed.

Now consider that instead of a single passenger, we have thousands of passengers. Each of these passengers can have a large number of rules for review. While not impossible to do, it can be difficult and time-consuming, and it gets exponentially more difficult the more rules you have to take into consideration. Time has to be spent carefully curating these rules.

Rules flow

The more rules you add to increase accuracy, the more difficult it becomes to take them all into consideration.

Our rules were getting 10X more complicated than the example shown above. Consequently, developers had to spend long hours understanding the logic of our rules, and also be very careful to avoid any interference with new rules.

In the beginning, we implemented rules through a three-step process:

  1. Data Scientists and Analysts dived deep into our transaction data, and discovered patterns.
  2. They abstracted these patterns and wrote rules in English (e.g. promotion based booking should be limited to 5 bookings and total finished bookings should be greater than 6, otherwise unallocate current ride)
  3. Developers implemented these rules and deployed the changes to production

Sometimes, the use of English between steps 2 and 3 caused inaccurate rule implementation (e.g. for “X should be limited to 5”, should the implementation be X < 5 or  X <= 5?)

Once a new rule is deployed, we monitored the performance of the rule. For example,

  • How often does the rule fire (after minutes, hours, or daily)?
  • Is it over-firing?
  • Does it conflict with other rules?

Based on implementation, each rule had dependency with other rules. For example, if Rule 1 is fired, we should not continue with Rule 2 and Rule 3.

As a result, we couldn’t  keep each rule evaluation independent.  We had no way to observe the performance of a rule with other rules interfering. Consider an example where we change Rule 1:

From IF a credit card has been declined today

To   IF a credit card has been declined this week

As Rules 2 and 3 depend on Rule 1, their trigger-rate would drop significantly. It means we would have unstable performance metrics for Rule 2 and Rule 3 even though the logic of Rule 2 and Rule 3 does not change. It is very hard for a rule owner to monitor the performance of Rules 2 and Rule 3.

When it comes to the of A/B testing of a new rule, Data Scientists need to put a lot of effort into cleaning up noise from other rules, but most of the time, it is mission-impossible.

After several misfiring events (wrong implementation of rules) and ever longer rule development time (weekly), we realized “No one can handle this manually.“

Birth of Griffin Rule Engine

We decided to take a step back, sit down and closely review our daily patterns. We realized that our daily patterns fall into two categories:

  1. Fetching new data:  e.g. “what is the credit card risk score”, or “how many food bookings has this user ordered in last 7 days”, and transform this data for easier consumption.
  2. Updating/creating rules: e.g. if a credit card risk score is high, decline a booking.

These two categories are essentially divided into two independent components:

  1. Data orchestration – collecting/transforming the data from different data sources.
  2. Rule-based prediction

Based on these findings, we got started with our Data Orchestrator (open sourced at https://github.com/grab/symphony) and Griffin projects.

The intent of Griffin is to provide data scientists and analysts with a way to add new rules to monitor, prevent, and detect fraud across Grab.

Griffin allows technical novices to apply their fraud expertise to add very complex rules that can automate the review of rules without manual intervention.

Griffin  now predicts billions of events every day with 100K+ Queries per second(QPS) at peak time (on only 6 regular EC2s).

Data scientists and analysts can self-service rule changes on the web portal directly, deploy rules with just a few clicks, experiment and monitor performance in real time.

Why we came up with Griffin instead of using third-party tools in the market

Before we decided to create our in-built tool, we did some research for common business rule engines available in the market such as Drools and checked if we should use them. In that process, we found:

  1. Drools has its own Java-based DSL with a non-trivial learning curve (whereas our major users are from Python background).
  2. Limited [expressive power](https://en.wikipedia.org/wiki/Expressive_power_(computer_science),
  3. Limited support for some common math functions (e.g. factorial/ Greatest Common Divisor).
  4. Our nature of business needed dynamic dataset for predictions (for example, a rule may need only passenger booking history on Day 1, but it may use passenger booking history, passenger credit balance, and passenger favorite places on Day 2). On the other hand, Drools usually works well with a static list of dataset instead of dynamic dataset.

Given the above constraints, we decided to build our own rule engine which can better fit our needs.

Griffin Architecture

The diagram depicts the high-level flow of making a prediction through Griffin.

High-level flow of making a prediction through Griffin

Components

  • Data Orchestration: a service that collects all data needed for predictions
  • Rule Engine: a service that makes prediction based on rules
  • Rule Editor: the portal through which users can create/update rules

Workflow

  1. Users create/update rules in the Rule Editor web portal, and save the rules in the database.
  2. Griffin Rule Engine reloads rules immediately as long as it detects any rule changes.
  3. Data Orchestrator sends all dataset (features) needed for a prediction (e.g. whether to block a ride based on passenger past ride pattern, credit card risk) to the Rule Engine
  4. Griffin Rule Engine makes a prediction.

How you can create rules using Griffin

In an abstract view, a rule inside Griffin is defined as:

Rule:

Input:JSON => Result:Boolean

We allow users (analysts, data scientists) to write Python-based rules on WebUI to accommodate some very complicated rules like:

len(list(filter(lambdax: x \>7, (map(lambdax: math.factorial(x), \[1,2,3,4,5,6\]))))) \>2

This significantly optimizes the expressive power of rules.

To match and evaluate a rule more efficiently, we also have other key components associated:

Scenarios

  • Here are some examples: PreBooking, PostBookingCompletion, PostFoodDelivery

Actions

  • Actions such as NotAllowBooking, AuthCapture, SendNotification
  • If a rule result is True, it returns a list of treatments as selected by users, e.g. AuthCapture and SendNotification (the example below is treatments for one Safety-related rule).The one below is for a checkpoint to detect credit-card risk.
Treatments: AuthCapture
  • Each checkpoint has a default treatment. If no rule inside this checkpoint is hit, the rule engine would return the default one (in most cases, it is just “do nothing”).
  • A treatment can only belong to one checkpoint, but one checkpoint can have multiple treatments.

For example, the graph below demonstrates a checkpoint PaxPreRide associated with three treatments: Pass, Decline, Hold

Treatments: Adding

Segments

  • The scope/dimension of a rule. Based on the sample segments below, a rule can be applied only to countries=\[MY,PH\] and verticals=\[GrabBus, GrabCar\]
  • It can be changed at any time on WebUI as well.
Segments

Values of a rule

 

When a rule is hit, more than just treatments, users also want some dynamic values returned. E.g. a max distance of the ride allowed if we believe this booking is medium risk.

Does Python make Griffin run slow?

We picked Python to enjoy its great expressive power and neatness of syntax, but some people ask: Python is slow, would this cause a latency bottleneck?

Our answer is No.

The below graph shows the Latency P99 of Prediction Request from load balancer side(actually the real latency for each prediction is < 6ms, the metrics are peaked at 30ms because some batch requests contain 50 predictions in a single call)

Prediction Request Latency P99

What we did to achieve this?

  • The key idea is to make all computations in CPU and memory only (in other words, no extra I/O).
  • We do not fetch the rules from database for each prediction. Instead, we keep a record called dirty_key, which keeps the latest rule update timestamp. The rule engine would actively check this timestamp and trigger a rule reload only when the dirty_key timestamp in the DB is newer than the latest rule reload time.
  • Rule engine would not fetch any additional new data, instead, all data should be from Data Orchestrator.
  • So the whole prediction flow is only between CPU & memory (and if the data size is small, it could be on CPU cache only).
  • Python GIL essentially enforces a process to have up to one active thread running at a time, no matter how many cores a CPU has. We have Gunicorn to wrap our service, so on the Production machine, we have (2x$num_cores) + 1 processes (see http://docs.gunicorn.org/en/latest/design.html#how-many-workers). The formula is based on the assumption that for a given core, one worker will be reading or writing from the socket while the other worker is processing a request.

The below screenshot is the process snapshot on C5.large machine with 2 vCPU. Note only green processes are active.

Process snapshot on C5.large machine

A lot of trial and error performance tuning:

  • We used to have python-jsonpath-rw for JSONPath query, but the performance was not strong enough. We switched to jmespath and observed about 10ms latency reduction.
  • We use sqlalchemy for DB Query and ORM. We enabled cache for some use cases, but turned out it was over-optimized with stale data. We ended up turning off some caching points to ensure the data consistency.
  • For new dict/list creation, we prefer native call (e.g. {}/[]) instead of function call (see the comparison below).
Native call and Function call
  • Use built-in functions https://docs.python.org/3/library/functions.html. It is written in C, no one can beat it.
  • Add randomness to rule reload so that not all machines run at the same time causing latency spikes.
  • Caching atomic feature units as they are used so that we don’t have to requery for them each time a checkpoint uses it.

How Griffin makes on-call engineers relax

One of the most popular aspects of Griffin is the WebUI. It opens a door for non-developers to make production changes in real time which significantly boosts organisation productivity. In the past a rule change needed 1 week for code change/test/deployment, now it is just 1 minute.

But this also introduces extra risks. Anyone can turn the whole checkpoint down, whether unintentionally or maliciously.

Hence we implemented Shadow Mode and Percentage-based rollout for each rule. Users can put a rule into Shadow Mode to verify the performance without any production impact, and if needed, rollout of a rule can be from 1% all the way to 100%.

We implemented version control for every rule change, and in case anything unexpected happened, we could rollback to the previous version quickly.

Version control
Rollback button

We also built RBAC-based permission system, along with Change Approval flow to make sure any prod change needs at least two people(and approver role has higher permission)

Closing thoughts

Griffin evolved from a fraud-based rule engine to generic rule engine. It can apply to any rule at Grab. For example, Grab just launched Appeal automation several days ago to reduce 50% of the  human effort it typically takes to review straightforward appeals from our passengers and drivers. It was an unplanned use case, but we are so excited about this.

This could happen because from the very beginning we designed Griffin with minimized business context, so that it can be generic enough.

After the launch of this, we observed an amazing adoption rate for various fraud/safety/identity use cases. More interestingly, people now treat Griffin as an automation point for various integration points.

Using Grab’s Trust Counter Service to Detect Fraud Successfully

Post Syndicated from Grab Tech original https://engineering.grab.com/using-grabs-trust-counter-service-to-detect-fraud-successfully

Background

Fraud is not a new phenomenon, but with the rise of the digital economy it has taken different and aggressive forms. Over the last decade, novel ways to exploit technology have appeared, and as a result, millions of people have been impacted and millions of dollars in revenue have been lost. According to ACFE survey, companies lost USD6.3 billion due to fraud. Organizations lose 5% of its revenue annually due to fraud.

In this blog, we take a closer look at how we developed an anti-fraud solution using the Counter service, which can be an indispensable tool in the highly complex world of fraud detection.

Anti-fraud solution using counters

At Grab, we detect fraud by deploying data science, analytics, and engineering tools to search for anomalous and suspicious transactions, or to identify high-risk individuals who are likely to commit fraud. Grab’s Trust Platform team provides a common anti-fraud solution across a variety of business verticals, such as transportation, payment, food, and safety. The team builds tools for managing data feeds, creates SDK for engineering integration, and builds rules engines and consoles for fraud detection.

One example of fraudulent behavior could be that of an individual who masquerades as both driver and passenger, and makes cashless payments to get promotions, for example, earn a one dollar rebate in the next transaction.In our system, we analyze real time booking and payment signals, compare it with the historical data of the driver and passenger pair, and create rules using the rule engine. We count the number of driver and passenger pairs at a given time frame. This counter is provided as an input to the rule.If the counter value exceeds a predefined threshold value, the rule evaluates it as a fraud transaction. We send this verdict back to the booking service.

The conventional method

Fraud detection is a job that requires cross-functional teams like data scientists, data analysts, data engineers, and backend engineers to work together. Usually data scientists or data analysts come up with an offline idea and apply it to real-time traffic. For example, a rule gets invented after brainstorming sessions by data scientists and data analysts. In the conventional method, the rule needs to be communicated to engineers.

Automated solution using the Counter service

To overcome the challenges in the conventional method, the Trust platform team decided to come out with the Counter service, a self-service platform, which provides management tools for users, and a computing engine for integrating with the backend services. This service provides an interface, such as a UI based rule editor and data feed, so that analysts can experiment and create rules without interacting with engineers. The platform team also decided to provide different data contracts, APIs, and SDKs to engineers so that the business verticals can use it quickly and easily.

The major engineering challenges faced in designing the Counter service

There are millions of transactions happening at Grab every day, which implies we needed to perform billions of fraud and safety detections. As seen from the example shared earlier, most predictions require a group of counters. In the above use case, we need to know how many counts of the cashless payment happened for a driver and passenger pair. Due to the scale of Grab’s business, the potential combinations of drivers and passengers could be exponential. However, this is only one use case. So imagine that there could be hundreds of counters for different use cases. Hence it’s important that we provide a platform for stakeholders to manage counters.

Some of the common challenges we faced were:

Scalability

As mentioned above, we could potentially have an exponential number of passengers and drivers in a single counter. So it’s a great challenge to store the counters in the database, read, and query them in real-time. When there are billions of counter keys across a long period of time, the Trust team had to find a scalable way to write and fetch keys effectively and meet the client’s SLA.

Self-serving

A counter is usually invented by data scientists or analysts and used by engineers. For example, every time a new type of counter is needed from data scientists, developers need to manually make code changes, such as adding a new stream, capturing related data sets for the counter, and storing it on the fraud service, then doing a deployment to make the counters ready. It usually takes two or more weeks for the whole iteration, and if there are any changes from the data analysts’ side, which happens often, the situation loops again. The team had to come up with a solution to prevent the long loop of manual tasks by coming out with a self-serving interface.

Manageable and extendable

Due to a lack of connection between real-time and offline data, data analysts and data scientists did not have a clear picture of what is written in the counters. That’s because the conventional counter data were stored in Redis database to satisfy the query SLA. They could not track the correctness of counter value, or its history. With the new solution, the stakeholders can get a real-time picture of what is stored in the counters using the data engineering tools.

The Machine Learning challenges solved by the Counter service

The Counter service plays an important role in our Machine Learning (ML) workflow.

Data Consistency Challenge/Issue

Most of the machine learning workflows need dedicated input data. However, when there is an anti-fraud model that is trained using offline data from the data lake, it is difficult to use the same model in real-time. This is because the model lacks the data contract and the consistency with the data source. In this case, the Counter service becomes a type of data source by providing the value of counters to file system.

ML featuring

Counters are important features for the ML models. Imagine there is a new invention of counters, which data scientists need to evaluate. We need to provide a historical data set for counters to work. The Counter service provides a counter replay feature, which allows data scientists to simulate the counters via historical payload.

In general, the Counter service is a bridge between online and offline datasets, data scientists, and engineers. There was technical debt with regards to data consistency and automation on the ML pipeline, and the Counter service closed this loop.

How we designed the Counter service

We followed the principle of asynchronized data ingestion, and synchronized transaction for designing the Counter service.

The diagram shows how the counters are generated and saved to database.

How the counters are generated and saved to the database

Counter creation workflow

  1. User opens the Counter Creation UI and creates a new key “fraud:counter:counter_name”.
  2. Configures required fields.
  3. The Counter service monitors the new counter-creation, puts a new counter into load script storage, and starts processing new counter events (see Counter Write below).

Counter write workflow

  1. The Counter service monitors multiple streams, assembles extra data from online data services (i.e. Common Data Service (CDS), passenger service, hydra service, etc), so that rich dataset would also be available for editors on each stream resource.
  2. The Counter Processor evaluates the user-configured expression and writes the evaluated values to the dedicated Grab-Stats stream using the GrabPlugin tool.

Counter read workflow

Counter read workflow

We use Grab-Stats as our storage service. Basically Grab-Stats runs above ScyllaDB, which is a distributed NoSQL data store. We use ScyllaDB because of its good performance on aggregation in memory to deal with the time series dataset. In comparison with in-memory storage like AWS elasticCache, it is 10 times cheaper and as reliable as AWS in terms of stability. The p99 of reading from ScyllaDB is less than 150ms which satisfies our SLA.

How we improved the Counter service performance

We used the multi-buckets strategy to improve the Counter service performance.

Background

There are different time windows when you perform a query. Some counters are time sensitive so that it needs to know what happened in the last 30 or 60 minutes. Some other counters focus on the long term and need to know the events in the last 30 or 90 days.

From a transactional database perspective, it’s not possible to serve small range as well as long term events at the same time. This is because the more the need for the accuracy of the data and the longer the time range, the more aggregations need to happen on database. Which means we would not be able to satisfy the SLA. Otherwise we will need to block other process which leads to the service downgrade.

Solution for improving the query

We resolved this problem by using different granularities of the tables. We pre-aggregated the signals into different time buckets, such as 15min, 1 hour, and 1 day.

When a request comes in, the time-range of the request will be divided by the buckets, and the results are conquered. For example, if there is a request for 9/10 23:15:20 to 9/12 17:20:18, the handler will query 15min buckets within the hour.  It will query for hourly buckets for the same day. And it will query the daily buckets for the rest of 2 days. This way, we avoid doing heavy aggregations, but still keep the accuracy in 15 minutes level in a scalable response time.

Counter service UI

We allowed data analysts and data scientists to onboard counters by themselves, from a dedicated web portal. After the counter is submitted, the Counter service takes care of the integration and parsing the logic at runtime.

Counter service UI

Backend integration

We provide SDK for quicker and better integration. The engineers only need to provide the counter identifier ID (which is shown in the UI) and the time duration in the query. Under the hood we provide a GRPC protocol to communicate across services. We divide the query time window to smaller granularities, fetching from different time series tables and then conquering the result. We are also providing a short TTL cache layer to take the uncommon traffic from client such as network retry or traffic throttle. Our QPS are designed to target 100K.

Monitoring the Counter service

The Counter service dashboard helps to track the human errors while editing the counters in real-time. The Counter service sends alerts to slack channel to notify users if there is any error.

Counter service dashboard

We setup Datadog for monitoring multiple system metrics. The figure below shows a portion of stream processing and counter writing. In the example below, the total stream QPS would reach 5k at peak hour, and the total counter saved to storage tier is about 4k. It will keep climbing without an upper limit, when more counters are onboarded.

Counter service dashboard with multiple metrics

The Counter service UI portal also helps users to fetch real-time counter results for verification purposes.

Counter service UI

Future plans

Here’s what we plan to do in the near future to improve the Counter service.

Close the ML workflow loop

As mentioned above, we plan to send the resource payload of the Counter service to the offline data lake, in order to complete the counter replay function for data scientists. We are working on the project called “time traveler”. As the name indicates, it is used not only for serving the online transactional data, but also supports historical data analytics, and provides more flexibility on counter inventions and experiments.

There are more automation steps we plan to do, such as adding a replay button on the web portal, and hooking up with the offline big data engine to trigger the analytics jobs. The performance metrics will be collected and displayed on the web portal. A single platform would be able to manage both the online and offline data.

Integration to Griffin

Griffin is our rule engine. Counters are sometimes an input to a particular rule, and one rule usually needs many counters to work together. We need to provide a better integration to Griffin on backend. We plan to minimize the current engineering effort when using counters on Griffin. A counter then becomes an automated input variable on Griffin, which can be configured on the web portal by any users.

Getting started with Git and GitHub is easier than ever with GitHub Desktop 2.2

Post Syndicated from Amanda Pinsker original https://github.blog/2019-10-02-get-started-easier-with-github-desktop-2-2/

Anyone who uses Git knows that it has a steep learning curve. We’ve learned from developers that most people tend to learn from a buddy, whether that’s a coworker, a professor, a friend, or even a YouTube video. In GitHub Desktop 2.2, we’re releasing the first version of an interactive Git and GitHub tutorial that can be your buddy and help you get started. If you’re new to Desktop, you can download and try out the tutorial at desktop.github.com.

Get set up

To get set up, we help you through two major pieces: creating a repository and connecting an editor. When you first open Desktop, a welcome page appears with a new option to “Create a Tutorial Repository”. Starting with this option creates a tutorial repository that guides you through the core concepts of working with Git using GitHub Desktop.

There are a lot of tools you need to get started with Git and GitHub. The most important of these is your code editor. In the first step of the tutorial, you’re prompted to install an editor if you don’t have one already.

Learn the GitHub flow

Next, we guide you through how to use GitHub Desktop to make changes to code locally and get your work on GitHub. You’ll create a new branch, make a change to a file, commit it, push it to GitHub, and open your first pull request.

We’ve also heard that new users initially experience confusion between Git, GitHub, and GitHub Desktop. We cover these differences in the tutorial and make sure to reinforce the explanations.

Keep going with your own project

In GitHub Desktop 1.6, we introduced suggested next steps based on the state of your repository. Now when you complete the tutorial, we similarly suggest next steps: exploring projects on GitHub that you might want to contribute to, creating a new project, or adding an existing project to Desktop. We always want GitHub Desktop to be the tool that makes your next steps clear, whether you’re in the flow of your work, or you’re a new developer just getting started.

What’s next?

With GitHub Desktop 2.2, we’re making the product our users love more approachable to newcomers. We’ll be iterating on the tutorial based on your feedback, and we’ll continue to build on the connection between GitHub and your local machine. If you want to start building something but don’t know how, think of GitHub Desktop as your buddy to help you get started.

Learn more about GitHub Desktop

The post Getting started with Git and GitHub is easier than ever with GitHub Desktop 2.2 appeared first on The GitHub Blog.