Note: Timestamps used in this article are in UTC+8 Singapore time, unless stated otherwise.
When we upgraded to Xcode 13.1 in April 2022, we noticed a few issues such as instability of the CI tests and other problems related to the switch to Xcode 13.1.
After taking a step back, we investigated this issue by integrating some observability tools into our iOS CI development process. This gave us a comprehensive perspective of the entire process, from the beginning to the end of the UITest job. In this article, we share the improvements we made, the insights we gathered, and the impact of these improvements on the overall process and resource utilisation.
In the following sections, we elaborate the various steps we took to investigate the issues, like unstable CI tests and high CPU utilisation, and the improvements we made to make our iOS CI infrastructure more reliable.
Analyse Xcode 13.1 CPU utilisation
As an iOS developer, we are certain that you have also experienced Spotlight process-related CPU usage problems with Xcode 13.1, which have since been resolved in Xcode 13.2. After investigating, we found that the CPU usage issues were one of the root causes of UITest’s instability and it was something we needed to fix urgently. We decided not to wait for Apple’s update as it would cost us more time to perform another round of migration.
Before we started UITest, we moved the spotlight.app into a new folder. When the test was complete, we restored the application to its original location. This significantly decreased CPU utilisation by more than 50%.
This section helps you better visualise how the different versions of Xcode affected CPU utilisation.
Remove iOS Safari’s dependency during deep link testing
As a superapp, there are countless scenarios that need to be thoroughly tested at Grab before the feature is released in production. One of these tests is deep link testing.
More than 10% of the total number of tests are deep link tests. Typically, it is advised to mock the dependencies throughout the test to ensure that it runs quickly and reliably. However, this creates another reliance on iOS Safari.
As a result, we created a mock browser in UITest. We used the URL to the mock browser as the launch argument, and the same URL is then called back. This method results in a 20% reduction in CI time and more stable tests.
Boot the iOS simulator with permission
It is always a good idea to reset the simulator before running UITest so that there are no residual presets or simulated data from a different test. Additionally, using any of the simulator’s services (location, ATT, contacts, etc.) will prompt the simulator to request permission, which slows down execution. We used UIInterruptionHandler (a handler block for managing alerts and other dialogues) to manage asynchronous UI interruptions during the test.
We wanted to reduce the time taken for test execution, which we knew includes many permissions. Therefore, in order to speed up execution, we boot the simulator with permissions. This removes the need for permissions during UITest, which speeds up performance by 5%.
Monitor HTTP traffic during the UITest
When writing tests, it is important to mock all resources as this enables us to focus on the code that’s being tested and not how external dependencies interact or respond. However, with a large team working concurrently, it can be challenging to ensure that nothing is actually downloaded from the internet.
Developers often make changes to code, and UITests are essential for ensuring that these modifications do not adversely affect existing functionality. It is advised to mock all dependencies while writing tests to simulate all possible behavior. We discovered that a significant number of resources were being downloaded each time we ran the tests, which was highly inefficient.
In large teams working simultaneously, preventing downloads from the internet can be quite challenging. To tackle this issue, we devised a custom tool that tracks all URLs accessed throughout the UITest. This enabled us to identify resources being downloaded from the internet during the testing process.
By using our custom tool to analyse network traffic, we were able to ensure that no resources were being downloaded during testing. Instead, we relied on mocked dependencies, resulting in reduced testing times and improved stability.
GitLab load runner analysis
At Grab, we have many teams of developers who maintain the app, make code changes, and raise merge requests (MRs) on a daily basis. To make sure that new changes don’t conflict with existing code, these MRs are integrated with CI.
Additionally, to manage the number of MRs, we maintain a list of clusters that run test runners concurrently for better resource utilisation and performance. We frequently run these tests to determine how many parallel processors are required for stable results.
####Return HTTP responses to the local mock server
We have a tool that we use to mock API requests, which we improved to also support HTML responses. This increases the scope of testing and ensures the HTML response sequences work properly.
Use explicit waiting commands
When running multiple tests, timing issues are inevitable and they cause tests to occasionally pass and fail. To mitigate this, most of the developers prefer to add a sleep command so there is time for the element to render properly before we verify it – but this slows down execution. In order to improve CI execution, we introduced a link that allows us to track sleep function usage and suggest developers use waitForExistence wrappers in UI tests.
Track each failure state
With large codebases, it is quite common to see flakiness in UITests, where tests occasionally succeed and fail without any code changes. This means that test results can be inconsistent and in some cases, faulty. Faulty testing can be frustrating, and quite expensive. This is because engineers need to re-trigger entire builds, which ends up consuming more time.
Initially, we used an internal tool that required all tests to pass on the first run, before merging was allowed. However, we realised that this significantly increased engineers’ manual retry time, hence, we modified the rules to allow merging as long as a subsequent retry passes the tests. This minor change improved our engineers’ CI overall experience and did not result in more flaky tests.
Our journey to improve iOS CI infrastructure is still ongoing, but from this experience, we learnt several things:
Focus on the feature being tested by ensuring all external responses are mocked.
A certain degree of test flakiness is expected, but you should monitor past trends. If flakiness increases, there’s probably a deeper lying issue within your code.
Regularly monitor resource utilisation and performance – detecting a sudden spike early could save you a lot of time and money.
At Cloudflare, we’re excited about the quickly-approaching 5G future. Increasingly, we’ll have access to high throughput and low-latency wireless networks wherever we are. It will make the Internet feel instantaneous, and we’ll find new uses for this connectivity such as sensors that will help us be more productive and energy-efficient. However, this type of connectivity doesn’t have to come at the expense of security, a concern raised in this recent Wired article. Today we’re announcing the creation of a new partnership program for mobile networks—Zero Trust for Mobile Operators—to jointly solve the biggest security and performance challenges.
SASE for Mobile Networks
Every network is different, and the key to managing the complicated security environment of an enterprise network is having lots of tools in the toolbox. Most of these functions fall under the industry buzzword SASE, which stands for Secure Access Service Edge. Cloudflare’s SASE product is Cloudflare One, and it’s a comprehensive platform for network operators. It includes:
Magic WAN, which offers secure Network-as-a-Service (NaaS) connectivity for your data centers, branch offices and cloud VPCs and integrates with your legacy MPLS networks
Cloudflare Access, which is a Zero Trust Network Access (ZTNA) service requiring strict verification for every user and every device before authorizing them to access internal resources.
Gateway, our Secure Web Gateway, which operates between a corporate network and the Internet to enforce security policies and protect company data.
A Cloud Access Security Broker, which monitors the network and external cloud services for security threats.
Cloudflare Area 1, an email threat detection tool to scan email for phishing, malware, and other threats.
We’re excited to partner with mobile network operators for these services because our networks and services are tremendously complementary. Let’s first think about SD-WAN (Software-Defined Wide Area Network) connectivity, which is the foundation on which much of the SASE framework rests. As an example, imagine a developer working from home developing a solution with a Mobile Network Operator’s (MNO) Internet of Things APIs. Maybe they’re developing tracking software for the number of drinks left in a soda machine, or want to track the routes for delivery trucks.
The developer at home and their fleet of devices should be on the same wide area network, securely, and at reasonable cost. What Cloudflare provides is the programmable software layer that enables this secure connectivity. The developer and the developer’s employer still need to have connectivity to the Internet at home, and for the fleet of devices. The ability to make a secure connection to your fleet of devices doesn’t do any good without enterprise connectivity, and the enterprise connectivity is only more valuable with the secure connection running on top of it. They’re the perfect match.
Once the connectivity is established, we can layer on a Zero Trust platform to ensure every user can only access a resource to which they’ve been explicitly granted permission. Any time a user wants to access a protected resource – via ssh, to a cloud service, etc. – they’re challenged to authenticate with their single-sign-on credentials before being allowed access. The networks we use are growing and becoming more distributed. A Zero Trust architecture enables that growth while protecting against known risks.
Given the potential of low-latency 5G networks, consumers and operators are both waiting for a “killer 5G app”. Maybe it will be autonomous vehicles and virtual reality, but our bet is on a quieter revolution: moving compute – the “work” that a server needs to do to respond to a request – from big regional data centers to small city-level data centers, embedding the compute capacity inside wireless networks, and eventually even to the base of cell towers.
Cloudflare’s edge compute platform is called Workers, and it does exactly this – execute code at the edge. It’s designed to be simple. When a developer is building an API to support their product or service, they don’t want to worry about regions and availability zones. With Workers, a developer writes code they want executed at the edge, deploys it, and within seconds it’s running at every Cloudflare data center globally.
Some workloads we already see, and expect to see more of, include:
IoT (Internet of Things) companies implementing complex device logic and security features directly at the edge, letting them add cutting-edge capabilities without adding cost or latency to their devices.
eCommerce platforms storing and caching customized assets close to their visitors for improved customer experience and great conversion rates.
Financial data platforms, including new Web3 players, providing near real-time information and transactions to their users.
A/B testing and experimentation run at the edge without adding latency or introducing dependencies on the client-side.
Fitness-type devices tracking a user’s movement and health statistics can offload compute-heavy workloads while maintaining great speed/latency.
Retail applications providing fast service and a customized experience for each customer without an expensive on-prem solution.
The Cloudflare Case Studies section has additional examples from NCR, Edgemesh, BlockFi, and others on how they’re using the Workers platform. While these examples are exciting, we’re most excited about providing the platform for new innovation.
You may have seen last week we announcedWorkers for Platforms is now in General Availability. Workers for Platforms is an umbrella-like structure that allows a parent organization to enable Workers for their own customers. As an MNO, your focus is on providing the means for devices to send communication to clients. For IoT use cases, sending data is the first step, but the exciting potential of this connectivity is the applications it enables. With Workers for Platforms, MNOs can expose an embedded product that allows customers to access compute power at the edge.
The complementary networks between mobile networks and Cloudflare is another area of opportunity. When a user is interacting with the Internet, one of the most important factors for the speed of their connection is the physical distance from their handset to the content and services they’re trying to access. If the data request from a user in Denver needs to wind its way to one of the major Internet hubs in Dallas, San Jose, or Chicago (and then all the way back!), that is going to be slow. But if the MNO can link to the service locally in Denver, the connection will be much faster.
One of the exciting developments with new 5G networks is the ability of MNOs to do more “local breakout”. Many MNOs are moving towards cloud-native and distributed radio access networks (RANs) which provides more flexibility to move and multiply packet cores. These packet cores are the heart of a mobile network and all of a subscriber’s data flows through one.
For Cloudflare – with a data center presence in 275+ cities globally – a user never has to wait long for our services. We can also take it a step further. In some cases, our services are embedded within the MNO or ISP’s own network. The traffic which connects a user to a device, authorizes the connection, and securely transmits data is all within the network boundary of the MNO – it never needs to touch the public Internet, incur added latency, or otherwise compromise the performance for your subscribers.
We’re excited to partner with mobile networks because our security services work best when our customers have excellent enterprise connectivity underneath. Likewise, we think mobile networks can offer more value to their customers with our security software added on top. If you’d like to talk about how to integrate Cloudflare One into your offerings, please email us at [email protected], and we’ll be in touch!
The humble cell phone is now a critical tool in the modern workplace; even more so as the modern workplace has shifted out of the office. Given the billions of mobile devices on the planet — they now outnumber PCs by an order of magnitude — it should come as no surprise that they have become the threat vector of choice for those attempting to break through corporate defenses.
The problem you face in defending against such attacks is that for most Zero Trust solutions, mobile is often a second-class citizen. Those solutions are typically hard to install and manage. And they only work at the software layer, such as with WARP, the mobile (and desktop) apps that connect devices directly into our Zero Trust network. And all this is before you add in the further complication of Bring Your Own Device (BYOD) that more employees are using — you’re trying to deploy Zero Trust on a device that doesn’t belong to the company.
It’s a tricky — and increasingly critical — problem to solve. But it’s also a problem which we think we can help with.
What if employers could offer their employees a deal: we’ll cover your monthly data costs if you agree to let us direct your work-related traffic through a network that has Zero Trust protections built right in? And what’s more, we’ll make it super easy to install — in fact, to take advantage of it, all you need to do is scan a QR code — which can be embedded in an employee’s onboarding material — from your phone’s camera.
Well, we’d like to introduce you to the Cloudflare SIM: the world’s first Zero Trust SIM.
In true Cloudflare fashion, we think that combining the software layer and the network layer enables better security, performance, and reliability. By targeting a foundational piece of technology that underpins every mobile device — the (not so) humble SIM card — we’re aiming to bring an unprecedented level of security (and performance) to the mobile world.
The threat is increasingly mobile
When we say that mobile is the new threat vector, we’re not talking in the abstract. Last month, Cloudflare was one of 130 companies that were targeted by a sophisticated phishing attack. Mobile was the cornerstone of the attack — employees were initially reached by SMS, and the attack relied heavily on compromising 2FA codes.
So far as we’re aware, we were the only company to not be compromised.
A big part of that was because we’re continuously pushing multi-layered Zero Trust defenses. Given how foundational mobile is to how companies operate today, we’ve been working hard to further shore up Zero Trust defenses in this sphere. And this is how we think about Zero Trust SIM: another layer of defense at a different level of the stack, making life even harder for those who are trying to penetrate your organization. With the Zero Trust SIM, you get the benefits of:
Preventing employees from visiting phishing and malware sites: DNS requests leaving the device can automatically and implicitly use Cloudflare Gateway for DNS filtering.
Mitigating common SIM attacks: an eSIM-first approach allows us to prevent SIM-swapping or cloning attacks, and by locking SIMs to individual employee devices, bring the same protections to physical SIMs.
Enabling secure, identity-based private connectivity to cloud services, on-premise infrastructure and even other devices (think: fleets of IoT devices) via Magic WAN. Each SIM can be strongly tied to a specific employee, and treated as an identity signal in conjunction with other device posture signals already supported by WARP.
By integrating Cloudflare’s security capabilities at the SIM-level, teams can better secure their fleets of mobile devices, especially in a world where BYOD is the norm and no longer the exception.
Zero Trust works better when it’s software + On-ramps
Beyond all the security benefits that we get for mobile devices, the Zero Trust SIM transforms mobile into another on-ramp pillar into the Cloudflare One platform.
Cloudflare One presents a single, unified control plane: allowing organizations to apply security controls across all the traffic coming to, and leaving from, their networks, devices and infrastructure. It’s the same with logging: you want one place to get your logs, and one location for all of your security analysis. With the Cloudflare SIM, mobile is now treated as just one more way that traffic gets passed around your corporate network.
Working at the on-ramp rather than the software level has another big benefit — it grants the flexibility to allow devices to reach services not on the Internet, including cloud infrastructure, data centers and branch offices connected into Magic WAN, our Network-as-a-Service platform. In fact, under the covers, we’re using the same software networking foundations that our customers use to build out the connectivity layer behind the Zero Trust SIM. This will also allow us to support new capabilities like Geneve, a new network tunneling protocol, further expanding how customers can connect their infrastructure into Cloudflare One.
We’re following efforts like IoT SAFE (and parallel, non-IoT standards) that enable SIM cards to be used as a root-of-trust, which will enable a stronger association between the Zero Trust SIM, employee identity, and the potential to act as a trusted hardware token.
Get Zero Trust up and running on mobile immediately (and easily)
Of course, every Zero Trust solutions provider promises protection for mobile. But especially in the case of BYOD, getting employees up and running can be tough. To get a device onboarded, there is a deep tour of the Settings app of your phone: accepting profiles, trusting certificates, and (in most cases) a requirement for a mature mobile device management (MDM) solution.
It’s a pain to install.
Now, we’re not advocating the elimination of the client software on the phone any more than we would be on the PC. More layers of defense is always better than fewer. And it remains necessary to secure Wi-Fi connections that are established on the phone. But a big advantage is that the Cloudflare SIM gets employees protected behind Cloudflare’s Zero Trust platform immediately for all mobile traffic.
It’s not just the on-device installation we wanted to simplify, however. It’s companies’ IT supply chains, as well.
One of the traditional challenges with SIM cards is that they have been, until recently, a physical card. A card that you have to mail to employees (a supply chain risk in modern times), that can be lost, stolen, and that can still fail. With a distributed workforce, all of this is made even harder. We know that whilst security is critical, security that is hard to deploy tends to be deployed haphazardly, ad-hoc, and often, not at all.
We’re also exploring introducing physical SIMs (just like the ones above): although we believe eSIMs are the future, especially given their deployment & security advantages, we understand that the future is not always evenly distributed. We’ll be working to make sure that the physical SIMs we ship are as secure as possible, and we’ll be sharing more of how this works in the coming months.
Privacy and transparency for employees
Of course, more and more of the devices that employees use for work are their own. And while employers want to make sure their corporate resources are secure, employees also have privacy concerns when work and private life are blended on the same device. You don’t want your boss knowing that you’re swiping on Tinder.
We want to be thoughtful about how we approach this, from the perspective of both sides. We have sophisticated logging set up as part of Cloudflare One, and this will extend to Cloudflare SIM. Today, Cloudflare One can be explicitly configured to log only the resources it blocks — the threats it’s protecting employees from — without logging every domain visited beyond that. We’re working to make this as obvious and transparent as possible to both employers and employees so that, in true Cloudflare fashion, security does not have to compromise privacy.
Like any product at Cloudflare, we’re testing this on ourselves first (or “dogfooding”, to those in the know). Given the services we provide for over 30% of the Fortune 1000, we continue to observe, and be the target of, increasingly sophisticated cybersecurity attacks. We believe that running the service first is an important step in ensuring we make the Zero Trust SIM both secure and as easy to deploy and manage across thousands of employees as possible.
We’re also bringing the Zero Trust SIM to the Internet of Things: almost every vehicle shipped today has an expectation of cellular connectivity; an increasing number of payment terminals have a SIM card; and a growing number of industrial devices across manufacturing and logistics. IoT device security is under increasing levels of scrutiny, and ensuring that the only way a device can connect is a secure one — protected by Cloudflare’s Zero Trust capabilities — can directly prevent devices from becoming part of the next big DDoS botnet.
We’ll be rolling the Zero Trust SIM out to customers on a regional basis as we build our regional connectivity across the globe (if you’re an operator: reach out). We’d especially love to talk to organizations who don’t have an existing mobile device solution in place at all, or who are struggling to make things work today. If you’re interested, then sign up here.
To scale up to the needs of our customers, we’ve adopted ways to efficiently deliver our services through our everyday superapp – whether it’s through continuous process improvements or coding best practices. For one, libraries have made it possible for us to increase our development velocity. In the Passenger App Android team, we’ve a mix of libraries – from libraries that we’ve built in-house to open source ones.
Every week, we release a new version of our Passenger App. Each update contains on average between five to ten library updates. In this article, we will explain how we keep all libraries used by our app up to date, and the different actions we take to avoid defect leaks into production.
How many libraries are we using?
Before we add a new library to a project, it goes through a rigorous assessment process covering many parts, such as security issue detection and usability tests measuring the impact on the app size and app startup time. This process ensures that only libraries up to our standards are added.
In total, there are more than 170 libraries powering the SuperApp, including 55 AndroidX artifacts and 22 libraries used for the sole purpose of writing automation testing (Unit Testing or UI Testing).
Who is responsible for updating
While we do have an internal process on how to update the libraries, it doesn’t mention who and how often it should be done. In fact, it’s everyone’s responsibility to make sure our libraries are up to date. Each team should be aware of the libraries they’re using and whenever a new version is released.
However, this isn’t really the case. We’ve a few developers taking ownership of the libraries as a whole and trying to maintain it. With more than 170 external libraries, we surveyed the Android developer community on how they manage libraries in the company. The result can be summarized as follow:
While most developers are aware of updates, they don’t update a library because the risk of defects leaking into production is too high.
The risk is to have a defect leaking into production. It can cause regressions on existing features or introduce new crashes in the app. In a worst case scenario, if this isn’t caught before publishing, it can force us to make a hotfix and a certain number of users will be impacted.
Before updating (bump) a library, we evaluate two metrics:
the usage of this library in the codebase.
the number of changes introduced in the library between the current version and the targeted version.
The risk needs to be assessed between the number of usages of a certain library and the size of the changes. The following chart illustrate this point.
This arbitrary scale helps us in deciding if we will require additional signoff from the QA team. If the estimation places the item on the bottom-left corner, the update will be less risky while if it’s on the top-right corner, it means we should follow extra verification to reduce the risk.
A good practice to reduce the risks of updating a library is to update it frequently, decreasing the diffs hence reducing the scope of impact.
Reducing the risk
The first thing we’re doing to reduce the risk is to update our libraries on a weekly basis. As described above, small changes are always less risky than large changes even if the usage of this partial library is wide. By following incremental updates, we avoid accumulating potential issues over a longer period of time.
For example, the Android Jetpack and Firebase libraries follow a two-week release train. So every two weeks, we check for new updates, read the changelogs, and proceed with the update.
In case of a defect detected, we can easily revert the change until we figure out a proper solution or raise the issue to the library owner.
To reduce risk on any merge request (not limited to library update), we’ve spent a tremendous amount of effort on automating tests. For each new feature we’ve a set of test cases written in Gherkin syntax.
Automation is implemented as UI tests that run on continuous integration (CI) for every merge request. If those tests fail, we won’t be able to merge any changes.
To further elaborate, let’s take this example: Team A developed a lot of features and now has a total of 1,000 test cases. During regression testing before each release, only a subset of those are executed manually based on the impacted area. With automation in place, team A now has 60% of those tests executed as part of CI. So, when all the tests successfully pass, we’re already 60% confident that no defect is detected. This tremendously increases our confidence level while reducing manual testing.
When the update is in the risk threshold area and the automation tests are insufficient, the developer works with QA engineers on analyzing impacted areas. They would then execute test cases related to the impacted area.
For example, if we’re updating Facebook library, the impacted area would be the “Login with Facebook” functionality. QA engineers would then run test cases related to social login.
A single or multiple team can be involved. In some cases, QA signoff can be required by all the teams if they’re all affected by the update.
This process requires a lot of effort from different teams and can affect the current roadmap. To avoid falling into this category, we refine the impacted area analysis to be as specific as possible.
Update before it becomes mandatory
Google updates the Google Play requirements regularly to ensure that published apps are fully compatible with the latest Android version.
For example, starting 1st November 2020 all apps must target API 29. This change causes behavior changes for some API. New behavior has to be supported and verified for our code, but also for all the libraries we use. Libraries bundled inside our app are also affected if they’re using Android API. However, the support for newer API is done by each library maintainer. By keeping our libraries up to date, we ensure compatibility with the latest Android API.
Keep updating your libraries. If they’re following a release plan, try to match it so it won’t accumulate too many changes. For every new release at Grab, we ship a new version each week, which includes between 5 to 10 libraries bump.
For each update, identify the potential risks on your app and find the correct balance between risk and effort required to mitigate this. Don’t overestimate the risk, especially if the changes are minimal and only include some minor bug fixing. Some library updates don’t even change any single line of code and are only documentation updates.
Invest in robust automation testing to create a high confidence level when making changes, including potentially large changes like a huge library bump.
Authored by Lucas Nelaupe on behalf of the Grab Android Development team. Special thanks to Tridip Thrizu and Karen Kue for the design and copyediting contributions.
Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.
If you share our vision of driving South East Asia forward, apply to join our team today.
As Android developers, we usually have the luxury of treating our backends as magic boxes running in the cloud, faithfully returning us JSON. At Netflix, we have adopted the Backend for Frontend (BFF) pattern: instead of having one general purpose “backend API”, we have one backend per client (Android/iOS/TV/web). On the Android team, while most of our time is spent working on the app, we are also responsible for maintaining this backend that our app communicates with, and its orchestration code.
Recently, we completed a year-long project rearchitecting and decoupling our backend from the centralized model used previously. We did this migration without slowing down the usual cadence of our releases, and with particular care to avoid any negative effects to the user experience. We went from an essentially serverless model in a monolithic service, to deploying and maintaining a new microservice that hosted our app backend endpoints. This allowed Android engineers to have much more control and observability over how we get our data. Over the course of this post, we will talk about our approach to this migration, the strategies that we employed, and the tools we built to support this.
The Netflix Android app uses the falcor data model and query protocol. This allows the app to query a list of “paths” in each HTTP request, and get specially formatted JSON (jsonGraph) that we use to cache the data and hydrate the UI. As mentioned earlier, each client team owns their respective endpoints: which effectively means that we’re writing the resolvers for each of the paths that are in a query.
As an example, to render the screen shown here, the app sends a query that looks like this:
paths: ["videos", 80154610, "detail"]
A path starts from a root object, and is followed by a sequence of keys that we want to retrieve the data for. In the snippet above, we’re accessing the detail key for the video object with id 80154610.
For that query, the response is:
In the Monolith
In the example you see above, the data that the app needs is served by different backend microservices. For example, the artwork service is separate from the video metadata service, but we need the data from both in the detail key.
We do this orchestration on our endpoint code using a library provided by our API team, which exposes an RxJava API to handle the downstream calls to the various backend microservices. Our endpoint route handlers are effectively fetching the data using this API, usually across multiple different calls, and massaging it into data models that the UI expects. These handlers we wrote were deployed into a service run by the API team, shown in the diagram below.
As you can see, our code was just a part (#2 in the diagram) of this monolithic service. In addition to hosting our route handlers, this service also handled the business logic necessary to make the downstream calls in a fault tolerant manner. While this gave client teams a very convenient “serverless” model, over time we ran into multiple operational and devex challenges with this service. You can read more about this in our previous posts here: part 1, part 2.
The big difference from the monolith, though, is that this is now a standalone service deployed as a separate “application” (service) in our cloud infrastructure. More importantly, we’re no longer just getting and returning requests from the context of an endpoint script running in a service: we’re now getting a chance to handle the HTTP request in its entirety. Starting from “terminating” the request from our public gateway, we then make downstream calls to the api application (using the previously mentioned JS API), and build up various parts of the response. Finally, we return the required JSON response from our service.
Before we look at what this change meant for us, we want to talk about how we did it. Our app had ~170 query paths (think: route handlers), so we had to figure out an iterative approach to this migration. Let’s take a look at what we built in the app to support this migration. Going back to the screenshot above, if you scroll a bit further down on that page, you will see the section titled “more like this”:
As you can imagine, this does not belong in the video details data for this title. Instead, it is part of a different path: [videos, <id>, similars]. The general idea here is that each UI screen (Activity/Fragment) needs data from multiple query paths to render the UI.
To prepare ourselves for a big change in the tech stack of our endpoint, we decided to track metrics around the time taken to respond to queries. After some consultation with our backend teams, we determined the most effective way to group these metrics were by UI screen. Our app uses a version of the repository pattern, where each screen can fetch data using a list of query paths. These paths, along with some other configuration, builds a Task. These Tasks already carry a uiLabel that uniquely identifies each screen: this label became our starting point, which we passed in a header to our endpoint. We then used this to log the time taken to respond to each query, grouped by the uiLabel. This meant that we could track any possible regressions to user experience by screen, which corresponds to how users navigate through the app. We will talk more about how we used these metrics in the sections to follow.
Fast forward a year: the 170 number we started with slowly but surely whittled down to 0, and we had all our “routes” (query paths) migrated to the new microservice. So, how did it go…?
Today, a big part of this migration is done: most of our app gets its data from this new microservice, and hopefully our users never noticed. As with any migration of this scale, we hit a few bumps along the way: but first, let’s look at good parts.
Migration Testing Infrastructure
Our monolith had been around for many years and hadn’t been created with functional and unit testing in mind, so those were independently bolted on by each UI team. For the migration, testing was a first-class citizen. While there was no technical reason stopping us from adding full automation coverage earlier, it was just much easier to add this while migrating each query path.
For each route we migrated, we wanted to make sure we were not introducing any regressions: either in the form of missing (or worse, wrong) data, or by increasing the latency of each endpoint. If we pare down the problem to absolute basics, we essentially have two services returning JSON. We want to make sure that for a given set of paths as input, the returned JSON is always exactly the same. With lots of guidance from other platform and backend teams, we took a 3-pronged approach to ensure correctness for each route migrated.
Functional Testing Functional testing was the most straightforward of them all: a set of tests alongside each path exercised it against the old and new endpoints. We then used the excellent Jest testing framework with a set of custom matchers that sanitized a few things like timestamps and uuids. It gave us really high confidence during development, and helped us cover all the code paths that we had to migrate. The test suite automated a few things like setting up a test user, and matching the query parameters/headers sent by a real device: but that’s as far as it goes. The scope of functional testing was limited to the already setup test scenarios, but we would never be able to replicate the variety of device, language and locale combinations used by millions of our users across the globe.
Replay Testing Enter replay testing. This was a custom built, 3-step pipeline:
Capture the production traffic for the desired path(s)
Replay the traffic against the two services in the TEST environment
Compare and assert for differences
It was a self-contained flow that, by design, captured entire requests, and not just the one path we requested. This test was the closest to production: it replayed real requests sent by the device, thus exercising the part of our service that fetches responses from the old endpoint and stitches them together with data from the new endpoint. The thoroughness and flexibility of this replay pipeline is best described in its own post. For us, the replay test tooling gave the confidence that our new code was nearly bug free.
Canaries Canaries were the last step involved in “vetting” our new route handler implementation. In this step, a pipeline picks our candidate change, deploys the service, makes it publicly discoverable, and redirects a small percentage of production traffic to this new service. You can find a lot more details about how this works in the Spinnaker canaries documentation.
This is where our previously mentioned uiLabel metrics become relevant: for the duration of the canary, Kayenta was configured to capture and compare these metrics for all requests (in addition to the system level metrics already being tracked, like server CPU and memory). At the end of the canary period, we got a report that aggregated and compared the percentiles of each request made by a particular UI screen. Looking at our high traffic UI screens (like the homepage) allowed us to identify any regressions caused by the endpoint before we enabled it for all our users. Here’s one such report to get an idea of what it looks like:
Each identified regression (like this one) was subject to a lot of analysis: chasing down a few of these led to previously unidentified performance gains! Being able to canary a new route let us verify latency and error rates were within acceptable limits. This type of tooling required time and effort to create, but in the end, the feedback it provided was well worth the cost.
Many Android engineers will be familiar with systrace or one of the excellent profilers in Android Studio. Imagine getting a similar tracing for your endpoint code, traversing along many different microservices: that is effectively what distributed tracing provides. Our microservice and router were already integrated into the Netflix request tracing infrastructure. We used Zipkin to consume the traces, which allowed us to search for a trace by path. Here’s what a typical trace looks like:
Request tracing has been critical to the success of Netflix infrastructure, but when we operated in the monolith, we did not have the ability to get this detailed look into how our app interacted with the various microservices. To demonstrate how this helped us, let us zoom into this part of the picture:
It’s pretty clear here that the calls are being serialized: however, at this point we’re already ~10 hops disconnected from our microservice. It’s hard to conclude this, and uncover such problems, from looking at raw numbers: either on our service or the testservice above, and even harder to attribute them back to the exact UI platform or screen. With the rich end-to-end tracing instrumented in the Netflix microservice ecosystem and made easily accessible via Zipkin, we were able to pretty quickly triage this problem to the responsible team.
As we mentioned earlier, our new service now had the “ownership” for the lifetime of the request. Where previously we only returned a Java object back to the api middleware, now the final step in the service was to flush the JSON down the request buffer. This increased ownership gave us the opportunity to easily test new optimisations at this layer. For example, with about a day’s worth of work, we had a prototype of the app using the binary msgpack response format instead of plain JSON. In addition to the flexible service architecture, this can also be attributed to the Node.js ecosystem and the rich selection of npm packages available.
Before the migration, developing and debugging on the endpoint was painful due to slow deployment and lack of local debugging (this post covers that in more detail). One of the Android team’s biggest motivations for doing this migration project was to improve this experience. The new microservice gave us fast deployment and debug support by running the service in a local Docker instance, which has led to significant productivity improvements.
In the arduous process of breaking a monolith, you might get a sharp shard or two flung at you. A lot of what follows is not specific to Android, but we want to briefly mention these issues because they did end up affecting our app.
The old api service was running on the same “machine” that also cached a lot of video metadata (by design). This meant that data that was static (e.g. video titles, descriptions) could be aggressively cached and reused across multiple requests. However, with the new microservice, even fetching this cached data needed to incur a network round trip, which added some latency.
This might sound like a classic example of “monoliths vs microservices”, but the reality is somewhat more complex. The monolith was also essentially still talking to a lot of downstream microservices: it just happened to have a custom-designed cache that helped a lot. Some of this increased latency was mitigated by better observability and more efficient batching of requests. But, for a small fraction of requests, after a lot of attempts at optimization, we just had to take the latency hit: sometimes, there are no silver bullets.
Increased Partial Query Errors
As each call to our endpoint might need to make multiple requests to the api service, some of these calls can fail, leaving us with partial data. Handling such partial query errors isn’t a new problem: it is baked into the nature of composite protocols like Falcor or GraphQL. However, as we moved our route handlers into a new microservice, we now introduced a network boundary for fetching any data, as mentioned earlier.
This meant that we now ran into partial states that weren’t possible before because of the custom caching. We were not completely aware of this problem in the beginning of our migration: we only saw it when some of our deserialized data objects had null fields. Since a lot of our code uses Kotlin, these partial data objects led to immediate crashes, which helped us notice the problem early: before it ever hit production.
As a result of increased partial errors, we’ve had to improve overall error handling approach and explore ways to minimize the impact of the network errors. In some cases, we also added custom retry logic on either the endpoint or the client code.
This has been a long (you can tell!) and a fulfilling journey for us on the Android team: as we mentioned earlier, on our team we typically work on the app and, until now, we did not have a chance to work with our endpoint with this level of scrutiny. Not only did we learn more about the intriguing world of microservices, but for us working on this project, it provided us the perfect opportunity to add observability to our app-endpoint interaction. At the same time, we ran into some unexpected issues like partial errors and made our app more resilient to them in the process.
As we continue to evolve and improve our app, we hope to share more insights like these with you.
The planning and successful migration to this new service was the combined effort of multiple backend and front end teams.
On the Android team, we ship the Netflix app on Android to millions of members around the world. Our responsibilities include extensive A/B testing on a wide variety of devices by building highly performant and often custom UI experiences. We work on data driven optimizations at scale in a diverse and sometimes unforgiving device and network ecosystem. If you find these challenges interesting, and want to work with us, we have an open position.
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.