Tag Archives: Engineering

Unveiling the process: The creation of our powerful campaign builder

Post Syndicated from Grab Tech original https://engineering.grab.com/the-creation-of-our-powerful-campaign-builder

In a previous blog, we introduced Trident, Grab’s internal marketing campaign platform. Trident empowers our marketing team to configure If This, Then That (IFTTT) logic and processes real-time events based on that.

While we mainly covered how we scaled up the system to handle large volumes of real-time events, we did not explain the implementation of the event processing mechanism. This blog will fill up this missing piece. We will walk you through the various processing mechanisms supported in Trident and how they were built.

Base building block: Treatment

In our system, we use the term “treatment” to refer to the core unit of a full IFTTT data structure. A treatment is an amalgamation of three key elements – an event, conditions (which are optional), and actions. For example, consider a promotional campaign that offers “100 GrabPoints for completing a ride paid with GrabPay Credit”. This campaign can be transformed into a treatment in which the event is “ride completion”, the condition is “payment made using GrabPay Credit”, and the action is “awarding 100 GrabPoints”.

Data generated across various Kafka streams by multiple services within Grab forms the crux of events and conditions for a treatment. Trident processes these Kafka streams, treating each data object as an event for the treatments. It evaluates the set conditions against the data received from these events. If all conditions are met, Trident then executes the actions.

Figure 1. Trident processes Kafka streams as events for treatments.

When the Trident user interface (UI) was first established, campaign creators had to grasp the treatment concept and configure the treatments accordingly. As we improved the UI, it became more user-friendly.

Building on top of treatment

Campaigns can be more complex than the example we provided earlier. In such scenarios, a single campaign may need transformation into several treatments. All these individual treatments are categorised under what we refer to as a “treatment group”. In this section, we discuss features that we have developed to manage such intricate campaigns.

Counter

Let’s say we have a marketing campaign that “rewards users after they complete 4 rides”. For this requirement, it’s necessary for us to keep track of the number of rides each user has completed. To make this possible, we developed a capability known as counter.

On the backend, a single counter setup translates into two treatments.

Treatment 1:

  • Event: onRideCompleted
  • Condition: N/A
  • Action: incrementUserStats

Treatment 2:

  • Event: onProfileUpdate
  • Condition: Ride Count == 4
  • Action: awardReward

In this feature, we introduce a new event, onProfileUpdate. The incrementUserStats action in Treatment 1 triggers the onProfileUpdate event following the update of the user counter. This allows Treatment 2 to consume the event and perform subsequent evaluations.

Figure 2. The end-to-end evaluation process when using the Counter feature.

When the onRideCompleted event is consumed, Treatment 1 is evaluated which then executes the incrementUserStat action. This action increments the user’s ride counter in the database, gets the latest counter value, and publishes an onProfileUpdate event to Kafka.

There are also other consumers that listen to onProfileUpdate events. When this event is consumed, Treatment 2 is evaluated. This process involves verifying whether the Ride Count equals to 4. If the condition is satisfied, the awardReward action is triggered.

This feature is not limited to counting the number of event occurrences only. It’s also capable of tallying the total amount of transactions, among other things.

Delay

Another feature available on Trident is a delay function. This feature is particularly beneficial in situations where we want to time our actions based on user behaviour. For example, we might want to give a ride voucher to a user three hours after they’ve ordered a ride to a theme park. The intention for this is to offer them a voucher they can use for their return trip.

On the backend, a delay setup translates into two treatments. Given the above scenario, the treatments are as follows:

Treatment 1:

  • Event: onRideCompleted
  • Condition: Dropoff Location == Universal Studio
  • Action: scheduleDelayedEvent

Treatment 2:

  • Event: onDelayedEvent
  • Condition: N/A
  • Action: awardReward

We introduce a new event, onDelayedEvent, which Treatment 1 triggers during the scheduleDelayedEvent action. This is made possible by using Simple Queue Service (SQS), given its built-in capability to publish an event with a delay.

Figure 3. The end-to-end evaluation process when using the Delay feature.

The maximum delay that SQS supports is 15 minutes; meanwhile, our platform allows for a delay of up to x hours. To address this limitation, we publish the event multiple times upon receiving the message, extending the delay by another 15 minutes each time, until it reaches the desired delay of x hours.

Limit

The Limit feature is used to restrict the number of actions for a specific campaign or user within that campaign. This feature can be applied on a daily basis or for the full duration of the campaign.

For instance, we can use the Limit feature to distribute 1000 vouchers to users who have completed a ride and restrict it to only one voucher for one user per day. This ensures a controlled distribution of rewards and prevents a user from excessively using the benefits of a campaign.

In the backend, a limit setup translates into conditions within a single treatment. Given the above scenario, the treatment would be as follows:

  • Event: onRideCompleted
  • Condition: TotalUsageCount <= 1000 AND DailyUserUsageCount <= 1
  • Action: awardReward

Similar to the Counter feature, it’s necessary for us to keep track of the number of completed rides for each user in the database.

Figure 4. The end-to-end evaluation process when using the Limit feature.

A better campaign builder

As our campaigns grew more and more complex, the treatment creation quickly became overwhelming. A complex logic flow often required the creation of many treatments, which was cumbersome and error-prone. The need for a more visual and simpler campaign builder UI became evident.

Our design team came up with a flow-chart-like UI. Figure 5, 6, and 7 show examples of how certain imaginary campaign setup would look like in the new UI.

Figure 5. When users complete a food order, if they are a gold user, award them with A. However, if they are a silver user, award them with B.
Figure 6. When users complete a food or mart order, increment a counter. When the counter reaches 5, send them a message. Once the counter reaches 10, award them with points.
Figure 7. When a user confirms a ride booking, wait for 1 minute, and then conduct A/B testing by sending a message 50% of the time.

The campaign setup in the new UI can be naturally stored as a node tree structure. The following is how the example in figure 5 would look like in JSON format. We assign each node a unique number ID, and store a map of the ID to node content.

{
  "1": {
    "type": "scenario",
    "data": { "eventType": "foodOrderComplete"  },
    "children": ["2", "3"]
  },
  "2": {
    "type": "condition",
    "data": { "lhs": "var.user.tier", "operator": "eq", "rhs": "gold" },
    "children": ["4"]
  },
  "3": {
    "type": "condition",
    "data": { "lhs": "var.user.tier", "operator": "eq", "rhs": "silver" },
    "children": ["5"]
  },
  "4": {
    "type": "action",
    "data": {
      "type": "awardReward",
      "payload": { "rewardID": "ID-of-A"  }
    }
  },
  "5": {
    "type": "action",
    "data": {
      "type": "awardReward",
      "payload": { "rewardID": "ID-of-B"  }
    }
  }
}

Conversion to treatments

The question then arises, how do we execute this node tree as treatments? This requires a conversion process. We then developed the following algorithm for converting the node tree into equivalent treatments:

// convertToTreatments is the main function
func convertToTreatments(rootNode) -> []Treatment:
  output = []

  for each scenario in rootNode.scenarios:
    // traverse down each branch
    context = createConversionContext(scenario)
    for child in rootNode.children:
      treatments = convertHelper(context, child)
      output.append(treatments)

  return output

// convertHelper is a recursive helper function
func convertHelper(context, node) -> []Treatment:
  output = []
  f = getNodeConverterFunc(node.type)
  treatments, updatedContext = f(context, node)

  output.append(treatments)

  for child in rootNode.children:
    treatments = convertHelper(updatedContext, child)
    output.append(treatments)

  return output

The getNodeConverterFunc will return different handler functions according to the node type. Each handler function will either update the conversion context, create treatments, or both.

Table 1. The handler logic mapping for each node type.
Node type Logic
condition Add conditions into the context and return the updated context.
action Return a treatment with the event type, condition from the context, and the action itself.
delay Return a treatment with the event type, condition from the context, and a scheduleDelayedEvent action.
count Return a treatment with the event type, condition from the context, and an incrementUserStats action.
count condition Form a condition with the count key from the context, and return an updated context with the condition.

It is important to note that treatments cannot always be reverted to their original node tree structure. This is because different node trees might be converted into the same set of treatments.

The following is an example where two different node trees setups correspond to the same set of treatments:

  • Food order complete -> if gold user -> then award A
  • Food order complete -> if silver user -> then award B
Figure 8. An example of two node tree setups corresponding to the the same set of treatments.

Therefore, we need to store both the campaign node tree JSON and treatments, along with the mapping between the nodes and the treatments. Campaigns are executed using treatments, but displayed using the node tree JSON.

Figure 9. For each campaign, we store both the node tree JSON and treatments, along with their mapping.

How we handle campaign updates

There are instances where a marketing user updates a campaign after its creation. For such cases we need to identify:

  • Which existing treatments should be removed.
  • Which existing treatments should be updated.
  • What new treatments should be added.

We can do this by using the node-treatment mapping information we stored. The following is the pseudocode for this process:

func howToUpdateTreatments(oldTreatments []Treatment, newTreatments []Treatment):
  treatmentsUpdate = map[int]Treatment // treatment ID -> updated treatment
  treatmentsRemove = []int // list of treatment IDs
  treatmentsAdd = []Treatment // list of new treatments to be created

  matchedOldTreamentIDs = set()

  for newTreatment in newTreatments:
    matched = false

    // see whether the nodes match any old treatment
    for oldTreatment in oldTreatments:
      // two treatments are considered matched if their linked node IDs are identical
      if isSame(oldTreatment.nodeIDs, newTreatment.nodeIDs):
        matched = true
        treatmentsUpdate[oldTreament.ID] = newTreatment
        matchedOldTreamentIDs.Add(oldTreatment.ID)
        break

    // if no match, that means it is a new treatment we need to create
    if not matched:
      treatmentsAdd.Append(newTreatment)

  // all the non-matched old treatments should be deleted
  for oldTreatment in oldTreatments:
    if not matchedOldTreamentIDs.contains(oldTreatment.ID):
      treatmentsRemove.Append(oldTreatment.ID)

  return treatmentsAdd, treatmentsUpdate, treatmentsRemove

For a visual illustration, let’s consider a campaign that initially resembles the one shown in figure 10. The node IDs are highlighted in red.

Figure 10. A campaign in node tree structure.

This campaign will generate two treatments.

Table 2. The campaign shown in the figure 10 will generated two treatments.
ID Treatment Linked node IDs
1 Event: food order complete
Condition: gold user
Action: award A
1, 2, 3
2 Event: food order complete
Condition: silver user
Action: award B
1, 4, 5

After creation, the campaign creator updates the upper condition branch, deletes the lower branch, and creates a new branch. Note that after node deletion, the deleted node ID will not be reused.

Figure 11. An updated campaign in node tree structure.

According to our logic in figure 11, the following update will be performed:

  • Update action for treatment 1 to “award C”.
  • Delete treatment 2
  • Create a new treatment: food -> is promo used -> send push

Conclusion

This article reveals the workings of Trident, our bespoke marketing campaign platform. By exploring the core concept of a “treatment” and additional features like Counter, Delay and Limit, we illustrated the flexibility and sophistication of our system.

We’ve explained changes to the Trident UI that have made campaign creation more intuitive. Transforming campaign setups into executable treatments while preserving the visual representation ensures seamless campaign execution and adaptation.

Our devotion to improving Trident aims to empower our marketing team to design engaging and dynamic campaigns, ultimately providing excellent experiences to our users.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 700 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Chimera Sandbox: A scalable experimentation and development platform for Notebook services

Post Syndicated from Grab Tech original https://engineering.grab.com/chimera-sandbox

Key to innovation and improvement in machine learning (ML) models is the ability for rapid iteration. Our team, Chimera, part of the Artificial Intelligence (AI) Platform team, provides the essential compute infrastructure, ML pipeline components, and backend services. This support enables our ML engineers, data scientists, and data analysts to efficiently experiment and develop ML solutions at scale.

With a commitment to leveraging the latest Generative AI (GenAI) technologies, Grab is enhancing productivity tools for all Grabbers. Our Chimera Sandbox, a scalable Notebook platform, facilitates swift experimentation and development of ML solutions, offering deep integration with our AI Gateway. This enables easy access to various Large Language Models (LLMs) (both proprietary and open source), ensuring scalability, compliance, and access control are managed seamlessly.

What is Chimera Sandbox?

Chimera Sandbox is a Notebook service platform. It allows users to launch multiple notebook and visualisation services for experimentation and development. The platform offers an extremely quick onboarding process enabling any Grabber to start learning, exploring and experimenting in just a few minutes. This inclusivity and ease of use have been key in driving the adoption of the platform across different teams within Grab and empowering all Grabbers to be GenAI-ready.

One significant challenge in harnessing ML for innovation, whether for technical experts or non-technical enthusiasts, has been the accessibility of resources. This includes GPU instances and specialised services for developing LLM-powered applications. Chimera Sandbox addresses this head-on by offering an extensive array of compute instances, both with and without GPU support, thus removing barriers to experimentation. Its deep integration with Grab’s suite of internal ML tools transforms the way users approach ML projects. Users benefit from features like hyperparameter tuning, tracking ML training metadata, accessing diverse LLMs through Grab’s AI Gateway, and experimenting with rich datasets from Grab’s data lake. Chimera Sandbox ensures that users have everything they need at their fingertips. This ecosystem not only accelerates the development process but also encourages innovative approaches to solving complex problems.

The underlying compute infrastructure of the Chimera Sandbox platform is Grab’s very own battle-tested, highly scalable ML compute infrastructure running on multiple Kubernetes clusters. Each cluster can scale up to thousands of nodes at peak times gracefully. This scalability ensures that the platform can handle the high computational demands of ML tasks. The robustness of Kubernetes ensures that the platform remains stable, reliable, and highly available even under heavy load. At any point in time, there can be hundreds of data scientists, ML engineers and developers experimenting and developing on the Chimera Sandbox platform.

Figure 1. Chimera Sandbox Platform.
Figure 2. UI for Starting Chimera Sandbox.

Best of both worlds

Chimera Sandbox is suitable for both new users who want to explore and experiment ML solutions and advanced users who want to have full control over the Notebook services they run. Users can launch Notebook services using default Docker images provided by the Chimera Sandbox platform. These images come pre-loaded with popular data science and ML libraries and various Grab internal systems integrations. Chimera also provides basic Docker images from which the users can use as base images to build their own customised Notebook service Docker images. Once the images are built, the users can configure their Notebook services to use their custom Docker images. This ensures their Notebook environment can be exactly the way they want them to be.

Figure 3. Users are able to customise their Notebook service with additional packages.

Real-time collaboration

The Chimera Sandbox platform also features a real-time collaboration feature. This feature fosters a collaborative environment where users can exchange ideas and work together on projects.

CPU and GPU choices

Chimera Sandbox offers a wide variety of CPU and GPU choices to cater to specific needs, whether it is a CPU, memory, or GPU intensive experimentation. This flexibility allows users to choose the most suitable computational resources for their tasks, ensuring optimal performance and efficiency.

Deep integration with Spark

The platform is deeply integrated with internal Spark engines, enabling users to experiment building extract, transform, and load (ETL) jobs with data from Grab’s data lake. Integrated helpers such as SparkConnect Kernel and %%spark_sql magic cell, provide a faster developer experience, which can execute Spark SQL queries without needing to write additional code to start a Spark session and query.

Figure 4. %%spark_sql magic cell enables users to quickly explore data with Spark.

In addition to Magic Cell, the Chimera Sandbox offers advanced Spark functionalities. Users can write PySpark code using pre-configured and configurable Spark clients in the runtime environment. The underlying computation engine leverages Grab’s custom Spark-on-Kubernetes operator, enabling support for large-scale Spark workloads. This high-code capability complements the low-code Magic Cell feature, providing users with a versatile data processing environment.

Chimera Sandbox features an AI Gallery to guide and accelerate users to start experimenting with ML solutions or building GenAI-powered applications. This is especially useful for new or novice users who are keen to explore what they can do on the Chimera Sandbox platform. With Chimera Sandbox, users are not just presented with a bare bones compute solution but rather are provided with ways to do ML tasks right from Chimera Sandbox Notebooks. This approach saves users from the hassle of having to piece together the examples from the public internet, which may not work on the platform. These ready-to-run and comprehensive notebooks in the AI Gallery assure users that they can run end-to-end examples without a hitch. Based on these examples, the users can only extend their experimentations and development for their specific needs. Not only that, these tutorials and notebooks exhibit the platform capabilities and integrations available on the platform in an interactive manner rather than having the users refer to a separate documentation.

Lastly, the AI Gallery encourages contributions from other Grabbers, fostering a collaborative environment. Users who are enthusiastic about creating educational contents on Chimera Sandbox can effectively share their work with other Grabbers.

Figure 5. Including AI Gallery in user specified sandbox images.

Integration with various LLM services

Notebook users on Chimera Sandbox can easily tap into a plethora of LLMs, both open source and proprietary models, without any additional setup via our AI Gateway. The platform takes care of access mechanisms and endpoints for various LLM services so that the users can easily use their favourite libraries to create LLM-powered applications and conduct experimentations. This seamless integration with LLMs enables users to focus on their GAI-powered ideas rather than having to worry about underlying logistics and technicalities of using different LLMs.

More than a notebook service

While Notebook is the most popular service on the platform, Chimera Sandbox offers much more than just notebook capabilities. It serves as a comprehensive namespace workspace equipped with a suite of ML/AI tools. Alongside notebooks, users can access essential ML tools such as Optuna for hyperparameter tuning, MLflow for experiment tracking, and other tools including Zeppelin, RStudio, Spark history, Polynote, and LabelStudio. All these services use a shared storage system, creating a tailored workspace for ML and AI tasks.

Figure 6. A Sandbox namespace with its out-of-the-box services.

Additionally, the Sandbox framework allows for the seamless integration of more services into personal workspaces. This high level of flexibility significantly enhances the capabilities of the Sandbox platform, making it an ideal environment for diverse ML and AI applications.

Cost attribution

For a multi-tenanted platform such as Chimera Sandbox, it is crucial to provide users information on how much they have spent with their experimentations. Cost showback and chargeback capabilities are of utmost importance for a platform on which users can launch Notebook services that use accelerated instances with GPUs. The platform provides cost attribution to individual users, so each user knows exactly how much they are spending on their experimentations and can make budget-conscious decisions. This transparency in cost attribution encourages responsible usage of resources and helps users manage their budgets effectively.

Growth and future plans

In essence, Chimera Sandbox is more than just a tool; it’s a catalyst for innovation and growth, empowering Grabbers to explore the frontiers of ML and AI. By providing an inclusive, flexible, and powerful platform, Chimera Sandbox is helping shape the future of Grab, making every Grabber not just ready but excited to contribute to the AI-driven transformation of our products and services.

In July and August of this year, teams were given the opportunity to intensively learn and experiment with AI. Since then, we have observed hockey stick growth on the Chimera Sandbox platform. We are enabling massive experimentation across different teams at Grab to experiment and work on different GAI-powered applications.

Figure 7. Chimera Sandbox daily active users.

Our future plans include mechanisms for better notebook discovery, collaboration and usability, and the ability to enable users to schedule their notebooks right from Chimera Sandbox. These enhancements aim to improve the user experience and make the platform even more versatile and powerful.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 700 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How we improved translation experience with cost efficiency

Post Syndicated from Grab Tech original https://engineering.grab.com/improved-translation-experience-with-cost-efficiency

Introduction

As COVID restrictions were fully lifted in 2023, the number of tourists grew dramatically. People began to explore the world again, frequently using the Grab app to make bookings outside of their home country. However, we noticed that communication posed a challenge for some users. Despite our efforts to integrate an auto-translation feature in the booking chat, we received feedback about occasional missed or inaccurate translations. You can refer to this blog for a better understanding of Grab’s chat system.

An example of a bad translation. The correct translation is: ‘ok sir’.

In an effort to enhance the user experience for travellers using the Grab app, we formed an engineering squad to tackle this problem. The objectives are as follows:

  • Ensure translation is provided when it’s needed.
  • Improve the quality of translation.
  • Maintain the cost of this service within a reasonable range.

Ensure translation is provided when it’s needed

Originally, we relied on users’ device language settings to determine if translation is needed. For example, if both the passenger and driver’s language setting is set to English, translation is not needed. Interestingly, it turned out that the device language setting did not reliably indicate the language in which a user would send their messages. There were numerous cases where despite having their device language set to English, drivers sent messages in another language.

Therefore, we needed to detect the language of user messages on the fly to make sure we trigger translation when it’s needed.

Language detection

Simple as it may seem, language detection is not that straightforward a task. We were unable to find an open-source language detector library that covered all Southeast Asian languages. We looked for Golang libraries as our service was written in Golang. The closest we could find were the following:

  • Whatlang: unable to detect Malay
  • Lingua: unable to detect Burmese and Khmer

We decided to choose Lingua over Whatlang as the base detector due to the following factors:

  • Overall higher accuracy.
  • Capability to provide detection confidence level.
  • We have more users using Malay than those using Burmese or Khmer.

When a translation request comes in, our first step is to use Lingua for language detection. If the detection confidence level falls below a predefined threshold, we fall back to call the third-party translation service as it can detect all Southeast Asian languages.

You may ask, why don’t we simply use the third-party service in the first place. It’s because:

  • The third-party service only has a translate API that also does language detection, but it does not provide a standalone language detection API.
  • Using the translate API is costly, so we need to avoid calling it when it’s unnecessary. We will cover more on this in a later section.

Another challenge we’ve encountered is the difficulty of distinguishing between Malay and Indonesian languages due to their strong similarities and shared vocabulary. The identical text might convey different meanings in these two languages, which the third-party translation service struggles to accurately detect and translate.

Differentiating Malay and Indonesian is a tough problem in general. However, in our case, the detection has a very specific context, and we can make use of the context to enhance our detection accuracy.

Making use of translation context

All our translations are for the messages sent in the context of a booking or order, predominantly between passenger and driver. There are two simple facts that can aid in our language detection:

  • Booking/order happens in one single country.
  • Drivers are almost always local to that country.

So, for a booking that happens in an Indonesian city, if the driver’s message is detected as Malay, it’s highly likely that the message is actually in Bahasa Indonesia.

Improve quality of translation

Initially, we were entirely dependent on a third-party service for translating our chat messages. While overall powerful, the third-party service is not perfect, and it does generate weird translations from time to time.

An example of a weird translation from a third-party service recorded on 19 Dec 2023.

Then, it came to us that we might be able to build an in-house translation model that could translate chat messages better than the third-party service. The reasons being:

  • The scope of our chat content is highly specific. All the chats are related to bookings or orders. There would not be conversations about life or work in the chat. Maybe a small Machine Learning (ML) model would suffice to do the job.
  • The third-party service is a general translation service. It doesn’t know the context of our messages. We, however, know the whole context. Having the right context gives us a great edge on generating the right translation.

Training steps

To create our own translation model, we took the following steps:

  • Perform topic modelling on Grab chat conversations.
  • Worked with the localisation team to create a benchmark set of translations.
  • Measured existing translation solutions against benchmarks.
  • Used an open source Large Language Model (LLM) to produce synthetic training data.
  • Used synthetic data to train our lightweight translation model.

Topic modelling

In this step, our aim was to generate a dataset which is both representative of the chat messages sent by our users and diverse enough to capture all of the nuances of the conversations. To achieve this, we took a stratified sampling approach. This involved a random sample of past chat conversation messages stratified by various topics to ensure a comprehensive and balanced representation.

Developing a benchmark

For this step we engaged Grab’s localisation team to create a benchmark for translations. The intention behind this step wasn’t to create enough translation examples to fully train or even finetune a model, but rather, it was to act as a benchmark for translation quality, and also as a set of few-shot learning examples for when we generate our synthetic data.

This second point was critical! Although LLMs can generate good quality translations, LLMs are highly susceptible to their training examples. Thus, by using a set of handcrafted translation examples, we hoped to produce a set of examples that would teach the model the exact style, level of formality, and correct tone for the context in which we plan to deploy the final model.

Benchmarking

From a theoretical perspective there are two ways that one can measure the performance of a machine translation system. The first is through the computation of some sort of translation quality score such as a BLEU or CHRF++ score. The second method is via subjective evaluation. For example, you could give each translation a score from 1 to 5 or pit two translations against each other and ask someone to assess which they prefer.

Both methods have their relative strengths and weaknesses. The advantage of a subjective method is that it corresponds better with what we want, a high quality translation experience for our users. The disadvantage of this method is that it is quite laborious. The opposite is true for the computed translation quality scores, that is to say that they correspond less well to a human’s subjective experience of our translation quality, but that they are easier and faster to compute.

To overcome the inherent limitations of each method, we decided to do the following:

  1. Set a benchmark score for the translation quality of various translation services using a CHRF++ score.
  2. Train our model until its CHRF++ score is significantly better than the benchmark score.
  3. Perform a manual A/B test between the newly trained model and the existing translation service.

Synthetic data generation

To generate the training data needed to create our model, we had to rely on an open source LLM to generate the synthetic translation data. For this task, we spent considerable effort looking for a model which had both a large enough parameter count to ensure high quality outputs, but also a model which had the correct tokenizer to handle the diverse sets of languages which Grab’s customers speak. This is particularly important for languages which use non-standard character sets such as Vietnamese and Thai. We settled on using a public model from Hugging Face for this task.

We then used a subset of the previously mentioned benchmark translations to input as few-shot learning examples to our prompt. After many rounds of iteration, we were able to generate translations which were superior to the benchmark CHRF++ scores which we had attained in the previous section.

Model fine tuning

We now had one last step before we had something that was production ready! Although we had successfully engineered a prompt capable of generating high quality translations from the public Hugging Face model, there was no way we’d be able to deploy such a model. The model was far too big for us to deploy it in a cost efficient manner and within an acceptable latency. Our solution to this was to fine-tune a smaller bespoke model using the synthetic training data which was derived from the larger model.

These models were language specific (e.g. English to Indonesian) and built solely for the purpose of language translation. They are 99% smaller than the public model. With approximately 10 Million synthetic training examples, we were able to achieve performance which was 98% as effective as our larger model.

We deployed our model and ran several A/B tests with it. Our model performed pretty well overall, but we noticed a critical problem: sometimes, numbers got mutated in the translation. These numbers can be part of an address, phone number, price etc. Showing the wrong number in a translation can cause great confusion to the users. Unfortunately, an ML model’s output can never be fully controlled; therefore, we added an additional layer of programmatic check to mitigate this issue.

Post-translation quality check

Our goal is to ensure non-translatable content such as numbers, special symbols, and emojis in the original message doesn’t get mutated in the translation produced by our in-house model. We extract all the non-translatable content from the original message, count the occurrences of each, and then try to match the same in the translation. If it fails to match, we discard the in-house translation and fall back to using the third-party translation service.

Keep cost low

At Grab, we try to be as cost efficient as possible in all aspects. In the case of translation, we tried to minimise cost by avoiding unnecessary on-the-fly translations.

As you would have guessed, the first thing we did was to implement caching. A cache layer is placed before both the in-house translation model and the third-party translation. We try to serve translation from the cache first before hitting the underlying translation service. However, given that translation requests are in free text and can be quite dynamic, the impact of caching is limited. There’s more we need to do.

For context, in a booking chat, other than the users, Grab’s internal services can also send messages to the chat room. These messages are called system messages. For example,our food service always sends a message with information on the food order when an order is confirmed.

System messages are all fairly static in nature, however, we saw a very high amount of translation cost attributed to system messages. Taking a deeper look, we noticed the following:

  • Many system messages were not sent in the recipient’s language, thus requiring on-the-fly translation.
  • Many system messages, though having the same static structure, contain quite a few variants such as passenger’s name and food order item name. This makes it challenging to utilise our translation cache effectively as each message is different.

Since all system messages are manually prepared, we should be able to get them all manually translated into all the required languages, and avoid on-the-fly translations altogether.

Therefore, we launched an internal campaign, mandating all internal services that send system messages to chat rooms to get manual translations prepared, and pass in the translated contents. This alone helped us save roughly US$255K a year!

Next steps

At Grab, we firmly believe that our proprietary in-house translation models are not only more cost-effective but cater more accurately to our unique use cases compared to third-party services. We will focus on expanding these models to more languages and countries across our operating regions.

Additionally, we are exploring opportunities to apply learnings of our chat translations to other Grab content. This strategy aims to guarantee a seamless language experience for our rapidly expanding user base, especially travellers. We are enthusiastically looking forward to the opportunities this journey brings!

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

GitHub Availability Report: June 2024

Post Syndicated from Jakub Oleksy original https://github.blog/2024-07-12-github-availability-report-june-2024/

In June, we experienced two incidents that resulted in degraded performance across GitHub services.

June 05 17:05 UTC (lasting 142 minutes)

On June 5, between 17:05 UTC and 19:27 UTC, the GitHub Issues service was degraded. During that time, events related to projects were not displayed on issue timelines. These events indicate when an issue was added to or removed from a project and when their status changed within a project. A misconfiguration of the service backing these events prevented the data from being loaded.

We determined the root cause to be a scheduled secret rotation that resulted in one of the configured services using old expired secrets. Specifically, as a part of our continual improvement, we had an initiative to cleanup, streamline, and simplify our service configurations for improved automation. A bug in the implementation resulted in a misconfiguration that resulted in the degradation.

We mitigated the incident by remediating the service configuration and we believe the simplified configuration will help avoid similar incidents in the future.

June 27 20:39 UTC (lasting 58 minutes)

On June 27, between 20:39 UTC and 21:37 UTC, the GitHub Migration service saw all in-progress migrations fail. Once the increased failures were detected, we paused new migrations so they could resume when the issue was mitigated. This resulted in longer migration times, but prevented further failures.

We attributed the root cause of this incident to an invalid infrastructure credential that required us to manually intervene.

Once identified, the incident was mitigated by the active involvement of our first responders at which time we unpaused queued migrations and continued processing them with an expected level of success.

To prevent recurrence of similar incidents in the future, we are mitigating specific gaps in our monitoring and alerting for infrastructure credentials.


Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: June 2024 appeared first on The GitHub Blog.

Exploring the challenges in creating an accessible sortable list (drag-and-drop)

Post Syndicated from Kendall Gassner original https://github.blog/engineering/user-experience/exploring-the-challenges-in-creating-an-accessible-sortable-list-drag-and-drop/

Drag-and-drop is a highly interactive and visual interface. We often use drag-and-drop to perform tasks like uploading files, reordering browser bookmarks, or even moving a card in solitaire. It can be hard to imagine completing most of these tasks without a mouse and even harder without using a screen or visual aid. This is why the Accessibility team at GitHub considers drag-and-drop a “high-risk pattern,” often leading to accessibility barriers and a lack of effective solutions in the industry.

Recently, our team worked to develop a solution for a more accessible sortable list, which we refer to as ‘one-dimensional drag-and-drop.’ In our first step toward making drag-and-drop more accessible, we scoped our efforts to explore moving items along a single axis.

An example of a sortable to do list. This list has  7 items all of which are in a single column. One of the the items is hovering over the list to indicate it is being dragged to another position.

Based on our findings, here are some of the challenges we faced and how we solved them:

Challenge: Screen readers use arrow keys to navigate through content

One of the very first challenges we faced involved setting up an interaction model for moving an item through keyboard navigation. We chose to use arrow keys as we wanted keyboard operation to feel natural for a visual keyboard user. However, this choice posed a problem for users who relied on screen readers to navigate throughout a webpage.

To note: Arrow keys are commonly used by screen readers to help users navigate through content, like reading text or navigating between cells in a table. Consequently, when we tested drag-and-drop with screen reader users, the users were unable to use the arrow keys to move the item as intended. The arrow keys ignored drag-and-drop actions and performed typical screen reader navigation instead.

In order to override these key bindings we used role='application'.

Take note: I cannot talk about using role='application' without also providing a warning. role='application' should almost never be used. It is important to use role='application' sparingly and to scope it to the smallest possible element. The role='application' attribute alters how a screen reader operates, making it treat the element and its contents as a single application.

Considering the mentioned caution, we applied the role to the drag-and-drop trigger to restrict the scope of the DOM impacted by role='application'. Additionally, we exclusively added role='application' to the DOM when a user activates drag-and-drop. We remove role='application' when the user completes or cancels out of drag-and-drop. By employing role='application', we are able to override or reassign the screen reader’s arrow commands to correspond with our drag-and-drop commands.

Remember, even if implemented thoughtfully, it’s crucial to rely on feedback from daily screen reader users to ensure you have implemented role='application' correctly. Their feedback and experience should be the determining factor in assessing whether or not using role='application' is truly accessible and necessary.

Challenge: NVDA simulates mouse events when a user presses Enter or Space

Another challenge we faced was determining whether a mouse or keyboard event was triggered when an NVDA screen reader user activated drag-and-drop.

When a user uses a mouse to drag-and-drop items, the expectation is that releasing the mouse button (triggering an onMouseUp event) will finalize the drag-and-drop operation. Whereas, when operating drag-and-drop via the keyboard, Enter or Escape is used to finalize the drag-and-drop operation.

To note: When a user activates a button with the Enter or Space key while using NVDA, the screen reader simulates an onMouseDown and onMouseUp event rather than an onKeyDown event.

Because most NVDA users rely on keyboard operations to operate drag-and-drop instead of a mouse, we had to find a way to make sure our code ignored the onMouseUp event triggered by an NVDA Enter or Space key press.

We accomplished this by using two HTML elements to separate keyboard and mouse functionality:
1. A Primer Icon button to handle keyboard interactions.
2. An invisible overlay to capture mouse interactions.

A screenshot of the drag-and-drop trigger. Added to the screenshot are two informational text boxes the first explains the purpose of the invisible overlay, stating:  "An empty <div> overlays the iconButton to handle mouse events. This is neither focusable nor visible to keyboard users. Associated event handlers: onMouseDown". The second text box explains the purpose of the visible iconButton, stating: "The iconButton can only be activated by keyboard users. This is not clickable for mouse users. Associated event handlers: onMouseDown, onKeyDown.

Challenge: Announcing movements in rapid succession

Once we had our keyboard operations working, our next big obstacle was announcing movements to users, in particular announcing rapid movements of a selected item.

To prepare for external user testing we tested our announcements ourselves with screen readers. Because we are not native screen reader users, we moved items slowly throughout the page and the announcements sounded great. However, users typically move items rapidly to complete tasks quickly, so our screen reader testing did not reflect how users would actually interact with our feature.

To note: It was not until user testing that we discovered that when a user moved an item in rapid succession the aria-live announcements would lag or sometimes announce movements that were no longer relevant. This would disorient users, leading to confusion about the item’s current position.

To solve this problem, we added a small debounce to our move announcements. We tested various debounce speeds with users and landed on 100ms to ensure that we did not slow down a user’s ability to interact with drag-and-drop. Additionally, we used aria-live='assertive' to ensure that stale positional announcements are interrupted by the new positional announcement.

export const debounceAnnouncement = debounce((announcement: string) => {
  announce(announcement, {assertive: true})
}, 100)

Take note: aria-live='assertive' is reserved for time-sensitive or critical notifications. aria-live='assertive' interrupts any ongoing announcements the screen reader is making and can be disruptive for users. Use aria-live='assertive' sparingly and test with screen reader users to ensure your feature implores it correctly.

Challenge: First-time user experience of a new pattern

To note: During user testing we discovered that some of our users found it difficult to operate drag-and-drop with a keyboard. Oftentimes, drag-and-drop is not keyboard accessible or screen reader accessible. As a result, users with disabilities might not have had the opportunity to use drag-and-drop before, making the operations unfamiliar to them.

This problem was particularly challenging to solve because we wanted to make sure our instruction set was easy to find but not a constant distraction for users who frequently use the drag-and-drop functionality.

To address this, we added a dialog with a set of instructions that would open when a user activated drag-and-drop via the keyboard. This dialog has a “don’t show this again” checkbox preference for users who feel like they have a good grasp on the interaction and no longer want a reminder.

A screenshot of the instruction dialog. The dialog contains a table with three rows. Each row contains two columns the first column is the movement and the second column is the keyboard command. At the bottom is the

Challenge: Moving items with voice control in a scrollable container

One of the final big challenges we faced was operating drag-and-drop using voice control assistive technology. We found that using voice control to drag-and-drop items in a non-scrollable list was straightforward, but when the list became scrollable it was nearly impossible to move an item from the top of the list to the bottom.

To note: Voice control displays an overlay of numbers next to interactive items on the screen when requested. These numbers are references to items a user can interact with. For example, if a user says, “Click item 5” and item 5 is a button on a web page the assistive technology will then click the button. These number references dynamically update as the user scrolls through the webpage. Because references are updated as a user scrolls, scrolling the page while dragging an item via a numerical reference will cause the item to be dropped.

We found it critical to support two modes of operation in order to ensure that voice control users are able to sort items in a list. The first mode of operation, traditional drag-and-drop, has been discussed previously. The second mode is a move dialog.

The move dialog is a form that allows users to move items in a list without having to use traditional drag-and-drop:

A screenshot of the Move Dialog. At the top of the dialog is a span specifying  the item being moved. Followed by a form to move items then a preview of the movement.

The form includes two input fields: action and position.
The action field specifies the movement or direction of the operation, for example, “move item before” or “move item after.” And the position specifies the location of where the item should be moved.

Below the input fields we show a preview of where an item will be moved based on the input values. This preview is announced using aria-live and provides a way for users to preview their movements before finalizing them.

During our testing we found that the move dialog was the preferred mode of operation for several of our users who do not use voice control assistive technology. Our users felt more confident when using the move dialog to move items and we were delighted to find that our accessibility feature provided unexpected benefits to a wide range of users.

In Summary, creating an accessible drag-and-drop pattern is challenging and it is important to leverage feedback and consider a diverse range of user needs. If you are working to create an accessible drag-and-drop, we hope our journey helps you understand the nuances and pitfalls of this complex pattern.

A big thanks to my colleagues, Alexis Lucio, Matt Pence, Hussam Ghazzi, and Aarya BC, for their hard work in making drag-and-drop more accessible. Your dedication and expertise have been invaluable. I am excited about the progress we made and what we will achieve in the future!

Lastly, if you are interested in testing the future of drag-and-drop with us, consider joining our Customer Research Panel.

The post Exploring the challenges in creating an accessible sortable list (drag-and-drop) appeared first on The GitHub Blog.

Unlocking the power of unstructured data with RAG

Post Syndicated from Nicole Choi original https://github.blog/2024-06-13-unlocking-the-power-of-unstructured-data-with-rag/


Whether they’re building a new product or improving a process or feature, developers and IT leaders need data and insights to make informed decisions.

When it comes to software development, this data exists in two ways: unstructured and structured. While structured data follows a specific and predefined format, unstructured data—like email, an audio or visual file, code comment, or commit message—doesn’t. This makes unstructured data hard to organize and interpret, which means teams can miss out on potentially valuable insights.

To make the most of their unstructured data, development teams are turning to retrieval-augmented generation, or RAG, a method for customizing large language models (LLMs). They can use RAG to keep LLMs up to date with organizational knowledge and the latest information available on the web. They can also use RAG and LLMs to surface and extract insights from unstructured data.

GitHub data scientists, Pam Moriarty and Jessica Guo, explain unstructured data’s unique value in software development, and how developers and organizations can use RAG to create greater efficiency and value in the development process.

Unstructured data in software development

When it comes to software development, unstructured data includes source code and the context surrounding it, as these sources of information don’t follow a predefined format.

Here are some examples of unstructured data on GitHub:

  • README files describe in text the purpose behind project source code, and include instructions for source code use, how to contribute, and other details that developers decide is important to include. While they’re usually written in Markdown, README files don’t follow a predefined structure.
  • Code files are more orderly than README files in that they follow the syntax of a programming language. But not all code files have the exact same fields nor are they all written in the same format. Additionally, some parts of the file, like coding logic and variable names, are decided by individual developers.
  • Package documentation explains how the software works and how to use it. Documentation, written in natural language, can include installation instructions, troubleshooting tips, a description of the package’s API, and a list of any dependencies required to use the package. It can also include code snippets that highlight the package’s features.
  • Code comments explain the function behind certain code blocks in a code file. They’re text comments written in natural language and make the source code easier to understand by other developers.
  • Wiki pages, while not limited to unstructured data, can contain helpful text documentation about installation instructions, API references, and other information.
  • Commit messages describe in natural language text the changes a developer made to a codebase and why.
  • Issue and pull request descriptions are written in natural language and in a text field. They can contain any kind of information a developer chooses to include about a bug, feature request, or general task in a project.
  • Discussions contain a wealth and variety of information, from developer and end- user feedback to open-ended conversations about a topic. As long as a repository enables discussions, anyone with a GitHub account can start a discussion.
  • Review comments are where developers can discuss changes before they’re merged into a codebase. Consequently, they contain information in natural language about code quality, context behind certain decisions, and concerns about potential bugs.

The value of unstructured data

The same features that make unstructured data valuable also make it hard to analyze.

Unstructured data lacks inherent organization, as it often consists of free-form text, images, or multimedia content.

“Without clear boundaries or predefined formats, extracting meaningful information from unstructured data becomes very challenging,” Guo says.

But LLMs can help to identify complex patterns in unstructured data—especially text. Though not all unstructured data is text, a lot of text is unstructured. And LLMs can help you to analyze it.

“When dealing with ambiguous, semi-structured or unstructured data, LLMs dramatically excel at identifying patterns, sentiments, entities, and topics within text data and uncover valuable insights that might otherwise remain hidden,” Guo explains.

Here are a few reasons why developers and IT leaders might consider using RAG-powered LLMs to leverage unstructured data:

  • Surface organizational best practices and establish consistency. Through RAG, an LLM can receive a prompt with additional context pulled from an organization’s repositories and documents. So, instead of sifting through and piece-mealing documents, developers can quickly receive answers from an LLM that align with their organization’s knowledge and best practices.
  • Accelerate and deepen understanding of an existing codebase—including its conventions, functions, common issues, and bugs. Understanding and familiarizing yourself with code written by another developer is a persisting challenge for several reasons, including but not limited to: code complexity, use of different coding styles, a lack of documentation, use of legacy code or deprecated libraries and APIs, and the buildup of technical debt from quick fixes and workarounds.

RAG can help to mediate these pain points by enabling developers to ask and receive answers in natural language about a specific codebase. It can also guide developers to relevant documentation or existing solutions.

Accelerated and deepened understanding of a codebase enables junior developers to contribute their first pull request with less onboarding time and senior developers to mitigate live site incidents, even when they’re unfamiliar with the service that’s failing. It also means that legacy code suffering from “code rot” and natural aging can be more quickly modernized and easily maintained.

Unstructured data doesn’t just help to improve development processes. It can also improve product decisions by surfacing user pain points.

Moriarty says, “Structured data might show a user’s decision to upgrade or renew a subscription, or how frequently they use a product or not. While those decisions represent the user’s attitude and feelings toward the product, it’s not a complete representation. Unstructured data allows for more nuanced and qualitative feedback, making for a more complete picture.”

A lot of information and feedback is shared during informal discussions, whether those discussions happen on a call, over email, on social platforms, or in an instant message. From these discussions, decision makers and builders can find helpful feedback to improve a service or product, and understand general public and user sentiment.

What about structured data?

Contrary to unstructured data, structured data—like relational databases, Protobuf files, and configuration files—follows a specific and predefined format.

We’re not saying unstructured data is more valuable than structured. But the processes for analyzing structured data are more straightforward: you can use SQL functions to modify the data and traditional statistical methods to understand the relationship between different variables.

That’s not to say AI isn’t used for structured data analysis. “There’s a reason that machine learning, given its predictive power, is and continues to be widespread across industries that use data,” according to Moriarty.

However, “Structured data is often numeric, and numbers are simply easier to analyze for patterns than words are,” Moriarty says. Not to mention that methods for analyzing structured data have been around longer** **than those for analyzing unstructured data: “A longer history with more focus just means there are more established approaches, and more people are familiar with it,” she explains.

That’s why the demand to enhance structured data might seem less urgent, according to Guo. “The potential for transformative impact is significantly greater when applied to unstructured data,” she says.

How does RAG extract value from unstructured data?

With RAG, an LLM can use data sources beyond its training data to generate an output.

RAG is a prompting method that uses retrieval—a process for searching for and accessing information—to add more context to a prompt that generates an LLM response.

This method is designed to improve the quality and relevance of an LLM’s outputs. Additional data sources include a vector database, traditional database, or search engine. So, developers who use an enterprise AI tool equipped with RAG can receive AI outputs customized to their organization’s best practices and knowledge, and proprietary data.

We break down these data sources in our RAG explainer, but here’s a quick summary:

  • Vector databases. While you code in your IDE, algorithms create embeddings for your code snippets, which are stored in a vector database. An AI coding tool can search that database to find snippets from across your codebase that are similar to the code you’re currently writing and generate a suggestion.

And when you’re engaging with GitHub Copilot Chat on GitHub.com or in the IDE, your query or code is transformed into an embedding. Our retrieval service then fetches relevant embeddings from the vector database for the repository you’ve indexed. These embeddings are turned back into text and code when they’re added to the prompt as additional context for the LLM. This entire process leverages unstructured data, even though the retrieval system uses embeddings internally.

  • General text search. When developers engage with GitHub Copilot Chat under a GitHub Copilot Enterprise plan, they can index repositories—specifically code and documentation. So, when a developer on GitHub.com or in the IDE asks GitHub Copilot Chat a question about an indexed repository, the AI coding tool can retrieve data from all of those indexed, unstructured data sources. And on GitHub.com, GitHub Copilot Chat can tap into a collection of unstructured data in Markdown files from across repositories, which we call knowledge bases.

Learn about GitHub Copilot Enterprise features >

But wait, why is Markdown considered unstructured data? Though you can use Markdown to format a file, the file itself can contain essentially any kind of data. Think about it this way: how would you put the contents of a Markdown file in a table?

  • External or internal search engine. The retrieval method searches and pulls information from a wide range of sources from the public web or your internal platforms and websites. That information is used for RAG, which means the AI model now has data from additional files—like text, image, video, and audio—to answer your questions.

Retrieval also taps into internal search engines. So, if a developer wants to ask a question about a specific repository, they can index the repository and then send their question to GitHub Copilot Chat on GitHub.com. Retrieval uses our internal search engine to find relevant code or text from the indexed files, which are then used by RAG to prompt the LLM for a contextually relevant response.

Stay smart: LLMs can do things they weren’t trained to do, so it’s important to always evaluate and verify their outputs.

Use RAG to unlock insights from unstructured data

As developers improve their productivity and write more code with AI tools like GitHub Copilot, there’ll be even more unstructured data. Not just in the code itself, but also the information used to build, contextualize, maintain, and improve that code.

That means even more data containing rich insights that organizations can surface and leverage, or let sink and disappear.

Developers and IT leaders can use RAG as a tool to help improve their productivity, produce high-quality and consistent code at greater speed, preserve and share information, and increase their understanding of existing codebases, which can impact reduced onboarding time.

With a RAG-powered AI tool, developers and IT leaders can quickly discover, analyze, and evaluate a wealth of unstructured data—simply by asking a question.

A RAG reading list 📚

The post Unlocking the power of unstructured data with RAG appeared first on The GitHub Blog.

GitHub Availability Report: May 2024

Post Syndicated from Jakub Oleksy original https://github.blog/2024-06-12-github-availability-report-may-2024/

In May, we experienced one incident that resulted in significant degraded performance across GitHub services.

May 21 11:40 UTC (lasting 7 hours 26 minutes)

On May 21, various GitHub services experienced latency due to a configuration change in an upstream cloud provider. GitHub Copilot Chat experienced p50 latency of up to 2.5s and p95 latency of up to 6s, GitHub Actions was degraded with 20 60 minute delays for workflow run updates, and GitHub Enterprise Importer customers experienced longer migration run times due to Actions delays.

Actions users experienced their runs stuck in stale states for some time even if the underlying runner was completed successfully, and Copilot Chat users experienced delays in receiving responses to their requests. Billing related metrics for budget notifications and UI reporting were also delayed, leading to outdated billing details. No data was lost and reporting was restored after mitigation.

We determined that the issue was caused by a scheduled operating system upgrade that resulted in unintended and uneven distribution of traffic within the cluster. A short- term strategy of increasing the number of network routes between our data centers and cloud provider helped mitigate the incident.

To prevent recurrence of the incidents, we have identified and are fixing gaps in our monitoring and alerting for load thresholds to improve both detection and mitigation time.


Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: May 2024 appeared first on The GitHub Blog.

How we improved push processing on GitHub

Post Syndicated from Will Haltom original https://github.blog/engineering/architecture-optimization/how-we-improved-push-processing-on-github/


What happens when you push to GitHub? The answer, “My repository gets my changes” or maybe, “The refs on my remote get updated” is pretty much right—and that is a really important thing that happens, but there’s a whole lot more that goes on after that. To name a few examples:

  • Pull requests are synchronized, meaning the diff and commits in your pull request reflect your newly pushed changes.
  • Push webhooks are dispatched.
  • Workflows are triggered.
  • If you push an app configuration file (like for Dependabot or GitHub Actions), the app is automatically installed on your repository.
  • GitHub Pages are published.
  • Codespaces configuration is updated.
  • And much, much more.

Those are some pretty important things, and this is just a sample of what goes on for every push. In fact, in the GitHub monolith, there are over 60 different pieces of logic owned by 20 different services that run in direct response to a push. That’s actually really cool—we should be doing a bunch of interesting things when code gets pushed to GitHub. In some sense, that’s a big part of what GitHub is, the place you push code1 and then cool stuff happens.

The problem

What’s not so cool is that, up until recently, all of these things were the responsibility of a single, enormous background job. Whenever GitHub’s Ruby on Rails monolith was notified of a push, it enqueued a massive job called the RepositoryPushJob. This job was the home for all push processing logic, and its size and complexity led to many problems. The job triggered one thing after another in a long, sequential series of steps, kind of like this:

A flow chart from left to right. The first step is "Push". Then second step is "GitHub Rails monolith". The third step is a large block labeled "RepositoryPushJob" which contains a sequence of steps inside it. These steps are: "Apps callback", "Codespaces callback", "PRs callback", followed by a callout that there are 50+ tasks after this one. The final step is "processing task n".

There are a few things wrong with this picture. Let’s highlight some of them:

  • This job was huge, and hard to retry. The size of the RepositoryPushJob made it very difficult for different push processing tasks to be retried correctly. On a retry, all the logic of the job is repeated from the beginning, which is not always appropriate for individual tasks. For example:
    • Writing Push records to the database can be retried liberally on errors and reattempted any amount of time after the push, and will gracefully handle duplicate data.
    • Sending push webhooks, on the other hand, is much more time-sensitive and should not be reattempted too long after the push has occurred. It is also not desirable to dispatch multiples of the same webhook.
  • Most of these steps were never retried at all. The above difficulties with conflicting retry concerns ultimately led to retries of RepositoryPushJob being avoided in most cases. To prevent one step from killing the entire job, however, much of the push handling logic was wrapped in code catching any and all errors. This lack of retries led to issues where crucial pieces of push processing never occurred.
  • Tight coupling of many concerns created a huge blast radius for problems. While most of the dozens of tasks in this job rescued all errors, for historical reasons, a few pieces of work in the beginning of the job did not. This meant that all of the later steps had an implicit dependency on the initial parts of the job. As more concerns are combined within the same job, the likelihood of errors impacting the entire job increases.
    • For example, writing data to our Pushes MySQL cluster occurred in the beginning of the RepositoryPushJob. This meant that everything occurring after that had an implicit dependency on this cluster. This structure led to incidents where errors from this database cluster meant that user pull requests were not synchronized, even though pull requests have no explicit need to connect to this cluster.
  • A super long sequential process is bad for latency. It’s fine for the first few steps, but what about the things that happen last? They have to wait for every other piece of logic to run before they get a chance. In some cases, this structure led to a second or more of unnecessary latency for user-facing push tasks, including pull request synchronization.

What did we do about this?

At a high level, we took this very long sequential process and decoupled it into many isolated, parallel processes. We used the following approach:

  • We added a new Kafka topic that we publish an event to for each push.
  • We examined each of the many push processing tasks and grouped them by owning service and/or logical relationships (for example, order dependency, retry-ability).
  • For each coherent group of tasks, we placed them into a new background job with a clear owner and appropriate retry configuration.
  • Finally, we configured these jobs to be enqueued for each publish of the new Kafka event.
    • To do this, we used an internal system at GitHub that facilitates enqueueing background jobs in response to Kafka events via independent consumers.

We had to make investments in several areas to support this architecture, including:

  • Creating a reliable publisher for our Kafka event–one that would retry until broker acknowledgement.
  • Setting up a dedicated pool of job workers to handle the new job queues we’d need for this level of fan out.
  • Improving observability to ensure we could carefully monitor the flow of push events throughout this pipeline and detect any bottlenecks or problems.
  • Devising a system for consistent per-event feature flagging, to ensure that we could gradually roll out (and roll back if needed) the new system without risk of data loss or double processing of events between the old and new pipelines.

Now, things look like this:

A flow chart from left to right. The first step is "Push". The second step is "GitHub Rails monolith". The connection between the second and third step is labeled "Push event". The third step is "Kafka". The fourth step is "Kafka to job queue bridge". Then, there are 16 parallel connectors branching out from the fourth step to the next steps. These are: "AppsOnPushJob", "CodespacesOnPushJob", "PullRequestsOnPushJob", "MarketPlaceOnPushJob", "ProjectStackOnPushJob", "SecurityCenterPushJob", "IssuesOnPushJob", "PagesOnPushJob", "MaintenanceOnPushJob", "NotificationsOnPushJob", "RepositoriesOnPushJob", "ReleasesOnPushJob", "ActionsOnPushJob", "WikisOnPushJob", "SearchOnPushJob", and "Push job n".

A push triggers a Kafka event, which is fanned out via independent consumers to many isolated jobs that can process the event without worrying about any other consumers.

Results

  • A smaller blast radius for problems.
    • This can be clearly seen from the diagram. Previously, an issue with a single step in the very long push handling process could impact everything downstream. Now, issues with one piece of push handling logic don’t have the ability to take down much else.
    • Structurally, this decreases the risk of dependencies. For example, there are around 300 million push processing operations executed per day in the new pipeline that previously implicitly depended on the Pushes MySQL cluster and now have no such dependency, simply as a product of being moved into isolated processes.
    • Decoupling also means better ownership. In splitting up these jobs, we distributed ownership of the push processing code from one owning team to 15+ more appropriate service owners. New push functionality in our monolith can be added and iterated on by the owning team without unintentional impact to other teams.
  • Pushes are processed with lower latency.
    • By running these jobs in parallel, no push processing task has to wait for others to complete. This means better latency for just about everything that happens on push.
    • For example, we can see a notable decrease in pull request sync time:

    A line chart depicting the p50 pull request sync time since head ref update over several previous months. The line hovers around 3 seconds from September 2023 through November 2023. In December 2023, it drops to around 2 seconds.

  • Improved observability.

    • By breaking things up into smaller jobs, we get a much clearer picture of what’s going on with each job. This lets us set up observability and monitoring that is much more finely scoped than anything we had before, and helps us to quickly pinpoint any problems with pushes.
  • Pushes are more reliably processed.
    • By reducing the size and complexity of the jobs that process pushes, we are able to retry more things than in the previous system. Each job can have retry configuration that’s appropriate for its own small set of concerns, without having to worry about re-executing other, unrelated logic on retry.
    • If we define a “fully processed” push as a push event for which all the desired operations are completed with no failures, the old RepositoryPushJob system fully processed about 99.897% of pushes.
    • In the worst-case estimate, the new pipeline fully processes 99.999% of pushes.

Conclusion

Pushing code to GitHub is one of the most fundamental interactions that developers have with GitHub every day. It’s important that our system handles everyone’s pushes reliably and efficiently, and over the past several months we have significantly improved the ability of our monolith to correctly and fully process pushes from our users. Through platform level investments like this one, we strive to make GitHub the home for all developers (and their many pushes!) far into the future.

Notes


  1. People push to GitHub a whole lot, as you can imagine. In the last 30 days, we’ve received around 500 million pushes from 8.5 million users. 

The post How we improved push processing on GitHub appeared first on The GitHub Blog.

Profile-guided optimisation (PGO) on Grab services

Post Syndicated from Grab Tech original https://engineering.grab.com/profile-guided-optimisation

Profile-guided optimisation (PGO) is a technique where CPU profile data for an application is collected and fed back into the next compiler build of Go application. The compiler then uses this CPU profile data to optimise the performance of that build by around 2-14% currently (future releases could likely improve this figure further).

High level view of how PGO works

PGO is a widely used technique that can be implemented with many programming languages. When it was released in May 2023, PGO was introduced as a preview in Go 1.20.

Enabling PGO on a service

Profile the service to get pprof file

First, make sure that your service is built using Golang version v1.20 or higher, as only these versions support PGO.

Next, enable pprof in your service.

If it’s already enabled, you can use the following command to capture a 6-minute profile and save it to /tmp/pprof.

curl 'http://localhost:6060/debug/pprof/profile?seconds=360' -o /tmp/pprof

Enabled PGO on the service

TalariaDB: TalariaDB is a distributed, highly available, and low latency time-series database for Presto open sourced by Grab.

It is a service that runs on an EKS cluster and is entirely managed by our team, we will use it as an example here.

Since the cluster deployment relies on a Docker image, we only need to update the Docker image’s go build command to include -PGO=./talaria.PGO. The talaria.PGO file is a pprof profile collected from production services over a span of 360 seconds.

If you’re utilising a go pluginas we do in TalariaDB, it’s crucial to ensure that the PGO is also applied to the plugin.

Here’s our Dockerfile, with the additions to support PGO.

FROM arm64v8/golang:1.21 AS builder

ARG GO111MODULE="on"
ARG GOOS="linux"
ARG GOARCH="arm64"
ENV GO111MODULE=${GO111MODULE}
ENV GOOS=${GOOS}
ENV GOARCH=${GOARCH}

RUN mkdir -p /go/src/talaria
COPY . src/talaria
#RUN cd src/talaria && go mod download  && go build && test -x talaria
RUN cd src/talaria && go mod download  && go build -PGO=./talaria.PGO && test -x talaria

RUN mkdir -p /go/src/talaria-plugin
COPY ./talaria-plugin  src/talaria-plugin
RUN cd src/talaria-plugin && make plugin && test -f talaria-plugin.so
FROM arm64v8/debian:latest AS base

RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/cache/apk/*

WORKDIR /root/ 
ARG GO_BINARY=talaria
COPY  --from=builder /go/src/talaria/${GO_BINARY} .
COPY  --from=builder /go/src/talaria-plugin/talaria-plugin.so .

ADD entrypoint.sh . 
RUN mkdir /etc/talaria/ && chmod +x /root/${GO_BINARY} /root/entrypoint.sh
ENV TALARIA_RC=/etc/talaria/talaria.rc 
EXPOSE 8027
ENTRYPOINT ["/root/entrypoint.sh"]

Result on enabling PGO on one GrabX service

It’s important to mention that the pprof utilised for PGO was not captured during peak hours and was limited to a duration of 360 seconds.

Service TalariaDB has three clusters and the time we enabled PGO for these clusters are:

  • We enabled PGO on cluster 0, and deployed on 4 Sep 11.16 AM.
  • We enabled PGO on cluster 1, and deployed on 5 Sep 15:00 PM.
  • We enabled PGO on cluster 2, and deployed on 6 Sep 16:00 PM.

The size of the instances, their quantity, and all other dependencies remained unchanged.

CPU metrics on cluster

Cluster CPU usage before enabling PGO
Cluster CPU usage after enabling PGO

It’s evident that enabling PGO resulted in at least a 10% reduction in CPU usage.

Memory metrics on cluster

Memory usage of the cluster before enabling PGO
Percentage of free memory after enabling PGO

It’s clear that enabling PGO led to a reduction of at least 10GB (30%) in memory usage.

Volume metrics on cluster

Persistent volume usage on cluster before enabling PGO
Volume usage after enabling PGO

Enabling PGO resulted in a reduction of at least 7GB (38%) in volume usage. This volume is utilised for storing events that are queued for ingestion.

Ingested event count/CPU metrics on cluster

To gauge the enhancements, I employed the metric of ingested event count per CPU unit (event count / CPU). This approach was adopted to account for the variable influx of events, which complicates direct observation of performance gains.

Count of ingested events on cluster after enabling PGO

Upon activating PGO, there was a noticeable increase in the ingested event count per CPU, rising from 1.1 million to 1.7 million, as depicted by the blue line in the cluster screenshot.

How we enabled PGO on a Catwalk service

We also experimented with enabling PGO on certain orchestrators in a Catwalk service. This section covers our findings.

Enabling PGO on the test-golang-orch-tfs orchestrator

Attempt 1: Take pprof for 59 seconds

  • Just 1 pod running with a constant throughput of 420 QPS.
  • Load test started with a non-PGO image at 5:39 PM SGT.
  • Take pprof for 59 seconds.
  • Image with PGO enabled deployed at 5:49 PM SGT.

Observation: CPU usage increased after enabling PGO with pprof for 59 seconds.

We suspected that taking pprof for just 59 seconds may not be sufficient to collect accurate metrics. Hence, we extended the duration to 6 minutes in our second attempt.

Attempt 2 : Take pprof for 6 minutes

  • Just 1 pod running with a constant throughput of 420 QPS.
  • Deployed non PGO image with custom pprof server at 6:13 PM SGT.
  • pprof taken at 6:19 PM SGT for 6 minutes.
  • Image with PGO enabled deployed at 6:29 PM SGT.

Observation: CPU usage decreased after enabling PGO with pprof for 6 minutes.

CPU usage after enabling PGO on Catwalk
Container memory utilisation after enabling PGO on Catwalk

Based on this experiment, we found that the impact of PGO is around 5% but the effort involved to enable PGO outweighs the impact. To enable PGO on Catwalk, we would need to create Docker images for each application through CI pipelines.

Additionally, the Catwalk team would require a workaround to pass the pprof dump, which is not a straightforward task. Hence, we decided to put off the PGO application for Catwalk services.

Looking into PGO for monorepo services

From the information provided above, enabling PGO for a service requires the following support mechanisms:

  • A pprof service, which is currently facilitated through Jenkins.
  • A build process that supports PGO arguments and can attach or retrieve the pprof file.

For services that are hosted outside the monorepo and are self-managed, the effort required to experiment is minimal. However, for those within the monorepo, we will require support from the build process, which is currently unable to support this.

Conclusion/Learnings

Enabling PGO has proven to be highly beneficial for some of our services, particularly TalariaDB. By using PGO, we’ve observed a clear reduction in both CPU usage and memory usage to the tune of approximately 10% and 30% respectively. Furthermore, the volume used for storing queued ingestion events has been reduced by a significant 38%. These improvements definitely underline the benefits and potential of utilising PGO on services.

Interestingly, applying PGO resulted in an increased rate of ingested event count per CPU unit on TalariaDB, which demonstrates an improvement in the service’s efficiency.

Experiments with the Catwalk service have however shown that the effort involved to enable PGO might not always justify the improvements gained. In our case, a mere 5% improvement did not appear to be worth the work required to generate Docker images for each application via CI pipelines and create a solution to pass the pprof dump.

On the whole, it is evident that the applicability and benefits of enabling PGO can vary across different services. Factors such as application characteristics, current architecture, and available support mechanisms can influence when and where PGO optimisation is feasible and beneficial.

Moving forward, further improvements to go-build and the introduction of PGO support for monorepo services may drive greater adoption of PGO. In turn, this has the potential to deliver powerful system-wide gains that translate to faster response times, lower resource consumption, and improved user experiences. As always, the relevance and impact of adopting new technologies or techniques should be considered on a case-by-case basis against operational realities and strategic objectives.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How GitHub reduced testing time for iOS apps with new runner features

Post Syndicated from Stephen Glass original https://github.blog/2024-06-03-how-github-reduced-testing-time-for-ios-apps-with-new-runner-features/

GitHub Actions 🤝 GitHub for iOS

The GitHub iOS and GitHub Actions macOS runner teams are integral parts of each other’s development inner loop. Each team partners on testing new runner images and hardware long before the features land in the hands of developers. GitHub Actions has been working hard at bringing the latest Mac hardware to the community. Apple silicon (M1) macOS runners are available for free in public repositories, along with larger options available for those jobs that need more performance.

The GitHub iOS team has been busy improving the user experience in the app, recently shipping such as GitHub Copilot Chat, code search, localization for German and Korean, and making it easier to work with issues and projects. In this blog, we will discuss how the GitHub iOS team brings the app to developers around the world, the benefits of Apple silicon, and building on GitHub Actions using macOS runners.

How GitHub reduced testing time for iOS apps with new runner features

The GitHub iOS team previously used a single workflow with one job to build and test the entire codebase on GitHub Actions that took 38 minutes to complete with the prior generation runners. The GitHub iOS app consists of about 60 first-party modules, consisting of various targets, such as dynamic frameworks, static libraries, app extensions, or the GitHub app itself. These modules range from networking layers to design system components to entire features or products, helping us maintain the app.

Breaking down the monolith

We decided to leverage the power of Apple silicon to speed up their testing process. We switched to M1 macOS runners (macos-14-xlarge YAML label) on GitHub Actions and split their test suite into separate jobs for each module. This way, they could build and test each module independently and get faster feedback. Some of the smallest modules completed their tests in as little as 2-3 minutes on M1 macOS runners, getting feedback to developers on their pull requests faster than ever before. This also made it easier to identify and fix failures on specific modules without waiting for a monolithic build to finish.

By using Apple silicon, we reduced their testing time by 60%, from 38 minutes to 15 minutes, and improved our productivity and efficiency. The figure below demonstrates how we broke down the monolith into small modules in order to improve our build times.

Image demonstrates the monolith build on tip with the total CI time. The Image below it demonstrates how per-module builds are crafted and the reduction in CI time with the new approach.

As each build is kicked off, GitHub Actions is behind the scenes preparing the required number of machines to execute the workflow. Each request is sent to the GitHub Actions service where it picks up a freshly reimaged virtual machine to execute the required number of jobs. The figure below shows how a request travels from our repository to the Actions Mac servers in Azure.

Image displays the relationship between the request for workflow to run and how a machine is assigned to a job. From left to right, the flow starts at GitHub.com, then the request is sent to Actions. Actions then finds the available macOS VM to execute the workflow.

With shorter build times and a scaling CI fleet, Apple silicon hosts allowed the GitHub iOS team to scale their jobs out across many shorter, faster steps, with GitHub Actions abstracting over the complexity of distributing CI jobs.

Analyzing CI performance

We further investigated the CI performance and divided each module’s CI into two separate steps, build and test, using xcodebuild’s build-without-testing and test-without-building. This helped us identify unit tests that ran for a long time or highlighted fast unit tests that finished in seconds.

Native development and test environments

With Apple silicon powering GitHub Actions runners and the developers’ laptops, our CI now had the same architecture as local development machines. Engineers could identify patterns that took a long time to compile or tests that failed due to the architecture from CI and fix them locally with confidence.

Benefits of Apple silicon

Apple silicon improves build performance, increases reliability, and lets iOS teams test natively for all Apple platforms throughout the software development lifecycle. They can avoid problems from cross-compilation or emulation and use the latest simulators on our GitHub Actions runner image. This ensures that their apps work well with the newest versions of iOS, iPadOS, watchOS, and tvOS. Our GitHub Actions M1 macOS runners help iOS teams leverage these benefits and deliver high-quality apps to their users faster and more efficiently. Additionally, GitHub Actions offers 50 concurrent runners for enterprise accounts and five for GitHub Free and Team plans. The GitHub for iOS team takes full advantage of these concurrent runners and initiates 50 jobs for every pull request to perform modular testing on the app in parallel.

Get started building on GitHub Actions using macOS runners

GitHub-hosted macOS runners are YAML-driven, meaning they are accessed by updating the runs on: key in your workflow file.

The post How GitHub reduced testing time for iOS apps with new runner features appeared first on The GitHub Blog.

Adopting OpenTelemetry for our logging pipeline

Post Syndicated from Colin Douch original https://blog.cloudflare.com/adopting-opentelemetry-for-our-logging-pipeline


Cloudflare’s logging pipeline is one of the largest data pipelines that Cloudflare has, serving millions of log events per second globally, from every server we run. Recently, we undertook a project to migrate the underlying systems of our logging pipeline from syslog-ng to OpenTelemetry Collector and in this post we want to share how we managed to swap out such a significant piece of our infrastructure, why we did it, what went well, what went wrong, and how we plan to improve the pipeline even more going forward.

Background

A full breakdown of our existing infrastructure can be found in our previous post An overview of Cloudflare’s logging pipeline, but to quickly summarize here:

  • We run a syslog-ng daemon on every server, reading from the local systemd-journald journal, and a set of named pipes.
  • We forward those logs to a set of centralized “log-x receivers”, in one of our core data centers.
  • We have a dead letter queue destination in another core data center, which receives messages that could not be sent to the primary receiver, and which get mirrored across to the primary receivers when possible.

The goal of this project was to replace those syslog-ng instances as transparently as possible. That means we needed to implement all these behaviors as precisely as possible, so that we didn’t need to modify any downstream systems.

There were a few reasons for wanting to make this shift, and enduring the difficulties of overhauling such a large part of our infrastructure:

  • syslog-ng is written in C, which is not a core competency of our team. While we have made upstream contributions to the project in the past, and the experience was great, having the OpenTelemetry collector in Go allows much more of our team to be able to contribute improvements to the system.
  • Building syslog-ng against our internal Post-Quantum cryptography libraries was difficult, due to having to maintain an often brittle C build chain, whereas our engineering teams have optimized the Go build model to make this as simple as possible.
  • OpenTelemetry Collectors have built in support for Prometheus metrics, which allows us to gather much deeper levels of telemetry data around what the collectors are doing, and surface these insights as “meta-observability” to our engineering teams.
  • We already use OpenTelemetry Collectors for some of our tracing infrastructure, so unifying onto one daemon rather than having separate collectors for all our different types of telemetry reduces the cognitive load on the team.

The Migration Process

What we needed to build

While the upstream contrib repository contains a wealth of useful components, all packaged into its own distribution, it became clear early on that we would need our own internal components. Having our own internal components would require us to build our own distribution, so one of the first things we did was turn to OCB (OpenTelemetry Collector Builder) to provide us a way to build an internal distribution of an OpenTelemetry Collector. We eventually ended up templating our OCB configuration file to automatically include all the internal components we have built, so that we didn’t have to add them manually.

In total, we built four internal components for our initial version of the collector.

cfjs1exporter

Internally, our logging pipeline uses a line format we call “cfjs1”. This format describes a JSON encoded log, with two fields: a format field, that decides the type of the log, and a “wrapper” field which contains the log body (which is a structured JSON object in and of itself), with a field name that changes depending on the format field. These two fields decide which Kafka topic our receivers will end up placing the log message in.

Because we didn’t want to make changes to other parts of the pipeline, we needed to support this format in our collector. To do this, we took inspiration from the contrib repository’s syslogexporter, building our cfjs1 format into it.

Ultimately, we would like to move towards using OTLP (OpenTelemetry Protocol) as our line format. This would allow us to remove our custom exporter, and utilize open standards, enabling easier migrations in the future.

fileexporter

While the upstream contrib repo does have a file exporter component, it only supports two formats: JSON and Protobuf. We needed to support two other formats, plain text and syslog, so we ended up forking the file exporter internally. Our plain text formatter simply outputs the body of the log message into a file, with newlines as a delimiter. Our syslog format outputs RFC 5424 formatted syslog messages into a file.

The other feature we implemented on our internal fork was custom permissions. The upstream file exporter is a bit of a mess, in that it actually has two different modes of operation – a standard mode, not utilizing any of the compression or rotation features, and a more advanced mode which uses those features. Crucially, if you want to use any of the rotation features, you end up using lumberjack, whereas without those features you use a more native file handling. This leads to strange issues where some features of the exporter are supported in one mode, but not the other. In the case of permissions, the community seems open to the idea in the native handling, but lumberjack seems against the idea. This dichotomy is what led us to implement it ourselves internally.

Ultimately, we would love to upstream these improvements should the community be open to them. Having support for custom marshallers (https://github.com/open-telemetry/opentelemetry-collector-contrib/issues/30331) would have made this a bit easier, however it’s not clear how that would work with OCB. Either that, or we could open source them in the Cloudflare organization, but we would love to remove the need to maintain our own fork in the future.

externaljsonprocessor

We want to set the value of an attribute/field that comes from external sources: either from an HTTP endpoint or an output from running a specific command. In syslog-ng, we have a sidecar service that generates a syslog-ng configuration to achieve this. In replacing syslog-ng with our OpenTelemetry Collector, we thought it would be easier to implement this feature as a custom component of our collector instead.

To that end, we implemented an “external JSON processor”, which is able to periodically query external data sources and add those fields to all the logs that flow through the processor. Cloudflare has many internal tools and APIs, and we use this processor to fetch data like the status of a data center, or the status of a systemd unit. This enables our engineers to have more filtering options, such as to exclude logs from data centers that are not supposed to receive customer traffic, or servers that are disabled for maintenance. Crucially, this allows us to update these values much faster than the standard three-hour cadence of other configuration updates through salt, allowing more rapid updates to these fields that may change quickly as we operate our network.

ratelimit processor

The last component we needed to implement was a replacement for the syslog-ng ratelimit filter, also contributed by us upstream. The ratelimit filter allows applying rate limits based on a specific field of a log message, dropping messages that exceed some limit (with an optional burst limit). In our case, we apply rate limits over the service field, ensuring that no individual service can degrade the log collection for any other.

While there has been some upstream discussion of similar components, we couldn’t find anything that explicitly fit our needs. This was especially true when you consider that in our case the data loss during the rate limiting process is intentional, something that might be hard to sell when trying to build something more generally applicable.

How we migrated

Once we had an OpenTelemetry Collector binary, we had to deploy it. Our deployment process took two forks: Deploying to our core data centers, and deploying to our edge data centers. For those unfamiliar, Cloudflare’s core data centers contain a small number of servers with a very diverse set of workloads, from Postgresql, to ElasticSearch, to Kubernetes, and everything in between. Our edge data centers, on the other hand, are much more homogenous. They contain a much larger number of servers, each one running the same set of services.

Both edge and core use salt to configure the services running on their servers. This meant that the first step was to write salt states that would install the OpenTelemetry collector, and write the appropriate configurations to disk. Once we had those in place, we also needed to write some temporary migration pieces that would disable syslog-ng and start the OpenTelemetry collector, as well as the inverse in the case of a roll back.

For the edge data centers, once we had a set of configurations written, it mostly came down to rolling the changes out gradually across the edge servers. Because edge servers run the same set of services, once we had gained confidence in our set of configurations, it became a matter of rolling out the changes slowly and monitoring the logging pipelines along the way. We did have a few false starts here, and needed to instrument our cfjs1exporter a bit more to work around issues surrounding some of our more niche services and general Internet badness which we’ll detail below in our lessons learned.

The core data centers required a more hands-on approach. Many of our services in core have custom syslog-ng configurations. For example, our Postgresql servers have custom handling for their audit logs, and our Kubernetes servers have custom handling for contour ingress and error logs. This meant that each role with a custom config had to be manually onboarded, with extensive testing on the designated canary nodes of each role to validate the configurations.

Lessons Learned

Failover

At Cloudflare, we regularly schedule chaos testing on our core data centers which contain our centralized log receivers. During one of these chaos tests, our cfjs1 exporter did not notice that it could not send to the primary central logging server. This caused our collector to not failover to the secondary central logging server and its log buffer to fill up, which resulted in the collector failing to consume logs from its receivers. This is not a problem with journal receivers since logs are buffered by journald before they get consumed by the collector, but it is a different case with named pipe receivers. Due to this bug, our collectors stopped consuming logs from named pipes, and services writing to these named pipes started blocking threads waiting to write to them. Our syslog-ng deployment solved this issue using a monit script to periodically kill the connections between syslog-ng and the central receivers, however we opted to solve this more explicitly in our exporter by building in much tighter timeouts, and modifying the upstream failover receiver to better respond to these partial failures.

Cutover delays

As we’ve previously blogged about, at Cloudflare, we use Nomad for running dynamic tasks in our edge data centers. We use a custom driver to run containers and this custom driver handles the shipping of logs from the container to a named pipe.

We did the migration from syslog-ng to OpenTelemetry Collectors while servers were live and running production services. During the migration, there was a gap when syslog-ng was stopped by our configuration management and our OpenTelemetry collector was started on the server. This gap caused the logs in the named pipe to not get consumed and similar to the previous named pipe, the services writing to the named pipe receiver in blocking mode got affected. Similar to NGINX and Postgresql, Cloudflare’s driver for Nomad also writes logs to the named pipe driver in blocking mode. Because of this delay, the driver timed out sending logs and rescheduled the containers.

We ultimately caught this pretty early on in testing, and changed our approach to the rollout. Instead of using Salt to separately stop syslog-ng and start the collector, we instead used salt to schedule a systemd “one shot” service that simultaneously stopped syslog-ng and started the collector, minimizing the downtime between the two.

What’s next?

Migrating such a critical part of our infrastructure is never easy, especially when it has remained largely untouched for nearly half a decade. Even with the issues we hit during our rollout, migrating to an OpenTelemetry Collector unlocks so many more improvements to our logging pipeline going forward. With the initial deployment complete, there are a number of changes we’re excited to work on next, including:
Better handling for log sampling, including tail sampling
Better insights for our engineering teams on their telemetry production
Migration to OTLP as our line protocol
Upstreaming of some of our custom components

If that sounds interesting to you, we’re hiring engineers to come work on our logging pipeline, so please reach out!

How we evaluated the business impact of marketing campaigns

Post Syndicated from Grab Tech original https://engineering.grab.com/evaluate-business-impact-of-marketing-campaigns

In a previous post, we introduced our systems for running marketing campaigns. Although we sent millions of messages daily, we had little insight into their effectiveness. Did they engage our users with our promotions? Did they encourage more transactions and bookings?

As Grab’s business expanded and the number of marketing campaigns increased, understanding the impact of these campaigns became crucial. This knowledge enables campaign managers to design more effective campaigns and avoid wasteful ones that degrade user experience.

Initially, campaign managers had to consult marketing analysts to gauge the impact of campaigns. However, this approach soon proved unsustainable:

  • Manual analysis doesn’t scale with an increasing number of campaigns.
  • Different analysts might assess the business impact in slightly different ways, leading to inconsistent results over time.

Thus, we recognised the need for a centralised solution allowing campaign managers to view their campaign impact analyses.

Marketing attribution model

The marketing analyst team designed a Marketing attribution model (MAM) for estimating the business impact of any campaign that sends messages to users. It quantifies business impact in terms of generated gross merchandise value (GMV), revenue, etc.

Unlike traditional models that only credit the last touchpoint (i.e. the last message user reads before making a transaction), MAM offers a more nuanced view. It recognises that users are exposed to various marketing messages (emails, pushes, feeds, etc.) throughout their decision-making process. As shown in Fig 1, MAM assigns credit to each touchpoint that influences a conversion (e.g., Grab usage) based on two key factors:

  • Relevance: Content directly related to the conversion receives a higher weightage. Imagine a user opening a GrabFood push notification before placing a food order. This push would be considered highly relevant and receive significant credit.
  • Recency: Touchpoints closer in time to the conversion hold more weight. For instance, a brand awareness email sent weeks before the purchase would be less impactful than a targeted GrabFood promotion right before the order.
    By factoring in both relevance and recency, MAM avoids crediting the same touchpoint twice and provides a more accurate picture of which marketing campaigns are driving higher conversions.
Fig 1. How MAM does business attribution

While MAM is effective for comparing the impacts of different campaigns, it struggles with the assessment of a single campaign because it does not account for negative impacts. For example, consider a message stating, “Hey, don’t use Grab.” Clearly, not all messages positively impact business.

Hold-out group

To better evaluate the impact of a single campaign, we divide targeted users into two groups:

  • Hold-out (control): do not send any message
  • Treatment: send the message
Fig 2. Campaign setup with hold-out group

We then compare the business performance of sending versus not sending messages. For the treatment group, we ideally count only the user transactions potentially linked to the message (i.e., transactions occurring within X days of message receipt). However, since the hold-out group receives no messages, there are no equivalent metrics for comparison.

The only business metrics available for the hold-out group are the aggregated totals of GMV, revenue, etc., over a given time, divided by the number of users. We must calculate the same for the treatment group to ensure a fair comparison.

Fig 3. Metrics calculation for both hold-out and treatment group

The comparison might seem unreliable due to:

  • The metrics are raw aggregations, lacking attribution logic.
  • The aggregated GMV and revenue might be skewed by other simultaneous campaigns involving the same users.

Here, we have to admit that figuring out true business impact is difficult. All we can do is try our best to get as close to the truth as possible. To make the comparison more precise, we employed the following strategies:

  • Stratify the two groups, so that both groups contain roughly the same distribution of users.
  • Calculate statistical significance to rule out the difference caused by random factors.
  • Allow users to narrow down the business metrics to compare according to campaign set-up. For example, we don’t compare ride bookings if the campaign is promoting food.

Statistical significance is a common, yet important technique for evaluating the result of controlled experiments. Let’s see how it’s used in our case.

Statistical significance

When we do an A/B testing, we cannot simply conclude that A is better than B when A’s result is better than B. The difference could be due to other random factors. If you did an A/A test, you will still see differences in the results even without doing anything different to the two groups.

Statistical significance is a method to calculate the probability that the difference between two groups is really due to randomness. The lower the probability, the more confidently we can say our action is truly making some impact.

In our case, to derive statistical significance, we assume:

  • Our hold-out and treatment group are two sets of samples drawn from two populations, A and B.
  • A and B are the same except that B received our message. We can’t 100% prove this, but can reasonably guess this is close to true, since we split with stratification.

Assuming the business metrics we are comparing is food GMV, the base numbers can be formulated as shown in Fig 4.

Fig 4. Formulation for calculating statistical significance

To calculate the probability, we then use a formula derived from the central limit theorem (CLT). The mathematical derivation of the formula is beyond the scope of this post. Programmatically, we use the popular jStat library for the calculation.

The calculation result of statistical significance as a special notice to the campaign owners is shown in Fig 5.

Fig 5. Display of business impact analysis with statistical significance

What’s next

Evaluating the true business impact remains challenging. We continue to refine our methodology and address potential biases, such as the assumption that both groups are of the same distribution, which might not hold true, especially in smaller group sizes. Furthermore, consistently reserving a 10% hold-out in each campaign is impractical for some campaigns, as sometimes campaign owners require messages to reach all targeted users.

We are committed to advancing our business impact evaluation solutions and will continue improving our existing solutions. We look forward to sharing more insights in future blogs.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

GitHub Availability Report: April 2024

Post Syndicated from Jakub Oleksy original https://github.blog/2024-05-10-github-availability-report-april-2024/

In April, we experienced four incidents that resulted in degraded performance across GitHub services.

April 05 08:11 UTC (lasting 47 minutes)

On April 5, between 8:11 and 8:58 UTC, several GitHub services experienced issues. Web request error rates peaked at 6% and API request error rates peaked at 10%, and over 100,000 GitHub Actions workflows failed to start. The root cause was traced to a change in the database load balancer, which caused connection failures to multiple critical databases in one of our three data centers. We resolved the incident by rolling back the change and have implemented new measures to detect similar problems earlier in the deployment pipeline to minimize user impact moving forward.

April 10 08:18 UTC (lasting 120 minutes)

On April 10, between 8:18 and 9:38 UTC, several services experienced increased error rates due to an overloaded primary database instance caused by an unbounded query. To mitigate the impact, we scaled up the instance and shipped an improved version of the query to run against read replicas. The incident resulted in a 17% failure rate for web-based repository file editing and failure rates between 1.5% and 8% for other repository management operations. Issue and pull request authoring were also heavily impacted, and work is ongoing to remove dependence on the impacted database primary. GitHub search saw a 5% failure rate due to reliance on the impacted primary database when authorizing repository access.

April 10 08:18 UTC (lasting 30 minutes)

On April 10, between 18:33 and 19:03 UTC, several services were degraded due to a compute-intensive database query that prevented a key database cluster from serving other queries. Impact was widespread due to the critical dependency on this cluster’s data. GitHub Actions experienced delays and failures, GitHub API requests had a significant number of timeouts, all GitHub Pages deployments during the incident period failed, and Git Systems saw HTTP 50X error codes for a portion of raw file and repository archive download requests. GitHub Issues also experienced increased latency for creation and updates, and GitHub Codespaces saw timeouts for requests to create and resume a codespace. The incident was mitigated by rolling back the offending query. We have a mechanism to detect similar compute-intensive queries in CI testing, but identified a gap in that coverage and have addressed that to prevent similar issues in the future. In addition, we have implemented improvements to various services to be more resilient to this dependency and to detect and stop deployments with similar regressions.

April 11 08:18 UTC (lasting 3 days, 4 hours, 23 minutes)

Between April 11 and April 14, GitHub.com experienced significant delays (up to two hours) in delivering emails, particularly for time-sensitive emails like password reset and unrecognized device verification. Users without 2FA attempting to sign in on an unrecognized device were unable to complete device verification, and users attempting to reset their password were unable to complete the reset. The delays were caused by increased usage of a shared resource pool, and a separate internal job queue that became unhealthy and prevented the mailer queue from processing. Immediate improvements have been made to better detect and react to similar situations in the future, including a queue-bypass ability for time-sensitive emails and updated methods of detection for anomalous email delivery. The unhealthy job queue has been paused to prevent impact to other queues using shared resources.


Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: April 2024 appeared first on The GitHub Blog.

How we’re building more inclusive and accessible components at GitHub

Post Syndicated from Eric Bailey original https://github.blog/engineering/user-experience/how-were-building-more-inclusive-and-accessible-components-at-github/

One of GitHub’s core values is Diverse and Inclusive. It is a guiding thought for how we operate, reminding us that GitHub serves a developer community that spans a wide range of geography and ability.

Putting diversity and inclusivity into practice means incorporating a wide range of perspectives into our work. To that point, disability and accessibility are an integral part of our efforts.

This consideration has been instrumental in crafting resilient, accessible components at GitHub. These components, in turn, help to guarantee that our experiences work regardless how they are interacted with.

Using GitHub should be efficient and intuitive, regardless of your device, circumstance, or ability. To that point, we have been working on improving the accessibility of our lists of issues and pull requests, as well as our information tables.

Our list of issues and pull requests are some of the most high-traffic experiences we have on GitHub. For many, it is the “homepage” of their open source projects, a jumping off point for conducting and managing work.

Our tables help to communicate, and facilitate taking action with confidence on complicated information relationships. These experiences are workhorses, helping to communicate information about branches, repositories, secrets, attestations, configurations, internal documentation, etc.

Nothing about us without us

Before we discuss the particulars of these updates, I would like to call attention to the most important aspect of the work: direct participation of, and input from daily assistive technology users.

Disabled people’s direct involvement in the inception, design, and development stages is indispensable. It’s crucial for us to go beyond compliance and weave these practices into the core of our organization. Only by doing so can we create genuinely inclusive experiences.

With this context established, we can now talk about how this process manifests in component work.

Improvements we’re making to lists of issues and pull requests

A list of nine GitHub issues. The issues topics are a blend of list component work and general maintenance tasks. Each issue has a checkbox for selecting it, a status icon indicating that it is an open issue, a title, metadata about its issue number, author, creation date, and source repository. These issues also have secondary information including labels, tallies for linked pull requests and comments, avatars for issue assignees, and overflow actions. Additionally, some issues have a small badge that indicates the number of tasks the issue contains, as well as how many of them are completed. Above the list of issues is an area that lists the total number of issues, allows you to select them all, control how they are sorted, change the information display density, and additional overflow actions.

Lists of issues and pull requests will continue to support methods of navigation via assistive technology that you may already be familiar with—making experiences consistent and predictable is a huge and often overlooked aspect of the work.

In addition, these lists will soon be updated to also have:

  • A dedicated subheading for quickly navigating to the list itself.
  • A dedicated subheading per issue or pull request.
  • List and list item screen reader keyboard shortcut support.
  • Arrow keys and Home/End to quickly move through each list item.
  • Focus management that allows using Tab to explore individual list item content.
  • Support for Space keypresses for selecting list items, and Enter for navigating to the issue or pull request the list item links to.

This allows a wide range of assistive technologies to efficiently navigate, and act on these experiences.

Improvements we’re making to tables

A table titled, ‘Active branches’. It has five columns and 7 rows. The columns are titled ‘branches’, ‘updated’, ‘check status’, ‘behind/ahead’, ‘pull request’, and ‘actions’. Each row lists a branch name and its associated metadata. The branch names use a GitHub user name/feature name pattern. The user names include people who worked on the table component, including Mike Perrotti, Josh Black, Eric Bailey, and James Scholes. They also include couple of subtle references to disability advocates Alice Wong and Patty Berne. The branches are sorted by last updated order, and after the table is a link titled, ‘View more branches’.

We are in the process of replacing one-off table implementations with a dedicated Primer component.

Primer-derived tables help provide consistency and predictability. This is important for expected table navigation, but also applies for other table-related experiences, such as loading content, sorting and pagination requests, and bulk and row-level actions.

At the time of this blog post’s publishing, there are 75 bespoke tables that have been replaced with the Primer component, spread across all of GitHub.

The reason for this quiet success has been due entirely to close collaboration with both our disabled partners and our design system experts. This collaboration helped to ensure:

  1. The new table experiences were seamlessly integrated.
  2. Doing so, improved and enhanced the underlying assistive technology experience.

Progress over perfection

Meryl K. Evans’ Progress Over Perfection philosophy heavily influenced how we approached this work.

Accessibility is never done. Part of our dedication to this work is understanding that it will grow and change to meet the needs of the people who rely on it. This means making positive, iterative change based on feedback from the community GitHub serves.

More to come

Tables will continue to be updated, and the lists should be released publicly soon. Beyond that, we’re excited about the changes we’re making to improve GitHub’s accessibility. This includes both our services and also our internal culture.

We hope that these components, and the process that led to their creation, help you as both part of our developer community and as people who build the world’s software.

Please visit accessibility.github.com to learn more and share feedback on our accessibility community discussion page.

The post How we’re building more inclusive and accessible components at GitHub appeared first on The GitHub Blog.

No version left behind: Our epic journey of GitLab upgrades

Post Syndicated from Grab Tech original https://engineering.grab.com/no-version-left-behind-our-epic-journey-of-gitlab-upgrades

In a tech-driven field, staying updated isn’t an option—it’s essential. At Grab, we’re committed to providing top-notch technology services. However, keeping pace can be demanding. At one point in time, our GitLab instance was trailing by roughly 14 months of releases. This blog post recounts our experience updating and formulating a consistent upgrade routine.

Recognising the need to upgrade

Our team, while skilled, was still learning GitLab’s complexities. Regular stability issues left us little time for necessary upgrades. Understanding the importance of upgrades for our operations to get latest patches for important security fixes and vulnerabilities, we started preparing for GitLab updates while managing system stability. This meant a quick learning and careful approach to updates.

The following image illustrates the version discrepancy between our self-hosted GitLab instance and the official most recent release of GitLab as of July 2022. GitLab follows a set release schedule, issuing one minor update monthly and rolling out a major upgrade annually.

Fig 1. The difference between our hosted version and the latest available GitLab version by 22 July 2022

Addressing fears and concerns

We were concerned about potential downtime, data integrity, and the threat of encountering unforeseen issues. GitLab is critical for the daily activities of Grab engineers. It serves a critical user base of thousands of engineers actively using it, hosting multiple mono repositories with code bases ranging in size from 1GB to a sizable 15GB. When taking into account all its artefacts, the overall imprint of a monorepo can extend to an impressive 39TB.

Our self-hosted GitLab firmly intertwines with multiple critical components. We’ve aligned our systems with GitLab’s official reference architecture for 5,000 users. We use Terraform to configure complete infrastructure with immutable Amazon Machine Images (AMIs) built using Packer and Ansible. Our efficient GitLab setup is designed for reliable performance to serve our wide user base. However, any fault leading to outages can disrupt our engineers, resulting in a loss of productivity for hundreds of teams.

High-level GitLab Architecture Diagram

The above is the top level architecture diagram of our GitLab infrastructure. Here are the major components of the GitLab architecture and their functions: 

  • Gitaly: Handles low-level Git operations for GitLab, such as interacting directly with the code repository present on disk. It’s important to mention that these code repositories are also stored on the same Gitaly nodes, using the attached Amazon Elastic Block Store (Amazon EBS) disks.
  • Praefect: Praefect in GitLab acts as a manager, coordinating Gitaly nodes to maintain data consistency and high availability.
  • Sidekiq: The background processing framework for GitLab written in Ruby. It handles asynchronous tasks in GitLab, ensuring smooth operation without blocking the main application.
  • App Server: The core web application server that serves the GitLab user interface and interacts with other components.

The importance of preparation

Recognising the complexity of our task, we prioritised careful planning for a successful upgrade. We studied GitLab’s documentation, shared insights within the team, and planned to prevent data losses.

To minimise disruptions from major upgrades or database migrations, we scheduled these during weekends. We also developed a checklist and a systematic approach for each upgrade, which include the following:

  • Diligently go through the release notes for each version of GitLab that falls within the scope of our upgrade.
  • Read through all dependencies like RDS, Redis, and Elasticsearch to ensure version compatibility.
  • Create documentation outlining new features, any deprecated elements, and changes that could potentially impact our operations.
  • Generate immutable AMIs for various components reflecting the new version of GitLab.
  • Revisit and validate all the backup plans.
  • Refresh staging environment with production data for accurate, realistic testing and performance checks, and validation of migration scripts under conditions similar to the actual setup.
  • Upgrade the staging environment.
  • Conduct extensive testing, incorporating both automated and manual functional testing, as well as load testing.
  • Conduct rollback tests on the staging environment to the previous version to confirm the rollback procedure’s reliability.
  • Inform all impacted stakeholders, and provide a defined timeline for upcoming upgrades.

We systematically follow GitLab’s official documentation for each upgrade, ensuring compatibility across software versions and reviewing specific instructions and changes, including any deprecations or removals.

The first upgrade

Equipped with knowledge, backup plans, and a robust support system, we embarked on our first GitLab upgrade two years ago. We carefully followed our checklist, handling each important part systematically. GitLab comprises both stateful (Gitaly) and stateless (Praefect, Sidekiq, and App Server) components, all managed through auto-scaling groups. We use a ‘create before destroy’ strategy for deploying stateless components and an ‘in-place node rotation’ method via Terraform for stateful ones.

We deployed key parts like Gitaly, Praefect, Sidekiq, App Servers, Network File System (NFS) server, and Elasticsearch in a specific sequence. Starting with Gitaly, followed by Praefect, then Sidekiq and App Servers, and finally NFS and Elasticsearch. Our thorough testing showed this order to be the most dependable and safe.

However, the journey was full of challenges. For instance, we encountered issues such as the Gitaly cluster falling out of sync for monorepo and the Praefect server failing to distribute the load effectively. Praefect assigns a primary Gitaly node for each repository to host it. All write operations are sent to the repository’s primary node, while read requests are spread across all synced nodes in the Gitaly cluster. If the Gitaly nodes aren’t synced, Praefect will redirect all write and read operations to the repository’s primary node.

Gitaly is a stateful application, we upgraded each Gitaly node with the latest AMI using an in-place node rotation strategy. In older versions of GitLab (up to v14.0), if a Gitaly node is unhealthy, Praefect would immediately update the primary node for the repository to any healthy Gitaly node. After the rolling upgrade for a 3-node Gitaly cluster, repositories were mainly concentrated on only one Gitaly node.

In our situation, a very busy monorepo was assigned to a Gitaly node that was also the main node for many other repositories. When real traffic began after deployment, the Gitaly node had trouble syncing the monorepo with the other nodes in the cluster.

Because the Gitaly node was out of sync, Praefect started sending all changes and access requests for monorepo to this struggling Gitaly node. This increased the load on the Gitaly server, causing it to fail. We found this to be the main issue and decided to manually move our monorepo to a Gitaly node that was less crowded. We also added a step to validate primary node distribution to our deployment checklist.

This immediate failover behaviour changed in GitLab version 14.1. Now, a primary is only elected lazily when a write request arrives for any repository. However, since we enabled maintenance mode before the Gitaly deployment, we didn’t receive any write requests. As a result, we did not see a shift in the primary node of the monorepo with new GitLab versions.

Regular upgrades: Our new normal

Embracing the practice of consistent upgrades dramatically transformed the way we operate. We initiated frequent upgrades and implemented measures to reduce the actual deployment time.  

  • Perform all major testing in one day before deployment.
  • Prepare a detailed checklist to follow during the deployment activity.
  • Reduce the minimum number of App Server and Sidekiq Servers required just after we start the deployment.
  • Upgrade components like App Server and Sidekiq in parallel.
  • Automate smoke testing to examine all major workflows after deployment.

Leveraging the lessons learned and the experience gained with each upgrade, we successfully cut the time spent on the entire operation by 50%. The image-3 shows how we reduced our deployment time for major upgrades from 6 hours to 3 hours and our deployment time for minor upgrades from 4 to 1.5 hours.

Each upgrade enriched our comprehensive knowledge base, equipping us with insights into the possible behaviours of each component under varying circumstances. Our growing experience and enhanced knowledge helped us achieve successful upgrades with less downtime with each deployment.

Rather than moving up one minor version at a time, we learned about the feasibility of skipping versions. We began using the GitLab Upgrade Path. This method allowed us to skip several versions, closing the distance to the latest version with fewer deployments. This approach enabled us to catch up on 24 months’ worth of upgrades in just 11 months, even though we started 14 months behind. 

Time taken in hrs for each upgrade. The blue line depicts major and the red line is for minor upgrades

Overcoming challenges

Our journey was not without hurdles. We faced challenges in maintaining system stability during upgrades, navigating unexpected changes in functionality post upgrades, and ensuring data integrity.

However, these challenges served as an opportunity for our team to innovate and create robust workarounds. Here are a few highlights:

Unexpected project distribution: During upgrades and Gitaly server restarts, we observed unexpected migration of the monorepo to a crowded Gitaly server, resulting in higher rate limiting. We manually updated primary nodes for the monorepo and made this validation as a part of our deployment checklist.

NFS deprecation: We migrated all required data to S3 buckets and deprecated NFS to become more resilient and independent of Availability Zone (AZ).

Handling unexpected Continuous Integration (CI) operations: A sudden surge in CI operations sometimes resulted in rate limiting and interrupted more essential Git operations for developers. This is because GitLab uses different RPC calls and their concurrency for SSH and HTTP operations. We encouraged using HTTPS links for GitLab CI and automation script and SSH links for regular Git operations.

Right-sizing resources: We countered resource limitations by right-sizing our infrastructure, ensuring each component had optimal resources to function efficiently.

Performance testing: We conducted performance testing of our GitLab using the GitLab Performance Tool (GPT). In addition, we used our custom scripts to load test Grab specific use cases and mono repositories.

Limiting maintenance windows: Each deployment required a maintenance window or downtime. To minimise this, we structured our deployment processes more efficiently, reducing potential downtime and ensuring uninterrupted service for users.

Dependency on GitLab.com image registry: We introduced measures to host necessary images internally, which increased our resilience and allowed us to cut ties with external dependencies.

The results

Through careful planning, we’ve improved our upgrade process, ensuring system stability and timely updates. We’ve also reduced the delay in aligning with official GitLab releases. The image below displays how the time delay between release date and deployment has been reduced with each upgrade. It sharply brought down from 396 days (around 14 months) to 35 days

At the time of this article, we’re just two minor versions behind the latest GitLab release, with a strong focus on security and resilience. We are also seeing a reduced number of reported issues after each upgrade.

Our refined process has allowed us to perform regular updates without any service disruptions. We aim to leverage these learnings to automate our upgrade deployments, painting a positive picture for our future updates, marked by efficiency and stability.

Time delay between official release date and date of deployment

Looking ahead

Our dedication extends beyond staying current with the most recent GitLab versions. With stabilised deployment, we are now focusing on:

  • Automated upgrades: Our efforts extend towards bringing in more automation to enhance efficiency. We’re already employing zero-downtime automated upgrades for patch versions involving no database migrations, utilising GitLab pipelines. Looking forward, we plan to automate minor version deployments as well, ensuring minimal human intervention during the upgrade process.
  • Automated runner onboarding for service teams: We’ve developed a ‘Runner as a Service’ solution for our service teams. Service teams can create their dedicated runners by providing minimal details, while we manage these runners centrally. This setup allows the service team to stay focused on development, ensuring smooth operations.
  • Improved communication and data safety: We’re regularly communicating new features and potential issues to our service teams. We also ensure targeted solutions for any disruptions. Additionally, we’re focusing on developing automated data validation via our data restoration process. 
  • Focus on development: With stabilised updates, we’ve created an environment where our development teams can focus more on crafting new features and supporting ongoing work, rather than handling upgrade issues.

Key takeaways

The upgrade process taught us the importance of adaptability, thorough preparation, effective communication, and continuous learning. Our ‘No Version Left Behind’ motto underscores the critical role of regular tech updates in boosting productivity, refining processes, and strengthening security. These insights will guide us as we navigate ongoing technological advancements.

Below are the key areas in which we improved:

Enhanced testing procedures: We’ve fine-tuned our testing strategies, using both automated and manual testing for GitLab, and regularly conducting performance tests before upgrades.

Approvals: We’ve designed approval workflows that allow us to obtain necessary clearances or approvals before each upgrade efficiently, further ensuring the smooth execution of our processes.

Improved communication: We’ve improved stakeholder communication, regularly sharing updates and detailed documents about new features, deprecated items, and significant changes with each upgrade.

Streamlined planning: We’ve improved our upgrade planning, strictly following our checklist and rotating the role of Upgrade Ownership among team members.

Optimised activity time: We’ve significantly reduced the time for production upgrade activity through advanced planning, automation, and eliminating unnecessary steps.

Efficient issue management: We’ve improved our ability to handle potential GitLab upgrade issues, with minimal to no issues occurring. We’re prepared to handle any incidents that could cause an outage.

Knowledge base creation and automation: We’ve created a GitLab knowledge base and continuously enhanced it with rich content, making it even more invaluable for training new team members and for reference during unexpected situations. We’ve also automated routine tasks to improve efficiency and reduce manual errors.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Empowering accessibility: GitHub’s journey building an in-house Champions program

Post Syndicated from Carie Fisher original https://github.blog/engineering/engineering-principles/empowering-accessibility-githubs-journey-building-an-in-house-champions-program/

For more on this topic, check out Alexis Lucio, Catherine McNally, and Lindsey Wild‘s axe-con 2024 talk, “Establishing a Scalable A11y Education Ecosystem,” which laid the foundation for this blog post. Free registration required.

Laying the foundation

In today’s digital world, accessibility isn’t merely a checkbox—it’s the cornerstone of creating an inclusive experience for all users. At GitHub, we recognize this fundamental truth. That’s why we’ve embarked on a journey to empower developers, including those with disabilities, to participate fully and thrive on our platform. Our commitment to accessibility isn’t a one-time endeavor; it’s an ongoing effort fueled by the desire to remove barriers and make technology accessible to everyone.

As part of GitHub’s dedication to accessibility, we’ve been expanding our internal accessibility program and have scaled up our assessment process to help remove or lower barriers for users with disabilities. Naturally, as the number of assessments increased, so did the issues requiring attention, which strained our centralized accessibility team. Understanding the importance of decentralizing ownership of accessibility across the organization, we took decisive action by launching GitHub’s Accessibility Champions program. This strategic initiative empowers employees from various disciplines to drive accessibility efforts within their teams, fostering a culture where accessibility is deeply ingrained and valued.

The journey to establish GitHub’s Accessibility Champions program began with a comprehensive examination of our existing challenges and opportunities. We understood that for the program to thrive, we needed to consider various factors, including different time zones and work schedules, the expertise levels of our employees, and their ability to dedicate time to accessibility efforts due to competing priorities. By thoroughly assessing these considerations, we aimed to ensure that the program would be effective and adaptable to our team’s evolving needs.

To lay a solid foundation for the program’s success, we established clear goals and defined responsibilities for our champions upon completing their training. By setting measurable objectives and metrics to track the program’s impact on accessibility efforts both within the company and beyond, we provided our champions with a clear roadmap to follow. This proactive approach ensured we were all aligned in our efforts to make GitHub a more inclusive platform.

Starting small

At the heart of the GitHub Accessibility Champions program’s success is the development of a comprehensive and dynamic curriculum. Understanding that people have different learning preferences, GitHub took a tailored approach by assembling different types of educational resources. These resources were carefully curated to cater to various learning styles and delivered asynchronously through videos, articles, and interactive exercises.

Participants in the program received training on digital accessibility fundamentals, including WCAG guidelines, inclusive design principles, testing techniques, and content/interface accessibility best practices. They learned to identify and address accessibility barriers, advocate for accessibility within their teams, and utilize assistive technologies. Participants gained practical experience creating inclusive digital experiences through hands-on exercises and interactive discussions.

The program began with a modest group of 17 engineering champions serving as pioneers in the initiative. This small-scale pilot allowed GitHub to fine-tune the curriculum, gather valuable feedback, and iterate on the program’s structure and content. As the program evolved and gained momentum, it gradually expanded to include 52 champions from a variety of backgrounds, spanning engineering, design, and content teams. Our plan for this year is to reach over 100 internal champions to help support our accessibility goals.

This phased approach to scaling the GitHub Accessibility Champions program has proved invaluable. By starting small and gradually growing the community of champions, we were able to refine the program iteratively, ensuring it met the evolving needs of participants. Moreover, this approach fostered a strong sense of camaraderie among champions, creating a network of advocates dedicated to advancing accessibility across the organization.

Embracing feedback and iteration

Feedback was instrumental in shaping the trajectory of the GitHub Accessibility Champions program, serving as a guiding force in its evolution. As participants engaged with the program, their voices were invaluable in driving improvements and enhancements to meet their needs.

One recurring theme in the feedback was the desire for more interactive experiences and community engagement. Participants expressed a hunger for opportunities to connect with fellow champions, share insights, and collaborate on addressing accessibility challenges. In response, we introduced monthly Champions Connect meetings, providing a platform for champions to come together, exchange ideas, and foster a sense of camaraderie. These gatherings facilitated knowledge sharing and motivated and inspired champions as they navigated their accessibility journeys.

“Being able to ask questions and get answers quickly on simple matters is important to my team’s success. Or, if the questions are too complex to get immediate answers, having a forum to take the time and unpack them to get the answers.”

Participants also emphasized the importance of hands-on experiences in honing their skills and understanding of accessibility principles. Recognizing this need, we organized bug bashes and collaborative events where teams worked together to identify and address accessibility issues in real-time. These sessions provided practical learning opportunities and fostered a culture of teamwork and collective problem-solving.

In addition to enhancing engagement within the champions community, we responded to the demand for more synchronous training sessions. We hosted live sessions tailored to the specific needs of engineers and product managers, providing a platform for interactive discussions, Q&A sessions, and technical deep dives. These sessions offered a valuable opportunity for participants to engage directly with experts, seek clarification on complex topics, and deepen their understanding of accessibility best practices.

“Getting a codespace to identify issues and identify remediations is an excellent way to move from using and understanding assistive technology to taking on the role of an auditor or engineer who is verifying fixes.”

Finally, we initiated roundtable discussions with customers with disabilities, recognizing the importance of incorporating diverse perspectives into the design and development process. These interactions provided invaluable insights into the experiences and needs of users with disabilities, highlighting the critical role of inclusive design practices. By engaging directly with end-users, every champion at GitHub gained a deeper understanding of accessibility challenges and priorities, informing the development of more user-centric and inclusive digital experiences.

“Communicating the value of why we should design and create accessible documentation is key to success on my team. Everyone wants to do the right thing and is willing to do more complex tasks if they understand how it helps people better use our product.”

Overall, feedback catalyzed continuous improvement and innovation within the GitHub Accessibility Champions program. By actively listening to participant input and responding with targeted initiatives, we demonstrate our commitment to fostering a culture of accessibility and inclusion. Through ongoing engagement, collaboration, and user-centered design, GitHub continues to advance accessibility efforts, empowering all users to access and interact with its platform seamlessly.

“I loved that the training was super detailed, to a point where someone with zero information on accessibility can get started with basic concepts all the way to acknowledging problems they didn’t know existed.”

Expanding reach and impact

While we are proud of our progress so far, the GitHub Accessibility Champions program isn’t just about addressing internal challenges and setting an example for the broader tech community. By sharing our experiences and best practices, we hope to inspire other organizations to prioritize accessibility and inclusion in their own initiatives.

As we reflect on the journey of GitHub’s Accessibility Champions program, there are several key takeaways and future directions that can provide valuable insights for other teams and organizations embarking on similar initiatives:

  1. Start where you are. Take stock of your current situation and identify areas where accessibility education can be improved. Understanding your organization’s unique needs and challenges is the first step toward meaningful progress.
  2. Go where you’re wanted. Invest your resources with a clear advocacy for accessibility and a willingness to engage in educational programs. By aligning your efforts with enthusiastic stakeholders, you can maximize the impact of your initiatives.
  3. Pilot with a small group. Begin with a small group to test your programs and gather feedback before scaling up. This phased approach allows for experimentation and refinement, ensuring that your initiatives are effective and sustainable in the long run.
  4. Lean into organic partnerships. Collaborate across teams and titles to create a cohesive ecosystem of accessibility education. By leveraging the expertise and resources available within your organization, you can amplify the impact of your efforts and foster a culture of inclusivity.
  5. Seek out, review, and take action on feedback. Actively solicit feedback from participants and stakeholders and use it to inform program improvements. By listening to the needs and experiences of your audience, you can continuously iterate and enhance the effectiveness of your initiatives.
  6. Collect and re-evaluate metrics. Continuously monitor and evaluate the impact of your educational initiatives to track progress and effectiveness over time. By collecting meaningful metrics and analyzing trends, you can identify areas for improvement and demonstrate the value of your efforts to key stakeholders.

Conclusion

The GitHub Accessibility Champions program demonstrates our dedication to fostering a culture of accessibility and inclusion. By prioritizing feedback, collaboration, and responsiveness, we have created a supportive ecosystem where individuals can learn, grow, and acquire the tools to build more inclusive digital experiences. Our champions are truly a community of passionate accessibility advocates.

Looking ahead, we’re committed to enhancing the GitHub Accessibility Champions program, advancing accessibility efforts across the organization, and sharing our journey with the broader tech community—paving the way for a more inclusive digital future for all.

Please visit accessibility.github.com to learn more and to share feedback on our accessibility community discussion page.

The post Empowering accessibility: GitHub’s journey building an in-house Champions program appeared first on The GitHub Blog.

5 tips to supercharge your developer career in 2024

Post Syndicated from Jeimy Ruiz original https://github.blog/2024-05-01-5-tips-to-supercharge-your-developer-career-in-2024/


The world of software development is constantly evolving. That means whether you’re a seasoned developer or just starting out on your coding journey, there’s always something new to learn.

Below, we’ll explore five actionable tips to take your career to the next level. From mastering prompt engineering to harnessing the power of AI for code security, these tips will help you learn the skills and uncover the knowledge you need to excel in today’s competitive job market.

Tip #1: Become a pro at prompt engineering

In the age of AI, you can use AI tools like GitHub Copilot to code up to 55% faster. But like any other tool or skill, our AI pair programmer has a learning curve, and there are certain techniques you can use that will make your work with AI even more effective. Enter prompt engineering. With prompt engineering, you provide GitHub Copilot with more context about your project—which yields better, more accurate results. Below are three best practices for crafting prompts for GitHub Copilot:

While you can begin using GitHub Copilot with a blank file, one easy way to introduce more context is to open related files in VS Code. Known as neighboring tabs, this technique enables Copilot to gain a deeper understanding of your code by processing all open files in your IDE.

This broader scope allows Copilot to identify matching code segments across your project, enhancing its suggestions and code completion capabilities.

Provide a top-level comment in your code file

Imagine being assigned a task with little to no context—that would make accomplishing it much more difficult, right? The same can be said for GitHub Copilot. When you add a brief, top-level comment in your code file, it helps Copilot understand the overarching objective before getting into the how.

Once you’ve broken down the ask and your goal, you can articulate the logic and steps required to achieve it. Then, allow Copilot to generate code incrementally, rather than all at once. This approach enhances Copilot’s understanding and improves the quality of the generated code.

Input sample code

Offer GitHub Copilot a snippet of code that closely resembles what you need. Even a brief example can further help Copilot craft suggestions tailored to your language and objectives!

Tip #2: Learn shortcuts and hacks

GitHub is full of shortcuts and hacks that make your work life easier and help you stay in the flow. Gain momentum in your projects and increase your productivity with these popular shortcuts:

Search for any file in your repositories

When you’re searching through repositories, type the letter “t” on your keyboard to activate the file finder and do away with hours of wasted time! See how in the video below:

Did you know that GitHub also has project management tools? One of them is a handy interlinking feature that allows you to link pull requests and Git commits to relevant issues in a project. This facilitates better organization, collaboration, and project management, not just for you, but for anyone looking for more context in your issue. Gone are the days of hunting down old issues every time you create a new pull request!

Create custom actions

Creating custom actions on GitHub enables you to enhance code reuse, bypass repetition, and simplify maintenance across multiple workflows. All you have to do is outline the necessary steps for a particular task and package them into an action using any supported programming or scripting language, and you’re all set!

Incorporate feedback in pull requests

Ever wish there was an easier way to review code? Well, it’s possible! Add comments directly to the pull request, propose changes, and even accept and add those suggestions seamlessly to make code reviews easier than ever. You can also save your replies by heading over to the comment box in an open pull request and selecting “create new saved reply,” and then “add saved reply,” to make it official.

Tip #3: Brush up on your soft skills

AI has introduced a host of hard skills that developers need to master in order to keep up with the latest tooling. Soft skills complement your new technical expertise and can contribute to your overall success by enhancing communication, collaboration, and problem-solving. Here are a few important ones to practice:

Communication

As you know, developer work rarely happens in a vacuum. Strong communication skills can facilitate clear understanding and efficient collaboration for both humans and AI tools, whether you’re collaborating with stakeholders, communicating complex technical concepts to non-technical audiences, or working on your prompt engineering.

Problem-solving

Critical thinking enables developers to approach complex challenges creatively, break them down into manageable tasks, and find innovative solutions with the help of AI coding tools.

Adaptability

AI coding tools are evolving rapidly, with new technologies, methodologies, and tools emerging regularly. Being adaptable allows developers to stay current, learn new skills quickly, and stay nimble as things change. To cultivate resilience and embrace discomfort (in and outside of the workplace), engage in activities that challenge you to anticipate and respond to the unexpected.

Ethics

Being aware of the ethical implications associated with these tools is essential. Developers should understand both the capabilities and limitations of AI coding tools and exercise critical thinking when interpreting responses from them. By remaining conscious of ethical considerations and actively working toward ethical practices, developers can ensure that these tools are used responsibly.

Empathy

Empathy is crucial for understanding the needs, preferences, and challenges of end-users. Empathy also fosters better collaboration within teams by promoting understanding and respect for colleagues’ perspectives and experiences.

Tip #4: Use AI to secure your code

Developers can leverage AI to enhance code security in several ways. First, AI can help prevent vulnerabilities by providing context and secure code suggestions right from the start. Traditionally, “shift left” meant getting security feedback after coding (but before deployment). By utilizing AI as a pair programmer, developers can “shift left” by addressing security concerns right where they bring their ideas to code.

A common pain point for developers is sifting through lengthy pages of alerts, many of which turn out to be false positives—wasting valuable time and resources. With features like code scanning autofix, AI and automation can step in to provide AI-generated code fixes alongside vulnerability alerts, streamlining remediation directly into the developer workflow. Similarly, secret scanning alerts developers to potential secrets detected in the code.

AI also presents an opportunity to improve the modeling of a vast array of open-source frameworks and libraries. Traditionally, security teams manually model numerous packages and APIs. This is a challenging task given the volume and diversity of these components, along with frequent updates and replacements. By infusing AI in modeling efforts, developers can increase the detection of vulnerabilities.

Tip #5: Attend GitHub Universe 2024

Attending conferences is a valuable investment in a developer’s career, providing opportunities for learning, networking, skill development, and professional growth all at the same time. GitHub Universe is our flagship, global event that brings together developers, leaders, and companies for two days of exploring the latest technologies and industry trends with fun, food, and networking in between. Here are some of the highlights:

100+ sessions on AI, DevEx, and security

Learn about frameworks and best practices directly from 150+ experts in the field through keynotes, breakout sessions, product demos, and more.

Gain and practice new skills

Git official by signing up for an interactive workshop or getting GitHub certified in GitHub Actions, GitHub Advanced Security, GitHub Foundations, or GitHub Administration. It’ll certainly look great on your resume and LinkedIn. 😉

Visibility

Sharing insights, presenting research findings, or showcasing projects can help developers establish themselves as thought leaders and experts in their field. The Universe call for sessions is open from now until May 10. Submit a session proposal today!

Professional development

Show your commitment to your career and continuous learning by visiting the dedicated Career Corner for professional development.

Community engagement

Build your network and find opportunities for collaboration and mentorship by engaging with peers and participating in the Discussions Lounge.

Learn more about our content tracks and what we have in store for the 10th anniversary of our global developer event.

By implementing the strategies outlined above, you’ll be well-equipped to unlock your dream career in 2024 and beyond. And remember: you can take your skills to the next level, network with industry leaders, and learn how to use the latest AI tools at GitHub Universe 2024.

Eager to get involved? Act fast to save 30% on in-person tickets with our Super Early Bird discount from now until July 8, or get notified about our free virtual event!

The post 5 tips to supercharge your developer career in 2024 appeared first on The GitHub Blog.

Ensuring data reliability and observability in risk systems

Post Syndicated from Grab Tech original https://engineering.grab.com/data-observability

Grab has an in-house Risk Management platform called GrabDefence which relies on ingesting large amounts of data gathered from upstream services to power our heuristic risk rules and data science models in real time.

Fig 1. GrabDefence aggregates data from different upstream services

As Grab’s business grows, so does the amount of data. It becomes imperative that the data which fuels our risk systems is of reliable quality as any data discrepancy or missing data could impact fraud detection and prevention capabilities.

We need to quickly detect any data anomalies, which is where data observability comes in.

Data observability as a solution

Data observability is a type of data operation (DataOps; similar to DevOps) where teams build visibility over the health and quality of their data pipelines. This enables teams to be notified of data quality issues, and allows teams to investigate and resolve these issues faster.

We needed a solution that addresses the following issues:

  1. Alerts for any data quality issues as soon as possible – so this means the observability tool had to work in real time.
  2. With hundreds of data points to observe, we needed a neat and scalable solution which allows users to quickly pinpoint which data points were having issues.
  3. A consistent way to compare, analyse, and compute data that might have different formats.

Hence, we decided to use Flink to standardise data transformations, compute, and observe data trends quickly (in real time) and scalably.

Flink SQL is a powerful, flexible tool for performing real-time analytics on streaming data. It allows users to query continuous data streams using standard SQL syntax, enabling complex event processing and data transformation within the Apache Flink ecosystem, which is particularly useful for scenarios requiring low-latency insights and decisions.

In Grab, data comes from multiple sources and while most of the data is in JSON format, the actual JSON structure differs between services. Because of JSON’s nested and dynamic data structure, it is difficult to consistently analyse the data – posing a significant challenge for real-time analysis.

To help address this issue, Apache Flink SQL has the capability to manage such intricacies with ease. It offers specialised functions tailored for parsing and querying JSON data, ensuring efficient processing.

Another standout feature of Flink SQL is the use of custom table functions, such as JSONEXPLOAD, which serves to deconstruct and flatten nested JSON structures into tabular rows. This transformation is crucial as it enables subsequent aggregation operations. By implementing a 5-minute tumbling window, Flink SQL can easily aggregate these now-flattened data streams. This technique is pivotal for monitoring, observing, and analysing data patterns and metrics in near real-time.

Now that data is aggregated by Flink for easy analysis, we still needed a way to incorporate comprehensive monitoring so that teams could be notified of any data anomalies or discrepancies in real time.

How we interfaced the output with Datadog 

Datadog is the observability tool of choice in Grab, with many teams using Datadog for their service reliability observations and alerts. By aggregating data from Apache Flink and integrating it with Datadog, we can harness the synergy of real-time analytics and comprehensive monitoring. Flink excels in processing and aggregating data streams, which, when pushed to Datadog, can be further analysed and visualised. Datadog also provides seamless integration with collaboration tools like Slack, which enables teams to receive instant notifications and alerts.

With Datadog’s out-of-the-box features such as anomaly detection, teams can identify and be alerted to unusual patterns or outliers in their data streams. Taking a proactive approach to monitoring is crucial in maintaining system health and performance as teams can be alerted, then collaborate quickly to diagnose and address anomalies.

This integrated pipeline—from Flink’s real-time data aggregation to Datadog’s monitoring and Slack’s communication capabilities—creates a robust framework for real-time data operations. It ensures that any potential issues are quickly traced and brought to the team’s attention, facilitating a rapid response. Such an ecosystem empowers organisations to maintain high levels of system reliability and performance, ultimately enhancing the overall user experience.

Organising monitors and alerts using out-of-the-box solutions from Datadog

Once we integrated Flink data into Datadog, we realised that it could become unwieldy to try to identify the data point with issues from hundreds of other counters.

Fig 2. Hundreds of data points on a graph make it hard to decipher which ones have issues

We decided to organise the counters according to the service stream it was coming from, and create individual monitors for each service stream. We used Datadog’s Monitor Summary tool to help visualise the total number of service streams we are reading from and the number of underlying data points within each stream.  

Fig 3. Data is grouped according to their source stream

Within each individual stream, we used Datadog’s Anomaly Detection feature to create an alert whenever a data point from the stream exceeds a predefined threshold. This can be configured by the service teams on Datadog.

Fig 4. Datadog’s built-in Anomaly Detection function triggers alerts whenever a data point exceeds a threshold

These alerts are then sent to a Slack channel where the Data team is informed when a data point of interest starts throwing anomalous values.

Fig 5. Datadog integration with Slack to help alert users

Impact

Since the deployment of this data observability tool, we have seen significant improvement in the detection of anomalous values. If there are any anomalies or issues, we now get alerts within the same day (or hour) instead of days to weeks later.

Organising the alerts according to source streams have also helped simplify the monitoring load and allows users to quickly narrow down and identify which pipeline has failed.

What’s next?

At the moment, this data observability tool is only implemented on selected checkpoints in GrabDefence. We plan to expand the observability tool’s coverage to include more checkpoints, and continue to refine the workflows to detect and resolve these data issues.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

GitHub Availability Report: March 2024

Post Syndicated from Jakub Oleksy original https://github.blog/2024-04-10-github-availability-report-march-2024/

In March, we experienced two incidents that resulted in degraded performance across GitHub services.

March 15 19:42 UTC (lasting 42 minutes)

On March 15, GitHub experienced service degradation from 19:42 to 20:24 UTC due to a regression in the permissions system. This regression caused failures in GitHub Codespaces, GitHub Actions, and GitHub Pages. The problem stemmed from a framework upgrade that introduced MySQL query syntax that is incompatible with the database proxy service used in some production clusters. GitHub responded by rolling back the deployment and fixing a misconfiguration in development and CI environments to prevent similar issues in the future.

March 11 22:45 UTC (lasting 2 hours and 3 minutes)

On March 11, GitHub experienced service degradation from 22:45 to 00:48 UTC due to an inadvertent deployment of network configuration to the wrong environment. This led to intermittent errors in various services, including API requests, GitHub Copilot, GitHub secret scanning, and 2FA using GitHub Mobile. The issue was detected within 4 minutes, and a rollback was initiated immediately. The majority of impact was mitigated by 22:54 UTC. However, the rollback failed in one data center due to system-created configuration records missing a required field, causing 0.4% of requests to continue failing. Full rollback was successful after manual intervention to correct the configuration data, enabling full service restoration by 00:48 UTC. GitHub has implemented measures for safer configuration changes, such as prevention and automatic cleanup of obsolete configuration and faster issue detection, to prevent similar issues in the future.


Please follow our status page for real-time updates on status changes and post-incident recaps. To learn more about what we’re working on, check out the GitHub Engineering Blog.

The post GitHub Availability Report: March 2024 appeared first on The GitHub Blog.

4 ways GitHub engineers use GitHub Copilot

Post Syndicated from Holger Staudacher original https://github.blog/2024-04-09-4-ways-github-engineers-use-github-copilot/

Just recently, I was coding a new feature for GitHub Copilot Chat. My task was to enable the chat to recognize a user’s project dependencies, allowing it to provide magical answers when the user poses a question. While I could have easily listed the project dependencies and considered the task complete, I knew that to extract top-notch responses from these large language models, I needed to be careful to not overload the prompt to avoid confusing the model by providing too much context. This meant pre-processing the dependency list and selecting the most relevant ones to include in the chat prompt.

Creating machine-processable formats for the most prominent frameworks across various programming languages would have consumed days. It was during this time that I experienced one of those “Copilot moments.”

I simply queried the chat in my IDE:

Look at the data structure I have selected and create at least 10 examples that conform to the data structure. The data should cover the most prominent frameworks for the Go programming language.

Voilà, there it was my initial batch of machine-processable dependencies. Just 30 minutes later, I had amassed a comprehensive collection of significant dependencies for nearly all supported languages, complete with parameterized unit tests. Completing a task that would likely have taken days without GitHub Copilot, in just 30 minutes, was truly remarkable.

This led me to ponder: what other “Copilot moments” might my colleagues here at GitHub have experienced? Thus, here are a few ways we use GitHub Copilot at GitHub.

1. Semi-automating repetitive tasks

Semi-automating repetitive tasks is a topic that resonates with a colleague of mine from another team. He mentions that they are tasked with developing and maintaining several live services, many of which utilize protocol buffers for data communication. During maintenance, they often encounter a situation where they need to increment ID numbers in the protobuf definitions, as illustrated in the code snippet below:

protobuf

google.protobuf.StringValue fetcher = 130

  [(opts.cts_opt)={src:"Properties" key:"fetcher"}];

google.protobuf.StringValue proxy_enabled = 131

  [(opts.cts_opt)={src:"Properties" key:"proxy_enabled"}];

google.protobuf.StringValue proxy_auth = 132

  [(opts.cts_opt)={src:"Properties" key:"proxy_auth"}];

He particularly appreciates having GitHub Copilot completions in the editor for these tasks. It serves as a significant time saver, eliminating the need to manually generate ID numbers. Instead, one can simply tab through the completion suggestions until the task is complete.

2. Avoid getting side tracked

Here’s another intriguing use case I heard about from a colleague. He needed to devise a regular expression to capture a Markdown code block and extract the language identifier. Fully immersed in his work, he preferred not to interrupt his flow by switching to chat, even though it could have provided a solution. Instead, he employed a creative approach by formalizing his task in a code comment:

// The string above contains a code block with a language identifier.
// Create a regexp that matches the code block and captures the language 
identifier.
// Use tagged capture groups for the language and the code. 

This prompted GitHub Copilot to generate the regular expression as the subsequent statement in his editor:

const re = /```(?<lang>\w+)(?<code>[\s\S]+?)```/;

With the comment deleted, the task was swiftly accomplished!

Screenshot of GitHub engineer prompting GitHub Copilot, and GItHub Copilot returning a useful response

During a pleasant coffee chat, one of our support engineers shared an incident she experienced with a colleague last week. It was a Friday afternoon, and they were attempting to troubleshoot an issue for a specific customer. Eventually, they pinpointed the solution by creating various notes in VSCode. At GitHub, we prioritize remote collaboration. Thus, merely resolving the task wasn’t sufficient; it was also essential to inform our colleagues about the process to ensure the best possible experience for future customer requests. Consequently, even after completing this exhaustive task, they needed to document how they arrived at the solution.

She initiated GitHub Copilot Chat and simply typed something along the lines of, “Organize my notes, structure them, and compile the data in the editor into Markdown tables.” Within seconds, the task was completed, allowing them to commence their well-deserved weekend.

4. Exploring and learning

Enhancing and acquiring new skills are integral aspects of every engineer’s journey. John Berryman, a colleague of mine, undertook the challenge of leveraging GitHub Copilot to tackle a non-trivial coding task in a completely unfamiliar programming language. His goal was to delve into Rust, so on a Sunday, he embarked on this endeavor with the assistance of GitHub Copilot Chat. The task he set out to accomplish was to develop a program capable of converting any numerical input into its written English equivalent. While initially seeming straightforward, this task presented various complexities such as handling teen numbers, naming conventions for tens, placement of “and” in the output, and more.

Twenty-three minutes and nine seconds later, he successfully produced a functional version written in Rust, despite having no prior experience with the language. Notably, he documented his entire process, recording himself throughout the endeavor.

Berryman uses an older, experimental version of GitHub Copilot to write a program in Rust.

Your very own GitHub Copilot moment

I found it incredibly enlightening to discover how my fellow Hubbers utilize GitHub Copilot, and their innovative approaches inspired me to incorporate some of their ideas into my daily workflows. If you’re eager to explore GitHub Copilot firsthand, getting started is a breeze. Simply install it into your preferred editor and ask away.

The post 4 ways GitHub engineers use GitHub Copilot appeared first on The GitHub Blog.