Tag Archives: Engineering

Graph concepts and applications

Post Syndicated from Grab Tech original https://engineering.grab.com/graph-concepts

Introduction

In an introductory article, we talked about the importance of Graph Networks in fraud detection. In this article, we will be adding some further context on graphs, graph technology and some common use cases.

Connectivity is the most prominent feature of today’s networks and systems. From molecular interactions, social networks and communication systems to power grids, shopping experiences or even supply chains, networks relating to real-world systems are not random. This means that these connections are not static and can be displayed differently at different times. Simple statistical analysis is insufficient to effectively characterise, let alone forecast, networked system behaviour.

As the world becomes more interconnected and systems become more complex, it is more important to employ technologies that are built to take advantage of relationships and their dynamic properties. There is no doubt that graphs have sparked a lot of attention because they are seen as a means to get insights from related data. Graph theory-based approaches show the concepts underlying the behaviour of massively complex systems and networks.

What are graphs?

Graphs are mathematical models frequently used in network science, which is a set of technological tools that may be applied to almost any subject. To put it simply, graphs are mathematical representations of complex systems.

Origin of graphs

The first graph was produced in 1736 in the city of Königsberg, now known as Kaliningrad, Russia. In this city, there were two islands with two mainland sections that were connected by seven different bridges.

Famed mathematician Euler wanted to plot a journey through the entire city by crossing each bridge only once. Euler proceeded to abstract the four regions of the city and the seven bridges into edges but he demonstrated that the problem was unsolvable. A simplified abstract graph is shown in Fig 1.

Fig 1 Abstraction graph

The graph’s four dots represent Königsberg’s four zones, while the lines represent the seven bridges that connect them. Zones connected by an even number of bridges is clearly navigable because several paths to enter and exit are available. Zones connected by an odd number of bridges can only be used as starting or terminating locations because the same route can only be taken once.

The number of edges associated with a node is known as the node degree. If two nodes have odd degrees and the rest have even degrees, the Königsberg problem could be solved. For example, exactly two regions must have an even number of bridges while the rest have an odd number of bridges. However, as illustrated in Fig 1, no Königsberg location has an even number of bridges, rendering this problem unsolvable.

Definition of graphs

A graph is a structure that consists of vertices and edges. Vertices, or nodes, are the objects in a problem, while edges are the links that connect vertices in a graph.  

Vertices are the fundamental elements that a graph requires to function; there should be at least one in a graph. Vertices are mathematical abstractions that refer to objects that are linked by a condition.

On the other hand, edges are optional as graphs can still be defined without any edges. An edge is a link or connection between any two vertices in a graph, including a connection between a vertex and itself. The idea is that if two vertices are present, there is a relationship between them.

We usually indicate V={v1, v2, …, vn} as the set of vertices, and E = {e1, e2, …, em} as the set of edges. From there, we can define a graph G as a structure G(V, E) which models the relationship between the two sets:

Fig 2 Graph structure

It is worth noting that the order of the two sets within parentheses matters, because we usually express the vertices first, followed by the edges. A graph H(X, Y) is therefore a structure that models the relationship between the set of vertices X and the set of edges Y, not the other way around.

Graph data model

Now that we have covered graphs and their typical components, let us move on to graph data models, which help to translate a conceptual view of your data to a logical model. Two common graph data formats are Resource Description Framework (RDF) and Labelled Property Graph (LPG).

Resource Description Framework (RDF)

RDF is typically used for metadata and facilitates standardised exchange of data based on their relationships. RDFs typically consist of a triple: a subject, a predicate, and an object. A collection of such triples is an RDF graph. This can be depicted as a node and a directed edge diagram, with each triple representing a node-edge-node graph, as shown in Fig 3.

Fig 3 RDF graph

The three types of nodes that can exist are:

  • Internationalised Resource Identifiers (IRI) – online resource identification code.
  • Literals – data type value, i.e. text, integer, etc.
  • Blank nodes – have no identification; similar to anonymous or existential variables.

Let us use an example to illustrate this. We have a person with the name Art and we want to plot all his relationships. In this case, the IRI is http://example.org/art and this can be shortened by defining a prefix like ex.

In this example, the IRI http://xmlns.com/foaf/0.1/knows defines the relationship knows. We define foaf as the prefix for http://xmlns.com/foaf/0.1/. The following code snippet shows how a graph like this will look.

@prefix foaf: <http://xmlns.com/foaf/0.1/>
@prefix ex: <http://example.org/>

ex:art foaf:knows ex:bob
ex:art foaf:knows ex:bea
ex:bob foaf:knows ex:cal
ex:bob foaf:knows ex:cam
ex:bea foaf:knows ex:coe
ex:bea foaf:knows ex:cory
ex:bea foaf:age 23
ex:bea foaf:based_near_:o1

In the last two lines, you can see how a literal and blank node would be depicted in an RDF graph. The variable foaf:age is a literal node with the integer value of 23, while foaf:based_near is an anonymous spatial entity with a node identifier of underscore. Outside the context of this graph, o1 is a data identifier with no meaning.

Multiple IRIs, intended for use in RDF graphs, are typically stored in an RDF vocabulary. These IRIs often begin with a common substring known as a namespace IRI. In some cases, namespace IRIs are also associated with a short name known as a namespace prefix. In the example above, http://xmlns.com/foaf/0.1/ is the namespace IRI and foaf and ex are namespace prefixes.

Note: RDF graphs are considered atemporal as they provide a static snapshot of data. They can use appropriate language extensions to communicate information about events or other dynamic properties of entities.

An RDF dataset is a set of RDF graphs that includes one or more named graphs as well as exactly one default graph. A default graph is one that can be empty, and has no associated IRI or name, while each named graph has an IRI or a blank node corresponding to the RDF graph and its name. If there is no named graph specified in a query, the default graph is queried (hence its name).

Labelled Property Graph (LPG)

A labelled property graph is made up of nodes, links, and properties. Each node is given a label and a set of characteristics in the form of arbitrary key-value pairs. The keys are strings, and the values can be any data type. A relationship is then defined by adding a directed edge that is labelled and connects two nodes with a set of properties.

In Fig 4, we have an LPG that shows two nodes: art and bea. The bea node has two characteristics, age and proximity, that are connected by a known edge. This edge has the attribute since because it commemorates the year that art and bea first met.

Fig 4 Labelled Property Graph: Example 1

Nodes, edges and properties must be defined when designing an LPG data model. In this scenario, based_near might not be applicable to all vertices, but they should be defined. You might be wondering, why not represent the city Seattle as a node and add an edge marked as based_near that connects a person and the city?

In general, if there is a value linked to a large number of other nodes in the network and it requires additional properties to correlate  with other nodes, it should be represented as a node. In this scenario, the architecture defined in Fig 5 is more appropriate for traversing based_near connections. It also gives us the ability to link any new attributes to the based_near relationship.

Fig 5 Labelled Property Graph: Example 2

Now that we have the context of graphs, let us talk about graph databases, how they help with large data queries and the part they play in Graph Technology.

Graph database

A graph database is a type of NoSQL database that stores data using network topology. The idea is derived from LPG, which represents data sets with vertices, edges, and attributes.

  • Vertices are instances or entities of data that represent any object to be tracked, such as people, accounts, locations, etc.
  • Edges are the critical concepts in graph databases which represent relationships between vertices. The connections have a direction that can be unidirectional (one-way) or bidirectional (two-way).
  • Properties represent descriptive information associated with vertices. In some cases, edges have properties as well.

Graph databases provide a more conceptual view of data that is closer to reality. Modelling complex linkages becomes simpler because interconnections between data points are given the same weight as the data itself.

Graph database vs. relational database

Relational databases are currently the industry norm and take a structured approach to data, usually in the form of tables. On the other hand, graph databases are agile and focus on immediate relationship understanding. Neither type is designed to replace the other, so it is important to know what each database type has to offer.

Fig 6 Graph database vs relational database

There is a domain for both graph and relational databases. Graph databases outperform typical relational databases, especially in use cases involving complicated relationships, as they take a more naturalistic and flowing approach to data.

The key distinctions between graph and relational databases are summarised in the following table:

Type Graph Relational
Format Nodes and edges with properties Tables with rows and columns
Relationships Represented with edges between nodes Created using foreign keys between tables
Flexibility Flexible Rigid
Complex queries Quick and responsive Requires complex joins
Use case Systems with highly connected relationships Transaction focused systems with more straightforward relationships

Table. 1 Graph vs. Relational Databases

Advantages and disadvantages

Every database type has its advantages and disadvantages; knowing the distinctions as well as potential options for specific challenges is crucial. Graph databases are a rapidly evolving technology with improved functions compared with other database types.

Advantages

Some advantages of graph databases include:

  • Agile and flexible structures.
  • Explicit relationship representation between entities.
  • Real-time query output – speed depends on the number of relationships.

Disadvantages

The general disadvantages of graph databases are:

  • No standardised query language; depends on the platform used.
  • Not suitable for transactional-based systems.
  • Small user base, making it hard to find troubleshooting support.

Graph technology

Graph technology is the next step in improving analytics delivery. Traditional analytics is insufficient to meet complicated business operations, distribution, and analytical concerns as data quantities expand.

Graph technology aids in the discovery of unknown correlations in data that would otherwise go undetected or unanalysed. When the term graph is used to describe a topic, three distinct concepts come to mind: graph theory, graph analytics, and graph data management.

  • Graph theory – A mathematical notion that uses stack ordering to find paths, linkages, and networks of logical or physical objects, as well as their relationships. Can be used to model molecules, telephone lines, transport routes, manufacturing processes, and many other things.
  • Graph analytics – The application of graph theory to uncover nodes, edges, and data linkages that may be assigned semantic attributes. Can examine potentially interesting connections in data found in traditional analysis solutions, using node and edge relationships.
  • Graph database – A type of storage for data generated by graph analytics. Filling a knowledge graph, which is a model in data that indicates a common usage of acquired knowledge or data sets expressing a frequently held notion, is a typical use case for graph analytics output.

While the architecture and terminology are sometimes misunderstood, graph analytics’ output can be viewed through visualisation tools, knowledge graphs, particular applications, and even some advanced dashboard capabilities of business intelligence tools. All three concepts above are frequently used to improve system efficiency and even to assist in dynamic data management. In this approach, graph theory and analysis are inextricably linked, and analysis may always rely on graph databases.

Graph-centric user stories

Fraud detection

Traditional fraud prevention methods concentrate on discrete data points such as individual accounts, devices, or IP addresses. However, today’s sophisticated fraudsters avoid detection by building fraud rings using stolen and fake identities. To detect such fraud rings, we need to look beyond individual data points to the linkages that connect them.

Graph technology greatly transcends the capabilities of a relational database, by revealing hard-to-find patterns. Enterprise businesses also employ Graph technology to supplement their existing fraud detection skills to tackle a wide range of financial crimes, including first-party bank fraud, fraud, and money laundering.

Real-time recommendations

An online business’s success depends on systems that can generate meaningful recommendations in real time. To do so, we need the capacity to correlate product, customer, inventory, supplier, logistical, and even social sentiment data in real time. Furthermore, a real-time recommendation engine must be able to record any new interests displayed during the consumer’s current visit in real time, which batch processing cannot do.

Graph databases outperform relational and other NoSQL data stores in terms of delivering real-time suggestions. Graph databases can easily integrate different types of data to get insights into consumer requirements and product trends, making them an increasingly popular alternative to traditional relational databases.

Supply chain management

With complicated scenarios like supply chains, there are many different parties involved and companies need to stay vigilant in detecting issues like fraud, contamination, high-risk areas or unknown product sources. This means that there is a need to efficiently process large amounts of data and ensure transparency throughout the supply chain.

To have a transparent supply chain, relationships between each product and party need to be mapped out, which means there will be deep linkages. Graph databases are great for these as they are designed to search and analyse data with deep links. This means they can process enormous amounts of data without performance issues.

Identity and access management

Managing multiple changing roles, groups, products and authorisations can be difficult, especially in large organisations. Graph technology integrates your data and allows quick and effective identity and access control. It also allows you to track all identity and access authorisations and inheritances with significant depth and real-time insights.

Network and IT operations

Because of the scale and complexity of network and IT infrastructure, you need a configuration management database (CMDB) that is far more capable than relational databases. Neptune is an example of a CMDB and graph database that allows you to correlate your network, data centre, and IT assets to aid troubleshooting, impact analysis, and capacity or outage planning.

A graph database allows you to integrate various monitoring tools and acquire important insights into the complicated relationships that exist between various network or data centre processes. Possible applications of graphs in network and IT operations range from dependency management to automated microservice monitoring.

Risk assessment and monitoring

Risk assessment is crucial in the fintech business. With multiple sources of credit data such as ecommerce sites, mobile wallets and loan repayment records, it can be difficult to accurately assess an individual’s credit risk. Graph Technology makes it possible to combine these data sources, quantify an individual’s fraud risk and even generate full credit reviews.

One clear example of this is IceKredit, which employs artificial intelligence (AI) and machine learning (ML) techniques to make better risk-based decisions. With Graph technology, IceKredit has also successfully detected unreported links and increased efficiency of financial crime investigations.

Social network

Whether you’re using stated social connections or inferring links based on behaviour, social graph databases like Neptune introduce possibilities for building new social networks or integrating existing social graphs into commercial applications.

Having a data model that is identical to your domain model allows you to better understand your data, communicate more effectively, and save time. By decreasing the time spent data modelling, graph databases increase the quality and speed of development for your social network application.

Artificial intelligence (AI) and machine learning (ML)

AI and ML use statistical and analytical approaches to find patterns in data and provide insights. However, there are two prevalent concerns that arise – the quality of data and effectiveness of the analytics. Some AI and ML solutions have poor accuracy because there is not enough training data or variants that have a high correlation to the outcome.

These ML data issues can be solved with graph databases as it’s possible to connect and traverse links, as well as supplement raw data. With Graph technology, ML systems can recognise each column as a “feature” and each connection as a distinct characteristic, and then be able to identify data patterns and train themselves to recognise these relationships.

Conclusion

Graphs are a great way to visually represent complex systems and can be used to easily detect patterns or relationships between entities. To help improve graphs’ ability to detect patterns early, businesses should consider using Graph technology, which is the next step in improving analytics delivery.

Graph technology typically consists of:

  • Graph theory – Used to find paths, linkages and networks of logical or physical objects.
  • Graph analytics – Application of graph theory to uncover nodes, edges, and data linkages.
  • Graph database – Storage for data generated by graph analytics.

Although predominantly used in fraud detection, Graph technology has many other use cases such as making real-time recommendations based on consumer behaviour, identity and access control, risk assessment and monitoring, AI and ML, and many more.

Check out our next blog article, where we will be talking about how our Graph Visualisation Platform enhances Grab’s fraud detection methods.

References

  1. https://www.baeldung.com/cs/graph-theory-intro
  2. https://web.stanford.edu/class/cs520/2020/notes/What_Are_Graph_Data_Models.html

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

GitHub Availability Report: May 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-06-01-github-availability-report-may-2022/

In May, we experienced three distinct incidents that resulted in significant impact and degraded state of availability to multiple services across GitHub.com. This report also sheds light into the billing incident that impacted GitHub Actions and Codespaces users in April.

May 20 09:44 UTC (lasting 49 minutes)

During this incident, our alerting systems detected increased CPU utilization on one of the GitHub Container registry databases. When we received the alert we immediately began investigating. Due to this preemptive monitoring added from the last incident in April at 8:59 UTC, the on-call was readily monitoring and prepared to run mitigation for this incident.

As CPU utilization on the database continued to rise, the Container registry began responding to requests with increased latency, followed by an internal server error for a percentage of requests. At this point we knew there was customer impact and changed the public status of the service. This increased CPU activity was due to a high volume of the “Put Manifest” command. Other package registries were not impacted.

The reason for the CPU utilization was that the throttling criteria configured at the API side for this command was too permissive, and a database query was found to be non-performant under that degree of scale. This caused an outage for anyone using the GitHub Container registry. Users were experiencing latency issues when pushing or pulling packages, as well as getting slow access to the packages UI.

In order to limit impact we throttled the requests from all organizations/users and to restore normal operation, we had to reset our database state by restarting our front-end servers and then the database.

To avoid this in the future, we have added separate rate limiting for operation types from specific organizations/users and will continue working on performance improvements for SQL queries.

May 27 04:26 UTC (lasting 21 minutes)

Our alerting systems detected degraded availability for API requests during this time. Due to the recency of these incidents, we are still investigating the contributing factors and will provide a more detailed update on the causes and remediations in the June Availability Report, which will be published the first Wednesday of July.

May 27 07:36 UTC (lasting 1 hour and 21 minutes)

During this incident, services including GitHub Actions, API Requests, Codespaces, Git Operations, Issues, GitHub Packages, GitHub Pages, Pull Requests, and Webhooks were impacted. As we continue to investigate the contributing factors, we will provide a more detailed update in the June Availability Report. We will also share more about our efforts to minimize the impact of similar incidents in the future.

Follow up to April 14 20:35 UTC (lasting 4 hours and 53 minutes)

As we mentioned in the April Availability Report, we are now providing a more detailed update on this incident following further investigation.

On April 14, GitHub Actions and Codespaces customers started reporting incorrect charges for metered services shown in their GitHub billing settings. As a result, customers were hitting their GitHub spending limits and unable to run new Actions or create new Codespaces. We immediately started an incident bridge. Our first step was to unblock all customers by giving unlimited Actions and Codespaces usage for no additional charge during the time of this incident.

From looking at the timing and list of recently pushed changes, we determined that the issue was caused by a code change in the metered billing pipeline. When attempting to improve performance of our metered usage processor, Actions and Codespaces minutes were mistakenly multiplied by 1,000,000,000 to convert gigabytes into bytes when this was not necessary for these products. This was due to a change to shared metered billing code that was not thought to impact these products.

To fix the issue, we reverted the code change and started repairing the corrupted data recorded during the incident. We did not re-enable metered billing for GitHub products until we had repaired the incorrect billing data, which happened 24 hours after this incident.

To prevent this incident in the future, we added a Rubocop (Ruby static code analyzer) rule to block pull requests containing non-safe billing code updates. In addition, we added anomaly monitoring for the billed quantity, so next time we are alerted before impacted customers. We also tightened the release process to require a feature flag and end-to-end test when shipping such changes.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. To receive real-time updates on status changes, please follow our status page. You can also learn more about what we’re working on on the GitHub Engineering Blog.

Automated Experiment Analysis – Making experimental analysis scalable

Post Syndicated from Grab Tech original https://engineering.grab.com/automated-experiment-analysis

Introduction

Trustworthy experiments are key to making sound decisions, so analysts and data scientists put a lot of effort into analysing them and making business impacts. An extension of Grab’s Experimentation (GrabX) platform, Automated Experiment Analysis is one of Grab’s data products that helps automate statistical analyses of experiments. It also provides automatic experimental data pipelines and customised tests for different types of experiments.

Designed to help Grab in its journey of innovation and data-driven decision making, the data product helps to:

  1. Standardise and automate the basic experiment analysis process on Grab experiments.
  2. Ensure post-experiment results are reproducible under a company-wide standard, and easily reviewed by each other.
  3. Democratise the institutional knowledge of experimentation across functions.

Background

Today, the GrabX platform provides the ability to define, configure, and execute online controlled experiments (OCEs), often called A/B tests, to gather trustworthy data and make data-driven decisions about how to improve our products.

Before the automated analysis, each experiment was analysed manually on an ad-hoc basis. This manual and federated model brings in several challenges at the company level:

  1. Inefficiency: Repetitive nature of data pipeline building and basic post-experiment analyses incur large costs and deplete the analysts’ bandwidth from running deeper analyses.
  2. Lack of quality control: Risk of unstandardised, inaccurate or late results as the platform cannot exercise data-governance/control or extend offerings to Grab’s other entities.
  3. Lack of scalability and availability: GrabX users have varied backgrounds and skills, making their approaches to experiments different and not easily transferable/shared. E.g. Some teams may use more advanced techniques to speed up their experiments without using too much resources but these techniques are not transferable without considerable training.

Solution

Architecture details

Point multiplier
Architecture diagram

When users set up experiments on GrabX, they can configure the success metrics they are interested in. These metrics configurations are then stored in the metadata as “bronze”, “silver”, and “gold” datasets depending on the corresponding step in the automated data pipeline process.

Metrics configuration and “bronze” datasets

In this project, we have developed a metrics glossary that stores information about what the metrics are and how they are computed. The metrics glossary is stored in CosmoDB and serves as an API Endpoint for GrabX so users can pick from the list of available metrics. If a metric is not available, users can input their custom metrics definition.

This metrics selection, as an analysis configuration, is then stored as a “bronze” dataset in Azure Data Lake as metadata, together with the experiment configurations. Once the experiment starts, the data pipeline gathers all experiment subjects and their assigned experiment groups from our clickstream tracking system.

In this case, the experiment subject refers to the facets of the experiment. For example, if the experiment subject is a user, then the user will go through the same experience throughout the entire experimentation period.

Metrics computation and “silver” datasets

In this step, the metrics engine gathers all metrics data based on the metrics configuration and computes the metrics for each experiment subject. This computed data is then stored as a “silver” dataset and is the foundation dataset for all statistical analyses.

“Silver” datasets are then passed through the “Decision Engine” to get the final “gold” datasets, which contain the experiment results.

Results visualisation and ”gold” datasets

In “gold” datasets, we have the result of the experiment, along with some custom messages we want to show our users. These are saved in sets of fact and dim tables (typically used in star schemas).

For users to visualise the result on GrabX, we leverage the embedded Power BI visualisation. We build the visualisation using a “gold” dataset and embed it to each experiment page with a fixed filter. By doing so, users can experience the end-to-end flow directly from GrabX.

Implementation

The implementation consists of four key engineering components:

  1. Analysis configuration setup
  2. A data pipeline
  3. Automatic analysis
  4. Results visualisation

Analysis configuration is part of the experiment setup process where users select success metrics they are interested in. This is an essential configuration for post-experiment analysis, in addition to the usual experiment configurations (e.g. sampling strategies).

It ensures that the reported experiment results will align with the hypothesis setup, which helps avoid one of the common pitfalls in OCEs 1.

There are three types of metrics available:

  1. Pre-defined metrics: These metrics are already defined in the Scribe datamart, e.g. Gross Merchandise Value (GMV) per pax.
  2. Event-based metrics: Users can specify an ad-hoc metric in the form of a funnel with event names for funnel start and end.
  3. Build your own metrics: Users have the flexibility to define a metric in the form of a SQL query.

A data pipeline here mainly consists of data sourcing and data processing. We use Azure Data Factory to schedule ETL pipelines so we can calculate the metrics and statistical analysis. ETL jobs are written in spark and run using Databricks.

Data pipelines are streamlined to the following:

  1. Load experiments and metrics metadata, defined at the experiment creation stage.
  2. Load experiment and clickstream events.
  3. Load experiment assignments. An experiment assignment maps a randomisation unit ID to the corresponding experiment or variant IDs.
  4. Merge the data mentioned above for each experiment variant, and obtain sufficient data to do a deeper results analysis.

Automatic analysis uses an internal python package “Decision Engine”, which decouples the dataset and statistical tests, so that we can incrementally improve applications of advanced techniques. It provides a comprehensive set of test results at the variant level, which include statistics, p-values, confidence intervals, and the test choices that correspond to the experiment configurations. It’s a crowdsourced project which allows all to contribute what they believe should be included in fundamental post-experiment analysis.

Results visualisation leverages PowerBI, which is embedded in the GrabX UI, so users can run the experiments and review the results on a single platform. 

Impact

At the individual user level, Automated Experiment Analysis is designed to enable analysts and data scientists to associate metrics with experiments, and present the experiment results in a standardised and comprehensive manner. It speeds up the decision-making process and frees up the bandwidths of analysts and data scientists to conduct deeper analyses.

At the user community level, it improves the efficiency of running experimental analysis by capturing all experiments, their results, and the launch decision within a single platform.

Learnings/Conclusion

Automated Experiment Analysis is the first building block to boost the trustworthiness of OCEs in Grab. Not all types of experiments are fully onboard, and they might not need to be. Through this journey, we believe these key learnings would be useful for experimenters and platform teams:

  1. To standardise and simplify several experimental analysis steps, there needs to be automation data pipelines, analytics tools, and a metrics store in the infrastructure.
  2. The “Decision Engine” analytics tool should be decoupled from the other engineering components, so that it can be incrementally improved in future.
  3. To democratise knowledge and ensure service coverage, many components need to have a crowdsourcing feature, e.g. the metrics store has a BYOM function, and “Decision Engine” is an open-sourced internal python package.
  4. Tracking implementation is important. To standardise data pipelines and achieve scalability, we need to standardise the way we implement tracking.

What’s next?

A centralised metric store –  We built a metric calculation dictionary, which currently contains around 30-40 basic business metrics, but its functionality is limited to GrabX Experimentation use case.

If the metric store is expected to serve more general uses, it needs to be further enriched by allowing some “smarts”, e.g. fabric-agnostic metrics computations 2, other types of data slicing, and some considerations with real-time metrics or signals.

An end-to-end experiment guide rail – Currently, we provide automatic data analysis after an experiment is done, but no guardrail features at multiple experiment stages, e.g. sampling strategy choices, sample size recommendation from the planning stage, and data quality check during/after the experiment windows. Without the end-to-end guardrails, running experiments will be very prone to pitfalls. We therefore plan to add some degree of automation to ensure experiments adhere to the standards used by the post-experimental analysis.

A more comprehensive analysis toolbox – The current state of the project mainly focuses on infrastructure development, so it starts with basic frequentist’s A/B testing approaches. In future versions, it can be extended to include sequential testing, CUPED 3, attribution analysis, Causal Forest, heterogeneous treatment effects, etc.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

References

  1. Dmitriev, P., Gupta, S., Kim, D. W., & Vaz, G. (2017, August). A dirty dozen: twelve common metric interpretation pitfalls in online controlled experiments. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge discovery and data mining (pp. 1427-1436). 

  2. Metric computation for multiple backends, Craig Boucher, Ulf Knoblich, Dan Miller, Sasha Patotski, Amin Saied, Microsoft Experimentation Platform 

  3. Deng, A., Xu, Y., Kohavi, R., & Walker, T. (2013, February). Improving the sensitivity of online controlled experiments by utilising pre-experiment data. In Proceedings of the sixth ACM international conference on Web search and data mining (pp. 123-132). 

Action needed by GitHub Connect customers using GHES 3.1 and older to adopt new authentication token format updates

Post Syndicated from Jim Boyle original https://github.blog/2022-05-20-action-needed-by-github-connect-customers-using-ghes-3-1-and-older-to-adopt-new-authentication-token-format-updates/

As previously announced, the format of GitHub authentication tokens has changed. The following token types have been affected:

Due to the updated format of GitHub authentication tokens, GitHub Connect will no longer support instances running GitHub Enterprise Server (GHES) 3.1 or older after June 3, 2022. To continue using GitHub Connect, an upgrade to GHES 3.2 or newer will be required by June 3. GitHub Connect is necessary in order to use the latest features, such as Dependabot updates, license sync, GitHub.com Actions synchronization, and unified contributions.

GHES customers seeking a GHES upgrade can follow the instructions outlined in the instructions for Upgrading GitHub Enterprise Server.

Search architecture revamp

Post Syndicated from Grab Tech original https://engineering.grab.com/search-architecture-revamp

Background

Prior to 2021, Grab’s search architecture was designed to only support textual matching, which takes in a user query and looks for exact matches within the ecosystem through an inverted index. This legacy system meant that only textual matching results could be fetched.

In the second half of 2021, the Deliveries search team worked on improving this architecture to make it smarter, more scalable and also unlock future growth for different search use cases at Grab. The figure below shows a simplified overview of the legacy architecture.

Point multiplier
Legacy architecture

Problem statement

With the legacy system, we noticed several problems.

Search results were textually matched without considering intention and context

If a user types in a query “Roti Prata” (flatbread), he is likely looking for Roti Prata dishes and those matches with the dish name should be prioritised compared with matches with the merchant-partner’s name or matches with other entities.

In the legacy system, all entities whose names partially matched “Roti Prata” were displayed and ranked according to hard coded weights, and matches with merchant-partner names were always prioritised, even if the user intention was clearly to search for the “Roti Prata” dish itself.  

This problem was more common in Mart, as users often intended to search for items instead of shops. Besides the lack of intention recognition, the search system was also unable to take context into consideration; users searching the same keyword query at different times and locations could have different objectives. E.g. if users search for “Bread” in the day, they may be likely to look for cafes while searches at night could be for breakfast the next day.

Search results from multiple business verticals were not blended effectively

In Grab’s context, results from multiple verticals were often merged. For example, in Mart searches, Ads and Mart organic search results were displayed together; in Food searches, Ads, Food and Mart organic results were blended together.

In the legacy architecture, multiple business verticals were merged on the Deliveries API layer, which resulted in the leak of abstraction and loss of useful data as data from the search recall stage was also not taken into account during the merge stage.

Inability to quickly scale to new search use cases and difficulty in reusing existing components

The legacy code base was not written in a structured way that could scale to new use cases easily. If new search use cases cannot be built on top of an existing system, it can be rather tedious to keep rebuilding the function every time there is a new search use case.

Solution

In this section, solutions from both architecture and implementation perspectives are presented to address the above problem statements.

Architecture

In the new architecture, the flow is extended from lexical recall only to multi-layer including boosting, multi-recall, and ranking. The addition of boosting enables capabilities like intent recognition and query expansion, while the change from single lexical recall to multi-recall opens up the potential for other recall methods, e.g. embedding based and graph based.

These help address the first problem statement. Furthermore, the multi-recall framework enables fetching results from multiple business verticals, addressing the second problem statement. In the new framework, results from different verticals and different recall methods were grouped and ranked together without any leak of abstraction or loss of useful data from search recall stage in ranking.

Point multiplier
Upgraded architecture

Implementation

We believe that the key to a platform’s success is modularisation and flexible assembling of plugins to enable quick product iteration. That is why we implemented a combination of a framework defined by the platform and plugins provided by service teams. In this implementation, plugins are assembled through configurations, which addresses the third problem statement and has two advantages:

  • Separation of concern. With the main flow abstracted and maintained by the platform, service team developers could focus on the application logic by writing plugins and fitting them into the main flow. In this case, developers without search experience could quickly enable new search flows.
  • Reusing plugins and economies of scale. With more use cases onboarded, more plugins are written by service teams and these plugins are reusable assets, resulting in scale effect. For example, an Ads recall plugin could be reused in Food keyword or non-keyword searches, Mart keyword or non-keyword searches and universal search flows as all these searches contain non-organic Ads. Similarly, a Mart recall plugin could be reused in Mart keyword or non-keyword searches, universal search and Food keyword search flows, as all these flows contain Mart results. With more plugins accumulated on our platform, developers might be able to ship a new search flow by just reusing and assembling the existing plugins.

Conclusion

Our platform now has a smart search with intent recognition and semantic (embedding-based) search. The process of adding new modules is also more straightforward and adds intention recognition to the boosting step as well as embedding as an additional recall to the multi-recall step. These modules can be easily reused by other use cases.

On top of that, we also have a mixed Ads and an organic framework. This means that data in the recall stage is taken into consideration and Ads can now be ranked together with organic results, e.g. text relevance.

With a modularised design and plugins provided by the platform, it is easier for clients to use our platform with a simple onboarding process. Furthermore, plugins can be reused to cater to new use cases and achieve a scale effect.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Service architecture revamp

Post Syndicated from Grab Tech original https://engineering.grab.com/service-architecture-revamp

Background

Prior to 2021, Grab’s search architecture was designed to only support textual matching, which takes in a user query and looks for exact matches within the ecosystem through an inverted index. This legacy system meant that only textual matching results could be fetched.

In the second half of 2021, the Deliveries search team worked on improving this architecture to make it smarter, more scalable and also unlock future growth for different search use cases at Grab. The figure below shows a simplified overview of the legacy architecture.

Point multiplier
Legacy architecture

Problem statement

With the legacy system, we noticed several problems.

Search results were textually matched without considering intention and context

If a user types in a query “Roti Prata” (flatbread), he is likely looking for Roti Prata dishes and those matches with the dish name should be prioritised compared with matches with the merchant-partner’s name or matches with other entities.

In the legacy system, all entities whose names partially matched “Roti Prata” were displayed and ranked according to hard coded weights, and matches with merchant-partner names were always prioritised, even if the user intention was clearly to search for the “Roti Prata” dish itself.  

This problem was more common in Mart, as users often intended to search for items instead of shops. Besides the lack of intention recognition, the search system was also unable to take context into consideration; users searching the same keyword query at different times and locations could have different objectives. E.g. if users search for “Bread” in the day, they may be likely to look for cafes while searches at night could be for breakfast the next day.

Search results from multiple business verticals were not blended effectively

In Grab’s context, results from multiple verticals were often merged. For example, in Mart searches, Ads and Mart organic search results were displayed together; in Food searches, Ads, Food and Mart organic results were blended together.

In the legacy architecture, multiple business verticals were merged on the Deliveries API layer, which resulted in the leak of abstraction and loss of useful data as data from the search recall stage was also not taken into account during the merge stage.

Inability to quickly scale to new search use cases and difficulty in reusing existing components

The legacy code base was not written in a structured way that could scale to new use cases easily. If new search use cases cannot be built on top of an existing system, it can be rather tedious to keep rebuilding the function every time there is a new search use case.

Solution

In this section, solutions from both architecture and implementation perspectives are presented to address the above problem statements.

Architecture

In the new architecture, the flow is extended from lexical recall only to multi-layer including boosting, multi-recall, and ranking. The addition of boosting enables capabilities like intent recognition and query expansion, while the change from single lexical recall to multi-recall opens up the potential for other recall methods, e.g. embedding based and graph based.

These help address the first problem statement. Furthermore, the multi-recall framework enables fetching results from multiple business verticals, addressing the second problem statement. In the new framework, results from different verticals and different recall methods were grouped and ranked together without any leak of abstraction or loss of useful data from search recall stage in ranking.

Point multiplier
Upgraded architecture

Implementation

We believe that the key to a platform’s success is modularisation and flexible assembling of plugins to enable quick product iteration. That is why we implemented a combination of a framework defined by the platform and plugins provided by service teams. In this implementation, plugins are assembled through configurations, which addresses the third problem statement and has two advantages:

  • Separation of concern. With the main flow abstracted and maintained by the platform, service team developers could focus on the application logic by writing plugins and fitting them into the main flow. In this case, developers without search experience could quickly enable new search flows.
  • Reusing plugins and economies of scale. With more use cases onboarded, more plugins are written by service teams and these plugins are reusable assets, resulting in scale effect. For example, an Ads recall plugin could be reused in Food keyword or non-keyword searches, Mart keyword or non-keyword searches and universal search flows as all these searches contain non-organic Ads. Similarly, a Mart recall plugin could be reused in Mart keyword or non-keyword searches, universal search and Food keyword search flows, as all these flows contain Mart results. With more plugins accumulated on our platform, developers might be able to ship a new search flow by just reusing and assembling the existing plugins.

Conclusion

Our platform now has a smart search with intent recognition and semantic (embedding-based) search. The process of adding new modules is also more straightforward and adds intention recognition to the boosting step as well as embedding as an additional recall to the multi-recall step. These modules can be easily reused by other use cases.

On top of that, we also have a mixed Ads and an organic framework. This means that data in the recall stage is taken into consideration and Ads can now be ranked together with organic results, e.g. text relevance.

With a modularised design and plugins provided by the platform, it is easier for clients to use our platform with a simple onboarding process. Furthermore, plugins can be reused to cater to new use cases and achieve a scale effect.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

How we’re using projects to build projects

Post Syndicated from Jed Verity original https://github.blog/2022-05-16-how-were-using-projects-to-build-projects/

At GitHub, we use GitHub to build our own products, whether that be moving our entire Engineering team over to Codespaces for the majority of GitHub.com development, or utilizing GitHub Actions to coordinate our GitHub Mobile releases. And while GitHub Issues has been a part of the GitHub experience since the early days and is an integral part of how we work together as Hubbers internally, the addition of powerful project planning has given us more opportunities to test out some of our most exciting products.

In this post, I’m going to share how we’ve been utilizing the new projects experience across our team (from an engineer like myself all the way to our VPs and team leads). We love working so closely with developers to ship requested features and updates (all of which roll up into the Changelogs you see), and using the new projects helps us stay consistent in our shipping cadence.

How we think about shipping

Our core team consists of members of the product, engineering, design, and user research teams. We recognize that good ideas can come from anywhere. Our process is designed to inspire, surface, and implement those ideas, whether they come from users, individual contributors, managers, directors, or VPs. To get the proper alignment for this group, we’ve agreed on a few guiding principles that drive what our roadmap will look like:

💭 The pitch: When it comes to what we’re going to work on (outside of the big pieces of work on our roadmap) people within our team can pitch ideas in our team’s repository for upcoming cycles (which we define as 6-8 weeks of work, inclusive of planning, engineering work, and an unstructured passion project week); these can be features, fixes, or even maintenance work. Every pitch must clearly state the problem it’s solving and why it’s important for us to prioritize. Some features that have come from this process include live updates, burn up charts for insights, and more. Note: these are all the changes you see as a developer, but we also have a lot of pitches come in from my fellow engineers focused around the developer experience. For example, a couple successful pitches have included reducing our CI time to 10 minutes, and streamlining our release process by switching to a ring deployment model and adding ChatOps.

💡 In addition to using issues to propose and converse on pitches from the team, we use the new projects experience to track and manage all the pitches from the team so we can see them in an all-up table or board view.

✂ Keep it small: We knew for ourselves, and for developers, that we didn’t want to lock them into a specific planning methodology and over-complicate a team’s planning and tracking process. For us, we wanted to plan shorter cycles for our team to increase our tempo and focus, so we opted for six-week cycles to break up and build features. Check out how we recommend getting started with this approach in a recent blog post.

📬 Ship to learn: Similar to how we ship a lot of our products, we knew developers and customers were going to be heavily intertwined with each and every ship, giving us immediate feedback in both the private and public beta. Their feedback both influenced what we built and then how we iterated and continued to better the experience once something did ship. While there are so many people to thank, we’re extremely grateful for all our customers along the way for being our partners in building GitHub Issues into the place for developers to project plan.

How we used projects to do it

We love that the product we’re building doesn’t tool a specific project management methodology, but equips users with powerful primitives that they can compose into their preferred experiences and workflows. This allows for many people (not just us engineers) involved in building and developing products at GitHub (team leads, marketing, design, sales, etc.) the ability to use the product in a way that makes sense for them.

With the above principles in mind, once a pitch has been agreed upon to move forward on building, that pitch issue becomes a tracking issue in a project table or board that we convert into pieces of work that fit into an upcoming cycle. A great example of this was when we updated the GitHub Issues icons to lessen confusion among developers. This came in as a pitch from a designer on the team, and was soon accepted and moved into epic planning in which the team responsible began to track the individual pieces of work needed to make this happen.

IC approach

Let’s start with how my fellow engineers, individual contributors and I use projects for day-to-day development within cycles. From our perspective on any given day, we’re hyper-focused on tackling what issues and pull requests are assigned to us (fun fact: we recently added the assignee:me filter to make this even easier) in a given cycle, so we work from more individually scoped project tables or boards that stem from the larger epic and iteration tracking. Because of this, we can easily zoom out from our individual tasks to see how our work fits into a given cycle, and then even zoom out more into how it fits into larger initiatives.

💡 In addition to scoping more specifically a given table or board, engineers across our organization utilize a personal project table or board to track all the things specific to themselves like what issues are assigned to them—even work not connected to a given cycle, like open source work.

EM approach

If we pull back to engineering managers overseeing those smaller cycles, they’re focused on kicking off an accepted pitch’s work, breaking it first into cycles and then into smaller iterations in which they can assign out work. A given cycle’s table or board view allows the managers to have a whole look at all members of their team and look specifically at things that are important to them, like all the pull requests that are open and quickly seeing which engineers are assigned, what pull requests have been merged, deployed, etc.

💡 Check out what this looks like in our team board.

Team lead approach

Now, if we put ourselves in the shoes of our team leads and Directors/VPs, we see that they’re using the new projects experience to primarily get the full picture of where product and feature development currently sit. They told me the main team roadmap and backlog is where they can get questions answered like:

  • Which projects do we have in flight in which product area right now?
  • Who’s the key decision maker for this project?
  • Which engineers are working on which projects?
  • Which projects are at risk and need help (progress/status)?

What’s great about this is that they can quickly glance at what’s in motion and then click into any cycles or status to get more context on open issues, pull requests, and how everything is connected.

💡 Outside of being able to check in on what’s being worked on and where the organization’s current focus is, our leads have found additional use cases that may not be applicable for an engineer like me. They use private projects for more sensitive tasks, like managing our team’s hiring, tracking upcoming start dates, making sure they’re staying on top of career development, organizational change management, and more.

Wrap-up

This is how we as the planning and tracking team at GitHub are using the very product we’re building for you to build the new projects experience. There are many other teams across GitHub that utilize the new project tables and boards, but we hope this gives you a little bit of inspiration about how to think about project planning on GitHub and how to optimize for all the stakeholders involved in building and shipping products.

What’s great about project planning on GitHub is that our focus on powerful primitives approach to project management means that there is an unlimited amount of flexibility for you and your team to play around with, and likely many, many ways we haven’t even thought about how to use the product. So, please let us know how you’re using it and how we can improve the experience!

GitHub Availability Report: April 2022

Post Syndicated from Jakub Oleksy original https://github.blog/2022-05-04-github-availability-report-april-2022/

In April, we experienced three distinct incidents resulting in significant impact and degraded state of availability for Codespaces and GitHub Packages.

April 01  7:07 UTC (lasting 5 hours and 32 minutes)

Our alerting detected an increase in failures to create new Codespaces and start existing stopped Codespaces in the US West region. We immediately updated the GitHub status page and began to investigate.

Upon further investigation, we determined that some secrets that are used by the Codespaces service had expired. Codespaces maintains warm pools of resources to protect our users from intermittent failures in our dependent services. However, in the US West region, those pools were empty of resources due to the expired secret. In this case, we didn’t have an early enough warning on pools reaching low thresholds and didn’t have time to react until we ran out of capacity. As we worked to mitigate the incident, the pools in other regions also emptied due to the expired secret, and those regions began to see failures as well.

A limited number of GitHub engineers had access to rotate the secret, and communication issues delayed the start of the secret refresh process. The expired secret was eventually refreshed and rolled out to all regions, and the service was returned to full operation.

To prevent this failure pattern in the future, we now verify resources that expire and have monitors in place that alert well in advance if pool resources are not being maintained. We’ve also added monitors to notify us earlier when we approach resource exhaustion limits. In addition, we’ve initiated migrating the service to use a mechanism that doesn’t rely on secrets or the need to rotate credentials.

April 14 20:35 UTC (lasting 4 hours and 53 minutes)

We are still investigating the contributing factors and will provide a more detailed update in the May Availability Report, which will be published the first Wednesday of June. We will also share more about our efforts to minimize the impact of future incidents.

April 25 8:59 UTC (lasting 5 hours and 8 minutes)

During this incident, our alerting systems detected increased CPU utilization on one of the GitHub Packages Registry databases, which started approximately one hour before any customer impact occurred. The threshold for this alert was relatively low, and it was not a paging alert, so we did not immediately investigate. CPU continued to rise on the database causing the Package Registry to start responding to requests with internal server errors, eventually causing customer impact. This increased activity was due to a high volume of the “Create Manifest” command used in an unexpected manner.

The throttling criteria configured at the database level wasn’t enough to limit the above command, and this caused an outage for anyone using the GitHub Packages Registry. Users were unable to push or pull packages, as well as being unable to access the packages UI or the repository landing page.

After investigating, we determined there was a performance bug related to the high volume of “Create Manifest” commands. In order to limit impact and restore normal operation, we blocked the activity causing this problem. We are actively following up on this issue by improving the rate limiting in packages and fixing the performance problem that was uncovered. We’ve also modified database alerting thresholds and severity so we get alerted to unexpected issues more quickly (rather than after customer impact).

During this incident, we also discovered that the repository home page has a hard dependency on the packages infrastructure. When the package registry is down, the home pages for repositories that list packages also fail to load. We decoupled the package listing from the repository home page, but that required manual intervention during the outage. We are working on a fix that loosely binds the packages listing, so if it fails, it does not take down the repository home pages for repositories that list packages.

In summary

We will continue to keep you updated on the progress and investments we’re making to ensure the reliability of our services. Please follow our status page for real-time updates. To learn more about what we’re working on, check out the GitHub Engineering Blog.

Embracing a Docs-as-Code approach

Post Syndicated from Grab Tech original https://engineering.grab.com/doc-as-code

The Docs-as-Code concept has been gaining traction in the past few years as more tech companies start implementing this approach. One of the most widely-known examples is Spotify, that ​​uses Docs-as-Code to publish documentation in an internal developer portal.

Since the start of 2021, Grab has also adopted a Docs-as-Code approach to improve our technical documentation. Before we talk about how this is done at Grab, let’s explain what this concept really means.

What is Docs-as-Code?

Docs-as-Code is a mindset of creating and maintaining technical documentation. The goal is to empower engineers to write technical documentation frequently and keep it up to date by integrating with their tools and processes.

This means that technical documentation is placed in the same repository as the code, making it easier for engineers to write and update. Next, we’ll go through the motivations behind this initiative.

Why embark on this journey?

After speaking to Grab engineers, we found that some of their biggest challenges are around finding and writing documentation. Like many other companies on the same journey, Grab is rather big and our engineers are split into many different teams. Within each team, technical documentation can be stored on different platforms and in different formats, e.g. Google drive documents, text files, etc. This makes it hard to find relevant information, especially if you are trying to find another team’s documentation.

On top of that, we realised that the documentation process is disconnected from an engineer’s everyday activities, making technical documentation an awkward afterthought. This means that even if people could find the information, there was a good chance that it would not be up to date.

To address these issues, we need a centralised platform, a single source of truth, so that people can find and discover technical documentation easily. But first, we need to change how we write technical documentation. This is where Docs-as-Code comes in.  

How does Docs-as-Code solve the problem?

With Docs-as-Code, technical documentation is:

  • Written in plaintext.
  • Editable in a code editor.
  • Stored in the same repository as the source code so it’s easier to update docs whenever a code change is committed.
  • Published on a central platform.

The idea is to consolidate all technical documentation on a central platform, making it easier to discover and find content by using an easy-to-navigate information architecture and targeted search.

How is Grab embracing Docs-as-Code?

We’ve developed an internal developer portal that simplifies the process of writing, reviewing and publishing technical documentation.

Here’s a brief overview of the process:

  1. Create a dedicated docs folder in a Git repository.
  2. Push Markdown files into the docs folder.
  3. Configure the developer portal to publish docs from the respective code repository.

The latest version of the documentation will automatically be built and published in the developer portal.

Point multiplier
Simplified documentation process

This way, technical documentation is closer to the source code and integrated into the code development process. Writing and updating technical documentation becomes part of writing code, and this encourages engineers to keep documentation updated.

Measuring success

Whenever there’s a change throughout big organisations like Grab, it can be tough to implement. But thankfully, our engineers recognised the importance of improving documentation and making it easier to maintain or update.

We surveyed our users and here’s what some have said about our Docs-as-Code initiative:

“[W]ith the doc and source code in one place, test backend engineers can now make doc changes via standard code review process and re-use the same content for CLI helper message and documentation.” – Kang Yaw Ong, Test Automation – Engineering Manager

“[Docs-as-Code] is a great initiative, as it keeps documentation in line and up-to-date with the development of a project. Managing documentation using a version control system and the same tools to handle merges and conflicts reduces overhead and friction in an engineer’s workflow.” – Eugene Chiang, Foundations – Engineering Manager

Progress and future optimisations

Since we first started the Docs-as-Code initiative in Grab, we’ve made a lot of progress in terms of adoption – approximately 80% of Grab services will have their technical documentation on the internal portal by April 2022.

We’ve also improved overall user experience by enhancing stability and performance, improving navigation and content formatting, and enabling feedback. But it doesn’t stop there; we are continuously improving the internal portal and providing more features for our engineers.

Apart from technical documentation, we are also applying the Docs-as-Code approach to our technical training content. This means moving both self-paced and workshop training content to a centralised repository and providing engineers a single platform for all their learning needs.


Special thanks to the Tech Learning – Documentation team for their contributions to this blog post.


We are hiring!

We are looking for more technical content developers to join the team. If you’re keen on joining our Docs-as-Code journey and improving developer experience, check out our open listings in Singapore and Malaysia.

Join us in driving this initiative forward and making documentation more approachable for everyone!

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Being friendly: Strategies for friendly fork management

Post Syndicated from Lessley Dennington original https://github.blog/2022-05-02-friend-zone-strategies-friendly-fork-management/

This is the second and final post in a series describing friendly forks and alternative strategies for managing them. Make sure to check out Being friendly: friendly forks 101 for general information on friendly forks and background on the three forks on which we center this post’s discussion.

In the first post in this series, we discussed what friendly forks are and learned about three GitHub-managed friendly forks of git/git, git-for-windows/git, microsoft/git, and github/git. In this post, we deep dive into the management strategies we employ for each of these forks and provide scenarios to help you select the appropriate management strategy for your own friendly fork.

📋 The importance of friendly fork management

While the basics of friendly forks could make for mildly interesting cocktail party conversation 🍹, it takes a deeper understanding to successfully manage a fork once you have it. Management (or lack thereof) can make or break friendly forks for the following reasons:

  1. Contributions taken by the upstream project are also generally valuable to the fork.
  2. The number of changes to the upstream project since the last merge is correlated with the difficulty of merging those changes into the fork.
  3. When security patches are pushed upstream, it is critical to be able to easily apply them (without other conflicts getting in the way).

Although there is no one-size-fits-all approach to friendly fork management, our goal for the remainder of this post is to provide a solid starting point for your management journey that will help your friendly fork remain securely in the friend zone.

🎯 Our management strategies

We employ a different management strategy for each of the forks discussed in the previous post based on that fork’s unique needs. These strategies are illustrated in the graphic below.

If the above image makes you feel slightly dizzy, don’t worry! We know it’s a lot, so we’re going to break it down into detailed descriptions of how each fork works. Take a deep breath, and prepare to dive in.

git-for-windows/git

Git for Windows uses a custom merging rebase strategy to take changes from upstream. A merging rebase is just what it sounds like-a combination of a merge and a rebase. Merging rebases are executed at a predictable cadence that follows the git/git release cycle.

When it is time for a new release, git/git creates a series of release candidate tags (typically rc0, rc1, rc2, and final) on its default branch. As soon as each new candidate is released, we execute a merging rebase of the main branch on top of it. The merge portion comes first, with this command:

$ git merge -s ours -m "Start the merging-rebase to <version>" HEAD@{1}

This creates what we call a “fake merge” since it uses the “ours” strategy to discard all the changes from main. While this may seem odd, it actually provides the benefits of a clean slate for rebasing and the ability to fast forward from previous states of the branch.

After the merge is complete, the rebase commences. This portion of the process helps us resolve merge conflicts that occur when upstream changes conflict with changes that have been made in git-for-windows/git. One type of conflict in particular is worth discussing in more depth: commits that have been added to git-for-windows/git and subsequently upstreamed.

When these commits are submitted upstream, the community supporting git/git usually requests changes before they are accepted. Additionally, since git/git only accepts patches sent via mailing list (instead of pull requests), commit IDs inevitably change when applied upstream. This means there will be conflicts when git-for-windows/git is rebased on top of a new release. Running the following command helps us identify commits that have been upstreamed when we encounter conflicts:

$ git range-diff --left-only <commit>^! <commit>..<upstream-branch>

This command compares the differences between the below ranges of commits, dropping any that are not in the first specified range:

  1. From the commit before the commit you are checking to the commit you are checking (this will only contain the commit you are checking).
  2. The upstream commits that are not in the commit history of <commit>.

If there is a matching commit in upstream, it will be shown in the command’s output. In this case, we use git rebase --skip to bypass the commit (and implicitly accept the upstream version).

Because git-for-windows/git begins its merging rebase immediately after the creation of each release candidate, we classify it as proactive-it ensures all new git/git features are integrated and released as soon as possible. The merging rebase is generally executed for each release candidate by one developer, but is reviewed by multiple other developers with the help of range-diff comparisons. You can find an example of such a review here. When complete, the changes are pushed directly to the main branch, and a new release is created.

microsoft/git

Like git-for-windows/git, microsoft/git is proactive, executing rebases immediately following the creation of each new git/git release candidate. It also uses the same strategies for identifying commits that have made it upstream. However, there are a few key differences between the git-for-windows/git approach and the microsoft/git approach. These are:

  1. Since microsoft/git is a fork of git-for-windows/git, it does not take commits directly from git/git. Instead, we wait for the git-for-windows/git merging rebase to complete for each candidate then rebase on top of the resulting tag.
  2. Instead of repeatedly rebasing a designated main branch, we cut a brand new branch for each version with the naming scheme vfs-2.X.Y (see the current default as an example). This branch is based off the initial git-for-windows/git release candidate tag and is updated with the rebase for each new release candidate. We based this strategy on release branches in Azure DevOps to make hotfixing easier and to clarify which commits are released with each new version.

Once the rebases are complete, we designate vfs-2.X.Y, as the new default branch, and create a new release.

github/git

github/git integrates new git/git releases using a traditional merge strategy. It is cautious in its cadence for taking releases; for this fork, we prefer to allow new features to “simmer” for some time before integration. This means that github/git is typically one or two versions behind the latest release of git/git.

To ensure merges are high-quality and accurate, the merge of a new git/git version is carried out by at least two developers in parallel. For commits that began in github/git and subsequently made it upstream, we generally accept the upstream version when resolving resulting conflicts. Occasionally, however, there are reasons to take the github/git version (or, some parts of both versions). It is up to the developers executing the merge to decide the correct strategy for a particular commit. When each merge is complete, the trees at the tip of each merge are compared, and the different approaches to conflict resolution are reviewed. The outcome of this review is merged and deployed as a new release.

Note that there are tradeoffs to the decision to use a traditional merge strategy to manage this fork. Merging is more straightforward than rebasing or rebase merging. However, merges in github/git can become somewhat tricky when it has drifted far from upstream or when a sweeping change in upstream affects many parts of its custom code. Additionally, this strategy requires all commits to be preserved (as opposed to git-for-windows/git and microsoft/git, which use the autosquash feature of the rebase command to remove squash and fixup commits), which means more commits are involved in github/git merges.

Comparison

Below is a side-by-side summary of the key similarities and differences in the management strategies discussed above.

Fork Management Strategy # of developers executing Proactive or cautious Long-running main branch Integrates release candidates
git-for-windows/git merging rebase 1 Proactive Yes Yes
microsoft/git merging rebase 1 Proactive No Yes
github/git merge >=2 Cautious Yes No

As shown in the table, git-for-windows/git and microsoft/git have a lot in common. They are both proactive, executed by one developer, and use a form of rebase to integrate new releases (and release candidates). github/git is a bit different in its choice of a merge management strategy, the number of developers that simultaneously execute this strategy, and its cautious approach to integrating new releases (and in that release candidates are not considered).

As lovely as the above table is, you may still be scratching your head, wondering which strategy is right for you or your organization. Never fear! Our final section provides a series of scenarios with the goal of making this decision easier for you.

🌹 Finding your perfect match

Well done! You’ve successfully made it through our deep dive into three alternatives for friendly fork management. 👏👏👏

However, now comes the really important part! It’s time to take a good look at each of the above strategies to understand which one is the best fit for you. We’ve organized this section as a series of scenarios to help you frame the above in the context of your own needs. Keep reading if you’re ready to choose your own friendly fork adventure!

Scenario 1: You have many contributors working simultaneously.

Constantly creating new default branches can leave developers with open pull requests in a bad state; obviously, this is particularly problematic when there’s a healthy amount of active development in your repository. Consider a merge or merging rebase strategy to avoid changing default branches and requiring your developers to constantly rebase.

Scenario 2: You need to support multiple versions with security releases or other bug fixes.

Consider the rebase model to have an easy story for cherry-picking fixes to supported versions.

Note: it is also possible to cherry-pick features from upstream with a merge-based workflow. However, if you later merge the commits containing the cherry-picked features, you may have to resolve some trivial conflicts depending on your merge strategy.

Scenario 3: You don’t (or do!) want to take new features immediately.

You can apply a cautious or proactive approach to any of the above strategies. Work with your team/management to find a cadence everyone is comfortable with and stick to it.

Scenario 4: You’re new to the fork game and want to keep it simple.

If this is your (or, your team’s) first time managing a fork and/or you’re just learning your way around Git, the merge strategy may be the most straightforward place for you to start.

Still not sure which strategy to use? Consider getting in touch with the maintainers of one of the friendly forks listed above (for example, via the appropriate mailing list(s) or a GitHub discussion) to get input from the experts!

🎁 Wrapping up

A friendly fork can help accelerate development and increase developer productivity and satisfaction. Additionally, managing a friendly fork carefully to stay in the friend zone can lead to successful collaboration between different communities and improved project quality for all parties involved. We hope this series has helped you understand whether a friendly fork is right for you or your organization and, if so, empowered you to create and begin managing your friendly fork successfully.

Bringing code navigation to communities

Post Syndicated from Patrick Thomson original https://github.blog/2022-04-29-bringing-code-navigation-to-communities/

We’re proud to announce the availability of search-based code navigation for the Elixir programming language. Yet this is more than just new functionality. It’s the first example of a language community writing and submitting their own code for search-based code navigation.

Since GitHub introduced code navigation, we’ve received enthusiastic feedback from users. However, we hear one question over and over again: “When will my favorite language be supported?” We believe that every programming language deserves best-in-class support in GitHub. However, GitHub hosts code written in thousands of different programming languages, and we at GitHub can commit to supporting only a small subset of these languages ourselves. To this end, we’ve been working hard to empower language communities to integrate with our code navigation systems. After all, nobody understands a programming language better than the people who build it!

Community contributions welcome!

Would you like to develop and contribute search-based code navigation for a language? If a Tree-sitter grammar tool exists for your language, you can do so using Tree-sitter’s tree query language. This language describes how our code navigation systems scan through syntax trees and how to extract the important parts of a declaration or reference. This process is known as tagging, and it’s integrated with the tree-sitter command-line tool so that you can run and test tag queries locally. When a user pushes new commits or creates a pull request, the GitHub code navigation systems use tag queries to extract jump-to-definition and find-all-references information. Note: If you’re interested in how search-based code navigation works, you can read a technical report that describes the architecture and evolution of GitHub’s implementation of code navigation.

How do I get started?

Complete documentation, including how to write unit tests for your tags queries, can be found here. For examples of tag queries, you can check out the Elixir tags implementation, or those written for Python or Ruby. Once you’ve implemented tag queries for a language, have written unit tests, and are satisfied with the output from tree-sitter tags, you can submit a request on the Code Search and Navigation Feedback discussion page in the GitHub Discussions page in the GitHub feedback repository. The Code Navigation team will add this to their roadmap. When we’re confident in the quality of yielded tags and the load on the back-end systems that handle code navigation, we can enable it in testing for a subset of contributors, eventually rolling it out to all GitHub users, on both private and public repositories.

Please note that search-based code navigation is distinct from GitHub code search. Code search provides a view across all of the corpus of code on GitHub, whereas search-based code navigation is a part of the experience of reading code within a single repository. Search-based code navigation is also distinct from our support for precise code navigation. However, we hope to empower language communities to inform and contribute to the development of and support for these and other features in the future.

We’re committed to working with language maintainers and contributors to keep these rules as useful and up-to-date as possible. Whether you’re a longtime language contributor or someone seeking to enter the community for the first time, we encourage you to take a look at adding support for search-based code navigation, and join the community on the GiHub Discussions Page.

Graph Networks – Striking fraud syndicates in the dark

Post Syndicated from Grab Tech original https://engineering.grab.com/graph-networks


As a leading superapp in Southeast Asia, Grab serves millions of consumers daily. This naturally makes us a target for fraudsters and to enhance our defences, the Integrity team at Grab has launched several hyper-scaled services, such as the Griffin real-time rule engine and Advanced Feature Engineering. These systems enable data scientists and risk analysts to develop real-time scoring, and take fraudsters out of our ecosystems.

Apart from individual fraudsters, we have also observed the fast evolution of the dark side over time. We have had to evolve our defences to deal with professional syndicates that use advanced equipment such as device farms and GPS spoofing apps to perform fraud at scale. These professional fraudsters are able to camouflage themselves as normal users, making it significantly harder to identify them with rule-based detection.

Since 2020, Grab’s Integrity team has been advancing fraud detection with more sophisticated techniques and experimenting with a range of graph network technologies such as graph visualisations, graph neural networks and graph analytics. We’ve seen a lot of progress in this journey and will be sharing some key learnings that might help other teams who are facing similar issues.

What are Graph-based Prediction Platforms?

“You can fool some of the people all of the time, and all of the people some of the time, but you cannot fool all of the people all of the time.” – Abraham Lincoln

A Graph-based Prediction Platform connects multiple entities through one or more common features. When such entities are viewed as a macro graph network, we uncover new patterns that are otherwise unseen to the naked eye. For example, when investigating if two users are sharing IP addresses or devices, we might not be able to tell if they are fraudulent or just family members sharing a device.

However, if we use a graph system and look at all users sharing this device or IP address, it could show us if these two users are part of a much larger syndicate network in a device farming operation. In operations like these, we may see up to hundreds of other fake accounts that were specifically created for promo and payment fraud. With graphs, we can identify fraudulent activity more easily.

Grab’s Graph-based Prediction Platform

Leveraging the power of graphs, the team has primarily built two types of systems:

  • Graph Database Platform: An ultra-scalable storage system with over one billion nodes that powers:
    1. Graph Visualisation: Risk specialists and data analysts can review user connections real-time and are able to quickly capture new fraud patterns with over 10 dimensions of features (see Fig 1).

      Change Data Capture flow
      Fig 1: Graph visualisation
    2. Network-based feature system: A configurable system for engineers to adjust machine learning features based on network connectivity, e.g. number of hops between two users, numbers of shared devices between two IP addresses.

  • Graph-based Machine Learning: Unlike traditional fraud detection models, Graph Neural Networks (GNN) are able to utilise the structural correlations on the graph and act as a sustainable foundation to combat many different kinds of fraud. The data science team has built large-scale GNN models for scenarios like anti-money laundering and fraud detection.

    Fig 2 shows a Money Laundering Network where hundreds of accounts coordinate the placement of funds, layering the illicit monies through a complex web of transactions making funds hard to trace, and consolidate funds into spending accounts.

Change Data Capture flow
Fig 2: Money Laundering Network

What’s next?

In the next article of our Graph Network blog series, we will dive deeper into how we develop the graph infrastructure and database using AWS Neptune. Stay tuned for the next part.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Being friendly: Friendly forks 101

Post Syndicated from Lessley Dennington original https://github.blog/2022-04-25-the-friend-zone-friendly-forks-101/

This is the first post in a two-part series describing friendly forks and alternative strategies for managing them. Stay tuned for part two coming in May!

This post covers what a friendly fork is, why they are beneficial, and how they differ from a divergent fork. We’ll also look at some examples from the wild and provide details on three of our favorite friendly forks of the git/git repository.

💀 To fork or not to fork

Most developers are familiar with the concept of working with source code in repositories. However, what should you do when you want to make one or more major changes to a repository that you do not own? Two options are to submit feature requests or to contribute the features you need yourself. This is a very common approach in open source software, and, when it goes well, it can lead to productive collaboration and useful results for all parties.

But what if the proposed features are not accepted into the repository? What if they were never intended to be contributed back to the original project? If you (or your company) have a strong need for these features, creating a friendly fork of the repository could be the right choice.

👭 What is a friendly fork?

A friendly fork is a long-lived fork that complements its upstream repository (i.e., the repository from which it was forked) with customizations targeted to a subset of users. Typically, features from the friendly fork are contributed back to the upstream repository through a process known as upstreaming. If that is the case, developers working in the friendly fork sustain relationships with the maintainers of the upstream repository (this is the friend zone we’re so fond of!) to facilitate this flow of features and to improve the software for both user bases. It is also possible, however, for a friendly fork to simply take regular updates from its upstream repository with no maintainer interaction. Friendly forks may or may not eventually re-merge with the upstream repository.

Below are some examples of existing friendly forks of the git/git repository (which are maintained by folks at GitHub) and their purposes.

  1. git-for-windows/git: hosts Windows-specific features of Git. It also sometimes receives early features which are subsequently upstreamed into git/git.
  2. microsoft/git: focuses on features to support monorepos, some of which are subsequently upstreamed into git/git.
  3. github/git: powers GitHub’s backend. It includes GitHub-specific changes, like extra instrumentation specific to GitHub’s infrastructure, but also serves as a staging ground for new features before they are submitted upstream.

git/git is definitely not the only repository with friendly forks. Examples of friendly forks created from other repos include:

  1. MSYS2: a fork of Cygwin that provides an environment for building, installing, and running native Windows software.
  2. CentOS: a fork of Red Hat Enterprise Linux (RHEL) created in 2004 to offer a RHEL-like experience for the community. (Interestingly, Red Hat now owns the CentOS trademarks).

It is important to note that not all forks can be considered friendly. There are also divergent forks, which are typically created due to insurmountable disagreements in a community caused by disparate goals or personality conflicts. Divergent forks often become so different from their upstream repositories that sharing code becomes difficult, if not impossible. While it is good to know that divergent forks are a thing you may encounter in the wild, we emphatically believe in the power of the friend zone and will center our focus for the rest of this series on getting and keeping forks there.

📖 A tale of three forks

Three of the friendly fork examples provided above are based off of git/git, git-for-windows/git, microsoft/git, and github/git. Each of these forks has a unique history that contributes to the strategy used to maintain it. We will dedicate the remainder of this post to describing the history and purpose of each fork.

git-for-windows/git

Git for Windows logo

git-for-windows/git is the oldest of our forks for discussion. It was created in 2007 to provide the required adjustments to Git for it to run on Windows. While it may seem odd that Windows-specific features weren’t just added to the git/git repository, forking was deemed necessary because the git/git project was (and, still is) run by experts in the Unix systems domain. And Windows, of course, falls outside the scope of that expertise.

Although this fork was originally intended to be short-lived, it soon became clear that Windows support would be an ongoing community need that would require a permanent fork. Thus, development in git-for-windows/git continues in earnest today.

Because it was created to give Windows users the option of using Git for version control, the main purposes of git-for-windows/git are:

  1. Provide a seamless, pain-free experience for Windows users.
  2. Separation of concerns (in other words, Windows-specific and Unix-specific features are contributed in different repositories, while shared features can easily flow between the repositories).

As with each fork we discuss in this post, there are some use cases for which it makes sense for new features to be contributed to the fork before they are added upstream. FS Monitor is an example of this in which the Request for Comments went to the git/git mailing list, but early implementations of the feature were merged into git-for-windows/git for rapid testing.

microsoft/git

Microsoft logo

microsoft/git began as a private fork of git-for-windows/git, with the initial purpose of supporting Microsoft-internal products. However, it was open-sourced in 2017, and, as a result, its goals today are much more general and community-oriented:

  1. Facilitate easy dogfooding/quick releases of new features for GitHub’s monorepo customers.
  2. As with git-for-windows/git, determine which of these features make sense to contribute back upstream and submit/monitor them on the mailing list.

An example feature that was introduced to microsoft/git prior to upstreaming is the sparse index. This was done to speedily get this new feature into the hands of monorepo customers who needed it most. After that was done, the feature was gradually introduced and refined upstream.

github/git

github/git, our final fork for discussion, actually did not begin as a fork. In the early days of GitHub, we carried a handful of changes on top of new Git releases. Because there were only a few, we stored them as *.patch files that got applied on top of each new release before deployment. Over time, however, our custom changes became both more numerous and more applicable to being contributed to git/git. It became clear that a full-fledged fork would be beneficial to improve management, workflows, and our ability to contribute back to upstream. Thus, in 2012, the official github/git fork was born.

Note that github/git differs from the above forks in that it is a private friendly fork, while the other two are public. Private friendly forks can be very beneficial to organizations. For example, they can be an excellent testing ground for new features, as they allow you to be confident code has been battle-tested internally and works before submitting publicly upstream. They can also be less beholden to the upstream release cadence, which helps ensure stability for the product.

To this day, this fork serves the same purpose of powering GitHub’s infrastructure. It fulfills our need to support features specialized to this infrastructure and is also used as a “staging ground” for new features before submitting them to the open source git/git repository. Specific examples of the latter use include bitmaps and multi-pack bitmaps.

👋 That’s all for now!

In this post, we’ve discussed what a friendly fork is and how friendly forks differ from divergent forks. We’ve learned about three different friendly forks of the git/git repository and their purposes for existing. Thanks for sticking with us this far, and be sure to keep your 👀 peeled for our second post in the “friend zone” series, in which we’ll talk about how these forks are managed and how you can adapt their strategies to your very own friendly fork!

Improving Git push times through faster server side hooks

Post Syndicated from Carlos Martín Nieto original https://github.blog/2022-04-21-improving-git-push-times-through-faster-server-side-hooks/

At GitHub, we relentlessly pursue performance. Join me now for the tale of how we dropped a P99 time by 95% on code that runs for every single Git push operation.

Pre-receive hook execution time

Every time you push to GitHub, we run a set of checks to validate your push before accepting it. If you ever tried to push an object larger than 100MB, you are already familiar with them, as these pre-receive hooks contain that logic. Similarly, they do other checks, such as verifying that LFS objects have been successfully uploaded. These hooks help keep our servers healthy and improve the user experience.

We recently rewrote these hooks from their original Ruby implementation into Go. This rewrite was something we had in mind for a while, but what really sold us on the effort was the potential performance improvement.

Today, we’ll talk about the history of these hooks, how we discovered that the performance was problematic, and how we went about safely replacing them.

How did we get here?

We created the first hook in 2013 to warn users that a repository was renamed. The only action was a database check for a previous name and to send a warning to the user to update their remote URL. At the time, almost all of GitHub was part of one Ruby on Rails application, so it was the logical choice for hooks as well. As time passed, more and more functionality was added to the hooks, requiring additional configuration, exception reporting, and logging.

This meant that hooks imported the same dependencies as the Ruby application. Over time, the number of dependencies, and therefore startup time, only increased. In a Rails application, these dependencies are loaded only once at startup time, and then each request has them available, making the startup time not important for the user experience. However, these hooks are run as subprocesses underneath the Git executable, so they are loaded for each request, making the startup time critical to performance. When we investigated, loading these dependencies took a rather long time. Hooks took about 880 milliseconds to execute on average, and almost all of that time was spent loading dependencies. In addition, there are two sets of hooks: one with the new data under quarantine and a second set once the data is available in the repository. Especially with this double execution, this startup time significantly affects each push. An empty push could take more than two seconds, which was unacceptable.

Why rewrite?

Since the performance issues were related to startup time, we had a few options. We could reduce the number of dependencies, we could change the architecture so that hooks only started up once, or we could rewrite the hooks to run independently of the monolith. Rewrites and changing the architecture carry risk, so we tried the simplest alternative first.

Loading fewer dependencies while staying within the Rails monolith proved quite tricky. There were a lot of dependencies (more than 450 gems leading to over 1,000 require calls), and they were all quite tangled up in the app’s configuration, because they were not designed to be used outside of the GitHub Rails monolith. However, careful use of the debugger and strace revealed a few outliers that we could avoid loading when running the hooks. This removal dropped 350-400 milliseconds from the startup time.

While this was already a decent improvement for a small tweak, the startup time was still quite slow, and we weren’t satisfied yet. Additionally, new dependencies are frequently added to the Rails application, which means that the startup time would creep up again over time even if our hook code did not change.

How did we rewrite?

We could not ignore the configuration from the Rails app as that is how we know how to connect to the database, send stats, etc. Some of that could be duplicated at the risk of having two parallel configuration paths that would almost certainly end up diverging.

To pass configuration along that only the app knows about, we were able to use an existing mechanism, which was already in use to pass along information, such as the name of a repository and whether it is over its quota. This comes alongside other information necessary to perform updates so it gets called for every push, and adding some more data there adds very little overhead. We identified the information necessary to perform the checks and added this information.

For the past few years, the Git Systems Team has been extracting more and more of our service code from the Rails monolith and rewriting it as a dedicated service written in Go. Based on this experience with Go, and given what we learned about the Ruby hooks, moving the hooks into this Go service seemed like a natural fit as well. Just extracting the hooks to run independently of the Rails apps would have removed most of the boot time, but Go gives us the last few milliseconds and lets us make them part of the service in which the backend code increasingly lives. We expected such a significant rewrite to be worth the risk, because afterwards the hooks should be much faster.

Further, as we commonly do for high-impact changes, we put these rewritten hooks behind a feature flag. This gave us the ability to enable them for individual repositories or groups of them. We started with a few internal GitHub repositories to confirm the effect in production.

The results were so impressive that we had to double check that the hooks were still running. It was hard to distinguish between really fast hooks and completely disabled hooks. The median time was now 10ms, compared to roughly 880ms when we started the project. This made pushes noticeably faster for everyone. We even got unprompted questions about whether pushing had become faster after someone noticed it on their own.

Lessons learned

This is a project we had in mind for a long time. We had wanted to rewrite these hooks outside of the monolith to separate our area of responsibility better. However, merely having a better architecture often isn’t enough to make something a business priority. By tying the change to its impact on users we could prioritize this work. We came away with the dual benefits of a much better user experience and an architectural improvement.

This change has now been live on github.com for a couple of months, and has been shipped in GHES 3.4, so everyone now saves some time pushing to their GitHub repositories.

 

Codespaces for multi-repository and monorepo scenarios

Post Syndicated from Gabe Dominguez original https://github.blog/2022-04-20-codespaces-multi-repository-monorepo-scenarios/

Today, we’re releasing exciting improvements that will streamline your Codespaces experience when working with multi-repository projects and monorepos. Codespaces are instant cloud-powered development environments that aim at maximizing your productivity by eliminating set-up times regardless of the type, size, and complexity of your projects.

With our initial release, we wanted to address the most common type of projects hosted on GitHub: cloud-native applications housed in a singular repository. As organization adoption began to scale, we quickly realized we needed to support additional types of projects that required extensive workarounds. With this latest update, we’re excited to release improved support for multi-repository and monorepo projects.

Codespaces configuration for microservices

Many of you told us that you often work with a number of interwoven repositories for your projects. Maybe there is a billing service, an event service, an authorization service, and they’re all dependent on each other. When developing a feature that spans many of these services, you might want to clone and interact with each repository within your codespace.

With this scenario in mind, we have added the ability for users to configure which permissions their codespace should have on creation. This means that users will no longer have to set up a personal access token inside of their codespace to clone or create pull requests for other repositories.

repository permissions code

Even better, you can now specify these repository permissions in your devcontainer.json under the customizations.codespaces.repositories key so that every developer is prompted for the right set of permissions while working on the project.

In the future, we plan to make it even simpler to work with microservices in Codespaces by automatically cloning across multiple services and allowing you to configure how your environment is initialized to run each repository.

Codespaces configuration for monorepos

If you are part of a larger organization and have many teams working in one repository, you may have wished there was an easy way to have a different codespace configuration for each team. We heard you loud and clear and are happy to announce that Codespaces now supports multiple devcontainer.json files inside of your .devcontainer directory, as long as they follow the pattern of .devcontainer/${DIR}/devcontainer.json. If multiple configurations exist, users will be able to select their specific configuration at the time of codespace creation, allowing you to better customize your codespaces to fit the specific needs of your teams.

For example, imagine your docs team works primarily in a few directories and just needs a lightweight configuration to update Markdown files. You could have a devcontainer.json that looks like the following:

oncreatecommand script

This devcontainer.json runs an “onCreateCommand” script specific to setting up the environment for the Docs Team. The script in this scenario uses the permissions granted to “my_org/docs_linter” to pull in a linter repository, which is a useful tool when writing and editing documentation.

Advanced create

As we grow to handle more diverse project types and scenarios, we also want to ensure that we continue to provide the ease of environment creations through simple one-click experiences that don’t require you to spend undue time understanding various configuration options.

However, if you need more flexibility, we’ve created a new advanced create flow for Codespaces that allows you to select various options, such as branch, region, machine type, and dev container configuration while creating your codespace.

configure and create codespace screen
Create a new codespace creation flow

If you want to skip the advanced creation flow, you can easily just select “Create codespace on <branch name>,” and it will create a codespace with the default configuration.

How to get started?

We believe that these three new features will allow for larger organizations to have a smoother experience as they onboard and scale with Codespaces. Repository administrators can create multiple devcontainers, each with permission sets, setup scripts, and a codespace configuration specific for certain teams. And, developers will be able to select the ideal devcontainer, machine type, and region during codespace creation with the advanced creation flow as needed. There’s something for everyone with Codespaces!

Here are some helpful links to help you get started!

If you have any feedback to help improve this experience, be sure to post it on our discussions forum.

Sharing security expertise through CodeQL packs (Part I)

Post Syndicated from Andrew Eisenberg original https://github.blog/2022-04-19-sharing-security-expertise-through-codeql-packs-part-i/

Congratulations! You’ve discovered a security bug in your own code before anyone has exploited it. It’s a big relief. You’ve created a CodeQL query to find other places where this happens and ensure this will never happen again, and you’ve deployed your new query to be run on every pull request in your repo to prevent similar mistakes from ever being made again.

What’s the best way to share this knowledge with the community to help protect the open source ecosystem by making sure that the same vulnerability is never introduced into anyone’s codebase, ever?

The short answer: produce a CodeQL pack containing your queries, and publish them to GitHub. CodeQL packaging is a beta feature in the CodeQL ecosystem. With CodeQL packaging, your expertise is documented, concise, executable, and easily shareable.

This is the first post of a two-part series on CodeQL packaging. In this post, we show how to use CodeQL packs to share security expertise. In the next post, we will discuss some of our implementation and design decisions.

Modeling a vulnerability in CodeQL

CodeQL’s customizability makes it great for detecting vulnerabilities in your code. Let’s use the Exec call vulnerable to binary planting query as an example. This query was developed by our team in response to discovering a real vulnerability in one of our open source repositories.

The purpose of this query is to detect executables that are potentially vulnerable to Windows binary planting, an exploit where an attacker could inject a malicious executable into a pull request. This query is meant to be evaluated on JavaScript code that is run inside of a GitHub Action. It matches all arguments to calls to the ToolRunner (a GitHub Action API) where the argument has not been sanitized (that is, ensured to be safe) by having been wrapped in a call to safeWhich. The implementation details of this query are not relevant to this post, but you can explore this query and other domain-specific queries like it in the repository.

This query is currently protecting us on every pull request, but in its current form, it is not easily available for others to use. Even though this vulnerability is relatively difficult to attack, the surface area is large, and it could affect any GitHub Action running on Windows in public repositories that accept pull requests. You could write a stern blog post on the dangers of invoking unqualified Windows executables in untrusted pull requests (maybe you’re even reading such a post right now!), but your impact will be much higher if you could share the query to help anyone find the bug in their code. This is where CodeQL packaging comes in. Using CodeQL packaging, not only can developers easily learn about the binary planting pattern, but they can also automatically apply the pattern to find the bug in their own code.

Sharing queries through CodeQL packs

If you think that your query is general purpose and applicable to all repositories in all situations, then it is best to contribute it to our open source CodeQL query repository (and collect a bounty in the process!). That way, your query will be run on every pull request on every repository that has GitHub code scanning enabled.

However, many (if not most) queries are domain specific and not applicable to all repositories. For example, this particular binary planting query is only applicable to GitHub Actions implemented in JavaScript. The best way to share such queries query is by creating a CodeQL pack and publishing it to the CodeQL package registry to make it available to the world. Once published, CodeQL packs are easily shared with others and executed in their CI/CD pipeline.

There are two kinds of CodeQL packs:

  • Query packs, which contain a set of pre-compiled queries that can be easily evaluated on a CodeQL database.
  • Library packs, which contain CodeQL libraries (*.qll files), but do not contain any runnable queries. Library packs are meant to be used as building blocks to produce other query packs or library packs.

In the rest of this post, we will show you how to create, share, and consume a CodeQL query pack. Library packs will be introduced in a future blog post.

To create a CodeQL pack, you’ll need to make sure that you’ve installed and set up the CodeQL CLI. You can follow the instructions here.

The next step is to create a qlpack.yml file. This file declares the CodeQL pack and information about it. Any *.ql files in the same directory (or sub-directory) as a qlpack.yml file are considered part of the package. In this case, you can place binary-planting.ql next to the qlpack.yml file.

Here is the qlpack.yml from our example:

name: aeisenberg/codeql-actions-queries
version: 1.0.1
dependencies:
 codeql/javascript-all: ~0.0.10

All CodeQL packs must have a name property. If they are going to be published to the CodeQL registry, then they must have a scope as part of the name. The scope is the part of the package name before the slash (in this example: aeisenberg). It should be the username or organization on github.com that will own this package. Anyone publishing a package must have the proper privileges to do so for that scope. The name part of the package name must be unique within the scope. Additionally, a version, following standard semver rules, is required for publishing.

The dependencies block lists all of the dependencies of this package and their compatible version ranges. Each dependency is referenced as the scope/name of a CodeQL library pack, and each library pack may in turn depend on other library packs declared in their qlpack.yml files. Each query pack must (transitively) depend on exactly one of the core language packs (for example, JavaScript, C#, Ruby, etc.), which determines the language your query can analyze.

In this query pack, the standard JavaScript libraries, codeql/javascript-all, is the only dependency and the semver range ~0.0.10 means any version >= 0.0.10 and < 0.1.0 suffices.

With the qlpack.yml defined, you can now install all of your declared dependencies. Run the codeql pack install command in the root directory of the CodeQL pack:

$ codeql pack install
Dependencies resolved. Installing packages...
Install location: /Users/andrew.eisenberg/.codeql/packages
Installed fresh codeql/[email protected]

After making any changes to the query, you can then publish the query to the GitHub registry. You do this by running the codeql pack publish command in the root of the CodeQL pack.

Here is the output of the command:

$ codeql pack publish
Running on packs: aeisenberg/codeql-actions-queries.
Bundling and then publishing qlpack located at '/Users/andrew.eisenberg/git-repos/codeql-actions-queries'.
Bundled qlpack created at '/var/folders/41/kxmfbgxj40dd2l_x63x9fw7c0000gn/T/codeql-docker17755193287422157173/.Docker Package Manager/codeql-actions-queries.1.0.1.tgz'.
Packaging> Package 'aeisenberg/codeql-actions-queries' will be published to registry 'https://ghcr.io/v2/' as 'aeisenberg/codeql-actions-queries'.
Packaging> Package 'aeisenberg/[email protected]' will be published locally to /Users/andrew.eisenberg/.codeql/packages/aeisenberg/codeql-actions-queries/1.0.1
Publish successful.

You have successfully published your first CodeQL pack! It is now available in the registry on GitHub.com for anyone else to run using the CodeQL CLI. You can view your newly-published package on github.com:

CodeQL pack on github.com

At the time of this writing, packages are initially uploaded as private packages. If you want to make it public, you must explicitly change the permissions. To do this, go to the package page, click on package settings, then scroll down to the Danger Zone:

Danger Zone!

And click Change visibility.

Running queries from CodeQL packs using the CodeQL CLI

Running the queries in a CodeQL pack is simple using the CodeQL CLI. If you already have a database created, just call the codeql database analyze command with the --download option, passing a reference to the package you want to use in your analysis:

$ codeql database analyze --format=sarif-latest --output=out.sarif --download my-db aeisenberg/codeql-actions-queries@^1.0.1

The --download option asks CodeQL to download any CodeQL packs that aren’t already available. The ^1.0,0 is optional and specifies that you want to run the latest version of the package that is compatible with ^1.0.1. If no version range is specified, then the latest version is always used. You can also pass a list of packages to evaluate. The CodeQL CLI will download and cache each specified package and then run all queries in their default query suite.

To run a subset of queries in a pack, add a : and a path after it:

aeisenberg/codeql-actions-queries@^1.0.1:binary-planting.ql

Everything after the : is interpreted as a path relative to the root of the pack, and you can specify a single query, a query directory, or a query suite (.qls file).

Evaluating CodeQL packs from code scanning

Run the queries from your CodeQL pack in GitHub code scanning is easy! In your code scanning workflow, in the github/codeql-action/init step, add packs entry to list the packs you want to run:

- uses: github/codeql-action/init@v1
  with:
    packs:
      - aeisenberg/[email protected]
    languages: javascript

Note that specifying a path after a colon is not yet supported in the codeql-action, so using this approach, you can only run the default query suite of a pack in this manner.

Conclusion

We’ve shown how easy it is to share your CodeQL queries with the world using two CLI commands: the first resolves and retrieves your dependencies and the second compiles, bundles, and publishes your package.

To recap:

Publishing a CodeQL query pack consists of:

  1. Create the qlpack.yml file.
  2. Run codeql pack install to download dependencies.
  3. Write and test your queries.
  4. Run codeql pack publish to share your package in GHCR.

Using a CodeQL query pack from GHCR on the command line consists of:

  1. codeql database analyze --download path/to/my-db aeisenberg/[email protected]

Using a CodeQL query pack from GHCR in code-scanning consists of:

  1. Adding a config-file input to the github/codeql-action/init action
  2. Adding a packs block in the config file

The CodeQL Team has already published all of our standard queries as query packs, and all of our core libraries as library packs. Any pack named {*}-queries is a query pack and contains queries that can be used to scan your code. Any pack named {*}-all is a library pack and contains CodeQL libraries (*.qll files) that can be used as the building blocks for your queries. When you are creating your own query packs, you should be adding as a dependency the library pack for the language that your query will scan.

If you are interested in understanding more about how we’ve implemented packaging and some of our design decisions, please check out our second post in this series. Also, if you are interested in learning more or contributing to CodeQL, get involved with the Security Lab.

Sharing your security expertise has never been easier!

How we reduced our CI YAML files from 1800 lines to 50 lines

Post Syndicated from Grab Tech original https://engineering.grab.com/how-we-reduced-our-ci-yaml

This article illustrates how the Cauldron Machine Learning (ML) Platform team uses GitLab parent-child pipelines to dynamically generate GitLab CI files to solve several limitations of GitLab for large repositories, namely:

  • Limitations to the number of includes (100 by default).
  • Simplifying the GitLab CI file from 1800 lines to 50 lines.
  • Reducing the need for nested gitlab-ci yml files.

Introduction

Cauldron is the Machine Learning (ML) Platform team at Grab. The Cauldron team provides tools for ML practitioners to manage the end to end lifecycle of ML models, from training to deployment. GitLab and its tooling are an integral part of our stack, for continuous delivery of machine learning.

One of our core products is MerLin Pipelines. Each team has a dedicated repo to maintain the code for their ML pipelines. Each pipeline has its own subfolder. We rely heavily on GitLab rules to detect specific changes to trigger deployments for the different stages of different pipelines (for example, model serving with Catwalk, and so on).

Background

Approach 1: Nested child files

Our initial approach was to rely heavily on static code generation to generate the child gitlab-ci.yml files in individual stages. See Figure 1 for an example directory structure. These nested yml files are pre-generated by our cli and committed to the repository.

Figure 1: Example directory structure with nested gitlab-ci.yml files.
Figure 1: Example directory structure with nested gitlab-ci.yml files.

 

Child gitlab-ci.yml files are added by using the include keyword.

Figure 2: Example root .gitlab-ci.yml file, and include clauses.
Figure 2: Example root .gitlab-ci.yml file, and include clauses.

 

Figure 3: Example child .gitlab-ci.yml file for a given stage (Deploy Model) in a pipeline (pipeline 1).
Figure 3: Example child `.gitlab-ci.yml` file for a given stage (Deploy Model) in a pipeline (pipeline 1).

 

As teams add more pipelines and stages, we soon hit a limitation in this approach:

There was a soft limit in the number of includes that could be in the base .gitlab-ci.yml file.

It became evident that this approach would not scale to our use-cases.

Approach 2: Dynamically generating a big CI file

Our next attempt to solve this problem was to try to inject and inline the nested child gitlab-ci.yml contents into the root gitlab-ci.yml file, so that we no longer needed to rely on the in-built GitLab “include” clause.

To achieve it, we wrote a utility that parsed a raw gitlab-ci file, walked the tree to retrieve all “included” child gitlab-ci files, and to replace the includes to generate a final big gitlab-ci.yml file.

Figure 4 illustrates the resulting file is generated from Figure 3.

Figure 4: “Fat” YAML file generated through this approach, assumes the original raw file of Figure 3.
Figure 4: “Fat” YAML file generated through this approach, assumes the original raw file of Figure 3.

 

This approach solved our issues temporarily. Unfortunately, we ended up with GitLab files that were up to 1800 lines long. There is also a soft limit to the size of gitlab-ci.yml files. It became evident that we would eventually hit the limits of this approach.

Solution

Our initial attempt at using static code generation put us partially there. We were able to pre-generate and infer the stage and pipeline names from the information available to us. Code generation was definitely needed, but upfront generation of code had some key limitations, as shown above. We needed a way to improve on this, to somehow generate GitLab stages on the fly. After some research, we stumbled upon Dynamic Child Pipelines.

Quoting the official website:

Instead of running a child pipeline from a static YAML file, you can define a job that runs your own script to generate a YAML file, which is then used to trigger a child pipeline.

This technique can be very powerful in generating pipelines targeting content that changed or to build a matrix of targets and architectures.

We were already on the right track. We just needed to combine code generation with child pipelines, to dynamically generate the necessary stages on the fly.

Architecture details

Figure 5: Flow diagram of how we use dynamic yaml generation. The user raises a merge request in a branch, and subsequently merges the branch to master.
Figure 5: Flow diagram of how we use dynamic yaml generation. The user raises a merge request in a branch, and subsequently merges the branch to master.

 

Implementation

The user Git flow can be seen in Figure 5, where the user modifies or adds some files in their respective Git team repo. As a refresher, a typical repo structure consists of pipelines and stages (see Figure 1). We would need to extract the information necessary from the branch environment in Figure 5, and have a stage to programmatically generate the proper stages (for example, Figure 3).

In short, our requirements can be summarized as:

  1. Detecting the files being changed in the Git branch.
  2. Extracting the information needed from the files that have changed.
  3. Passing this to be templated into the necessary stages.

Let’s take a very simple example, where a user is modifying a file in stage_1 in pipeline_1 in Figure 1. Our desired output would be:

Figure 6: Desired output that should be dynamically generated.
Figure 6: Desired output that should be dynamically generated.

 

Our template would be in the form of:

Figure 7: Example template, and information needed. Let’s call it template\_file.yml.
Figure 7: Example template, and information needed. Let’s call it template_file.yml.

 

First, we need to detect the files being modified in the branch. We achieve this with native git diff commands, checking against the base of the branch to track what files are being modified in the merge request. The output (let’s call it diff.txt) would be in the form of:

M        pipelines/pipeline_1/stage_1/modelserving.yaml
Figure 8: Example diff.txt generated from git diff.

We must extract the yellow and green information from the line, corresponding to pipeline_name and stage_name.

Figure 9: Information that needs to be extracted from the file.
Figure 9: Information that needs to be extracted from the file.

 

We take a very simple approach here, by introducing a concept called stop patterns.

Stop patterns are defined as a comma separated list of variable names, and the words to stop at. The colon (:) denotes how many levels before the stop word to stop.

For example, the stop pattern:

pipeline_name:pipelines

tells the parser to look for the folder pipelines and stop before that, extracting pipeline_1 from the example above tagged to the variable name pipeline_name.

The stop pattern with two colons (::):

stage_name::pipelines

tells the parser to stop two levels before the folder pipelines, and extract stage_1 as stage_name.

Our cli tool allows the stop patterns to be comma separated, so the final command would be:

cauldron_repo_util diff.txt template_file.yml
pipeline_name:pipelines,stage_name::pipelines > generated.yml

We elected to write the util in Rust due to its high performance, and its rich templating libraries (for example, Tera) and decent cli libraries (clap).

Combining all these together, we are able to extract the information needed from git diff, and use stop patterns to extract the necessary information to be passed into the template. Stop patterns are flexible enough to support different types of folder structures.

Figure 10: Example Rust code snippet for parsing the Git diff file.
Figure 10: Example Rust code snippet for parsing the Git diff file.

 

When triggering pipelines in the master branch (see right side of Figure 5), the flow is the same, with a small caveat that we must retrieve the same diff.txt file from the source branch. We achieve this by using the rich GitLab API, retrieving the pipeline artifacts and using the same util above to generate the necessary GitLab steps dynamically.

Impact

After implementing this change, our biggest success was reducing one of the biggest ML pipeline Git repositories from 1800 lines to 50 lines. This approach keeps the size of the .gitlab-ci.yaml file constant at 50 lines, and ensures that it scales with however many pipelines are added.

Our users, the machine learning practitioners, also find it more productive as they no longer need to worry about GitLab yaml files.

Learnings and conclusion

With some creativity, and the flexibility of GitLab Child Pipelines, we were able to invest some engineering effort into making the configuration re-usable, adhering to DRY principles.


Special thanks to the Cauldron ML Platform team.


What’s next

We might open source our solution.

References

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Highlights from Git 2.36

Post Syndicated from Taylor Blau original https://github.blog/2022-04-18-highlights-from-git-2-36/

The open source Git project just released Git 2.36, with features and bug fixes from over 96 contributors, 26 of them new. We last caught up with you on the latest in Git back when 2.35 was released.

To celebrate this most recent release, here’s GitHub’s look at some of the most interesting features and changes introduced since last time.

Review merge conflict resolution with –remerge-diff

Returning readers may remember our coverage of merge ort, the from-scratch rewrite of Git’s recursive merge engine.

This release brings another new feature powered by ort, which is the --remerge-diff option. To explain what --remerge-diff is and why you might be excited about it, let’s take a step back and talk about git show.

When given a commit git show will print out that commit’s log message as well as its diff. But it has slightly different behavior when given a merge commit, especially one that had merge conflicts. If you’ve ever passed a conflicted merge to git show, you might be familiar with this output:

If you look closely, you might notice that there are actually two columns of diff markers (the + and - characters to indicate lines added and removed). These come from the output of git diff-tree -cc, which is showing us the diff between each parent and the post-image of the given commit simultaneously.

In this particular example, the conflict occurs because one side has an extra argument in the dwim_ref() call, and the other includes an updated comment to use reflect renaming a variable from sha1 to oid. The left-most markers show the latter resolution, and the right-most markers show the former.

But this output can be understandably difficult to interpret. In Git 2.36, --remerge-diff takes a different approach. Instead of showing you the diffs between the merge resolution and each parent simultaneously, --remerge-diff shows you the diff between the file with merge conflicts, and the resolution.

The above shows the output of git show with --remerge-diff on the same conflicted merge commit as before. Here, we can see the diff3-style conflicts (shown in red, since the merge commit removes the conflict markers during resolution) along with the resolution. By more clearly indicating which parts of the conflict were left as-is, we can more easily see how the given commit resolved its conflicts, instead of trying to weave-together the simultaneous diff output from git diff-tree -cc.

Reconstructing these merges is made possible using ort. The ort engine is significantly faster than its predecessor, recursive, and can reconstruct all conflicted merge in linux.git in about 3 seconds (as compared to diff-tree -cc, which takes more than 30 seconds to perform the same operation
[source]).

Give it a whirl in your Git repositories on 2.36 by running git show --remerge-diff on some merge conflicts in your history.

[source]

More flexible fsync configuration

If you have ever looked around in your repository’s .git directory, you’ll notice a variety of files: objects, references, reflogs, packfiles, configuration, and the like. Git writes these objects to keep track of the state of your repository, creating new object files when you make new commits, update references, repack your repository, and so on.

Most likely, you haven’t had to think too hard about how these files are written and updated. If you’re curious about these details, then read on! When any application writes changes to your filesystem, those changes aren’t immediately persisted, since writing to the external storage medium is significantly slower than updating your filesystem’s in-memory caches.

Instead, changes are staged in memory and periodically flushed to disk at which point the changes are (usually, though disks and controllers can have their own write caches, too) written to the physical storage medium.

Aside from following standard best-practices (like writing new files to a temporary location and then atomically moving them into place), Git has had a somewhat limited set of configuration available to tune how and when it calls fsync, mostly limited to core.fsyncObjectFiles, which, when set, causes Git to call fsync() when creating new loose object files. (Git has had non-configurable fsync() calls scattered throughout its codebase for things like writing packfiles, the commit-graph, multi-pack index, and so on).

Git 2.36 introduces a significantly more flexible set of configuration options to tune how and when Git will explicitly fsync lots of different kinds of files, not just if it fsyncs loose objects.

At the heart of this new change are two new configuration variables:
core.fsync and core.fsyncMethod. The former lets you pick a comma-separated list of which parts of Git’s internal data structures you want to be explicitly flushed after writing. The full list can be found in the documentation, but you can pick from things like pack (to fsync files in $GIT_DIR/objects/pack) or loose-object (to fsync loose objects), to reference (to fsync references in the $GIT_DIR/refs directory). There are also aggregate options like objects (which implies both loose-object and pack), along with others like derived-metadata, committed, and all.

You can also tune how Git ensures the durability of components included in your core.fsync configuration by setting the core.fsyncMethod to either fsync (which calls fsync(), or issues a special fcntl() on macOS), or writeout-only, which schedules the written data for flushing, though does not guarantee that metadata like directory entries are updated as part of the flush operation.

Most users won’t need to change these defaults. But for server operators who have many Git repositories living on hardware that may suddenly lose power, having these new knobs to tune will provide new opportunities to enhance the durability of written data.

[source, source, source]

Stricter repository ownership checks

If you haven’t seen our blog post from last week announcing the security patches for versions 2.35 and earlier, let me give you a brief recap.

Beginning in Git 2.35.2, Git changed its default behavior to prevent you from executing git commands in a repository owned by a different user than the current one. This is designed to prevent git invocations from unintentionally executing commands which the repository owner configured.

You can bypass this check by setting the new safe.directory configuration to include trusted repositories owned by other users. If you can’t upgrade immediately, our blog post outlines some steps you can take to mitigate your risk, though the safest thing you can do is upgrade to the latest version of Git.

Since publishing that blog post, the safe.directory option now interprets the value * to consider all Git repositories as safe, regardless of their owner. You can set this in your --global config to opt-out of the new behavior in situations where it makes sense.

[source]

Tidbits

Now that we have looked at some of the bigger features in detail, let’s turn to a handful of smaller topics from this release.

  • If you’ve ever spent time poking around in the internals of one of your Git repositories, you may have come across the git cat-file command. Reminiscent of cat, this command is useful for printing out the raw contents of Git objects in your repository. cat-file has a handful of other modes that go beyond just printing the contents of an object. Instead of printing out one object at a time, it can accept a stream of objects (via stdin) when passed the --batch or --batch-check command-line arguments. These two similarly-named options have slightly different outputs: --batch instructs cat-file to just print out each object’s contents, while --batch-check is used to print out information about the object itself, like its type and size1.

    But what if you want to dynamically switch between the two? Before, the only way was to run two separate copies of the cat-file command in the same repository, one in --batch mode and the other in --batch-check mode. In Git 2.36, you no longer need to do this. You can instead run a single git cat-file command with the new --batch-command mode. This mode lets you ask for the type of output you want for each object. Its input looks either like contents <object>, or info <object>, which correspond to the output you’d get from --batch, or --batch-check, respectively.

    For server operators who may have long-running cat-file commands intended to service multiple requests, --batch-command accepts a new flush command, which flushes the output buffer upon receipt.

    [source, source]

  • Speaking of Git internals, if you’ve ever needed to script around the contents of a tree object in your repository, then there’s no doubt that git ls-tree has come in handy.

    If you aren’t familiar with ls-tree, the gist is that it allows you to list the contents of a tree objects, optionally recursing through nested sub-trees. Its output looks something like this:

    $ git ls-tree HEAD -- builtin/
    100644 blob 3ffb86a43384f21cad4fdcc0d8549e37dba12227  builtin/add.c
    100644 blob 0f4111bafa0b0810ae29903509a0af74073013ff  builtin/am.c
    100644 blob 58ff977a2314e2878ee0c7d3bcd9874b71bfdeef  builtin/annotate.c
    100644 blob 3f099b960565ff2944209ba514ea7274dad852f5  builtin/apply.c
    100644 blob 7176b041b6d85b5760c91f94fcdde551a38d147f  builtin/archive.c
    [...]
    

    Previously, the customizability of ls-tree‘s output was somewhat limited. You could restrict the output to just the filenames with --name-only, print absolute paths with --full-name, or abbreviate the object IDs with --abbrev, but that was about it.

    In Git 2.36, you have a lot more control about how ls-tree‘s should look. There’s a new --object-only option to complement --name-only. But if you really want to customize its output, the new --format option is your best bet. You can select from any combination and order of the each entry’s mode, type, name, and size.

    Here’s a fun example of where something like this might come in handy. Let’s say you’re interested in the distribution of file-sizes of blobs in your repository. Before, to get a list of object sizes, you would have had to do either:

    $ git ls-tree ... | awk '{ print $3 }' | git cat-file --batch-check='%(objectsize)'
    

    or (ab)use the --long format and pull out the file sizes of blobs:

    $ git ls-tree -l | awk '{ print $4 }'
    

    but now you can ask for just the file sizes directly, making it much more convenient to script around them:

    $ dist () {
     ruby -lne 'print 10 ** (Math.log10($_.to_i).ceil)' | sort -n | uniq -c
    }
    $ git ls-tree --format='%(objectsize)' HEAD:builtin/ | dist
      8 1000
     59 10000
     53 100000
      2 1000000
    

    …showing us that we have 8 files that are between 1-10 KiB in size, 59 files between 10-100 KiB, 53 files between 100 KiB and 1 MiB, and 2 files larger than 1 MiB.

    [source, source, source, source]

  • If you’ve ever tried to track down a bug using Git, then you’re familiar with the git bisect command. If you haven’t, here’s a quick primer. git bisect takes two revisions of your repository, one corresponding to a known “good” state, and another corresponding to some broken state. The idea is then to run a binary search between those two points in history to find the first commit which transitioned the good state to the broken state.

    If you aren’t a frequent bisect user, you may not have heard of the git bisect run command. Instead of requiring you to classify whether each point along the search is good or bad, you can supply a script which Git will execute for you, using its exit status to classify each revision for you.

    This can be useful when trying to figure out which commit broke the build, which you can do by running:

    $ git bisect start <bad> <good>
    $ git bisect run make
    

    which will run make along the binary search between <bad> and <good>, outputting the first commit which broke compilation.

    But what about automating more complicated tests? It can often be useful to write a one-off shell script which runs some test for you, and then hand that off to git bisect. Here, you might do something like:

    $ vi test.sh
    # type type type
    $ git bisect run test.sh
    

    See the problem? We forgot to mark test.sh as executable! In previous versions of Git, git bisect would incorrectly carry on the search, classifying each revision as broken. In Git 2.36, git bisect will detect that you forgot to mark the script as executable, and halt the search early.

    [source]

  • When you run git fetch, your Git client communicates with the remote to carry out a process called negotiation to determine which objects the server needs to send to complete your request. Roughly speaking, your client and the server mutually advertise what they have at the tips of each reference, then your client lists which objects it wants, and the server sends back all objects between the requested objects and the ones you already have.

    This works well because Git always expects to maintain closure over reachable objects2, meaning that if you have some reachable object in your repository, you also have all of its ancestors.

    In other words, it’s fine for the Git server to omit objects you already have, since the combination of the objects it sends along with the ones you already have should be sufficient to assemble the branches and tags your client asked for.

    But if your repository is corrupt, then you may need the server to send you objects which are reachable from ones you already have, in which case it isn’t good enough for the server to just send you the objects between what you have and want. In the past, getting into a situation like this may have led you to re-clone your entire repository.

    Git 2.36 ships with a new option to git fetch which makes it easier to recover from certain kinds of repository corruption. By passing the new --refetch option, you can instruct git fetch to fetch all objects from the remote, regardless of which objects you already have, which is useful when the contents of your objects directory are suspect.

    [source]

  • Returning readers may remember our earlier discussions about the sparse index and sparse checkouts, which make it possible to only have part of your repository checked out at a time.

    Over the last handful of releases, more and more commands have become compatible with the sparse index. This release is no exception, with four more Git commands joining the pack. Git 2.36 brings sparse index support to git clean, git checkout-index, git update-index, and git read-tree.

    If you haven’t used these commands, there’s no need to worry: adding support to these plumbing commands is designed to lay the ground work for building a sparse index-aware git stash. In the meantime, sparse index support already exists in the commands that you are most likely already familiar with, like git status, git commit, git checkout, and more.

    As an added bonus, git sparse-checkout (which is used to enable the sparse checkout feature and dictate which parts of your repository you want checked out) gained support for the command-line completion Git ships in its contrib directory.

    [source, source, source]

  • Returning readers may remember our previous coverage on partial clones, a relatively new feature in Git which allows you to initialize your clones by downloading just some of the objects in your repository.

    If you used this feature in the past with git clone‘s --recurse-submodules flag, the partial clone filter was only applied to the top-level repository, cloning all of the objects in the submodules.

    This has been fixed in the latest release, where the --filter specification you use in your top-level clone is applied recursively to any submodules your repository might contain, too.

    [source, source]

  • While we’re talking about partial clones, now is a good time to mention partial bundles, which are new in Git 2.36. You may not have heard of Git bundles, which is a different way of transferring around parts of your repository.

    Roughly speaking, a bundle combines the data in a packfile, along with a list of references that are contained in the bundle. This allows you to capture information about the state of your repository into a single file that you can share. For example, the Git project uses bundles to share embargoed security releases with various Linux distribution maintainers. This allows us to send all of the objects which comprise a new release, along with the tags that point at them in a single file over email.

    In previous releases of Git, it was impossible to prepare a filtered bundle which you could apply to a partial clone. In Git 2.36, you can now prepare filtered bundles, whose contents are unpacked as if they arrived during a partial clone3. You can’t yet initialize a new clone from a partial bundle, but you can use it to fetch objects into a bare repository:

    $ git bundle create --filter=blob:none ../partial.bundle v2.36.0
    $ cd ..
    $ git init --bare example.repo
    $ git fetch --filter=blob:none ../partial.bundle 'refs/tags/*:refs/tags/*'
    [ ... ]
    From ../example.bundle
    * [new tag]             v2.36.0 -> v2.36.0
    

    [source, source]

  • Lastly, let’s discuss a bug fix concerning Git’s multi-pack reachability bitmaps. If you have started to use this new feature, you may have noticed a handful of new files in your .git/objects/pack directory:

    $ ls .git/objects/pack/multi-pack-index*
    .git/objects/pack/multi-pack-index
    .git/objects/pack/multi-pack-index-33cd13fb5d4166389dbbd51cabdb04b9df882582.bitmap
    .git/objects/pack/multi-pack-index-33cd13fb5d4166389dbbd51cabdb04b9df882582.rev
    

    In order, these are: the multi-pack index (MIDX) itself, the reachability bitmap data, and the reverse-index which tells Git which bits correspond to what objects in your repository.

    These are all associated back to the MIDX via the MIDX’s checksum, which is how Git knows that the three belong together. This release fixes a bug where the .rev file could fall out-of-sync with the MIDX and its bitmap, leading Git to report incorrect results when using a multi-pack bitmap. This happens when changing the object order of the MIDX without changing the set of objects tracked by the MIDX.

    If your .rev file has a modification time that is significantly older than the MIDX and .bitmap, you may have been bitten by this bug4. Luckily this bug can be resolved by dropping and regenerating your bitmaps5. To prevent a MIDX bitmap and its .rev file from falling out of sync again, the contents of the .rev are now included in the MIDX itself, forcing the MIDX’s checksum to change whenever the object order changes.

    [source]

The rest of the iceberg

That’s just a sample of changes from the latest release. For more, check out the release notes for 2.36, or any previous version in the Git repository.


  1. You can ask for other attributes, too, like %(objectsize:disk) which shows how many bytes it takes Git to store the object on disk (which can be smaller than %(objectsize) if, for example, the object is stored as a delta against some other, similar object). 
  2. This isn’t quite true, because of things like shallow and partial clones, along with grafts, but the assumption is good enough for our purposes here. What matters it that outside of scenarios where we expect to be missing objects, the only time we don’t have a reachability closure is when the repository itself is corrupt. 
  3. In Git parlance, this would be a packfile from a promisor remote
  4. This isn’t entirely fool-proof, since it’s possible way of detecting that this bug occurred, since it’s possible your bitmaps were rewritten after first falling out-of-sync. When this happens, it’s possible that the corrupt bitmaps are propagated forward when generating new bitmaps. You can use git rev-list --test-bitmap HEAD to check if your bitmaps are OK. 
  5. By first running rm -f .git/objects/pack/multi-pack-index*, and then
    git repack -d --write-midx --write-bitmap-index

Git security vulnerability announced

Post Syndicated from Taylor Blau original https://github.blog/2022-04-12-git-security-vulnerability-announced/

Today, the Git project released new versions which address a pair of security vulnerabilities.

GitHub is unaffected by these vulnerabilities1. However, you should be aware of them and upgrade your local installation of Git, especially if you are using Git for Windows, or you use Git on a multi-user machine.

CVE-2022-24765

This vulnerability affects users working on multi-user machines where a malicious actor could create a .git directory in a shared location above a victim’s current working directory. On Windows, for example, an attacker could create C:\.git\config, which would cause all git invocations that occur outside of a repository to read its configured values.

Since some configuration variables (such as core.fsmonitor) cause Git to execute arbitrary commands, this can lead to arbitrary command
execution when working on a shared machine.

The most effective way to protect against this vulnerability is to upgrade to Git v2.35.2. This version changes Git’s behavior when looking for a top-level .git directory to stop when its directory traversal changes ownership from the current user. (If you wish to make an exception to this behavior, you can use the new multi-valued safe.directory configuration).

If you can’t upgrade immediately, the most effective ways to reduce your risk are the following:

  • Define the GIT_CEILING_DIRECTORIES environment variable to contain the parent directory of your user profile (i.e., /Users on macOS,
    /home on Linux, and C:\Users on Windows).
  • Avoid running Git on multi-user machines when your current working directory is not within a trusted repository.

Note that many tools (such as the Git for Windows installation of Git Bash, posh-git, and Visual Studio) run Git commands under the hood. If you are on a multi-user machine, avoid using these tools until you have upgraded to the latest release.

Credit for finding this vulnerability goes to 俞晨东.

[source]

CVE-2022-24767

This vulnerability affects the Git for Windows uninstaller, which runs in the user’s temporary directory. Because the SYSTEM user account inherits the
default permissions of C:\Windows\Temp (which is world-writable), any authenticated user can place malicious .dll files which are loaded when
running the Git for Windows uninstaller when run via the SYSTEM account.

The most effective way to protect against this vulnerability is to upgrade to Git for Windows v2.35.2. If you can’t upgrade
immediately, reduce your risk with the following:

  • Avoid running the uninstaller until after upgrading
  • Override the SYSTEM user’s TMP environment variable to a directory which can only be written to by the SYSTEM user
  • Remove unknown .dll files from C:\Windows\Temp before running the
    uninstaller
  • Run the uninstaller under an administrator account rather than as the
    SYSTEM user

Credit for finding this vulnerability goes to the Lockheed Martin Red Team.

[source]

Download Git 2.35.2


  1. GitHub does not run git outside of known repositories, so is not susceptible to the attack described by CVE-2022-24765. Likewise, GitHub does not use Git for Windows, and so is unaffected by CVE-2022-24767 entirely. 

Performance at GitHub: deferring stats with rack.after_reply

Post Syndicated from blakewilliams original https://github.blog/2022-04-11-performance-at-github-deferring-stats-with-rack-after_reply/

Performance is an essential aspect of any production application, and GitHub is no exception. In the last year, we’ve been making significant investments to improve the performance of some of our most critical workflows. There’s a lot to share, but today we’re going to focus on a little-known feature of some Rack-powered web servers called rack.after_reply that saved us 30ms p50 and 50ms+ p99 on every request across the site.

The problem

First, let me talk about how we found ourselves looking at this Rack web server functionality. When diving into performance, we often look at flamegraphs that reveal where time is being spent for a given request. Eventually, we started to notice a pattern. It turned out we were spending a significant amount of time sending statsd metrics throughout a request. In fact, it could be up to 65ms per request! While telemetry enables us to improve GitHub, it doesn’t help users directly, especially when it impacts the performance of our page loads.

There’s no need to send telemetry immediately, especially if it’s expensive, so we started discussing ways to remove telemetry network calls out of the critical path. This resulted in the question, “Can we batch all of our telemetry calls and send them at once?” There’s far too much data to use a background job, so we had to look at other potential solutions. We came across Rack::Events and spiked out a proof of concept. Unfortunately, this approach caused any work performed in Rack::Events callbacks to block the response from closing. This resulted in two issues. First, users would see the browser loading indicator until the deferred code was complete. Second, this impacted our metrics since response times still included the time it took to run our deferred code.

A potential path forward

Recognizing the potential, we kept digging. Eventually, we came across a feature that seemed to do exactly what we wanted, rack.after_reply. The Puma source code describes rack.after_reply as: “A rack extension. If the app writes #call’ables to this array, we will invoke them when the request is done.” In other words, this would allow us to collect telemetry metrics during a request and flush them in a single batch after the browser received the full response, avoiding the issue of network calls slowing down response times. There was a problem though. We don’t use Puma at GitHub, we use Unicorn. 😱

We had a clear path forward. All we needed to do was implement rack.after_reply in Unicorn. This functionality has a lot of use cases outside of sending telemetry metrics, so we contributed rack.after_reply back to Unicorn. This allowed us to write a small wrapper class around rack.after_reply, making it easier to use in the application:


class AfterResponse def initialize(env) env[“rack.after_reply”] ||= [] env[“rack.after_reply”] << -> do self.call end @to_perform = [] end # Calls each callable defined via #perform def call @to_perform.each do |block| begin block.call(self) rescue Object => e Rails.logger.error(e) end end end # Adds given block to the array of callables that will # be called after the user has received the response. def perform(name, &block) @to_perform << block end end

With the hard part completed, we wrote a small wrapper around our telemetry class to buffer all of our stats data. Combining that with a middleware that performs roughly the following, users now receive responses 30ms faster P50 across all pages. P99 response times improved too, with a 50ms+ reduction across the site.


GitHub.after_response.perform do GitHub.statsd.flush! end

In conclusion

There are a few drawbacks to this approach that are worth mentioning. The primary disadvantage is that a Unicorn worker executes the rack.after_reply code, making it unable to serve another HTTP request until your callables finish running. If not kept in check, this can potentially increase HTTP queueing, the time spent waiting for a web server worker to serve a request. We mitigated this by adding timeout behavior to prevent any code from running too long.

It’s a little early to talk about it, but there is progress toward making this functionality an official part of the Rack spec. A recently released gem also implements additional functionality around this feature called maybe_later that’s worth checking out.

Overall, we’re pleased with the results. Unicorn-powered applications now have a new tool at their disposal, and our customers reap the benefits of this performance enhancement!