Tag Archives: Product

Getting started with Git and GitHub is easier than ever with GitHub Desktop 2.2

Post Syndicated from Amanda Pinsker original https://github.blog/2019-10-02-get-started-easier-with-github-desktop-2-2/

Anyone who uses Git knows that it has a steep learning curve. We’ve learned from developers that most people tend to learn from a buddy, whether that’s a coworker, a professor, a friend, or even a YouTube video. In GitHub Desktop 2.2, we’re releasing the first version of an interactive Git and GitHub tutorial that can be your buddy and help you get started. If you’re new to Desktop, you can download and try out the tutorial at desktop.github.com.

Get set up

To get set up, we help you through two major pieces: creating a repository and connecting an editor. When you first open Desktop, a welcome page appears with a new option to “Create a Tutorial Repository”. Starting with this option creates a tutorial repository that guides you through the core concepts of working with Git using GitHub Desktop.

There are a lot of tools you need to get started with Git and GitHub. The most important of these is your code editor. In the first step of the tutorial, you’re prompted to install an editor if you don’t have one already.

Learn the GitHub flow

Next, we guide you through how to use GitHub Desktop to make changes to code locally and get your work on GitHub. You’ll create a new branch, make a change to a file, commit it, push it to GitHub, and open your first pull request.

We’ve also heard that new users initially experience confusion between Git, GitHub, and GitHub Desktop. We cover these differences in the tutorial and make sure to reinforce the explanations.

Keep going with your own project

In GitHub Desktop 1.6, we introduced suggested next steps based on the state of your repository. Now when you complete the tutorial, we similarly suggest next steps: exploring projects on GitHub that you might want to contribute to, creating a new project, or adding an existing project to Desktop. We always want GitHub Desktop to be the tool that makes your next steps clear, whether you’re in the flow of your work, or you’re a new developer just getting started.

What’s next?

With GitHub Desktop 2.2, we’re making the product our users love more approachable to newcomers. We’ll be iterating on the tutorial based on your feedback, and we’ll continue to build on the connection between GitHub and your local machine. If you want to start building something but don’t know how, think of GitHub Desktop as your buddy to help you get started.

Learn more about GitHub Desktop

The post Getting started with Git and GitHub is easier than ever with GitHub Desktop 2.2 appeared first on The GitHub Blog.

New workflow editor for GitHub Actions

Post Syndicated from Chris Patterson original https://github.blog/2019-10-01-new-workflow-editor-for-github-actions/

It’s now even easier to create and edit a GitHub Actions workflow with the updated editor. We’ve provided inline auto-complete and lint as you type so you can say goodbye to YAML indentation issues and explore the full workflow syntax without going to the docs.

Get through code faster with auto-complete

Auto-complete can be triggered with Ctrl+Space almost anywhere, and in some cases, automatically. It suggests keys or values depending on the current position of the cursor, and displays brief contextual documentation so you can discover and understand all the available options without losing the focus. Auto-complete works even inside expressions.

Explore context options without losing focus

Additionally, the editor will insert a new line with the right indentation when you press enter. It will also either suggest the next key to insert or insert code snippets that you can navigate through using the tab key.

Snippets also work when using functions inside expressions, allowing you to easily write and navigate through the required arguments.

Minimize mistakes

There are many updates to help you write workflow files, but if you make a mistake, we’ve got you covered too.

The editor now highlights structural errors in your file, unexpected values, or even conflicting values, such as an invalid shell value for the chosen operating system.

Understand cron expressions quickly

Finally, the editor will also help you edit your scheduled jobs by describing the cron expressions you have used with natural language:

Tell us what you think

These are just a few features we’ve made to help you edit workflows with fewer errors. We’d love to hear any feedback you have—share your questions and comments with us in the GitHub Actions Community Forum.

Learn more about GitHub Actions

The post New workflow editor for GitHub Actions appeared first on The GitHub Blog.

Accelerating the GitHub Sponsors beta with Stripe Connect

Post Syndicated from Katie Delfin original https://github.blog/2019-09-10-accelerating-the-github-sponsors-beta/

Update: GitHub Sponsors is available in every country where GitHub does business, not just the 30 countries supported by Stripe Connect. We’ll continue to use our existing manual payout system for anyone outside of Connect’s list of currently supported countries.


Back in May, we announced GitHub Sponsors, a new way to support the developers who build and maintain the open source software you use every day. We launched in beta as early as possible so we could work closely with the community to address feedback and understand their needs before adding more developers to the program. Today, we’re taking the next step in accelerating the availability of GitHub Sponsors through a new streamlined onboarding and payment experience with Stripe Connect.

The early days of manual payments

Within hours of announcing GitHub Sponsors, thousands of people signed up for the waitlist from all over the world. It was awesome to see so much excitement for the program, but we knew this meant we had a lot of work ahead of us. 

We started the beta with a small group of sponsored developers, and every few weeks, we invited more into the program. Everything was a manual process—setting up account information, verifying identity, waiting for approval, running reports, and processing payouts across multiple departments. Due to the manual nature of the process, we were limited to a small number of people in the program until we could automate and streamline our operations.

Onboarding more developers, faster

At GitHub, we do business with companies of all sizes, from Fortune 50 companies to early stage startups from all over the world. Collecting revenue globally is something we’ve done for years, but with GitHub Sponsors, it’s not just about receiving money—it’s also about paying it out. This means providing a seamless and secure experience for sponsored developers to verify their identities, enter banking information, receive funds, and manage payouts. We also know that, for many maintainers, this will be the first time they’ve been financially rewarded for their contributions to open source. We want to ensure the process is easy for developers to complete, so they can spend more time doing what they love, like contributing to open source.

Stripe Connect Express offers a simple, low overhead onboarding experience that complies with web accessibility guidelines, coupled with the ability to localize the experience to simplify payouts in countries with unique tax laws and compliance regulations. And by not building our own onboarding solution, we’re able to save months of engineering and maintenance time, and start onboarding more sponsored developers today.

Localization

One of Stripe Connect’s features that’s helped us deliver a great experience to our users is their localization support. Stripe just released international support for 30 countries (with more to come) for Connect Express. With this expanded support, Stripe takes care of adjusting for localized rules and regulations for supported countries—no small feat. We couldn’t be more thrilled to be one of the first to offer support across all 30 countries to serve our global community.

Our maintainer onboarding process can now start to keep up with the growing demand of the program. With Stripe, users can onboard using Connect Express, reducing onboarding time from a week-long process to under five minutes.  

And there’s much more to come. While we’re just under four months into the program, we’re focused on making GitHub Sponsors available to even more developers around the world.

Sign up to join the beta—you’ll hear from us soon.

Learn more about GitHub Sponsors

The post Accelerating the GitHub Sponsors beta with Stripe Connect appeared first on The GitHub Blog.

Save Your Place with Grab!

Post Syndicated from Grab Tech original https://engineering.grab.com/save-your-place-with-grab

Do you find it tedious to type and search for your destination or have a hard time remembering that address of the friend you are going to meet? It can be really confusing when it comes to keeping track of so many addresses that you frequent on a regular basis. To solve this pain point, Grab rolled out a new feature called Saved Places in January’19 across SouthEast Asia.

With Saved Places, you can save an address and also add a label like “Home”, “Work”, “Gym”, etc which makes finding and selecting an address for booking a ride or ordering your favourite food a breeze!

Never forget your addresses again!

To use the feature, fire up your Grab app, head to the “Saved Places” section on the app navigation bar and start adding all your favourite destinations such as your home, your office, your favourite mall or the airport and you are done with the hassle of typing them again.

Save your place with Grab!

 

Hola! your saved addresses are just a click away to order a ride or your favourite meal.

Inspiration behind the work

We at Grab continuously engage with our customers to understand how we can outserve them better. Difficulty in choosing the correct address was one of the key feedback shared by our customers. Our drivers shared funny stories about places that have similar names but outrightly different locations e.g. Sime Road is in Bukit Timah but Simei Road is in Simei almost 20 km away, Nicoll Highway is in Kallang but Nicoll Drive is in Changi almost 20 km away. In this case, even though the users use the address frequently, there remains scope for misselection.

Data-Driven Decisions

Our vast repository of data and insights has helped us identify and solve some challenging problems. Our analysis of millions of transport bookings and food orders revealed that customers usually visit five to seven unique locations and order food at one or two addresses.

One intriguing yet intuitive insight that came out was a set pattern in user’s booking behaviour during weekdays. A set of passengers mostly commute between two addresses, probably going to the office in the morning and coming back home in the evening. These identifiable clusters of pick-up and drop-off locations during peak hours signified our hypothesis of users using a small set of locations for their Grab bookings. The pictures below show such clusters in Singapore and Jakarta where passengers generally commute to and fro in morning and in evening respectively.

Save your place with Grab!

 

This insight also motivated us to test out the concept of user created labels which allows the users to mark their saved places with their own labels like “Home”, “Changi Airport”, “Sis’s House” etc. Initial experiment results were extremely encouraging and we got significantly higher usage and repeat rates from users.

A group of cross functional teams – Analytics, Product, Design, Engineering etc came together, worked backwards from the customer, brainstormed multiple ideas, and finalised a product approach. We then went on to conduct in depth user research and usability testing to ensure that the final product met user expectations and was easy to understand and use.

And users love it!

Since the launch, we have seen significant user adoption for the feature. More than 14 Million users have saved close to 45 Million saved places. That’s ~3 places per user!

Customers from Singapore and Myanmar tend to save around 3 addresses each whereas customers from Indonesia, Malaysia, Philippines, Thailand, Vietnam and Cambodia save 2 addresses each. A customer from Indonesia has saved a whopping 1,191 addresses!

Users across South East Asia have adopted the feature and as of today, a significant portion of our bookings are made using a saved place for either pickup or drop off. If you were curious, here are the most frequently used labels for saving addresses in Singapore (left) and Indonesia (right):

Save your place with Grab!

 

Apart from saving home and office addresses our customers are also saving their child’s school address and places of worship. Some of them are also saving their favourite shopping destinations.

Another observation, as someone may have guessed, is regarding cluster of home addresses. Home addresses in Singapore are evenly scattered across the island (map on upper left) but the same are concentrated in specific pockets of the city in Jakarta (map on lower left). However office addresses are concentrated in specific areas in both cities – CBD and Changi area in Singapore (map on upper right) and along central Jakarta in Jakarta (map on lower right).

Save your place with Grab!

 

This is only the beginning

We’re constantly striving to improve the user experience with Grab and make it as seamless as possible. We have only taken the first steps with Saved Places and the path forward involves deeper understanding of user behaviour with the help of saved places data to create a more personalised experience. This is just the beginning and we’re planning to launch some very innovative features in the coming months.

No More Forgetting to Input ERP Charges – Hello Automated ERP!

Post Syndicated from Grab Tech original https://engineering.grab.com/automated-erp-charges

ERP, standing for Electronic Road Pricing, is a system used to manage road congestion in Singapore. Drivers are charged when they pass through ERP gantries during peak hours. ERP rates vary for different roads and time periods based on the traffic conditions at the time. This encourages people to change their mode of transport, travel route or time of travel during peak hours. ERP is seen as an effective measure in addressing traffic conditions and ensuring drivers continue to have a smooth journey.

Did you know that Singapore has a total of 79 active ERP gantries? Did you also know that every ERP gantry changes its fare 10 times a day on average? For example, total ERP charges for a journey from Ang Mo Kio to Marina will cost $10 if you leave at 8:50am, but $4 if you leave at 9:00am on a working day!

Imagine how troublesome it would have been for Grab’s driver-partners who, on top of having to drive and check navigation, would also have had to remember each and every gantry they passed, calculating their total fare and then manually entering the charges to the total ride cost at the end of the ride.

In fact, based on our driver-partners’ feedback, missing out on ERP charges was listed as one of their top-most pain points. Not only did the drivers find the entire process troublesome, this also led to earnings loss as they would have had to bear the cost of the  ERP fares.

We’re glad to share that, as of 15th March 2019, we’ve successfully resolved this pain point for our driver-partners by introducing automated ERP fare calculation!

So, how did we achieve automating the ERP fare calculation for our drivers-partners? How did we manage to reduce the number of trips where drivers would forget to enter ERP fare to almost zero? Read on!

How we approached the Problem

The question we wanted to solve was – how do we create an impactful feature to make sure that driver -partners have one less thing to handle when they drive?

We started by looking at the problem at hand. ERP fares in Singapore are very dynamic; it changes on the basis of day and time.

Caption: Example of ERP fare changes on a normal weekday in Singapore
Caption: Example of ERP fare changes on a normal weekday in Singapore

 

We wanted to create a system which can identify the dynamic ERP fares at any given time and location, while simultaneously identifying when a driver-partner has passed through any of these gantries.

However, that wasn’t enough. We wanted this feature to be scalable to every country where Grab is in – like Indonesia, Thailand, Malaysia, Philippines, Vietnam. We started studying the ERP (or tolls – as it is known locally) system in other countries. We realized that every country has its own style of calculating toll. While in Singapore ERP charges for cars and taxis are the same, Malaysia applies different charges for cars and taxis. Similarly, Vietnam has different tolls for 4-seaters and 7-seaters. Indonesia and Thailand have couple gantries where you pay only at one of the gantries.Suppose A and B are couple gantries, if you passed through A, you won’t need to pay at B and vice versa. This is where our Ops team came to the rescue!

Boots on the Ground!

Collecting all the ERP or toll data for every country is no small feat, recalls Robinson Kudali, program manager for the project. “We had around 15 people travelling across the region for 2-3 weeks, working on collecting data from every possible source in every country.”

Getting the right geographical coordinates for every gantry is very important. We track driver GPS pings frequently, identify the nearest road to that GPS ping and check the presence of a gantry using its coordinates. The entire process requires you to be very accurate; incorrect gantry location can easily lead to us miscalculating the fare.

Bayu Yanuaragi, our regional mapops lead, explains – “To do this, the first step was to identify all toll gates for all expressways & highways in the country. The team used various mapping software to locate and plot all entry & exit gates using map sources, open data and more importantly government data as references. Each gate was manually plotted using satellite imagery and aligned with our road layers in order to extract the coordinates with a unique gantry ID.”

Location precision is vital in creating the dataset as it dictates whether a toll gate will be detected by the Grab app or not. Next step was to identify the toll charge from one gate to another. Accuracy of toll charge per segment directly reflects on the fare that the passenger pays after the trip.

Caption: ERP gantries visualisation on our map - The purple bars are the gantries that we drew on our map
Caption: ERP gantries visualisation on our map – The purple bars are the gantries that we drew on our map

 

Once the data compilation is done, team would then conduct fieldwork to verify its integrity. If data gaps are identified, modifications would be made accordingly.

Upon submission of the output, stack engineers would perform higher level quality check of the content in staging.

Lastly, we worked with a local team of driver-partners who volunteered to make sure the new system is fully operational and the prices are correct. Inconsistencies observed were reported by these driver-partners, and then corrected in our system.

Closing the loop

Creating a strong dataset did help us in predicting correct fares, but we needed something which allows us to handle the dynamic behavior of the changing toll status too. For example, Singapore government revises ERP fare every quarter, while there could also be ad-hoc changes like activating or deactivating of gantries on an on-going basis.

Garvee Garg, Product Manager for this feature explains: “Creating a system that solves the current problem isn’t sufficient. Your product should be robust enough to handle all future edge case scenarios too. Hence we thought of building a feedback mechanism with drivers.”

In case our ERP fare estimate isn’t correct or there are changes in ERPs on-ground, our driver-partners can provide feedback to us. These feedback directly flow to Customer Experience teamwho does the initial investigation, and from there to our Ops team. A dedicated person from Ops team checks the validity of the feedback, and recommends updates. It only takes 1 day on average to update the data from when we receive the feedback from the driver-partner.

However, validating the driver feedback was a time consuming process. We needed a tool which can ease the life of Ops team by helping them in de-bugging each and every case.

Hence the ERP Workflow tool came into the picture.

99% of the time, feedback from our driver-partners are about error cases. When feedback comes in, this tool would allow the Ops team to check the entire ride history of the driver and map driver’s ride trajectory with all the underlying ERP gantries at that particular point of time. The Ops team  would then be able to identify if ERP fare calculated by our system or as said by driver is right or wrong.

This is only the beginning

By creating a system that can automatically calculate and key in ERP fares for each trip, Grab is proud to say that our driver-partners can now drive with less hassle and focus more on the road which will bring the ride experience and safety for both the driver and the passengers to a new level!

The Automated ERP feature is currently live in Singapore and we are now testing it with our driver-partners in Indonesia and Thailand. Next up, we plan to pilot in the Philippines and Malaysia and soon to every country where Grab is in – so stay tuned for even more innovative ideas to enhance your experience on our super app!

To know more about what Grab has been doing to improve the convenience and experience for both our driver-partners and passengers, check out other stories on this blog!

Making Grab’s everyday app super

Post Syndicated from Grab Tech original https://engineering.grab.com/grab-everyday-super-app

Grab is Southeast Asia’s leading superapp, providing highly-used daily services such as ride-hailing, food delivery, payments, and more. Our goal is to give people better access to the services that matter to them, with more value and convenience, so we’ve been expanding our ecosystem to include bill payments, hotel bookings, trip planners, and videos – with more to come. We want to outserve our customers – not just by packing the Grab app with useful features and services, but by making the whole experience a unique and personalized one for each of them.

To realize our super app ambitions, we work with partners who, like us, want to help drive Southeast Asia forward.

A lot of the collaborative work we do with our partners can be seen in the Grab Feed. This is where we broadcast various types of content about Grab and our partners in an aggregated manner, adding value to the overall user experience. Here’s what the feed looks like:

Grab super app feed
Waiting for the next promo? Check the Feed.
Looking for news and entertainment? Check the Feed.
Want to know if it’s a good time to book a car? CHECK. THE. FEED.

 

As we continue to add more cards, services, and chunks of content into Grab Feed, there’s a risk that our users will find it harder to find the information relevant to them. So we work to ensure that our platform is able to distinguish and show information based on what’s most suited for the user’s profile. This goes back to what has always been our central focus – the customer – and is why we put so much importance in personalising the Grab experience for each of them.

To excel in a heavily diversified market like Southeast Asia, we leverage on the depth of our data to understand what sorts of information users want to see and when they should see them. In this article we will discuss Grab Feed’s recommendation logic and strategies, as well as its future roadmap.

Start your Engines

Grab super app feed

The problem we’re trying to solve here is known as the recommendations problem. In a nutshell, this problem is about inferring the preference of consumers to recommend content and services to them. In Grab Feed, we have different types of content that we want to show to different types of consumers and our challenge is to ensure that everyone gets quality content served to them.

Grab super app feed

To solve this, we have built a recommendation engine, which is a system that suggests the type of content a user should consider consuming. In order to make a recommendation, we need to understand three factors:

  1. Users. There’s a lot we can infer about our users based on how they’ve used the Grab app, such as the number of rides they’ve taken, the type of food they like to order, the movie voucher deals they’ve purchased, the games they’ve played, and so on.
    This information gives us the opportunity to understand our users’ preferences better, enabling us to match their profiles with relevant and suitable content.
  2. Items. These are the characteristics of the content. We consider the type of the content (e.g. video, games, rewards) and consumability (e.g. purchase, view, redeem). We also consider other metadata such as store hours for merchants, points to burn for rewards, and GPS coordinates for points of interest.
  3. Context. This pertains to the setting in which a user is consuming our content. It could be the time of day, the user’s location, or the current feed category.

Using signals from all these factors, we build a model that returns a ranked set of cards to the user. More on this in the next few sections.

Understanding our User

Grab super app feed

Interpreting user preference from the signals mentioned above is a whole challenge in itself. It’s important here to note that we are in a constant state of experimentation. Slowly but surely, we are continuing to fine tune how to measure content preferences. That being said, we look at two areas:

  1. Action. We firmly believe that not all interactions are made equal. Does liking a card actually mean you like it? Do you like things at the same rate as your friends? What about transactions, are those more preferred? The feed introduces a lot of ways for the users to give feedback to the platform. These events include likes, clicks, swipes, views, transactions, and call-to-actions.

Depending on the model, we can take slightly different approaches. We can learn the importance of each event and aggregate them to have an expected rating, or we can predict the probability of each event and rank accordingly.

  1. Recency. Old interactions are probably not as useful as new ones. The feed is a product that is constantly evolving, and so are the preferences of our users. Failing to decay the weight of older interactions will give us recommendations that are no longer meaningful to our users.

Optimising the Experience

Grab super app feed

Building a viable recommendation engine requires several phases. Working iteratively, we are able to create a few core recommendation strategies to produce the final model in determining the content’s relevance to the user. We’ll discuss each strategy in this section.

  1. Popularity. This strategy is better known as trending recommendations. We capture online clickstream events over a rolling time window and aggregate the events to show the user what’s popular to everyone at that point in time. Listening to the crowds is generally an effective strategy, but this particular strategy also helps us address the cold start problem by providing recommendations for new feed users.
  2. User Favourites. We understand that our users have different tastes and that users will have content that they engage with more than other users would.  In this strategy, we capture that personal engagement and the user’s evolving preferences.
  3. Collaborative Filtering.A key goal in building our everyday super app is to let users experience different services. To allow discoverability, we study similar users to uncover a s et ofsimilar preferences they may have, which we can then use to guide what we show other users.
  4. Habitual Behaviour. There will be times where users only want to do a specific thing, and we wouldn’t want them to scroll all the way down just to do it. We’ve built in habitual recommendations to address this. So if users always use the feed to scroll through food choices at lunch or to take a peek at ride peaks (pun intended) on Sunday morning, we’ve still got them covered.
  5. Deep Recommendations. We’ve shown you how we use Feed data to drive usage across the platform. But what about using the platform data to drive the user feed behaviour? By embedding users’ activities from across our multiple businesses, we’re also able to leverage this data along with clickstream to determine the content preferences for each user.

We apply all these strategies to find out the best recommendations to serve the users either by selection or by aggregation. These decisions are determined through regular experiments and studies of our users.

Always Learning

We’re constantly learning and relearning about our users. There are a lot of ways to understand behaviour and a lot of different ways to incorporate different strategies, so we’re always iterating on these to deliver the most personal experience on the app.

To identify a user’s preferences and optimal strategy exposure, we capitalise on our Experimentation Platform to expose different configurations of our Recommendation Engine to different users. To monitor the quality of our recommendations, we measure the impact with online metrics such as interaction, clickthrough, and engagement rates and offline metrics like [email protected] Normalized Discounted Cumulative Gain (NDCG).

Future Work

Through our experience building out this recommendations platform, we realised that the space was large enough and that there’s a lot of pieces that can continuously be built. To keep improving, we’re already working on the following items:

  1. Multi-objective optimisation for business and technical metrics
  2. Building out automation pipelines for hyperparameter optimisation
  3. Incorporating online learning for real-time model updates
  4. Multi-armed bandits for user personalised recommendation strategies
  5. Recsplanation system to allow stakeholders to better understand the system

Conclusion

Grab is one of Southeast Asia’s fastest growing companies. As its business, partnerships, and offerings continue to grow, the super app real estate problem will only keep on getting bigger. In this post, we discuss how we are addressing that problem by building out a recommendation system that understands our users and personalises the experience for each of them. This system (us included) continues to learn and iterate from our users feedback to deliver the best version for them.

If you’ve got any feedback, suggestions, or other great ideas, feel free to reach me at [email protected] Interested in working on these technologies yourself? Check out our career page.

C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages

Post Syndicated from Kavita Ganesan original https://github.blog/2019-07-02-c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages/

GitHub hosts over 300 programming languages—from commonly used languages such as Python, Java, and Javascript to esoteric languages such as Befunge, only known to very small communities.

JavaScript is the top programming language on GitHub, followed by Java and HTML
Figure 1: Top 10 programming languages hosted by GitHub by repository count

One of the necessary challenges that GitHub faces is to be able to recognize these different languages. When some code is pushed to a repository, it’s important to recognize the type of code that was added for the purposes of search, security vulnerability alerting, and syntax highlighting—and to show the repository’s content distribution to users.

Despite the appearance, language recognition isn’t a trivial task. File names and extensions, while providing a good indication of what the coding language is likely to be, do not offer the full picture. In fact, many extensions are associated with the same language (e.g., “.pl”, “.pm”, “.t”, “.pod” are all associated with Perl), while others are ambiguous and used almost interchangeably across languages (e.g., “.h” is commonly used to indicate many languages of the “C” family, including C, C++, and Objective-C). In other cases, files are simply provided with no extension (especially for executable scripts) or with the incorrect extension (either on purpose or accidentally).

Linguist is the tool we currently use to detect coding languages at GitHub. Linguist a Ruby-based application that uses various strategies for language detection, leveraging naming conventions and file extensions and also taking into account Vim or Emacs modelines, as well as the content at the top of the file (shebang). Linguist handles language disambiguation via heuristics and, failing that, via a Naive Bayes classifier trained on a small sample of data. 

Although Linguist does a good job making file-level language predictions (84% accuracy), its performance declines considerably when files use unexpected naming conventions and, crucially, when a file extension is not provided. This renders Linguist unsuitable for content such as GitHub Gists or code snippets within README’s, issues, and pull requests.

In order to make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua based on an Artificial Neural Network (ANN) architecture which can handle language predictions in tricky scenarios. The current version of the model is able to make predictions for the top 50 languages hosted by GitHub and surpasses Linguist in accuracy and performance. 

The Nuts and Bolts Behind OctoLingua

OctoLingua was built from scratch using Python, Keras with TensorFlow backend—and is built to be accurate, robust, and easy to maintain. In this section, we describe our data sources, model architecture, and performance benchmark for OctoLingua. We also describe what it takes to add support for a new language. 

Data sources

The current version of OctoLingua was trained on files retrieved from Rosetta Code and from a set of quality repositories internally crowdsourced. We limited our language set to the top 50 languages hosted on GitHub.

Rosetta Code was an excellent starter dataset as it contained source code for the same task expressed in different programming languages. For example, the task of generating a Fibonacci sequence is expressed in C, C++, CoffeeScript, D, Java, Julia, and more. However, the coverage across languages was not uniform where some languages only have a handful of files and some files were just too sparsely populated. Augmenting our training set with some additional sources was therefore necessary and substantially improved language coverage and performance.

Our process for adding a new language is now fully automated. We programmatically collect source code from public repositories on GitHub. We choose repositories that meet a minimum qualifying criteria such as having a minimum number of forks, covering the target language and covering specific file extensions. For this stage of data collection, we determine the primary language of a repository using the classification from Linguist. 

Features: leveraging prior knowledge

Traditionally, for text classification problems with Neural Networks, memory-based architectures such as Recurrent Neural Networks (RNN) and Long Short Term Memory Networks (LSTM) are often employed. However, given that programming languages have differences in vocabulary, commenting style, file extensions, structure, libraries import style and other minor differences, we opted for a simpler approach that leverages all this information by extracting some relevant features in tabular form to be fed to our classifier. The features currently extracted are as follows:

  1. Top five special characters per file
  2. Top 20 tokens per file
  3. File extension
  4. Presence of certain special characters commonly used in source code files such as colons, curly braces, and semicolons

The Artificial Neural Network (ANN) model

We use the above features as input to a two-layer Artificial Neural Network built using Keras with Tensorflow backend. 

The diagram below shows that the feature extraction step produces an n-dimensional tabular input for our classifier. As the information moves along the layers of our network, it is regularized by dropout and ultimately produces a 51-dimensional output which represents the predicted probability that the given code is written in each of the top 50 GitHub languages plus the probability that it is not written in any of those.

image
Figure 2: The ANN Structure of our initial model (50 languages + 1 for “other”)

We used 90% of our dataset for training over approximately eight epochs. Additionally, we removed a percentage of file extensions from our training data at the training step, to encourage the model to learn from the vocabulary of the files, and not overfit on the file extension feature, which is highly predictive.

Performance benchmark

OctoLingua vs. Linguist

In Figure 3, we show the F1 Score (harmonic mean between precision and recall) of OctoLingua and Linguist calculated on the same test set (10% from our initial data source). 

Here we show three tests. The first test is with the test set untouched in any way. The second test uses the same set of test files with file extension information removed and the third test also uses the same set of files but this time with file extensions scrambled so as to confuse the classifiers (e.g., a Java file may have a “.txt” extension and a Python file may have a “.java”) extension. 

The intuition behind scrambling or removing the file extensions in our test set is to assess the robustness of OctoLingua in classifying files when a key feature is removed or is misleading. A classifier that does not rely heavily on extension would be extremely useful to classify gists and snippets, since in those cases it is common for people not to provide accurate extension information (e.g., many code-related gists have a .txt extension).

The table below shows how OctoLingua maintains a good performance under various conditions, suggesting that the model learns primarily from the vocabulary of the code, rather than from meta information (i.e. file extension), whereas Linguist fails as soon as the information on file extensions is altered.

image
Figure 3: Performance of OctoLingua vs. Linguist on the same test set

 

Effect of removing file extension during training time

As mentioned earlier, during training time we removed a percentage of file extensions from our training data to encourage the model to learn from the vocabulary of the files. The table below shows the performance of our model with different fractions of file extensions removed during training time. 

image
Figure 4: Performance of OctoLingua with different percentage of file extensions removed on our three test variations

Notice that with no file extension removed during training time, the performance of OctoLingua on test files with no extensions and randomized extensions decreases considerably from that on the regular test data. On the other hand, when the model is trained on a dataset where some file extensions are removed, the model performance does not decline much on the modified test set. This confirms that removing the file extension from a fraction of files at training time induces our classifier to learn more from the vocabulary. It also shows that the file extension feature, while highly predictive, had a tendency to dominate and prevented more weights from being assigned to the content features. 

Supporting a new language

Adding a new language in OctoLingua is fairly straightforward. It starts with obtaining a bulk of files in the new language (we can do this programmatically as described in data sources). These files are split into a training and a test set and then run through our preprocessor and feature extractor. This new train and test set is added to our existing pool of training and testing data. The new testing set allows us to verify that the accuracy of our model remains acceptable.

image
Figure 5: Adding a new language with OctoLingua

Our plans

As of now, OctoLingua is at the “advanced prototyping stage”. Our language classification engine is already robust and reliable, but does not yet support all coding languages on our platform. Aside from broadening language support—which would be rather straightforward—we aim to enable language detection at various levels of granularity. Our current implementation already allows us, with a small modification to our machine learning engine, to classify code snippets. It wouldn’t be too far fetched to take the model to the stage where it can reliably detect and classify embedded languages. 

We are also contemplating the possibility of open sourcing our model and would love to hear from the community if you’re interested.

Summary

With OctoLingua, our goal is to provide a service that enables robust and reliable source code language detection at multiple levels of granularity, from file level or snippet level to potentially line-level language detection and classification. Eventually, this service can support, among others, code searchability, code sharing, language highlighting, and diff rendering—all of this aimed at supporting developers in their day to day development work in addition to helping them write quality code.  If you are interested in leveraging or contributing to our work, please feel free to get in touch on Twitter @github!

Authors

The post C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages appeared first on The GitHub Blog.

Atom editor is now faster

Post Syndicated from Rafael Oleza original https://github.blog/2019-06-12-atom-editor-is-now-faster/

The Atom Team improved a few of Atom’s most common features—they’re now dramatically faster and ready to help you be even more productive.

Fuzzy Finder

The fuzzy finder is one of the most popular features in Atom, and it’s used by nearly every user to quickly open files by name. We wanted to make it even better by speeding up the process. Atom version 1.38 introduces an experimental fast mode that brings drastic speed improvements to any project. With this mode, indexing a medium or large project is roughly six times faster than when using the standard mode.

Medium to large projects are crawled six times quicker using fast mode.

This change also made the process to display filtered results in Atom about 11 times faster. You’ll notice the difference in speed as soon as you start searching for a query.

Medium to large projects are filtered 11 times quicker using fast mode.

What’s next

Our goal is to make the new experimental mode the default option on Atom version 1.39 when it’s released next month. Until then, switch to this mode by clicking the try experimental fast mode option in the fuzzy finder.

The next Atom version 1.39 will also introduce performance improvements to the find and replace package. We’re adding a new mode to make searching for files between 10 and 20 times faster than the current mode.

Thank you

We’d like to thank the open source community since it has allowed us to build and improve Atom over the years. More specifically, we’ve made these performance improvements by building on the shoulders of two big giants:

Try out Atom.io

The post Atom editor is now faster appeared first on The GitHub Blog.

Guiding you Door-to-Door via our Super App!

Post Syndicated from Grab Tech original https://engineering.grab.com/poi-entrances-venues-door-to-door

Remember landing at an airport or going to your favourite mall and the hassle of finding the pickup spot when you booked a cab? When there are about a million entrances, it can get particularly annoying trying to find the right pickup location!

Rolling out across South East Asia  is a brand new booking experience from Grab, designed  to make it easier for you to make a booking at large venues like airports, shopping centers, and tourist destinations! With the new booking flow, it will not only be easier to select one of the pre-designated Grab pickup points, you can also find text and image directions to help you navigate your way through the venue for a smoother rendezvous with your driver!

Inspiration behind the work

Finding your pick-up point closest to you, let alone predicting it, is incredibly challenging, especially when you are inside huge buildings or in crowded areas. Neeraj Mishra, Product Owner for Places at Grab explains: “We rely on GPS-data to understand user’s location which can be tricky when you are indoors or surrounded by skyscrapers. Since the satellite signal has to go through layers of concrete and steel, it becomes weak which adds to the inaccuracy. Furthermore, ensuring that passengers and drivers have the same pick-up point in mind can be tricky, especially with venues that have multiple entrances. ”  

Marina One POI

Grab’s data analysis revealed that “rendezvous distance” (walking distance between the selected pick-up point and where the car is waiting) is more than twice the Grab average when the booking is made from large venues such as airports.

To solve this issue, Grab launched “Entrances” (the green dots on the map) last year, which lists the various pick-up points available at a particular building, and shows them on the map, allowing users to easily choose the one closest to them, and ensuring their drivers know exactly where they want to be picked up from. Since then, Grab has created more than 120,000 such entrances, and we are delighted to inform you that average of rendezvous distances across all  countries have been steadily going down!

Decreasing rendezvous distance across region

One problem remained

But there was still one common pain-point to be solved. Just because a passenger has selected the pick-up point closest to them, doesn’t mean it’s easy for them to find it. This is particularly challenging at very large venues like airports and shopping centres, and especially difficult if the passenger is unfamiliar with the venue, for example – a tourist landing at Jakarta Airport for the very first time. To deliver an even smoother booking and pick-up experience, Grab has rolled out a new feature called Venues – the first in the region – that will give passengers in-app photo and text directions to the pick-up point closest to them.

Let’s break it down! How does it work?

Whether you are a local or a foreigner on holiday or business trip, fret not if you are not too familiar with the place that you are in!

Let’s imagine that you are now at Singapore Changi Airport: your new booking experience will look something like this!

Step 1: Fire the Grab app and click on Transport. You will see a welcome screen showing you where you are!

Welcome to Changi Airport

Step 2: On booking screen, you will see a new pickup menu with a list of available pickup points. Confirm the pickup point you want and make the booking!

Booking screen at Changi Airport

Step 3: Once you’ve been allocated a driver, tap on the bubble to get directions to your pick-up point!

Driver allocated at Changi Airport

Step 4: Follow the landmarks and walking instructions and you’ve arrived at your pick-up point!

Directions to pick-up point at Changi Airport

Curious about how we got this done?

Data-Driven Decisions

Based on a thorough data analysis of historical bookings, Grab identified key venues across our markets in Southeast Asia. Then we dispatched our Operations team to the ground, to identify all pick up points and perform detailed on-ground survey of the venue.

Operations Team’s Leg Work

Nagur Hassan, Operations Manager at Grab, explains the process: “For the venue survey process, we send a team equipped with the tools required to capture the details, like cameras, wifi and bluetooth scanners etc. Once inside the venue, the team identifies strategic landmarks and clear direction signs that are related to drop-off and pick-up points. Team also captures turn-by-turn walking directions to make it easier for Grab users to navigate – For instance, walk towards Starbucks and take a left near H&M store. All the photos and documentations taken on the sites are then brought back to the office for further processing.”

Quality Assurance

Once the data is collected, our in-house team checks the quality of the images and data. We also mask people’s faces and number plates of the vehicles to hide any identity-related information. As of today, we have collected 3400+ images for 1900+ pick up points belonging to 600 key venues! This effort took more than 3000 man-hours in total! And we aim to cover more than 10,000 such venues across the region in the next few months.

This is only the beginning

We’re constantly striving to improve the location accuracy of our passengers by using advanced Machine Learning and constant feedback mechanism. We understand GPS may not always be the most accurate determination of your current location, especially in crowded areas and skyscraper districts. This is just the beginning and we’re planning to launch some very innovative features in the coming months! So stay tuned for more!

Recipe for building a widget: How we helped to “peak-shift” demand by helping passengers understand travel trends

Post Syndicated from Grab Tech original https://engineering.grab.com/peak-shift-demand-travel-trends

Credits: Photo by rawpixel on Unsplash

 

Stuck in traffic in a Grab ride? Pass the time by opening your Grab app and checking out the Feed – just scroll down! You’ll find widgets for games, polls, videos, news and even food recommendations!

Beyond serving your everyday needs, we want to provide our users with information that is interesting, useful and relevant. That’s why we’re always coming up with new widgets.

Building each widget takes close collaboration across multiple different teams – from Product Management to Design, Engineering, Behavioral Science, and Data Science and Analytics. Sounds like a lot of people, doesn’t it? But you’ll be surprised to hear that this behind-the-scenes collaboration works rapidly, usually in the span of one month! Which means we’re often moving from ideation phase to product release in just a few weeks.

Travel Trends Widget

This fast-and-furious process is anchored on one word – “customer-centric”. And that’s how it all began with our  “Travel Trends Widget” – a widget that provides passengers with an overview of historical supply and demand trends for their current location and nearby time periods.

Because we had so much fun developing this widget, we wanted to write a blog post to share with you what we did and how we did it!

Inspiration: Where it all started

Transport demand can be rather lumpy. Owing to organic patterns (e.g. office hours), a lot of passengers tend to request for cars around the same time. In periods like this, the increase in demand could outpace the arrival of driver supply, increasing the waiting time for passengers.

Our goal at Grab is to make sure people get a ride when they want it and at the price they want, so we got to thinking about how we can ease this friction by leveraging our treasure trove – Big Data! – to help our passengers better plan their trips.

As we were looking at the data, we noticed that there is a seasonality to demand and supply: at certain times and days, imbalances appear, peak and disappear, and the process repeats itself. Studies say that humans in general, unless shown a compelling reason or benefit for change, are habitual beings subject to inertia. So we set out to achieve exactly that: To create a widget to surface information to our passengers that may help them alter their decisions on when they choose to book a ride, thereby redistributing some of the present peak demands to periods just before and after peak – also known as “peak shifting the demand”!

While this widget is the first-of-its-kind in the ride-hailing industry, “peak-shifting” was actually coined and introduced long ago!

London Transport Museum Trends

As you can see from this post from the London Transport Museum (Source: Transport for London), London tube tried peak-shifting long before anyone else: Original Ad from 1928 displayed on the left, and Ad from 2015 displayed on the right, comparing the trends to 1928.

Trends from a hotel in Beijing

You may also have seen something similar at the last hotel you stayed at. Notice here a poster in an elevator at a Beijing hotel, announcing the best times to eat breakfast in comfort and avoid the crowd. (Photo credits to Prashant, our Product Manager, who saw this on holiday.)

To apply “peak-shifting” and help our users better plan their trips, we decided to dig in and leverage our data. It was way more complex than we had initially thought, as market conditions could be different on different days. This meant that  generic statements like “5PM-8PM are peak hours and prices will be hight” would not hold true. Contrary to general perception, we observed that even during peak hours, there are buckets of time when there is no surge or low surge.

For instance, plot 1 and plot 2 below shows how a typical Monday and Tuesday surge looks like in a given month respectively. One of the key insights is that the surge trends during peak hour is different on Monday from Tuesday. It reinforces our initial hypothesis that every day is unique.

So we used machine learning techniques to build a forecasting widget which can help our users and give them the power to plan their trips beforehand. This widget is able to provide the pricing trends for the next 2 hours. So with a bit of flexibility, riders can ride the tide!

Grab trends

So how exactly does this widget work?!

Historical trends for Monday

It pulls together historically observed imbalances between supply and demand, for the consumer’s current location and nearby time periods. Aggregated data is displayed to consumers in easily interpreted visualisations, so that they can plan to leave at times when there are more supply, and with potentially more savings for fares.

How did we build the widget? Loop, agile working process, POC & workstream

Widget-building is an agile, collaborative, and simultaneous process. First, we started the process with analysis from Product Analytics team, pulling out data on traffic trends, surge patterns, and behavioral insights of both passengers and drivers in Singapore.

When we noticed the existence of seasonality for each day of the week, we came up with more precise analytical and business questions to dig deeper into the data. Upon verification of hypotheses, we decided that we will build a widget.

Then joined the Behavioural Science, UX (User Experience) Design and the Product Management teams, who started giving shape to the problem we are solving. Our Behavioural Scientists shared their expertise on how information, suggestions and choices should be presented to enable easy assimilation and beneficial action. Daily whiteboarding breakouts, endless back-and forth conversations, and a healthy amount of challenge-and-accept culture ensured that we distilled the idea down to its core. We then presented the relevant information with just the right level of detail, and with the right amount of messaging, to allow users to take the intended action i.e. shift his/her demand outside of peak periods if possible.

Our amazing regional Copywriting team then swung in to put our intent into words in 7 different languages for our users across South-East Asia. Simultaneously, our UX designers and Full-stack Engineers started exploring the best visual components to communicate data on time trends to users. More on this later, but suffice to say that plenty of ideas were explored and discarded in a collaborative process, which aimed to create something that’s intuitive and engaging while being robust and scalable to work across all types of devices.

While these designs made their way up to engineering, the Data Science team worked on finding the most rigorous method to deduce the historical trend of surge across all our cities and areas, and time periods within them. There were discussions on how to best store and update this data reliably so that the widget itself can access it with great performance.

Soon after, we went into the development process, and voila! We had the first iteration of the widget ready on our staging (internal testing) servers in just 2 weeks! This prototype was opened up to the core team for influx of feedback.

And just two weeks later, the widget made its way to our Singapore and Jakarta Feeds, accessible to the world at large! Feedback from our users started pouring in almost immediately (thanks to the rich feedback functionality that comes with each widget), ranging from great to sometimes not-so-great, and we listened to all of it with a keen ear! And thus began a new cycle of iterations and continuous improvement, more of which we will share in a subsequent post.

In the trenches with the creators: How multiple teams got together to make this come true

Various disciplines within our cross functional team came together to whip out this widget by quipping their expertise to the end product.

Using Behavioural Science to simplify choices and design good outcomes

Behavioural Science helped to explore many facets of consumer behaviour in order to plan and design the widget: understanding how consumers think and conceptualizing a widget that can be easily understood and used by the consumers.

While fares are governed entirely by market conditions, it’s important for us to explain the economics to customers. As a customer-centric company, we aim to make the consumers feel like they own their decisions, which they can take based on full information. And this is the role of Behavioral Scientists at Grab!

In guiding the customers through the information, Behavioural Science team had the following three objectives in mind while building this Travel Trends widget:

  1. Offer transparency on the fares: By exposing our historic surge levels for a 4 hour period, we wanted to ensure that the passenger is aware of the surge levels and does not treat the fare as a nasty shock.
  2. Give information that helps them plan: By showing them surge levels for the future 2 hours, we wanted to help customers who have the flexibility, plan for a better time, hence, giving them the power to decide based on transparent information.
  3. Provide helpful tips: Every bar gives users tips on the conditions at that time and the immediate future. For instance, a low surge bar, followed by a high surge bar gives the tip “Psst… Leave now, It might get busy later!”, helping people understand the graph better and nudging them to take an action. If you are interested in saving fares, may we suggest tapping around all the bars to reveal the secret pro-tips?

Designing interfaces that lead to consumer success by abstracting complexity

Design team is the one behind the colors and shapes that make up the widget that you see and interact with! The team took inspiration from Google’s Popular Times.

Source/Credits: Google Live Popular Times
Source/Credits: Google Live Popular Times

 

Right from the offset, our content and product teams were keen to surface additional information and actions with each bar to keep the widget interactive and useful. One of the early challenges was to arrive at the right gesture that invites the user to interact and intuitively navigate the bars on the widget but also does not conflict with other gestures (eg scrolling and scrubbing) that the user was pre-trained to perform on the feed. We found out that tapping was simultaneously an unused and yet intuitive gesturethat we could use for interaction with the bars.

We then went into rounds of iteration on the visual design of the widget. In this process, multiple stakeholders were involved ranging from Product to Content to Engineering. We had to overcome a number of constraints i.e. the limited canvas of a widget and the context of a user when she is exploring the feed. By re-using existing libraries and components, we managed to keep the development light and ship something fast.

GrabCar trends near you

Dozens of revisions and four iterations later, we landed with a design that we felt equipped the feature for its user-facing goal, and did so in a manner which was aesthetically appealing!

And finally we managed to deliver on the feature’s goal, by surfacing just the right detail of information in a manner that is intuitive yet effective to peak-shift demand.  

Bringing all of this to fruition through high performance engineering

Our Development Engineering team was in charge of developing the widget and making it available to our users in just a few weeks’ time – materialising the work of the other teams.

One of their challenges was to find the best way to process the vast amount of data (millions of database entries) so it can be visualized simply as bar charts. Grab’s engineers had to achieve this while making sure performance is as resilient as possible.

There were two options in doing this:

a) Fetch the data directly from the DB for each API call; or

b) Store the data in an in-memory data structure on a timely basis, so when a user calls the API will no longer have to hit the DB.

After considering that this feature will likely expect a lot of traffic thus high QPS, we decided that the former option would be too costly. Ultimately, we chose the latter option since it is more performant and more scalable.

At the frontend, the challenge was to cater to the intricate request from our designers. We use chart libraries to increase our development speed, and not all of the requirements were readily supported by these libraries.

For instance, let’s say this library makes visualising charts easy, but not so much for customising them. If designers wanted to have an average line in a dotted form, the library did not support this so easily. Also, the moving arrow pointers as you move between bar chart, changing colors of the bars changes when clicked – all required countless CSS tweaks.

CSS tweak on trends widget
CSS tweak on trends widget

Closing the product loop with user feedback and data driven insights

One of the most crucial parts of launching any product is to ensure that customers are engaging with the widget and finding it useful.

To understand what customers think about the widget, whether they find it useful and whether it is helping them to plan better,  we delved into the huge mine of clickstream data.

User feedback on the trends widget

We found that 1 in 3 users who make a booking everyday interact with the widget. And of these people, more than 70% users have given positive rating for the widget. This validates our initial hypothesis that if given an option, our customers will love the freedom to plan their trips and inculcate more transparent ecosystem.

These users also indicate the things they like most about the widget. 61% of users gave positive rating for usefulness, 20% were impressed by the design (Kudos to our fantastic designer Ajmal!!) and 13% for usability.

Tweet about the widget

Beyond internal data, our widget made some rounds on social media channels. For Example, here is screenshot of what our users have to say on Twitter.

We closely track these metrics on user engagement and feedback to ensure that we keep improving and coming up with new iterations which helps us to serve our customers in a better way.

Conclusion

We hope you enjoyed reading about how we went from ideation, through iterations to a finished widget in the hands of the user, all in 1 month! Many hands helped along the way. If you are interested in joining this hyper-proactive problem-solving team, please check out Grab’s career site!

And if you have feedback for us, we are here to listen! While we cannot be happier to see some positive reaction from the public, we are also thrilled to hear your suggestions and advice. Please leave us a memo using the Widget’s comment function!

Epilogue

We just released an upgrade to this widget which allows users to set reminders and be notified about availability of good fares in a time period of their choosing. We will keep a watch and come knocking! Go ahead, find the widget on your Grab feed, set a reminder and save on fares on your next ride!

Atom understands your code better than ever before

Post Syndicated from Max Brunsfeld original https://github.blog/2018-10-31-atoms-new-parsing-system/

Text editors like Atom have many features that make code easier to read and write—syntax highlighting and code folding are two of the most important examples. For decades, all major text editors have implemented these kinds of features based on a very crude understanding of code, obtained by searching for simple, regular expression patterns. This approach has severely limited how helpful text editors can be.

At GitHub, we want to explore new ways of making programming intuitive and delightful, so we’ve developed a parsing system called Tree-sitter that will serve as a new foundation for code analysis in Atom. Tree-sitter makes it possible for Atom to parse your code while you type—maintaining a syntax tree at all times that precisely describes the structure of your code. We’ve enabled the new system by default in Atom, bringing a number of improvements.

Crisp syntax highlighting

Atom’s syntax highlighting is now based on the syntax trees provided by Tree-sitter. This lets us use color to outline your code’s structure more clearly than before. Notice the consistency with which fields, functions, keywords, types, and variables are highlighted across a variety of languages:

This animated GIF shows a snippet of C code rendered in Atom. It then switches to show an equivalent snippet of code written in several different languages: C++, Go, Rust, and TypeScript.

Reliable code folding

In most text editors, code folding is based on indentation: lines with greater indentation are considered to be nested more deeply than lines with less indentation. But this doesn’t always match the structure of our code and can make code folding useless in some files. With Tree-sitter, Atom folds code based on its syntax, which allows folding to work as expected, even for code like this:

This animated GIF shows a snippet of Ruby code, containing a heredoc with unindented text. Code folding is used to first collapse the block containing the heredoc, then collapse the method containing that block, and then collapse the class containing that method.

Syntax-aware selection

Atom also uses syntax trees as the basis for two new editing commands: Select Larger Syntax Node and Select Smaller Syntax Node, bound to Alt+Up and Alt+Down. These commands can make many editing tasks more efficient and fun, especially when used in combination with multiple cursors.

This animated GIF shows a snippet of JavaScript code in Atom. First, nothing is selected. Then, larger and larger constructs in the code are selected.

Speed

Parsing an entire source file can be time-consuming. This is why most IDEs wait to parse your code until you stop typing for a moment, and there is often a delay before syntax highlighting updates. We want to avoid these delays, so we designed Tree-sitter to parse your code incrementally: it keeps the syntax tree up to date as you edit your code without ever having to re-parse the entire file from scratch.

tree-editing

Language support

Currently, we use Tree-sitter to parse 11 languages: Bash, C, C++, ERB, EJS, Go, HTML, JavaScript, Python, Ruby, and TypeScript. And we’ve added Rust support on our Beta channel. If you’d like to help us bring the power of Tree-sitter to more languages, check out the Tree-sitter documentation and the grammar page of the Atom Flight Manual.

Feedback

Want to know more about Tree-sitter? Check out this talk from this year’s StrangeLoop conference.

If you write code and you’re interested in trying our new system, give the new version of Atom a try. We’d love to hear your feedback—tweet us at @AtomEditor or if you’ve run into a bug, open an issue.

The post Atom understands your code better than ever before appeared first on The GitHub Blog.

October 21 post-incident analysis

Post Syndicated from Jason Warner original https://github.blog/2018-10-30-oct21-post-incident-analysis/

Last week, GitHub experienced an incident that resulted in degraded service for 24 hours and 11 minutes. While portions of our platform were not affected by this incident, multiple internal systems were affected which resulted in our displaying of information that was out of date and inconsistent. Ultimately, no user data was lost; however manual reconciliation for a few seconds of database writes is still in progress. For the majority of the incident, GitHub was also unable to serve webhook events or build and publish GitHub Pages sites.

All of us at GitHub would like to sincerely apologize for the impact this caused to each and every one of you. We’re aware of the trust you place in GitHub and take pride in building resilient systems that enable our platform to remain highly available. With this incident, we failed you, and we are deeply sorry. While we cannot undo the problems that were created by GitHub’s platform being unusable for an extended period of time, we can explain the events that led to this incident, the lessons we’ve learned, and the steps we’re taking as a company to better ensure this doesn’t happen again.

Background

The majority of user-facing GitHub services are run within our own data center facilities. The data center topology is designed to provide a robust and expandable edge network that operates in front of several regional data centers that power our compute and storage workloads. Despite the layers of redundancy built into the physical and logical components in this design, it is still possible that sites will be unable to communicate with each other for some amount of time.

At 22:52 UTC on October 21, routine maintenance work to replace failing 100G optical equipment resulted in the loss of connectivity between our US East Coast network hub and our primary US East Coast data center. Connectivity between these locations was restored in 43 seconds, but this brief outage triggered a chain of events that led to 24 hours and 11 minutes of service degradation.

A high-level depiction of GitHub's network architecture, including two physical datacenters, 3 POPS, and cloud capacity in multiple regions connected via peering.

In the past, we’ve discussed how we use MySQL to store GitHub metadata as well as our approach to MySQL High Availability. GitHub operates multiple MySQL clusters varying in size from hundreds of gigabytes to nearly five terabytes, each with up to dozens of read replicas per cluster to store non-Git metadata, so our applications can provide pull requests and issues, manage authentication, coordinate background processing, and serve additional functionality beyond raw Git object storage. Different data across different parts of the application is stored on various clusters through functional sharding.

To improve performance at scale, our applications will direct writes to the relevant primary for each cluster, but delegate read requests to a subset of replica servers in the vast majority of cases. We use Orchestrator to manage our MySQL cluster topologies and handle automated failover. Orchestrator considers a number of variables during this process and is built on top of Raft for consensus. It’s possible for Orchestrator to implement topologies that applications are unable to support, therefore care must be taken to align Orchestrator’s configuration with application-level expectations.

In the normal topology, all apps perform reads locally with low latency.

Incident timeline

2018 October 21 22:52 UTC

During the network partition described above, Orchestrator, which had been active in our primary data center, began a process of leadership deselection, according to Raft consensus. The US West Coast data center and US East Coast public cloud Orchestrator nodes were able to establish a quorum and start failing over clusters to direct writes to the US West Coast data center. Orchestrator proceeded to organize the US West Coast database cluster topologies. When connectivity was restored, our application tier immediately began directing write traffic to the new primaries in the West Coast site.

The database servers in the US East Coast data center contained a brief period of writes that had not been replicated to the US West Coast facility. Because the database clusters in both data centers now contained writes that were not present in the other data center, we were unable to fail the primary back over to the US East Coast data center safely.

2018 October 21 22:54 UTC

Our internal monitoring systems began generating alerts indicating that our systems were experiencing numerous faults. At this time there were several engineers responding and working to triage the incoming notifications. By 23:02 UTC, engineers in our first responder team had determined that topologies for numerous database clusters were in an unexpected state. Querying the Orchestrator API displayed a database replication topology that only included servers from our US West Coast data center.

2018 October 21 23:07 UTC

By this point the responding team decided to manually lock our internal deployment tooling to prevent any additional changes from being introduced. At 23:09 UTC, the responding team placed the site into yellow status. This action automatically escalated the situation into an active incident and sent an alert to the incident coordinator. At 23:11 UTC the incident coordinator joined and two minutes later made the decision change to status red.

2018 October 21 23:13 UTC

It was understood at this time that the problem affected multiple database clusters. Additional engineers from GitHub’s database engineering team were paged. They began investigating the current state in order to determine what actions needed to be taken to manually configure a US East Coast database as the primary for each cluster and rebuild the replication topology. This effort was challenging because by this point the West Coast database cluster had ingested writes from our application tier for nearly 40 minutes. Additionally, there were the several seconds of writes that existed in the East Coast cluster that had not been replicated to the West Coast and prevented replication of new writes back to the East Coast.

Guarding the confidentiality and integrity of user data is GitHub’s highest priority. In an effort to preserve this data, we decided that the 30+ minutes of data written to the US West Coast data center prevented us from considering options other than failing-forward in order to keep user data safe. However, applications running in the East Coast that depend on writing information to a West Coast MySQL cluster are currently unable to cope with the additional latency introduced by a cross-country round trip for the majority of their database calls. This decision would result in our service being unusable for many users. We believe that the extended degradation of service was worth ensuring the consistency of our users’ data.

In the invalid topology, replication from US West to US East is broken and apps are unable to read from current replicas as they depend on low latency to maintain transaction performance.

2018 October 21 23:19 UTC

It was clear through querying the state of the database clusters that we needed to stop running jobs that write metadata about things like pushes. We made an explicit choice to partially degrade site usability by pausing webhook delivery and GitHub Pages builds instead of jeopardizing data we had already received from users. In other words, our strategy was to prioritize data integrity over site usability and time to recovery.

2018 October 22 00:05 UTC

Engineers involved in the incident response team began developing a plan to resolve data inconsistencies and implement our failover procedures for MySQL. Our plan was to restore from backups, synchronize the replicas in both sites, fall back to a stable serving topology, and then resume processing queued jobs. We updated our status to inform users that we were going to be executing a controlled failover of an internal data storage system.

Overview of recovery plan was to fail forward, synchronize, fall back, then churn through backlogs before returning to green.

While MySQL data backups occur every four hours and are retained for many years, the backups are stored remotely in a public cloud blob storage service. The time required to restore multiple terabytes of backup data caused the process to take hours. A significant portion of the time was consumed transferring the data from the remote backup service. The process to decompress, checksum, prepare, and load large backup files onto newly provisioned MySQL servers took the majority of time. This procedure is tested daily at minimum, so the recovery time frame was well understood, however until this incident we have never needed to fully rebuild an entire cluster from backup and had instead been able to rely on other strategies such as delayed replicas.

2018 October 22 00:41 UTC

A backup process for all affected MySQL clusters had been initiated by this time and engineers were monitoring progress. Concurrently, multiple teams of engineers were investigating ways to speed up the transfer and recovery time without further degrading site usability or risking data corruption.

2018 October 22 06:51 UTC

Several clusters had completed restoration from backups in our US East Coast data center and begun replicating new data from the West Coast. This resulted in slow site load times for pages that had to execute a write operation over a cross-country link, but pages reading from those database clusters would return up-to-date results if the read request landed on the newly restored replica. Other larger database clusters were still restoring.

Our teams had identified ways to restore directly from the West Coast to overcome throughput restrictions caused by downloading from off-site storage and were increasingly confident that restoration was imminent, and the time left to establishing a healthy replication topology was dependent on how long it would take replication to catch up. This estimate was linearly interpolated from the replication telemetry we had available and the status page was updated to set an expectation of two hours as our estimated time of recovery.

2018 October 22 07:46 UTC

GitHub published a blog post to provide more context. We use GitHub Pages internally and all builds had been paused several hours earlier, so publishing this took additional effort. We apologize for the delay. We intended to send this communication out much sooner and will be ensuring we can publish updates in the future under these constraints.

2018 October 22 11:12 UTC

All database primaries established in US East Coast again. This resulted in the site becoming far more responsive as writes were now directed to a database server that was co-located in the same physical data center as our application tier. While this improved performance substantially, there were still dozens of database read replicas that were multiple hours delayed behind the primary. These delayed replicas resulted in users seeing inconsistent data as they interacted with our services. We spread the read load across a large pool of read replicas and each request to our services had a good chance of hitting a read replica that was multiple hours delayed.

In reality, the time required for replication to catch up had adhered to a power decay function instead of a linear trajectory. Due to increased write load on our database clusters as users woke up and began their workday in Europe and the US, the recovery process took longer than originally estimated.

2018 October 22 13:15 UTC

By now, we were approaching peak traffic load on GitHub.com. A discussion was had by the incident response team on how to proceed. It was clear that replication delays were increasing instead of decreasing towards a consistent state. We’d begun provisioning additional MySQL read replicas in the US East Coast public cloud earlier in the incident. Once these became available it became easier to spread read request volume across more servers. Reducing the utilization in aggregate across the read replicas allowed replication to catch up.

2018 October 22 16:24 UTC

Once the replicas were in sync, we conducted a failover to the original topology, addressing the immediate latency/availability concerns. As part of a conscious decision to prioritize data integrity over a shorter incident window, we kept the service status red while we began processing the backlog of data we had accumulated.

2018 October 22 16:45 UTC

During this phase of the recovery, we had to balance the increased load represented by the backlog, potentially overloading our ecosystem partners with notifications, and getting our services back to 100% as quickly as possible. There were over five million hook events and 80 thousand Pages builds queued.

As we re-enabled processing of this data, we processed ~200,000 webhook payloads that had outlived an internal TTL and were dropped. Upon discovering this, we paused that processing and pushed a change to increase that TTL for the time being.

To avoid further eroding the reliability of our status updates, we remained in degraded status until we had completed processing the entire backlog of data and ensured that our services had clearly settled back into normal performance levels.

2018 October 22 23:03 UTC

All pending webhooks and Pages builds had been processed and the integrity and proper operation of all systems had been confirmed. The site status was updated to green.

Next steps

Resolving data inconsistencies

During our recovery, we captured the MySQL binary logs containing the writes we took in our primary site that were not replicated to our West Coast site from each affected cluster. The total number of writes that were not replicated to the West Coast was relatively small. For example, one of our busiest clusters had 954 writes in the affected window. We are currently performing an analysis on these logs and determining which writes can be automatically reconciled and which will require outreach to users. We have multiple teams engaged in this effort, and our analysis has already determined a category of writes that have since been repeated by the user and successfully persisted. As stated in this analysis, our primary goal is preserving the integrity and accuracy of the data you store on GitHub.

Communication

In our desire to communicate meaningful information to you during the incident, we made several public estimates on time to repair based on the rate of processing of the backlog of data. In retrospect, our estimates did not factor in all variables. We are sorry for the confusion this caused and will strive to provide more accurate information in the future.

Technical initiatives

There are a number of technical initiatives that have been identified during this analysis. As we continue to work through an extensive post-incident analysis process internally, we expect to identify even more work that needs to happen.

  1. Adjust the configuration of Orchestrator to prevent the promotion of database primaries across regional boundaries. Orchestrator’s actions behaved as configured, despite our application tier being unable to support this topology change. Leader-election within a region is generally safe, but the sudden introduction of cross-country latency was a major contributing factor during this incident. This was emergent behavior of the system given that we hadn’t previously seen an internal network partition of this magnitude.

  2. We have accelerated our migration to a new status reporting mechanism that will provide a richer forum for us to talk about active incidents in crisper and clearer language. While many portions of GitHub were available throughout the incident, we were only able to set our status to green, yellow, and red. We recognize that this doesn’t give you an accurate picture of what is working and what is not, and in the future will be displaying the different components of the platform so you know the status of each service.

  3. In the weeks prior to this incident, we had started a company-wide engineering initiative to support serving GitHub traffic from multiple data centers in an active/active/active design. This project has the goal of supporting N+1 redundancy at the facility level. The goal of that work is to tolerate the full failure of a single data center failure without user impact. This is a major effort and will take some time, but we believe that multiple well-connected sites in a geography provides a good set of trade-offs. This incident has added urgency to the initiative.

  4. We will take a more proactive stance in testing our assumptions. GitHub is a fast growing company and has built up its fair share of complexity over the last decade. As we continue to grow, it becomes increasingly difficult to capture and transfer the historical context of trade-offs and decisions made to newer generations of Hubbers.

Organizational initiatives

This incident has led to a shift in our mindset around site reliability. We have learned that tighter operational controls or improved response times are insufficient safeguards for site reliability within a system of services as complicated as ours. To bolster those efforts, we will also begin a systemic practice of validating failure scenarios before they have a chance to affect you. This work will involve future investment in fault injection and chaos engineering tooling at GitHub.

Conclusion

We know how much you rely on GitHub for your projects and businesses to succeed. No one is more passionate about the availability of our services and the correctness of your data. We will continue to analyze this event for opportunities to serve you better and earn the trust you place in us.

The post October 21 post-incident analysis appeared first on The GitHub Blog.