Tag Archives: Insights

Introducing the CodeSearchNet challenge

Post Syndicated from Janey Jack original https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/

Searching for code to reuse, call into, or to see how others handle a problem is one of the most common tasks in a software developer’s day. However, search engines for code are often frustrating and never fully understand what we want, unlike regular web search engines. We started using modern machine learning techniques to improve code search but quickly realized that we were unable to measure our progress. Unlike natural language processing with GLUE benchmarks, there is no standard dataset suitable for code search evaluation.

With our partners from Weights & Biases, today we’re announcing the CodeSearchNet Challenge evaluation environment and leaderboard. We’re also releasing a large dataset to help data scientists build models for this task, as well as several baseline models showing the current state of the art. Our leaderboard uses an annotated dataset of queries to evaluate the quality of code search tools.

Learn more from our technical report 

The CodeSearchNet Corpus and models

We collected a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. We used our TreeSitter infrastructure for this effort, and we’re also releasing our data preprocessing pipeline for others to use as a starting point in applying machine learning to code. While this data is not directly related to code search, its pairing of code with related natural language description is suitable to train models for this task. Its substantial size also makes it possible to apply high-capacity models based on modern Transformer architectures.

Our fully preprocessed CodeSearchNet Corpus is available for download on Amazon S3, including:

  • Six million methods overall
  • Two million of which have associated documentation (docstrings, JavaDoc, and more)
  • Metadata that indicates the original location (repository or line number, for example) where the data was found

Building on our earlier efforts in semantic code search, we’re also releasing a collection of baseline models leveraging modern techniques in learning from sequences (including a BERT-like self-attentional model) to help data scientists get started on code search. 

The CodeSearchNet Challenge

To evaluate code search models, we collected an initial set of code search queries and had programmers annotate the relevance of potential results. We started by collecting common search queries from Bing that had high click-through rates to code and combined these with queries from StaQC, yielding 99 queries for concepts related to code (i.e., we removed everything that was just an API documentation lookup).

We then used a standard Elasticsearch installation and our baseline models to obtain 10 likely results per query from our CodeSearchNet Corpus. Finally, we asked programmers, data scientists, and machine learning researchers to annotate the proposed results for relevance to the query on a scale from zero (“totally irrelevant”) to three (“exact match”). See our technical report for an in-depth explanation of the annotation process and data.

We want to expand our evaluation dataset to include more languages, queries, and annotations in the future. As we continue adding more over the next few months, we aim to include an extended dataset for the next version of CodeSearchNet Challenge in the future.

Other use cases

We anticipate other use cases for this dataset beyond code search and are presenting code search as one possible task that leverages learned representations of natural language and code. We’re excited to see what the community builds next.

Special thanks

The CodeSearchNet Challenge would not be possible without the Microsoft Research Team and core contributors from GitHub, including Marc Brockschmidt, Miltos Allamanis, Ho-Hsiang Wu, Hamel Husain, and Tiferet Gazit.

We’re also thankful for all of the contributors from the community who helped put this project together:

@nbardy, @raubitsj, @staceysv, @cvphelps, @tejaskannan, @s-zanella, @AntonioND, @goutham7r, @campoy, @cal58, @febuiles, @letmaik, @sebastiandziadzio, @panthap2, @CoderPat.


Learn more about the CodeSearchNet Challenge

The post Introducing the CodeSearchNet challenge appeared first on The GitHub Blog.

C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages

Post Syndicated from Kavita Ganesan original https://github.blog/2019-07-02-c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages/

GitHub hosts over 300 programming languages—from commonly used languages such as Python, Java, and Javascript to esoteric languages such as Befunge, only known to very small communities.

JavaScript is the top programming language on GitHub, followed by Java and HTML
Figure 1: Top 10 programming languages hosted by GitHub by repository count

One of the necessary challenges that GitHub faces is to be able to recognize these different languages. When some code is pushed to a repository, it’s important to recognize the type of code that was added for the purposes of search, security vulnerability alerting, and syntax highlighting—and to show the repository’s content distribution to users.

Despite the appearance, language recognition isn’t a trivial task. File names and extensions, while providing a good indication of what the coding language is likely to be, do not offer the full picture. In fact, many extensions are associated with the same language (e.g., “.pl”, “.pm”, “.t”, “.pod” are all associated with Perl), while others are ambiguous and used almost interchangeably across languages (e.g., “.h” is commonly used to indicate many languages of the “C” family, including C, C++, and Objective-C). In other cases, files are simply provided with no extension (especially for executable scripts) or with the incorrect extension (either on purpose or accidentally).

Linguist is the tool we currently use to detect coding languages at GitHub. Linguist a Ruby-based application that uses various strategies for language detection, leveraging naming conventions and file extensions and also taking into account Vim or Emacs modelines, as well as the content at the top of the file (shebang). Linguist handles language disambiguation via heuristics and, failing that, via a Naive Bayes classifier trained on a small sample of data. 

Although Linguist does a good job making file-level language predictions (84% accuracy), its performance declines considerably when files use unexpected naming conventions and, crucially, when a file extension is not provided. This renders Linguist unsuitable for content such as GitHub Gists or code snippets within README’s, issues, and pull requests.

In order to make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua based on an Artificial Neural Network (ANN) architecture which can handle language predictions in tricky scenarios. The current version of the model is able to make predictions for the top 50 languages hosted by GitHub and surpasses Linguist in accuracy and performance. 

The Nuts and Bolts Behind OctoLingua

OctoLingua was built from scratch using Python, Keras with TensorFlow backend—and is built to be accurate, robust, and easy to maintain. In this section, we describe our data sources, model architecture, and performance benchmark for OctoLingua. We also describe what it takes to add support for a new language. 

Data sources

The current version of OctoLingua was trained on files retrieved from Rosetta Code and from a set of quality repositories internally crowdsourced. We limited our language set to the top 50 languages hosted on GitHub.

Rosetta Code was an excellent starter dataset as it contained source code for the same task expressed in different programming languages. For example, the task of generating a Fibonacci sequence is expressed in C, C++, CoffeeScript, D, Java, Julia, and more. However, the coverage across languages was not uniform where some languages only have a handful of files and some files were just too sparsely populated. Augmenting our training set with some additional sources was therefore necessary and substantially improved language coverage and performance.

Our process for adding a new language is now fully automated. We programmatically collect source code from public repositories on GitHub. We choose repositories that meet a minimum qualifying criteria such as having a minimum number of forks, covering the target language and covering specific file extensions. For this stage of data collection, we determine the primary language of a repository using the classification from Linguist. 

Features: leveraging prior knowledge

Traditionally, for text classification problems with Neural Networks, memory-based architectures such as Recurrent Neural Networks (RNN) and Long Short Term Memory Networks (LSTM) are often employed. However, given that programming languages have differences in vocabulary, commenting style, file extensions, structure, libraries import style and other minor differences, we opted for a simpler approach that leverages all this information by extracting some relevant features in tabular form to be fed to our classifier. The features currently extracted are as follows:

  1. Top five special characters per file
  2. Top 20 tokens per file
  3. File extension
  4. Presence of certain special characters commonly used in source code files such as colons, curly braces, and semicolons

The Artificial Neural Network (ANN) model

We use the above features as input to a two-layer Artificial Neural Network built using Keras with Tensorflow backend. 

The diagram below shows that the feature extraction step produces an n-dimensional tabular input for our classifier. As the information moves along the layers of our network, it is regularized by dropout and ultimately produces a 51-dimensional output which represents the predicted probability that the given code is written in each of the top 50 GitHub languages plus the probability that it is not written in any of those.

image
Figure 2: The ANN Structure of our initial model (50 languages + 1 for “other”)

We used 90% of our dataset for training over approximately eight epochs. Additionally, we removed a percentage of file extensions from our training data at the training step, to encourage the model to learn from the vocabulary of the files, and not overfit on the file extension feature, which is highly predictive.

Performance benchmark

OctoLingua vs. Linguist

In Figure 3, we show the F1 Score (harmonic mean between precision and recall) of OctoLingua and Linguist calculated on the same test set (10% from our initial data source). 

Here we show three tests. The first test is with the test set untouched in any way. The second test uses the same set of test files with file extension information removed and the third test also uses the same set of files but this time with file extensions scrambled so as to confuse the classifiers (e.g., a Java file may have a “.txt” extension and a Python file may have a “.java”) extension. 

The intuition behind scrambling or removing the file extensions in our test set is to assess the robustness of OctoLingua in classifying files when a key feature is removed or is misleading. A classifier that does not rely heavily on extension would be extremely useful to classify gists and snippets, since in those cases it is common for people not to provide accurate extension information (e.g., many code-related gists have a .txt extension).

The table below shows how OctoLingua maintains a good performance under various conditions, suggesting that the model learns primarily from the vocabulary of the code, rather than from meta information (i.e. file extension), whereas Linguist fails as soon as the information on file extensions is altered.

image
Figure 3: Performance of OctoLingua vs. Linguist on the same test set

 

Effect of removing file extension during training time

As mentioned earlier, during training time we removed a percentage of file extensions from our training data to encourage the model to learn from the vocabulary of the files. The table below shows the performance of our model with different fractions of file extensions removed during training time. 

image
Figure 4: Performance of OctoLingua with different percentage of file extensions removed on our three test variations

Notice that with no file extension removed during training time, the performance of OctoLingua on test files with no extensions and randomized extensions decreases considerably from that on the regular test data. On the other hand, when the model is trained on a dataset where some file extensions are removed, the model performance does not decline much on the modified test set. This confirms that removing the file extension from a fraction of files at training time induces our classifier to learn more from the vocabulary. It also shows that the file extension feature, while highly predictive, had a tendency to dominate and prevented more weights from being assigned to the content features. 

Supporting a new language

Adding a new language in OctoLingua is fairly straightforward. It starts with obtaining a bulk of files in the new language (we can do this programmatically as described in data sources). These files are split into a training and a test set and then run through our preprocessor and feature extractor. This new train and test set is added to our existing pool of training and testing data. The new testing set allows us to verify that the accuracy of our model remains acceptable.

image
Figure 5: Adding a new language with OctoLingua

Our plans

As of now, OctoLingua is at the “advanced prototyping stage”. Our language classification engine is already robust and reliable, but does not yet support all coding languages on our platform. Aside from broadening language support—which would be rather straightforward—we aim to enable language detection at various levels of granularity. Our current implementation already allows us, with a small modification to our machine learning engine, to classify code snippets. It wouldn’t be too far fetched to take the model to the stage where it can reliably detect and classify embedded languages. 

We are also contemplating the possibility of open sourcing our model and would love to hear from the community if you’re interested.

Summary

With OctoLingua, our goal is to provide a service that enables robust and reliable source code language detection at multiple levels of granularity, from file level or snippet level to potentially line-level language detection and classification. Eventually, this service can support, among others, code searchability, code sharing, language highlighting, and diff rendering—all of this aimed at supporting developers in their day to day development work in addition to helping them write quality code.  If you are interested in leveraging or contributing to our work, please feel free to get in touch on Twitter @github!

Authors

The post C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages appeared first on The GitHub Blog.

Get Cloudflare insights in your preferred analytics provider

Post Syndicated from Simon Steiner original https://blog.cloudflare.com/cloudflare-partners-with-analytics-providers/

Get Cloudflare insights in your preferred analytics provider

Today, we’re excited to announce our partnerships with Chronicle Security, Datadog, Elastic, Looker, Splunk, and Sumo Logic to make it easy for our customers to analyze Cloudflare logs and metrics using their analytics provider of choice. In a joint effort, we have developed pre-built dashboards that are available as a Cloudflare App in each partner’s platform. These dashboards help customers better understand events and trends from their websites and applications on our network.


Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Cloudflare insights in the tools you’re already using

Data analytics is a frequent theme in conversations with Cloudflare customers. Our customers want to understand how Cloudflare speeds up their websites and saves them bandwidth, ranks their fastest and slowest pages, and be alerted if they are under attack. While providing insights is a core tenet of Cloudflare’s offering, the data analytics market has matured and many of our customers have started using third-party providers to analyze data—including Cloudflare logs and metrics. By aggregating data from multiple applications, infrastructure, and cloud platforms in one dedicated analytics platform, customers can create a single pane of glass and benefit from better end-to-end visibility over their entire stack.

Get Cloudflare insights in your preferred analytics provider

While these analytics platforms provide great benefits in terms of functionality and flexibility, they can take significant time to configure: from ingesting logs, to specifying data models that make data searchable, all the way to building dashboards to get the right insights out of the raw data. We see this as an opportunity to partner with the companies our customers are already using to offer a better and more integrated solution.

Providing flexibility through easy-to-use integrations

To address these complexities of aggregating, managing, and displaying data, we have developed a number of product features and partnerships to make it easier to get insights out of Cloudflare logs and metrics. In February we announced Logpush, which allows customers to automatically push Cloudflare logs to Google Cloud Storage and Amazon S3. Both of these cloud storage solutions are supported by the major analytics providers as a source for collecting logs, making it possible to get Cloudflare logs into an analytics platform with just a few clicks. With today’s announcement of Cloudflare’s Analytics Partnerships, we’re releasing a Cloudflare App—a set of pre-built and fully customizable dashboards—in each partner’s app store or integrations catalogue to make the experience even more seamless.

By using these dashboards, customers can immediately analyze events and trends of their websites and applications without first needing to wade through individual log files and build custom searches. The dashboards feature all 55+ fields available in Cloudflare logs and include 90+ panels with information about the performance, security, and reliability of customers’ websites and applications.

Get Cloudflare insights in your preferred analytics provider

Ultimately, we want to provide flexibility to our customers and make it easier to use Cloudflare with the analytics tools they already use. Improving our customers’ ability to get better data and insights continues to be a focus for us, so we’d love to hear about what tools you’re using—tell us via this brief survey. To learn more about each of our partnerships and how to get access to the dashboards, please visit our developer documentation or contact your Customer Success Manager. Similarly, if you’re an analytics provider who is interested in partnering with us, use the contact form on our analytics partnerships page to get in touch.

Get Cloudflare insights in your preferred analytics provider

Post Syndicated from Simon Steiner original https://blog.cloudflare.com/cloudflare-partners-with-analytics-providers/

Get Cloudflare insights in your preferred analytics provider

Today, we’re excited to announce our partnerships with Chronicle Security, Datadog, Elastic, Looker, Splunk, and Sumo Logic to make it easy for our customers to analyze Cloudflare logs and metrics using their analytics provider of choice. In a joint effort, we have developed pre-built dashboards that are available as a Cloudflare App in each partner’s platform. These dashboards help customers better understand events and trends from their websites and applications on our network.


Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Cloudflare insights in the tools you’re already using

Data analytics is a frequent theme in conversations with Cloudflare customers. Our customers want to understand how Cloudflare speeds up their websites and saves them bandwidth, ranks their fastest and slowest pages, and be alerted if they are under attack. While providing insights is a core tenet of Cloudflare’s offering, the data analytics market has matured and many of our customers have started using third-party providers to analyze data—including Cloudflare logs and metrics. By aggregating data from multiple applications, infrastructure, and cloud platforms in one dedicated analytics platform, customers can create a single pane of glass and benefit from better end-to-end visibility over their entire stack.

Get Cloudflare insights in your preferred analytics provider

While these analytics platforms provide great benefits in terms of functionality and flexibility, they can take significant time to configure: from ingesting logs, to specifying data models that make data searchable, all the way to building dashboards to get the right insights out of the raw data. We see this as an opportunity to partner with the companies our customers are already using to offer a better and more integrated solution.

Providing flexibility through easy-to-use integrations

To address these complexities of aggregating, managing, and displaying data, we have developed a number of product features and partnerships to make it easier to get insights out of Cloudflare logs and metrics. In February we announced Logpush, which allows customers to automatically push Cloudflare logs to Google Cloud Storage and Amazon S3. Both of these cloud storage solutions are supported by the major analytics providers as a source for collecting logs, making it possible to get Cloudflare logs into an analytics platform with just a few clicks. With today’s announcement of Cloudflare’s Analytics Partnerships, we’re releasing a Cloudflare App—a set of pre-built and fully customizable dashboards—in each partner’s app store or integrations catalogue to make the experience even more seamless.

By using these dashboards, customers can immediately analyze events and trends of their websites and applications without first needing to wade through individual log files and build custom searches. The dashboards feature all 55+ fields available in Cloudflare logs and include 90+ panels with information about the performance, security, and reliability of customers’ websites and applications.

Get Cloudflare insights in your preferred analytics provider

Ultimately, we want to provide flexibility to our customers and make it easier to use Cloudflare with the analytics tools they already use. Improving our customers’ ability to get better data and insights continues to be a focus for us, so we’d love to hear about what tools you’re using—tell us via this brief survey. To learn more about each of our partnerships and how to get access to the dashboards, please visit our developer documentation or contact your Customer Success Manager. Similarly, if you’re an analytics provider who is interested in partnering with us, use the contact form on our analytics partnerships page to get in touch.