Tag Archives: Insights

Automating MySQL schema migrations with GitHub Actions and more

Post Syndicated from Shlomi Noach original https://github.blog/2020-02-14-automating-mysql-schema-migrations-with-github-actions-and-more/

In the past year, GitHub engineers shipped GitHub Packages, Actions, Sponsors, Mobile, security advisories and updates, notifications, code navigation, and more. Needless to say, the development pace at GitHub is accelerated.

With MySQL serving our backends, updating code requires changes to the underlying database schema. New features may require new tables, columns, changes to existing columns or indexes, dropping unused tables, and so on. On average, we have two schema migrations running daily on our production servers. Some days we have a half dozen migrations to run. We’ll cover how this amounted to a significant toil on the database infrastructure team, and how we searched for a solution to automate the manual parts of the process.

What’s in a migration?

At first glance, migrating appears to be no more difficult than adding a CREATE, ALTER or DROP TABLE statement. At a closer look, the process is far more complex, and involves multiple owners, platforms, environments, and transitions between those pieces. Here’s the flow as we experience it at GitHub:

1. Starting the process

It begins with a developer who identifies the need for a schema change. Maybe they need a new table, or a new column in an existing table. The developer has a local testing environment where they can experiment however they like, until they’re satisfied and wish to apply changes to production.

2. Feedback and review

The developer doesn’t just apply their changes online. First, they seek review and discussion with their peers. Depending on the change, they may ask for a review from a group of schema reviewers (at GitHub, this is a volunteer group experienced with database design). Then, they seek the agreement of the database infrastructure team, who owns the production databases. The database infrastructure team reviews the changes, looking for performance concerns, among other potential issues. Assuming all reviews are favorable, it’s on the database infrastructure engineer to deploy the change to production.

3. Taking the change to production

At this point, we need to determine where the change is taking place since we have multiple clusters. Some of them are sharded, so we have to ask: Where do the affected tables exist in our clusters or schemas? Next, we need to know what to run. The developer presented the schema they want to see in production, but how do we transition the existing production schema into the one requested? What’s the formal CREATE, ALTER or DROP statement? Following what to run, we need to know how we should run the migration. Do we run the query directly? Or is it a blocking operation and we need an online schema change tool? And finally, we need to know when to execute the migration. Perhaps now is not a good time if there’s already a migration running on the cluster.

4. Migration

At long last, we’re ready to run the migration. Some of our larger tables may take hours and even days to migrate, especially since the site needs to be up and running. We want to track status. And we want to see what impact the migration may have on production, or, preferably, to ensure it does not have an impact.

5. Completing the process

Even as the migration completes there are further steps to take. There’s some cleanup process, and we want to unblock the next migration, if any currently exists. The database infrastructure team wishes to advertise to the developer that the changes have taken place, and the developer will have their own followup to address.

Throughout that flow, there’s a lot of potential for friction:

  • Does the database infrastructure team review the developer’s request in a timely fashion?
  • Is the review process productive?
  • Do we need to wait for something before running the migration?
  • Is the database infrastructure engineer actually available to run the migration, or perhaps they’re busy with other tasks?

The database infrastructure engineer needs to either create or review the migration statement, double-check their logic, ensure they can begin the migration, follow up, unblock other migrations as needed, advertise progress to the developer, and so on.

With our volume of daily migrations, this flow sometimes consumed hours of work of a database infrastructure engineer per day, and—in the best-case scenario—at least several hours of work per week. They would frequently multitask between two or three migrations and keep mental notes for next steps. Developers would ping us to ask what the status was, and their work was sometimes blocked until the migration was complete.

A brief history of schema migration automation at GitHub

GitHub was originally created as a Ruby on Rails (RoR) app. Like other frameworks, and in particular, those using Active Record, RoR has a built-in mechanism to generate database schema from code, as well as programmatically express migrations. RoR tooling can analyze code changes and create and run the SQL statements to change the database schema.

We use the GitHub flow to manage our own development: when suggesting a change, we create a branch, commit, push, and open a pull request. We use the declarative approach to schema definition: our RoR GitHub repository contains the full schema definition, such as the CREATE TABLE statements that generate the complete schema. This way, we know exactly what schema is associated with each commit or branch. Counter that with the programmatic approach, where your commits contain migration statements, and where to deduce a schema you need to start at some baseline and run through all statements sequentially.

The database infrastructure and the application teams collaborated to create a set of chatops tooling. We ran a chatops command to list pull requests with schema changes, and then another command to generate the CREATE/ALTER/DROP statement for a given pull request. For this, we used RoR’s rake command. Our wrapper scripts then added meta information, like which cluster is involved, and generated a script used to run the migration.

The generated statements and script were mostly fine, with occasional SQL syntax errors. We’d review the output and fix it manually as needed.

A few years ago we developed gh-ost, an online table migration solution, which added even more visibility and control through our chatops. We’d check progress, change runtime configuration, and cut-over the migration through chat. While simple, these were still manual steps.

The heart of GitHub’s app remains with the same RoR, but we’ve expanded far beyond it. We created more repositories and some also use RoR, while others are in other programming languages such as Go. However, we didn’t use Object Relational Mapping practice with the new repositories.

As GitHub expanded, the more toil the database infrastructure team had. We’d review pull requests, compare schemas, generate migration statements manually, and verify on a local machine. Other than the git log, no formal tracking for schema migrations existed. We’d check in chat, issues, and pull requests to see what was done and what wasn’t. We’d keep track of ongoing migrations in our heads, context switch between the migrations throughout the day, and how often we’d get interrupted by notifications. And we did this while taking each migration through the next step, keeping mental notes, and communicating the progress to our peers.

With these steps in mind, we wanted a solution to automate the process. We came up with various ideas, and in 2019 GitHub Actions was released. This was our solution: multiple loosely coupled components, each owning a specific aspect of the flow, all orchestrated by a controller service. The next section covers the breakdown of our solution.

Code

Our basic premise is that schema design should be treated as code. We want the schema to be versioned, and we want to know what schema is associated and with what version of our code.

To illustrate, GitHub provides not only github.com, but also GitHub Enterprise, an on-premise solution. On github.com we run continuous deployments. With GitHub Enterprise, we make periodic releases, and our customers can upgrade in-house. This means we need to be able to reproduce any schema changes we make to github.com on a customer’s Enterprise server.

Therefore we must keep our schema design coupled with the code in the same git repository. For a developer to design a schema change, they need to follow our normal development flow: create a branch, commit, push, and open a pull request. The pull request is where code is reviewed and discussion takes place for any changes. It’s where continuous integration and testing run. Our solution revolves around the pull request, and this is standardized across all our repositories.

The change

Once a pull request is opened, we need to be able to identify what changes we’d like to make. Typically, when we review code changes, we look at the diff. And it might be tempting to expect that git diff can help us formalize the schema change. Unfortunately, this is not the case, and git diff is poor at identifying these changes. For example, consider this simplified table definition:

CREATE TABLE some_table (
  id int(10) unsigned NOT NULL AUTO_INCREMENT,
  hostname varchar(128) NOT NULL,
  PRIMARY KEY (id),
  KEY (hostname)
);

Suppose we decide to add a new column and drop the index on hostname. The new schema becomes:

CREATE TABLE some_table (
  id int(10) unsigned NOT NULL AUTO_INCREMENT,
  hostname varchar(128) NOT NULL,
  time_created TIMESTAMP NOT NULL,
  PRIMARY KEY (id)
);

Running git diff on the two schemas yields the following:

@@ -1,6 +1,6 @@
 CREATE TABLE some_table (
   id int(10) unsigned NOT NULL DEFAULT 0,
   hostname varchar(128) NOT NULL,
-  PRIMARY KEY (id),
-  KEY (hostname)
+  time_created TIMESTAMP NOT NULL,
+  PRIMARY KEY (id)
 );

The pull request’s “Files changed” tab shows the same:

This is a sample Pull Request where we change a table's schema. git diff does a poor job of analyzing the schema change.

See how the PRIMARY KEY line goes into the diff because of the trailing comma. This diff does not capture the schema change well, and while RoR provides tooling for that,  we’ve still had to carefully review them. Fortunately, there’s a good MySQL-oriented tool to do the task.

skeema

skeema is an open source schema management utility developed by Evan Elias. It expects the declarative approach, and looks for a schema definition on your file system (hopefully as part of your repository). The file system layout should include a directory per schema/database, a file per table, and then some special configuration files telling skeema the identities of, and the credentials for, MySQL servers in various environments. Skeema is able to run useful tasks, such as:

  • skeema diff: generate SQL statements that convert the existing database schema into the schema defined in the file system. This includes as many CREATE, ALTER and DROP TABLE statements as needed.
  • skeema push: actually apply changes to the database server for the schema to match the one on file system
  • skeema pull: rewrite the filesystem schema based on the existing schema in the MySQL server.

skeema can do much more, including the ability to invoke online schema change tools—but that’s outside this post’s scope.

Git users will feel comfortable with skeema. Indeed, skeema works very well with git-versioned schemas. For us, the most valuable asset is its diff output: a well formed, reliable set of statements to show the SQL transition from one schema to another. For example, skeema diff output for the above schema change is:

USE `test`;
ALTER TABLE `some_table` ADD COLUMN `time_created` timestamp NOT NULL DEFAULT CURRENT_TIMESTAMP ON UPDATE CURRENT_TIMESTAMP, DROP KEY `hostname`;

Note that the above is not only correct, but also formal. It reproduces correctly whether our code uses lower/upper case, includes/omits default value, etc.

We wanted to use skeema to tell us what statements we needed to run to get from our existing state into the state defined in the pull request. Assuming the master branch reflects our current production schema, this now becomes a matter of diffing the schemas between master and the pull request’s branch.

Skeema wasn’t without its challenges, and we had to figure out where to place skeema from a design perspective. Do the developers own it? Does every repository own it? Is there a central service to own it? Each presented its own problems, from false ownership to excessive responsibilities and access.

GitHub Actions

Enter GitHub Actions. With Actions, you’re able to run code as a response to events taking place in your repository. A new pull request, review, comment, issue, and quite a few others, are such events. The code (the action) is arbitrary, and GitHub spawns a container on its own infrastructure, where your code will run. What makes this extra interesting is that the container can get access to your repository. GitHub Actions implicitly receives an API token to interact with the repository.

The container comes with popular software packages pre-installed, such as a MySQL server.

Perhaps the most classic use of Actions is CI/CD.  When  a pull_request event occurs (a new pull request and any subsequent commit) run some code to build, test, lint, or validate the change. We took this approach to run skeema as part of a pull_request action flow, called skeema-diff.

Here’s a simplified breakdown of the action:

  1. Fetch skeema binary
  2. Checkout master branch
  3. Run skeema push to populate the container’s MySQL server with the schema as defined by the master branch
  4. Checkout pull request’s branch
  5. Run skeema diff to generate the statements that take the schema from the one in MySQL (remember, this is the master schema) to the one in the pull request’s branch
  6. Add the diff as a comment in the pull request
  7. Add a special label to indicate this pull request has a schema change

The GitHub Action, running skeema, generates schema diff output, which is added as a comment to the Pull Request. The comment presents the correct ALTER statement implied by the code change. This comment is both human and machine readable.

The code is more complex than what we’ve shown. We actually use base and head instead of master and branch, and there’s some logic to formalize, edit and validate the diff, to handle commits that further change the schema, among other processes.

By now, we have a partial flow, which works entirely on GitHub’s platform:

  • Schema change as code
  • Review process, based on GitHub’s pull request flow
  • Automated schema change analysis, based on skeema running in a GitHub Action
  • A visible output, presented as a pull request comment

Up to this point, everything is constrained to the repository. The repository itself doesn’t have information about where the schema gets deployed in production. This information is something that’s outside the repository’s scope, and it’s owned by the database infrastructure team rather than the repository’s developers. Neither the repository nor any action running on that repository has access to production, nor should they, as that would be a breach of domains.

Before we describe how the schema gets to production, let’s jump ahead and discuss the schema migration itself.

Schema migrations and gh-ost

Even the simplest schema migration isn’t simple. We are concerned with three types of table migrations:

  • CREATE TABLE is the simplest and the safest. We created something that didn’t exist before, and its creation time is instantaneous. Note that if the target cluster is sharded, this must be applied on all shards. If the cluster is sharded with vitess, then the vitess vtgate service automatically handles this for us.
  • DROP TABLE is a simple statement that comes with a great risk. What if it’s still in use and some code breaks as a result of the table going away? Note that we don’t actually drop tables as part of schema migrations. Any DROP TABLE statement is converted into a RENAME TABLE. Instead of DROP TABLE repositories (whoops!), our automation runs RENAME TABLE repositories TO _repositories_DROP_20200101123456. If our application fails because of this, we have an instant revert command: RENAME back to the original. Renamed tables are kept around for a few days prior to being garbage collected and dropped by our automation.
  • ALTER TABLE is the most complex case, mainly because it takes time to alter a table. We don’t actually ALTER tables in-place. We use gh-ost to emulate an ALTER TABLE, and the end result is the same even though the process is completely different. It doesn’t lock our apps, throttles as much as needed, and it’s controllable as well as auditable. We’ve run gh-ost in production for over three and a half years. It has little to no impact on production, and we generally don’t care that it’s running. But some of our larger tables may still take hours or even days to migrate. We also only run one ALTER (or, gh-ost) at a time on a cluster. Concurrent migrations are possible but compete over resources, leading to overall longer runtimes than sequential execution. This means that an ALTER migration requires scheduling. We need to be able to tell if a migration is already running on a cluster, as well as prioritize and queue migrations that apply to the same cluster. We also need to be able to tell the status over the duration of hours or days, and this needs to be communicated to the developer, the owner of the change. And, if the cluster is sharded, we need to run the migration per shard.

In order to run a migration, we must first determine the strategy for that migration (Is it direct query, gh-ost, or a manual?). We need to be able to tell where it can run,  how to go about the process if the cluster is sharded, as well as When to schedule it. While migrations can wait in queue while others are running, we want to be able to prioritize migrations, in case the queue is large.

skeefree

We created skeefree as the glue, which means it’s an orchestrating service that’s aware of our repositories, can communicate with our pull requests, knows about production (or, can get information about production) and which invokes the migrations. We run skeefree as a stateless kubernetes service, backed by a MySQL database that holds the state. Note that skeefree’s own schema is managed by skeefree.

skeefree uses GitHub’s API to interact with pull requests, GitHub’s internal inventory and discovery services, to locate clusters in production, and gh-ost to run migrations. Skeefree is best described by following a schema migration flow:

  1. A developer wishes to change the schema, so they open a pull request.
  2. skeema-diff Action springs to life and seeks a schema change. If a schema change isn’t found in the pull request, nothing happens. If there is a schema change, the Action, computes the change via skeema, adds a well-formed comment to the pull request indicating the change, and adds a migration:skeema:diff label to the pull request. This is done via the GitHub API.
  3. A developer looks into the change, and seeks review from a team member. At this time they may communicate to team members without actually going to production. Finally, they add the label migration:for:review.
  4. skeefree is aware of the developer’s repository and uses the GitHub API to periodically look for open pull requests, which are labeled by both migration:skeema:diff and migration:for:review, and have been approved by at least one developer.
  5. Once detected, skeefree investigates the pull request, and reads the schema change comment, generated by the Action. It maps the schema/repository to the schema/production cluster, and uses our inventory and discovery services to know if the cluster is sharded. Then, it finds the location and name of the cluster.
  6. skeefree then adds this to its backend database, and advertises its analysis on the pull request with another comment. This comment generally means “here’s what I will do if you approve”. And it proceeds to get a review from an authority. Once the user labels the Pull Request as "migration:for:review", skeefree analyzes the migration and evaluates where it needs to run. It proceeds to seek review from an authority.
  7. For most repositories, the authority is the database-infrastructure team. On our original RoR repository, we also seek review from a cross-functional team, known as the db-schema-reviewers, who are familiar with the general application and database design throughout the years and who have more context to offer. skeefree automatically knows which teams should be notified on which repositories.
  8. The relevant teams review and hopefully approve, and skeefree detects the approval, before choosing the proper strategy (direct query for CREATE and DROP, or RENAME), and gh-ost for ALTER. It then queues the migration(s).
  9. skeefree’s scheduler periodically checks what next can be executed. Remember we only run a single ALTER migration on a given cluster at a time, but we also have a limited number of runner hosts. If there’s a free runner host and the cluster is not running any migration, skeefree then proceeds to kick off a migration. Skeefree advertises this fact as a pull request comment to notify the developer that the migration started.
  10. Once the migration is complete, skeefree announces it in a pull request comment. The same applies should the migration fail.
  11. The pull request may also have more than one migration. Perhaps the cluster is sharded, or there may be multiple tables changed in the pull request. Once all migrations are successfully completed, skeefree advertises this in a pull request comment. The developer is notified that all migrations are done, and they’re encouraged to proceed with their standard deploy/merge flow.

as skeefree runs the migrations, it adds comments on the Pull Request page to indicate its progress. When all migrations are complete, skeefree comments as much, again on the pull request page.

Analysis of the flow

There are a few nuances here that make a good experience to everyone involved:

  • The database infrastructure team doesn’t know about the pull request until the developer explicitly adds the migration:for:review label. It’s like a draft pull request or a pull request that’s a work in progress, only this flag applies specifically to the schema migration flow. This allows the developer to use their preferred flow, and communicate with their team without interrupting the database infrastructure team or getting premature reviews.
  • The skeema analysis is contained within the repository, which means That no external service is required. The developer can check the diff result, themselves.
  • The Action is the only part of the flow that looks at the code. Neither skeefree nor gh-ost look at the actual code, and they don’t need git access.
  • The database infrastructure team only needs to take a single step, which is review the pull request.
  • The developers own the creation of pull requests, getting peer reviews, and finally, deploying and merging. These are the exact operations that should be under their ownership. Moreover, they get visibility into the state of their migration. By looking at the pull request page or their GitHub notifications, they can tell whether the pull request has been reviewed, queued, started, completed, or failed. They don’t need to ask. Even better, we have chatops that give visibility into the overall state of migration queue, a running migration’s progress, and more. These chatops are available for all to invoke.
  • The database infrastructure team owns the process of mapping the repository schema to production. This is done via chatops, but can also be completed via configuration. The team is able to cancel a pull request, retry a failed migration, and more.
  • gh-ost is generally trusted, and we have control over a running migration. This means that we can force it to throttle, set up a different throttle threshold, make it use less resources, or terminate it, if needed. We also have a throttling mechanism throughout our stack, so that long running processes like migrations yield to higher priority operations, which extends their own runtime so it doesn’t generate too much load on our database servers.
  • We use our own prefered pull request flow, oActions (skeefree was an early adopter for Actions), GitHub API, and our existing datacenter and database infrastructure, all of which are well understood internally.

Public availability

skeefree and the skeema-diff Action were authored internally at GitHub to solve a specific problem. skeefree uses our internal inventory and discovery services, it works with our chatops and uses some internal libraries.

Our experience in releasing open source software is that no one’s use case is exactly the same as ours. Our perception of an automated migrations flow may be very different from another organization’s perception. We still want to share more than just our words, so we’ve open sourced the code.

It’s a bit of a peculiar OSS release:

  • it’s missing some libraries; it will not build.
  • It expects some of our internal services to exist, which more than likely won’t be on your platform.
  • It expects chatops, and you may not be using chatops.
  • The code also needs to be rewritten for adaptation to your environment,

Note that the code is available, but not open for issues and pull requests. We hope the community finds it useful.

Get the code

The post Automating MySQL schema migrations with GitHub Actions and more appeared first on The GitHub Blog.

New tools to monitor your server and avoid downtime

Post Syndicated from Brian Batraski original https://blog.cloudflare.com/new-tools-to-monitor-your-server-and-avoid-downtime/

New tools to monitor your server and avoid downtime

New tools to monitor your server and avoid downtime

When your server goes down, it’s a big problem. Today, Cloudflare is introducing two new tools to help you understand and respond faster to origin downtime — plus, a new service to automatically avoid downtime.

The new features are:

  • Standalone Health Checks, which notify you as soon as we detect problems at your origin server, without needing a Cloudflare Load Balancer.
  • Passive Origin Monitoring, which lets you know when your origin cannot be reached, with no configuration required.
  • Zero-Downtime Failover, which can automatically avert failures by retrying requests to origin.

Standalone Health Checks

Our first new tool is Standalone Health Checks, which will notify you as soon as we detect problems at your origin server — without needing a Cloudflare Load Balancer.

A Health Check is a service that runs on our edge network to monitor whether your origin server is online. Health Checks are a key part of our load balancing service because they allow us to quickly and actively route traffic to origin servers that are live and ready to serve requests. Standalone Health Checks allow you to monitor the health of your origin even if you only have one origin or do not yet need to balance traffic across your infrastructure.

We’ve provided many dimensions for you to hone in on exactly what you’d like to check, including response code, protocol type, and interval. You can specify a particular path if your origin serves multiple applications, or you can check a larger subset of response codes for your staging environment. All of these options allow you to properly target your Health Check, giving you a precise picture of what is wrong with your origin.

New tools to monitor your server and avoid downtime

If one of your origin servers becomes unavailable, you will receive a notification letting you know of the health change, along with detailed information about the failure so you can take action to restore your origin’s health.  

Lastly, once you’ve set up your Health Checks across the different origin servers, you may want to see trends or the top unhealthy origins. With Health Check Analytics, you’ll be able to view all the change events for a given health check, isolate origins that may be top offenders or not performing at par, and move forward with a fix. On top of this, in the near future, we are working to provide you with access to all Health Check raw events to ensure you have the detailed lens to compare Cloudflare Health Check Event logs against internal server logs.

New tools to monitor your server and avoid downtime

Users on the Pro, Business, or Enterprise plan will have access to Standalone Health Checks and Health Check Analytics to promote top-tier application reliability and help maximize brand trust with their customers. You can access Standalone Health Checks and Health Check Analytics through the Traffic app in the dashboard.

Passive Origin Monitoring

Standalone Health Checks are a super flexible way to understand what’s happening at your origin server. However, they require some forethought to configure before an outage happens. That’s why we’re excited to introduce Passive Origin Monitoring, which will automatically notify you when a problem occurs — no configuration required.

Cloudflare knows when your origin is down, because we’re the ones trying to reach it to serve traffic! When we detect downtime lasting longer than a few minutes, we’ll send you an email.

Starting today, you can configure origin monitoring alerts to go to multiple email addresses. Origin Monitoring alerts are available in the new Notification Center (more on that below!) in the Cloudflare dashboard:

New tools to monitor your server and avoid downtime

Passive Origin Monitoring is available to customers on all Cloudflare plans.

Zero-Downtime Failover

What’s better than getting notified about downtime? Never having downtime in the first place! With Zero-Downtime Failover, we can automatically retry requests to origin, even before Load Balancing kicks in.

How does it work? If a request to your origin fails, and Cloudflare has another record for your origin server, we’ll just try another origin within the same HTTP request. The alternate record could be either an A/AAAA record configured via Cloudflare DNS, or another origin server in the same Load Balancing pool.

Consider an website, example.com, that has web servers at two different IP addresses: 203.0.113.1 and 203.0.113.2. Before Zero-Downtime Failover, if 203.0.113.1 becomes unavailable, Cloudflare would attempt to connect, fail, and ultimately serve an error page to the user. With Zero-Downtime Failover, if 203.0.113.1 cannot be reached, then Cloudflare’s proxy will seamlessly attempt to connect to 203.0.113.2. If the second server can respond, then Cloudflare can avert serving an error to example.com’s user.

Since we rolled Zero-Downtime Failover a few weeks ago, we’ve prevented tens of millions of requests per day from failing!

Zero-Downtime Failover works in conjunction with Load Balancing, Standalone Health Checks, and Passive Origin Monitoring to keep your website running without a hitch. Health Checks and Load Balancing can avert failure, but take time to kick in. Zero-Downtime failover works instantly, but adds latency on each connection attempt. In practice, Zero-Downtime Failover is helpful at the start of an event, when it can instantly recover from errors; once a Health Check has detected a problem, a Load Balancer can then kick in and properly re-route traffic. And if no origin is available, we’ll send an alert via Passive Origin Monitoring.

To see an example of this in practice, consider an incident from a recent customer. They saw a spike in errors at their origin that would ordinarily cause availability to plummet (red line), but thanks to Zero-Downtime failover, their actual availability stayed flat (blue line).

New tools to monitor your server and avoid downtime

During a 30 minute time period, Zero-Downtime Failover improved overall availability from 99.53% to 99.98%, and prevented 140,000 HTTP requests from resulting in an error.

It’s important to note that we only attempt to retry requests that have failed during the TCP or TLS connection phase, which ensures that HTTP headers and payload have not been transmitted yet. Thanks to this safety mechanism, we’re able to make Zero-Downtime Failover Cloudflare’s default behavior for Pro, Business, and Enterprise plans. In other words, Zero-Downtime Failover makes connections to your origins more reliable with no configuration or action required.

Coming soon: more notifications, more flexibility

Our customers are always asking us for more insights into the health of their critical edge infrastructure. Health Checks and Passive Origin monitoring are a significant step towards Cloudflare taking a proactive instead of reactive approach to insights.

To support this work, today we’re announcing the Notification Center as the central place to manage notifications. This is available in the dashboard today, accessible from your Account Home.

From here, you can create new notifications, as well as view any existing notifications you’ve already set up. Today’s release allows you to configure  Passive Origin Monitoring notifications, and set multiple email recipients.

New tools to monitor your server and avoid downtime

We’re excited about today’s launches to helping our customers avoid downtime. Based on your feedback, we have lots of improvements planned that can help you get the timely insights you need:

  • New notification delivery mechanisms
  • More events that can trigger notifications
  • Advanced configuration options for Health Checks, including added protocols, threshold based notifications, and threshold based status changes.
  • More ways to configure Passive Health Checks, like the ability to add thresholds, and filter to specific status codes

Introducing Load Balancing Analytics

Post Syndicated from Brian Batraski original https://blog.cloudflare.com/introducing-load-balancing-analytics/

Introducing Load Balancing Analytics

Introducing Load Balancing Analytics

Cloudflare aspires to make Internet properties everywhere faster, more secure, and more reliable. Load Balancing helps with speed and reliability and has been evolving over the past three years.

Let’s go through a scenario that highlights a bit more of what a Load Balancer is and the value it can provide.  A standard load balancer comprises a set of pools, each of which have origin servers that are hostnames and/or IP addresses. A routing policy is assigned to each load balancer, which determines the origin pool selection process.

Let’s say you build an API that is using cloud provider ACME Web Services. Unfortunately, ACME had a rough week, and their service had a regional outage in their Eastern US region. Consequently, your website was unable to serve traffic during this period, which resulted in reduced brand trust from users and missed revenue. To prevent this from happening again, you decide to take two steps: use a secondary cloud provider (in order to avoid having ACME as a single point of failure) and use Cloudflare’s Load Balancing to take advantage of the multi-cloud architecture. Cloudflare’s Load Balancing can help you maximize your API’s availability for your new architecture. For example, you can assign health checks to each of your origin pools. These health checks can monitor your origin servers’ health by checking HTTP status codes, response bodies, and more. If an origin pool’s response doesn’t match what is expected, then traffic will stop being steered there. This will reduce downtime for your API when ACME has a regional outage because traffic in that region will seamlessly be rerouted to your fallback origin pool(s). In this scenario, you can set the fallback pool to be origin servers in your secondary cloud provider. In addition to health checks, you can use the ‘random’ routing policy in order to distribute your customers’ API requests evenly across your backend. If you want to optimize your response time instead, you can use ‘dynamic steering’, which will send traffic to the origin determined to be closest to your customer.

Our customers love Cloudflare Load Balancing, and we’re always looking to improve and make our customers’ lives easier. Since Cloudflare’s Load Balancing was first released, the most popular customer request was for an analytics service that would provide insights on traffic steering decisions.

Today, we are rolling out Load Balancing Analytics in the Traffic tab of the Cloudflare  dashboard. The three major components in the analytics service are:

  • An overview of traffic flow that can be filtered by load balancer, pool, origin, and region.
  • A latency map that indicates origin health status and latency metrics from Cloudflare’s global network spanning 194 cities and growing!
  • Event logs denoting changes in origin health. This feature was released in 2018 and tracks pool and origin transitions between healthy and unhealthy states. We’ve moved these logs under the new Load Balancing Analytics subtab. See the documentation to learn more.

In this blog post, we’ll discuss the traffic flow distribution and the latency map.

Traffic Flow Overview

Our users want a detailed view into where their traffic is going, why it is going there, and insights into what changes may optimize their infrastructure. With Load Balancing Analytics, users can graphically view traffic demands on load balancers, pools, and origins over variable time ranges.

Understanding how traffic flow is distributed informs the process of creating new origin pools, adapting to peak traffic demands, and observing failover response during origin pool failures.

Introducing Load Balancing Analytics
Figure 1

In Figure 1, we can see an overview of traffic for a given domain. On Tuesday, the 24th, the red pool was created and added to the load balancer. In the following 36 hours, as the red pool handled more traffic, the blue and green pool both saw a reduced workload. In this scenario, the traffic distribution graph did provide the customer with new insights. First, it demonstrated that traffic was being steered to the new red pool. It also allowed the customer to understand the new level of traffic distribution across their network. Finally, it allowed the customer to confirm whether traffic decreased in the expected pools. Over time, these graphs can be used to better manage capacity and plan for upcoming infrastructure needs.

Latency Map

The traffic distribution overview is only one part of the puzzle. Another essential component is understanding request performance around the world. This is useful because customers can ensure user requests are handled as fast as possible, regardless of where in the world the request originates.

The standard Load Balancing configuration contains monitors that probe the health of customer origins. These monitors can be configured to run from a particular region(s) or, for Enterprise customers, from all Cloudflare locations. They collect useful information, such as round-trip time, that can be aggregated to create the latency map.

The map provides a summary of how responsive origins are from around the world, so customers can see regions where requests are underperforming and may need further investigation. A common metric used to identify performance is request latency. We found that the p90 latency for all Load Balancing origins being monitored is 300 milliseconds, which means that 90% of all monitors’ health checks had a round trip time faster than 300 milliseconds. We used this value to identify locations where latency was slower than the p90 latency seen by other Load Balancing customers.

Introducing Load Balancing Analytics
Figure 2

In Figure 2, we can see the responsiveness of the Northeast Asia pool. The Northeast Asia pool is slow specifically for monitors in South America, the Middle East, and Southern Africa, but fast for monitors that are probing closer to the origin pool. Unfortunately, this means users for the pool in countries like Paraguay are seeing high request latency. High page load times have many unfortunate consequences: higher visitor bounce rate, decreased visitor satisfaction rate, and a lower search engine ranking. In order to avoid these repercussions, a site administrator could consider adding a new origin pool in a region closer to underserved regions. In Figure 3, we can see the result of adding a new origin pool in Eastern North America. We see the number of locations where the domain was found to be unhealthy drops to zero and the number of slow locations cut by more than 50%.

Introducing Load Balancing Analytics
Figure 3

Tied with the traffic flow metrics from the Overview page, the latency map arms users with insights to optimize their internal systems, reduce their costs, and increase their application availability.

GraphQL Analytics API

Behind the scenes, Load Balancing Analytics is powered by the GraphQL Analytics API. As you’ll learn later this week, GraphQL provides many benefits to us at Cloudflare. Customers now only need to learn a single API format that will allow them to extract only the data they require. For internal development, GraphQL eliminates the need for customized analytics APIs for each service, reduces query cost by increasing cache hits, and reduces developer fatigue by using a straightforward query language with standardized input and output formats. Very soon, all Load Balancing customers on paid plans will be given the opportunity to extract insights from the GraphQL API.  Let’s walk through some examples of how you can utilize the GraphQL API to understand your Load Balancing logs.

Suppose you want to understand the number of requests the pools for a load balancer are seeing from the different locations in Cloudflare’s global network. The query in Figure 4 counts the number of unique (location, pool ID) combinations every fifteen minutes over the course of a week.

Introducing Load Balancing Analytics
Figure 4

For context, our example load balancer, lb.example.com, utilizes dynamic steering. Dynamic steering directs requests to the most responsive, available, origin pool, which is often the closest. It does so using a weighted round-trip time measurement. Let’s try to understand why all traffic from Singapore (SIN) is being steered to our pool in Northeast Asia (asia-ne). We can run the query in Figure 5. This query shows us that the asia-ne pool has an avgRttMs value of 67ms, whereas the other two pools have avgRttMs values that exceed 150ms. The lower avgRttMs value explains why traffic in Singapore is being routed to the asia-ne pool.

Introducing Load Balancing Analytics
Figure 5

Notice how the query in Figure 4 uses the loadBalancingRequestsGroups schema, whereas the query in Figure 5 uses the loadBalancingRequests schema. loadBalancingRequestsGroups queries aggregate data over the requested query interval, whereas loadBalancingRequests provides granular information on individual requests. For those ready to get started, Cloudflare has written a helpful guide. The GraphQL website is also a great resource. We recommend you use an IDE like GraphiQL to make your queries. GraphiQL embeds the schema documentation into the IDE, autocompletes, saves your queries, and manages your custom headers, all of which help make the developer experience smoother.

Conclusion

Now that the Load Balancing Analytics solution is live and available to all Pro, Business, Enterprise customers, we’re excited for you to start using it! We’ve attached a survey to the Traffic overview page, and we’d love to hear your feedback.

Firewall Analytics: Now available to all paid plans

Post Syndicated from Alex Cruz Farmer original https://blog.cloudflare.com/updates-to-firewall-analytics/

Firewall Analytics: Now available to all paid plans

Firewall Analytics: Now available to all paid plans

Our Firewall Analytics tool enables customers to quickly identify and investigate security threats using an intuitive interface. Until now, this tool had only been available to our Enterprise customers, who have been using it to get detailed insights into their traffic and better tailor their security configurations. Today, we are excited to make Firewall Analytics available to all paid plans and share details on several recent improvements we have made.

All paid plans are now able to take advantage of these capabilities, along with several important enhancements we’ve made to improve our customers’ workflow and productivity.

Firewall Analytics: Now available to all paid plans

Increased Data Retention and Adaptive Sampling

Previously, Enterprise customers could view 14 days of Firewall Analytics for their domains. Today we’re increasing that retention to 30 days, and again to 90 days in the coming months. Business and Professional plan zones will get 30 and 3 days of retention, respectively.

In addition to the extended retention, we are introducing adaptive sampling to guarantee that Firewall Analytics results are displayed in the Cloudflare Dashboard quickly and reliably, even when you are under a massive attack or otherwise receiving a large volume of requests.

Adaptive sampling works similar to Netflix: when your internet connection runs low on bandwidth, you receive a slightly downscaled version of the video stream you are watching. When your bandwidth recovers, Netflix then upscales back to the highest quality available.

Firewall Analytics does this sampling on each query, ensuring that customers see the best precision available in the UI given current load on the zone. When results are sampled, the sampling rate will be displayed as shown below:

Firewall Analytics: Now available to all paid plans

Event-Based Logging

As adoption of our expressive Firewall Rules engine has grown, one consistent ask we’ve heard from customers is for a more streamlined way to see all Firewall Events generated by a specific rule. Until today, if a malicious request matched multiple rules, only the last one to execute was shown in the Activity Log, requiring customers to click into the request to see if the rule they’re investigating was listed as an “Additional match”.

To streamline this process, we’ve changed how the Firewall Analytics UI interacts with the Activity Log. Customers can now filter by a specific rule (or any other criteria) and see a row for each event generated by that rule. This change also makes it easier to review all requests that would have been blocked by a rule by creating it in Log mode first before changing it to Block.

Firewall Analytics: Now available to all paid plans

Challenge Solve Rates to help reduce False Positives

When our customers write rules to block undesired, automated traffic they want to make sure they’re not blocking or challenging desired traffic, e.g., humans wanting to make a purchase should be allowed but not bots scraping pricing.

To help customers determine what percent of CAPTCHA challenges returned to users may have been unnecessary, i.e., false positives, we are now showing the Challenge Solve Rate (CSR) for each rule. If you’re seeing rates higher than expected, e.g., for your Bot Management rules, you may want to relax the rule criteria. If the rate you see is 0% indicating that no CAPTCHAs are being solved, you may want to change the rule to Block outright rather than challenge.

Firewall Analytics: Now available to all paid plans

Hovering over the CSR rate will reveal the number of CAPTCHAs issued vs. solved:

Firewall Analytics: Now available to all paid plans

Exporting Firewall Events

Business and Enterprise customers can now export a set of 500 events from the Activity Log. The data exported are those events that remain after any selected filters have been applied.

Firewall Analytics: Now available to all paid plans

Column Customization

Sometimes the columns shown in the Activity Log do not contain the details you want to see to analyze the threat. When this happens, you can now click “Edit Columns” to select the fields you want to see. For example, a customer diagnosing a Bot related issue may want to also view the User-Agent and the source country whereas a customer investigating a DDoS attack may want to see IP addresses, ASNs, Path, and other attributes. You can now customize what you’d like to see as shown below.

Firewall Analytics: Now available to all paid plans

We would love to hear your feedback and suggestions, so feel free to reach out to us via our Community forums or through your Customer Success team.

If you’d like to receive more updates like this one directly to your inbox, please subscribe to our Blog!

Announcing deeper insights and new monitoring capabilities from Cloudflare Analytics

Post Syndicated from Filipp Nisenzoun original https://blog.cloudflare.com/announcing-deeper-insights-and-new-monitoring-capabilities/

Announcing deeper insights and new monitoring capabilities from Cloudflare Analytics

Announcing deeper insights and new monitoring capabilities from Cloudflare Analytics

This week we’re excited to announce a number of new products and features that provide deeper security and reliability insights, “proactive” analytics when there’s a problem, and more powerful ways to explore your data.

If you’ve been a user or follower of Cloudflare for a little while, you might have noticed that we take pride in turning technical challenges into easy solutions. Flip a switch or run a few API commands, and the attack you’re facing is now under control or your site is now 20% faster. However, this ease of use is even more helpful if it’s complemented by analytics. Before you make a change, you want to be sure that you understand your current situation. After the change, you want to confirm that it worked as intended, ideally as fast as possible.

Because of the front-line position of Cloudflare’s network, we can provide comprehensive metrics regarding both your traffic and the security and performance of your Internet property. And best of all, there’s nothing to set up or enable. Cloudflare Analytics is automatically available to all Cloudflare users and doesn’t rely on Javascript trackers, meaning that our metrics include traffic from APIs and bots and are not skewed by ad blockers.

Here’s a sneak peek of the product launches. Look out for individual blog posts this week for more details on each of these announcements.

  • Product Analytics:
    • Today, we’re making Firewall Analytics available to Business and Pro plans, so that more customers understand how well Cloudflare mitigates attacks and handles malicious traffic. And we’re highlighting some new metrics, such as the rate of solved captchas (useful for Bot Management), and features, such as customizable reports to facilitate sharing and archiving attack information.
    • We’re introducing Load Balancing Analytics, which shows traffic flows by load balancer, pool, origin, and region, and helps explain why a particular origin was selected to receive traffic.
  • Monitoring:
    • We’re announcing tools to help you monitor your origin either actively or passively and automatically reroute your traffic to a different server when needed. Because Cloudflare sits between your end users and your origin, we can spot problems with your servers without the use of external monitoring services.
  • Data tools:
    • The product analytics we’ll be featuring this week use a new API behind the scenes. We’re making this API generally available, allowing you to easily build custom dashboards and explore all of your Cloudflare data the same way we do, so you can easily gain insights and identify and debug issues.
  • Account Analytics:
    • We’re releasing (in beta) a new dashboard that shows aggregated information for all of the domains under your account, allowing you to know what’s happening at a glance.

We’re excited to tell you about all of these new products in this week’s posts and would love to hear your thoughts. If you’re not already subscribing to the blog, sign up now to receive daily updates in your inbox.

Introducing the CodeSearchNet challenge

Post Syndicated from Janey Jack original https://github.blog/2019-09-26-introducing-the-codesearchnet-challenge/

Searching for code to reuse, call into, or to see how others handle a problem is one of the most common tasks in a software developer’s day. However, search engines for code are often frustrating and never fully understand what we want, unlike regular web search engines. We started using modern machine learning techniques to improve code search but quickly realized that we were unable to measure our progress. Unlike natural language processing with GLUE benchmarks, there is no standard dataset suitable for code search evaluation.

With our partners from Weights & Biases, today we’re announcing the CodeSearchNet Challenge evaluation environment and leaderboard. We’re also releasing a large dataset to help data scientists build models for this task, as well as several baseline models showing the current state of the art. Our leaderboard uses an annotated dataset of queries to evaluate the quality of code search tools.

Learn more from our technical report 

The CodeSearchNet Corpus and models

We collected a large dataset of functions with associated documentation written in Go, Java, JavaScript, PHP, Python, and Ruby from open source projects on GitHub. We used our TreeSitter infrastructure for this effort, and we’re also releasing our data preprocessing pipeline for others to use as a starting point in applying machine learning to code. While this data is not directly related to code search, its pairing of code with related natural language description is suitable to train models for this task. Its substantial size also makes it possible to apply high-capacity models based on modern Transformer architectures.

Our fully preprocessed CodeSearchNet Corpus is available for download on Amazon S3, including:

  • Six million methods overall
  • Two million of which have associated documentation (docstrings, JavaDoc, and more)
  • Metadata that indicates the original location (repository or line number, for example) where the data was found

Building on our earlier efforts in semantic code search, we’re also releasing a collection of baseline models leveraging modern techniques in learning from sequences (including a BERT-like self-attentional model) to help data scientists get started on code search. 

The CodeSearchNet Challenge

To evaluate code search models, we collected an initial set of code search queries and had programmers annotate the relevance of potential results. We started by collecting common search queries from Bing that had high click-through rates to code and combined these with queries from StaQC, yielding 99 queries for concepts related to code (i.e., we removed everything that was just an API documentation lookup).

We then used a standard Elasticsearch installation and our baseline models to obtain 10 likely results per query from our CodeSearchNet Corpus. Finally, we asked programmers, data scientists, and machine learning researchers to annotate the proposed results for relevance to the query on a scale from zero (“totally irrelevant”) to three (“exact match”). See our technical report for an in-depth explanation of the annotation process and data.

We want to expand our evaluation dataset to include more languages, queries, and annotations in the future. As we continue adding more over the next few months, we aim to include an extended dataset for the next version of CodeSearchNet Challenge in the future.

Other use cases

We anticipate other use cases for this dataset beyond code search and are presenting code search as one possible task that leverages learned representations of natural language and code. We’re excited to see what the community builds next.

Special thanks

The CodeSearchNet Challenge would not be possible without the Microsoft Research Team and core contributors from GitHub, including Marc Brockschmidt, Miltos Allamanis, Ho-Hsiang Wu, Hamel Husain, and Tiferet Gazit.

We’re also thankful for all of the contributors from the community who helped put this project together:

@nbardy, @raubitsj, @staceysv, @cvphelps, @tejaskannan, @s-zanella, @AntonioND, @goutham7r, @campoy, @cal58, @febuiles, @letmaik, @sebastiandziadzio, @panthap2, @CoderPat.


Learn more about the CodeSearchNet Challenge

The post Introducing the CodeSearchNet challenge appeared first on The GitHub Blog.

C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages

Post Syndicated from Kavita Ganesan original https://github.blog/2019-07-02-c-or-java-typescript-or-javascript-machine-learning-based-classification-of-programming-languages/

GitHub hosts over 300 programming languages—from commonly used languages such as Python, Java, and Javascript to esoteric languages such as Befunge, only known to very small communities.

JavaScript is the top programming language on GitHub, followed by Java and HTML
Figure 1: Top 10 programming languages hosted by GitHub by repository count

One of the necessary challenges that GitHub faces is to be able to recognize these different languages. When some code is pushed to a repository, it’s important to recognize the type of code that was added for the purposes of search, security vulnerability alerting, and syntax highlighting—and to show the repository’s content distribution to users.

Despite the appearance, language recognition isn’t a trivial task. File names and extensions, while providing a good indication of what the coding language is likely to be, do not offer the full picture. In fact, many extensions are associated with the same language (e.g., “.pl”, “.pm”, “.t”, “.pod” are all associated with Perl), while others are ambiguous and used almost interchangeably across languages (e.g., “.h” is commonly used to indicate many languages of the “C” family, including C, C++, and Objective-C). In other cases, files are simply provided with no extension (especially for executable scripts) or with the incorrect extension (either on purpose or accidentally).

Linguist is the tool we currently use to detect coding languages at GitHub. Linguist a Ruby-based application that uses various strategies for language detection, leveraging naming conventions and file extensions and also taking into account Vim or Emacs modelines, as well as the content at the top of the file (shebang). Linguist handles language disambiguation via heuristics and, failing that, via a Naive Bayes classifier trained on a small sample of data. 

Although Linguist does a good job making file-level language predictions (84% accuracy), its performance declines considerably when files use unexpected naming conventions and, crucially, when a file extension is not provided. This renders Linguist unsuitable for content such as GitHub Gists or code snippets within README’s, issues, and pull requests.

In order to make language detection more robust and maintainable in the long run, we developed a machine learning classifier named OctoLingua based on an Artificial Neural Network (ANN) architecture which can handle language predictions in tricky scenarios. The current version of the model is able to make predictions for the top 50 languages hosted by GitHub and surpasses Linguist in accuracy and performance. 

The Nuts and Bolts Behind OctoLingua

OctoLingua was built from scratch using Python, Keras with TensorFlow backend—and is built to be accurate, robust, and easy to maintain. In this section, we describe our data sources, model architecture, and performance benchmark for OctoLingua. We also describe what it takes to add support for a new language. 

Data sources

The current version of OctoLingua was trained on files retrieved from Rosetta Code and from a set of quality repositories internally crowdsourced. We limited our language set to the top 50 languages hosted on GitHub.

Rosetta Code was an excellent starter dataset as it contained source code for the same task expressed in different programming languages. For example, the task of generating a Fibonacci sequence is expressed in C, C++, CoffeeScript, D, Java, Julia, and more. However, the coverage across languages was not uniform where some languages only have a handful of files and some files were just too sparsely populated. Augmenting our training set with some additional sources was therefore necessary and substantially improved language coverage and performance.

Our process for adding a new language is now fully automated. We programmatically collect source code from public repositories on GitHub. We choose repositories that meet a minimum qualifying criteria such as having a minimum number of forks, covering the target language and covering specific file extensions. For this stage of data collection, we determine the primary language of a repository using the classification from Linguist. 

Features: leveraging prior knowledge

Traditionally, for text classification problems with Neural Networks, memory-based architectures such as Recurrent Neural Networks (RNN) and Long Short Term Memory Networks (LSTM) are often employed. However, given that programming languages have differences in vocabulary, commenting style, file extensions, structure, libraries import style and other minor differences, we opted for a simpler approach that leverages all this information by extracting some relevant features in tabular form to be fed to our classifier. The features currently extracted are as follows:

  1. Top five special characters per file
  2. Top 20 tokens per file
  3. File extension
  4. Presence of certain special characters commonly used in source code files such as colons, curly braces, and semicolons

The Artificial Neural Network (ANN) model

We use the above features as input to a two-layer Artificial Neural Network built using Keras with Tensorflow backend. 

The diagram below shows that the feature extraction step produces an n-dimensional tabular input for our classifier. As the information moves along the layers of our network, it is regularized by dropout and ultimately produces a 51-dimensional output which represents the predicted probability that the given code is written in each of the top 50 GitHub languages plus the probability that it is not written in any of those.

image
Figure 2: The ANN Structure of our initial model (50 languages + 1 for “other”)

We used 90% of our dataset for training over approximately eight epochs. Additionally, we removed a percentage of file extensions from our training data at the training step, to encourage the model to learn from the vocabulary of the files, and not overfit on the file extension feature, which is highly predictive.

Performance benchmark

OctoLingua vs. Linguist

In Figure 3, we show the F1 Score (harmonic mean between precision and recall) of OctoLingua and Linguist calculated on the same test set (10% from our initial data source). 

Here we show three tests. The first test is with the test set untouched in any way. The second test uses the same set of test files with file extension information removed and the third test also uses the same set of files but this time with file extensions scrambled so as to confuse the classifiers (e.g., a Java file may have a “.txt” extension and a Python file may have a “.java”) extension. 

The intuition behind scrambling or removing the file extensions in our test set is to assess the robustness of OctoLingua in classifying files when a key feature is removed or is misleading. A classifier that does not rely heavily on extension would be extremely useful to classify gists and snippets, since in those cases it is common for people not to provide accurate extension information (e.g., many code-related gists have a .txt extension).

The table below shows how OctoLingua maintains a good performance under various conditions, suggesting that the model learns primarily from the vocabulary of the code, rather than from meta information (i.e. file extension), whereas Linguist fails as soon as the information on file extensions is altered.

image
Figure 3: Performance of OctoLingua vs. Linguist on the same test set

 

Effect of removing file extension during training time

As mentioned earlier, during training time we removed a percentage of file extensions from our training data to encourage the model to learn from the vocabulary of the files. The table below shows the performance of our model with different fractions of file extensions removed during training time. 

image
Figure 4: Performance of OctoLingua with different percentage of file extensions removed on our three test variations

Notice that with no file extension removed during training time, the performance of OctoLingua on test files with no extensions and randomized extensions decreases considerably from that on the regular test data. On the other hand, when the model is trained on a dataset where some file extensions are removed, the model performance does not decline much on the modified test set. This confirms that removing the file extension from a fraction of files at training time induces our classifier to learn more from the vocabulary. It also shows that the file extension feature, while highly predictive, had a tendency to dominate and prevented more weights from being assigned to the content features. 

Supporting a new language

Adding a new language in OctoLingua is fairly straightforward. It starts with obtaining a bulk of files in the new language (we can do this programmatically as described in data sources). These files are split into a training and a test set and then run through our preprocessor and feature extractor. This new train and test set is added to our existing pool of training and testing data. The new testing set allows us to verify that the accuracy of our model remains acceptable.

image
Figure 5: Adding a new language with OctoLingua

Our plans

As of now, OctoLingua is at the “advanced prototyping stage”. Our language classification engine is already robust and reliable, but does not yet support all coding languages on our platform. Aside from broadening language support—which would be rather straightforward—we aim to enable language detection at various levels of granularity. Our current implementation already allows us, with a small modification to our machine learning engine, to classify code snippets. It wouldn’t be too far fetched to take the model to the stage where it can reliably detect and classify embedded languages. 

We are also contemplating the possibility of open sourcing our model and would love to hear from the community if you’re interested.

Summary

With OctoLingua, our goal is to provide a service that enables robust and reliable source code language detection at multiple levels of granularity, from file level or snippet level to potentially line-level language detection and classification. Eventually, this service can support, among others, code searchability, code sharing, language highlighting, and diff rendering—all of this aimed at supporting developers in their day to day development work in addition to helping them write quality code.  If you are interested in leveraging or contributing to our work, please feel free to get in touch on Twitter @github!

Authors

The post C# or Java? TypeScript or JavaScript? Machine learning based classification of programming languages appeared first on The GitHub Blog.

Get Cloudflare insights in your preferred analytics provider

Post Syndicated from Simon Steiner original https://blog.cloudflare.com/cloudflare-partners-with-analytics-providers/

Get Cloudflare insights in your preferred analytics provider

Today, we’re excited to announce our partnerships with Chronicle Security, Datadog, Elastic, Looker, Splunk, and Sumo Logic to make it easy for our customers to analyze Cloudflare logs and metrics using their analytics provider of choice. In a joint effort, we have developed pre-built dashboards that are available as a Cloudflare App in each partner’s platform. These dashboards help customers better understand events and trends from their websites and applications on our network.


Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Cloudflare insights in the tools you’re already using

Data analytics is a frequent theme in conversations with Cloudflare customers. Our customers want to understand how Cloudflare speeds up their websites and saves them bandwidth, ranks their fastest and slowest pages, and be alerted if they are under attack. While providing insights is a core tenet of Cloudflare’s offering, the data analytics market has matured and many of our customers have started using third-party providers to analyze data—including Cloudflare logs and metrics. By aggregating data from multiple applications, infrastructure, and cloud platforms in one dedicated analytics platform, customers can create a single pane of glass and benefit from better end-to-end visibility over their entire stack.

Get Cloudflare insights in your preferred analytics provider

While these analytics platforms provide great benefits in terms of functionality and flexibility, they can take significant time to configure: from ingesting logs, to specifying data models that make data searchable, all the way to building dashboards to get the right insights out of the raw data. We see this as an opportunity to partner with the companies our customers are already using to offer a better and more integrated solution.

Providing flexibility through easy-to-use integrations

To address these complexities of aggregating, managing, and displaying data, we have developed a number of product features and partnerships to make it easier to get insights out of Cloudflare logs and metrics. In February we announced Logpush, which allows customers to automatically push Cloudflare logs to Google Cloud Storage and Amazon S3. Both of these cloud storage solutions are supported by the major analytics providers as a source for collecting logs, making it possible to get Cloudflare logs into an analytics platform with just a few clicks. With today’s announcement of Cloudflare’s Analytics Partnerships, we’re releasing a Cloudflare App—a set of pre-built and fully customizable dashboards—in each partner’s app store or integrations catalogue to make the experience even more seamless.

By using these dashboards, customers can immediately analyze events and trends of their websites and applications without first needing to wade through individual log files and build custom searches. The dashboards feature all 55+ fields available in Cloudflare logs and include 90+ panels with information about the performance, security, and reliability of customers’ websites and applications.

Get Cloudflare insights in your preferred analytics provider

Ultimately, we want to provide flexibility to our customers and make it easier to use Cloudflare with the analytics tools they already use. Improving our customers’ ability to get better data and insights continues to be a focus for us, so we’d love to hear about what tools you’re using—tell us via this brief survey. To learn more about each of our partnerships and how to get access to the dashboards, please visit our developer documentation or contact your Customer Success Manager. Similarly, if you’re an analytics provider who is interested in partnering with us, use the contact form on our analytics partnerships page to get in touch.

Get Cloudflare insights in your preferred analytics provider

Post Syndicated from Simon Steiner original https://blog.cloudflare.com/cloudflare-partners-with-analytics-providers/

Get Cloudflare insights in your preferred analytics provider

Today, we’re excited to announce our partnerships with Chronicle Security, Datadog, Elastic, Looker, Splunk, and Sumo Logic to make it easy for our customers to analyze Cloudflare logs and metrics using their analytics provider of choice. In a joint effort, we have developed pre-built dashboards that are available as a Cloudflare App in each partner’s platform. These dashboards help customers better understand events and trends from their websites and applications on our network.


Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Get Cloudflare insights in your preferred analytics provider

Cloudflare insights in the tools you’re already using

Data analytics is a frequent theme in conversations with Cloudflare customers. Our customers want to understand how Cloudflare speeds up their websites and saves them bandwidth, ranks their fastest and slowest pages, and be alerted if they are under attack. While providing insights is a core tenet of Cloudflare’s offering, the data analytics market has matured and many of our customers have started using third-party providers to analyze data—including Cloudflare logs and metrics. By aggregating data from multiple applications, infrastructure, and cloud platforms in one dedicated analytics platform, customers can create a single pane of glass and benefit from better end-to-end visibility over their entire stack.

Get Cloudflare insights in your preferred analytics provider

While these analytics platforms provide great benefits in terms of functionality and flexibility, they can take significant time to configure: from ingesting logs, to specifying data models that make data searchable, all the way to building dashboards to get the right insights out of the raw data. We see this as an opportunity to partner with the companies our customers are already using to offer a better and more integrated solution.

Providing flexibility through easy-to-use integrations

To address these complexities of aggregating, managing, and displaying data, we have developed a number of product features and partnerships to make it easier to get insights out of Cloudflare logs and metrics. In February we announced Logpush, which allows customers to automatically push Cloudflare logs to Google Cloud Storage and Amazon S3. Both of these cloud storage solutions are supported by the major analytics providers as a source for collecting logs, making it possible to get Cloudflare logs into an analytics platform with just a few clicks. With today’s announcement of Cloudflare’s Analytics Partnerships, we’re releasing a Cloudflare App—a set of pre-built and fully customizable dashboards—in each partner’s app store or integrations catalogue to make the experience even more seamless.

By using these dashboards, customers can immediately analyze events and trends of their websites and applications without first needing to wade through individual log files and build custom searches. The dashboards feature all 55+ fields available in Cloudflare logs and include 90+ panels with information about the performance, security, and reliability of customers’ websites and applications.

Get Cloudflare insights in your preferred analytics provider

Ultimately, we want to provide flexibility to our customers and make it easier to use Cloudflare with the analytics tools they already use. Improving our customers’ ability to get better data and insights continues to be a focus for us, so we’d love to hear about what tools you’re using—tell us via this brief survey. To learn more about each of our partnerships and how to get access to the dashboards, please visit our developer documentation or contact your Customer Success Manager. Similarly, if you’re an analytics provider who is interested in partnering with us, use the contact form on our analytics partnerships page to get in touch.