Tag Archives: support

Cloudflare Support Portal gets an overhaul

Post Syndicated from Meghan Bevill original https://blog.cloudflare.com/cloudflare-support-portal-gets-an-overhaul/

Cloudflare Support Portal gets an overhaul

Cloudflare Support Portal gets an overhaul

The Cloudflare Support team is excited to announce the launch of our brand-new Customer Support Portal. When our customers open support tickets, we understand that they want quick and accurate responses from us. For those of you who have opened a support ticket in the past, we are certain you will notice the improvements we’ve made! The new Support Portal lives where our ticket submission form has always been, dash.cloudflare.com/support, but that’s where the similarities between the old and the new one end.

What can you expect in the new portal?

The new Support Portal will help you solve your problems quickly and effectively, by getting you on the fastest path to resolution. In some cases, the most efficient way to resolve your issue will be to use our self-help resources or our machine learning-trained Support Bot. Other times, the most efficient way to resolve your issue will be by working with one of our Support Engineers via ticket, phone or chat, depending on your plan type. Regardless of how we help you solve your issue, we will have more context about the products you are using and your issue up front, reducing time-consuming back and forth.

The new portal has several features that will make it easier for you to access the support you need, including:

  • Fast and secure ticket submission for verified Cloudflare users
  • An easier-to-use interface that serves relevant resources based on your issue summary
  • Machine learning-powered Support Bot to run diagnostics and serve targeted help guides

Everyone is encouraged to begin using our new portal. Tickets submitted through our legacy form are typically solved faster than tickets emailed to us, and we expect the updates in our new form to help us resolve your issues even faster!

If you are ready to be one of the first people to take advantage of our new Support Portal, you can now opt in and begin using the new experience to access resources and submit tickets. Just hit the Support dropdown in your dashboard and click Contact Support.

Cloudflare Support Portal gets an overhaul

Below is a preview of what you can expect with the new experience.

Relevant self-help resources at your fingertips

The biggest change you’ll notice from our old ticket submission form is that we’ve made it easier to get help. First, we link you directly to relevant resources and the ticket submission form immediately upon clicking “Contact Support”. You no longer have to navigate through multiple steps to get your problem resolved. Second, we’ve moved to a full-page experience allowing us to curate a selection of support articles and help guides targeting your specific problem, making it easier for you to find answers to your questions. Of course, there will still be times when you need to submit a support ticket, but if we have resources that address your problem, we want you to be able to find that information easily.

All the details you provide when searching for articles in the portal will be captured and added to your ticket if you are not able to find the answers to your questions.

Cloudflare Support Portal gets an overhaul

Take advantage of our Support Bot

Our machine learning-powered Support Bot has been integrated into the new portal to deliver a customized experience that identifies your specific problem. Support Bot has been helping our Support Engineers work more efficiently for years, and now we’re making some of this functionality customer-facing so that you can benefit from these efficiencies as well.

Within the portal, the Support Bot will run diagnostics (if your issue is domain-related), assess the issue summary you entered, and provide you with help guides to address the root cause of your problem. The more information you are able to provide, the better our bot can direct you to the resources most pertinent to your issue. This gives you the chance to solve your issue on the spot, rather than waiting for a response to your ticket.

For each issue submitted through the portal, our Support Bot can perform one of two actions. If your issue is domain-specific, the bot will run a set of diagnostics against your domain that check for common configuration issues. If any issue is detected, the bot will display the issue and a suggested solution. Regardless of whether your issue is domain-specific, the bot will also analyze the issue summary you’ve entered against our ensemble of Natural Language Processing models and keyword searches. The bot is trained on thousands of historic customer tickets to differentiate between specific customer issues. We retrain the model on a regular basis to ensure it is consistently learning from new and emerging issues. If the bot detects keywords in your summary that map to a relevant issue, it will present a known solution for that issue.

The solutions the bot surfaces are based on how successfully these resources resolved issues previously, and we will continue to refine the bot’s responses and solutions based on a couple of key success metrics. We consider a recommendation successful if a customer doesn’t need to ultimately open a ticket or if they acknowledge that a resource was helpful by voting on the page. We will evaluate this data along with any information you provide on why specific content wasn’t helpful, and make iterative improvements to the bot every time we retrain it.

Cloudflare Support Portal gets an overhaul

Fast and secure ticket submission

While we have a ton of helpful content for a wide range of problems, we know there will be instances where you need to speak to one of our very experienced Support Engineers. For plan types that include ticket support, we have built our ticket submission flow into the portal and introduced new features to make the experience more efficient. The first step for our Support Engineers in resolving most issues is for us to verify the identity of account users and admins. The new process ensures that tickets are only submitted by verified account users and admins, reducing some back and forth and allowing us to start working on your issue right away.

Along with this verification step, the new portal will collect detailed information about your problem up front, including issue category and impact level. These details will help route your ticket to the Support Engineer most knowledgeable in the area of your issue and enable that engineer to begin work on your ticket more quickly and without having to come to you with additional questions.

How to try the new experience

To take advantage of these improvements, we encourage everyone to use the new Support Portal as the starting point for troubleshooting your issues.

Over the next few months, we will be rolling out the new portal to all plan types, starting with an opt-in period where you can pilot the new experience. Once we are satisfied the portal is working as intended, we will close the opt-in phase and release the portal to all customers. At that point, we will begin redirecting emails received at our main support email addresses (support at cloudflare.com and billing at cloudflare.com) to the Support Portal so that they can be triaged, and resolved quicker and more efficiently. We are excited to start implementing these changes and are confident that these steps are the first of many planned in making your support experience as efficient and effective as possible. We can’t wait for you to check it out!

To start using the new portal today, you can opt in from your dashboard. Let us know what you think with the feedback form included at the top of the new portal.

Using data science and machine learning for improved customer support

Post Syndicated from Junade Ali original https://blog.cloudflare.com/using-data-science-and-machine-learning-for-improved-customer-support/

Using data science and machine learning for improved customer support

In this blog post we’ll explore three tricks that can be used for data science that helped us solve real problems for our customer support group and our customers. Two for natural language processing in a customer support context and one for identifying attack Internet attack traffic.

Through these examples, we hope to demonstrate how invaluable data processing tricks, visualisations and tools can be before putting data into a machine learning algorithm. By refining data prior to processing, we are able to achieve dramatically improved results without needing to change the underlying machine learning strategies which are used.

Know the Limits (Language Classification)

When browsing a social media site, you may find the site prompts you to translate a post even though it is in your language.

We recently came across a similar problem at Cloudflare when we were looking into language classification for chat support messages. Using an off-the-shelf classification algorithm, users with short messages often had their chats classified incorrectly and our analysis found there’s a correlation between the length of a message and the accuracy of the classification (based on the browser Accept-Language header and the languages of the country where the request was submitted):

Using data science and machine learning for improved customer support

On a subset of tickets, comparing the classified language against the web browser Accept-Language header, we found there was broad agreement between these two properties. When we considered the languages associated with the user’s country, we found another signal.

Using data science and machine learning for improved customer support

In 67% of our sample, we found agreement between these three signals. In 15% of instances the classified language agreed with only the Accept-Language header and in 5% of cases there was only agreement with the languages associated with the user’s country.

Using data science and machine learning for improved customer support

We decided the ideal approach was to train a machine learning model that would take all three signals (plus the confidence rate from the language classification algorithm) and use that to make a prediction. By knowing the limits of a given classification algorithm, we were able to develop an approach that helped compliment it.

A naive approach to do the same may not even need a trained model to do so, simply requiring agreement between two of three properties (classified language, Accept-Language header and country header) helps make a decision about the right language to use.

Hold Your Fire (Fuzzy String Matching)

Fuzzy String Matching is often used in natural language processing when trying to extract information from human text. For example, this can be used for extracting error messages from customer support tickets to do automatic classification. At Cloudflare, we use this as one signal in our natural language processing pipeline for support tickets.

Engineers often use the Levenshtein distance algorithm for string matching; for example, this algorithm is implemented in the Python fuzzywuzzy library. This approach has a high computational overhead (for two strings of length k and l, the algorithm runs in O(k * l) time).

To understand the performance of different string matching algorithms in a customer support context, we compared multiple algorithms (Cosine, Dice, Damerau, LCS and Levenshtein) and measured the true positive rate (TP), false positive rate (FP) and the ratio of false positives to true positives (FP/TP).

Using data science and machine learning for improved customer support

We opted for the Cosine algorithm, not just because it outperformed the Levenshtein algorithm, but also the computational difficulty was reduced to O(k + l) time. The Cosine similarity algorithm is a very simple algorithm; it works by representing words or phrases as a vector representation in a multidimensional vector space, where each unique letter of an alphabet is a separate dimension. The smaller the angle between the two vectors, the closer the word is to another.

The mathematical definitions of each string similarity algorithm and a scientific comparison can be found in our paper: M. Pikies and J. Ali, “String similarity algorithms for a ticket classification system,” 2019 6th International Conference on Control, Decision and Information Technologies (CoDIT), Paris, France, 2019, pp. 36-41. https://doi.org/10.1109/CoDIT.2019.8820497

There were other optimisations we introduce to the fuzzy string matching approaches; the similarity threshold is determined by evaluating the True Positive and False Positive rates on various sample data. We further devised a new tokenization approach for handling phrases and numeric strings whilst using the FastText natural language processing library to determine candidate values for fuzzy string matching and to improve overall accuracy, we will share more about these optimisations in a further blog post.

Using data science and machine learning for improved customer support

“Beyond it is Another Dimension” (Threat Identification)

Attack alerting is particularly important at Cloudflare – this is useful for both monitoring the overall status of our network and providing proactive support to particular at-risk customers.

DDoS attacks can be represented in granularity by a few different features; including differences in request or error rates over a temporal baseline, the relationship between errors and request volumes and other metrics that indicate attack behaviour. One example of a metric we use to differentiate between whether a customer is under a low volume attack or they are experiencing another issue is the relationship between 499 error codes vs 5xx HTTP status codes. Cloudflare’s network edge returns a 499 status code when the client disconnects before the origin web server has an opportunity to respond, whilst 5xx status codes indicate an error handling the request. In the chart below; the x-axis measures the differential increase in 5xx errors over a baseline, whilst the y-axis represents the rate of 499 responses (each scatter represents a 15 minute interval). During a DDoS attack we notice a linear correlation between these criteria, whilst origin issues typically have an increase in one metric instead of another:

Using data science and machine learning for improved customer support

The next question is how this data can be used in more complicated situations – take the following example of identifying a credential stuffing attack in aggregate. We looked at a small number of anonymised data fields for the most prolific attackers of WordPress login portals. The data is based purely on HTTP headers, in total we saw 820 unique IPs towards 16,248 distinct zones (the IPs were hashed and requests were placed into “buckets” as they were collected). As WordPress returns a HTTP 200 when a login fails and a HTTP 302 on a successful login (redirecting to the login panel), we’re able to analyse this just from the status code returned.

On the left hand chart, the x-axis represents a normalised number of unique zones that are under attack (0 means the attacker is hitting the same site whilst 1 means the attacker is hitting all different sites) and the y-axis represents the success rate (using HTTP status codes, identifying the chance of a successful login). The right hand side chart switches the x-axis out for something called the “variety ratio” – this measures the rate of abnormal 4xx/5xx HTTP status codes (i.e. firewall blocks, rate limiting HTTP headers or 5xx status codes). We see clear clusters on both charts:

Using data science and machine learning for improved customer support

However, by plotting this chart in three dimensions with all three fields represented – clusters appear. These clusters are then grouped using an unsupervised clustering algorithm (agglomerative hierarchical clustering):

Using data science and machine learning for improved customer support

Cluster 1 has 99.45% of requests from the same country and 99.45% from the same User-Agent. This tactic, however, has advantages when looking at other clusters – for example, Cluster 0 had 89% of requests coming from three User-Agents (75%, 12.3% and 1.7%, respectively). By using this approach we are able to correlate such attacks together even when they would be hard to identify on a request-to-request basis (as they are being made from different IPs and with different request headers). Such strategies allow us to fingerprint attacks regardless of whether attackers are continuously changing how they make these requests to us.

By aggregating data together then representing the data in multiple dimensions, we are able to gain visibility into the data that would ordinarily not be possible on a request-to-request basis. In product level functionality, it is often important to make decisions on a signal-to-signal basis (“should this request be challenged whilst this one is allowed?”) but by looking at the data in aggregate we are able to focus  on the interesting clusters and provide alerting systems which identify anomalies. Performing this in multiple dimensions provides the tools to reduce false positives dramatically.

Conclusion

From natural language processing to intelligent threat fingerprinting, using data science techniques has improved our ability to build new functionality. Recently, new machine learning approaches and strategies have been designed to process this data more efficiently and effectively; however, preprocessing of data remains a vital tool for doing this. When seeking to optimise data processing pipelines, it often helps to look not just at the tools being used, but also the input and structure of the data you seek to process.

If you’re interested in using data science techniques to identify threats on a large scale network, we’re hiring for Support Engineers (including Security Operations, Technical Support and Support Operations Engineering) in San Francisco, Austin, Champaign, London, Lisbon, Munich and Singapore.

Time-Based One-Time Passwords for Phone Support

Post Syndicated from Junade Ali original https://blog.cloudflare.com/time-based-one-time-passwords-for-phone-support/

Time-Based One-Time Passwords for Phone Support

Time-Based One-Time Passwords for Phone Support

As part of Cloudflare’s support offering, we provide phone support to Enterprise customers who are experiencing critical business issues.

For account security, specific account settings and sensitive details are not discussed via phone. From today, we are providing Enterprise customers with the ability to configure phone authentication to allow for greater support to be offered over the phone without need to perform validation through support tickets.

After providing your email address to a Cloudflare Support representative, you can now provide a token generated from the Cloudflare dashboard or via a 2FA app like Google Authenticator. So, a customer is able to prove over the phone that they are who they say they are.

Configuring Phone Authentication

If you are an existing Enterprise customer interested in phone support, please contact your Customer Success Manager for eligibility information and set-up. If you are interested in our Enterprise offering, please get in contact via our Enterprise plan page.

If you already have phone support eligibility, you can generate single-use tokens from the Cloudflare dashboard or configure an authenticator app to do the same remotely.

On the support page, you will see a card called “Emergency Phone Support Hotline – Authentication”. From here you can generate a Single-Use Token for authenticating a single call or configure an Authenticator App to generate tokens from a 2FA app.

Time-Based One-Time Passwords for Phone Support

For more detailed instructions, please see the “Emergency Phone” section of the Contacting Cloudflare Support article on the Cloudflare Knowledge Base.

How it Works

A standardised approach for generating TOTPs (Time-Based One-Time Passwords) is described in RFC 6238 – this is the approach that is often used for setting up Two Factor Authentication on websites.

When configuring a TOTP authenticator app, you are usually asked to scan a QR code or input a long alphanumeric string. This is a randomly generated secret that is shared between your local authenticator app and the web service where you are configuring TOTP. After TOTP is configured, this is stored between both the web server and your local device.

TOTP password generation relies on two key inputs; the shared secret and the number of seconds since the Unix epoch (Unix time). The timestamp is integer divided by a validity period (often 30 seconds) and this value is put into a cryptographic hash function alongside the secret to generate an output. The hexadecimal output is then truncated to provide the decimal digits which are shown to the user. The Avalanche Effect means that whenever the inputs that go into the hash function change slightly (e.g. the timestamp increments), a completely different hash output is generated.

This approach is fairly widely used and is available in a number of libraries depending on your preferred programming language. However, as our phone validation functionality offers both authenticator app support and generation of a single-use token from the dashboard (where no shared secret exists) – some deviation was required.

We generate a single use token by creating a hash of an internal user ID combined with a Cloudflare-internal secret, which in turn is used to generate RFC 6238 compliant time-based one-time passwords. Similarly, this service can generate random passwords for any user without needing to store additional secrets. This is then surfaced to the user every 30 seconds via a JavaScript request without exposing the secret used to generate the token.

Time-Based One-Time Passwords for Phone Support

One question you may be asking yourself after all of this is why don’t we simply use the 2FA mechanism which users use to login for phone validation too? Firstly, we don’t want to accustom users to providing their 2FA tokens to anyone else (they should purely be used for logging in). Secondly, as you may have noticed – we recently began supporting WebAuthn keys for logging in, as these are physical tokens used for website authentication they aren’t suited to usage on a mobile device.

To improve user experience during a phone call, we also validate tokens in the previous time step in the event it has expired by the time the user has read it out (indeed, RFC 6238 provides that “at most one time step is allowed as the network delay”). This means a token can be valid for up to one minute.

The APIs powering this service are then wrapped with API gateways that offer audit logging both for customer actions and actions completed by staff members. This provides a clear audit trail for customer authentication.

Future Work

Authentication is a critical component to securing customer support interactions. Authentication tooling must develop alongside support contact channels; from web forms behind logins to using JWT tokens for validating live chat sessions and now TOTP phone authentication. This is complimented by technical support engineers who will manage risk by routing certain issues into traditional support tickets and being able to refer some cases to named customer success managers for approval.

We are constantly advancing our support experience; for example, we plan to further improve our Enterprise Phone Support by giving users the ability to request a callback from a support agent within our dashboard. As always, right here on our blog we’ll keep you up-to-date with improvements in our service.

Project Crossbow: Lessons from Refactoring a Large-Scale Internal Tool

Post Syndicated from Junade Ali original https://blog.cloudflare.com/project-crossbow-lessons-from-refactoring-a-large-scale-internal-tool/

Project Crossbow: Lessons from Refactoring a Large-Scale Internal Tool

Project Crossbow: Lessons from Refactoring a Large-Scale Internal Tool

Cloudflare’s global network currently spans 200 cities in more than 90 countries. Engineers working in product, technical support and operations often need to be able to debug network issues from particular locations or individual servers.

Crossbow is the internal tool for doing just this; allowing Cloudflare’s Technical Support Engineers to perform diagnostic activities from running commands (like traceroutes, cURL requests and DNS queries) to debugging product features and performance using bespoke tools.

In September last year, an Engineering Manager at Cloudflare asked to transition Crossbow from a Product Engineering team to the Support Operations team. The tool had been a secondary focus and had been transitioned through multiple engineering teams without developing subject matter knowledge.

The Support Operations team at Cloudflare is closely aligned with Cloudflare’s Technical Support Engineers; developing diagnostic tooling and Natural Language Processing technology to drive efficiency. Based on this alignment, it was decided that Support Operations was the best team to own this tool.

Learning from Sisyphus

Whilst seeking advice on the transition process, an SRE Engineering Manager in Cloudflare suggested reading: “A Case Study in Community-Driven Software Adoption”. This book proved a truly invaluable read for anyone thinking of doing internal tool development or contributing to such tooling. The book describes why multiple tools are often created for the same purpose by different autonomous teams and how this issue can be overcome. The book also describes challenges and approaches to gaining adoption of tooling, especially where this requires some behaviour change for engineers who use such tools.

That said, there are some things we learnt along the way of taking over Crossbow and performing a refactor and revamp of a large-scale internal tool. This blog post seeks to be an addendum to such guidance and provide some further practical advice.

In this blog post we won’t dwell too much on the work of the Cloudflare Support Operations team, but this can be found in the SRECon talk: “Support Operations Engineering: Scaling Developer Products to the Millions”. The software development methodology used in Cloudflare’s Support Operations Group closely resembles Extreme Programming.

Cutting The Fat

There were two ways of using Crossbow, a CLI (command line interface) and UI in Cloudflare’s internal tool for Cloudflare’s Technical Support Engineers. Maintaining both interfaces clearly had significant overhead for improvement efforts, and we took the decision to deprecate one of the interfaces. This allowed us to focus our efforts on one platform to achieve large-scale improvements across technology, usability and functionality.

We set-up a poll to allow engineering, operations, solutions engineering and technical support teams to provide their feedback on how they used the tooling. Polling was not only critical for gaining vital information to how different teams used the tool, but also ensured that prior to deprecation that people knew their views were taken onboard. We polled not only on the option people preferred, but which options they felt were necessary to them and the reasons as to why.

We found that the reasons for favouring the web UI primarily revolved around the absence of documentation and training. Instead, we discovered those who used the CLI found it far more critical for their workflow. Product Engineering teams do not routinely have access to the support UI but some found it necessary to use Crossbow for their jobs and users wanted to be able to automate commands with shell scripts.

Technically, the UI was in JavaScript with an API Gateway service that converted HTTP requests to gRPC alongside some configuration to allow it to work in the support UI. The CLI directly interfaced with the gRPC API so it was a simpler system. Given the Cloudflare Support Operations team primarily works on Systems Engineering projects and had limited UI resources, the decision to deprecate the UI was also in our own interest.

We rolled out a new internal Crossbow user group, trained up teams and created new documentation, provided advance notification of deprecation and abrogated the source code of these services. We also dramatically improved the user experience when using the CLI for users through simple improvements to the help information and easier CLI usage.

Rearchitecting Pub/Sub with Cloudflare Access

One of the primary challenges we encountered was how the system architecture for Crossbow was designed many years ago. A gRPC API ran commands at Cloudflare’s edge network using a configuration management tool which the SRE team expressed a desire to deprecate (with Crossbow being the last user of it).

During a visit to the Singapore Office, the Edge SRE Engineering Manager locally wanted his team to understand Crossbow and how to contribute. During this meeting, we provided an overview of the current architecture and the team there were forthcoming in providing potential refactoring ideas to handle global network stability and move away from the old pipeline. This provided invaluable insight into the common issues experienced between technical approaches and instances of where the tool would fail requiring Technical Support Engineers to consult the SRE team.

We decided to adopt a more simple pub/sub pipeline, instead the edge network would expose a gRPC daemon that would listen for new jobs and execute them and then make a callback to the API service with the results (which would be relayed onto the client).

For authentication between the API service and the client or the API service and the network edge, we implemented a JWT authentication scheme. For a CLI user, the authentication was done by querying an HTTP endpoint behind Cloudflare Access using cloudflared, which provided a JWT the client could use for authentication with gRPC. In practice, this looks something like this:

  1. CLI makes request to authentication server using cloudflared
  2. Authentication server responds with signed JWT token
  3. CLI makes gRPC request with JWT authentication token to API service
  4. API service validates token using a public key

The gRPC API endpoint was placed on Cloudflare Spectrum; as users were authenticated using Cloudflare Access, we could remove the requirement for users to be on the company VPN to use the tool. The new authentication pipeline, combined with a single user interface, also allowed us to improve the collection of metrics and usage logs of the tool.

Risk Management

Risk is inherent in the activities undertaken by engineering professionals, meaning that members of the profession have a significant role to play in managing and limiting it.

As with all engineering projects, it was critical to manage risk. However, the risk to manage is different for different engineering projects. Availability wasn’t the largest factor, given that Technical Support Engineers could escalate issues to the SRE team if the tool wasn’t available. The main risk was security of the Cloudflare network and ensuring Crossbow did not affect the availability of any other services. To this end we took methodical steps to improve isolation and engaged the InfoSec team early to assist with specification and code reviews of the new pipeline. Where a risk to availability existed, we ensured this was properly communicated to the support team and the internal Crossbow user group to communicate the risk/reward that existed.

Feedback, Build, Refactor, Measure

The Support Operations team at Cloudflare works using a methodology based on Extreme Programming. A key tenant of Extreme Programming is that of Test Driven Development, this is often described as a “red-green-green” pattern or “red-green-refactor”. First the engineer enshrines the requirements in tests, then they make those tests pass and then refactor to improve code quality before pushing the software.

As we took on this project, the Cloudflare Support and SRE teams were working on Project Baton – an effort to allow Technical Support Engineers to handle more customer escalations without handover to the SRE teams.

As part of this effort, they had already created an invaluable resource in the form of a feature wish list for Crossbow. We associated JIRAs with all these items and prioritised this work to deliver such feature requests using a Test Driven Development workflow and the introduction of Continuous Integration. Critically we measured such improvements once deployed. Adding simple functionality like support for MTR (a Linux network diagnostic tool) and exposing support for different cURL flags provided improvements in usage.

We were also able to embed Crossbow support for other tools available at the network edge created by other teams, allowing them to maintain such tools and expose features to Crossbow users. Through the creation of an improved development environment and documentation, we were able to drive Product Engineering teams to contribute functionality that was in the mutual interest of them and the customer support team.

Finally, we owned a number of tools which were used by Technical Support Engineers to discover what Cloudflare configuration was applied to a given URL and performing distributed performance testing, we deprecated these tools and rolled them into Crossbow. Another tool owned by the Cloudflare Workers team, called Edge Worker Debug was rolled into Crossbow and the team deprecated their tool.

Results

From implementing user analytics on the tool on the 16 December 2019 to the week ending the 22 January 2020, we found a found usage increase of 4.5x. This growth primarily happened within a 4 week period; by adding the most wanted functionality, we were able to achieve a critical saturation of usage amongst Technical Support Engineers.

Project Crossbow: Lessons from Refactoring a Large-Scale Internal Tool

Beyond this point, it became critical to use the number of checks being run as a metric to evaluate how useful the tool was. For example, only the week starting January 27 saw no meaningful increase in unique users (a 14% usage increase over the previous week – within the normal fluctuation of stable usage). However, over the same timeframe, we saw a 2.6x increase in the number of tests being run – coinciding with introduction of a number of new high-usage functionalities.

Project Crossbow: Lessons from Refactoring a Large-Scale Internal Tool

Conclusion

Through removing low-value/high-maintenance functionality and merciless refactoring, we were dramatically able to improve the quality of Crossbow and therefore improve the velocity of delivery. We were able to dramatically improve usage through enabling functionality to measure usage, receive feature requests in feedback loops with users and test-driven development. Consolidation of tooling reduced overhead of developing support tooling across the business, providing a common framework for developing and exposing functionality for Technical Support Engineers.

There are two key counterintuitive learnings from this project. The first is that cutting functionality can drive usage, providing this is done intelligently. In our case, the web UI contained no additional functionality that wasn’t in the CLI, yet caused substantial engineering overhead for maintenance. By deprecating this functionality, we were able to reduce technical debt and thereby improve the velocity of delivering more important functionality. This effort requires effective communication of the decision making process and involvement from those who are impacted by such a decision.

Secondly, tool development efforts are often focussed by user feedback but lack a means of objectively measuring such improvements. When logging is added, it is often done purely for security and audit logging purposes. Whilst feedback loops with users are invaluable, it is critical to have an objective measure of how successful such a feature is and how it is used. Effective measurement drives the decision making process of future tooling and therefore, in the long run, the usage data can be more important than the original feature itself.

If you’re interested in debugging interesting technical problems on a network with these tools, we’re hiring for Support Engineers (including Security Operations, Technical Support and Support Operations Engineering) in San Francisco, Austin, Champaign, London, Lisbon, Munich and Singapore.