All posts by Jesse Kipp

Workers AI Update: Hello Mistral 7B

2023-11-21 Jesse Kipp

Post Syndicated from Jesse Kipp original http://blog.cloudflare.com/workers-ai-update-hello-mistral-7b/

Today we’re excited to announce that we’ve added the Mistral-7B-v0.1-instruct to Workers AI. Mistral 7B is a 7.3 billion parameter language model with a number of unique advantages. With some help from the founders of Mistral AI, we’ll look at some of the highlights of the Mistral 7B model, and use the opportunity to dive deeper into “attention” and its variations such as multi-query attention and grouped-query attention.

Mistral 7B tl;dr:

Mistral 7B is a 7.3 billion parameter model that puts up impressive numbers on benchmarks. The model:

Outperforms Llama 2 13B on all benchmarks
Outperforms Llama 1 34B on many benchmarks,
Approaches CodeLlama 7B performance on code, while remaining good at English tasks, and
The chat fine-tuned version we’ve deployed outperforms Llama 2 13B chat in the benchmarks provided by Mistral.

Here’s an example of using streaming with the REST API:

curl -X POST \
“https://api.cloudflare.com/client/v4/accounts/{account-id}/ai/run/@cf/mistral/mistral-7b-instruct-v0.1” \
-H “Authorization: Bearer {api-token}” \
-H “Content-Type:application/json” \
-d '{ “prompt”: “What is grouped query attention”, “stream”: true }'

API Response: { response: “Grouped query attention is a technique used in natural language processing  (NLP) and machine learning to improve the performance of models…” }

And here’s an example using a Worker script:

import { Ai } from ‘@cloudflare/ai’;
export default {
    async fetch(request, env) {
        const ai = new Ai(env.AI);
        const stream = await ai.run(‘@cf/mistral/mistral-7b-instruct-v0.1’, {
            prompt: ‘What is grouped query attention’,
            stream: true
        });
        return Response.json(stream, { headers: { “content-type”: “text/event-stream” } });
    }
}

Mistral takes advantage of grouped-query attention for faster inference. This recently-developed technique improves the speed of inference without compromising output quality. For 7 billion parameter models, we can generate close to 4x as many tokens per second with Mistral as we can with Llama, thanks to Grouped-Query attention.

You don’t need any information beyond this to start using Mistral-7B, you can test it out today ai.cloudflare.com. To learn more about attention and Grouped-Query attention, read on!

So what is “attention” anyway?

The basic mechanism of attention, specifically “Scaled Dot-Product Attention” as introduced in the landmark paper Attention Is All You Need, is fairly simple:

We call our particular attention “Scale Dot-Product Attention”. The input consists of query and keys of dimension d_k, and values of dimension d_v. We compute the dot products of the query with all the keys, divide each by sqrt(d_k) and apply a softmax function to obtain the weights on the values.

More concretely, this looks like this:

In simpler terms, this allows models to focus on important parts of the input. Imagine you are reading a sentence and trying to understand it. Scaled dot product attention enables you to pay more attention to certain words based on their relevance. It works by calculating the similarity between each word (K) in the sentence and a query (Q). Then, it scales the similarity scores by dividing them by the square root of the dimension of the query. This scaling helps to avoid very small or very large values. Finally, using these scaled similarity scores, we can determine how much attention or importance each word should receive. This attention mechanism helps models identify crucial information (V) and improve their understanding and translation capabilities.

Easy, right? To get from this simple mechanism to an AI that can write a “Seinfeld episode in which Jerry learns the bubble sort algorithm,” we’ll need to make it more complex. In fact, everything we’ve just covered doesn’t even have any learned parameters — constant values learned during model training that customize the output of the attention block!
Attention blocks in the style of Attention is All You Need add mainly three types of complexity:

Learned parameters

Learned parameters refer to values or weights that are adjusted during the training process of a model to improve its performance. These parameters are used to control the flow of information or attention within the model, allowing it to focus on the most relevant parts of the input data. In simpler terms, learned parameters are like adjustable knobs on a machine that can be turned to optimize its operation.

Vertical stacking – layered attention blocks

Vertical layered stacking is a way to stack multiple attention mechanisms on top of each other, with each layer building on the output of the previous layer. This allows the model to focus on different parts of the input data at different levels of abstraction, which can lead to better performance on certain tasks.

Horizontal stacking – aka Multi-Head Attention

The figure from the paper displays the full multi-head attention module. Multiple attention operations are carried out in parallel, with the Q-K-V input for each generated by a unique linear projection of the same input data (defined by a unique set of learned parameters). These parallel attention blocks are referred to as “attention heads”. The weighted-sum outputs of all attention heads are concatenated into a single vector and passed through another parameterized linear transformation to get the final output.

This mechanism allows a model to focus on different parts of the input data concurrently. Imagine you are trying to understand a complex piece of information, like a sentence or a paragraph. In order to understand it, you need to pay attention to different parts of it at the same time. For example, you might need to pay attention to the subject of the sentence, the verb, and the object, all simultaneously, in order to understand the meaning of the sentence. Multi-headed attention works similarly. It allows a model to pay attention to different parts of the input data at the same time, by using multiple “heads” of attention. Each head of attention focuses on a different aspect of the input data, and the outputs of all the heads are combined to produce the final output of the model.

Styles of attention

There are three common arrangements of attention blocks used by large language models developed in recent years: multi-head attention, grouped-query attention and multi-query attention. They differ in the number of K and V vectors relative to the number of query vectors. Multi-head attention uses the same number of K and V vectors as Q vectors, denoted by “N” in the table below. Multi-query attention uses only a single K and V vector. Grouped-query attention, the type used in the Mistral 7B model, divides the Q vectors evenly into groups containing “G” vectors each, then uses a single K and V vector for each group for a total of N divided by G sets of K and V vectors. This summarizes the differences, and we’ll dive into the implications of these below.

	Number of Key/Value Blocks	Quality	Memory Usage
Multi-head attention (MHA)	N	Best	Most
Grouped-query attention (GQA)	N / G	Better	Less
Multi-query attention (MQA)	1	Good	Least

Summary of attention styles

And this diagram helps illustrate the difference between the three styles:

Multi-Query Attention

Multi-query attention was described in 2019 in the paper from Google: Fast Transformer Decoding: One Write-Head is All You Need. The idea is that instead of creating separate K and V entries for every Q vector in the attention mechanism, as in multi-head attention above, only a single K and V vector is used for the entire set of Q vectors. Thus the name, multiple queries combined into a single attention mechanism. In the paper, this was benchmarked on a translation task and showed performance equal to multi-head attention on the benchmark task.

Originally the idea was to reduce the total size of memory that is accessed when performing inference for the model. Since then, as generalized models have emerged and grown in number of parameters, the GPU memory needed is often the bottleneck which is the strength of multi-query attention, as it requires the least accelerator memory of the three types of attention. However, as models grew in size and generality, performance of multi-query attention fell relative to multi-head attention.

Grouped-Query Attention

The newest of the bunch — and the one used by Mistral — is grouped-query attention, as described in the paper GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints that was published on arxiv.org in May 2023. Grouped-query attention combines the best of both worlds: the quality of multi-headed attention with the speed and low memory usage of multi-query attention. Instead of either a single set of K and V vectors or one set for every Q vector, a fixed ratio of 1 set of K and V vectors for every Q vector is used, reducing memory usage but retaining high performance on many tasks.

Often choosing a model for a production task is not just about picking the best model available because we must consider tradeoffs between performance, memory usage, batch size, and available hardware (or cloud costs). Understanding these three styles of attention can help guide those decisions and understand when we might choose a particular model given our circumstances.

Enter Mistral — try it today

Being one of the first large language models to leverage grouped-query attention and combining it with sliding window attention, Mistral seems to have hit the goldilocks zone — it’s low latency, high-throughput, and it performs really well on benchmarks even when compared to bigger models (13B). All this to say is that it packs a punch for its size, and we couldn’t be more excited to make it available to all developers today, via Workers AI.

Head over to our developer docs to get started, and if you need help, want to give feedback, or want to share what you’re building just pop into our Developer Discord!

The Workers AI team is also expanding and hiring; check our jobs page for open roles if you’re passionate about AI engineering and want to help us build and evolve our global, serverless GPU-powered inference platform.

Streaming and longer context lengths for LLMs on Workers AI

2023-11-14 Jesse Kipp

Post Syndicated from Jesse Kipp original http://blog.cloudflare.com/workers-ai-streaming/

Streaming LLMs and longer context lengths available in Workers AI

Workers AI is our serverless GPU-powered inference platform running on top of Cloudflare’s global network. It provides a growing catalog of off-the-shelf models that run seamlessly with Workers and enable developers to build powerful and scalable AI applications in minutes. We’ve already seen developers doing amazing things with Workers AI, and we can’t wait to see what they do as we continue to expand the platform. To that end, today we’re excited to announce some of our most-requested new features: streaming responses for all Large Language Models (LLMs) on Workers AI, larger context and sequence windows, and a full-precision Llama-2 model variant.

If you’ve used ChatGPT before, then you’re familiar with the benefits of response streaming, where responses flow in token by token. LLMs work internally by generating responses sequentially using a process of repeated inference — the full output of a LLM model is essentially a sequence of hundreds or thousands of individual prediction tasks. For this reason, while it only takes a few milliseconds to generate a single token, generating the full response takes longer, on the order of seconds. The good news is we can start displaying the response as soon as the first tokens are generated, and append each additional token until the response is complete. This yields a much better experience for the end user — displaying text incrementally as it’s generated not only provides instant responsiveness, but also gives the end-user time to read and interpret the text.

As of today, you can now use response streaming for any LLM model in our catalog, including the very popular Llama-2 model. Here’s how it works.

Server-sent events: a little gem in the browser API

Server-sent events are easy to use, simple to implement on the server side, standardized, and broadly available across many platforms natively or as a polyfill. Server-sent events fill a niche of handling a stream of updates from the server, removing the need for the boilerplate code that would otherwise be necessary to handle the event stream.

	Easy-to-use	Streaming	Bidirectional
fetch	✅
Server-sent events	✅	✅
Websockets		✅	✅

^{Comparing fetch, server-sent events, and websockets}

To get started using streaming on Workers AI’s text generation models with server-sent events, set the “stream” parameter to true in the input of request. This will change the response format and mime-type to text/event-stream.

Here’s an example of using streaming with the REST API:

curl -X POST \
"https://api.cloudflare.com/client/v4/accounts/<account>/ai/run/@cf/meta/llama-2-7b-chat-int8" \
-H "Authorization: Bearer <token>" \
-H "Content-Type:application/json" \
-d '{ "prompt": "where is new york?", "stream": true }'

data: {"response":"New"}

data: {"response":" York"}

data: {"response":" is"}

data: {"response":" located"}

data: {"response":" in"}

data: {"response":" the"}

...

data: [DONE]

And here’s an example using a Worker script:

import { Ai } from "@cloudflare/ai";
export default {
    async fetch(request, env, ctx) {
        const ai = new Ai(env.AI, { sessionOptions: { ctx: ctx } });
        const stream = await ai.run(
            "@cf/meta/llama-2-7b-chat-int8",
            { prompt: "where is new york?", stream: true  }
        );
        return new Response(stream,
            { headers: { "content-type": "text/event-stream" } }
        );
    }
}

If you want to consume the output event-stream from this Worker in a browser page, the client-side JavaScript is something like:

const source = new EventSource("/worker-endpoint");
source.onmessage = (event) => {
    if(event.data=="[DONE]") {
        // SSE spec says the connection is restarted
        // if we don't explicitly close it
        source.close();
        return;
    }
    const data = JSON.parse(event.data);
    el.innerHTML += data.response;
}

You can use this simple code with any simple HTML page, complex SPAs using React or other Web frameworks.

This creates a much more interactive experience for the user, who now sees the page update as the response is incrementally created, instead of waiting with a spinner until the entire response sequence has been generated. Try it out streaming on ai.cloudflare.com.

Workers AI supports streaming text responses for the Llama-2 model and any future LLM models we are adding to our catalog.

But this is not all.

Higher precision, longer context and sequence lengths

Another top request we heard from our community after the launch of Workers AI was for longer questions and answers in our Llama-2 model. In LLM terminology, this translates to higher context length (the number of tokens the model takes as input before making the prediction) and higher sequence length (the number of tokens the model generates in the response.)

We’re listening, and in conjunction with streaming, today we are adding a higher 16-bit full-precision Llama-2 variant to the catalog, and increasing the context and sequence lengths for the existing 8-bit version.

Model	Context length (in)	Sequence length (out)
@cf/meta/llama-2-7b-chat-int8	2048 (768 before)	1800 (256 before)
@cf/meta/llama-2-7b-chat-fp16	3072	2500

Streaming, higher precision, and longer context and sequence lengths provide a better user experience and enable new, richer applications using large language models in Workers AI.

Check the Workers AI developer documentation for more information and options. If you have any questions or feedback about Workers AI, please come see us in the Cloudflare Community and the Cloudflare Discord.
If you are interested in machine learning and serverless AI, the Cloudflare Workers AI team is building a global-scale platform and tools that enable our customers to run fast, low-latency inference tasks on top of our network. Check our jobs page for opportunities.

Using the power of Cloudflare’s global network to detect malicious domains using machine learning

2023-03-15 Jesse Kipp

Post Syndicated from Jesse Kipp original https://blog.cloudflare.com/threat-detection-machine-learning-models/

Using the power of Cloudflare’s global network to detect malicious domains using machine learning

Cloudflare secures outbound Internet traffic for thousands of organizations every day, protecting users, devices, and data from threats like ransomware and phishing. One way we do this is by intelligently classifying what Internet destinations are risky using the domain name system (DNS). DNS is essential to Internet navigation because it enables users to look up addresses using human-friendly names, like cloudflare.com. For websites, this means translating a domain name into the IP address of the server that can deliver the content for that site.

However, attackers can exploit the DNS system itself, and often use techniques to evade detection and control using domain names that look like random strings. In this blog, we will discuss two techniques threat actors use – DNS tunneling and domain generation algorithms – and explain how Cloudflare uses machine learning to detect them.

Domain Generation Algorithm (DGA)

Most websites don’t change their domain name very often. This is the point after all, having a stable human-friendly name to be able to connect to a resource on the Internet. However, as a side-effect stable domain names become a point of control, allowing network administrators to use restrictions on domain names to enforce policies, for example blocking access to malicious websites. Cloudflare Gateway – our secure web gateway service for threat defense – makes this easy to do by allowing administrators to block risky and suspicious domains based on integrated threat intelligence.

But what if instead of using a stable domain name, an attacker targeting your users generated random domain names to communicate with, making it more difficult to know in advance what domains to block? This is the idea of Domain Generation Algorithm domains (MITRE ATT&CK technique T1568.002).

After initial installation, malware reaches out to a command-and-control server to receive further instructions, this is called “command and control” (MITRE ATT&CK tactic TA0011). The attacker may send instructions to perform such actions as gathering and transmitting information about the infected device, downloading additional stages of malware, stealing credentials and private data and sending it to the server, or operating as a bot within a network to perform denial-of-service attacks. Using a domain generation algorithm to frequently generate random domain names to communicate with for command and control gives malware a way to bypass blocks on fixed domains or IP addresses. Each day the malware generates a random set of domain names. To rendezvous with the malware, the attacker registers one of these domain names and awaits communication from the infected device.

Speed in identifying these domains is important to disrupting an attack. Because the domains rotate each day, by the time the malicious disposition of a domain propagates through the cybersecurity community, the malware may have rotated to a new domain name. However, the random nature of these domain names (they are literally a random string of letters!) also gives us an opportunity to detect them using machine learning.

The machine learning model

To identify DGA domains, we trained a model that extends a pre-trained transformers-based neural network. Transformers-based neural networks are the state-of-the-art technique in natural language processing, and underlie large language models and services like ChatGPT. They are trained by using adjacent words and context around a word or character to “learn” what is likely to come next.

Domain names largely contain words and abbreviations that are meaningful in human language. Looking at the top domains on Cloudflare Radar, we see that they are largely composed of words and common abbreviations, “face” and “book” for example, or “cloud” and “flare”. This makes the knowledge of human language encoded in transformer models a powerful tool for detecting random domain names.

For DGA models, we curated ground truth data that consisted of domain names observed from Cloudflare’s 1.1.1.1 DNS resolver for the negative class, and we used domain names from known domain generation algorithms for the positive class (all uses of DNS resolver data is completed in accordance with our privacy commitments).

Our final training set contained over 250,000 domain names, and was weighted to include more negative (not DGA domains) than positive cases. We trained three different versions of the model with different architectures: LSTM (Long Short-Term Memory Neural Network), LightGBM (binary classification), and a transformer-based model. We selected the transformer based model based on it having the highest accuracy and F1 score (the F1 score is a measure of model fit that penalizes having very different precision and recall, on an imbalanced data set the highest accuracy model might be the one that predicts everything either true or false, not what we want!), with an accuracy of over 99% on the test data.

To compute the score for a new domain never seen before by the model, the domain name is tokenized (i.e. broken up into individual components, in this case characters), and the sequence of characters are passed to the model. The transformers Python package from Hugging Face makes it easy to use these types of models for a variety of applications. The library supports summarization, question answering, translation, text generation, classification, and more. In this case we use sequence classification, together with a model that was customized for this task. The output of the model is a score indicating the chance that the domain was generated by a domain generation algorithm. If the score is over our threshold, we label the domain and a domain generation algorithm domain.

Deployment

The expansive view of domain names Cloudflare has from our 1.1.1.1 resolver means we can quickly observe DGA domains after they become active. We process all DNS query names that successfully resolve using this model, so a single successful resolution of the domain name anywhere in Cloudflare’s public resolver network can be detected.

From the queries observed on 1.1.1.1, we filter down first to new and newly seen domain names. We then apply our DGA classifier to the new and newly seen domain names, allowing us to detect activated command and control domains as soon as they are observed anywhere in the world by the 1.1.1.1 resolver.

DNS Tunneling detection

In issuing commands or extracting data from an installed piece of malware, attackers seek to avoid detection. One way to send data and bypass traditional detection methods is to encode data within another protocol. When the attacker controls the authoritative name server for a domain, information can be encoded as DNS queries and responses. Instead of making a DNS query for a simple domain name, such as www.cloudflare.com, and getting a response like 104.16.124.96, attackers can send and receive long DNS queries and responses that contain encoded data.

Here is an example query made by an application performing DNS tunneling (query shortened and partially redacted):

3rroeuvx6bkvfwq7dvruh7adpxzmm3zfyi244myk4gmswch4lcwmkvtqq2cryyi.qrsptavsqmschy2zeghydiff4ogvcacaabc3mpya2baacabqtqcaa2iaaaaocjb.br1ns.example.com

The response data to a query like the one above can vary in length based on the response record type the server uses and the recursive DNS resolvers in the path. Generally, it is at most 255 characters per response record and looks like a random string of characters.

TXT	jdqjtv64k2w4iudbe6b7t2abgubis

This ability to take an arbitrary set of bytes and send it to the server as a DNS query and receive a response in the answer data creates a bi-directional communication channel that can be used to transmit any data. The malware running on the infected host encodes the data it wants to transmit as a DNS query name and the infected host sends the DNS query to its resolver.

Since this query is not a true hostname, but actually encodes some data the malware wishes to transmit, the query is very likely to be unique, and is passed on to the authoritative DNS server for that domain.

The authoritative DNS server decodes the query back into the original data, and if necessary can transmit it elsewhere on the Internet. Responses go back the other direction, the response data is encoded as a query response (for example a TXT record) and sent back to the malware running on the infected host.

One challenge with identifying this type of traffic, however, is that there are also many benign applications that use the DNS system to encode or transmit data as well. An example of a query that was classified as not DNS tunneling:

00641f74-8518-4f03-adc2-792a34ea2612.bbbb.example.com

As humans, we can see that the leading portion of this DNS query is a UUID. Queries like this are often used by security and monitoring applications and network appliances to check in. The leading portion of the query might be the unique id of the device or installation that is performing the check-in.

During the research and training phase our researchers identified a wide variety of different applications that use a large number of random looking DNS queries. Some examples of this include subdomains of content delivery networks, video streaming, advertising and tracking, security appliances, as well as DNS tunneling. Our researchers investigated and labeled many of these domains, and while doing so, identified features that can be used to distinguish between benign applications and true DNS tunneling.

The model

For this application, we trained a two-stage model. The first stage makes quick yes/no decisions about whether the domain might be a DNS tunneling domain. The second stage of the model makes finer-grained distinctions between legitimate domains that have large numbers of subdomains, such as security appliances or AV false-positive control, and malicious DNS tunneling.

The first stage is a gradient boosted decision tree that gives us an initial classification based on minimal information. A decision tree model is like playing 20 questions – each layer of the decision tree asks a yes or no question, which gets you closer to the final answer. Decision tree models are good at both predicting binary yes/no results as well as incorporating binary or nominal attributes into a prediction, and are fast and lightweight to execute, making them a good fit for this application. Gradient boosting is a reliable technique for training models that is particularly good at combining several attributes with weak predictive power into a strong predictor. It can be used to train multiple types of models including decision trees as well as numeric predictions.

If the first stage classifies the domain as “yes, potential DNS tunneling”, it is checked against the second stage, which incorporates data observed from Cloudflare’s 1.1.1.1 DNS resolver. This second model is a neural network model and refines the categorization of the first, in order to distinguish legitimate applications.

In this model, the neural network takes 28 features as input and classifies the domain into one of 17 applications, such as DNS tunneling, IT appliance beacons, or email delivery and spam related. Figure 2 shows a diagram generated from the popular Python software package Keras showing the layers of this neural network. We see the 28 input features at the top layer and at the bottom layer, the 17 output values indicating the prediction value for each type of application. This neural network is very small, having about 2,000 individual weights that can be set during the training process. In the next section we will see an example of a model that is based on a state-of-the-art pretrained model from a model family that has tens to hundreds of millions of predefined weights.

Figure 3 shows a plot of the feature values of the applications we are trying to distinguish in polar coordinates. Each color is the feature values of all the domains the model classified as a single type of application over a sample period. The position around the circle (theta) is the feature, and the distance from the center (rho) is the value of that feature. We can see how many of the applications have similar feature values.

When we observe a new domain and compute its feature values, our model uses those feature values to give us a prediction about which application the new domain resembles. As mentioned, the neural network has 28 inputs each of which is the value for a single feature and 17 outputs. The 17 output values represent the prediction that the domain is each of those 17 different types of applications, with malicious DNS tunneling being one of the 17 outputs. The job of the model is to convert the sometimes small differences between the feature values into a prediction. If the value of the malicious DNS tunneling output of the neural network is higher than the other outputs, the domain is labeled as a security threat.

Deployment

For the DNS tunneling model, our system consumes the logs from our secure web gateway service. The first stage model is applied to all DNS queries. Domains that are flagged as possible DNS tunneling are then sent to the second stage where the prediction is refined using additional features.

Looking forward: combining machine learning with human expertise

In September 2022, Cloudflare announced the general availability of our threat operations and research team, Cloudforce One, which allows our in-house experts to share insights directly with customers. Layering this human element on top of the ML models that we have already developed helps Cloudflare deliver additional protection threat protection for our customers, as we plan to explain in the next article in this blog series.

Until then, click here to create a free account, with no time limit for up to 50 users, and point just your DNS traffic, or all traffic (layers 4 to 7), to Cloudflare to protect your team, devices, and data with machine learning-driven threat defense.

Area 1 threat indicators now available in Cloudflare Zero Trust

2022-06-20 Jesse Kipp

Post Syndicated from Jesse Kipp original https://blog.cloudflare.com/phishing-threat-indicators-in-zero-trust/

Area 1 threat indicators now available in Cloudflare Zero Trust

Over the last several years, both Area 1 and Cloudflare built pipelines for ingesting threat indicator data, for use within our products. During the acquisition process we compared notes, and we discovered that the overlap of indicators between our two respective systems was smaller than we expected. This presented us with an opportunity: as one of our first tasks in bringing the two companies together, we have started bringing Area 1’s threat indicator data into the Cloudflare suite of products. This means that all the products today that use indicator data from Cloudflare’s own pipeline now get the benefit of Area 1’s data, too.

Area 1 built a data pipeline focused on identifying new and active phishing threats, which now supplements the Phishing category available today in Gateway. If you have a policy that references this category, you’re already benefiting from this additional threat coverage.

How Cloudflare identifies potential phishing threats

Cloudflare is able to combine the data, procedures and techniques developed independently by both the Cloudflare team and the Area 1 team prior to acquisition. Customers are able to benefit from the work of both teams across the suite of Cloudflare products.

Cloudflare curates a set of data feeds both from our own network traffic, OSINT sources, and numerous partnerships, and applies custom false positive control. Customers who rely on Cloudflare are spared the software development effort as well as the operational workload to distribute and update these feeds. Cloudflare handles this automatically, with updates happening as often as every minute.

Cloudflare is able to go beyond this and work to proactively identify phishing infrastructure in multiple ways. With the Area 1 acquisition, Cloudflare is now able to apply the adversary-focused threat research approach of Area1 across our network. A team of threat researchers track state-sponsored and financially motivated threat actors, newly disclosed CVEs, and current phishing trends.

Cloudflare now operates mail exchange servers for hundreds of organizations around the world, in addition to its DNS resolvers, Zero Trust suite, and network services. Each of these products generates data that is used to enhance the security of all of Cloudflare’s products. For example, as part of mail delivery, the mail engine performs domain lookups, scores potential phishing indicators via machine learning, and fetches URLs. Data which can now be used through Cloudflare’s offerings.

How Cloudflare Area 1 identifies potential phishing threats

The Cloudflare Area 1 team operates a suite of web crawling tools designed to identify phishing pages, capture phishing kits, and highlight attacker infrastructure. In addition, Cloudflare Area 1 threat models assess campaigns based on signals gathered from threat actor campaigns; and the associated IOCs of these campaign messages are further used to enrich Cloudflare Area 1 threat data for future campaign discovery. Together these techniques give Cloudflare Area 1 a leg up on identifying the indicators of compromise for an attacker prior to their attacks against our customers. As part of this proactive approach, Cloudflare Area 1 also houses a team of threat researchers that track state-sponsored and financially motivated threat actors, newly disclosed CVEs, and current phishing trends. Through this research, analysts regularly insert phishing indicators into an extensive indicator management system that may be used for our email product or any other product that may query it.

Cloudflare Area 1 also collects information about phishing threats during our normal operation as the mail exchange server for hundreds of organizations across the world. As part of that role, the mail engine performs domain lookups, scores potential phishing indicators via machine learning, and fetches URLs. For those emails found to be malicious, the indicators associated with the email are inserted into our indicator management system as part of a feedback loop for subsequent message evaluation.

How Cloudflare data will be used to improve phishing detection

In order to support Cloudflare products, including Gateway and Page Shield, Cloudflare has a data pipeline that ingests data from partnerships, OSINT sources, as well as threat intelligence generated in-house at Cloudflare. We are always working to curate a threat intelligence data set that is relevant to our customers and actionable in the products Cloudflare supports. This is our North star: what data can we provide that enhances our customer’s security without requiring our customers to manage the complexity of data, relationships, and configuration. We offer a variety of security threat categories, but some major focus areas include:

Malware distribution
Malware and Botnet Command & Control
Phishing,
New and newly seen domains

Phishing is a threat regardless of how the potential phishing link gets entry into an organization, whether via email, SMS, calendar invite or shared document, or other means. As such, detecting and blocking phishing domains has been an area of active development for Cloudflare’s threat data team since almost its inception.

Looking forward, we will be able to incorporate that work into Cloudflare Area 1’s phishing email detection process. Cloudflare’s list of phishing domains can help identify malicious email when those domains appear in the sender, delivery headers, message body or links of an email.

Threat actors have long had an unfair advantage — and that advantage is rooted in the knowledge of their target, and the time they have to set up specific campaigns against their targets. That dimension of time allows threat actors to set up the right infrastructure, perform reconnaissance, stage campaigns, perform test probes, observe their results, iterate, improve and then launch their ‘production’ campaigns. This precise element of time gives us the opportunity to discover, assess and proactively filter out campaign infrastructure prior to campaigns reaching critical mass. But to do that effectively, we need visibility and knowledge of threat activity across the public IP space.

With Cloudflare’s extensive network and global insight into the origins of DNS, email or web traffic, combined with Cloudflare Area 1’s datasets of campaign tactics, techniques, and procedures (TTPs), seed infrastructure and threat models — we are now better positioned than ever to help organizations secure themselves against sophisticated threat actor activity, and regain the advantage that for so long has been heavily weighted towards the bad guys.

If you’d like to extend Zero Trust to your email security to block advanced threats, contact your Customer Success manager, or request a Phishing Risk Assessment here.

Noise

All posts by Jesse Kipp

Workers AI Update: Hello Mistral 7B

Mistral 7B tl;dr:

So what is “attention” anyway?

Learned parameters

Vertical stacking – layered attention blocks

Horizontal stacking – aka Multi-Head Attention

Styles of attention

Multi-Query Attention

Grouped-Query Attention

Enter Mistral — try it today

Streaming and longer context lengths for LLMs on Workers AI

Server-sent events: a little gem in the browser API

Higher precision, longer context and sequence lengths

Using the power of Cloudflare’s global network to detect malicious domains using machine learning

Domain Generation Algorithm (DGA)

The machine learning model

Deployment

DNS Tunneling detection

The model

Deployment

Looking forward: combining machine learning with human expertise

Area 1 threat indicators now available in Cloudflare Zero Trust

How Cloudflare identifies potential phishing threats

How Cloudflare Area 1 identifies potential phishing threats

How Cloudflare data will be used to improve phishing detection

The collective thoughts of the interwebz

Mistral 7B tl;dr:

So what is “attention” anyway?

Learned parameters

Vertical stacking – layered attention blocks

Horizontal stacking – aka Multi-Head Attention

Styles of attention

Multi-Query Attention

Grouped-Query Attention

Enter Mistral — try it today

Server-sent events: a little gem in the browser API

Higher precision, longer context and sequence lengths

Domain Generation Algorithm (DGA)

The machine learning model

Deployment

DNS Tunneling detection

The model

Deployment

Looking forward: combining machine learning with human expertise

How Cloudflare identifies potential phishing threats

How Cloudflare Area 1 identifies potential phishing threats

How Cloudflare data will be used to improve phishing detection

1+1 = 3: Greater dataset sharing between Cloudflare and Area 1

The collective thoughts of the interwebz