Enable business users to analyze large datasets in your data lake with Amazon QuickSight

Post Syndicated from Eliad Maimon original https://aws.amazon.com/blogs/big-data/enable-business-users-to-analyze-large-datasets-in-your-data-lake-with-amazon-quicksight/

This blog post is co-written with Ori Nakar from Imperva.

Imperva Cloud WAF protects hundreds of thousands of websites and blocks billions of security events every day. Events and many other security data types are stored in Imperva’s Threat Research Multi-Region data lake.

Imperva harnesses data to improve their business outcomes. To enable this transformation to a data-driven organization, Imperva brings together data from structured, semi-structured, and unstructured sources into a data lake. As part of their solution, they are using Amazon QuickSight to unlock insights from their data.

Imperva’s data lake is based on Amazon Simple Storage Service (Amazon S3), where data is continually loaded. Imperva’s data lake has a few dozen different datasets, in the scale of petabytes. Each day, TBs of new data is added to the data lake, which is then transformed, aggregated, partitioned, and compressed.

In this post, we explain how Imperva’s solution enables users across the organization to explore, visualize, and analyze data using Amazon Redshift Serverless, Amazon Athena, and QuickSight.

Challenges and needs

A modern data strategy gives you a comprehensive plan to manage, access, analyze, and act on data. AWS provides the most complete set of services for the entire end-to-end data journey for all workloads, all types of data, and all desired business outcomes. In turn, this makes AWS the best place to unlock value from your data and turn it into insight.

Redshift Serverless is a serverless option of Amazon Redshift that allows you to run and scale analytics without having to provision and manage data warehouse clusters. Redshift Serverless automatically provisions and intelligently scales data warehouse capacity to deliver high performance for all your analytics. You just need to load and query your data, and you only pay for the compute used for the duration of the workloads on a per-second basis. Redshift Serverless is ideal when it’s difficult to predict compute needs such as variable workloads, periodic workloads with idle time, and steady-state workloads with spikes.

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, straightforward to use, and makes it simple for anyone with SQL skills to quickly analyze large-scale datasets in multiple Regions.

QuickSight is a cloud-native business intelligence (BI) service that you can use to visually analyze data and share interactive dashboards with all users in the organization. QuickSight is fully managed and serverless, requires no client downloads for dashboard creation, and has a pay-per-session pricing model that allows you to pay for dashboard consumption. Imperva uses QuickSight to enable users with no technical expertise, from different teams such as marketing, product, sales, and others, to extract insight from the data without the help of data or research teams.

QuickSight offers SPICE, an in-memory, cloud-native data store that allows end-users to interactively explore data. SPICE provides consistently fast query performance and automatically scales for high concurrency. With SPICE, you save time and cost because you don’t need to retrieve data from the data source (whether a database or data warehouse) every time you change an analysis or update a visual, and you remove the load of concurrent access or analytical complexity off the underlying data source with the data.

In order for QuickSight to consume data from the data lake, some of the data undergoes additional transformations, filters, joins, and aggregations. Imperva cleans their data by filtering incomplete records, reducing the number of records by aggregations, and applying internal logic to curate millions of security incidents out of hundreds of millions of records.

Imperva had the following requirements for their solution:

  • High performance with low query latency to enable interactive dashboards
  • Continuously update and append data to queryable sources from the data lake
  • Data freshness of up to 1 day
  • Low cost
  • Engineering efficiency

The challenge faced by Imperva and many other companies is how to create a big data extract, transform, and load (ETL) pipeline solution that fits these requirements.

In this post, we review two approaches Imperva implemented to address their challenges and meet their requirements. The solutions can be easily implemented while maintaining engineering efficiency, especially with the introduction of Redshift Serverless.

Imperva’s solutions

Imperva needed to have the data lake’s data available through QuickSight continuously. The following solutions were chosen to connect the data lake to QuickSight:

  • QuickSight caching layer, SPICE – Use Athena to query the data into a QuickSight SPICE dataset
  • Redshift Serverless – Copy the data to Redshift Serverless and use it as a data source

Our recommendation is to use a solution based on the use case. Each solution has its own advantages and challenges, which we discuss as part of this post.

The high-level flow is the following:

  • Data is continuously updated from the data lake into either Redshift Serverless or the QuickSight caching layer, SPICE
  • An internal user can create an analysis and publish it as a dashboard for other internal or external users

The following architecture diagram shows the high-level flow.

High-level flow

In the following sections, we discuss the details about the flow and the different solutions, including a comparison between them, which can help you choose the right solution for you.

Solution 1: Query with Athena and import to SPICE

QuickSight provides inherent capabilities to upload data using Athena into SPICE, which is a straightforward approach that meets Imperva’s requirements regarding simple data management. For example, it suits stable data flows without frequent exceptions, which may result in SPICE full refresh.

You can use Athena to load data into a QuickSight SPICE dataset, and then use the SPICE incremental upload option to load new data to the dataset. A QuickSight dataset will be connected to a table or a view accessible by Athena. A time column (like day or hour) is used for incremental updates. The following table summarizes the options and details.

Option Description Pros/Cons
Existing table Use the built-in option by QuickSight. Not flexible—the table is imported as is in the data lake.
Dedicated view

A view will let you better control the data in your dataset. It allows joining data, aggregation, or choosing a filter like the date you want to start importing data from.

Note that QuickSight allows building a dataset based on custom SQL, but this option doesn’t allow incremental updates.

Large Athena resource consumption on a full refresh.
Dedicated ETL

Create a dedicated ETL process, which is similar to a view, but unlike the view, it allows reuse of the results in case of a full refresh.

In case your ETL or view contains grouping or other complex operations, you know that these operations will be done only by the ETL process, according to the schedule you define.

Most flexible, but requires ETL development and implementation and additional Amazon S3 storage.

The following architecture diagram details the options for loading data by Athena into SPICE.

Architecture diagram details the options for loading data by Athena into SPICE

The following code provides a SQL example for a view creation. We assume the existence of two tables, customers and events, with one join column called customer_id. The view is used to do the following:

  • Aggregate the data from daily to weekly, and reduce the number of rows
  • Control the start date of the dataset (in this case, 30 weeks back)
  • Join the data to add more columns (customer_type) and filter it
CREATE VIEW my_dataset AS
SELECT DATE_ADD('day', -DAY_OF_WEEK(day) + 1, day) AS first_day_of_week,
       customer_type, event_type, COUNT(events) AS total_events
FROM my_events INNER JOIN my_customers USING (customer_id)
WHERE customer_type NOT IN ('Reseller')
      AND day BETWEEN DATE_ADD('DAY',-7 * 30 -DAY_OF_WEEK(CURRENT_DATE) + 1, CURRENT_DATE)
      AND DATE_ADD('DAY', -DAY_OF_WEEK(CURRENT_DATE), CURRENT_DATE)
GROUP BY 1, 2, 3

Solution 2: Load data into Redshift Serverless

Redshift Serverless provides full visibility to the data, which can be viewed or edited at any time. For example, if there is a delay in adding data to the data lake or the data isn’t properly added, with Redshift Serverless, you can edit data using SQL statements or retry data loading. Redshift Serverless is a scalable solution that doesn’t have a dataset size limitation.

Redshift Serverless is used as a serving layer for the datasets that are to be used in QuickSight. The pricing model for Redshift Serverless is based on storage utilization and the run of queries; idle compute resources have no associated cost. Setting up a cluster is simple and doesn’t require you to choose node types or amount of storage. You simply load the data to tables you create and start working.

To create a new dataset, you need to create an Amazon Redshift table and run the following process every time data is added:

  1. Transform the data using an ETL process (optional):
    • Read data from the tables.
    • Transform to the QuickSight dataset schema.
    • Write the data to an S3 bucket and load it to Amazon Redshift.
  2. Delete old data if it exists to avoid duplicate data.
  3. Load the data using the COPY command.

The following architecture diagram details the options to load data into Redshift Serverless with or without an ETL process.

Architecture diagram details the options to load data into Redshift Serverless with or without an ETL process

The Amazon Redshift COPY command is simple and fast. For example, to copy daily partition Parquet data, use the following code:

COPY my_table
FROM 's3://my_bucket/my_table/day=2022-01-01'
IAM_ROLE 'my_role' 
FORMAT AS PARQUET

Use the following COPY command to load the output file of the ETL process. Values will be truncated according to Amazon Redshift column size. The column truncation is important because, unlike in the data lake, in Amazon Redshift, the column size must be set. This option prevents COPY failures:

COPY my_table
FROM 's3://my_bucket/my_table/day=2022-01-01'
IAM_ROLE 'my_role' 
FORMAT AS JSON GZIP TRUNCATECOLUMNS

The Amazon Redshift COPY operation provides many benefits and options. It supports multiple formats as well as column mapping, escaping, and more. It also allows more control over data format, object size, and options to tune the COPY operation for improved performance. Unlike data in the data lake, Amazon Redshift has column length specifications. We use TRUNCATECOLUMNS to truncates the data in columns to the appropriate number of characters so that it fits the column specification.

Using this method provides full control over the data. In case of a problem, we can repair parts of the table by deleting old data and loading the data again. It’s also possible to use the QuickSight dataset JOIN option, which is not available in SPICE when using incremental update.

Additional benefit of this approach is that the data is available for other clients and services looking to use the same data, such as SQL clients or notebooks servers such as Apache Zeppelin.

Conclusion

QuickSight allows Imperva to expose business data to various departments within an organization. In the post, we explored approaches for importing data from a data lake to QuickSight, whether continuously or incrementally.

However, it’s important to note that there is no one-size-fits-all solution; the optimal approach will depend on the specific use case. Both options—continuous and incremental updates—are scalable and flexible, with no significant cost differences observed for our dataset and access patterns.

Imperva found incremental refresh to be very useful and uses it for simple data management. For more complex datasets, Imperva has benefitted from the greater scalability and flexibility provided by Redshift Serverless.

In cases where a higher degree of control over the datasets was required, Imperva chose Redshift Serverless so that data issues could be addressed promptly by deleting, updating, or inserting new records as necessary.

With the integration of dashboards, individuals can now access data that was previously inaccessible to them. Moreover, QuickSight has played a crucial role in streamlining our data distribution processes, enabling data accessibility across all departments within the organization.

To learn more, visit Amazon QuickSight.


About the Authors

Eliad Maimon is a Senior Startups Solutions Architect at AWS in Tel-Aviv with over 20 years of experience in architecting, building, and maintaining software products. He creates architectural best practices and collaborates with customers to leverage cloud and innovation, transforming businesses and disrupting markets. Eliad is specializing in machine learning on AWS, with a focus in areas such as generative AI, MLOps, and Amazon SageMaker.

Ori Nakar is a principal cyber-security researcher, a data engineer, and a data scientist at Imperva Threat Research group. Ori has many years of experience as a software engineer and engineering manager, focused on cloud technologies and big data infrastructure.

Enforce boundaries on AWS Glue interactive sessions

Post Syndicated from Nicolas Jacob Baer original https://aws.amazon.com/blogs/big-data/enforce-boundaries-on-aws-glue-interactive-sessions/

AWS Glue interactive sessions allow engineers to build, test, and run data preparation and analytics workloads in an interactive notebook. Interactive sessions provide isolated development environments, take care of the underlying compute cluster, and allow for configuration to stop idling resources.

Glue interactive sessions provides default recommended configurations, and also allows users to customize the session to meet their needs. For example, you can provision more workers to experiment on a larger dataset or set the idle timeout for long-running workloads. With the flexibility to change these options depending on the workload, you may need ensure that the options are changed within specific boundaries and apply a control mechanism.

In this post, we present the process of deploying a reusable solution to enforce AWS Glue interactive session limits on three options: connection, number of workers, and maximum idle time. The first option addresses the need for applying custom inspection and controls on traffic, for example by enforcing an interactive session to only be run inside a VPC. The other two enforce limits on costs and usage of AWS Glue resources by enforcing an upper boundary on the number of workers and idle time per session. You can further extend the solution for other properties or services within AWS Glue.

Overview of solution

The proposed architecture is built on serverless components and runs whenever a new AWS Glue interactive session is created.

Architecture Diagram of the Solution

The workflow steps are as follows:

  1. A data engineer creates a new AWS Glue interactive session either through the AWS Management Console or in a Jupyter notebook locally.
  2. The interactive session produces a new event to AWS CloudTrail for the CreateSession event with all relevant information to identify and inspect a session as soon as the session is initiated.
  3. An Amazon EventBridge rule filters the CloudTrail events and invokes an AWS Lambda function to inspect the CreateSession event.
  4. The Lambda function inspects the CreateSession event and checks for all defined boundary conditions. Currently, the boundaries configurable with this solution are limited to maximum number of workers, idle timeout in minutes, and deployment with connection enforced.
  5. If any of the defined boundary conditions are not met, for example too many workers are provisioned for the session, depending on the provided configuration, the function ends the interactive session immediately and sends an email via Amazon Simple Notification Service (Amazon SNS). If the session hasn’t started yet, the function will wait for it to start before taking any action.
  6. If the session was stopped, an email is sent to an SNS topic. There is no information available in the interactive session notebook on the reason for the ending of the session. Therefore, additional context information is provided through the SNS topic to the data engineers.
  7. If the function fails, the sessions are logged in a dead-letter queue inside Amazon Simple Queue Service (Amazon SQS). Furthermore, the queue is monitored and in case of a message, it will trigger an Amazon CloudWatch alarm.

The following steps walk you through how to build and deploy the solution. The code is available in the GitHub repo.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Overview of the deployed resources

All the necessary resources are defined in an AWS CloudFormation file located under cfn/template.yaml. To deploy those resources, we use AWS Serverless Application Model (AWS SAM), which enables us to conveniently build and package all the dependencies and also manages the AWS CloudFormation steps for us.

The CloudFormation stack deploys the following resources:

  • A Lambda function with its library, both defined under the directory src/functions. The function is the control. It will validate that the session is started within the limits defined.
  • An EventBridge rule. This event listens to CloudTrail and in case of a new interactive session, will trigger the control Lambda function.
  • An SQS dead-letter queue (DLQ) attached to the Lambda function. This keeps a record of events that triggered a Lambda function failure.
  • Two CloudWatch alarms monitoring the Lambda function failures and the messages in the DLQ.

If notification via email is enabled, two more resources are deployed:

Additionally, AWS CloudFormation deploys all the necessary AWS Identity and Access Management (IAM) roles and policies, and an AWS Key Management Service (AWS KMS) key to ensure that the exchanged data is encrypted.

Deploy the solution

To facilitate the deployment lifecycle, including the setup of the user local environment, we provide a Makefile that describes all the necessary steps. Make sure you have your AWS credentials renewed and have access to your account. For more information, refer to Configuration and credential file settings.

  • Explore the Makefile and adjust the Region and stack name as needed by modifying the values of the variables AWS_REGION and STACK_NAME.
  • Set KILL_SESSION = "True" if you want to immediately stop the interactive session that has been found out of boundaries. Allowed values are True or False; the default is True.
  • Set NOTIFICATION_EMAIL_ADDRESS = <[email protected]> in the Makefile if you want get notified when a session has been found out of boundaries.
  • Set values for your controls:
    • ENFORCE_VPC_CONNECTION to stop sessions not running inside a VPC (true or false).
    • MAX_WORKERS to set the maximum number of workers for a session (numeric).
    • MAX_IDLE_TIMEOUT_MINUTES to define the maximum idle time for sessions in minutes (numeric).
  • Install all the prerequisite libraries:
    make install-pre-requisites

    These will be installed under a newly created Python virtual environment inside this repository in the directory .venv.

  • Deploy the new stack:
    make deploy

    This command will complete the following tasks:

    • Check if the prerequisites are met.
    • Perform pytest unittest on the Python files.
    • Validate the CloudFormation template.
    • Build the artifacts (Lambda function and Lambda layers).
    • Deploy the resources via AWS SAM.

Test the solution

Refer to Introducing AWS Glue interactive sessions for Jupyter for information about running an interactive session. If you follow the instructions in the post (see the section Run your first code cell and author your AWS Glue notebook), the initialization of the interactive session should fail with an error similar to the following.

Example of code in the cell:

from awsglue.context import GlueContext
from awsglue.job import Job
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
sc3 = SparkContext.getOrCreate()
glueContext1 = GlueContext(sc3)
spark = glueContext1.spark_session
job = Job(glueContext1)

Received output:

Authenticating with profile=XXXXXXXX
glue_role_arn defined by user: arn:aws:iam::XXXXXXXXXX:role/XXXXXXXX
Trying to create a Glue session for the kernel.
Worker Type: G.1X
Number of Workers: 5
Session ID: XXXXXXXXXXXXX
Applying the following default arguments:
--glue_kernel_version 0.35
--enable-glue-datacatalog true
Waiting for session xxxxxxxxx to get into ready status...
Session xxxxxxxxx has been created
Exception encountered while running statement: An error occurred (EntityNotFoundException) when calling the GetStatement operation: Session ID xxxxxxxxx not found

If you enabled the email feature, you should also get an email notification.

You can also check on the AWS Glue console that your session ID isn’t listed.

Clean up

Clean up the deployed resources by running the following command:

make clean-up

Note that the resources deployed from following the recommended post, Introducing AWS Glue interactive sessions for Jupyter, will not be removed with the previous command.

Limitations

The delivery guarantee for CloudTrail events to EventBridge are best effort. This means CloudTrail will attempt to deliver all events to EventBridge, but in some rare cases, an event might not be delivered. For more information, refer to Events from AWS services.

Conclusion

This post described how to build, deploy, and test a solution to enforce boundary conditions on AWS Glue interactive sessions in order to enforce constraints on the number of workers, idle timeouts, and AWS Glue connection.

You can adapt this solution based on your needs and further extend it to allow controls on other options.

To learn more about how to use AWS Glue interactive sessions, refer to Introducing AWS Glue interactive sessions for Jupyter and Author AWS Glue jobs with PyCharm using AWS Glue interactive sessions.


About the Authors

Nicolas Jacob Baer is a Senior Cloud Application Architect with a strong focus on data engineering and machine learning, based in Switzerland. He works closely with enterprise customers to design data platforms and build advanced analytics/ml use-cases.

Luca Mazzaferro is a Senior DevOps Architect at Amazon Web Services. He likes to have infrastructure automated, reproducible and secured. In his free time he likes to cook, especially pizza.

Kemeng Zhang is a Cloud Application Architect with a strong focus on machine learning and UX, based in Switzerland. She works closely with customers to design user experiences and build advanced analytics/ml use-cases.

Mark Walser, a Senior Global Data Architect at Amazon Web Services, collaborates with customers to develop innovative Big Data solutions that solve business problems and speed up the adoption of AWS services. Outside of work, he finds pleasure in running, swimming, and all things related to technology.

Gal blog picGal Heyne is a Product Manager for AWS Glue with a strong focus on AI/ML, data engineering and BI, based in California. She is passionate about developing a deep understanding of customer’s business needs and collaborating with engineers to design easy to use data products.

[$] Reports from OSPM 2023, part 3

Post Syndicated from original https://lwn.net/Articles/935180/

The fifth conference on Power
Management and Scheduling in the Linux Kernel
(abbreviated “OSPM”) was
held on April 17 to 19 in Ancona, Italy. LWN was not there,
unfortunately, but the attendees of the event have gotten together to write
up summaries of the discussions that took place and LWN has the privilege
of being able to publish them. Reports from the third and final day of the
event appear below.

Security updates for Friday

Post Syndicated from original https://lwn.net/Articles/936040/

Security updates have been issued by Debian (asterisk, lua5.3, and trafficserver), Fedora (tang and trafficserver), Oracle (.NET 7.0, c-ares, firefox, openssl, postgresql, python3, texlive, and thunderbird), Red Hat (python27:2.7 and python39:3.9 and python39-devel:3.9), Scientific Linux (c-ares), Slackware (cups), SUSE (cups, dav1d, google-cloud-sap-agent, java-1_8_0-openjdk, libX11, openssl-1_0_0, openssl-1_1, openssl-3, openvswitch, and python-sqlparse), and Ubuntu (cups, dotnet6, dotnet7, and openssl).

Customer Compliance Guides now available on AWS Artifact

Post Syndicated from Kevin Donohue original https://aws.amazon.com/blogs/security/customer-compliance-guides-now-available-on-aws-artifact/

Amazon Web Services (AWS) has released Customer Compliance Guides (CCGs) to support customers, partners, and auditors in their understanding of how compliance requirements from leading frameworks map to AWS service security recommendations. CCGs cover 100+ services and features offering security guidance mapped to 10 different compliance frameworks. Customers can select any of the available frameworks and services to see a consolidated summary of recommendations that are mapped to security control requirements. 

CCGs summarize key details from public AWS user guides and map them to related security topics and control requirements. CCGs don’t cover compliance topics such as physical and maintenance controls, or organization-specific requirements such as policies and human resources controls. This makes the guides lightweight and focused only on the unique security considerations for AWS services.

Customer Compliance Guides work backwards from security configuration recommendations for each service and map the guidance and compliance considerations to the following frameworks:

  • National Institute of Standards and Technology (NIST) 800-53
  • NIST Cybersecurity Framework (CSF)
  • NIST 800-171
  • System and Organization Controls (SOC) II
  • Center for Internet Security (CIS) Critical Controls v8.0
  • ISO 27001
  • NERC Critical Infrastructure Protection (CIP)
  • Payment Card Industry Data Security Standard (PCI-DSS) v4.0
  • Department of Defense Cybersecurity Maturity Model Certification (CMMC)
  • HIPAA

Customer Compliance Guides help customers address three primary challenges:

  1. Explaining how configuration responsibility might vary depending on the service and summarizing security best practice guidance through the lens of compliance
  2. Assisting customers in determining the scope of their security or compliance assessments based on the services they use to run their workloads
  3. Providing customers with guidance to craft security compliance documentation that might be required to meet various compliance frameworks

CCGs are available for download in AWS Artifact. Artifact is your go-to, central resource for AWS compliance-related information. It provides on-demand access to security and compliance reports from AWS and independent software vendors (ISVs) who sell their products on AWS Marketplace. To access the new CCG resources, navigate to AWS Artifact from the console and search for Customer Compliance Guides. To learn more about the background of Customer Compliance Guides, see the YouTube video Simplify the Shared Responsibility Model.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Kevin Donohue

Kevin Donohue

Kevin is a Senior Manager in AWS Security Assurance, specializing in shared responsibility compliance and regulatory operations across various industries. Kevin began his tenure with AWS in 2019 in support of U.S. Government customers in the AWS FedRAMP program.

Travis Goldbach

Travis Goldbach

Travis has over 12 years’ experience as a cybersecurity and compliance professional with demonstrated ability to map key business drivers to ensure client success. He started at AWS in 2021 as a Sr. Business Development Manager to help AWS customers accelerate their DFARS, NIST, and CMMC compliance requirements while reducing their level of effort and risk.

How we scaled and protected Eurovision 2023 voting with Pages and Turnstile

Post Syndicated from Dirk-Jan van Helmond original http://blog.cloudflare.com/how-cloudflare-scaled-and-protected-eurovision-2023-voting/

How we scaled and protected Eurovision 2023 voting with Pages and Turnstile

How we scaled and protected Eurovision 2023 voting with Pages and Turnstile

2023 was the first year that non-participating countries could vote for their favorites during the Eurovision Song Contest, adding millions of additional viewers and voters to an already impressive 162 million tuning in from the participating countries. It became a truly global event with a potential for disruption from multiple sources. To prepare for anything, Cloudflare helped scale and protect the voting application, used by millions of dedicated fans around the world to choose the winner.

In this blog we will cover how once.net built their platform based.io to monitor, manage and scale the Eurovision voting application to handle all traffic using many Cloudflare services. The speed with which DNS changes made through the Cloudflare API propagate globally allowed them to scale their backend within seconds. At the same time, Cloudflare Pages was ready to serve any amount of traffic to the voting landing page so fans didn’t miss a beat. And to cap it off, by combining Cloudflare CDN, DDoS protection, WAF, and Turnstile, they made sure that attackers didn’t steal any of the limelight.

The unsung heroes

Based.io is a resilient live data platform built by the once.net team, with the capability to scale up to 400 million concurrent connected users. It’s built from the ground up for speed and performance, consisting of an observable real time graph database, networking layer, cloud functions, analytics and infrastructure orchestration. Since all system information, traffic analysis and disruptions are monitored in real time, it makes the platform instantly responsive to variable demand, which enables real time scaling of your infrastructure during spikes, outages and attacks.

Although the based.io platform on its own is currently in closed beta, it is already serving a few flagship customers in production assisted by the software and services of the once.net team. One such customer is Tally, a platform used by multiple broadcasters in Europe to add live interaction to traditional television. Over 100 live shows have been performed using the platform. Another is Airhub, a startup that handles and logs automatic drone flights. And of course the star of this blog post, the Eurovision Song Contest.

Setting the stage

The Eurovision Song Contest is one of the world’s most popular broadcasted contests, and this year it reached 162 million people across 38 broadcasting countries. In addition, on TikTok the three live shows were viewed 4.8 million times, while 7.6 million people watched the Grand Final live on YouTube. With such an audience, it is no surprise that Cloudflare sees the impact of it on the Internet. Last year, we wrote a blog post where we showed lower than average traffic during, and higher than average traffic after the grand final. This year, the traffic from participating countries showed an even more remarkable surge:

How we scaled and protected Eurovision 2023 voting with Pages and Turnstile
HTTP Requests per Second from Norway, with a similar pattern visible in countries such as the UK, Sweden and France. Internet traffic spiked at 21:20 UTC, when voting started.

Such large amounts of traffic are nothing new to the Eurovision Song Contest. Eurovision has relied on Cloudflare’s services for over a decade now and Cloudflare has helped to protect Eurovision.tv and improve its performance through noticeable faster load time to visitors from all corners of the world. Year after year, the team of Eurovision continued to use our services more, discovering additional features to improve performance and reliability further, with increasingly fine-grained control over their traffic flows. Eurovision.tv uses Page Rules to cache additional content on Cloudflare’s edge, speeding up delivery without sacrificing up-to-the-minute updates during the global event. Finally, to protect their backend and content management system, the team has placed their admin portals behind Cloudflare Zero Trust to delegate responsibilities down to individual levels.

Since then the contest itself has also evolved – sometimes by choice, sometimes by force. During the COVID-19 pandemic in-person cheering became impossible for many people due to a reduced live audience, resulting in the Eurovision Song Contest asking once.net to build a new iOS and Android application in which fans could cheer virtually. The feature was an instant hit, and it was clear that it would become part of this year’s contest as well.

How we scaled and protected Eurovision 2023 voting with Pages and Turnstile
A screenshot of the official Eurovision Song Contest application showing the real-time number of connected fans (1) and allowing them to cheer (2) for their favorites.

In 2023, once.net was also asked to handle the paid voting from the regions where phone and SMS voting was not possible. It was the first time that Eurovision allowed voting online. The challenge that had to be overcome was the extreme peak demand on the platform when the show was live, and especially when the voting window started.

Complicating it further, was the fact that during last year’s show, there had been a large number of targeted and coordinated attacks.

To prepare for these spikes in demand and determined adversaries, once.net needed a platform that isn’t only resilient and highly scalable, but could also act as a mitigation layer in front of it. once.net selected Cloudflare for this functionality and integrated Cloudflare deeply with its real-time monitoring and management platform. To understand how and why, it’s essential to understand based.io underlying architecture.

The based.io platform

Instead of relying on network or HTTP load balancers, based.io uses a client-side service discovery pattern, selecting the most suitable server to connect to and leveraging Cloudflare's fast cache propagation infrastructure to handle spikes in traffic (both malicious and benign).

First, each server continuously registers a unique access key that has an expiration of 15 seconds, which must be used when a client connects to the server. In addition, the backend servers register their health (such as active connections, CPU, memory usage, requests per second, etc.) to the service registry every 300 milliseconds. Clients then request the optimal server URL and associated access key from a central discovery registry and proceed to establish a long lived connection with that server. When a server gets overloaded it will disconnect a certain amount of clients and those clients will go through the discovery process again.

The central discovery registry would normally be a huge bottleneck and attack target. based.io resolves this by putting the registry behind Cloudflare's global network with a cache time of three seconds. Since the system relies on real-time stats to distribute load and uses short lived access keys, it is crucial that the cache updates fast and reliably. This is where Cloudflare’s infrastructure proved its worth, both due to the fast updating cache and reducing load with Tiered Caching.

Not using load balancers means the based.io system allows clients to connect to the backend servers through Cloudflare, resulting in  better performance and a more resilient infrastructure by eliminating the load balancers as potential attack surface. It also results in a better distribution of connections, using the real-time information of server health, amount of active connections, active subscriptions.

Scaling up the platform happens automatically under load by deploying additional machines that can each handle 40,000 connected users. These are spun up in batches of a couple of hundred and as each machine spins up, it reaches out directly to the Cloudflare API to configure its own DNS record and proxy status. Thanks to Cloudflare’s high speed DNS system, these changes are then propagated globally within seconds, resulting in a total machine turn-up time of around three seconds. This means faster discovery of new servers and faster dynamic rebalancing from the clients. And since the voting window of the Eurovision Song Contest is only 45 minutes, with the main peak within minutes after the window opens, every second counts!

How we scaled and protected Eurovision 2023 voting with Pages and Turnstile
High level architecture of the based.io platform used for the 2023 Eurovision Song Contest‌ ‌

To vote, users of the mobile app and viewers globally were pointed to the voting landing page, esc.vote. Building a frontend web application able to handle this kind of an audience is a challenge in itself. Although hosting it yourself and putting a CDN in front seems straightforward, this still requires you to own, configure and manage your origin infrastructure. once.net decided to leverage Cloudflare’s infrastructure directly by hosting the voting landing page on Cloudflare Pages. Deploying was as quick as a commit to their Git repository, and they never had to worry about reachability or scaling of the webpage.

once.net also used Cloudflare Turnstile to protect their payment API endpoints that were used to validate online votes. They used the invisible Turnstile widget to make sure the request was not coming from emulated browsers (e.g. Selenium). And best of all, using the invisible Turnstile widget the user did not have to go through extra steps, which allowed for a better user experience and better conversion.

Cloudflare Pages stealing the show!

After the two semi-finals went according to plan with approximately 200,000 concurrent users during each,May 13 brought the Grand Final. The once.net team made sure that there were enough machines ready to take the initial load, jumped on a call with Cloudflare to monitor and started looking at the number of concurrent users slowly increasing. During the event, there were a few attempts to DDoS the site, which were automatically and instantaneously mitigated without any noticeable impact to any visitors.

The based.io discovery registry server also got some attention. Since the cache TTL was set quite low at five seconds, a high rate of distributed traffic to it could still result in a significant load. Luckily, on its own, the highly optimized based.io server can already handle around 300,000 requests per second. Still, it was great to see that during the event the cache hit ratio for normal traffic was 20%, and during one significant attack the cache hit ratio peaked towards 80%. This showed how easy it is to leverage a combination of Cloudflare CDN and DDoS protection to mitigate such attacks, while still being able to serve dynamic and real time content.

When the curtains finally closed, 1.3 million concurrent users connected to the based.io platform at peak. The based.io platform handled a total of 350 million events and served seven million unique users in three hours. The voting landing page hosted by Cloudflare Pages served 2.3 million requests per second at peak, and made sure that the voting payments were by real human fans using Turnstile. Although the Cloudflare platform didn’t blink for such a flood of traffic, it is no surprise that it shows up as a short crescendo in our traffic statistics:

How we scaled and protected Eurovision 2023 voting with Pages and Turnstile

Get in touch with us

If you’re also working on or with an application that would benefit from Cloudflare’s speed and security, but don’t know where to start, reach out and we’ll work together.

All the way up to 11: Serve Brotli from origin and Introducing Compression Rules

Post Syndicated from Matt Bullock original http://blog.cloudflare.com/this-is-brotli-from-origin/

All the way up to 11: Serve Brotli from origin and Introducing Compression Rules

This post is also available in 简体中文, 日本語, Español and Deutsch.

All the way up to 11: Serve Brotli from origin and Introducing Compression Rules

Throughout Speed Week, we have talked about the importance of optimizing performance. Compression plays a crucial role by reducing file sizes transmitted over the Internet. Smaller file sizes lead to faster downloads, quicker website loading, and an improved user experience.

Take household cleaning products as a real world example. It is estimated “a typical bottle of cleaner is 90% water and less than 10% actual valuable ingredients”. Removing 90% of a typical 500ml bottle of household cleaner reduces the weight from 600g to 60g. This reduction means only a 60g parcel, with instructions to rehydrate on receipt, needs to be sent. Extrapolated into the gallons, this weight reduction soon becomes a huge shipping saving for businesses. Not to mention the environmental impact.

This is how compression works. The sender compresses the file to its smallest possible size, and then sends the smaller file with instructions on how to handle it when received. By reducing the size of the files sent, compression ensures the amount of bandwidth needed to send files over the Internet is a lot less. Where files are stored in expensive cloud providers like AWS, reducing the size of files sent can directly equate to significant cost savings on bandwidth.

Smaller file sizes are also particularly beneficial for end users with limited Internet connections, such as mobile devices on cellular networks or users in areas with slow network speeds.

Cloudflare has always supported compression in the form of Gzip. Gzip is a widely used compression algorithm that has been around since 1992 and provides file compression for all Cloudflare users. However, in 2013 Google introduced Brotli which supports higher compression levels and better performance overall. Switching from gzip to Brotli results in smaller file sizes and faster load times for web pages. We have supported Brotli since 2017 for the connection between Cloudflare and client browsers. Today we are announcing end-to-end Brotli support for web content: support for Brotli compression, at the highest possible levels, from the origin server to the client.

If your origin server supports Brotli, turn it on, crank up the compression level, and enjoy the performance boost.

Brotli compression to 11

Brotli has 12 levels of compression ranging from 0 to 11, with 0 providing the fastest compression speed but the lowest compression ratio, and 11 offering the highest compression ratio but requiring more computational resources and time. During our initial implementation of Brotli five years ago, we identified that compression level 4 offered the balance between bytes saved and compression time without compromising performance.

Since 2017, Cloudflare has been using a maximum compression of Brotli level 4 for all compressible assets based on the end user's "accept-encoding" header. However, one issue was that Cloudflare only requested Gzip compression from the origin, even if the origin supported Brotli. Furthermore, Cloudflare would always decompress the content received from the origin before compressing and sending it to the end user, resulting in additional processing time. As a result, customers were unable to fully leverage the benefits offered by Brotli compression.

Old world

All the way up to 11: Serve Brotli from origin and Introducing Compression Rules

With Cloudflare now fully supporting Brotli end to end, customers will start seeing our updated accept-encoding header arriving at their origins. Once available customers can transfer, cache and serve heavily compressed Brotli files directly to us, all the way up to the maximum level of 11. This will help reduce latency and bandwidth consumption. If the end user device does not support Brotli compression, we will automatically decompress the file and serve it either in its decompressed format or as a Gzip-compressed file, depending on the Accept-Encoding header.

Full end-to-end Brotli compression support

All the way up to 11: Serve Brotli from origin and Introducing Compression Rules

End user cannot support Brotli compression

All the way up to 11: Serve Brotli from origin and Introducing Compression Rules

Customers can implement Brotli compression at their origin by referring to the appropriate online materials. For example, customers that are using NGINX, can implement Brotli by following this tutorial and setting compression at level 11 within the nginx.conf configuration file as follows:

brotli on;
brotli_comp_level 11;
brotli_static on;
brotli_types text/plain text/css application/javascript application/x-javascript text/xml 
application/xml application/xml+rss text/javascript image/x-icon 
image/vnd.microsoft.icon image/bmp image/svg+xml;

Cloudflare will then serve these assets to the client at the exact same compression level (11) for the matching file brotli_types. This means any SVG or BMP images will be sent to the client compressed at Brotli level 11.

Testing

We applied compression against a simple CSS file, measuring the impact of various compression algorithms and levels. Our goal was to identify potential improvements that users could experience by optimizing compression techniques. These results can be seen in the following table:

Test Size (bytes) % Reduction of original file (Higher % better)
Uncompressed response (no compression used) 2,747
Cloudflare default Gzip compression (level 8) 1,121 59.21%
Cloudflare default Brotli compression (level 4) 1,110 59.58%
Compressed with max Gzip level (level 9) 1,121 59.21%
Compressed with max Brotli level (level 11) 909 66.94%

By compressing Brotli at level 11 users are able to reduce their file sizes by 19% compared to the best Gzip compression level. Additionally, the strongest Brotli compression level is around 18% smaller than the default level used by Cloudflare. This highlights a significant size reduction achieved by utilizing Brotli compression, particularly at its highest levels, which can lead to improved website performance, faster page load times and an overall reduction in egress fees.

To take advantage of higher end to end compression rates the following Cloudflare proxy features need to be disabled.

  • Email Obfuscation
  • Rocket Loader
  • Server Side Excludes (SSE)
  • Mirage
  • HTML Minification – JavaScript and CSS can be left enabled.
  • Automatic HTTPS Rewrites

This is due to Cloudflare needing to decompress and access the body to apply the requested settings. Alternatively a customer can disable these features for specific paths using Configuration Rules.

All the way up to 11: Serve Brotli from origin and Introducing Compression Rules

If any of these rewrite features are enabled, your origin can still send Brotli compression at higher levels. However, we will decompress, apply the Cloudflare feature(s) enabled, and recompress on the fly using Cloudflare’s default Brotli level 4 or Gzip level 8 depending on the user's accept-encoding header.

For browsers that do not accept Brotli compression, we will continue to decompress and send Gzipped responses or uncompressed.

Implementation

The initial step towards implementing Brotli from the origin involved constructing a decompression module that could be integrated into Cloudflare software stack. It allows us to efficiently convert the compressed bits received from the origin into the original, uncompressed file. This step was crucial as numerous features such as Email Obfuscation and Cloudflare Workers Customers, rely on accessing the body of a response to apply customizations.

We integrated the decompressor into  the core reverse web proxy of Cloudflare. This integration ensured that all Cloudflare products and features could access Brotli decompression effortlessly. This also allowed our Cloudflare Workers team to incorporate Brotli Directly into Cloudflare Workers allowing our Workers customers to be able to interact with responses returned in Brotli or pass through to the end user unmodified.

Introducing Compression rules – Granular control of compression to end users

By default Cloudflare compresses certain content types based on the Content-Type header of the file. Today we are also announcing Compression Rules for our Enterprise Customers to allow you even more control on how and what Cloudflare will compress.

Today we are also announcing the introduction of Compression Rules for our Enterprise Customers. With Compression Rules, you gain enhanced control over Cloudflare's compression capabilities, enabling you to customize how and which content Cloudflare compresses to optimize your website's performance.

For example, by using Cloudflare's Compression Rules for .ktx files, customers can optimize the delivery of textures in webGL applications, enhancing the overall user experience. Enabling compression minimizes the bandwidth usage and ensures that webGL applications load quickly and smoothly, even when dealing with large and detailed textures.

All the way up to 11: Serve Brotli from origin and Introducing Compression Rules

Alternatively customers can disable compression or specify a preference of how we compress. Another example could be an Infrastructure company only wanting to support Gzip for their IoT devices but allow Brotli compression for all other hostnames.

All the way up to 11: Serve Brotli from origin and Introducing Compression Rules

Compression rules use the filters that our other rules products are built on top of with the added fields of Media Type and Extension type. Allowing users to easily specify the content you wish to compress.

All the way up to 11: Serve Brotli from origin and Introducing Compression Rules

Deprecating the Brotli toggle

Brotli has been long supported by some web browsers since 2016 and Cloudflare offered Brotli Support in 2017. As with all new web technologies Brotli was unknown and we gave customers the ability to selectively enable or disable BrotlI via the API and our UI.

All the way up to 11: Serve Brotli from origin and Introducing Compression Rules

Now that Brotli has evolved and is supported by all browsers, we plan to enable Brotli on all zones by default in the coming months. Mirroring the Gzip behavior we currently support and removing the toggle from our dashboard. If browsers do not support Brotli, Cloudflare will continue to support their accepted encoding types such as Gzip or uncompressed and Enterprise customers will still be able to use Compression rules to granularly control how we compress data towards their users.

The future of web compression

We've seen great adoption and great performance for Brotli as the new compression technique for the web. Looking forward, we are closely following trends and new compression algorithms such as zstd as a possible next-generation compression algorithm.

At the same time, we're looking to improve Brotli directly where we can. One development that we're particularly focused on is shared dictionaries with Brotli. Whenever you compress an asset, you use a "dictionary" that helps the compression to be more efficient. A simple analogy of this is typing OMW into an iPhone message. The iPhone will automatically translate it into On My Way using its own internal dictionary.

O M W
O n M y W a y

This internal dictionary has taken three characters and morphed this into nine characters (including spaces) The internal dictionary has saved six characters which equals performance benefits for users.

By default, the Brotli RFC defines a static dictionary that both clients and the origin servers use. The static dictionary was designed to be general purpose and apply to everyone. Optimizing the size of the dictionary as to not be too large whilst able to generate best compression results. However, what if an origin could generate a bespoke dictionary tailored to a specific website? For example a Cloudflare-specific dictionary would allow us to compress the words and phrases that appear repeatedly on our site such as the word “Cloudflare”. The bespoke dictionary would be designed to compress this as heavily as possible and the browser using the same dictionary would be able to translate this back.

A new proposal by the Web Incubator CG aims to do just that, allowing you to specify your own dictionaries that browsers can use to allow websites to optimize compression further. We're excited about contributing to this proposal and plan on publishing our research soon.

Try it now

Compression Rules are available now! With End to End Brotli being rolled out over the coming weeks. Allowing you to improve performance, reduce bandwidth and granularly control how Cloudflare handles compression to your end users.

Making Cloudflare Pages the fastest way to serve your sites

Post Syndicated from Sid Chatterjee original http://blog.cloudflare.com/how-we-decreased-pages-latency/

Making Cloudflare Pages the fastest way to serve your sites

Making Cloudflare Pages the fastest way to serve your sites

In an era where visitors expect instant gratification and content on-demand, every millisecond counts. If you’re a web application developer, it’s an excellent time to be in this line of business, but with great power comes great responsibility. You’re tasked with creating an experience that is not only intuitive and delightful but also quick, reactive and responsive – sometimes with the two sides being at odds with each other. To add to this, if your business completely runs on the internet (say ecommerce), then your site’s Core Web Vitals could make or break your bottom line.

You don’t just need fast – you need magic fast. For the past two years, Cloudflare Pages has been serving up performant applications for users across the globe, but this week, we’re showing off our brand new, lightning fast architecture, decreasing the TTFB by up to 10X when serving assets.

And while a magician never reveals their secrets, this trick is too good to keep to ourselves. For all our application builders, we’re thrilled to share the juicy technical details on how we adopted Workers for Platforms — our extension of Workers to build SaaS businesses on top of — to make Pages one of the fastest ways to serve your sites.

The problem

When we launched Pages in 2021, we didn’t anticipate the exponential growth we would experience for our platform in the months and years to come. As our users began to adopt Pages into their development workflows, usage of our platform began to skyrocket. However, while riding the high of Pages’ success, we began to notice a problem – a rather large one. As projects grew in size, with every deployment came a pinch more latency, slowly affecting the end users visiting the Pages site. Customers with tens of thousands of deployments were at risk of introducing latency to their site – a problem that needed to be solved.

Before we dive into our technical solution, let’s first explore further the setup of Pages and the relationship between number of deployments and the observed latency.

How could this be?

Built on top of Cloudflare Workers, Pages serves static assets through a highly optimised Worker. We refer to this as the Asset Server Worker.

Users can also add dynamic content through Pages Functions which eventually get compiled into a separate Worker. Every single Pages deployment corresponds to unique instances of these Workers composed in a pipeline.

When a request hits Cloudflare we need to look up which pipeline to execute. As you’d expect, this is a function of the hostname in the URL.

If a user requested https://2b469e16.example.pages.dev/index.html, the hostname is 2b469e16.example.pages.dev which is unique across all deployments on Pages — 2b469e16 is typically the commit hash and example in this case refers to the name of the project.

Every Pages project has its own routing table which is used to look up the pipeline to execute. The routing table happens to be a JSON object with a list of regexes for possible paths in that project (in our case, one for every deployment) and their corresponding pipelines.

The script_hash in the example below refers to the pipeline identifier. Naming is hard indeed.

{
 "filters": [
   {
     "pattern": "^(?:2b469e16.example.pages.dev(?:[:][0-9]+)?\\/(?<p1>.*))$",
     "script_hash": "..."
   },
   {
     "pattern": "^(?:example.pages.dev(?:[:][0-9]+)?\\/(?<p1>.*))$",
     "script_hash": "..."
   },
   {
     "pattern": "^(?:test.example.com(?:[:][0-9]+)?\\/(?<p1>.*))$",
     "script_hash": "..."
   }
 ],
 "v": 1
}

So to look up the pipeline in question, we would: download this JSON object from Quicksilver, parse it, and then iterate through this until it finds a regex that matches the current request.

Unsurprisingly, this is expensive. Let’s take a look at a quick real world example to see how expensive.

In one realistic case, it took us 107ms just to parse the JSON. The larger the JSON object gets, the more compute it takes to parse it — with tens of thousands of deployments (not unusual for very active projects that deploy immutable preview deployments for every git commit), this JSON could be several megabytes in size!

It doesn’t end there though. After parsing this, it took 29ms to then iterate and test several regexes to find the one that matched the current request.

To summarise, every single request to this project would take 136ms to just pick the right pipeline to execute. While this was the median case for projects with 10,000 deployments on average, we’ve seen projects with seconds in added latency making them unusable after 50,000 deployments, punishing users for using our platform.

Given most web sites load more than one asset for a page, this leads to timeouts and breakage leading to an unstable and unacceptable user experience.

The secret sauce is Workers for Platforms

We launched Workers for Platforms last year as a way to build ambitious platforms on top of Workers. Workers for Platforms lets one build complex pipelines where a request may be served by a Worker built and maintained by you but could then dispatch to a Worker written by a user of your platform. This allows your platform’s users to write their own Worker like they’ve been used to but while you control how and when they are executed.

Making Cloudflare Pages the fastest way to serve your sites

This isn’t very different from what we do with Pages. Users write their Pages functions which compile into a Worker. Users also upload their own static assets which are then bound to our special Asset Server Worker in unique pipelines for each of their deployments. And we control how and when which Worker gets executed based on a hostname in their URL.

Runtime lookups shouldn’t be O(n) though but O(1). Because Workers for Platforms was designed to build entire platforms on top of, lookups when trying to dispatch to a user’s Worker were designed as O(1) ensuring latency wasn’t a function of number of Workers in an account.

The solution

By default, Workers for Platforms hashes the name of the Worker with a secret and uses that for lookups at runtime. However, because we need to dispatch by hostname, we had a different idea. At deployment time, we could hash the pipeline for the deployment by its hostname — 2b469e16.example.pages.dev, for example.

When a request comes in, we hash the hostname from the URL with our predefined secret and then use that value to look up the pipeline to execute. This entirely removes the necessity to fetch, parse and traverse the routing table JSON from before, now making our lookup O(1).

Once we were happy with our new setup from internal testing we wanted to onboard a real user. Our Developer Docs have been running on Pages since the start of 2022 and during that time, we’ve dogfooded many different features and experiments. Due to the great relationship between our two teams and them being a sizable customer of ours we wanted to bring them onto our new Workers for Platform routing.

Before opting them in, TTFB was averaging at about 600ms.

After opting them in, TTFB is now 60ms. Web Analytics shows a noticeable drop in entire page load time as a result.

Making Cloudflare Pages the fastest way to serve your sites

This improvement was also visible through Lighthouse scores now approaching a perfect score of 100 instead of 78 which was the average we saw previously.

Making Cloudflare Pages the fastest way to serve your sites

The team was ecstatic about this especially given all of this happened under the hood with no downtime or required engineering team on their end. Not only is https://developers.cloudflare.com/ faster, we’re using less compute to serve it to all of you.

The big migration

Migrating developers.cloudflare.com was a big milestone for us and meant our new infrastructure was capable of handling traffic at scale. But a goal we were very certain of was migrating every Pages deployment ever created. We didn’t want to leave any users behind.

Turns out, that wasn’t a small number. There’d been over 14 million deployments so far over the years. This was about to be one of the biggest migrations we’d done to runtime assets and the risk was that we’d take down someone’s site.

We approached this migration with some key goals:

  • Customer impact in terms of downtime was a no go, all of this needed to happen under the hood without anyone’s site being affected;
  • We needed the ability to A/B test the old and new setup so we could revert on a per site basis if something went wrong or was incompatible;
  • Migrations at this scale have the ability to cause incidents because they exceed the typical request capacity of our APIs in a short window so they need to run slowly;
  • Because this was likely to be a long running migration, we needed the ability to look at metrics and retry failures.

The first step to all of this was to add the ability to A/B test between the legacy setup and the new one. To ensure we could A/B between the legacy setup and new one at any time, we needed to deploy both a regular pipeline (and updated routing table) and new Workers for Platforms hashed one for every deployment.

We also added a feature flag that allowed us to route to either the legacy setup or the new one per site or per data centre with the ability to explicitly opt out a site when an edgecase didn’t work.

With this setup, we started running our long running migration behind the scenes that duplicated every single deployment to the new Workers for Platforms enabled pipelines.

Making Cloudflare Pages the fastest way to serve your sites

Duplicating them instead of replacing them meant that risk was low and A/B would be possible with the tradeoff of more cleanup after we finished but we picked that with reliability for users in mind.

A few days in after all 14 million deployments had finished migrating over, we started rollout to the new infrastructure with a percentage based rollout. This was a great way for us to find issues and ensure we were ready to serve all runtime traffic for Pages without the risk of an incident.

Making Cloudflare Pages the fastest way to serve your sites

Feeding three birds with one scone

Alongside the significant latency improvements for Pages projects, this change also gave improvements in other areas:

  • Lower CPU usage – Since we no longer need to parse a huge JSON blob and do potentially thousands of regex matches, we saved a nice amount of CPU time across thousands of machines across our data centres.
  • Higher LRU hit rate – We have LRU caches for things we fetch from Quicksilver this is to reduce load on Quicksilver and improve performance. However, with the large routing tables we had previously, we could easily fill up this cache with one or just a few routing tables. Now that we have turned this into tiny single entry JSONs, we have improved the cache hit rate for all Workers.
  • Quicksilver storage reduction – We also reduced the storage we take up with our routing tables by 92%. This is a reduction of approximately 12 GiB on each of our hundreds of data centres.

We’re just getting started

Pages is now the fastest way to serve your sites across Netlify, Vercel and many others and we’re so proud.

But it’s going to get even faster. With projects like Flame, we can’t wait to shave off many more milliseconds to every request a user makes to your site.

To a faster web for all of us.

Speeding up APIs with Ricochet for API Gateway

Post Syndicated from John Cosgrove original http://blog.cloudflare.com/speeding-up-apis-ricochet-for-api-gateway/

Speeding up APIs with Ricochet for API Gateway

Speeding up APIs with Ricochet for API Gateway

APIs form the backbone of communication between apps and services on the Internet. They are a quick way for an application to ask for data or ask that a task be performed by a service. For example, anyone can write a weather app without being a meteorologist: simply ask a weather API for the forecast and display it in your app.

Speed is inherent to the API use case. Rather than transferring bulky files like images and HTML, APIs only share the essential data needed to render a webpage or an app. However, despite their efficiency, Internet latency can still impede API data transfers. If the server processing a user’s API request is located far from that user, the network round trip time can degrade that user’s experience.

Cloudflare's global network is specifically designed to optimize and accelerate internet traffic, including APIs. Our users enjoy features like 11ms DNS responses, load balancing, and Argo Smart Routing, which significantly improve API traffic speed. For web content, Cloudflare customers have always been able to cache their web traffic, serving requests from the closest data center and thereby reducing network round trip time and server processing time to a bare minimum. Now, we are leveraging these benefits to enhance API traffic in exciting new ways.

Today we’re announcing Ricochet for API Gateway, the easiest way for Cloudflare customers to achieve faster API responses. Customers using Cloudflare’s API Gateway will be able to enable Ricochet for their API endpoints and automatically reduce average latency through intelligent caching of API requests that would otherwise go to origin. Ricochet will even work for things you previously thought un-cacheable, like GraphQL POST requests. Best of all, there are no changes to make at your origin. Just configure API Gateway with your API session identifiers and leave the rest to us.

Enabling Ricochet for your APIs will cause Cloudflare to cache many of the basic, repetitive API calls from your applications and deliver them to users faster than ever. At first, your product metrics might even look broken with lower latency and fewer requests at origin. But these metrics will be a new sign of success, reflecting your app’s new speedy user experience.

Why you should cache API responses

It isn’t news that page load times directly correlate to dollar spend of site visitors. Organizations have spent the last decade obsessing over static content optimization to deliver websites quicker every year. Faster apps result in higher business metrics for most web sites. For example, faster sites receive more sales orders and have higher customer loyalty. Faster sites are also critical for engagement during marketing campaigns. Caching API requests will make your sites and apps faster by lowering the amount of time required to populate your app with data for your users.

The tools for caching APIs have always been available. But why isn’t it more common? We hypothesize a few reasons. It could be that API developers assume that APIs are too dynamic to cache, and that it only helps after lots of analysis of the application’s performance. It could also be that the security concerns around caching APIs are non-trivial. If both of those are true, you can imagine how caching APIs would only be successful with a large cross-organizational approach and lots of effort.

Let’s say your organization decided to try API caching anyway. We know there are a few problems if you want to cache APIs: First, traditional caching methods aren’t necessarily ready out of the box for caching APIs; cache invalidation needs to happen quickly and automatically. Second, special standalone cache tooling exists, but it doesn’t help you if it’s placed next to the origin and your users are globally distributed. Lastly, it’s hard to get security right when caching user data. It’s strictly forbidden to serve one user’s data to another on accident. Cloudflare has superpowers in these areas: knowledge of origin response time and existing cache-hit ratios, customer-configured API session IDs to establish secure user association with API requests, and the scale of our global network. We’re bringing together API Management with our robust caching infrastructure to safely and automatically cache API requests.

Cloudflare’s unique approach

The HTTP methods POST, PUT, and DELETE aren't generally cacheable. These methods are meant to change information at the origin on a one-time basis, and are therefore “non-safe”. If you responded to a non-safe request from cache, the data on the API server would not be updated. Compare this with “safe” HTTP methods: GET, OPTIONS, and HEAD. Caching requests with safe methods is straightforward as data at the origin does not change per-request.

But what do we know about RESTful APIs that can make caching easier? The endpoints usually stay the same when operating on objects, but the methods change. We will enable caching for safe methods and then automatically invalidate the cache when we see a non-safe method request for a RESTful endpoint managed by API Gateway. It’s also possible you have updates on one endpoint that change another endpoint’s data. In that event, we urge you to consider whether API Gateway’s short default TTL timers fit your use case by allowing a small delay between updating data at the origin and serving that update from cache. Check out the below diagram for an example of how automatic cache invalidation would work for shared paths with different methods:

Speeding up APIs with Ricochet for API Gateway

Even for safe requests, caching API data is risky when it comes to security. Done incorrectly, you could accidentally serve sensitive user data to the wrong person. That’s why Ricochet’s cache key includes the user’s API session identifier. This extra information in the cache key ensures that only an authorized user is able to receive their own cached data. For APIs without authentication, we hash the request parameters themselves and include that hash in the cache key, to ensure the correct data is returned for endpoints with static paths but variable inputs. Here’s an example for authenticated APIs:

Speeding up APIs with Ricochet for API Gateway

And here’s an example for anonymous APIs, where we use a hash of the request body to preserve privacy and still enable unique, useful cache keys:

Speeding up APIs with Ricochet for API Gateway

APIs can be ripe for caching

There are many API caching use cases out there, and we decided to start with two where our customers have asked us for help so far: mixed-authentication APIs where returning the correct data is critical, and APIs that have single endpoints that can return varied query results (think RESTful endpoints with variable inputs and GraphQL POST requests). These use cases include things like weather forecasts and current conditions, airline flight tracking and flight status, live sports scores, and collaborative tools with many users.

Even short cache control timers can be beneficial to reduce load at the origin and speed up responses. Consider an example of a popular public endpoint receiving 1,000 requests/second, or 60,000 requests/min at origin. Let’s assume the data at the origin changes unpredictably but not due to unique user interaction. For this use case, reporting stale data for a few seconds or even a minute could be acceptable, and most users wouldn’t know or mind the difference. You could set your cache control to a very low 1 second and serve 999 requests/second from cache. That would reduce the origin requests to only 60 requests/minute!

This is a simple example, and we urge you to think about your API and the potential performance improvements caching could bring.

Potential impact of caching APIs

We profiled five top global airline website APIs for flight status checks and compared their retrieval time against the airline logo’s retrieval time. Here are the results:

Speeding up APIs with Ricochet for API Gateway

All five airlines saw on average a ~7x slow down for data that could easily be cached!

Successful caching will also lower load on your API origin. The hidden benefit with lower load at origin means faster responses from the requests that miss the cache and do end up hitting origin. So it’s two-for-the-price-of-one and latency decreases all-around. We aren’t going to reinvent your API overnight, but we are going to make a difference in your application’s response times by making it easy to add caching to your APIs.

Conclusion

We're launching Ricochet in 2024. It’s going to make a measurable difference speeding up your APIs, and the best part is that as an API Gateway customer it will be easy to get started without requiring tons of your team’s time. Let your account team know if you’d like to be on our waitlist for this feature. Our goal is to increase the amount of API caching on the Internet so that we can all benefit from faster response times and snappier apps.

Introducing the Cloudflare Radar Internet Quality Page

Post Syndicated from David Belson original http://blog.cloudflare.com/introducing-radar-internet-quality-page/

Introducing the Cloudflare Radar Internet Quality Page

Introducing the Cloudflare Radar Internet Quality Page

Internet connections are most often marketed and sold on the basis of "speed", with providers touting the number of megabits or gigabits per second that their various service tiers are supposed to provide. This marketing has largely been successful, as most subscribers believe that "more is better”. Furthermore, many national broadband plans in countries around the world include specific target connection speeds. However, even with a high speed connection, gamers may encounter sluggish performance, while video conference participants may experience frozen video or audio dropouts. Speeds alone don't tell the whole story when it comes to Internet connection quality.

Additional factors like latency, jitter, and packet loss can significantly impact end user experience, potentially leading to situations where higher speed connections actually deliver a worse user experience than lower speed connections. Connection performance and quality can also vary based on usage – measured average speed will differ from peak available capacity, and latency varies under loaded and idle conditions.

The new Cloudflare Radar Internet Quality page

A little more than three years ago, as residential Internet connections were strained because of the shift towards working and learning from home due to the COVID-19 pandemic, Cloudflare announced the speed.cloudflare.com speed test tool, which enabled users to test the performance and quality of their Internet connection. Within the tool, users can download the results of their individual test as a CSV, or share the results on social media. However, there was no aggregated insight into Cloudflare speed test results at a network or country level to provide a perspective on connectivity characteristics across a larger population.

Today, we are launching these long-missing aggregated connection performance and quality insights on Cloudflare Radar. The new Internet Quality page provides both country and network (autonomous system) level insight into Internet connection performance (bandwidth) and quality (latency, jitter) over time. (Your Internet service provider is likely an autonomous system with its own autonomous system number (ASN), and many large companies, online platforms, and educational institutions also have their own autonomous systems and associated ASNs.) The insights we are providing are presented across two sections: the Internet Quality Index (IQI), which estimates average Internet quality based on aggregated measurements against a set of Cloudflare & third-party targets, and Connection Quality, which presents peak/best case connection characteristics based on speed.cloudflare.com test results aggregated over the previous 90 days. (Details on our approach to the analysis of this data are presented below.)

Users may note that individual speed test results, as well as the aggregate speed test results presented on the Internet Quality page will likely differ from those presented by other speed test tools. This can be due to a number of factors including differences in test endpoint locations (considering both geographic and network distance), test content selection, the impact of “rate boosting” by some ISPs, and testing over a single connection vs. multiple parallel connections. Infrequent testing (on any speed test tool) by users seeking to confirm perceived poor performance or validate purchased speeds will also contribute to the differences seen in the results published by the various speed test platforms.

And as we announced in April, Cloudflare has partnered with Measurement Lab (M-Lab) to create a publicly-available, queryable repository for speed test results. M-Lab is a non-profit third-party organization dedicated to providing a representative picture of Internet quality around the world. M-Lab produces and hosts the Network Diagnostic Tool, which is a very popular network quality test that records millions of samples a day. Given their mission to provide a publicly viewable, representative picture of Internet quality, we chose to partner with them to provide an accurate view of your Internet experience and the experience of others around the world using openly available data.

Connection speed & quality data is important

While most advertisements for fixed broadband and mobile connectivity tend to focus on download speeds (and peak speeds at that), there’s more to an Internet connection, and the user’s experience with that Internet connection, than that single metric. In addition to download speeds, users should also understand the upload speeds that their connection is capable of, as well as the quality of the connection, as expressed through metrics known as latency and jitter. Getting insight into all of these metrics provides a more well-rounded view of a given Internet connection, or in aggregate, the state of Internet connectivity across a geography or network.

The concept of download speeds are fairly well understood as a measure of performance. However, it is important to note that the average download speeds experienced by a user during common Web browsing activities, which often involves the parallel retrieval of multiple smaller files from multiple hosts, can differ significantly from peak download speeds, where the user is downloading a single large file (such as a video or software update), which allows the connection to reach maximum performance. The bandwidth (speed) available for upload is sometimes mentioned in ISP advertisements, but doesn’t receive much attention. (And depending on the type of Internet connection, there’s often a significant difference between the available upload and download speeds.) However, the importance of upload came to the forefront in 2020 as video conferencing tools saw a surge in usage as both work meetings and school classes shifted to the Internet during the COVID-19 pandemic. To share your audio and video with other participants, you need sufficient upload bandwidth, and this issue was often compounded by multiple people sharing a single residential Internet connection.

Latency is the time it takes data to move through the Internet, and is measured in the number of milliseconds that it takes a packet of data to go from a client (such as your computer or mobile device) to a server, and then back to the client. In contrast to speed metrics, lower latency is preferable. This is especially true for use cases like online gaming where latency can make a difference between a character’s life and death in the game, as well as video conferencing, where higher latency can cause choppy audio and video experiences, but it also impacts web page performance. The latency metric can be further broken down into loaded and idle latency. The former measures latency on a loaded connection, where bandwidth is actively being consumed, while the latter measures latency on an “idle” connection, when there is no other network traffic present. (These specific loaded and idle definitions are from the device’s perspective, and more specifically, from the speed test application’s perspective. Unless the speed test is being performed directly from a router, the device/application doesn't have insight into traffic on the rest of the network.) Jitter is the average variation found in consecutive latency measurements, and can be measured on both idle and loaded connections. A lower number means that the latency measurements are more consistent. As with latency, Internet connections should have minimal jitter, which helps provide more consistent performance.

Our approach to data analysis

The Internet Quality Index (IQI) and Connection Quality sections get their data from two different sources, providing two different (albeit related) perspectives. Under the hood they share some common principles, though.

IQI builds upon the mechanism we already use to regularly benchmark ourselves against other industry players. It is based on end user measurements against a set of Cloudflare and third-party targets, meant to represent a pattern that has become very common in the modern Internet, where most content is served from distribution networks with points of presence spread throughout the world. For this reason, and by design, IQI will show worse results for regions and Internet providers that rely on international (rather than peering) links for most content.

IQI is also designed to reflect the traffic load most commonly associated with web browsing, rather than more intensive use. This, and the chosen set of measurement targets, effectively biases the numbers towards what end users experience in practice (where latency plays an important role in how fast things can go).

For each metric covered by IQI, and for each ASN, we calculate the 25th percentile, median, and 75th percentile at 15 minute intervals. At the country level and above, the three calculated numbers for each ASN visible from that region are independently aggregated. This aggregation takes the estimated user population of each ASN into account, biasing the numbers away from networks that source a lot of automated traffic but have few end users.

The Connection Quality section gets its data from the Cloudflare Speed Test tool, which exercises a user's connection in order to see how well it is able to perform. It measures against the closest Cloudflare location, providing a good balance of realistic results and network proximity to the end user. We have a presence in 285 cities around the world, allowing us to be pretty close to most users.

Similar to the IQI, we calculate the 25th percentile, median, and 75th percentile for each ASN. But here these three numbers are immediately combined using an operation called the trimean — a single number meant to balance the best connection quality that most users have, with the best quality available from that ASN (users may not subscribe to the best available plan for a number of reasons).

Because users may choose to run a speed test for different motives at different times, and also because we take privacy very seriously and don’t record any personally identifiable information along with test results, we aggregate at 90-day intervals to capture as much variability as we can.

At the country level and above, the calculated trimean for each ASN in that region is aggregated. This, again, takes the estimated user population of each ASN into account, biasing the numbers away from networks that have few end users but which may still have technicians using the Cloudflare Speed Test to assess the performance of their network.

The new Internet Quality page includes three views: Global, country-level, and autonomous system (AS). In line with the other pages on Cloudflare Radar, the country-level and AS pages show the same data sets, differing only in their level of aggregation. Below, we highlight the various components of the Internet Quality page.

Global

Introducing the Cloudflare Radar Internet Quality Page

The top section of the global (worldwide) view includes time series graphs of the Internet Quality Index metrics aggregated at a continent level. The time frame shown in the graphs is governed by the selection made in the time frame drop down at the upper right of the page, and at launch, data for only the last three months is available. For users interested in examining a specific continent, clicking on the other continent names in the legend removes them from the graph. Although continent-level aggregation is still rather coarse, it still provides some insight into regional Internet quality around the world.

Introducing the Cloudflare Radar Internet Quality Page

Further down the page, the Connection Quality section presents a choropleth map, with countries shaded according to the values of the speed, latency, or jitter metric selected from the drop-down menu. Hovering over a country displays a label with the country’s name and metric value, and clicking on the country takes you to the country’s Internet Quality page. Note that in contrast to the IQI section, the Connection Quality section always displays data aggregated over the previous 90 days.

Country-level

Within the country-level page (using Canada as an example in the figures below), the country’s IQI metrics over the selected time frame are displayed. These time series graphs show the median bandwidth, latency, and DNS response time within a shaded band bounded at the 25th and 75th percentile and represent the average expected user experience across the country, as discussed in the Our approach to data analysis section above.

Introducing the Cloudflare Radar Internet Quality Page
Introducing the Cloudflare Radar Internet Quality Page
Introducing the Cloudflare Radar Internet Quality Page

Below that is the Connection Quality section, which provides a summary view of the country’s measured upload and download speeds, as well as latency and jitter, over the previous 90 days. The colored wedges in the Performance Summary graph are intended to illustrate aggregate connection quality at a glance, with an “ideal” connection having larger upload and download wedges and smaller latency and jitter wedges. Hovering over the wedges displays the metric’s value, which is also shown in the table to the right of the graph.

Introducing the Cloudflare Radar Internet Quality Page

Below that, the Bandwidth and Latency/Jitter histograms illustrate the bucketed distribution of upload and download speeds, and latency and jitter measurements. In some cases, the speed histograms may show a noticeable bar at 1 Gbps, or 1000 ms (1 second) on the latency/jitter histograms. The presence of such a bar indicates that there is a set of measurements with values greater than the 1 Gbps/1000 ms maximum histogram values.

Introducing the Cloudflare Radar Internet Quality Page

Autonomous system level

Within the upper-right section of the country-level page, a list of the top five autonomous systems within the country is shown. Clicking on an ASN takes you to the Performance page for that autonomous system. For others not displayed in the top five list, you can use the search bar at the top of the page to search by autonomous system name or number. The graphs shown within the AS level view are identical to those shown at a country level, but obviously at a different level of aggregation. You can find the ASN that you are connected to from the My Connection page on Cloudflare Radar.

Exploring connection performance & quality data

Digging into the IQI and Connection Quality visualizations can surface some interesting observations, including characterizing Internet connections, and the impact of Internet disruptions, including shutdowns and network issues. We explore some examples below.

Characterizing Internet connections

Verizon FiOS is a residential fiber-based Internet service available to customers in the United States. Fiber-based Internet services (as opposed to cable-based, DSL, dial-up, or satellite) will generally offer symmetric upload and download speeds, and the FiOS plans page shows this to be the case, offering 300 Mbps (upload & download), 500 Mbps (upload & download), and “1 Gig” (Verizon claims average wired speeds between 750-940 Mbps download / 750-880 Mbps upload) plans. Verizon carries FiOS traffic on AS701 (labeled UUNET due to a historical acquisition), and in looking at the bandwidth histogram for AS701, several things stand out. The first is a rough symmetry in upload and download speeds. (A cable-based Internet service provider, in contrast, would generally show a wide spread of download speeds, but have upload speeds clustered at the lower end of the range.) Another is the peaks around 300 Mbps and 750 Mbps, suggesting that the 300 Mbps and “1 Gig” plans may be more popular than the 500 Mbps plan. It is also clear that there are a significant number of test results with speeds below 300 Mbps. This is due to several factors: one is that Verizon also carries lower speed non-FiOS traffic on AS701, while another is that erratic nature of in-home WiFi often means that the speeds achieved on a test will be lower than the purchased service level.

Introducing the Cloudflare Radar Internet Quality Page

Traffic shifts drive latency shifts

On May 9, 2023, the government of Pakistan ordered the shutdown of mobile network services in the wake of protests following the arrest of former Prime Minister Imran Khan. Our blog post covering this shutdown looked at the impact from a traffic perspective. Within the post, we noted that autonomous systems associated with fixed broadband networks saw significant increases in traffic when the mobile networks were shut down – that is, some users shifted to using fixed networks (home broadband) when mobile networks were unavailable.

Examining IQI data after the blog post was published, we found that the impact of this traffic shift was also visible in our latency data. As can be seen in the shaded area of the graph below, the shutdown of the mobile networks resulted in the median latency dropping about 25% as usage shifted from higher latency mobile networks to lower latency fixed broadband networks. An increase in latency is visible in the graph when mobile connectivity was restored on May 12.

Introducing the Cloudflare Radar Internet Quality Page

Bandwidth shifts as a potential early warning sign

On April 4, UK mobile operator Virgin Media suffered several brief outages. In examining the IQI bandwidth graph for AS5089, the ASN used by Virgin Media (formerly branded as NTL), indications of a potential problem are visible several days before the outages occurred, as median bandwidth dropped by about a third, from around 35 Mbps to around 23 Mbps. The outages are visible in the circled area in the graph below. Published reports indicate that the problems lasted into April 5, in line with the lower median bandwidth measured through mid-day.

Introducing the Cloudflare Radar Internet Quality Page

Submarine cable issues cause slower browsing

On June 5, Philippine Internet provider PLDT Tweeted an advisory that noted “One of our submarine cable partners confirms a loss in some of its internet bandwidth capacity, and thus causing slower Internet browsing.” IQI latency and bandwidth graphs for AS9299, a primary ASN used by PLDT, shows clear shifts starting around 06:45 UTC (14:45 local time). Median bandwidth dropped by half, from 17 Mbps to 8 Mbps, while median latency increased by 75% from 37 ms to around 65 ms. 75th percentile latency also saw a significant increase, nearly tripling from 63 ms to 180 ms coincident with the reported submarine cable issue.

Introducing the Cloudflare Radar Internet Quality Page
Introducing the Cloudflare Radar Internet Quality Page

Conclusion

Making network performance and quality insights available on Cloudflare Radar supports Cloudflare’s mission to help build a better Internet. However, we’re not done yet – we have more enhancements planned. These include making data available at a more granular geographical level (such as state and possibly city), incorporating AIM scores to help assess Internet quality for specific types of use cases, and embedding the Cloudflare speed test directly on Radar using the open source JavaScript module.

In the meantime, we invite you to use speed.cloudflare.com to test the performance and quality of your Internet connection, share any country or AS-level insights you discover on social media (tag @CloudflareRadar on Twitter or @[email protected] on Mastodon), and explore the underlying data through the M-Lab repository or the Radar API.

A Comprehensive Analysis of the GPL Issues With the Red Hat Enterprise Linux (RHEL) Business Model

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2023/06/23/rhel-red-hat-gpl.html

[ This is
a crosspost
from my professional blog at Software Freedom Conservancy
(SFC)
. I encourage you
to use
that copy of the post as the canonical linkage for this essay — I
crossposted here merely for posterity and to reach a wider
audience. ]

This article was originally published primarily as a response
to
IBM’s
Red Hat’s change
to no longer publish complete, corresponding source
(CCS) for RHEL and the
prior discontinuation of CentOS Linux (which are related events, as
described below). We hope that this will serve as a comprehensive
document that discusses the history of Red Hat’s RHEL business model,
the related source code provisioning, and the GPL compliance issues with RHEL.


For approximately twenty years, Red Hat (now a fully owned subsidiary of
IBM) has experimented with building a business model for operating system deployment and
distribution that looks, feels, and acts like a proprietary one, but
nonetheless complies with the GPL and other standard copyleft
terms. Software rights activists,
including SFC, have spent decades talking to Red Hat and its
attorneys about how the Red Hat Enterprise Linux (RHEL) business model courts
disaster and is actively unfriendly to
community-oriented Free and Open Source Software (FOSS). These pleadings,
discussions, and encouragements have, as far as we can tell, been heard and
seriously listened to by key members of Red Hat’s legal and OSPO
departments, and even by key C-level executives, but they have ultimately been rejected
and ignored — sometimes even with a “fine, then sue us for GPL
violations” attitude. Activists have found this discussion
frustrating, but kept the nature and tenure of these discussions as an
“open secret” until now because we all had hoped that Red Hat’s behavior
would improve. Recent events show that the behavior has simply gotten worse, and is likely to get even
worse.

What Exactly Is the RHEL Business Model?

The most concise and pithy way to describe RHEL’s business model is:
“if you exercise your rights under the GPL, your money is no good
here”. Specifically, IBM’s Red Hat offers copies of RHEL to its
customers, and each copy comes with a support and automatic-update
subscription contract. As we understand it, this contract
clearly states
that the terms do not intend to contradict any rights to copy, modify,
redistribute and/or reinstall the software
as many times and as many places
as the customer likes (see §1.4). Additionally, though, the contract indicates that
if the customer engages in these activities, that Red Hat reserves the
right to cancel that contract and make no further contracts with the
customer for support and update services. In essence, Red Hat requires their customers
to choose between (a) their software freedom and rights, and (b) remaining a Red Hat
customer. In some versions of these contracts that we have reviewed, Red
Hat even reserves the right to “Review” a customer (effectively a BSA-style audit) to examine how
many copies of RHEL are actually installed (see §10) — presumably for the
purpose of Red Hat getting the information they need to decide
whether to “fire” the customer.

Red Hat’s lawyers clearly take the position that this business model complies with the GPL (though we aren’t so sure), on grounds that that nothing in the GPL agreements requires an entity
keep a business relationship with any other entity. They have further argued that such business
relationships can be terminated based on any behaviors — including
exercising rights guaranteed by the GPL agreements. Whether that
analysis is correct is a matter of intense debate, and likely only a court
case that disputed this particular issue would yield a definitive answer
on whether that disagreeable behavior is permitted (or not) under the GPL agreements. Debates continue, even today,
in copyleft expert circles, whether this
model itself violates GPL. There is, however, no doubt that this
provision is not in the spirit of the GPL agreements. The RHEL business
model is unfriendly, captious, capricious, and cringe-worthy.

Furthermore, this RHEL
business model remains, to our knowledge, rather unique in the software
industry. IBM’s Red Hat definitely deserves credit for so carefully
constructing their business model such that it has spent most of the last
two decades in murky territory of “probably not violating the
GPL”.

Does The RHEL Business Model Violate the GPL Agreements?

Perhaps the biggest problem with a murky business model that skirts the
line of GPL compliance is that violations can and do happen — since
even a minor deviation from the business model clearly violates the GPL
agreements. Pre-IBM Red Hat deserves a certain amount of credit, as
SFC is aware of only two documented incidents of GPL violations that have
occurred since 2006 regarding the RHEL business model. We’ve decided to
share some general details of these violations for the purpose of
explaining where this business model can so easily cross the line.

In the first violation, a large Fortune 500 company (which we’ll
call Company A), who both used RHEL internally and also built
public-facing Linux-based products, decided to create a consumer-facing
product (which we’ll call Product P) based primarily on CentOS Linux,
but P included a few packages built from RHEL sources. Company A
did not seek nor ask for support or update services for this separate
Product P. Red Hat later became aware that Product P contained
some part of RHEL, and Red Hat demanded royalty payments for Product
P
. Red Hat threatened to revoke the support and update
services on Company A‘s internal RHEL servers if such royalties were
not paid.

Since Company A was powerful and had good lawyers and savvy
business development staff, they did not acquiesce. Company A ultimately
continued (to our knowledge) on as a RHEL customer for their internal
servers and continued selling Product P without royalty payments. Nevertheless, a
demand for royalties for distribution is clearly a violation as that demand creates a
“further restriction” on the permissions granted by GPL. As
stated in GPLv3:

You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted
under this License.

Red Hat tried to impose a further restriction in this situation, and therefore
violated the GPL. The violation was resolved since no royalty was paid
and Company A faced no consequences. SFC learned of
the incident later, and informed Red Hat that the past royalty demand was
a violation. Red Hat did not dispute nor agree that it was a violation, and did informally agree
such demands would not be made in future.

In another violation incident, we learned that Red Hat, in a specific
non-USA country, was requiring that any customer who lowered the number of
RHEL machines under service contract with Red Hat sign an
additional agreement. This additional agreement promised that the customer
had deleted every copy of RHEL in their entire organization other than the
copies of RHEL that were currently contracted for service with Red Hat.
Again, this is a “further restriction”. The GPL agreements
give everyone the unfettered right to make and keep as many copies of the
software as they like, and a distributor of GPL’d software may not require
a user to attest that they’ve deleted these legitimate, licensed copies of
third-party-licensed software under the GPL. SFC informed Red Hat’s legal department
of this violation, and we were assured that this additional agreement would no longer
be presented to any Red Hat customers in the future.

In both these situations, we at SFC were worried they were merely a
“tip of the proverbial iceberg”. For years, we have heard from
Red Hat customers who are truly confused. It’s common in the industry to
talk about RHEL “seat licenses”, and many software acquisition
specialists in the industry are not aware of the nuances of the RHEL
business model and do not understand their rights. We remain very
concerned that RHEL salespeople purposely confuse customers to sell more
“seat licenses”. It’s often led us to ask: “If a GPL
violation happens in the woods, and everyone involved doesn’t hear it, how
does anyone know that software rights have indeed been trampled upon in
those woods?”. As we do for as many GPL violation reports as we can, we zealously pursue RHEL-related GPL violations that
are reported to us, and if you’re aware of one, please
do email us at
<[email protected]>
immediately. We fear that
be it through incompetence or malice, many RHEL salespeople and business
development professionals may regularly violate GPL and no one knows about
it. That said, the business model as described by IBM’s Red Hat
may well comply with the GPL — it’s just so murky that any tweak to
the model in any direction seems to definitely violate, in our experience.

Furthermore, Red Hat exploits the classic “caveat emptor”
approach — popular in many a shady business deal throughout history. While,
technically speaking, a careful reader of the GPL and the RHEL agreements
understands the bargain they’re making, we suspect most small businesses
just don’t have the FOSS licensing acumen and knowledge to truly understand
that deal.

Why Was an Independent CentOS So Important?

Until Red
Hat’s “aquisition” of CentOS in early 2014
, CentOS
provided an excellent counterbalance to the problems with the RHEL
business model. Specifically, CentOS was a community-driven project,
with many volunteers, supported by some involvement from small
businesses, to re-create RHEL releases using the
CCS releases
made for RHEL. Our pre-2014 view was that CentOS was the “canary in
the murky coalmine” of the RHEL business. If CentOS seemed vibrant,
usable, and a viable alternative to RHEL for those who didn’t want to
purchase Red Hat’s updates and services, the community could rest easy.
Even if there were GPL violations by Red Hat on RHEL, CentOS’ vibrancy
assured that such violations were having only a minor negative impact on
the FOSS community around RHEL’s codebase.

Red Hat, however, apparently knew that this vibrant community was cutting
into their profits. Starting in 2013, Red Hat engaged in a series of actions
that increased their grip. First, they “acquired”
CentOS. This was initially couched as a cooperation agreement, but Red Hat
systematically made job offers that key CentOS volunteers couldn’t refuse,
acquired the small businesses who might ultimately build CentOS into a
product, and otherwise integrated CentOS into Red Hat’s own operations.

After IBM acquired Red Hat, the situation got worse. Having gotten rights
to the CentOS brand as part of the “aquisition”, Red Hat slowly
began to change what CentOS was. CentOS Linux quickly ceased to be a
check-and-balance on RHEL, and just became a testing ground for RHEL.
Then, in 2020, when most of us were distracted by the worst of the COVID-19
pandemic, Red Hat unilaterally terminated all CentOS Linux development. Later (during
the Delta variant portion of the pandemic in late 2021) Red Hat ended CentOS Linux entirely.
IBM’s Red Hat
then used the name “CentOS Stream” to refer to experimental
source packages related to RHEL. These were (and are) not actually the RHEL
source releases — rather, they appear to be primarily a testing
ground for what might appear in RHEL later.

Finally, Red Hat announced two days ago
that RHEL
CCS will no longer be publicly available in any way
. Now, to be clear, the GPL agreements did not obligate Red Hat to make its
CCS publicly
available to everyone. This is a common misconception about GPL’s
requirements. While the details of CCS provisioning vary in the different
versions of the GPL agreements, the general principle is that CCS need to
be provided either (a) along with the binary distributions to those who
receive, or (b) to those who request pursuant to a written offer for
source. In a normal situation, with no mitigating factors, the fact that
a company moved from distributing CCS publicly to everyone to only giving
it to customers who received the binaries already would not raise
concerns.

In this situation, however, this completes what appears to be a
decade-long plan by Red Hat to maximize the level of difficulty of
those in the community who wish to “trust but verify” that RHEL
complies with the GPL agreements. Namely, Red Hat has badly thwarted
efforts by entities such
as Rocky
Linux

and Alma
Linux
. These entities are de-facto the intellectual successors to
CentOS Linux project that Red Hat carefully dismantled over the last decade. These organizations
sought to build Linux-based distributions that mirrored RHEL
releases, and it is now unclear if they can do that effectively, since Red Hat will undoubtedly capriciously refuse to sell them exactly-one RHEL service and update “seat license” at a reasonable price. It appears that, as of this week, one must have at least that to get timely access to RHEL CCS.

What Should Those Who Care About Software Rights Do About RHEL?

Due to this ongoing bad behavior by IBM’s Red Hat, the situation has
become increasingly complex and difficult to face. No third party can
effectively monitor RHEL compliance with the GPL agreements, since
customers live in fear of losing their much-needed service contracts.
Red Hat’s legal department
has systematically refused SFC’s requests in recent years to set up some
form of monitoring by SFC. (For example, we asked to review the training
materials and documents that RHEL salespeople are given to convince
customers to buy RHEL, and Red Hat has not been willing to share these
materials with us.) Nevertheless, since SFC serves as the global watchdog for
GPL compliance, we welcome reports of RHEL-related violations.

We finally express our sadness that this long road has led the FOSS community to such a disappointing place. I
personally remember standing with Erik Troan in a Red Hat booth at a USENIX
conference in the late 1990s, and meeting Bob Young around the same time.
Both expressed how much they wanted to build a company that respected,
collaborated with, engaged with, and most of all treated as equals the wide
spectrum of individuals, hobbyists, and small businesses that make the
plurality of the FOSS community. We hope that the
modern Red Hat can find their way back to this mission under IBM’s control.

A Comprehensive Analysis of the GPL Issues With the Red Hat Enterprise Linux (RHEL) Business Model

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2023/06/23/rhel-red-hat-gpl.html

[ This is
a crosspost
from my professional blog at Software Freedom Conservancy
(SFC)
. I encourage you
to use
that copy of the post as the canonical linkage for this essay — I
crossposted here merely for posterity and to reach a wider
audience. ]

This article was originally published primarily as a response
to
IBM’s
Red Hat’s change
to no longer publish complete, corresponding source
(CCS) for RHEL and the
prior discontinuation of CentOS Linux (which are related events, as
described below). We hope that this will serve as a comprehensive
document that discusses the history of Red Hat’s RHEL business model,
the related source code provisioning, and the GPL compliance issues with RHEL.


For approximately twenty years, Red Hat (now a fully owned subsidiary of
IBM) has experimented with building a business model for operating system deployment and
distribution that looks, feels, and acts like a proprietary one, but
nonetheless complies with the GPL and other standard copyleft
terms. Software rights activists,
including SFC, have spent decades talking to Red Hat and its
attorneys about how the Red Hat Enterprise Linux (RHEL) business model courts
disaster and is actively unfriendly to
community-oriented Free and Open Source Software (FOSS). These pleadings,
discussions, and encouragements have, as far as we can tell, been heard and
seriously listened to by key members of Red Hat’s legal and OSPO
departments, and even by key C-level executives, but they have ultimately been rejected
and ignored — sometimes even with a “fine, then sue us for GPL
violations” attitude. Activists have found this discussion
frustrating, but kept the nature and tenure of these discussions as an
“open secret” until now because we all had hoped that Red Hat’s behavior
would improve. Recent events show that the behavior has simply gotten worse, and is likely to get even
worse.

What Exactly Is the RHEL Business Model?

The most concise and pithy way to describe RHEL’s business model is:
“if you exercise your rights under the GPL, your money is no good
here”. Specifically, IBM’s Red Hat offers copies of RHEL to its
customers, and each copy comes with a support and automatic-update
subscription contract. As we understand it, this contract
clearly states
that the terms do not intend to contradict any rights to copy, modify,
redistribute and/or reinstall the software
as many times and as many places
as the customer likes (see §1.4). Additionally, though, the contract indicates that
if the customer engages in these activities, that Red Hat reserves the
right to cancel that contract and make no further contracts with the
customer for support and update services. In essence, Red Hat requires their customers
to choose between (a) their software freedom and rights, and (b) remaining a Red Hat
customer. In some versions of these contracts that we have reviewed, Red
Hat even reserves the right to “Review” a customer (effectively a BSA-style audit) to examine how
many copies of RHEL are actually installed (see §10) — presumably for the
purpose of Red Hat getting the information they need to decide
whether to “fire” the customer.

Red Hat’s lawyers clearly take the position that this business model complies with the GPL (though we aren’t so sure), on grounds that that nothing in the GPL agreements requires an entity
keep a business relationship with any other entity. They have further argued that such business
relationships can be terminated based on any behaviors — including
exercising rights guaranteed by the GPL agreements. Whether that
analysis is correct is a matter of intense debate, and likely only a court
case that disputed this particular issue would yield a definitive answer
on whether that disagreeable behavior is permitted (or not) under the GPL agreements. Debates continue, even today,
in copyleft expert circles, whether this
model itself violates GPL. There is, however, no doubt that this
provision is not in the spirit of the GPL agreements. The RHEL business
model is unfriendly, captious, capricious, and cringe-worthy.

Furthermore, this RHEL
business model remains, to our knowledge, rather unique in the software
industry. IBM’s Red Hat definitely deserves credit for so carefully
constructing their business model such that it has spent most of the last
two decades in murky territory of “probably not violating the
GPL”.

Does The RHEL Business Model Violate the GPL Agreements?

Perhaps the biggest problem with a murky business model that skirts the
line of GPL compliance is that violations can and do happen — since
even a minor deviation from the business model clearly violates the GPL
agreements. Pre-IBM Red Hat deserves a certain amount of credit, as
SFC is aware of only two documented incidents of GPL violations that have
occurred since 2006 regarding the RHEL business model. We’ve decided to
share some general details of these violations for the purpose of
explaining where this business model can so easily cross the line.

In the first violation, a large Fortune 500 company (which we’ll
call Company A), who both used RHEL internally and also built
public-facing Linux-based products, decided to create a consumer-facing
product (which we’ll call Product P) based primarily on CentOS Linux,
but P included a few packages built from RHEL sources. Company A
did not seek nor ask for support or update services for this separate
Product P. Red Hat later became aware that Product P contained
some part of RHEL, and Red Hat demanded royalty payments for Product
P
. Red Hat threatened to revoke the support and update
services on Company A‘s internal RHEL servers if such royalties were
not paid.

Since Company A was powerful and had good lawyers and savvy
business development staff, they did not acquiesce. Company A ultimately
continued (to our knowledge) on as a RHEL customer for their internal
servers and continued selling Product P without royalty payments. Nevertheless, a
demand for royalties for distribution is clearly a violation as that demand creates a
“further restriction” on the permissions granted by GPL. As
stated in GPLv3:

You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted
under this License.

Red Hat tried to impose a further restriction in this situation, and therefore
violated the GPL. The violation was resolved since no royalty was paid
and Company A faced no consequences. SFC learned of
the incident later, and informed Red Hat that the past royalty demand was
a violation. Red Hat did not dispute nor agree that it was a violation, and did informally agree
such demands would not be made in future.

In another violation incident, we learned that Red Hat, in a specific
non-USA country, was requiring that any customer who lowered the number of
RHEL machines under service contract with Red Hat sign an
additional agreement. This additional agreement promised that the customer
had deleted every copy of RHEL in their entire organization other than the
copies of RHEL that were currently contracted for service with Red Hat.
Again, this is a “further restriction”. The GPL agreements
give everyone the unfettered right to make and keep as many copies of the
software as they like, and a distributor of GPL’d software may not require
a user to attest that they’ve deleted these legitimate, licensed copies of
third-party-licensed software under the GPL. SFC informed Red Hat’s legal department
of this violation, and we were assured that this additional agreement would no longer
be presented to any Red Hat customers in the future.

In both these situations, we at SFC were worried they were merely a
“tip of the proverbial iceberg”. For years, we have heard from
Red Hat customers who are truly confused. It’s common in the industry to
talk about RHEL “seat licenses”, and many software acquisition
specialists in the industry are not aware of the nuances of the RHEL
business model and do not understand their rights. We remain very
concerned that RHEL salespeople purposely confuse customers to sell more
“seat licenses”. It’s often led us to ask: “If a GPL
violation happens in the woods, and everyone involved doesn’t hear it, how
does anyone know that software rights have indeed been trampled upon in
those woods?”. As we do for as many GPL violation reports as we can, we zealously pursue RHEL-related GPL violations that
are reported to us, and if you’re aware of one, please
do email us at
<[email protected]>
immediately. We fear that
be it through incompetence or malice, many RHEL salespeople and business
development professionals may regularly violate GPL and no one knows about
it. That said, the business model as described by IBM’s Red Hat
may well comply with the GPL — it’s just so murky that any tweak to
the model in any direction seems to definitely violate, in our experience.

Furthermore, Red Hat exploits the classic “caveat emptor”
approach — popular in many a shady business deal throughout history. While,
technically speaking, a careful reader of the GPL and the RHEL agreements
understands the bargain they’re making, we suspect most small businesses
just don’t have the FOSS licensing acumen and knowledge to truly understand
that deal.

Why Was an Independent CentOS So Important?

Until Red
Hat’s “aquisition” of CentOS in early 2014
, CentOS
provided an excellent counterbalance to the problems with the RHEL
business model. Specifically, CentOS was a community-driven project,
with many volunteers, supported by some involvement from small
businesses, to re-create RHEL releases using the
CCS releases
made for RHEL. Our pre-2014 view was that CentOS was the “canary in
the murky coalmine” of the RHEL business. If CentOS seemed vibrant,
usable, and a viable alternative to RHEL for those who didn’t want to
purchase Red Hat’s updates and services, the community could rest easy.
Even if there were GPL violations by Red Hat on RHEL, CentOS’ vibrancy
assured that such violations were having only a minor negative impact on
the FOSS community around RHEL’s codebase.

Red Hat, however, apparently knew that this vibrant community was cutting
into their profits. Starting in 2013, Red Hat engaged in a series of actions
that increased their grip. First, they “acquired”
CentOS. This was initially couched as a cooperation agreement, but Red Hat
systematically made job offers that key CentOS volunteers couldn’t refuse,
acquired the small businesses who might ultimately build CentOS into a
product, and otherwise integrated CentOS into Red Hat’s own operations.

After IBM acquired Red Hat, the situation got worse. Having gotten rights
to the CentOS brand as part of the “aquisition”, Red Hat slowly
began to change what CentOS was. CentOS Linux quickly ceased to be a
check-and-balance on RHEL, and just became a testing ground for RHEL.
Then, in 2020, when most of us were distracted by the worst of the COVID-19
pandemic, Red Hat unilaterally terminated all CentOS Linux development. Later (during
the Delta variant portion of the pandemic in late 2021) Red Hat ended CentOS Linux entirely.
IBM’s Red Hat
then used the name “CentOS Stream” to refer to experimental
source packages related to RHEL. These were (and are) not actually the RHEL
source releases — rather, they appear to be primarily a testing
ground for what might appear in RHEL later.

Finally, Red Hat announced two days ago
that RHEL
CCS will no longer be publicly available in any way
. Now, to be clear, the GPL agreements did not obligate Red Hat to make its
CCS publicly
available to everyone. This is a common misconception about GPL’s
requirements. While the details of CCS provisioning vary in the different
versions of the GPL agreements, the general principle is that CCS need to
be provided either (a) along with the binary distributions to those who
receive, or (b) to those who request pursuant to a written offer for
source. In a normal situation, with no mitigating factors, the fact that
a company moved from distributing CCS publicly to everyone to only giving
it to customers who received the binaries already would not raise
concerns.

In this situation, however, this completes what appears to be a
decade-long plan by Red Hat to maximize the level of difficulty of
those in the community who wish to “trust but verify” that RHEL
complies with the GPL agreements. Namely, Red Hat has badly thwarted
efforts by entities such
as Rocky
Linux

and Alma
Linux
. These entities are de-facto the intellectual successors to
CentOS Linux project that Red Hat carefully dismantled over the last decade. These organizations
sought to build Linux-based distributions that mirrored RHEL
releases, and it is now unclear if they can do that effectively, since Red Hat will undoubtedly capriciously refuse to sell them exactly-one RHEL service and update “seat license” at a reasonable price. It appears that, as of this week, one must have at least that to get timely access to RHEL CCS.

What Should Those Who Care About Software Rights Do About RHEL?

Due to this ongoing bad behavior by IBM’s Red Hat, the situation has
become increasingly complex and difficult to face. No third party can
effectively monitor RHEL compliance with the GPL agreements, since
customers live in fear of losing their much-needed service contracts.
Red Hat’s legal department
has systematically refused SFC’s requests in recent years to set up some
form of monitoring by SFC. (For example, we asked to review the training
materials and documents that RHEL salespeople are given to convince
customers to buy RHEL, and Red Hat has not been willing to share these
materials with us.) Nevertheless, since SFC serves as the global watchdog for
GPL compliance, we welcome reports of RHEL-related violations.

We finally express our sadness that this long road has led the FOSS community to such a disappointing place. I
personally remember standing with Erik Troan in a Red Hat booth at a USENIX
conference in the late 1990s, and meeting Bob Young around the same time.
Both expressed how much they wanted to build a company that respected,
collaborated with, engaged with, and most of all treated as equals the wide
spectrum of individuals, hobbyists, and small businesses that make the
plurality of the FOSS community. We hope that the
modern Red Hat can find their way back to this mission under IBM’s control.

A Comprehensive Analysis of the GPL Issues With the Red Hat Enterprise Linux (RHEL) Business Model

Post Syndicated from Bradley M. Kuhn original http://ebb.org/bkuhn/blog/2023/06/23/rhel-red-hat-gpl.html

[ This is
a crosspost
from my professional blog at Software Freedom Conservancy
(SFC)
. I encourage you
to use
that copy of the post as the canonical linkage for this essay — I
crossposted here merely for posterity and to reach a wider
audience. ]

This article was originally published primarily as a response
to
IBM’s
Red Hat’s change
to no longer publish complete, corresponding source
(CCS) for RHEL and the
prior discontinuation of CentOS Linux (which are related events, as
described below). We hope that this will serve as a comprehensive
document that discusses the history of Red Hat’s RHEL business model,
the related source code provisioning, and the GPL compliance issues with RHEL.


For approximately twenty years, Red Hat (now a fully owned subsidiary of
IBM) has experimented with building a business model for operating system deployment and
distribution that looks, feels, and acts like a proprietary one, but
nonetheless complies with the GPL and other standard copyleft
terms. Software rights activists,
including SFC, have spent decades talking to Red Hat and its
attorneys about how the Red Hat Enterprise Linux (RHEL) business model courts
disaster and is actively unfriendly to
community-oriented Free and Open Source Software (FOSS). These pleadings,
discussions, and encouragements have, as far as we can tell, been heard and
seriously listened to by key members of Red Hat’s legal and OSPO
departments, and even by key C-level executives, but they have ultimately been rejected
and ignored — sometimes even with a “fine, then sue us for GPL
violations” attitude. Activists have found this discussion
frustrating, but kept the nature and tenure of these discussions as an
“open secret” until now because we all had hoped that Red Hat’s behavior
would improve. Recent events show that the behavior has simply gotten worse, and is likely to get even
worse.

What Exactly Is the RHEL Business Model?

The most concise and pithy way to describe RHEL’s business model is:
“if you exercise your rights under the GPL, your money is no good
here”. Specifically, IBM’s Red Hat offers copies of RHEL to its
customers, and each copy comes with a support and automatic-update
subscription contract. As we understand it, this contract
clearly states
that the terms do not intend to contradict any rights to copy, modify,
redistribute and/or reinstall the software
as many times and as many places
as the customer likes (see §1.4). Additionally, though, the contract indicates that
if the customer engages in these activities, that Red Hat reserves the
right to cancel that contract and make no further contracts with the
customer for support and update services. In essence, Red Hat requires their customers
to choose between (a) their software freedom and rights, and (b) remaining a Red Hat
customer. In some versions of these contracts that we have reviewed, Red
Hat even reserves the right to “Review” a customer (effectively a BSA-style audit) to examine how
many copies of RHEL are actually installed (see §10) — presumably for the
purpose of Red Hat getting the information they need to decide
whether to “fire” the customer.

Red Hat’s lawyers clearly take the position that this business model complies with the GPL (though we aren’t so sure), on grounds that that nothing in the GPL agreements requires an entity
keep a business relationship with any other entity. They have further argued that such business
relationships can be terminated based on any behaviors — including
exercising rights guaranteed by the GPL agreements. Whether that
analysis is correct is a matter of intense debate, and likely only a court
case that disputed this particular issue would yield a definitive answer
on whether that disagreeable behavior is permitted (or not) under the GPL agreements. Debates continue, even today,
in copyleft expert circles, whether this
model itself violates GPL. There is, however, no doubt that this
provision is not in the spirit of the GPL agreements. The RHEL business
model is unfriendly, captious, capricious, and cringe-worthy.

Furthermore, this RHEL
business model remains, to our knowledge, rather unique in the software
industry. IBM’s Red Hat definitely deserves credit for so carefully
constructing their business model such that it has spent most of the last
two decades in murky territory of “probably not violating the
GPL”.

Does The RHEL Business Model Violate the GPL Agreements?

Perhaps the biggest problem with a murky business model that skirts the
line of GPL compliance is that violations can and do happen — since
even a minor deviation from the business model clearly violates the GPL
agreements. Pre-IBM Red Hat deserves a certain amount of credit, as
SFC is aware of only two documented incidents of GPL violations that have
occurred since 2006 regarding the RHEL business model. We’ve decided to
share some general details of these violations for the purpose of
explaining where this business model can so easily cross the line.

In the first violation, a large Fortune 500 company (which we’ll
call Company A), who both used RHEL internally and also built
public-facing Linux-based products, decided to create a consumer-facing
product (which we’ll call Product P) based primarily on CentOS Linux,
but P included a few packages built from RHEL sources. Company A
did not seek nor ask for support or update services for this separate
Product P. Red Hat later became aware that Product P contained
some part of RHEL, and Red Hat demanded royalty payments for Product
P
. Red Hat threatened to revoke the support and update
services on Company A‘s internal RHEL servers if such royalties were
not paid.

Since Company A was powerful and had good lawyers and savvy
business development staff, they did not acquiesce. Company A ultimately
continued (to our knowledge) on as a RHEL customer for their internal
servers and continued selling Product P without royalty payments. Nevertheless, a
demand for royalties for distribution is clearly a violation as that demand creates a
“further restriction” on the permissions granted by GPL. As
stated in GPLv3:

You may not impose any further restrictions on the exercise of the
rights granted or affirmed under this License. For example, you may
not impose a license fee, royalty, or other charge for exercise of
rights granted
under this License.

Red Hat tried to impose a further restriction in this situation, and therefore
violated the GPL. The violation was resolved since no royalty was paid
and Company A faced no consequences. SFC learned of
the incident later, and informed Red Hat that the past royalty demand was
a violation. Red Hat did not dispute nor agree that it was a violation, and did informally agree
such demands would not be made in future.

In another violation incident, we learned that Red Hat, in a specific
non-USA country, was requiring that any customer who lowered the number of
RHEL machines under service contract with Red Hat sign an
additional agreement. This additional agreement promised that the customer
had deleted every copy of RHEL in their entire organization other than the
copies of RHEL that were currently contracted for service with Red Hat.
Again, this is a “further restriction”. The GPL agreements
give everyone the unfettered right to make and keep as many copies of the
software as they like, and a distributor of GPL’d software may not require
a user to attest that they’ve deleted these legitimate, licensed copies of
third-party-licensed software under the GPL. SFC informed Red Hat’s legal department
of this violation, and we were assured that this additional agreement would no longer
be presented to any Red Hat customers in the future.

In both these situations, we at SFC were worried they were merely a
“tip of the proverbial iceberg”. For years, we have heard from
Red Hat customers who are truly confused. It’s common in the industry to
talk about RHEL “seat licenses”, and many software acquisition
specialists in the industry are not aware of the nuances of the RHEL
business model and do not understand their rights. We remain very
concerned that RHEL salespeople purposely confuse customers to sell more
“seat licenses”. It’s often led us to ask: “If a GPL
violation happens in the woods, and everyone involved doesn’t hear it, how
does anyone know that software rights have indeed been trampled upon in
those woods?”. As we do for as many GPL violation reports as we can, we zealously pursue RHEL-related GPL violations that
are reported to us, and if you’re aware of one, please
do email us at
<[email protected]>
immediately. We fear that
be it through incompetence or malice, many RHEL salespeople and business
development professionals may regularly violate GPL and no one knows about
it. That said, the business model as described by IBM’s Red Hat
may well comply with the GPL — it’s just so murky that any tweak to
the model in any direction seems to definitely violate, in our experience.

Furthermore, Red Hat exploits the classic “caveat emptor”
approach — popular in many a shady business deal throughout history. While,
technically speaking, a careful reader of the GPL and the RHEL agreements
understands the bargain they’re making, we suspect most small businesses
just don’t have the FOSS licensing acumen and knowledge to truly understand
that deal.

Why Was an Independent CentOS So Important?

Until Red
Hat’s “aquisition” of CentOS in early 2014
, CentOS
provided an excellent counterbalance to the problems with the RHEL
business model. Specifically, CentOS was a community-driven project,
with many volunteers, supported by some involvement from small
businesses, to re-create RHEL releases using the
CCS releases
made for RHEL. Our pre-2014 view was that CentOS was the “canary in
the murky coalmine” of the RHEL business. If CentOS seemed vibrant,
usable, and a viable alternative to RHEL for those who didn’t want to
purchase Red Hat’s updates and services, the community could rest easy.
Even if there were GPL violations by Red Hat on RHEL, CentOS’ vibrancy
assured that such violations were having only a minor negative impact on
the FOSS community around RHEL’s codebase.

Red Hat, however, apparently knew that this vibrant community was cutting
into their profits. Starting in 2013, Red Hat engaged in a series of actions
that increased their grip. First, they “acquired”
CentOS. This was initially couched as a cooperation agreement, but Red Hat
systematically made job offers that key CentOS volunteers couldn’t refuse,
acquired the small businesses who might ultimately build CentOS into a
product, and otherwise integrated CentOS into Red Hat’s own operations.

After IBM acquired Red Hat, the situation got worse. Having gotten rights
to the CentOS brand as part of the “aquisition”, Red Hat slowly
began to change what CentOS was. CentOS Linux quickly ceased to be a
check-and-balance on RHEL, and just became a testing ground for RHEL.
Then, in 2020, when most of us were distracted by the worst of the COVID-19
pandemic, Red Hat unilaterally terminated all CentOS Linux development. Later (during
the Delta variant portion of the pandemic in late 2021) Red Hat ended CentOS Linux entirely.
IBM’s Red Hat
then used the name “CentOS Stream” to refer to experimental
source packages related to RHEL. These were (and are) not actually the RHEL
source releases — rather, they appear to be primarily a testing
ground for what might appear in RHEL later.

Finally, Red Hat announced two days ago
that RHEL
CCS will no longer be publicly available in any way
. Now, to be clear, the GPL agreements did not obligate Red Hat to make its
CCS publicly
available to everyone. This is a common misconception about GPL’s
requirements. While the details of CCS provisioning vary in the different
versions of the GPL agreements, the general principle is that CCS need to
be provided either (a) along with the binary distributions to those who
receive, or (b) to those who request pursuant to a written offer for
source. In a normal situation, with no mitigating factors, the fact that
a company moved from distributing CCS publicly to everyone to only giving
it to customers who received the binaries already would not raise
concerns.

In this situation, however, this completes what appears to be a
decade-long plan by Red Hat to maximize the level of difficulty of
those in the community who wish to “trust but verify” that RHEL
complies with the GPL agreements. Namely, Red Hat has badly thwarted
efforts by entities such
as Rocky
Linux

and Alma
Linux
. These entities are de-facto the intellectual successors to
CentOS Linux project that Red Hat carefully dismantled over the last decade. These organizations
sought to build Linux-based distributions that mirrored RHEL
releases, and it is now unclear if they can do that effectively, since Red Hat will undoubtedly capriciously refuse to sell them exactly-one RHEL service and update “seat license” at a reasonable price. It appears that, as of this week, one must have at least that to get timely access to RHEL CCS.

What Should Those Who Care About Software Rights Do About RHEL?

Due to this ongoing bad behavior by IBM’s Red Hat, the situation has
become increasingly complex and difficult to face. No third party can
effectively monitor RHEL compliance with the GPL agreements, since
customers live in fear of losing their much-needed service contracts.
Red Hat’s legal department
has systematically refused SFC’s requests in recent years to set up some
form of monitoring by SFC. (For example, we asked to review the training
materials and documents that RHEL salespeople are given to convince
customers to buy RHEL, and Red Hat has not been willing to share these
materials with us.) Nevertheless, since SFC serves as the global watchdog for
GPL compliance, we welcome reports of RHEL-related violations.

We finally express our sadness that this long road has led the FOSS community to such a disappointing place. I
personally remember standing with Erik Troan in a Red Hat booth at a USENIX
conference in the late 1990s, and meeting Bob Young around the same time.
Both expressed how much they wanted to build a company that respected,
collaborated with, engaged with, and most of all treated as equals the wide
spectrum of individuals, hobbyists, and small businesses that make the
plurality of the FOSS community. We hope that the
modern Red Hat can find their way back to this mission under IBM’s control.

The collective thoughts of the interwebz

Proudly powered by Ants