Tag Archives: Pricing

New Workers pricing — never pay to wait on I/O again

Post Syndicated from Rita Kozlov original http://blog.cloudflare.com/workers-pricing-scale-to-zero/

New Workers pricing — never pay to wait on I/O again

New Workers pricing — never pay to wait on I/O again

Today we are announcing new pricing for Cloudflare Workers and Pages Functions, where you are billed based on CPU time, and never for the idle time that your Worker spends waiting on network requests and other I/O. Unlike other platforms, when you build applications on Workers, you only pay for the compute resources you actually use.

Why is this exciting? To date, all large serverless compute platforms have billed based on how long your function runs — its duration or “wall time”. This is a reflection of a new paradigm built on a leaky abstraction — your code may be neatly packaged up into a “function”, but under the hood there’s a virtual machine (VM). A VM can’t be paused and resumed quickly enough to execute another piece of code while it waits on I/O. So while a typical function might take 100ms to run, it might typically spend only 10ms doing CPU work, like crunching numbers or parsing JSON, with the rest of time spent waiting on I/O.

This status quo has meant that you are billed for this idle time, while nothing is happening.

With this announcement, Cloudflare is the first and only global serverless platform to offer standard pricing based on CPU time, rather than duration. We think you should only pay for the compute time you actually use, and that’s how we’re going to bill you going forward.

Old pricing — two pricing models, each with tradeoffs

New Workers pricing — never pay to wait on I/O again

New pricing — one simple and predictable pricing model

New Workers pricing — never pay to wait on I/O again

With the same generous Free plan

New Workers pricing — never pay to wait on I/O again

Unlike wall time (duration, or GB-s), CPU time is more predictable and under your control. When you make a request to a third party API, you can’t control how long that API takes to return a response. This time can be quite long, and vary dramatically — particularly when building AI applications that make inference requests to LLMs. If a request takes twice as long to complete, duration-based billing means you pay double. By contrast, CPU time is consistent and unaffected by time spent waiting on I/O — purely a function of the logic and processing of inputs on outputs to your Worker. It is entirely under your control.

Starting October 31, 2023, you will have the option to opt in individual Workers and Pages Functions projects on your account to new pricing, and newly created projects will default to new pricing. You’ll be able to estimate how much new pricing will cost in the Cloudflare dashboard. For the majority of current applications, new pricing is the same or less expensive than the previous Bundled and Unbound pricing plans.

If you’re on our Workers Paid plan, you will have until March 1, 2024 to switch to the new pricing on your own, after which all of your projects will be automatically migrated to new pricing. If you’re an Enterprise customer, any contract renewals after March 1, 2024, will use the new pricing. You’ll receive plenty of advance notice via email and dashboard notifications before any changes go into effect. And since CPU time is fully in your control, the more you optimize your Worker’s compute time, the less you’ll pay. Your incentives are aligned with ours, to make efficient use of compute resources on Region: Earth.

The challenge of truly scaling to zero

The beauty of serverless is that it allows teams to focus on what matters most — delivering value to their customers, rather than managing infrastructure. It saves you money by effortlessly scaling up and down all over the world based on your traffic, whether you’re an early stage startup or Shopify during Black Friday.

One of the promises of serverless is the idea of scaling to zero — once those big days subside, you no longer have to pay for virtual machines to sit idle before your autoscaling kicks in, or be charged by the hour for instances that you barely ended up using. No compute = no bills for usage. Or so, at least, is the promise of serverless.

Yet, there’s one hidden cost, where even in the serverless world you will find yourself paying for idle resources — what happens when your function is sitting around waiting on I/O? With pricing based on the duration that a function runs, you’re still billed for time that your service is doing zero work, and just waiting on network requests.

New Workers pricing — never pay to wait on I/O again

Most applications spend far more time waiting on this I/O than they do using the CPU, often ten times more.

Imagine a similar scenario in your own life — you grab a cab to go to the airport. On the way, the driver decides to stop to refuel and grab a snack, but leaves the meter running. This is not time spent bringing you closer to your destination, but it’s time that you’re paying for. Now imagine for the time the driver was refueling the car, the meter was paused. That’s the difference between CPU time and duration, or wall clock time.

New Workers pricing — never pay to wait on I/O again

But rather than waiting on the driver to refuel or grab a Snickers bar, what is it that you’re actually paying for when it comes to serverless compute?

Time spent waiting on services you don’t control

Most applications depend on one or many external service providers. Providers of hosted large language models (LLMs) like GPT-4 or Stable Diffusion. Databases as a service. Payment processors. Or simply an API request to a system outside your control. This is where software development is headed — rather than reinventing the wheel and slowly building everything themselves, both fast-moving startups and the Fortune 500 increasingly build using other services to avoid undifferentiated heavy lifting.

Every time an application interacts with one of these external services, it has to send data over the network and wait until it receives a response. And while some services are lightning fast, others can take considerable time, like waiting for a payment processor or for a large media file to be uploaded or converted. Your own application sits idle for most of the request, waiting on services outside your control.

Until today, you’ve had to pay while your application waits. You’ve had to pay more when a service you depend on has an operational issue and slows down, or times out in responding to your request. This has been a disincentive to incrementally move parts of your application to serverless.

Cloudflare’s new pricing: the first serverless platform to truly scale down to zero

The idea of “scale to zero” is that you never have to keep instances of your application sitting idle, waiting for something to happen. Serverless is more than just not having to manage servers or virtual machines — you shouldn’t have to provision and manage the number of compute resources that are available or warm.

Our new pricing takes the “scale to zero” concept even further, and extends it to whether your application is actually performing work. If you’re still paying while nothing is happening, we don’t think that’s truly scale to zero. Your application is idle. The CPU can be used for other tasks. Whether your application is “running” is an old concept lifted from an era before multi-tenant cloud platforms. What matters is if you are actually using compute resources.

Pay less, deploy everywhere, without hidden costs

Let’s compare what you’d pay on new Workers pricing to AWS Lambda, for the following Worker:

  • One billion requests per month
  • Seven CPU milliseconds per request
  • 200ms duration per request
New Workers pricing — never pay to wait on I/O again

The above table is for informational purposes only. Prices are limited to the public fees as of September 20, 2023, and do not include taxes and any other fees. AWS Lambda and Lambda @ Edge prices are based on publicly available pricing in US-East (Ohio) region as published on https://aws.amazon.com/lambda/pricing/

Workers are the most cost-effective option, and are globally distributed, automatically optimized with Smart Placement, and integrated with Durable Objects, R2, KV, Cache, Queues, D1 and more. And with Workers, you never have to pay extra for provisioned concurrency, pay a penalty for streaming responses, or incur egregious egress fees.

New Workers pricing makes building AI applications dramatically cheaper

Yesterday we announced a new suite of products to let you build AI applications on Cloudflare — Workers AI, AI Gateway, and our new vector database, Vectorize.

Nearly everyone is building new products and features using AI models right now. Large language models and generative AI models are incredibly powerful. But they aren’t always fast — asking a model to create an image, transcribe a segment of audio, or write a story often takes multiple seconds — far longer than a typical API response or database query that we expect to return in tens of milliseconds. There is significant compute work going on behind the scenes, and that means longer duration per request to a Worker.

New Workers pricing makes this much less expensive than it was previously on the Unbound usage model.

Let’s take the same example as above, but instead assume the duration of the request is two seconds (2000ms), because the Worker makes an inference request to a large AI model. With new Workers pricing, you pay the exact same amount, no matter how long this request takes.

New Workers pricing — never pay to wait on I/O again

No surprise bills — set a maximum limit on CPU time for each Worker

Surprise bills from cloud providers are an unfortunately common horror story. In the old way of provisioning compute resources, forgetting to shut down an instance of a database or virtual machine can cost hundreds of dollars. And accidentally autoscaling up too high can be even worse.

We’re building new safeguards to prevent these kinds of scenarios on Workers. As part of new pricing, you will be able to cap CPU usage on a per-Worker basis.

For example, if you have a Worker with a p99 CPU time of 15ms, you might use this to set a max CPU limit of 40ms — enough headroom to ensure that your worker will run successfully, while ensuring that even if you ship a bug that causes a CPU time to ratchet up dramatically, or have an edge case that causes infinite recursion, you can’t suddenly rack up a giant unexpected bill, or be vulnerable to a denial of wallet attack. This can be particularly helpful if your worker handles variable or user-generated input, to guard against edge cases that you haven’t accounted for.

Alternatively, if you’re running a production service, but want to make sure you stay on top of your costs, we will also be adding the option to configure notifications that can automatically email you, page you, or send a webhook if your worker exceeds a particular amount of CPU time per request. You will be able to choose at what threshold you want to be notified, and how.

New ways to “hibernate” Durable Objects while keeping connections alive

While Workers are stateless functions, Durable Objects are stateful and long-lived, commonly used to coordinate and persist real-time state in chat, multiplayer games, or collaborative apps. And unlike Workers, duration-based pricing fits Durable Objects well. As long as one or more clients are connected to a Durable Object, it keeps state available in memory. Durable Objects pricing will remain duration-based, and is not changing as part of this announcement.

What about when a client is connected to a Durable Object, but no work has happened for a long time? Consider a collaborative whiteboard app built using Durable Objects. A user of the app opens the app in a browser tab, but then forgets about it, and leaves it running for days, with an open WebSocket connection. Just like with Workers, we don’t think you should have to pay for this idle time. But until recently, there hasn’t been an API to signal to us that a Durable Object can be safely “hibernated”.

The recently introduced Hibernation API, currently in beta, allows you to set an automatic response to be used while hibernated and serialize state such that it survives hibernation. This gives Cloudflare the inputs we need in order to maintain open WebSocket connections from clients, while “hibernating” the Durable Object such that it is not actively running, and you are not billed for idle time. The result is that your state is always available in-memory when actually need it, but isn’t unnecessarily kept around when it’s not. As long as your Durable Object is hibernating, even if there are active clients still connected over a WebSocket, you won’t be billed for duration.

Snippets make Cloudflare’s CDN programmable — for free

What if you just want to modify a header, do a country code redirect, or cache a custom query? Developers have relied on Workers to program Cloudflare’s CDN like this for many years. With the announcement of Cloudflare Snippets last year, now in alpha, we’re making it free.

If you use Workers today for these smaller use cases, to customize any of Cloudflare’s application services, Snippets will be the optimal, zero cost option.

A serverless platform without limits

Developers are building ever larger and more complex full-stack applications on Workers each month. Our promise to you is to help you scale in any direction, without worrying about paying for idle time or having to manage and provision compute resources across regions.

This also means not having to worry about limits. Workers already serves many millions of requests per second, and scales and performs so well that we are rebuilding our own CDN on top of Workers. Individual Workers can now be up to 10MB, with a max startup time of 400ms, and can be easily composed together using Service Bindings. Entire platforms are built on top of Workers, with a growing number of companies allowing their own customers to write and deploy custom code and applications via Workers for Platforms. Some of the biggest platforms in the world rely on Cloudflare and the Workers platform during the most critical moments.

New pricing removes limits on the types of applications that could be built cost effectively with duration-based pricing. It removes the ceiling on CPU time from our original request-based pricing. We’re excited to see what you build, and are committed to being the development platform where you’re not constrained by limits on scale, regions, instances, concurrency or whatever else you need to handle to grow and operate globally.

When will new pricing be available?

Starting October 31, 2023, you will have the option to opt in individual Workers and Pages Functions projects on your account to new pricing, and newly created projects will default to new pricing. You will have until March 1, 2024, or the end of your Enterprise contract, whichever comes later, to switch to new pricing on your own, after which all of your projects will be automatically migrated to new pricing. You’ll receive plenty of advance notice via email and dashboard notifications before any changes go into effect.

Between now and then, we want to hear from you. We’ve based new pricing off feedback we’ve heard from developers building serverless applications, and companies estimating and projecting their costs. Tell us what you think of new pricing by sharing your feedback in this survey. We read every response.

Cloudflare Zaraz steps up: general availability and new pricing

Post Syndicated from Yair Dovrat original http://blog.cloudflare.com/cloudflare-zaraz-steps-up-general-availability-and-new-pricing/

Cloudflare Zaraz steps up: general availability and new pricing

This post is also available in Deutsch, Français.

Cloudflare Zaraz has transitioned out of beta and is now generally available to all customers. It is included under the free, paid, and enterprise plans of the Cloudflare Developer Platform. Visit our docs to learn more on our different plans.

Cloudflare Zaraz steps up: general availability and new pricing

Zaraz Is part of Cloudflare Developer Platform

Cloudflare Zaraz is a solution that developers and marketers use to load third-party tools like Google Analytics 4, Facebook CAPI, TikTok, and others. With Zaraz, Cloudflare customers can easily transition to server-side data collection with just a few clicks, without the need to set up and maintain their own cloud environment or make additional changes to their website for installation. Server-side data collection, as facilitated by Zaraz, simplifies analytics reporting from the server rather than loading numerous JavaScript files on the user's browser. It's a rapidly growing trend due to browser limitations on using third-party solutions and cookies. The result is significantly faster websites, plus enhanced security and privacy on the web.

We've had Zaraz in beta mode for a year and a half now. Throughout this time, we've dedicated our efforts to meeting as many customers as we could, gathering feedback, and getting a deep understanding of our users' needs before addressing them. We've been shipping features at a high rate and have now reached a stage where our product is robust, flexible, and competitive. It also offers unique features not found elsewhere, thanks to being built on Cloudflare’s global network, such as Zaraz’s Worker Variables. We have cultivated a strong and vibrant discord community, and we have certified Zaraz developers ready to help anyone with implementation and configuration.

With more than 25,000 websites running Zaraz today – from personal sites to those of some of the world's biggest companies – we feel confident it's time to go out of beta, and introduce our new pricing system. We believe this pricing is not only generous to our customers, but also competitive and sustainable. We view this as the next logical step in our ongoing commitment to our customers, for whom we're building the future.

If you're building a web application, there's a good chance you've spent at least some time implementing third-party tools for analytics, marketing performance, conversion optimization, A/B testing, customer experience and more. Indeed, according to the Web Almanac report, 94% percent of mobile pages used at least one third-party solution in 2022, and third-party requests accounted for 45% of all requests made by websites. It's clear that third-party solutions are everywhere. They have become an integral part of how the web has evolved. Third-party tools are here to stay, and they require effective developer solutions. We are building Zaraz to help developers manage the third-party layer of their website properly.

Starting today, Cloudflare Zaraz is available to everyone for free under their Cloudflare dashboard, and the paid version of Zaraz is included in the Workers Paid plan. The Free plan is designed to meet the needs of most developers who want to use Zaraz for personal use cases. For a price starting at $5/month, customers of the Workers Paid plan can enjoy the extensive list of features that makes Zaraz powerful, deploy Zaraz on their professional projects, and utilize the pay-as-you-go system. This is in addition to everything else included in the Workers Paid plan. The Enterprise plan, on the other hand, addresses the needs of larger businesses looking to leverage our platform to its fullest potential.

How is Zaraz priced

Zaraz pricing is based on two components: Zaraz Loads and the set of features. A Zaraz Load is counted each time a web page loads the Zaraz script within it, and/or the Pageview trigger is being activated. For Single Page Applications, each URL navigation is counted as a new Zaraz Load. Under the Zaraz Monitoring dashboard, you can find a report showing how many Zaraz Loads your website has generated during a specific time period. Zaraz Loads and features are factored into our billing as follows:

Cloudflare Zaraz steps up: general availability and new pricing

Free plan

The Free Plan has a limit of 100,000 Zaraz Loads per month per account. This should allow almost everyone wanting to use Zaraz for personal use cases, like personal websites or side projects, to do so for free. After 100,000 Zaraz Loads, Zaraz will simply stop functioning.

Following the same logic, the free plan includes everything you need in order to use Zaraz for personal use cases. That includes Auto-injection, Zaraz Debugger, Zaraz Track and Zaraz Set from our Web API, Consent Management Platform (CMP), Data Layer compatibility mode, and many more.

If your websites generate more than 100,000 Zaraz loads combined, you will need to upgrade to the Workers Paid plan to avoid service interruption. If you desire some of the more advanced features, you can upgrade to Workers Paid and get access for only $5/month.

The Workers Paid Plan includes the first 200,000 Zaraz Loads per month per account, free of charge.

If you exceed the free Zaraz Loads allocations, you'll be charged $0.50 for every additional 1,000 Zaraz Loads, but the service will continue to function. (You can set notifications to get notified when you exceed a certain threshold of Zaraz Loads, to keep track of your usage.)

Workers Paid customers can enjoy most of Zaraz robust and existing features, amongst other things, this includes: Zaraz E-commerce from our Web API, Custom Endpoints, Workers Variables, Preview/Publish Workflow, Privacy Features, and more.

If your websites generate Zaraz Loads in the millions, you might want to consider the Workers Enterprise plan. Beyond the free 200,000 Zaraz Loads per month for your account, it offers additional volume discounts based on your Zaraz Loads usage as well as Cloudflare’s professional services.

Enterprise plan

The Workers Enterprise Plan includes the first 200,000 Zaraz Loads per month per account free of charge. Based on your usage volume, Cloudflare’s sales representatives can offer compelling discounts. Get in touch with us here. Workers Enterprise customers enjoy all paid enterprise features.

I already use Zaraz, what should I do?

If you were using Zaraz under the free beta, you have a period of two months to adjust and decide how you want to go about this change. Nothing will change until September 20, 2023. In the meantime we advise you to:

  1. Get more clarity of your Zaraz Loads usage. Visit Monitoring to check how many Zaraz Loads you had in the previous couple of months. If you are worried about generating more than 100,000 Zaraz Loads per month, you might want to consider upgrading to Workers Paid via the plans page, to avoid service interruption. If you generate a big amount of Zaraz Loads, you’d probably want to reach out to your sales representative and get volume discounts. You can leave your details here, and we’ll get back to you.
  2. Check if you are using one of the paid features as listed in the plans page. If you are, then you would need to purchase a Workers Paid subscription, starting at $5/month via the plans page. On September 20, these features will cease to work unless you upgrade.

* Please note, as of now, free plan users won't have access to any paid features. However, if you're already using a paid feature without a Workers Paid subscription, you can continue to use it risk-free until September 20. After this date, you'll need to upgrade to keep using any paid features.

We are here for you

As we make this important transition, we want to extend our sincere gratitude to all our beta users who have provided invaluable feedback and have helped us shape Zaraz into what it is today. We are excited to see Zaraz move beyond its beta stage and look forward to continuing to serve your needs and helping you build better, faster, and more secure web experiences. We know this change comes with adjustments, and we are committed to making the transition as smooth as possible. In the next couple of days, you can expect an email from us, with clear next steps and a way to get advice in case of need. You can always get in touch directly with the Cloudflare Zaraz team on Discord, or the community forum.

Thank you for joining us on this journey and for your ongoing support and trust in Cloudflare Zaraz. Let's continue to build the future of the web together!

Adjusting pricing, introducing annual plans, and accelerating innovation

Post Syndicated from Matthew Prince original https://blog.cloudflare.com/adjusting-pricing-introducing-annual-plans-and-accelerating-innovation/

Adjusting pricing, introducing annual plans, and accelerating innovation

This post is also available in 繁體中文, 简体中文, 日本語, 한국어, Deutsch, Français, Pусский, Español, Português.

Adjusting pricing, introducing annual plans, and accelerating innovation

Cloudflare is raising prices for the first time in the last 12 years. Beginning January 15, 2023, new sign ups will be charged \$25 per month for our Pro Plan (up from \$20 per month) and \$250 per month for our Business Plan (up from \$200 per month). Any paying customers who sign up before January 15, 2023, including any currently paying customers who signed up at any point over the last 12 years, will be grandfathered at the old monthly price until May 14, 2023.

We are also introducing an option to pay annually, rather than monthly, that we hope most customers will choose to switch to. Annual plans are available today and discounted from the new monthly rate to \$240 per year for the Pro Plan (the equivalent of \$20 per month, saving \$60 per year) and \$2,400 per year for the Business Plan (the equivalent of \$200 per month, saving \$600 per year). In other words, if you choose to pay annually for Cloudflare you can lock in our old monthly prices.

After not raising prices in our history, this was something we thought carefully about before deciding to do. While we have over a decade of network expansion and innovation under our belts, what may not be intuitive is that our goal is not to increase revenue from this change. We need to invest up front in building out our network, and the main reason we’re making this change is to more closely map our business with the timing of our underlying costs. Doing so will enable us to further accelerate our network expansion and pace of innovation — which all of our customers will benefit from. Since this is a big change for us, I wanted to take the time to walk through how we came to this decision.

Cloudflare’s history

Cloudflare launched on September 27, 2010. At the time we had two plans: one Free Plan that was free, and a Pro Plan that cost $20 per month. Our network at the time consisted of “four and a half” data centers: Chicago, Illinois; Ashburn, Virginia; San Jose, California; Amsterdam, Netherlands; and Tokyo, Japan. The routing to Tokyo was so flaky that we’d turn it off for half the day to not mess up routing around the rest of the world. The biggest difference for the first couple years between our Free and Pro Plans was that only the latter included HTTPS support.

Adjusting pricing, introducing annual plans, and accelerating innovation
Slide from the Cloudflare Launch Presentation at TechCrunch Disrupt, September 27, 2010‌‌

In June 2012, we introduced our Business Plan for $200 per month and our Enterprise Plan which was customized for our largest customers. By then we’d not only gotten Tokyo to work reliably but added 18 more data centers around the world for a total of 23. Our Business plan added DDoS mitigation as the primary benefit, something prior to then we’d been terrified to offer.

Adjusting pricing, introducing annual plans, and accelerating innovation
Cloudflare’s Network as of June 16, 2012, courtesy of The Internet Archive’s Wayback Machine‌‌

My how you’ve grown

Fast-forward to today and a lot has changed. We’re up to presence in more than 275 cities in more than 100 countries worldwide. We included HTTPS support in our Free Plan with the launch of Universal SSL in September 2014. We included unlimited DDoS mitigation in our Free Plan with the launch of Unmetered DDoS Mitigation in September 2017. Today, we stop attacks for Free Plan customers on a daily basis that are more than 10-times as big as what was headline news back in 2013.

Adjusting pricing, introducing annual plans, and accelerating innovation

Our strategy has always been to roll features out, limit them at first to higher tiers of paying customers, but, over time, roll them down through our plans and eventually to even our Free Plan customers. We believe everyone should be fast, reliable, and secure online regardless of their budget. And we believe our continued success should be primarily driven by new innovation, not by milking old features for revenue.

Adjusting pricing, introducing annual plans, and accelerating innovation

And we’ve delivered on that promise, accelerating our roll out of new features across our platform and bundling them into our existing plans without increasing prices. What you get for our Free, Pro, and Business Plans today is orders of magnitude more valuable across every dimension — performance, reliability, and security — than those plans were when they launched.

And yet we know we are our customers’ infrastructure. You rely on us. And therefore we have been very reluctant to ever raise prices just to take price and capture more revenue.

Annual plans for even faster innovation

Early on, we only charged monthly because we were an unproven service we knew customers were taking a risk on. Today, that’s no longer the case. The majority of our customers have been using us for years and, from our conversations with them, plan to continue using us for the foreseeable future. In fact, one of the top requests we receive is from customers who want to pay once per year rather than getting billed every month.

While I’m proud of our pace of innovation, one of the challenges we have is managing the cash flow to fund those investments as quickly as we’d like. We invest up front in building out our network or developing a new feature, but then only get paid monthly by our customers. That, inherently, is a governor on our pace of innovation. We can invest even faster — hire more engineers, deploy more servers — if those customers who know they’re going to use us for the next year pay for us up front. We have no shortage of things we know customers want us to build, so by collecting revenue earlier we know we can unlock even faster innovation.

In other words, we are making this change hoping most of you won’t pay us anything more than you did before. Instead, our hope is that most of you will adopt our annual plans — you’ll get to lock in the existing pricing, and you’ll help us further accelerate our network growth and pace of innovation.

Finally, I wanted to mention that something isn’t changing: our Free Plan. It will still be free. It will still have all the features it has today. And we’re still committed to, over time, rolling many more features that are only available in paid plans today down to the Free Plan over time. Our mission is to help build a better Internet. We want to win by being the most innovative company in the world. And that means making our services available to as many people as possible, even those who can’t afford to pay us right now.

But, for those of you who can pay: thank you. You’ve funded our innovation to date. And I hope you’ll opt to switch to our annual billing, so we can further accelerate our network expansion and pace of innovation.

Adjusting pricing, introducing annual plans, and accelerating innovation

Democratizing Fare Storage at scale using Event Sourcing

Post Syndicated from Grab Tech original https://engineering.grab.com/democratizing-fare-storage-at-scale-using-event-sourcing

From humble beginnings, Grab has expanded across different markets in the last couple of years. We’ve added a wide range of features to the Grab platform to continue to delight our customers and driver-partners. We had to incessantly find ways to improve our existing solutions to better support new features.

In this blog, we discuss how we built Fare Storage, Grab’s single source of truth fare data store, and how we overcame the challenges to make it more reliable and scalable to support our expanding features.

High-level Flow

To set some context for this blog, let’s define some key terms before proceeding. A Fare is a dollar amount calculated to move someone or something from point A to point B. And, a Fee is a dollar amount added to or subtracted from the original fare amount for any additional service.

Now that you’re acquainted with the key concepts, let’s look take a look at the following image. It illustrates that features such as Destination Change Fee, Waiting Fee, Cancellation Fee, Tolls, Promos, Surcharges, and many others store additional fee breakdown along with the original fare. This set of information is crucial for generating receipts and debugging processes. However, our legacy storage system wasn’t designed to host massive quantities of information effectively.

Sample Flow with Fee Breakdown

In our legacy architecture, we stored all the booking and fare-related information in a single relational database table. Adding new fare fields and breakdowns required changes in our critical booking system, making iterations prohibitively expensive and hindering innovation.

The need to store the fare information and metadata for every additional feature along with other booking information resulted in a bloated booking entity. With millions of bookings created every day at Grab, this posed a scaling and stability threat to our booking service storage. Moreover, the legacy storage only tracked the latest value of fare and lacked a holistic view of all the modifications to the fare. So, debugging the fare was also a massive chore for our Engineering and Tech Operations teams.

Drafting a solution

The shortcomings of our legacy system led us to explore options for decoupling the fare and its metadata storage from the booking details. We wanted to build a platform that can store and provide access to both fare and its audit trail.

High-level functional requirements for the new fare store were:

  • Provide a platform to store and retrieve fare and associated breakdowns, with no tight coupling between services.
  • Act as a single source-of-truth for fare and associated fees in the Grab ecosystem.
  • Enable clients to access the metadata of fare change events in real-time, enabling the Product team to innovate freely.
  • Provide smooth access to a fare’s audit trail, improving the response time to our customers’ queries.

Non-functional requirements for the fare store were:

  • High availability for the read and write APIs, with few milliseconds latency.
  • Handle concurrent updates to the fare gracefully.
  • Detect duplicate events for a booking for the same transaction.

Storing change sequence with Event Sourcing

Our legacy storage solution used a defined schema and only stored the latest state of the fare. We needed an audit trail-based storage system with fast querying capabilities that can store and retrieve changes in chronological order.

The Event Sourcing pattern stood out as a flexible architectural pattern as it allowed us to store and query the sequence of changes in the order it occurred. In Martin Fowler’s blog, he described Event Sourcing as:

“The fundamental idea of Event Sourcing is to ensure that every change to the state of an application is captured in an event object and that these event objects are themselves stored in the sequence they were applied for the same lifetime as the application state itself.”

With the Event Sourcing pattern, we store all the fare changes as events in the order they occurred for a booking. We iterate through these events to retrieve a complete snapshot of the modifications to the fare.

A sample Fare Event looks like this:

message Event {
  // type of the event, ADD, SUB, SET, resilient
  EventType type = 1;
  // value which was added, subtracted or modified
  double value = 2;
  // fare for the booking after applying discount
  double fare = 3;

  ...

  // description bytes generated by SDK
  bytes description = 11;
  //transactionID for the EventType
  string transactionID = 12;
}

The Event Sourcing pattern also enable us to use the Command Query Responsibility Segregation (CQRS) pattern to decouple the read responsibility for different use cases.

CQRS Pattern

Clients of the fare life cycle read the current fare and create events to change the fare value as per their logic. Clients can also access fare events, when required. This pattern enable clients to modify fares independently, and give them visibility to the sequence for different business needs.

The diagram below describes the overall fare life cycle from creation, modification to display using the event store.

Overall Fare Life Cycle

Architecture overview

Fare Cycle Architecture

Clients interact with the Fare LifeCycle service through an SDK. The SDK offers various features such as metadata serialization, deserialization, retries, and timeouts configurations, some of which are discussed later.

The Fare LifeCycle Store service uses DynamoDB as Event Store to persist and read the fare change events backed by a cache for eventually consistent reads. For further processing, such as archiving and generation of receipts, the successfully updated events are streamed out to a message queue system.

Ensuring the integrity of the fare sequence

Democratizing the responsibility of fare modification means that multiple services might try to update the fare in parallel without prior synchronization. Concurrent fare updates for the same booking might result in a race condition. Concurrency and consistency problems are always highlights of distributed storage systems.

Let’s understand why the ordering of fare updates are important. Business rules for different cities and countries regulate the pricing features based on local market conditions and prevailing laws. For example, in some scenarios, Tolls and Waiting Fees may not be eligible for discounts or promotions. The service applying discounts needs to consider this information while applying a discount. Therefore, updates to the fare are not independent of the previous fare events.

Fare Integrity

We needed a mechanism to detect race conditions and handle them appropriately to ensure the integrity of the fare. To handle race conditions based on our use case, we explored Pessimistic and Optimistic locking mechanisms.

All the expected fare change events happen based on certain conditions being true or false. For example, less than 1% of the bookings have a payment change request initiated by passengers during a ride. And, the probability of multiple similar changes happening on the same booking is rather low. Optimistic Locking offers both efficiency and correctness for our requirements where the chances of race conditions are low, and the records are independent of each other.

The logic to calculate the fare/surcharge is coupled with the business logic of the system that calculates the fare component or fees. So, handling data race conditions on the data store layer was not an acceptable option either. It made more sense to let the clients handle it and keep the storage system decoupled from the business logic to compute the fare.

Optimistic Locking

To achieve Optimistic Locking, we store a fare version and increment it on every successful update. The client must pass the version they read to modify the fare. Should there be a version mismatch between the update query and the current fare, the update is rejected. On version mismatches, the clients read the updated checksum(version) and retry with the recalculated fare.

Idempotency of event updates

The next challenge we came across was how to handle client retries – ensuring that we do not duplicate the same event for the booking. Clients might encounter sporadic errors as a result of network-related issues, although the update was successful. Under such circumstances, clients retry to update the same event, resulting in duplicate events. Duplicate events not only result in an extra space requirement, but it also impairs the clients’ understanding on whether we’ve taken an action multiple times on the fare.

As discussed in the previous section, retrying with the same version would fail due to the version mismatch. If the previous attempt successfully modified the fare, it would also update the version.

However, clients might not know if their update modified the version or if any other clients updated the data. Relying on clients to check for event duplication makes the client-side complex and leaves a chance of error if the clients do not handle it correctly.

Solution for Duplicate Events

To handle the duplicate events, we associate each event with a unique UUID (transactionID) generated from the client-side using a UUID library from the Fare LifeCycle service SDK. We check whether the transactionID is already part of successful transaction IDs before updating the fare. If we identify a non-unique transactionID, we return duplicate event errors to the client.

For unique transactionIDs, we append it to the list of transactionIDs and save it to the Event Store along with the event.

Schema-less metadata

Metadata are the breakdowns associated with the fare. We require the metadata for specific fee/fare calculation for the generation of receipts and debugging purposes. Thus, for the storage system and multiple clients, they need not know the metadata definition of all events.

One goal for our data store was to give our clients the flexibility to add new fields to existing metadata or to define new metadata without changing the API. We adopted an SDK-based approach for metadata, where clients interact with the Fare LifeCycle service via SDK. The SDK has the following responsibilities for metadata:

  1. Serialize the metadata into bytes before making an API call to the Fare LifeCycle service.
  2. Deserialize the bytes metadata returned from the Fare LifeCycle service into a Go struct for client access.
Fare LifeCycle SDK

Serializing and deserializing the metadata on the client-side decoupled it from the Fare LifeCycle Store API. This helped teams update the metadata without deploying the storage service each time.

For reading the breakdown, the clients pass the metadata bytes to the SDK along with the Event Type, and then it converts them back into the corresponding proto schema. With this approach, clients can update the metadata without changing the Data Store Service.

Conclusion

The Fare LifeCycle service enabled us to revolutionize the fare storage at scale for Grab’s ecosystem of services. Further benefits realized with the system are:

  • The feature agnostic platform helped us to reduce the time-to-market for our hyper-local features so that we can further outserve our customers and driver-partners.
  • Decoupling the fare information from the booking information also helped us to achieve a better separation of concerns between services.
  • Improve the overall reliability and scalability of the Grab platform by decoupling fare and booking information, allowing them to scale independently of each other.
  • Reduce unnecessary coupling between services to fetch fare related information and update fare.
  • The audit-trail of fare changes in the chronological order reduced the time to debug fare and improved our response to customers for fare-related queries.

We hope this post helped you to have a closer look at how we used the Event Source pattern for building a data store and how we handled a few caveats and challenges in the process.


Authored by Sourabh Suman on behalf of the Pricing team at Grab. Special thanks to Karthik Gandhi, Kurni Famili, ChandanKumar Agarwal, Adarsh Koyya, Niteesh Mehra, Sebastian Wong, Matthew Saw, Muhammad Muneer, and Vishal Sharma for their contributions.


Join us

Grab is more than just the leading ride-hailing and mobile payments platform in Southeast Asia. We use data and technology to improve everything from transportation to payments and financial services across a region of more than 620 million people. We aspire to unlock the true potential of Southeast Asia and look for like-minded individuals to join us on this ride.

If you share our vision of driving South East Asia forward, apply to join our team today.