All posts by Umesh Kalaspurkar

Reduce costs and enable integrated SMS tracking with Braze URL shortening

Post Syndicated from Umesh Kalaspurkar original https://aws.amazon.com/blogs/architecture/reduce-costs-and-enable-integrated-sms-tracking-with-braze-url-shortening/

As competition grows fiercer, marketers need ways to ensure they reach each user with personalized content on their most critical channels. Short message/messaging service (SMS) is a key part of that effort, touching more than 5 billion people worldwide, with an impressive 82% open rate. However, SMS lacks the built-in engagement metrics supported by other channels.

To bridge this gap, leading customer engagement platform, Braze, recently built an in-house SMS link shortening solution using Amazon DynamoDB and Amazon DynamoDB Accelerator (DAX). It’s designed to handle up to 27 billion redirects per month, allowing marketers to automatically shorten SMS-related URLs. Alongside the Braze Intelligence Suite, you can use SMS click data in reporting functions and retargeting actions. Read on to learn how Braze created this feature and the impact it’s having on marketers and consumers alike.

SMS link shortening approach

Many Braze customers have used third-party SMS link shortening solutions in the past. However, this approach complicates the SMS composition process and isolates click metrics from Braze analytics. This makes it difficult to get a full picture of SMS performance.

Multiple approaches for shortening URLs, SMS, 3rd party, and Braze. Includes SMS links

Figure 1. Multiple approaches for shortening URLs

The following table compares all 3 approaches for their pros and cons.

Scenario #1 – Unshortened URL in SMS #2 – 3rd Party Shortener #3 – Braze Link Shortening & Click Tracking
Low Character Count X
Total Clicks X
Ability to Retarget Users X X
Ability to Trigger Subsequent Messages X X

With link shortening built in-house and more tightly integrated into the Braze platform, Braze can maintain more control over their roadmap priority. By developing the tool internally, Braze achieved a 90% reduction in ongoing expenses compared with the $400,000 annual expense associated with using an outside solution.

Braze SMS link shortening: Flow and architecture

SMS link shortening architecture diagram

Figure 2. SMS link shortening architecture

The following steps explain the link shortening architecture:

  1. First, customers initiate campaigns via the Braze Dashboard. Using this interface, they can also make requests to shorten URLs.
  2. The URL registration process is managed by a Kubernetes-deployed Go-based service. This service not only shortens the provided URL but also maintains reference data in Amazon DynamoDB.
  3. After processing, the dashboard receives the generated campaign details alongside the shortened URL.
  4. The fully refined campaign can be efficiently distributed to intended recipients through SMS channels.
  5. Upon a user’s interaction with the shortened URL, the message gets directed to the URL redirect service. This redirection occurs through an Application Load Balancer.
  6. The redirect service processes links in messages, calls the service, and replaces links before sending to carriers.
  7. Asynchronous calls feed data to a Kafka queue for metrics, using the HTTP sink connector integrated with Braze systems.

The registration and redirect services are decoupled from the Braze platform to enable independent deployment and scaling due to different requirements. Both the services are running the same code, but with different endpoints exposed, depending on the functionality of a given Kubernetes pod. This restricts internal access to the registration endpoint and permits independent scaling of the services, while still maintaining a fast response time.

Braze SMS link shortening: Scale

Right now, our customers use the Braze platform to send about 200 million SMS messages each month, with peak speeds of around 2,000 messages per second. Many of these messages contain one or more URLs that need to be shortened. In order to support the scalability of the link shortening feature and give us room to grow, we designed the service to handle 33 million URLs sent per month, and 3.25 million redirects per month. We assumed that we’d see up to 65 million database writes per month and 3.25 million reads per month in connection with the redirect service. This would require storage of 65 GB per month, with peaks of ~2,000 writes and 100 reads per second.

With these needs in mind, we carried out testing and determined that Amazon DynamoDB made the most sense as the backend database for the redirect service. To determine this, we tested read and write performance and found that it exceeded our needs. Additionally, it was fully managed, thus requiring less maintenance expertise, and included DAX out of the box. Most clicks happen close to send, so leveraging DAX helps us smooth out the read and write load associated with the SMS link shortener.

Because we know how long we must keep the relevant written elements at write time, we’re able to use DynamoDB Time to Live (TTL) to effectively manage their lifecycle. Finally, we’re careful to evenly distribute partition keys to avoid hot partitions, and DynamoDB’s autoscaling capabilities make it possible for us to respond more efficiently to spikes in demand.

Braze SMS link shortening: Flow

Braze SMS link shortening flow, including Registration and Redirect service

Figure 3. Braze SMS link shortening flow

  1. When the marketer initiates an SMS send, Braze checks its primary datastore (a MongoDB collection) to see if the link has already been shortened (see Figure 3). If it has, Braze re-uses that shortened link and continues the send. If it hasn’t, the registration process is initiated to generate a new site identifier that encodes the generation date and saves campaign information in DynamoDB via DAX.
    1. The response from the registration service is used to generate a short link (1a) for the SMS.
  2. A recipient gets an SMS containing a short link (2).
  3. Recipient decides to tap it (3). Braze smoothly redirects them to the destination URL, and updates the campaign statistics to show that the link was tapped.
    1. Using Amazon Route 53’s latency-based routing, Braze directs the recipient to the nearest endpoint (Braze currently has North America and EU deployments), then inspects the link to ensure validity and that it hasn’t expired. If it passes those checks, the redirect service queries DynamoDB via DAX for information about the redirect (3a). Initial redirects are cached at send time, while later requests query the DAX cache.
    2. The user is redirected with a P99 redirect latency of less than 10 milliseconds (3b).
  4. Emit campaign-level metrics on redirects.

Braze generates URL identifiers, which serve as the partition key to the DynamoDB collection, by generating a random number. We concatenate the generation date timestamp to the number, then Base66 encode the value. This results in a generated URL that looks like https://brz.ai/5xRmz, with “5xRmz” being the encoded URL identifier. The use of randomized partition keys helps avoid hot, overloaded partitions. Embedding the generation date lets us see when a given link was generated without querying the database. This helps us maintain performance and reduce costs by removing old links from the database. Other cost control measures include autoscaling and the use of DAX to avoid repeat reads of the same data. We also query DynamoDB directly against a hash key, avoiding scatter-gather queries.

Braze link shortening feature results

Since its launch, SMS link shortening has been used by over 300 Braze customer companies in more than 700 million SMS messages. This includes 50% of the total SMS volume sent by Braze during last year’s Black Friday period. There has been a tangible reduction in the time it takes to build and send SMS. “The Motley Fool”, a financial media company, saved up to four hours of work per month while driving click rates of up to 15%. Another Braze client utilized multimedia messaging service (MMS) and link shortening to encourage users to shop during their “Smart Investment” campaign, rewarding users with additional store credit. Using the engagement data collected with Braze link shortening, they were able to offer engaged users unique messaging and follow-up offers. They retargeted users who did not interact with the message via other Braze messaging channels.

Conclusion

The Braze platform is designed to be both accessible to marketers and capable of supporting best-in-class cross-channel customer engagement. Our SMS link shortening feature, supported by AWS, enables marketers to provide an exceptional user experience and save time and money.

Further reading:

How SeatGeek uses AWS Serverless to control authorization, authentication, and rate-limiting in a multi-tenant SaaS application

Post Syndicated from Umesh Kalaspurkar original https://aws.amazon.com/blogs/architecture/how-seatgeek-uses-aws-to-control-authorization-authentication-and-rate-limiting-in-a-multi-tenant-saas-application/

SeatGeek is a ticketing platform for web and mobile users, offering ticket purchase and reselling for sports games, concerts, and theatrical productions. In 2022, SeatGeek had an average of 47 million daily tickets available, and their mobile app was downloaded 33+ million times.

Historically, SeatGeek used multiple identity and access tools internally. Applications were individually managing authorization, leading to increased overhead and a need for more standardization. SeatGeek sought to simplify the API provided to customers and partners by abstracting and standardizing the authorization layer. They were also looking to introduce centralized API rate-limiting to prevent noisy neighbor problems in their multi-tenant SaaS application.

In this blog, we will take you through SeatGeek’s journey and explore the solution architecture they’ve implemented. As of the publication of this post, many B2B customers have adopted this solution to query terabytes of business data.

Building multi-tenant SaaS environments

Multi-tenant SaaS environments allow highly performant and cost-efficient applications by sharing underlying resources across tenants. While this is a benefit, it is important to implement cross-tenant isolation practices to adhere to security, compliance, and performance objectives. With that, each tenant should only be able to access their authorized resources. Another consideration is the noisy neighbor problem that occurs when one of the tenants monopolizes excessive shared capacity, causing performance issues for other tenants.

Authentication, authorization, and rate-limiting are critical components of a secure and resilient multi-tenant environment. Without these mechanisms in place, there is a risk of unauthorized access, resource-hogging, and denial-of-service attacks, which can compromise the security and stability of the system. Validating access early in the workflow can help eliminate the need for individual applications to implement similar heavy-lifting validation techniques.

SeatGeek had several criteria for addressing these concerns:

  1. They wanted to use their existing Auth0 instance.
  2. SeatGeek did not want to introduce any additional infrastructure management overhead; plus, they preferred to use serverless services to “stitch” managed components together (with minimal effort) to implement their business requirements.
  3. They wanted this solution to scale as seamlessly as possible with demand and adoption increases; concurrently, SeatGeek did not want to pay for idle or over-provisioned resources.

Exploring the solution

The SeatGeek team used a combination of Amazon Web Services (AWS) serverless services to address the aforementioned criteria and achieve the desired business outcome. Amazon API Gateway was used to serve APIs at the entry point to SeatGeek’s cloud environment. API Gateway allowed SeatGeek to use a custom AWS Lambda authorizer for integration with Auth0 and defining throttling configurations for their tenants. Since all the services used in the solution are fully serverless, they do not require infrastructure management, are scaled up and down automatically on-demand, and provide pay-as-you-go pricing.

SeatGeek created a set of tiered usage plans in API Gateway (bronze, silver, and gold) to introduce rate-limiting. Each usage plan had a pre-defined request-per-second rate limit configuration. A unique API key was created by API Gateway for each tenant. Amazon DynamoDB was used to store the association of existing tenant IDs (managed by Auth0) to API keys (managed by API Gateway). This allowed us to keep API key management transparent to SeatGeek’s tenants.

Each new tenant goes through an onboarding workflow. This is an automated process managed with Terraform. During new tenant onboarding, SeatGeek creates a new tenant ID in Auth0, a new API key in API Gateway, and stores association between them in DynamoDB. Each API key is also associated with one of the usage plans.

Once onboarding completes, the new tenant can start invoking SeatGeek APIs (Figure 1).

SeatGeek's fully serverless architecture

Figure 1. SeatGeek’s fully serverless architecture

  1. Tenant authenticates with Auth0 using machine-to-machine authorization. Auth0 returns a JSON web token representing tenant authentication success. The token includes claims required for downstream authorization, such as tenant ID, expiration date, scopes, and signature.
  2. Tenant sends a request to the SeatGeak API. The request includes the token obtained in Step 1 and application-specific parameters, for example, retrieving the last 12 months of booking data.
  3. API Gateway extracts the token and passes it to Lambda authorizer.
  4. Lambda authorizer retrieves the token validation keys from Auth0. The keys are cached in the authorizer, so this happens only once for each authorizer launch environment. This allows token validation locally without calling Auth0 each time, reducing latency and preventing an excessive number of requests to Auth0.
  5. Lambda authorizer performs token validation, checking tokens’ structure, expiration date, signature, audience, and subject. In case validation succeeds, Lambda authorizer extracts the tenant ID from the token.
  6. Lambda authorizer uses tenant ID extracted in Step 5 to retrieve the associated API key from DynamoDB and return it back to API Gateway.
  7. The API Gateway uses API key to check if the client making this particular request is above the rate-limit threshold, based on the usage plan associated with API key. If the rate limit is exceeded, HTTP 429 (“Too Many Requests”) is returned to the client. Otherwise, the request will be forwarded to the backend for further processing.
  8. Optionally, the backend can perform additional application-specific token validations.

Architecture benefits

The architecture implemented by SeatGeek provides several benefits:

  • Centralized authorization: Using Auth0 with API Gateway and Lambda authorizer allows for standardization the API authentication and removes the burden of individual applications having to implement authorization.
  • Multiple levels of caching: Each Lambda authorizer launch environment caches token validation keys in memory to validate tokens locally. This reduces token validation time and helps to avoid excessive traffic to Auth0. In addition, API Gateway can be configured with up to 5 minutes of caching for Lambda authorizer response, so the same token will not be revalidated in that timespan. This reduces overall cost and load on Lambda authorizer and DynamoDB.
  • Noisy neighbor prevention: Usage plans and rate limits prevent any particular tenant from monopolizing the shared resources and causing a negative performance impact for other tenants.
  • Simple management and reduced total cost of ownership: Using AWS serverless services removed the infrastructure maintenance overhead and allowed SeatGeek to deliver business value faster. It also ensured they didn’t pay for over-provisioned capacity, and their environment could scale up and down automatically and on demand.

Conclusion

In this blog, we explored how SeatGeek used AWS serverless services, such as API Gateway, Lambda, and DynamoDB, to integrate with external identity provider Auth0, and implemented per-tenant rate limits with multi-tiered usage plans. Using AWS serverless services allowed SeatGeek to avoid undifferentiated heavy-lifting of infrastructure management and accelerate efforts to build a solution addressing business requirements.

Build a Virtual Waiting Room with Amazon DynamoDB and AWS Lambda at SeatGeek

Post Syndicated from Umesh Kalaspurkar original https://aws.amazon.com/blogs/architecture/build-a-virtual-waiting-room-with-amazon-dynamodb-and-aws-lambda-at-seatgeek/

As retail sales, products, and customers continue to expand online, we’ve seen a trend towards releasing products in limited quantities to larger audiences. Demand of these products can be high, due to limited production capacity, venue capacity limits, or product exclusivity. Providers can then experience spikes in transaction volume, especially when multiple event sales occur simultaneously. This increased traffic and load can negatively impact customer experience and infrastructure.

To enhance the customer experience when releasing tickets to high demand events, SeatGeek has introduced a prioritization and queueing mechanism based on event type, venue, and customer type. For example, Dallas Cowboys’ tickets could have a different priority depending on seat type, or whether it’s a suite or a general admission ticket.

SeatGeek previously used a third-party waiting room solution, but it presented a number of shortcomings:

  • Lack of configuration and customization capabilities
  • More manual process that resulted in limiting the number of concurrent events could be set up
  • Inability to capture custom insights and metrics (for example, how long was the customer waiting in the queue before they dropped?)

Resolving these issues is crucial to improve the customer experience and audience engagement. SeatGeek decided to build a custom solution on AWS, in order to create a more robust system and address these third-party issues.

Virtual Waiting Room overview

Our solution redirects overflow customers waiting to complete their purchase to a separate queue. Personalized content is presented to improve the waiting experience. Public services such as school or voting registration can use this solution for limited spots or time slot management.

Figure 1. User path through a Virtual Waiting Room

Figure 1. User path through a Virtual Waiting Room

During a sale event, all customers begin their purchase journey in the Virtual Waiting Room (see Figure 1). When the sale starts, they will be moved from the Virtual Waiting Room to the ticket selection page. This is referred to as the Protected Zone. Here is where the customer will complete their purchase. The Protected Zone is a group of customized pages that guide the user through the purchasing process.

When the virtual waiting room is enabled, it can operate in three modes: Waiting Room mode, Queueing mode, or a combination of the two.

In Waiting Room mode, any request made to an event ticketing page before the designated start time of sale is routed to a separate screen. This displays the on-sale information and other marketing materials. At the desired time, users are then routed to the event page at a predefined throughput rate. Figure 2 shows a screenshot of the Waiting Room mode:

Figure 2. Waiting Room mode

Figure 2. Waiting Room mode

In Queueing mode, the event can be configured to allow a preset number of concurrent users to access the Protected Zone. Those beyond that preconfigured number wait in a First-In-First-Out (FIFO) queue. Exempt users, such as the event coordinator, can bypass the queue for management and operational visibility.

Figure 3. Queueing mode flow

Figure 3. Queueing mode flow

 

Figure 4. Queueing mode

Figure 4. Queueing mode

In some cases, the two modes can work together sequentially. This can occur when the Waiting Room mode is used before a sale starts and the Queueing mode takes over to control flow.

Once the customers move to the front of the queue, they are next in line for the Protected Zone. A ticket selection page, shown in Figure 5, must be protected from an overflow of customers, which could result in overselling.

Figure 5. Ticket selection page

Figure 5. Ticket selection page

Virtual Waiting Room implementation

In the following diagram, you can see the AWS services and flow that SeatGeek implemented for the Virtual Waiting Room solution. When a SeatGeek customer requests a protected resource like a concert ticket, a gate keeper application scans to see if the resource has an active waiting room. It also confirms if the configuration rules are satisfied in order to grant the customer access. If the customer isn’t allowed access to the protected resource for whatever reason, then that customer is redirected to the Virtual Waiting Room.

Figure 6. Architecture overview

Figure 6. Architecture overview

SeatGeek built this initial iteration of the gate keeper service on Fastly’s Computer@Edge service to leverage its existing content delivery network (CDN) investment. However, similar functionality could be built using Amazon CloudFront and AWS Lambda@Edge.

The Bouncer, handling the user flow into either the protected zone or the waiting room, consists of 3 components – Amazon API Gateway, AWS Lambda, and a Token Service. The token service is at the heart of the Waiting Room’s core logic. Before a concert event sale goes live at SeatGeek, the number of access tokens generated is equivalent to the number of available tickets. The order of assigning access tokens to customers in the waiting room can be based on FIFO or customer status (VIP customers first). Tokens are allocated when the customer is admitted to the waiting room and expire when tickets are purchased or when the customer exits.

For data storage, SeatGeek uses Amazon DynamoDB to monitor protected resources, tokens, and queues. The key tables are:

  • Protected Zone table: This table contains metadata about available protected zones
  • Counters table: Monitors the number of access tokens issued per minute for a specific protected zone
  • User Connection table: Every time a customer connects to the Amazon API Gateway, a record is created in this table recording their visitor token and connection ID using AWS Lambda
  • Queue table: This is the main table where the visitor token to access token mapping is saved

For analytics, two types of metrics are captured to ensure operational integrity:

  • System metrics: These are built into the AWS runtime infrastructure, and are stored in Amazon CloudWatch. These metrics provide telemetry of each component of the solution: Lambda latency, DynamoDB throttle (read and write), API Gateway connections, and more.
  • Business metrics: These are used to understand previous user behavior to improve infrastructure provisioning and user experiences. SeatGeek uses an AWS Lambda function to capture metrics from data in a DynamoDB stream. It then forwards it to Amazon Timestream for time-based analytics processing. Metrics captured include queue length, waiting time per queue, number of users in the protected zone, and more.

For historical needs, long-lived data can be streamed to tiered data storage options such as Amazon Simple Storage Service (S3). They can then be used later for other purposes, such as auditing and data analysis.

Considerations and enhancements for the Virtual Waiting Room

  • Tokens: We recommend using first-party cookies and token confirmations to track the number of sessions. Use the same token at the same time to stop users from checking out multiple times and cutting in line.
  • DDoS protection: Token and first-party cookies usage must also comply with General Data Protection Regulation (GDPR) and California Consumer Privacy Act (CCPA) guidelines depending on the geographic region. This system is susceptible to DDoS attacks, XSS attacks, and others, like any web-based solution. But these threats can be mitigated by using AWS Shield, a DDoS protection service, and AWS WAF – Web Application Firewall. For more information on DDoS protection, read this security blog post.
  • Marketing: Opportunities to educate the customer about the venue or product(s) while they wait in the Virtual Waiting Room (for example, parking or food options).
  • Alerts: Customers can be alerted via SMS or voice when their turn is up by using Amazon Pinpoint as a marketing communication service.

Conclusion

We have shown how to set up a Virtual Waiting Room for your customers. This can be used to improve the customer experience while they wait to complete their registration or purchase through your website. The solution takes advantage of several AWS services like AWS Lambda, Amazon DynamoDB, and Amazon Timestream.

While this references a retail use case, the waiting room concept can be used whenever throttling access to a specific resource is required. It can be useful during an infrastructure or application outage. You can use it during a load spike, while more resources (EC2 instances) are being launched. To block access to an unreleased feature or product, temporarily place all users in the waiting room and let them in as needed per your own configuration.

Providing a friendly, streamlined, and responsive user experience, even during peak load times, is a valuable way to keep existing customers and gain new ones.

Be mindful that there are costs associated with running these services. To be cost-efficient, see the following pages for details: AWS Lambda, Amazon S3, Amazon DynamoDB, Amazon Timestream.