But for some applications, data stored in R2 doesn’t need to be retained forever. Over time, as this data grows, it can unnecessarily lead to higher storage costs. Today, we’re excited to announce that Object Lifecycle Management for R2 is generally available, allowing you to effectively manage object expiration, all from the R2 dashboard or via our API.
Object Lifecycle Management
Object lifecycles give you the ability to define rules (up to 1,000) that determine how long objects uploaded to your bucket are kept. For example, by implementing an object lifecycle rule that deletes objects after 30 days, you could automatically delete outdated logs or temporary files. You can also define rules to abort unfinished multipart uploads that are sitting around and contributing to storage costs.
Getting started with object lifecycles in R2
Cloudflare dashboard
From the Cloudflare dashboard, select R2.
Select your R2 bucket.
Navigate to the Settings tab and find the Object lifecycle rules section.
Select Add rule to define the name, relevant prefix, and lifecycle actions: delete uploaded objects or abort incomplete multipart uploads.
Select Add rule to complete the process.
S3 Compatible API
With R2’s S3-compatible API, it’s easy to apply any existing object lifecycle rules to your R2 buckets.
Here’s an example of how to configure your R2 bucket’s lifecycle policy using the AWS SDK for JavaScript. To try this out, you’ll need to generate an Access Key.
import S3 from "aws-sdk/clients/s3.js";
const client = new S3({
endpoint: `https://${ACCOUNT_ID}.r2.cloudflarestorage.com`,
credentials: {
accessKeyId: ACCESS_KEY_ID, // fill in your own
secretAccessKey: SECRET_ACCESS_KEY, // fill in your own
},
region: "auto",
});
await client
.putBucketLifecycleConfiguration({
LifecycleConfiguration: {
Bucket: "testBucket",
Rules: [
// Example: deleting objects by age
// Delete logs older than 90 days
{
ID: "Delete logs after 90 days",
Filter: {
Prefix: "logs/",
},
Expiration: {
Days: 90,
},
},
// Example: abort all incomplete multipart uploads after a week
{
ID: "Abort incomplete multipart uploads",
AbortIncompleteMultipartUpload: {
DaysAfterInitiation: 7,
},
},
],
},
})
.promise();
For more information on how object lifecycle policies work and how to configure them in the dashboard or API, see the documentation here.
Speaking of documentation, if you’d like to provide feedback for R2’s documentation, fill out our documentation survey!
What’s next?
Creating object lifecycle rules to delete ephemeral objects is a great way to reduce storage costs, but what if you need to keep objects around to access in the future? We’re working on new, lower cost ways to store objects in R2 that aren’t frequently accessed, like long tail user-generated content, archive data, and more. If you’re interested in providing feedback and gaining early access, let us know by joining the waitlist here.
Join the conversation: share your feedback and experiences
If you have any questions or feedback relating to R2, we encourage you to join our Discord community to share! Stay tuned for more exciting R2 updates in the future.
As always, there’s plenty to share this week: Amazon CodeCatalyst is now generally available, Amazon S3 is now available on Snowball Edge devices, version 1.0.0 of AWS Amplify Flutter is here, and a lot more. Let’s dive in!
Last Week’s Launches Here are some of the launches that caught my eye this past week:
Amazon CodeCatalyst – First announced at re:Invent in preview form (Announcing Amazon CodeCatalyst, a Unified Software Development Service), this unified software development and delivery service is now generally available. As Steve notes in the post that he wrote for the preview, “Amazon CodeCatalyst enables software development teams to quickly and easily plan, develop, collaborate on, build, and deliver applications on AWS, reducing friction throughout the development lifecycle.” During the preview we added the ability to use AWS Graviton2 for CI/CD workflows and deployment environments, along with other new features, as detailed in the What’s New.
Amazon S3 on Snowball Edge – You have had the power to create S3 buckets on AWS Snow Family devices for a couple of years, and to PUT and GET object. With this new launch you can, as Channy says, “…use an expanded set of Amazon S3 APIs to easily build applications on AWS and deploy them on Snowball Edge Compute Optimized devices.” This launch allows you to manage the storage using AWS OpsHub, and to address multiple Denied, Disrupted, Intermittent, and Limited Impact (DDIL) use cases. To learn more, read Amazon S3 Compatible Storage on AWS Snowball Edge Compute Optimized Devices Now Generally Available.
Amazon Redshift Updates – We announced multiple updates to Amazon Redshift including the MERGE SQL command so that you can combine a series of DML statements into a single statement, dynamic data masking to simplify the process of protecting sensitive data in your Amazon Redshift data warehouse, and centralized access control for data sharing with AWS Lake Formation.
AWS Open Source – My colleague Ricardo writes a weekly newsletter to highlight new open source projects, tools, and demos from the AWS Community. Read edition 154 to learn more.
AWS Graviton Weekly – Marcos Ortiz writes a weekly newsletter to highlight the latest developments in AWS custom silicon. Read AWS Graviton weekly #33 to see what’s up.
Upcoming Events Here are some upcoming live and online events that may be of interest to you:
AWS Summits are coming to Berlin (May 4), Washington, DC (June 7 and 8), London (June 7), and Toronto (June 14). These events are free but I highly recommend that you register ahead of time.
.NET Enterprise Developer Day EMEA is a free one-day virtual conference on April 25; register now.
We have added a collection of purpose-built services to the AWS Snow Family for customers, such as Snowball Edge in 2016 and Snowcone in 2020. These services run compute intensive workloads and stores data in edge locations with denied, disrupted, intermittent, or limited network connectivity and for transferring large amounts of data from on-premises and rugged or mobile environments.
Each new service is optimized for space- or weight-constrained environments, portability, and flexible networking options. For example, Snowball Edge devices have three options for device configurations. AWS Snowball Edge Compute Optimized provides a suitcase-sized, secure, and rugged device that customers can deploy in rugged and tactical edge locations to run their compute applications. Customers modernize their edge applications in the cloud use AWS compute services and storage services such as Amazon Simple Storage Service (Amazon S3), and then deploy these applications on Snow devices at the edge.
We heard from customers that they also needed access to local object store to run applications at the edge, such as 5G mobile core and real-time data analytics, to process end-user transactions, and they had limited storage infrastructure availability in these environments. Although the Amazon S3 Adapter for Snowball enables the basic storage and retrieval of objects on a Snow device, customers wanted access to a broader set of Amazon S3 APIs, including flexibility at scale, local bucket management, object tagging, and S3 event notifications.
Today, we’re announcing the general availability of Amazon S3 compatible storage on Snow for our Snowball Edge Compute Optimized devices. This makes it easy for you to store data and run applications with local S3 buckets that require low latency processing at the edge.
With Amazon S3 compatible storage on Snow, you can use an expanded set of Amazon S3 APIs to easily build applications on AWS and deploy them on Snowball Edge Compute Optimized devices. This eliminates the need to re-architect applications for each deployment. You can manage applications requiring Amazon S3 compatible storage across the cloud, on-premises, and at the edge in connected and disconnected environments with a consistent experience.
Moreover, you can use AWS OpsHub, a graphical user interface, to manage your Snow Family services and Amazon S3 compatible storage on the devices at the edge or remotely from a central location. You can also use Amazon S3 SDK or AWS Command Line Interface (AWS CLI) to create and manage S3 buckets, get S3 event notifications using MQTT, and local service notifications using SMTP, just as you do in AWS Regions.
With Amazon S3 compatible storage on Snow, we are now able to address various use cases in limited network environments, giving customers secure, durable local object storage. For example, customers in the intelligence community and in industrial IoT deploy applications such as video analytics in rugged and mobile edge locations.
Getting Started with S3 Compatible Storage on Snowball Edge Compute Optimized To order new Amazon S3 enabled Snowball Edge devices, create a job in the AWS Snow Family console. You can replace an existing Snow device or cluster with new replacement devices that support S3 compatible storage.
In Step 1 – Job type, input your job name and choose Local compute and storage only. In Step 2 – Compute and storage, choose your preferred Snowball Edge Compute Optimized device.
Select Amazon S3 compatible storage, a new option for S3 compatible storage. The current S3 Adapter solution is on deprecation path, and we recommend migrating workloads to use Amazon S3 compatible storage on Snow.
When you select Amazon S3 compatible storage, you can configure Amazon S3 compatible storage capacity for a single device or for a cluster. The Amazon S3 storage capacity depends on the quantity and type of Snowball Edge device.
For single-device deployment, you can provision granular Amazon S3 capacity up to a maximum of 31 TB on a Snowball Edge Compute Optimized device.
For a cluster setup, all storage capacity on a device is allocated to Amazon S3 compatible storage on Snow. You can provision a maximum of 500 TB on a 16 node cluster of Snowball Edge Compute Optimized devices.
When you provide all necessary job details and create your job, you can see the status of the delivery of your device in the job status section.
Manage S3 Compatible Storage on Snow with OpsHub Once your device arrives at your site, power it on, and connect it to your network. To manage your device, download, install, and launch the OpsHub application in your laptop. After installation, you can unlock the device and start managing it and using supported AWS services locally.
OpsHub provides a dashboard that summarizes key metrics, such as storage capacity and active instances on your device. It also provides a selection of AWS services that are supported on the Snow Family devices.
Log in to OpsHub, then choose Manage Storage. This takes you to the Amazon S3 compatible storage on Snow landing page.
For Start service setup type, choose Simple if your network uses dynamic host configuration protocol (DHCP). With this option, the virtual network interface cards (VNICs) are created automatically on each device when you start the service. When your network uses static IP addresses, you need to create VNICs for each device manually, so choose the Advanced option.
Once the service starts, you’ll see its status is active with a list of endpoints. The following example shows the service activated in a single device:
Choose Create bucket if you want the new S3 bucket in your device. Otherwise, you can upload files to your selected bucket. New uploaded objects have destination URLs such as s3-snow://test123/test_file with the unique bucket name in the device or cluster.
You can also use the bucket lifecycle rule to define when to trigger object deletion based on age or date. Choose Create lifecycle rule in the Management tab to add a new lifecycle rule.
You can select either Delete objects or Delete incomplete multipart uploads as a rule action. Configure the rule trigger that schedules deletion based on a specific date or object’s age. In this example, I set two days to delete objects after being uploaded.
Things to know Keep these things in mind regarding additional features and considerations when you use Amazon S3 compatible storage on Snow:
Capacity: If you fully utilize Amazon S3 capacity on your device or cluster, your write (PUT) requests return an insufficient capacity error. Read (GET) operations continue to function normally. To monitor the available Amazon S3 capacity, you can use the OpsHub S3 on the Snow page or use the describe-service CLI command. Upon detecting insufficient capacity on the Snow device or cluster, you must free up space by deleting data or transferring data to an S3 bucket in the Region or another on-premises device.
Resiliency: Amazon S3 compatible storage on Snow stores data redundantly across multiple disks on each Snow device and multiple devices in your cluster, with built-in protection against correlated hardware failures. In the event of a disk or device failure within the quorum range, Amazon S3 compatible storage on Snow continues to operate until hardware is replaced. Additionally, Amazon S3 compatible storage on Snow continuously scrubs data on the device to make sure of data integrity and recover any corrupted data. For workloads that require local storage, the best practice is to back up your data to further protect your data stored on Snow devices.
Notifications: Amazon S3 compatible storage on Snow continuously monitors the health status of the device or cluster. Background processes respond to data inconsistencies and temporary failures to heal and recover data to make sure of resiliency. In the case of nonrecoverable hardware failures, Amazon S3 compatible storage on Snow can continue operations and provides proactive notifications through emails, prompting you to work with AWS to replace failed devices. For connected devices, you have the option to enable the “Remote Monitoring” feature, which will allow AWS to monitor service health online and proactively notify you of any service issues.
Security: Amazon S3 compatible storage on Snow supports encryption using server-side encryption with Amazon S3 managed encryption keys (SSE-S3) or customer-provided keys (SSE-C) and authentication and authorization using Snow IAM actions namespace (s3:*) to provide you with distinct controls for data stored on your Snow devices. Amazon S3 compatible storage on Snow doesn’t support object-level access control list and bucket policies. Amazon S3 compatible storage on Snow defaults to Bucket Owner is Object Owner, making sure that the bucket owner has control over objects in the bucket.
Now Available Amazon S3 compatible storage on Snow is now generally available for AWS Snowball Edge Compute Optimized devices in all AWS Commercial and GovCloud Regions where AWS Snow is available.
A new week starts, and Spring is almost here! If you’re curious about AWS news from the previous seven days, I got you covered.
Last Week’s Launches Here are the launches that got my attention last week:
Amazon S3 – Last week there was AWS Pi Day 2023 celebrating 17 years of innovation since Amazon S3 was introduced on March 14, 2006. For the occasion, the team released many new capabilities:
Amazon S3 has also simplified private connectivity from on-premises networks: with private DNS for S3, on-premises applications can use AWS PrivateLink to access S3 over an interface endpoint, while requests from your in-VPC applications access S3 using gateway endpoints.
We released Mountpoint for Amazon S3, a high performance open source file client. Read more in the blog. Note that Mountpoint isn’t a general-purpose networked file system, and comes with some restrictions on file operations.
Amazon Neptune – Now offers a graph summary API to help understand important metadata about property graphs (PG) and resource description framework (RDF) graphs. Neptune added support for Slow Query Logs to help identify queries that need performance tuning.
Amazon OpenSearch Service – The team introduced security analytics that provides new threat monitoring, detection, and alerting features. The service now supports OpenSearchversion 2.5 that adds several new features such as support for Point in Time Search and improvements to observability and geospatial functionality.
AWS Lake Formation and Apache Hive on Amazon EMR – Introduced fine-grained access controls that allow data administrators to define and enforce fine-grained table and column level security for customers accessing data via Apache Hive running on Amazon EMR.
On this day 17 years ago, we launched a very simple object storage service. It allowed developers to create, list, and delete private storage spaces (known as buckets), upload and download files, and manage their access permissions. The service was available only through a REST and SOAP API. It was designed to provide highly durable data storage with 99.999999999 percent data durability (that’s 11 nines!).
Fast forward to 2023, Amazon Simple Storage Service (Amazon S3) holds more than 280 trillion objects and averages over 100 million requests per second. To protect data integrity, Amazon S3 performs over four billion checksum computations per second. Over the years, we added many capabilities, such as a range of storage classes, to store your colder data cost effectively. Every day, you restore on average more than 1 petabyte from the S3 Glacier Flexible Retrieval and S3 Glacier Deep Archive storage classes. Since launch, you have saved $1 billion from using Amazon S3 Intelligent-Tiering. In 2015, we added the possibility of replicating your data across Regions. Every week, Amazon S3 Replication moves more than 100 petabytes of data. Amazon S3 is also at the core of hundreds of thousands of data lakes. It also has become a critical component of a growing ecosystem of serverless applications. Every day, Amazon S3 sends over 125 billion event notifications to serverless applications. Altogether, Amazon S3 is helping people around the world securely store and extract value from their data.
To celebrate Amazon S3‘s birthday AWS is hosting the AWS Pi Day event for the third consecutive year. This live online event starts at 13:00 PDT today (March 14, 2023) on the AWS On Air channel on Twitch and will feature four hours of fresh educational content from AWS experts. We will discuss not only Amazon S3 best practices, we will also dive into the latest innovations across AWS data services, from storage to analytics and AI/ML. Tune in to learn how to get the most out of your data by making it more secure, available, accessible, and connected, and to help you respond to rapid growth and changing demand. You will also learn how to optimize your data costs, automate your cost savings, eliminate operational complexity, and get new insights from your data. Have a look at the full agenda on the registration page.
But we are not stopping there, and today we take the occasion of this celebration to announce seven new capabilities across our data services.
Mountpoint for Amazon S3 (alpha release): an open-source file client for Amazon S3 Mountpoint for Amazon S3 is an open-source file client for Amazon S3 that you can install on your compute instance. It translates local file storage API calls to REST API calls on objects in Amazon S3. When using Mountpoint for Amazon S3, data lake applications that access objects using file APIs can achieve high single-instance transfer rates, saving on compute costs.
You can get started with Mountpoint for Amazon S3 by mounting an Amazon S3 bucket at a local mount point on your compute instance. Once mounted, applications read objects as files available locally. Mountpoint for Amazon S3 supports sequential and random read operations on existing S3 objects. It is available to download for Linux operating systems as an alpha release and is not yet intended for production workloads. Instead, we want to collect your feedback early and incorporate your input into the design and implementation. To get started, visit the Mountpoint for Amazon S3 GitHub repo, read the technical launch blog, and share your feedback.
Now Generally Available: AWS Data Exchange for Amazon S3 AWS Data Exchange for Amazon S3 enables you to easily find, subscribe to, and use third-party data files for faster time to insight, storage cost optimization, simplified data licensing management, and more. Data Exchange subscribers can directly use files from data providers’ Amazon S3 buckets for their analysis with AWS services without needing to create or manage copies to their account. Data providers can license in-place access to data hosted in their Amazon S3 buckets.
Amazon S3 Multi-Region Access Points now support replicated datasets that span multiple AWS accounts We launched Amazon S3 Multi-Region Access Points in September 2021. We added failover control in November 2022. Amazon S3 Multi-Region Access Points now support datasets that are replicated across multiple AWS accounts. Cross-account Multi-Region Access Points simplify object storage access for applications that span both AWS Regions and accounts, avoiding the need for complex request routing logic in your application. They provide a single global endpoint for your multi-Region applications and dynamically route S3 requests based on policies that you define. This helps you to easily implement multi-Region resilience, latency-based routing, and active/passive failover, even when your data is stored in multiple AWS accounts.
Aliases for S3 Object Lambda Access Points as CloudFront origin Amazon S3 Object Lambda, launched in March 2021, lets you add your own code to S3 GET, HEAD, and LIST API requests to modify data as it is returned to an application. With today’s launch of aliases for S3 Object Lambda Access Points any application that requires an S3 bucket name can easily present different views of data depending on the requester. You can now use an S3 Object Lambda Access Point alias as an origin for your Amazon CloudFront distribution to modify the data requested. For example, you can dynamically transform an image depending on the device that a user is visiting from, such as a desktop or a smartphone.
Simplify private connectivity from on-premises networks Amazon Virtual Private Cloud (Amazon VPC) interface endpoints for Amazon S3 now offer private DNS options that can help you more easily route Amazon S3 requests to the lowest-cost endpoint in your VPC. With private DNS for Amazon S3, your on-premises applications can use AWS PrivateLink to access Amazon S3 over an interface endpoint, while requests from your in-VPC applications access Amazon S3 using gateway endpoints. Routing requests like this helps you take advantage of the lowest-cost private network path without having to make code or configuration changes to your clients.
Local Amazon S3 Replication on Outposts Amazon S3 on Outposts now supports S3 replication on Outposts. This extends S3’s fully managed approach to replication to S3 on Outposts buckets. It helps you meet your data residency and data redundancy requirements. With local S3 Replication on Outposts, you can create and configure replication rules to automatically replicate your S3 objects to another Outpost or to another bucket on the same Outpost. During replication, your S3 on Outposts objects are always sent over your local gateway, and objects do not travel back to the AWS Region. S3 Replication on Outposts provides an easy and flexible way to automatically replicate data within a specific data perimeter to address your data redundancy and compliance requirements.
Amazon OpenSearch Security Analytics The new Amazon OpenSearch Service’s security analytics capability enables your Security Operations (SecOps) teams to detect potential threats quickly while having the tools to help with security investigations on historical data—all with lower data storage costs. Like many other advanced capabilities of Amazon OpenSearch Service, there is no additional charge for security analytics.
Join Us Online Today You will learn more about these launches and about AWS data services in general. We have also prepared some live demos. We designed the AWS Pi Day event for system administrators, engineers, developers, and architects. Our sessions will bring you the latest and greatest information on storage, security, backup, archiving, training and certification, and more.
And to dive deeper, get Pi Day started early by attending AWS Innovate: Data and AI/ML Edition to learn about cutting-edge machine learning tools, strategies for building future-proof applications, and making data-driven decisions for your organization. Don’t miss Swami Sivasubramanian‘s keynote, starting at 9:00 PDT.
Today, we are launching aliases for S3 Object Lambda Access Points. Aliases are now automatically generated when S3 Object Lambda Access Points are created and are interchangeable with bucket names anywhere you use a bucket name to access data stored in Amazon S3. Therefore, your applications don’t need to know about S3 Object Lambda and can consider the alias to be a bucket name.
You can now use an S3 Object Lambda Access Point alias as an origin for your Amazon CloudFront distribution to tailor or customize data for end users. You can use this to implement automatic image resizing or to tag or annotate content as it is downloaded. Many images still use older formats like JPEG or PNG, and you can use a transcoding function to deliver images in more efficient formats like WebP, BPG, or HEIC. Digital images contain metadata, and you can implement a function that strips metadata to help satisfy data privacy requirements.
Let’s see how this works in practice. First, I’ll show a simple example using text that you can follow along by just using the AWS Management Console. After that, I’ll implement a more advanced use case processing images.
Using an S3 Object Lambda Access Point as the Origin of a CloudFront Distribution For simplicity, I am using the same application in the launch post that changes all text in the original file to uppercase. This time, I use the S3 Object Lambda Access Point alias to set up a public distribution with CloudFront.
To test that this is working, I open the same file from the bucket and through the S3 Object Lambda Access Point. In the S3 console, I select the bucket and a sample file (called s3.txt) that I uploaded earlier and choose Open.
A new browser tab is opened (you might need to disable the pop-up blocker in your browser), and its content is the original file with mixed-case text:
Amazon Simple Storage Service (Amazon S3) is an object storage service that offers...
I choose Object Lambda Access Points from the navigation pane and select the AWS Region I used before from the dropdown. Then, I search for the S3 Object Lambda Access Point that I just created. I select the same file as before and choose Open.
In the new tab, the text has been processed by the Lambda function and is now all in uppercase:
AMAZON SIMPLE STORAGE SERVICE (AMAZON S3) IS AN OBJECT STORAGE SERVICE THAT OFFERS...
Now that the S3 Object Lambda Access Point is correctly configured, I can create the CloudFront distribution. Before I do that, in the list of S3 Object Lambda Access Points in the S3 console, I copy the Object Lambda Access Point alias that has been automatically created:
In the CloudFront console, I choose Distributions in the navigation pane and then Create distribution. In the Origin domain, I use the S3 Object Lambda Access Point alias and the Region. The full syntax of the domain is:
ALIAS.s3.REGION.amazonaws.com
S3 Object Lambda Access Points cannot be public, and I use CloudFront origin access control (OAC) to authenticate requests to the origin. For Origin access, I select Origin access control settings and choose Create control setting. I write a name for the control setting and select Sign requests and S3 in the Origin type dropdown.
Now, my Origin access control settings use the configuration I just created.
To reduce the number of requests going through S3 Object Lambda, I enable Origin Shield and choose the closest Origin Shield Region to the Region I am using. Then, I select the CachingOptimized cache policy and create the distribution. As the distribution is being deployed, I update permissions for the resources used by the distribution.
Setting Up Permissions to Use an S3 Object Lambda Access Point as the Origin of a CloudFront Distribution First, the S3 Object Lambda Access Point needs to give access to the CloudFront distribution. In the S3 console, I select the S3 Object Lambda Access Point and, in the Permissions tab, I update the policy with the following:
The supporting access point also needs to allow access to CloudFront when called via S3 Object Lambda. I select the access point and update the policy in the Permissions tab:
Finally, CloudFront needs to be able to invoke the Lambda function. In the Lambda console, I choose the Lambda function used by S3 Object Lambda, and then, in the Configuration tab, I choose Permissions. In the Resource-based policy statements section, I choose Add permissions and select AWS Account. I enter a unique Statement ID. Then, I enter cloudfront.amazonaws.com as Principal and select lambda:InvokeFunction from the Action dropdown and Save. We are working to simplify this step in the future. I’ll update this post when that’s available.
Testing the CloudFront Distribution When the distribution has been deployed, I test that the setup is working with the same sample file I used before. In the CloudFront console, I select the distribution and copy the Distribution domain name. I can use the browser and enter https://DISTRIBUTION_DOMAIN_NAME/s3.txt in the navigation bar to send a request to CloudFront and get the file processed by S3 Object Lambda. To quickly get all the info, I use curl with the -i option to see the HTTP status and the headers in the response:
curl -i https://DISTRIBUTION_DOMAIN_NAME/s3.txt
HTTP/2 200
content-type: text/plain
content-length: 427
x-amzn-requestid: a85fe537-3502-4592-b2a9-a09261c8c00c
date: Mon, 06 Mar 2023 10:23:02 GMT
x-cache: Miss from cloudfront
via: 1.1 a2df4ad642d78d6dac65038e06ad10d2.cloudfront.net (CloudFront)
x-amz-cf-pop: DUB56-P1
x-amz-cf-id: KIiljCzYJBUVVxmNkl3EP2PMh96OBVoTyFSMYDupMd4muLGNm2AmgA==
AMAZON SIMPLE STORAGE SERVICE (AMAZON S3) IS AN OBJECT STORAGE SERVICE THAT OFFERS...
It works! As expected, the content processed by the Lambda function is all uppercase. Because this is the first invocation for the distribution, it has not been returned from the cache (x-cache: Miss from cloudfront). The request went through S3 Object Lambda to process the file using the Lambda function I provided.
Let’s try the same request again:
curl -i https://DISTRIBUTION_DOMAIN_NAME/s3.txt
HTTP/2 200
content-type: text/plain
content-length: 427
x-amzn-requestid: a85fe537-3502-4592-b2a9-a09261c8c00c
date: Mon, 06 Mar 2023 10:23:02 GMT
x-cache: Hit from cloudfront
via: 1.1 145b7e87a6273078e52d178985ceaa5e.cloudfront.net (CloudFront)
x-amz-cf-pop: DUB56-P1
x-amz-cf-id: HEx9Fodp184mnxLQZuW62U11Fr1bA-W1aIkWjeqpC9yHbd0Rg4eM3A==
age: 3
AMAZON SIMPLE STORAGE SERVICE (AMAZON S3) IS AN OBJECT STORAGE SERVICE THAT OFFERS...
This time the content is returned from the CloudFront cache (x-cache: Hit from cloudfront), and there was no further processing by S3 Object Lambda. By using S3 Object Lambda as the origin, the CloudFront distribution serves content that has been processed by a Lambda function and can be cached to reduce latency and optimize costs.
Resizing Images Using S3 Object Lambda and CloudFront As I mentioned at the beginning of this post, one of the use cases that can be implemented using S3 Object Lambda and CloudFront is image transformation. Let’s create a CloudFront distribution that can dynamically resize an image by passing the desired width and height as query parameters (w and h respectively). For example:
For this setup to work, I need to make two changes to the CloudFront distribution. First, I create a new cache policy to include query parameters in the cache key. In the CloudFront console, I choose Policies in the navigation pane. In the Cache tab, I choose Create cache policy. Then, I enter a name for the cache policy.
In the Query settings of the Cache key settings, I select the option to Include the following query parameters and add w (for the width) and h (for the height).
Then, in the Behaviors tab of the distribution, I select the default behavior and choose Edit.
There, I update the Cache key and origin requests section:
In the Cache policy, I use the new cache policy to include the w and h query parameters in the cache key.
In the Origin request policy, use the AllViewerExceptHostHeader managed policy to forward query parameters to the origin.
Now I can update the Lambda function code. To resize images, this function uses the Pillow module that needs to be packaged with the function when it is uploaded to Lambda. You can deploy the function using a tool like the AWS SAM CLI or the AWS CDK. Compared to the previous example, this function also handles and returns HTTP errors, such as when content is not found in the bucket.
import io
import boto3
from urllib.request import urlopen, HTTPError
from PIL import Image
from urllib.parse import urlparse, parse_qs
s3 = boto3.client('s3')
def lambda_handler(event, context):
print(event)
object_get_context = event['getObjectContext']
request_route = object_get_context['outputRoute']
request_token = object_get_context['outputToken']
s3_url = object_get_context['inputS3Url']
# Get object from S3
try:
original_image = Image.open(urlopen(s3_url))
except HTTPError as err:
s3.write_get_object_response(
StatusCode=err.code,
ErrorCode='HTTPError',
ErrorMessage=err.reason,
RequestRoute=request_route,
RequestToken=request_token)
return
# Get width and height from query parameters
user_request = event['userRequest']
url = user_request['url']
parsed_url = urlparse(url)
query_parameters = parse_qs(parsed_url.query)
try:
width, height = int(query_parameters['w'][0]), int(query_parameters['h'][0])
except (KeyError, ValueError):
width, height = 0, 0
# Transform object
if width > 0 and height > 0:
transformed_image = original_image.resize((width, height), Image.ANTIALIAS)
else:
transformed_image = original_image
transformed_bytes = io.BytesIO()
transformed_image.save(transformed_bytes, format='JPEG')
# Write object back to S3 Object Lambda
s3.write_get_object_response(
Body=transformed_bytes.getvalue(),
RequestRoute=request_route,
RequestToken=request_token)
return
I upload a picture I took of the Trevi Fountain in the source bucket. To start, I generate a small thumbnail (200 by 150 pixels).
It works as expected. The first invocation with a specific size is processed by the Lambda function. Further requests with the same width and height are served from the CloudFront cache.
Availability and Pricing Aliases for S3 Object Lambda Access Points are available today in all commercial AWS Regions. There is no additional cost for aliases. With S3 Object Lambda, you pay for the Lambda compute and request charges required to process the data, and for the data S3 Object Lambda returns to your application. You also pay for the S3 requests that are invoked by your Lambda function. For more information, see Amazon S3 Pricing.
Aliases are now automatically generated when an S3 Object Lambda Access Point is created. For existing S3 Object Lambda Access Points, aliases are automatically assigned and ready for use.
It’s now easier to use S3 Object Lambda with existing applications, and aliases open many new possibilities. For example, you can use aliases with CloudFront to create a website that converts content in Markdown to HTML, resizes and watermarks images, or masks personally identifiable information (PII) from text, images, and documents.
Building applications on Cloudflare Workers has always been fun. Workers applications have low latency response times by default, and easy developer ergonomics thanks to Wrangler. It’s no surprise that for years now, developers have been going from idea to production with Workers in just a few minutes.
Internally, we’re no different. When a member of our team has a project idea, we often reach for Workers first, and not just for the MVP stage, but in production, too. Workers have been a secret ingredient to Cloudflare’s innovation for some time now, allowing us to build products like Access, Stream and Workers KV. Even better, when we have new ideas and we can use new Cloudflare products to build them, it’s a great way to give feedback on those products.
We’ve discussed this in the past on the Cloudflare blog – in May last year, I wrote how we rebuilt Cloudflare’s developer documentation using many of the tools that had recently been released in the Workers ecosystem: Cloudflare Pages for hosting, and Bulk Redirects for the redirect rules. In November, we released a new version of our API documentation, which again used Pages for hosting, and Pages functions for intelligent caching and transformation of our API schema.
In this blog post, I’m excited to show off some of the new tools in Cloudflare’s developer arsenal, D1 and Queues, to prototype and ship an internal tool for our SEO experts at Cloudflare. We’ve made this project, which we’re calling Prospector, open-source too – check it out in our cloudflare/templates repo on GitHub. Whether you’re a developer looking to understand how to use multiple parts of Cloudflare’s developer stack together, or an SEO specialist who may want to deploy the tool in production, we’ve made it incredibly easy to get up and running.
What we’re building
Prospector is a tool that allows Cloudflare’s SEO experts to monitor our blog and marketing site for specific keywords. When a keyword is matched on a page, Prospector will notify an email address. This allows our SEO experts to stay informed of any changes to our website, and take action accordingly.
Using MailChannels’ integration with Workers, we can quickly and easily send emails from our application using a single API call. This allows us to focus on the core functionality of the application, and not worry about the details of sending emails.
Prospector uses Cloudflare Workers as the user-facing API for the application. It uses D1 to store and retrieve data in real-time, and Queues to handle the fetching of all URLs and the notification process. We’ve also included an intuitive user interface for the application, which is built with HTML, CSS, and JavaScript.
Why we built it
It is widely known in SEO that both internal and external links help Google and other search engines understand what a website is about, which impacts keyword rankings. Not only do these links guide readers to additional helpful information, they also allow web crawlers for search engines to discover and index content on the site.
Acquiring external links is often a time-consuming process and at the discretion of third parties, whereas website owners typically have much more control over internal links. As a result, internal linking is one of the most useful levers available in SEO.
In an ideal world, every piece of content would be fully formed upon publication, replete with helpful internal links throughout the piece. However, this is often not the case. Many times, content is edited after the fact or additional pieces of relevant content come along after initial publication. These situations result in missed opportunities for internal linking.
Like other large organizations, Cloudflare has published thousands of blogs and web pages over the years. We share new content every time a product/technology is introduced and improved. Ultimately, that also means it’s become more challenging to identify opportunities for internal linking in a timely, automated fashion. We needed a tool that would allow us to identify internal linking opportunities as they appear, and speed up the time it takes to identify new internal linking opportunities.
Although we tested several tools that might solve this problem, we found that they were limited in several ways. First, some tools only scanned the first 2,000 characters of a web page. Any opportunities found beyond that limit would not be detected. Next, some tools did not allow us to limit searches to certain areas of the site and resulted in many false positives. Finally, other potential solutions required manual operation, leaving the process at the mercy of human memory.
To solve our problem (and ultimately, improve our SEO), we needed an automated tool that could discover and notify us of new instances of targeted phrases on a specified range of pages.
How it works
Data model
First, let’s explore the data model for Prospector. We have two main tables: notifiers and urls. The notifiers table stores the email address and keyword that we want to monitor. The urls table stores the URL and sitemap that we want to scrape. The notifiers table has a one-to-many relationship with the urls table, meaning that each notifier can have many URLs associated with it.
In addition, we have a sitemaps table that stores the sitemap URLs that we’ve scraped. Many larger websites don’t just have a single sitemap: the Cloudflare blog, for instance, has a primary sitemap that contains four sub-sitemaps. When the application is deployed, a primary sitemap is provided as configuration, and Prospector will parse it to find all of the sub-sitemaps.
Finally, notifier_matches is a table that stores the matches between a notifier and a URL. This allows us to keep track of which URLs have already been matched, and which ones still need to be processed. When a match has been found, the notifier_matches table is updated to reflect that, and “matches” for a keyword are no longer processed. This saves our SEO experts from a crowded inbox, and allows them to focus and act on new matches.
Connecting the pieces with Cloudflare Queues Cloudflare Queues acts as the work queue for Prospector. When a new notifier is added, a new job is created for it and added to the queue. Behind the scenes, Queues will distribute the work across multiple Workers, allowing us to scale the application as needed. When a job is processed, Prospector will scrape the URL and check for matches. If a match is found, Prospector will send an email to the notifier’s email address.
Using the Cron Triggers functionality in Workers, we can schedule the scraping process to run at a regular interval – by default, once a day. This allows us to keep our data up-to-date, and ensures that we’re always notified of any changes to our website. It also allows the end-user to configure when they receive emails in case they want to receive them more or less frequently, or at the beginning of their workday.
The Module Workers syntax for Workers makes accessing the application bindings – the constants available in the application for querying D1, Queues, and other services – incredibly easy. src/index.ts, the entrypoint for the application, looks like this:
With this syntax, we can see where the various events incoming to the application – the fetch event, the queue event, and the scheduled event – are handled. The fetch event is the main entrypoint for the application, and is where we handle all of the API routes. The queue event is where we handle the work that’s been added to the queue, and the scheduled event is where we handle the scheduled scraping process.
Central to the application, of course, is Workers – acting as the API gateway and coordinator. We’ve elected to use the popular open-source framework Hono, an Express-style API for Workers, in Prospector. With Hono, we can quickly map out a REST API in just a few lines of code. Here’s an example of a few API routes and how they’re defined with Hono:
With these bindings, we can define the types for our environment variables, and use them in our application. Here’s an example of the Env type, which defines the environment variables that we use in the application:
Notice the types of the DB and QUEUE bindings – D1Database and Queue, respectively. These types are automatically generated, complete with type signatures for each method inside of the D1 and Queue APIs. This means that we can be sure that we’re using the correct methods, and that we’re passing the correct arguments to them, directly from our text editor – without having to refer to the documentation.
How to use it
One of my favorite things about Workers is that deploying applications is quick and easy. Using `wrangler.toml` and some simple build scripts, we can deploy a fully-functional application in just a few minutes. Prospector is no different. With just a few commands, we can create the necessary D1 database and Queues instance, and deploy the application to our account.
First, you’ll need to clone the repository from our cloudflare/templates repository:
git clone $URL
If you haven’t installed wrangler yet, you can do so by running:
npm install @cloudflare/wrangler -g
With Wrangler installed, you can login to your account by running:
wrangler login
After you’ve done that, you’ll need to create a new D1 database, as well as a Queues instance. You can do this by running the following commands:
Next, you can run the bin/migrate script to create the tables in your database:
bin/migrate
This will create all the needed tables in your database, both in development (locally) and in production. Note that you’ll even see the creation of a honest-to-goodness .sqlite3 file in your project directory – this is the local development database, which you can connect to directly using the same SQLite CLI that you’re used to:
Finally, you can deploy the application to your account:
npm run deploy
With a deployed application, you can visit your Workers URL to see the user interface. From there, you can add new notifiers and URLs, and see the results of your scraping process. When a new keyword match is found, you’ll receive an email with the details of the match instantly:
Conclusion
For some time, there have been a great deal of applications that were hard to build on Workers without relational data or background task tooling. Now, with D1 and Queues, we can build applications that seamlessly integrate between real-time user interfaces, geographically distributed data, background processing, and more, all using the same developer ergonomics and low latency that Workers is known for.
D1 has been crucial for building this application. On larger sites, the number of URLs that need to be scraped can be quite large. If we were to use Workers KV, our key-value store, for storing this data, we would quickly struggle with how to model, retrieve, and update the data needed for this use-case. With D1, we can build relational data models and quickly query just the data we need for each queued processing task.
Using these tools, developers can build internal tools and applications for their companies that are more powerful and more scalable than ever before. With the integration of Cloudflare’s Zero Trust suite, developers can make these applications secure by default, and deploy them to Cloudflare’s global network. This allows developers to build applications that are fast, secure, and reliable, all without having to worry about the underlying infrastructure.
Prospector is a great example of how easy it is to build applications on Cloudflare Workers. With the recent addition of D1 and Queues, we’ve been able to build fully-functional applications that require real-time data and background processing in just a few hours. We’re excited to share the open-source code for Prospector, and we’d love to hear your feedback on the project.
If you have any questions, feel free to reach out to us on Twitter at @cloudflaredev, or join us in the Cloudflare Workers Discord community, which recently hit 20k members and is a great place to ask questions and get help from other developers.
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.