Durable Objects aren’t just durable, they’re fast: a 10x speedup for Cloudflare Queues

Post Syndicated from Josh Wheeler original https://blog.cloudflare.com/how-we-built-cloudflare-queues

Cloudflare Queues let a developer decouple their Workers into event-driven services. Producer Workers write events to a Queue, and consumer Workers are invoked to take actions on the events. For example, you can use a Queue to decouple an e-commerce website from a service which sends purchase confirmation emails to users. During 2024’s Birthday Week, we announced that Cloudflare Queues is now Generally Available, with significant performance improvements that enable larger workloads. To accomplish this, we switched to a new architecture for Queues that enabled the following improvements:

  • Median latency for sending messages has dropped from ~200ms to ~60ms

  • Maximum throughput for each Queue has increased over 10x, from 400 to 5000 messages per second

  • Maximum Consumer concurrency for each Queue has increased from 20 to 250 concurrent invocations


Median latency drops from ~200ms to ~60ms as Queues are migrated to the new architecture

In this blog post, we’ll share details about how we built Queues using Durable Objects and the Cloudflare Developer Platform, and how we migrated from an initial Beta architecture to a geographically-distributed, horizontally-scalable architecture for General Availability.

v1 Beta architecture

When initially designing Cloudflare Queues, we decided to build something simple that we could get into users’ hands quickly. First, we considered leveraging an off-the-shelf messaging system such as Kafka or Pulsar. However, we decided that it would be too challenging to operate these systems at scale with the large number of isolated tenants that we wanted to support.

Instead of investing in new infrastructure, we decided to build on top of one of Cloudflare’s existing developer platform building blocks: Durable Objects. Durable Objects are a simple, yet powerful building block for coordination and storage in a distributed system. In our initial v1 architecture, each Queue was implemented using a single Durable Object. As shown below, clients would send messages to a Worker running in their region, which would be forwarded to the single Durable Object hosted in the WNAM (Western North America) region. We used a single Durable Object for simplicity, and hosted it in WNAM for proximity to our centralized configuration API service.


One of a Queue’s main responsibilities is to accept and store incoming messages. Sending a message to a v1 Queue used the following flow:

  • A client sends a POST request containing the message body to the Queues API at /accounts/:accountID/queues/:queueID/messages

  • The request is handled by an instance of the Queue Broker Worker in a Cloudflare data center running near the client.

  • The Worker performs authentication, and then uses Durable Objects idFromName API to route the request to the Queue Durable Object for the given queueID

  • The Queue Durable Object persists the message to storage before returning a success back to the client.

Durable Objects handled most of the heavy-lifting here: we did not need to set up any new servers, storage, or service discovery infrastructure. To route requests, we simply provided a queueID and the platform handled the rest. To store messages, we used the Durable Object storage API to put each message, and the platform handled reliably storing the data redundantly.

Consuming messages

The other main responsibility of a Queue is to deliver messages to a Consumer. Delivering messages in a v1 Queue used the following process:

  • Each Queue Durable Object maintained an alarm that was always set when there were undelivered messages in storage. The alarm guaranteed that the Durable Object would reliably wake up to deliver any messages in storage, even in the presence of failures. The alarm time was configured to fire after the user’s selected max wait time, if only a partial batch of messages was available. Whenever one or more full batches were available in storage, the alarm was scheduled to fire immediately.

  • The alarm would wake the Durable Object, which continually looked for batches of messages in storage to deliver.

  • Each batch of messages was sent to a “Dispatcher Worker” that used Workers for Platforms dynamic dispatch to pass the messages to the queue() function defined in a user’s Consumer Worker


This v1 architecture let us flesh out the initial version of the Queues Beta product and onboard users quickly. Using Durable Objects allowed us to focus on building application logic, instead of complex low-level systems challenges such as global routing and guaranteed durability for storage. Using a separate Durable Object for each Queue allowed us to host an essentially unlimited number of Queues, and provided isolation between them.

However, using only one Durable Object per queue had some significant limitations:

  • Latency: we created all of our v1 Queue Durable Objects in Western North America. Messages sent from distant regions incurred significant latency when traversing the globe.

  • Throughput: A single Durable Object is not scalable: it is single-threaded and has a fixed capacity for how many requests per second it can process. This is where the previous 400 messages per second limit came from.

  • Consumer Concurrency: Due to concurrent subrequest limits, a single Durable Object was limited in how many concurrent subrequests it could make to our Dispatcher Worker. This limited the number of queue() handler invocations that it could run simultaneously.

To solve these issues, we created a new v2 architecture that horizontally scales across multiple Durable Objects to implement each single high-performance Queue.

v2 Architecture

In the new v2 architecture for Queues, each Queue is implemented using multiple Durable Objects, instead of just one. Instead of a single region, we place Storage Shard Durable Objects in all available regions to enable lower latency. Within each region, we create multiple Storage Shards and load balance incoming requests amongst them. Just like that, we’ve multiplied message throughput.


Sending a message to a v2 Queue uses the following flow:

  • A client sends a POST request containing the message body to the Queues API at /accounts/:accountID/queues/:queueID/messages

  • The request is handled by an instance of the Queue Broker Worker running in a Cloudflare data center near the client.

  • The Worker:

    • Performs authentication

    • Reads from Workers KV to obtain a Shard Map that lists available storage shards for the given region and queueID

    • Picks one of the region’s Storage Shards at random, and uses Durable Objects idFromName API to route the request to the chosen shard

  • The Storage Shard persists the message to storage before returning a success back to the client.

In this v2 architecture, messages are stored in the closest available Durable Object storage cluster near the user, greatly reducing latency since messages don’t need to be shipped all the way to WNAM. Using multiple shards within each region removes the bottleneck of a single Durable Object, and allows us to scale each Queue horizontally to accept even more messages per second. Workers KV acts as a fast metadata store: our Worker can quickly look up the shard map to perform load balancing across shards.

To improve the Consumer side of v2 Queues, we used a similar “scale out” approach. A single Durable Object can only perform a limited number of concurrent subrequests. In v1 Queues, this limited the number of concurrent subrequests we could make to our Dispatcher Worker. To work around this, we created a new Consumer Shard Durable Object class that we can scale horizontally, enabling us to execute many more concurrent instances of our users’ queue() handlers.


Consumer Durable Objects in v2 Queues use the following approach:

  • Each Consumer maintains an alarm that guarantees it will wake up to process any pending messages. v2 Consumers are notified by the Queue’s Coordinator (introduced below) when there are messages ready for consumption. Upon notification, the Consumer sets an alarm to go off immediately.

  • The Consumer looks at the shard map, which contains information about the storage shards that exist for the Queue, including the number of available messages on each shard.

  • The Consumer picks a random storage shard with available messages, and asks for a batch.

  • The Consumer sends the batch to the Dispatcher Worker, just like for v1 Queues.

  • After processing the messages, the Consumer sends another request to the Storage Shard to either “acknowledge” or “retry” the messages.

This scale-out approach enabled us to work around the subrequest limits of a single Durable Object, and increase the maximum supported concurrency level of a Queue from 20 to 250. 

The Coordinator and “Control Plane”

So far, we have primarily discussed the “Data Plane” of a v2 Queue: how messages are load balanced amongst Storage Shards, and how Consumer Shards read and deliver messages. The other main piece of a v2 Queue is the “Control Plane”, which handles creating and managing all the individual Durable Objects in the system. In our v2 architecture, each Queue has a single Coordinator Durable Object that acts as the brain of the Queue. Requests to create a Queue, or change its settings, are sent to the Queue’s Coordinator.


The Coordinator maintains a Shard Map for the Queue, which includes metadata about all the Durable Objects in the Queue (including their region, number of available messages, current estimated load, etc.). The Coordinator periodically writes a fresh copy of the Shard Map into Workers KV, as pictured in step 1 of the diagram. Placing the shard map into Workers KV ensures that it is globally cached and available for our Worker to read quickly, so that it can pick a shard to accept the message.

Every shard in the system periodically sends a heartbeat to the Coordinator as shown in steps 2 and 3 of the diagram. Both Storage Shards and Consumer Shards send heartbeats, including information like the number of messages stored locally, and the current load (requests per second) that the shard is handling. The Coordinator uses this information to perform autoscaling. When it detects that the shards in a particular region are overloaded, it creates additional shards in the region, and adds them to the shard map in Workers KV. Our Worker sees the updated shard map and naturally load balances messages across the freshly added shards. Similarly, the Coordinator looks at the backlog of available messages in the Queue, and decides to add more Consumer shards to increase Consumer throughput when the backlog is growing. Consumer Shards pull messages from Storage Shards for processing as shown in step 4 of the diagram.

Switching to a new scalable architecture allowed us to meet our performance goals and take Queues to GA. As a recap, this new architecture delivered these significant improvements:

  • P50 latency for writing to a Queue has dropped from ~200ms to ~60ms.

  • Maximum throughput for a Queue has increased from 400 to 5000 messages per second.

  • Maximum consumer concurrency has increased from 20 to 250 invocations.

What’s next for Queues

  • We plan on leveraging the performance improvements in the new beta version of Durable Objects which use SQLite to continue to improve throughput/latency in Queues.

  • We will soon be adding message management features to Queues so that you can take actions to purge messages in a queue, pause consumption of messages, or “redrive”/move messages from one queue to another (for example messages that have been sent to a Dead Letter Queue could be “redriven” or moved back to the original queue).

  • Work to make Queues the “event hub” for the Cloudflare Developer Platform:

    • Create a low-friction way for events emitted from other Cloudflare services with event schemas to be sent to Queues.

    • Build multi-Consumer support for Queues so that Queues are no longer limited to one Consumer per queue.

To start using Queues, head over to our Getting Started guide. 

Do distributed systems like Cloudflare Queues and Durable Objects interest you? Would you like to help build them at Cloudflare? We’re Hiring!

Implementing a computing curriculum in Telangana

Post Syndicated from Fiona Coventry original https://www.raspberrypi.org/blog/implementing-a-computing-curriculum-in-telangana/

Last year we launched a partnership with the Government of Telangana Social Welfare Residential Educational Institutions Society (TGSWREIS) in Telangana, India to develop and implement a computing curriculum at their Coding Academy School and Coding Academy College. Our impact team is conducting an evaluation. Read on to find out more about the partnership and what we’ve learned so far.

Aim of the partnership 

The aim of our partnership is to enable students in the school and undergraduate college to learn about coding and computing by providing the best possible curriculum, resources, and training for teachers. 

Students sit in a classroom and watch the lecture slides.

As both institutions are government institutions, education is provided for free, with approximately 800 high-performing students from disadvantaged backgrounds currently benefiting. The school is co-educational up to grade 10 and the college is for female undergraduate students only. 

The partnership is strategically important for us at the Raspberry Pi Foundation because it helps us to test curriculum content in an Indian context, and specifically with learners from historically marginalised communities with limited resources.

Adapting our curriculum content for use in Telangana

Since our partnership began, we’ve developed curriculum content for students in grades 6–12 in the school, which is in line with India’s national education policy requiring coding to be introduced from grade 6. We’ve also developed curriculum content for the undergraduate students at the college. 

Students and educators engage in digital making.

In both cases, the content was developed based on an initial needs assessment — we used the assessment to adapt content from our previous work on The Computing Curriculum. Local examples were integrated to make the content relatable and culturally relevant for students in Telangana. Additionally, we tailored the content for different lesson durations and to allow a higher frequency of lessons. We captured impact and learning data through assessments, lesson observations, educator interviews, student surveys, and student focus groups.

Curriculum well received by educators and students

We have found that the partnership is succeeding in meeting many of its objectives. The curriculum resources have received lots of positive feedback from students, educators, and observers.

Students and educators engage in digital making.

In our recent survey, 96% of school students and 85% of college students reported that they’ve learned new things in their computing classes. This was backed up by assessment marks, with students scoring an average of 70% in the school and 69% in the college for each assessment, compared to a pass mark of 40%. Students were also positive about their experiences of the computing and coding classes, and particularly enjoyed the practical components.

“My favourite thing in this computing classes [sic] is doing practical projects. By doing [things] practically we learnt a lot.” – Third year undergraduate student, Coding Academy College

“Since their last SA [summative assessment] exam, students have learnt spreadsheet [concepts] and have enjoyed applying them in activities. Their favourite part has been example codes, programming, and web-designing activities.” – Student focus group facilitator, grade 9 students, Coding Academy School

However, we also found some variation in outcomes for different groups of students and identified some improvements that are needed to ensure the content is appropriate for all. For example, educators and students felt improvements were needed to the content for undergraduates specialising in data science — there was a wish for the content to be more challenging and to more effectively prepare students for the workplace. Some amendments have been made to this content and we will continue to keep this under review. 

In addition, we faced some challenges with the equipment and infrastructure available. For example, there were instances of power cuts and unstable internet connections. These issues have been addressed as far as possible with Wi-Fi dongles and educators adapting their delivery to work with the equipment available.

Our ambition for India

Our team has already made some improvements to our curriculum content in preparation for the new academic year. We will also make further improvements based on the feedback received. 

Students and educators engage in digital making.

The long-term vision for our work in India is to enable any school in India to teach students about computing and creating with digital technologies. Over our five-year partnership, we plan to work with TGSWREIS to roll out a computing curriculum to other government schools within the state. 

Through our work in Telangana and Odisha, we are learning about the unique challenges faced by government schools. We’re designing our curriculum to address these challenges and ensure that every student in India has the opportunity to thrive in the 21st century. If you would like to know more about our work and impact in India, please reach out to us at [email protected].

We take the evaluation of our work seriously and are always looking to understand how we can improve and increase the impact we have on the lives of young people. To find out more about our approach to impact, you can read about our recently updated theory of change, which supports how we evaluate what we do.

The post Implementing a computing curriculum in Telangana appeared first on Raspberry Pi Foundation.

Simplify your query performance diagnostics in Amazon Redshift with Query profiler

Post Syndicated from Raks Khare original https://aws.amazon.com/blogs/big-data/simplify-your-query-performance-diagnostics-in-amazon-redshift-with-query-profiler/

Amazon Redshift is a fast, scalable, secure, and fully managed cloud data warehouse that lets you analyze your data at scale. Amazon Redshift Serverless lets you access and analyze data without the usual configurations of a provisioned data warehouse. Resources are automatically provisioned and data warehouse capacity is intelligently scaled to deliver fast performance for even the most demanding and unpredictable workloads. If you prefer to manage your Amazon Redshift resources manually, you can create provisioned clusters for your data querying needs. For more information, refer to Amazon Redshift clusters.

Amazon Redshift provides performance metrics and data so you can track the health and performance of your provisioned clusters, serverless workgroups, and databases. The performance data you can use on the Amazon Redshift console falls into two categories:

  • Amazon CloudWatch metrics – Helps you monitor the physical aspects of your cluster or serverless, such as resource utilization, latency, and throughput.
  • Query and load performance data – Helps you monitor database activity, inspect and diagnose query performance problems.

Amazon Redshift has introduced a new feature called the Query profiler. The Query profiler is a graphical tool that helps users analyze the components and performance of a query. This feature is part of the Amazon Redshift console and provides a visual and graphical representation of the query’s run order, execution plan, and various statistics. The Query profiler makes it easier for users to understand and troubleshoot their queries.

In this post, we cover two common use cases for troubleshooting query performance. We show you step-by-step how to analyze and troubleshoot long-running queries using the Query profiler.

Overview

For Amazon Redshift Serverless, the Query profiler can be accessed by going to the Serverless console. Choose Query and database monitoring, select a query, and then navigate to the Query plan tab. If a query plan is available, you will observe a list of child queries. Choose a query to view it in Query profiler.

For Amazon Redshift provisioned, the Query profiler can be accessed by going to the provisioned clusters dashboard. Choose Query and loads, and choose a query. Navigate to the Query plan tab. If a query plan is available, you will observe a list of child queries. Choose a query to view it in Query profiler.

Prerequisites

  • You can use the following sample AWS Identity and Access Management (IAM) policy to configure your IAM user or role with minimum privileges to access Query profiler from the AWS console. If your IAM user or role already has access to Query and loads section of Redshift provisioned cluster dashboard or Query and database monitoring section of Redshift serverless dashboard, then no additional permissions are needed:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "redshift:DescribeClusters",
                "redshift-serverless:ListNamespaces",
                "redshift-serverless:ListWorkgroups",
                "redshift-data:ExecuteStatement",
                "redshift-data:DescribeStatement",
                "redshift-data:GetStatementResult"
            ],
            "Resource": [
                "arn:aws:redshift-serverless:<your-namespace>",
                "arn:aws:redshift-serverless:<your-workgroupname>",
                "arn:aws:redshift:<your-clustername>"
            ]
        }
    ]
}
  • You can choose to use Query profiler in your account with an existing Amazon Redshift data warehouse and queries. However, if you would like to implement this demo in your existing Amazon Redshift data warehouse, download Redshift query editor v2 notebook, Redshift Query profiler demo, and refer to the Data Loading section later in this post.
  • You must connect to the cluster using database credentials and grant the sys:operator or sys:monitor role to the database user to view queries run by users.

Data loading

Amazon Redshift Query Editor v2 comes with sample data that can be loaded into a sample database and corresponding schema. To test Query profiler against the sample data, load the tpcds sample data and run queries.

  1. To load the tpcds sample data, launch Redshift query editor v2 and expand the database sample_data_dev.
  2. Choose the icon associated with the tpcds.
  3. The query editor v2 then loads the data into a schema tpcds in the database sample_data_dev.

The following screenshot shows these steps.
Load Data

  1. Verify the data by running the following sample query, as shown in the following screenshot.
select count(*) from sample_data_dev.tpcds.customer;

Verify Data

Use cases

In this post, we describe two common uses cases around query performance and how to use Query profiler to troubleshoot the performance issues:

  1. Nested loop joins – This join type is the slowest of the possible join types. Nested loop joins are the cross-joins without a join condition that result in the Cartesian product of two tables.
  2. Suboptimal data distribution – If data distribution is suboptimal, you might notice a large broadcast or redistribution of data across compute nodes when two large tables are joined together.

Use case 1: Nested loop joins

To troubleshoot performance issues with nest loop joins using Query profiler, follow these steps:

  1. Import notebook downloaded previously in prerequisites section of the blog into Redshift query editor v2.
  2. Set the context of database to sample_data_dev in Query Editor v2, as shown in the following screenshot.
    Set the database context
  3. Run cell #3 from demo notebook to diagnose a query performance issue related to nested loop joins.
    Step 3

The query takes around 12 seconds to run, as shown in the Query Editor v2 results panel in the following screenshot.

Step 4 results

  1. Run cell #5 to capture the query id from the SYS_QUERY_HISTORY system view filtering based on the query label you set in the preceding step.Cell 5
  2. On the Amazon Redshift console, in the navigation pane, select Query and loads and choose the cluster name where the query was originally executed, as shown in the following screenshot.
    Query and loads
  3. This will open the new Query profiler. Under the Query history section, choose Connect to database.After successful connection to the database, you will observe the Status showing as Connected and displaying the query history, as shown in the following screenshot.
    Connec to database
  4. You can find your queries either by Query ID or Process ID. Enter the Query ID captured in the preceding step to filter the long-running query for further analysis and choose the corresponding Query ID, as shown in the following screenshot.
    Search query
  5. Under the Query plan section, choose Child query 1, as shown in the following screenshot. If there are multiple child queries, you will have to inspect each one for performance issues.
    Child queryThis will open the query plan in a tree view along with additional metrics on the side panel. This allows you to quickly analyze the query streams, segments and steps. For more information about streams, segments, and steps, refer to Query planning and execution workflow in the Amazon Redshift Database Developer Guide.
  6. Turn on View streams and, in the Streams side panel, investigate and identify which stream has the highest execution time. In this case, Streams ID 5 is where the query spends the majority of time, as shown in the following screenshot
    Enable view stream
  7. In the Streams side panel, under ID, select 5 to focus on Stream 5 for further analysis. Stream 5 shows a step of Nestloop, as shown in the following screenshot.
    Nestloop step
  8. Choose the Nestloop step to further analyze. The side panel will change with step details and additional metrics about the nested loop join.
  9. By looking at Step details – nestloop, we can inspect the Input rows and compare that with the Output rows, as shown in the following screenshot. In this case, due to the cross-joining with the Store_returns table, 287,514 input rows explodes to 950,233,770 rows, thus causing our query to run slower.
    Nestloop step details
  10. Fix the query by introducing a join condition between the store_sales and store_returns. Run cell #7 from Query editor v2 demo notebook.The re-written query runs in just 307 milliseconds.Cell 7

Use case 2: Suboptimal data distribution

  1. To demonstrate suboptimal data distribution, change the distribution style of tables web_sales and web_returns to EVEN by running cell #10 of Query editor v2 demo notebook.Cell 10
  1. Run cell #12. The query takes 409 milliseconds to run, as shown by the elapsed time in the following screenshot of the Query editor v2.Cell 12
  2. Follow steps 3–10 from use case 1 to locate the query_id and to open the Query profiler view for the preceding query.
  3. On the Query profiler page for the preceding query, turn on View streams. In the Streams side panel, investigate and identify which stream has the highest execution time. In this case, Stream ID 6 is where the query spends a majority of the time, as shown in the following screenshot.
    View streams
  4. Under ID, select 6 from the Streams side panel for further analysis.
    Streams side panel

Stream 6 shows a step of hash join, which involves a hash join of two tables that are both redistributed. This can be inferred from Hash Right Join DS_DIST_BOTH under Explain plan node information in the following screenshot. Usually, these redistributions occur because the tables aren’t joined on their distribution keys, or they don’t have the correct distribution style. In the case of large tables, these redistributions can lead to significant performance degradation and, hence, it is important to identify and fix such steps to optimize query performance.

Hashjoin step

  1. Fix this suboptimal data distribution pattern by choosing the appropriate distribution keys on the tables involved: web_sales and web_returns. To change the distribution styles, run cell #14 of demo notebook to alter table commands.
    Cell 14
  2. After the preceding commands finish running, run cell #16 to re-execute the select query. As shown in the Query Editor in the following screenshot, now the same query finished in 244 milliseconds after updating the distribution style to key for tables web_sales and web_returns.
    Cell 16

  3. In the Query profiler view, turn on View streams and notice that Streams 5 now took the most time. It took 8 milliseconds to finish, as compared to 13 milliseconds in the preceding step.
    View streams
  4. In the Streams side panel, under ID, select 5 to drill down further, then choose the Hashjoin As the following screenshot shows, after changing the distribution style to key for both web_sales and web_return tables, none of the tables need to be redistributed at the query runtime, resulting in optimized performance.
    Hashjoin step

Considerations

Consider the following details while using Query profiler:

  1. Query profiler displays information returned by the SYS_QUERY_HISTORY, SYS_QUERY_EXPLAIN, SYS_QUERY_DETAIL, and SYS_CHILD_QUERY_TEXT views.
  2. Query profiler only displays query information for queries that have recently run on the database. If a query completes using a prepopulated resultset cache, Query profiler won’t have information about it because Amazon Redshift doesn’t generate a query plan for such queries.
  3. Queries run by Query profiler to return the query information run on the same data warehouse as the user-defined queries.

Clean Up

To avoid unexpected costs, complete the following action to delete the resources you created:

Drop all the tables in the sample_data_dev under tpcds schema.

Conclusion

In this post, we discussed how to use Amazon Redshift Query profiler to monitor and troubleshoot long-running queries. We demonstrated a step-by-step approach to analyze query performance by examining the query execution plan and statistics and identifying the root cause of query slowness. Try this feature in your environment and share your feedback with us.


About the Authors

Raks KhareRaks Khare is a Senior Analytics Specialist Solutions Architect at AWS based out of Pennsylvania. He helps customers across varying industries and regions architect data analytics solutions at scale on the AWS platform. Outside of work, he likes exploring new travel and food destinations and spending quality time with his family.

Blessing Bamiduro is part of the Amazon Redshift Product Management team. She works with customers to help explore the use of Amazon Redshift ML in their data warehouse. In her spare time, Blessing loves travels and adventures.

Ekta Ahuja is an Amazon Redshift Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys landscape photography, traveling, and board games.

Are Automatic License Plate Scanners Constitutional?

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/10/are-automatic-license-plate-scanners-constitutional.html

An advocacy groups is filing a Fourth Amendment challenge against automatic license plate readers.

“The City of Norfolk, Virginia, has installed a network of cameras that make it functionally impossible for people to drive anywhere without having their movements tracked, photographed, and stored in an AI-assisted database that enables the warrantless surveillance of their every move. This civil rights lawsuit seeks to end this dragnet surveillance program,” the lawsuit notes. “In Norfolk, no one can escape the government’s 172 unblinking eyes,” it continues, referring to the 172 Flock cameras currently operational in Norfolk. The Fourth Amendment protects against unreasonable searches and seizures and has been ruled in many cases to protect against warrantless government surveillance, and the lawsuit specifically says Norfolk’s installation violates that.”

Introducing simplified interaction with the Airflow REST API in Amazon MWAA

Post Syndicated from Chandan Rupakheti original https://aws.amazon.com/blogs/big-data/introducing-simplified-interaction-with-the-airflow-rest-api-in-amazon-mwaa/

Amazon Managed Workflows for Apache Airflow (Amazon MWAA) is a fully managed service that builds upon Apache Airflow, offering its benefits while eliminating the need for you to set up, operate, and maintain the underlying infrastructure, reducing operational overhead while increasing security and resilience.

Today, we are excited to announce an enhancement to the Amazon MWAA integration with the Airflow REST API. This improvement streamlines the ability to access and manage your Airflow environments and their integration with external systems, and allows you to interact with your workflows programmatically. The Airflow REST API facilitates a wide range of use cases, from centralizing and automating administrative tasks to building event-driven, data-aware data pipelines.

In this post, we discuss the enhancement and present several use cases that the enhancement unlocks for your Amazon MWAA environment.

Airflow REST API

The Airflow REST API is a programmatic interface that allows you to interact with Airflow’s core functionalities. It’s a set of HTTP endpoints to perform operations such as invoking Directed Acyclic Graphs (DAGs), checking task statuses, retrieving metadata about workflows, managing connections and variables, and even initiating dataset-related events, without directly accessing the Airflow web interface or command line tools.

Before today, Amazon MWAA provided the foundation for interacting with the Airflow REST API. Though functional, the process of obtaining and managing access tokens and session cookies added complexity to the workflow. Amazon MWAA now supports a simplified mechanism for interacting with the Airflow REST API using AWS credentials, significantly reducing complexity and improving overall usability.

Enhancement overview

The new InvokeRestApi capability allows you to run Airflow REST API requests with a valid SigV4 signature using your existing AWS credentials. This feature is now available to all Amazon MWAA environments (2.4.3+) in supported Amazon MWAA AWS Regions. By acting as an intermediary, this REST API processes requests on behalf of users, requiring only the environment name and API request payload as inputs.

Integrating with the Airflow REST API through the enhanced Amazon MWAA API provides several key benefits:

  • Simplified integration – The new InvokeRestApi capability in Amazon MWAA removes the complexity of managing access tokens and session cookies, making it straightforward to interact with the Airflow REST API.
  • Improved usability – By acting as an intermediary, the enhanced API delivers Airflow REST API execution results directly to the client, reducing complexity and improving overall usability.
  • Automated administration – The simplified REST API access enables automating various administrative and management tasks, such as managing Airflow variables, connections, slot pools, and more.
  • Event-driven architectures – The enhanced API facilitates seamless integration with external events, enabling the triggering of Airflow DAGs based on these events. This supports the growing emphasis on event-driven data pipelines.
  • Data-aware scheduling – Using the dataset-based scheduling feature in Airflow, the enhanced API enables the Amazon MWAA environment to manage the incoming workload and scale resources accordingly, improving the overall reliability and efficiency of event-driven pipelines.

In the following sections, we demonstrate how to use the enhanced API in various use cases.

How to use the enhanced Amazon MWAA API

The following code snippet shows the general request format for the enhanced REST API:

POST /restapi/Name HTTP/1.1
Content-type: application/json

{
    Name: String,
    Method: String,
    Path: String,
    QueryParameters: Json,
    Body: Json
}

The Name of the Amazon MWAA environment, the Path of the Airflow REST API endpoint to be called, and the HTTP Method to use are the required parameters, whereas QueryParameters and Body are optional and can be used as needed in the API calls.

The following code snippet shows the general response format:

{
    RestApiStatusCode: Number,
    RestApiResponse: Json
}

The RestApiStatusCode represents the HTTP status code returned by the Airflow REST API call, and the RestApiResponse contains the response payload from the Airflow REST API.

The following sample code snippet showcases how to update the description field of an Airflow variable using the enhanced integration. The call uses the AWS Python SDK to invoke the Airflow REST API for the task.

import boto3

# Create a boto3 client
mwaa_client = boto3.client("mwaa")

# Call the enhanced REST API using boto3 client
# Using QueryParameters, you can selectively specify the field to be updated
# Without QueryParameters, all fields will be updated
response = mwaa_client.invoke_rest_api(
    Name="<your-environment-name>",
    Method="PATCH",
    Path=f"/variables/<your-variable-key>",
    Body={
        "key": "<key>",
        "value": "<value>",
        "description": "<description>"
    },
    QueryParameters={
        "update_mask": ["description"]
    }
)

# Access the outputs of the REST call
status_code = response["RestApiStatusCode"]
result = response['RestApiResponse']

To make the invoke_rest_api SDK call, the calling client should have an AWS Identity and Access Management (IAM) principal of airflow:InvokeRestAPI attached to call the requisite environment. The permission can be scoped to specific Airflow roles (Admin, Op, User, Viewer, or Public) to control access levels.

This simple yet powerful REST API supports various use cases for your Amazon MWAA environments. Let’s review some important ones in the subsequent sections.

Automate administration and management tasks

Prior to this launch, to automate configurations and setup of resources such as variables, connections, slot pools, and more, you had to develop a lengthy boilerplate code to make API requests to the Amazon MWAA web servers. You had to handle the cookie and session management in the process. You can simplify this automation with the new enhanced REST API support.

For this example, let’s assume you want to automate maintaining your Amazon MWAA environment variables. You need to perform API operations such as create, read, update, and delete on Airflow variables to achieve this task. The following is a simple Python client to do so (mwaa_variables_client.py):

import boto3

# Client for managing MWAA environment variables
class MWAAVariablesClient:
    # Initialize the client with environment name and optional MWAA boto3 client
    def __init__(self, env_name, mwaa_client=None):
        self.env_name = env_name
        self.client = mwaa_client or boto3.client("mwaa")

    # List all variables in the MWAA environment
    def list(self):
        response = self.client.invoke_rest_api(
            Name=self.env_name,
            Method="GET",
            Path="/variables"
        )

        output = response['RestApiResponse']['variables']
        return output

    # Get a specific variable by key
    def get(self, key):
        response = self.client.invoke_rest_api(
            Name=self.env_name,
            Method="GET",
            Path=f"/variables/{key}"
        )

        return response['RestApiResponse']

    # Create a new variable with key, value, and optional description
    def create(self, key, value, description=None):
        response = self.client.invoke_rest_api(
            Name=self.env_name,
            Method="POST",
            Path="/variables",
            Body={
                "key": key,
                "value": value,
                "description": description
            }
        )

        return response['RestApiResponse']
    
    # Update an existing variable's value and description
    def update(self, key, value, description, query_parameters=None):
        response = self.client.invoke_rest_api(
            Name=self.env_name,
            Method="PATCH",
            Path=f"/variables/{key}",
            Body={
                "key": key,
                "value": value,
                "description": description
            },
            QueryParameters=query_parameters
        )

        return response['RestApiResponse']

    # Delete a variable by key
    def delete(self, key):
        response = self.client.invoke_rest_api(
            Name=self.env_name,
            Method="DELETE",
            Path=f"/variables/{key}"
        )
        return response['RestApiStatusCode']

if __name__ == "__main__":
    client = MWAAVariablesClient("<your-mwaa-environment-name>")

    print("\nCreating a test variable ...")
    response = client.create(
        key="test",
        value="Test value",
        description="Test description"
    )
    print(response)

    print("\nListing all variables ...")
    variables = client.list()
    print(variables)

    print("\nGetting the test variable ...")
    response = client.get("test")
    print(response)

    print("\nUpdating the value and description of test variable ...")
    response = client.update(
        key="test",
        value="Updated Value",
        description="Updated description"
    )
    print(response)

    print("\nUpdating only description of test variable ...")
    response = client.update(
        key="test", 
        value="Updated Value", 
        description="Yet another updated description", 
        query_parameters={ "update_mask": ["description"] }
    )
    print(response)

    print("\nDeleting the test variable ...")
    response_code = client.delete("test")
    print(f"Response code: {response_code}")

    print("\nFinally, getting the deleted test variable ...")
    try:
        response = client.get("test")
        print(response)
    except Exception as e:
        print(e.response["RestApiResponse"])

Assuming that you have configured your terminal with appropriate AWS credentials, you can run the preceding Python script to achieve the following results:

$python mwaa_variables_client.py 

Creating a test variable ...
{'description': 'Test description', 'key': 'test', 'value': 'Test value'}

Listing all variables ...
[{'key': 'test', 'value': 'Test value'}]

Getting the test variable ...
{'key': 'test', 'value': 'Test value'}

Updating the value and description of test variable ...
{'description': 'Updated description', 'key': 'test', 'value': 'Updated Value'}

Updating only description of test variable ...
{'description': 'Yet another updated description', 'key': 'test', 'value': 'Updated Value'}

Deleting the test variable ...
Response code: 204

Finally, getting the deleted test variable ...
{'detail': 'Variable does not exist', 'status': 404, 'title': 'Variable not found', 'type': 'https://airflow.apache.org/docs/apache-airflow/2.8.1/stable-rest-api-ref.html#section/Errors/NotFound'}

Let’s further explore other useful use cases.

Build event-driven data pipelines

The Airflow community has been actively innovating to enhance the platform’s data awareness, enabling you to build more dynamic and responsive workflows. When we announced support for version 2.9.2 in Amazon MWAA, we introduced capabilities that allow pipelines to react to changes in datasets, both within Airflow environments and in external systems. The new simplified integration with the Airflow REST API makes the implementation of data-driven pipelines more straightforward.

Consider a use case where you need to run a pipeline that uses input from an external event. The following sample DAG runs a bash command supplied as a parameter (any_bash_command.py):

"""
This DAG allows you to execute a bash command supplied as a parameter to the DAG.
The command is passed as a parameter called `command` in the DAG configuration.
"""

from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.models.param import Param

from datetime import datetime

with DAG(
    dag_id="any_bash_command", 
    schedule=None, 
    start_date=datetime(2022, 1, 1), 
    catchup=False,
    params={
        "command": Param("env", type="string")
    },
) as dag:
    cli_command = BashOperator(
        task_id="triggered_bash_command",
        bash_command="{{ dag_run.conf['command'] }}"
    )

With the help of the enhanced REST API, you can create a client that can invoke this DAG, supplying the bash command of your choice as follows (mwaa_dag_run_client.py):

import boto3

# Client for triggering DAG runs in Amazon MWAA
class MWAADagRunClient:
    # Initialize the client with MWAA environment name and optional MWAA boto3 client
    def __init__(self, env_name, mwaa_client=None):
        self.env_name = env_name
        self.client = mwaa_client or boto3.client("mwaa")

    # Trigger a DAG run with specified parameters
    def trigger_run(self, 
            dag_id, 
            dag_run_id=None,
            logical_date=None,
            data_interval_start=None,
            data_interval_end=None,
            note=None,
            conf=None,
    ):
        body = {}
        if dag_run_id:
            body["dag_run_id"] = dag_run_id
        if logical_date:
            body["logical_date"] = logical_date
        if data_interval_start:
            body["data_interval_start"] = data_interval_start
        if data_interval_end:
            body["data_interval_end"] = data_interval_end
        if note:
            body["note"] = note
        body["conf"] = conf or {}            

        response = self.client.invoke_rest_api(
            Name=self.env_name,
            Method="POST",
            Path=f"/dags/{dag_id}/dagRuns",
            Body=body
        )
        return response['RestApiResponse']

if __name__ == "__main__":
    client = MWAADagRunClient("<your-mwaa-environment-name>")
	
    print("\nTriggering a dag run ...")
    result = client.trigger_run(
        dag_id="any_bash_command", 
        conf={
            "command": "echo 'Hello from external trigger!'"
        }
    )
    print(result)

The following snippet shows a sample run of the script:

$python mwaa_dag_run_client.py
Triggering a dag run ...
{'conf': {'command': "echo 'Hello from external trigger!'"}, 'dag_id': 'any_bash_command', 'dag_run_id': 'manual__2024-10-21T16:56:09.852908+00:00', 'data_interval_end': '2024-10-21T16:56:09.852908+00:00', 'data_interval_start': '2024-10-21T16:56:09.852908+00:00', 'execution_date': '2024-10-21T16:56:09.852908+00:00', 'external_trigger': True, 'logical_date': '2024-10-21T16:56:09.852908+00:00', 'run_type': 'manual', 'state': 'queued'}

On the Airflow UI, the trigger_bash_command task shows the following execution log:

[2024-10-21, 16:56:12 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2024-10-21, 16:56:12 UTC] {subprocess.py:63} INFO - Tmp dir root location: /tmp
[2024-10-21, 16:56:12 UTC] {subprocess.py:75} INFO - Running command: ['/usr/bin/bash', '-c', "echo 'Hello from external trigger!'"]
[2024-10-21, 16:56:12 UTC] {subprocess.py:86} INFO - Output:
[2024-10-21, 16:56:12 UTC] {subprocess.py:93} INFO - Hello from external trigger!
[2024-10-21, 16:56:12 UTC] {subprocess.py:97} INFO - Command exited with return code 0
[2024-10-21, 16:56:12 UTC] {taskinstance.py:340} ▶ Post task execution logs

You can further expand this example to more useful event-driven architectures. Let’s expand the use case to run your data pipeline and perform extract, transform, and load (ETL) jobs when a new file lands in an Amazon Simple Storage Service (Amazon S3) bucket in your data lake. The following diagram illustrates one architectural approach.

In the context of invoking a DAG through an external input, the AWS Lambda function would have no knowledge of how busy the Amazon MWAA web server is, potentially leading to the function overwhelming the Amazon MWAA web server by processing a large number of files in a short timeframe.

One way to regulate the file processing throughput would be to introduce an Amazon Simple Queue Service (Amazon SQS) queue between the S3 bucket and the Lambda function, which can help with rate limiting the API requests to the web server. You can achieve this by configuring maximum concurrency for Lambda for the SQS event source. However, the Lambda function would still be unaware of the processing capacity available in the Amazon MWAA environment to run the DAGs.

In addition to the SQS queue, to help afford the Amazon MWAA environment manage the load natively, you can use the Airflow’s data-aware scheduling feature using datasets. This approach involves using the enhanced Amazon MWAA REST API to create dataset events, which are then used by the Airflow scheduler to schedule the DAG natively. This way, the Amazon MWAA environment can effectively batch the dataset events and scale resources based on the load. Let’s explore this approach in more detail.

Configure data-aware scheduling

Consider the following DAG that showcases a framework for an ETL pipeline (data_aware_pipeline.py). It uses a dataset for scheduling.

"""
DAG to run the given ETL pipeline based on the datalake dataset event.
"""
from airflow import DAG, Dataset
from airflow.decorators import task

from datetime import datetime

# Create a dataset
datalake = Dataset("datalake")

# Return a list of S3 file URIs from the supplied dataset events 
def get_resources(dataset_uri, triggering_dataset_events=None):
    events = triggering_dataset_events[dataset_uri] if triggering_dataset_events else []
    s3_uris = list(map(lambda e: e.extra["uri"], events))
    return s3_uris

with DAG(
    dag_id="data_aware_pipeline",
    schedule=[datalake],
    start_date=datetime(2022, 1, 1), 
    catchup=False
):
    @task
    def extract(triggering_dataset_events=None):
        resources = get_resources("datalake", triggering_dataset_events)
        for resource in resources:
            print(f"Running data extraction for {resource} ...")

    @task
    def transform(triggering_dataset_events=None):
        resources = get_resources("datalake", triggering_dataset_events)
        for resource in resources:
            print(f"Running data transformation for {resource} ...")

    @task()
    def load(triggering_dataset_events=None):
        resources = get_resources("datalake", triggering_dataset_events)
        for resource in resources:
            print(f"Loading finalized data for {resource} ...")
    
    extract() >> transform() >> load()

In the preceding code snippet, a Dataset object called datalake is used to schedule the DAG. The get_resources function extracts the extra information that contains the locations of the newly added files in the S3 data lake. Upon receiving dataset events, the Amazon MWAA environment batches the dataset events based on the load and schedules the DAG to handle them appropriately. The modified architecture to support the data-aware scheduling is presented below.

The following is a simplified client that can create a dataset event through the enhanced REST API (mwaa_dataset_client.py):

import boto3

# Client for interacting with MWAA datasets
class MWAADatasetClient:
    # Initialize the client with environment name and optional MWAA boto3 client
    def __init__(self, env_name, mwaa_client=None):
        self.env_name = env_name
        self.client = mwaa_client or boto3.client("mwaa")

    # Create a dataset event in the MWAA environment
    def create_event(self, dataset_uri, extra=None):
        body = {
            "dataset_uri": dataset_uri
        }
        if extra:
            body["extra"] = extra

        response = self.client.invoke_rest_api(
            Name=self.env_name,
            Method="POST",
            Path="/datasets/events",
            Body=body
        )
        return response['RestApiResponse']

The following is a code snippet for the Lambda function in the preceding architecture to generate the dataset event, assuming the function is configured to handle one S3 PUT event at a time (dataset_event_lambda.py):

import os
import json

from mwaa_dataset_client import MWAADatasetClient

environment = os.environ["MWAA_ENV_NAME"]
client = MWAADatasetClient(environment)

def handler(event, context):
    # Extract S3 file URI from SQS record
    record = event["Records"][0]
    bucket = record["s3"]["bucket"]["name"]
    key = record["s3"]["object"]["key"]
    s3_file_uri = f"s3://{bucket}/{key}"

    # Create a dataset event
    result = client.create_event(
        dataset_uri="datalake",
        extra={"uri": s3_file_uri}
    )

    return {
        "statusCode": 200,
        "body": json.dumps(result)
    }

As new files get dropped into the S3 bucket, the Lambda function will generate a dataset event per file, passing in the Amazon S3 location of the newly added files. The Amazon MWAA environment will schedule the ETL pipeline upon receiving the dataset events. The following diagram illustrates a sample run of the ETL pipeline on the Airflow UI.

The following snippet shows the execution log of the extract task from the pipeline. The log shows how the Airflow scheduler batched three dataset events together to handle the load.

[2024-10-21, 16:47:15 UTC] {local_task_job_runner.py:123} ▶ Pre task execution logs
[2024-10-21, 16:47:15 UTC] {logging_mixin.py:190} INFO - Running data extraction for s3://example-bucket/path/to/file1.csv ...
[2024-10-21, 16:47:15 UTC] {logging_mixin.py:190} INFO - Running data extraction for s3://example-bucket/path/to/file2.csv ...
[2024-10-21, 16:47:15 UTC] {logging_mixin.py:190} INFO - Running data extraction for s3://example-bucket/path/to/file3.csv ...
[2024-10-21, 16:47:15 UTC] {python.py:240} INFO - Done. Returned value was: None
[2024-10-21, 16:47:16 UTC] {taskinstance.py:340} ▶ Post task execution logs

In this way, you can use the enhanced REST API to create data-aware, event-driven pipelines.

Considerations

When implementing solutions using the enhanced Amazon MWAA REST API, it’s important to consider the following:

  • IAM permissions – Make sure the IAM principal making the invoke_rest_api SDK call has the airflow:InvokeRestAPI permission on the Amazon MWAA resource. To control access levels, the permission can be scoped to specific Airflow roles (Admin, Op, User, Viewer, or Public).
  • Error handling – Implement robust error handling mechanisms to handle various HTTP status codes and error responses from the Airflow REST API.
  • Monitoring and logging – Set up appropriate monitoring and logging to track the performance and reliability of your API-based integrations and data pipelines.
  • Versioning and compatibility – Monitor the versioning of the Airflow REST API and the Amazon MWAA service to make sure your integrations remain compatible with any future changes.
  • Security and compliance – Adhere to your organization’s security and compliance requirements when integrating external systems with Amazon MWAA and handling sensitive data.

You can start using the simplified integration with the Airflow REST API in your Amazon MWAA environments with Airflow version 2.4.3 or greater, in all currently supported Regions.

Conclusion

The enhanced integration between Amazon MWAA and the Airflow REST API represents a significant improvement in the ease of interacting with Airflow’s core functionalities. This new capability opens up a wide range of use cases, from centralizing and automating administrative tasks, improving overall usability, to building event-driven, data-aware data pipelines.

As you explore this new feature, consider the various use cases and best practices outlined in this post. By using the new InvokeRestApi, you can streamline your data management processes, enhance operational efficiency, and drive greater value from your data-driven systems.


About the Authors

Chandan Rupakheti is a Senior Solutions Architect at AWS. His main focus at AWS lies in the intersection of analytics, serverless, and AdTech services. He is a passionate technical leader, researcher, and mentor with a knack for building innovative solutions in the cloud. Outside of his professional life, he loves spending time with his family and friends, and listening to and playing music.

Hernan Garcia is a Senior Solutions Architect at AWS based out of Amsterdam. He has worked in the financial services industry since 2018, specializing in application modernization and supporting customers in their adoption of the cloud with a focus on serverless technologies.

EC2 Image Builder now supports building and testing macOS images

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/ec2-image-builder-now-supports-building-and-testing-macos-images/

I’m thrilled to announce macOS support in EC2 Image Builder. This new capability allows you to create and manage machine images for your macOS workloads in addition to the existing support for Windows and Linux.

A golden image is a bootable disk image, also called an Amazon Machine Image (AMI), pre-installed with the operating system and all the tools required for your workloads. In the context of a continuous integration and continuous deployment (CI/CD) pipeline, your golden image most probably contains the specific version of your operating system (macOS) and all required development tools and libraries to build and test your applications (Xcode, Fastlane, and so on.)

Developing and manually managing pipelines to build macOS golden images is time-consuming and diverts talented resources from other tasks. And when you have existing pipelines to build Linux or Windows images, you need to use different tools for creating macOS images, leading to a disjointed workflow.

For these reasons, many of you have been asking for the ability to manage your macOS images using EC2 Image Builder. You want to consolidate your image pipelines across operating systems and take advantage of the automation and cloud-centered integrations that EC2 Image Builder provides.

By adding macOS support to EC2 Image Builder, you can now streamline your image management processes and reduce the operational overhead of maintaining macOS images. EC2 Image Builder takes care of testing, versioning, and validating the base images at scale, saving you the costs associated with maintaining your preferred macOS versions.

Let’s see it in action
Let’s create a pipeline to create a macOS AMI with Xcode 16. You can follow a similar process to install Fastlane on your AMIs.

At a high level, there are four main steps.

  1. I define a component for each tool I want to install. A component is a YAML document that tells EC2 Image Builder what application to install and how. In this example, I create a custom component to install Xcode. If you want to install Fastlane, you create a second component. I use the ExecuteBash action to enter the shell commands required to install Xcode.
  2. I define a recipe. A recipe starts from a base image and lists the components I want to install on it.
  3. I define the infrastructure configuration I want to use to build my image. This defines the pool of Amazon Elastic Compute Cloud (Amazon EC2) instances to build the image. In my case, I allocate an EC2 Mac Dedicated Host in my account and reference it in the infrastructure configuration.
  4. I create a pipeline and a schedule to run on the infrastructure with the given recipes and an image workflow. I test the output AMI and deliver it at the chosen destination (my account or another account)

It’s much easier than it sounds. I’ll show you the steps in the AWS Management Console. I can also configure EC2 Image Builder with the AWS Command Line Interface (AWS CLI) or write code using one of our AWS SDKs.

Step 1: Create a component
I open the console and select EC2 Image Builder, then Components, and finally Create component.

Image Builder - Create component

I select a base Image operating system and the Compatible OS Versions. Then, I enter a Component name and Component version. I select Define document content and enter this YAML as Content.

name: InstallXCodeDocument
description: This downloads and installs Xcode. Be sure to run `xcodeinstall authenticate -s us-east-1` from your laptop first.
schemaVersion: 1.0

phases:
  - name: build
    steps:
      - name: InstallXcode
        action: ExecuteBash
        inputs:
          commands:
             - sudo -u ec2-user /opt/homebrew/bin/brew tap sebsto/macos
             - sudo -u ec2-user /opt/homebrew/bin/brew install xcodeinstall
             - sudo -u ec2-user /opt/homebrew/bin/xcodeinstall download -s us-east-1 --name "Xcode 16.xip"
             - sudo -u ec2-user /opt/homebrew/bin/xcodeinstall install --name "Xcode 16.xip"
  
  - name: validate
    steps:
      - name: TestXcode
        action: ExecuteBash
        inputs:
          commands:
            -  xcodebuild -version && xcode-select -p   

I use a tool I wrote to download and install Xcode from the command line. xcodeinstall integrates with AWS Secrets Manager to securely store authentication web tokens. Before running the pipeline, I authenticate from my laptop with the command xcodeinstall authenticate -s us-east-1. This command starts a session with Apple server’s and stores the session token in Secrets Manager. xcodeinstall uses this token during the image creation pipeline to download Xcode.

When you use xcodeinstall with Secrets Manager, you must give permission to your pipeline to access the secrets. Here is the policy document I added to the role attached to the EC2 instance used by EC2 Image Builder (in the following infrastructure configuration).

{
	"Sid": "xcodeinstall",
	"Effect": "Allow",
	"Action": [
            "secretsmanager:GetSecretValue"
            "secretsmanager:PutSecretValue"
        ],
	"Resource": "arn:aws:secretsmanager:us-east-1:<YOUR ACCOUNT ID>:secret:xcodeinstall*"
}

To test and debug these components locally, without having to wait for long cycle to start and recycle the EC2 Mac instance, you can use the AWS Task Orchestrator and Executor (AWSTOE) command.

Step 2: Create a recipe
The next step is to create a recipe. On the console, I select Image recipes and Create image recipe.

I select macOS as the base Image Operating System. I choose macOS Sonoma ARM64 as Image name.

In the Build components section, I select the Xcode 16 component I just created during step 1.

Finally, I make sure the volume is large enough to store the operating system, Xcode, and my builds. I usually select a 500 Gb gp3 volume.

Image Builder - Create a recipe

Steps 3 and 4: Create the pipeline (and the infrastructure configuration)
On the EC2 Image Builder page, I select Image pipelines and Create image pipeline. I give my pipeline a name and select a Build schedule. For this demo, I select a manual trigger.Image Builder - Create Pipeline 1

Then, I select the recipe I just created (Sonoma-Xcode).

Image Builder - Create Pipeline 2

I chose Default workflows for Define image creation process (not shown for brevity).

I create or select an existing infrastructure configuration. In the context of building macOS images, you have to allocate Amazon EC2 Dedicated Hosts first. This is where I choose the instance type that EC2 Image Builder will use to create the AMI. I may also optionally select my virtual private cloud (VPC), security group, AWS Identity and Access Management (IAM) roles with permissions required during the preparation of the image, key pair, and all the parameters I usually select when I start an EC2 instance.

Image Builder - Create Pipeline 4

Finally, I select where I want to distribute the output AMI. By default, it stays on my account. But I can also share or copy it to other accounts.

Image Builder - Create Pipeline 5

Run the pipeline
Now I’m ready to run the pipeline. I select Image pipelines, then I select the pipeline I just created (Sonoma-Xcode). From the Actions menu, I select Run pipeline.

Image Builder - launch pipeline

I can observe the progress and the detailed logs from Amazon CloudWatch.

After a while, the AMI is created and ready to use.

Image Builder - AMI build succeeded

Testing my AMI
To finish the demo, I start an EC2 Mac instance with the AMI I just created (remember to allocate a Dedicated Host first or to reuse the one you used for EC2 Image Builder).

Once the instance is started, I connect to it using secure shell (SSH) and verify that Xcode is correctly installed.

Image Builder - Connect to new AMI

Pricing and availability
EC2 Image Builder for macOS is now available in all AWS Regions where EC2 Mac instances are available: US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Mumbai, Seoul, Singapore, Sydney, Tokyo), and Europe (Frankfurt, Ireland, London, Stockholm) (not all Mac instance types are available in all Regions).

It comes at no additional cost, and you’re only charged for the resources in use during the pipeline execution, namely the time your EC2 Mac Dedicated Host is allocated, with a minimum of 24 hours.

The preview of macOS support in EC2 Image Builder allows you to consolidate your image pipelines, automate your golden image creation processes, and use the benefits of cloud-focused integrations on AWS. As the EC2 Mac platform continues to expand with more instance types, this new capability positions EC2 Image Builder as a comprehensive solution for image management across Windows, Linux, and macOS.

Create your first pipeline today! 

— seb

[$] Toward safe transmutation in Rust

Post Syndicated from daroc original https://lwn.net/Articles/994334/

Currently in Rust, there is no efficient and safe way to turn an array of bytes
into a structure that corresponds to the array. Changing that was the topic of
Jack Wrenn’s talk this year at

RustConf
:

“Safety Goggles for Alchemists”
. The goal is to be able to “transmute” —
Rust’s name for this kind of conversion — values into arbitrary user-defined
types in a safer way. Wrenn justified the approach that the project has taken to
accomplish this, and spoke about the future work required to stabilize it.

Tor Browser 14.0 released

Post Syndicated from jzb original https://lwn.net/Articles/995353/

Version
14.0
of the privacy-focused Tor browser has been released.

This is our first stable release based on Firefox
ESR 128
, incorporating a year’s worth of changes shipped upstream
in Firefox. As part of this process we’ve also completed our annual
ESR transition audit, where we reviewed and addressed over 200 Bugzilla issues for changes in Firefox that
may negatively affect the privacy and security of Tor Browser
users. Our final reports from this audit are now available in the tor-browser-spec
repository
on our Gitlab instance.

Fortinet FortiManager CVE-2024-47575 Exploited in Zero-Day Attacks

Post Syndicated from Caitlin Condon original https://blog.rapid7.com/2024/10/23/etr-fortinet-fortimanager-cve-2024-47575-exploited-in-zero-day-attacks/

Fortinet FortiManager CVE-2024-47575 Exploited in Zero-Day Attacks

On Wednesday, October 23, 2024, security company Fortinet published an advisory on CVE-2024-47575, a critical zero-day vulnerability affecting their FortiManager network management solution. The vulnerability arises from a missing authentication for a critical function [CWE-306] in the FortiManager fgfmd daemon that allows a remote unauthenticated attacker to execute arbitrary code or commands via specially crafted requests. The vulnerability carries a CVSS v3 score of 9.8.

Fortinet’s advisory notes that CVE-2024-47575 has been “reported” as exploited in the wild. Rapid7 customers have also reported receiving communications from service providers indicating the vulnerability may have been exploited in their environments. According to the vendor, “The identified actions of this attack in the wild have been to automate via a script the exfiltration of various files from the FortiManager which contained the IPs, credentials and configurations of the managed devices.” Rapid7 strongly recommends reviewing the vendor advisory for indicators of compromise and mitigation strategies.

Background

Since roughly October 13, there have been private industry discussions and a number of public posts on Reddit, Twitter, and Mastodon about a rumored zero-day vulnerability in FortiManager. Public Reddit conversations indicated that Fortinet contacted some of their customers by email circa October 15 to “privately disclose” a FortiManager vulnerability and advise on mitigations. Despite embargoed communications and the publication of several news articles, neither a public advisory nor a CVE was issued until October 23.

On the evening of October 22, high-profile cybersecurity researcher Kevin Beaumont published a blog alleging that a state-sponsored adversary has been using this FortiManager zero-day vulnerability in espionage attacks. While Fortinet’s advisory doesn’t include any information about specific adversaries exploiting the vulnerability, Fortinet devices have long been popular targets for state-sponsored threat actors.

Mitigation guidance

Per Fortinet’s advisory, the following versions of FortiManager are vulnerable to CVE-2024-47575 and have mitigation guidance available:

  • FortiManager 7.6.0
  • FortiManager 7.4.0 through 7.4.4
  • FortiManager 7.2.0 through 7.2.7
  • FortiManager 7.0.0 through 7.0.12
  • FortiManager 6.4.0 through 6.4.14
  • FortiManager 6.2.0 through 6.2.12
  • FortiManager Cloud 7.4.1 through 7.4.4
  • FortiManager Cloud 7.2 (all versions)
  • FortiManager Cloud 7.0 (all versions)
  • FortiManager Cloud 6.4 (all versions)

The advisory indicates FortiManager Cloud 7.6 is not affected.

FortiManager customers should update to a supported, fixed version on an emergency basis, without waiting for a regular patch cycle to occur. See the vendor advisory for the latest list of fixed versions. A workaround is also available for some versions.

Fortinet’s advisory also includes a list of indicators of compromise (IOCs) that FortiManager customers should look for in their environments.

Rapid7 customers

InsightVM and Nexpose customers will be able to assess their exposure to CVE-2024-47575 with an authenticated check expected to be available in the Wednesday, October 23 content release.

Kadlčík: Copr Modularity, the End of an Era

Post Syndicated from jzb original https://lwn.net/Articles/995337/

Jakub Kadlčík announced
on his blog
that Fedora’s Copr build system will
be dropping support for building modules
(groups of RPM packages that are built, installed, and shipped
together) soon:

The Fedora Modularity project never really took off, and building
modules in Copr even less so. We’ve had only 14 builds in the last two
years. It’s not feasible to maintain the code for so few
users. Modularity has also been retired
since Fedora 39
and will die with RHEL 9.

Modularity features in Copr are now deprecated, and it will not be
possible to submit new module builds after April 2025. LWN covered some of the
problems with Fedora’s modularity initiative in 2019.

How Getir unleashed data democratization using a data mesh architecture with Amazon Redshift

Post Syndicated from Asser Moustafa original https://aws.amazon.com/blogs/big-data/how-getir-unleashed-data-democratization-using-a-data-mesh-architecture-with-amazon-redshift/

This blog post is co-written with Pinar Yasar from Getir.

Amazon Redshift is a fully managed cloud data warehouse that’s used by tens of thousands of customers for price-performance, scale, and advanced data analytics. Amazon Redshift enables data warehousing by seamlessly integrating with other data stores and services in the modern data organization through features such as Zero-ETL, data sharing, streaming ingestion, data lake integration, and Redshift ML.

In this post, we explain how ultrafast delivery pioneer, Getir, unleashed the power of data democratization on a large scale through their data mesh architecture using Amazon Redshift.

We start by introducing Getir and their vision—to seamlessly, securely, and efficiently share business data across different teams within the organization for BI, extract, transform, and load (ETL), and other use cases. We’ll then explore how Amazon Redshift data sharing powered the data mesh architecture that allowed Getir to achieve this transformative vision. We will also explain how Getir’s data mesh architecture enabled data democratization, shorter time-to-market, and cost-efficiencies. Next, we’ll provide a broader overview of modern data trends reinforced by Getir’s vision. In conclusion, we’ll offer some thoughts on how you can apply a similar approach to eliminate costly and barrier-inducing data silos using Amazon Redshift.

Who is Getir?

Getir is an ultrafast delivery pioneer that revolutionized last-mile delivery in 2015 with its 10-minute grocery delivery proposition.Getir’s story started in Istanbul, and they have launched multiple products since inception: GetirFood, GetirMore, GetirWater, GetirLocals, GetirBitaksi (taxi service), GetirDrive (car rental service), and GetirJobs (recruitment).

Getir serves dozens of cities throughout the world with more than 30,000 employees. The following figure shows the Getir app.

Figure 1: Getir app

Figure 1: Getir app

Overview of Getir’s main use case

Getir’s business is characterized by a tremendous volume of data generation and growth, in addition to ample opportunities to gain valuable insights. However, siloing this data and creating friction for teams trying to access the information they needed wasn’t a viable option. Allowing teams to duplicate data wherever required can be an anti-pattern, leading to operational complexity, cost overruns, and fragile data storage bloat.

Similarly, relying on dedicated teams to create data extracts or insights for downstream consumers introduces bottlenecks, stifles innovation, and increases the time-to-market. This approach isn’t optimal for a data-driven organization like Getir, which needs to empower its teams with seamless access to the information they require to drive the business forward. The various business lines within the organization made it abundantly clear that they wanted unfettered access to the company’s entire data ecosystem in a secure, cost-efficient, near real-time, and well-governed manner.

Furthermore, the organization was anticipating the emergence of data-as-a-serviceservice and generative AI use cases in the near future. This would necessitate the ability to securely share and potentially monetize the company’s data with external partners, such as franchises.

Overview of Getir’s use of Amazon Redshift and modern data architecture

To strike a balance that addresses these concerns and enables Getir teams to effectively use the wealth of data to generate meaningful insights and drive strategic decision-making across the organization, we chose a data mesh architecture.

Getir’s data analytics environment encompasses hundreds of terabytes of data, thousands of tables, and billions upon billions of data rows. Additionally, it processes millions of messaging events daily, all of which must be ingested, refined, and made available to analysts querying multiple Amazon Redshift warehouses. The end-to-end service level agreements (SLAs) for this data ecosystem can be extremely aggressive, with requirements that can be as stringent as single-digit minutes to single-digit seconds. This underscores the scale and complexity of Getir’s data analytics capabilities, which must operate with the utmost efficiency and responsiveness to meet the demands of the business. We were able to easily implement the envisioned data mesh architecture using Amazon Redshift’s native data sharing capabilities.

Figure 2: Data mesh architecture using Amazon Redshift data sharing

Figure 2: Data mesh architecture using Amazon Redshift data sharing

As the preceding diagram shows, at the heart of Getir’s architecture, was an ETL Redshift data warehouse that was used for various data sets from all over the organization, creating a refined 360-degree view of critical assets. It also was a producer for downstream Redshift data warehouses.

The demand was quite heavy on this main ETL cluster, so we relied on data sharing to isolate noisy workloads on a different Redshift data warehouse without having to duplicate the data on the main ETL cluster.

Using Redshift data sharing, individual business line teams could now rely solely on their dedicated Redshift cluster to provide them with their own data and analytics capabilities, but also the refined 360-degree views of data generated from all over the organization—without any data duplication or overstepping compute boundaries. BI analysts gained access to all of the data they needed to power their most complex dashboards with consistent performance free of noisy jobs. Additional warehouses were integrated into the data mesh for visualization, reporting, and machine learning.

Another benefit of Amazon Redshift data sharing and the data mesh architecture, was the relative ease with which we were able to maintain a chargeback model for ensuring costs were spread fairly across different teams.

Finally, the data sharing capability also enabled the seamless propagation of newly created tables within a schema to the subscribed consumers.

Modern data trends reinforced by Getir’s case study

Getir’s case study showcases the strategic uses of a data mesh architecture and Amazon Redshift, but more importantly provides tremendous insights into five key trends across all industries as modern data organizations move away from costly data silos that hinder collaboration, business insights, and time-to-market. As highlighted in the following diagram, those trends are 1/interconnected, purpose-built data stores that enable users to access data regardless of its physical location, 2/data democratization empowering users with self-service analytics capabilities, 3/real-time insights to drive greater value from data, 4/resilient data services ensuring business continuity, 5/leveraging generative AI to extract even deeper insights from data more expeditiously.

Figure 3: Key trends in the modern data organization reinforced by Getir's use case and solution

Figure 3: Key trends in the modern data organization reinforced by Getir’s use case and solution

As Getir showed, the modern data organization is adopting data architectures that democratize data securely and enable self-service analytics. To realize data’s true potential, the modern data organization has progressed beyond basic dashboarding and reporting on limited, point-in-time data sets, and evolved to use more sophisticated ETL processes that can ingest data from diverse sources. Near real-time analytics in addition to predictive models have become standard fare, significantly reducing the time to actionable insights.

Furthermore, the data landscape has been democratized to empower analysts in numerous ways through the rise of transactional data lakes powered by open table formats such as Apache Iceberg and the assistance of generative AI. This holistic approach has elevated data organizations’ capabilities well beyond traditional reporting, unlocking greater business value from the wealth of data available.

Using generative AI with data mesh architecture

In addition to the five key trends previously mentioned, the present-day data landscape is characterized by three key facts that are leading data organizations like Getir to increasingly harness the power of generative AI to drive the next evolution of data-informed decision-making.

Data is an organization’s most valuable asset and the ability to effectively use data is central to an organization’s success and growth. Data analytics and insights are absolutely crucial to strengthening and expanding the business. Deriving meaningful insights from data is essential for making informed, strategic decisions. Democratizing data and enabling self-service analytics can greatly expand the range of business insights, while reducing the time to market for those insights. Empowering users across the organization to access and analyze data can unlock tremendous value. Generative AI’s ability to respond to natural language prompts, explore and analyze complex data, and summarize lengthy content makes it a valuable tool for translating large amounts of data into valuable insights. However, the true potential of generative AI for organizations lies in Retrieval Augmented Generation (RAG).

Out of the box, generative AI models start with a relatively generic knowledge base, which can lead to unreliable or inaccurate information. RAG addresses this by introducing the model to additional datasets that are specific to the organization or context. This allows generative AI models to produce far more accurate, attributable, and highly contextualized outputs to support decision-making.

Data mesh architecture can play a crucial role in enabling and facilitating RAG. By facilitating access to multiple data sources within the organization, the data mesh provides the necessary fuel for the generative AI model to draw from, resulting in more reliable and insightful information. This, in turn, empowers data-driven decision-making and helps organizations harness the full potential of their data assets.

Conclusion

In this post, we examined how Getir implemented a data mesh architecture and Amazon Redshift data sharing to meet their evolving data requirements. This entailed dedicated data warehouses tailored to different business lines and needs, while maintaining robust data governance and secure data access. Additionally, we highlighted the key industry trends that Getir’s case study reinforces across the broader data landscape. For more information, contact AWS or connect with your AWS Technical Account Manager or Solutions Architect, who will be happy to provide more detailed guidance and support.


About the Authors

Asser Moustafa is a Principal Worldwide Specialist Solutions Architect at AWS, based in Dallas, Texas, USA. He partners with customers worldwide, advising them on all aspects of their data architectures, migrations, and strategic data visions to help organizations adopt cloud-based solutions, maximize the value of their data assets, modernize legacy infrastructures, and implement cutting-edge capabilities like machine learning and advanced analytics. Prior to joining AWS, Asser held various data and analytics leadership roles, completing an MBA from New York University and an MS in Computer Science from Columbia University in New York. He is passionate about empowering organizations to become truly data-driven and unlock the transformative potential of their data.

Pinar Yasar is the Data Engineering Manager at Getir. Her passion is to accelerate self-service analytics for her internal customers and build highly scalable and cost-effective solutions in the cloud.

Apache HBase online migration to Amazon EMR

Post Syndicated from Dalei Xu original https://aws.amazon.com/blogs/big-data/apache-hbase-online-migration-to-amazon-emr/

Apache HBase is an open source, non-relational distributed database developed as part of the Apache Software Foundation’s Hadoop project. HBase can run on Hadoop Distributed File System (HDFS) or Amazon Simple Storage Service (Amazon S3), and can host very large tables with billions of rows and millions of columns.

The followings are some typical use cases for HBase:

  • In an ecommerce scenario, when retrieving detailed product information based on the product ID, HBase can provide a quick and random query function.
  • In security assessment and fraud detection cases, the evaluation dimensions for users vary. HBase’s non-relational architectural design and ability to freely scale columns help cater to the complex needs.
  • In a high-frequency, real-time trading platform, HBase can support highly concurrent reads and writes, resulting in higher productivity and business agility.

Recommended HBase deployment mode

Starting with Amazon EMR 5.2.0, you have the option to run Apache HBase on Amazon S3.

Running HBase on Amazon S3 has several added benefits, including lower costs, data durability, and easier scalability. And during HBase migration, you can export the snapshot files to S3 and use them for recovery.

Recommended HBase migration mode

For existing HBase clusters (including self-built based on open source HBase or provided by vendors or other cloud service providers), we recommend using HBase snapshot and replication technologies to migrate to Apache HBase on Amazon EMR without significant downtime of service.

This blog post introduces a set of typical HBase migration solutions with best practices based on real-world customers’ migration case studies. Additionally, we deep dive into some key challenges faced during migrations, such as:

  • Using HBase snapshots to implement initial migration and HBase replication for real-time data migration.
  • HBase provided by other cloud platforms doesn’t support snapshots.
  • A single table with large amounts of data, for example more than 50 TB.
  • Using BucketCache to improve read performance after migration.

HBase snapshots allow you to take a snapshot of a table without too much impact on region servers, snapshots, clones, and restore operations don’t involve data copying. Also, exporting a snapshot to another cluster has little impact on the region servers.

HBase replication is a way to copy data between HBase clusters. It allows you to keep one cluster’s state synchronized with that of another cluster, using the write-ahead log (WAL) of the source cluster to propagate changes. It can work as a disaster recovery solution and also provides higher availability in the architecture.

Prerequisites

To implement HBase migration, you must have the following prerequisites:

Solution summary

In this example, we walk through a typical migration solution, which is from the source HBase on HDFS cluster (Cluster A) to the target Amazon EMR HBase on S3 (Cluster B). The following diagram illustrates the solution architecture.

Solution architecture

To demonstrate the best practice recommended to the HBase migration process, the following are the detailed steps we will walk through, as shown in the preceding diagram.

Step Activity Description Estimated time
1

Configure cluster A

(Source HBase)

Modify the configuration of the source HBase cluster to prepare for subsequent snapshot exports Less than 5 minutes
2

Create cluster B

(Amazon EMR HBase on S3)

Create an EMR cluster with HBase on Amazon S3 as the migration target cluster Less than 10 minutes
3 Configure replication Configure replication from the source HBase cluster to Amazon EMR HBase, but do not start Less than 1 minute
4 Pause service Pause the service of the source HBase cluster Less than 1 minute
5 Create snapshot Create a snapshot for each table on the source HBase cluster Less than 5 minutes
6 Resume service Resume the service of the source HBase cluster Less than 1 minute
7 Snapshot export and restore Use snapshot to migrate data from the source HBase cluster to the Amazon EMR HBase cluster Depends on the size of the table data volume
8 Start replication Start the replication of the source HBase cluster to Amazon EMR HBase and synchronize incremental data Depends on the size of the data accumulated during the snapshot export restore.
9 Test and verify Test the and verify the Amazon EMR HBase

Solution walkthrough

In the preceding diagram and table, we have listed the operational steps of the solution. Next, we will elaborate the specific operations for each step shown in the preceding table.

1. Configure cluster A (source HBase)

When exporting a snapshot from the source HBase cluster to the Amazon EMR HBase cluster, you must modify the following settings on the source cluster to ensure the performance and stability of data transmission.

Configuration classification Configuration item Suggested value Comment
core-site fs.s3.awsAccessKeyId Your AWS access key ID The snapshot export takes a relatively long time. Without an access key and secret key, the snapshot export to Amazon S3 will encounter errors such as com.amazon.aws.emr.hadoop.fs.shaded.com. amazonaws.SdkClientException: Unable to load AWS credentials from any provider in the chain.
core-site fs.s3.awsSecretAccessKey Your AWS secret access key
yarn-site yarn.nodemanager.resource.memory-mb Half of a single core node RAM The amount of physical memory, in MB, that can be allocated for containers.
yarn-site yarn.scheduler.maximum-allocation-mb Half of a single core node RAM The maximum allocation for every container request at the ResourceManager in MB. Because snapshot export runs in the YARN Map Reduce task, it’s necessary to allocate sufficient memory to YARN to ensure transmission speed.

These values are set depending on the cluster resource, workload, and table data volume. The modification can be done using a web UI if available or by suing a standard configuration XML file. Restart the HBase service after the change is complete.

2. Create cluster B (EMR HBase on S3)

Use the following recommend settings to launch an EMR cluster:

Configuration classification Configuration item Suggested value Comment
yarn-site yarn.nodemanager.resource.memory-mb 20% of a single core node RAM Amount of physical memory, in MB, that can be allocated for containers.
yarn-site yarn.scheduler.maximum-allocation-mb 20% of a single core node RAM The maximum allocation for every container request at the ResourceManager in MB. Because snapshot restore runs in the HBase, it’s necessary to allocate a small amount of small memory to YARN and leave sufficient memory to HBase to ensure restore.
hbase-env.export HBASE_MASTER_OPTS 70% of a single core node RAM Set the Java heap size for the primary HBase.
hbase-env.export HBASE_REGIONSERVER_OPTS 70% of a single core node RAM Set the Java heap size for the HBase region server.
hbase hbase.emr.storageMode S3 Indicates that HBase uses S3 to store data.
hbase-site hbase.rootdir <Your-HBase-Folder-on-S3> Your HBase data folder on S3.

See Configure HBase for more details. Additionally, the default YARN configuration on Amazon EMR for each Amazon EC2 instance type can be found in Task configuration.

The configuration of our example is as shown in the following figure.

Instance group configurations

3. Config replication

Next, we configure the replication peer from the source HBase to the EMR cluster.

The operations include:

  • Create a peer.
  • Because the snapshot migration hasn’t been done, we start by disabling the peer.
  • Specify the table that requires replication for the peer.
  • Enable table replication.

Let’s use the table usertable as an example. The shell script is as follows:

MASTER_IP="<Master-IP>"
TABLE_NAME="usertable"
cat << EOF | sudo -u hbase hbase shell 2>/dev/null
add_peer 'PEER_$TABLE_NAME', CLUSTER_KEY => '$MASTER_IP:2181:/hbase'
disable_peer 'PEER_$TABLE_NAME'
enable_table_replication '$TABLE_NAME'
EOF

The result will like the following text.

hbase:001:0> add_peer 'PEER_usertable', CLUSTER_KEY => '<Master-IP>:2181:/hbase'
Took 13.4117 seconds 
hbase:002:0> disable_peer 'PEER_usertable'
Took 8.1317 seconds 
hbase:003:0> enable_table_replication 'usertable'
The replication of table 'usertable' successfully enabled
Took 168.7254 seconds

In this experiment, we’re using the table usertable as an example. If we have many tables that need to be configured for replication, we can use the following code:

MASTER_IP="<Master-IP>"

# Get all tables
TABLE_LIST=$(echo 'list' | sudo -u hbase hbase shell 2>/dev/null | sed -e '1,/TABLE/d' -e '/seconds/,$d' -e '/row/,$d')
# Iterate each table
for TABLE_NAME in $TABLE_LIST; do
# Add the operation
cat << EOF | sudo -u hbase hbase shell 2>/dev/null
add_peer 'PEER_$TABLE_NAME', CLUSTER_KEY => '$MASTER_IP:2181:/hbase'
disable_peer 'PEER_$TABLE_NAME'
enable_table_replication '$TABLE_NAME'
EOF
done

In the scripts of following steps, if you need to apply the operations for all tables, you can refer to the preceding code sample.

At this point, the status of the peer is Disabled, so replication won’t be started. And the data that needs to be synchronized from the source to the target EMR cluster will be backlogged at the source HBase cluster and won’t be synchronized to HBase on the EMR cluster.

After the snapshot restore (step 7) is completed on the HBase on Amazon EMR cluster, we can enable the peer to start synchronizing data.

If the source HBase version is 1.x, you must run the set_peer_tableCFs function. See HBase Cluster Replication.

4. Pause the service

To pause the service of the source HBase cluster, disable the HBase tables. You can use the following script:

sudo -u hbase bash /usr/lib/hbase/bin/disable_all_tables.sh 2>/dev/null

The result is shown in the following figure.

Disable all tables

After disabling all tables, observe the HBase UI to ensure that no background tasks are being run, and then stop any services accessing the source HBase. This will take 5-10 minutes.

The HBase UI is as shown in the following figure.

Check background tasks

5. Create a snapshot

Make sure the tables in the source HBase are disabled. Then, you can create a snapshot of the source. This process will take 1-5 minutes.

Let’s use the table usertable as an example. The shell script is as follows:

DATE=`date +"%Y%m%d"`
TABLE_NAME="usertable"
sudo -u hbase hbase snapshot create -n "${TABLE_NAME/:/_}-$DATE" -t ${TABLE_NAME} 2>/dev/null

You can check the snapshot with a script:

sudo -u hbase hbase snapshot info -list-snapshots 2>/dev/null

And the result is as shown in the following figure.

Create snapshot

6. Resume service

After the snapshot is successfully created on the source HBase, you can enable the tables and resume the services that access the source HBase. These operations take several minutes, so the total data unavailability time on the source HBase during the implementation (steps 3 to step 6) will be approximately 10 minutes.

The command to enable the table is as follows:

TABLE_NAME="usertable"
echo -e "enable '$TABLE_NAME'" | sudo -u hbase hbase shell 2>/dev/null

The result is shown in the following figure.

Enable table

At this point, you can write data to the source HBase, because the status of the replication peer is disabled, so the incremental data won’t be synchronized to the target cluster.

7. Snapshot export and restore

After the snapshot created in the source HBase, it’s time to export the snapshot to the HBase data directory on the target EMR cluster. The example script is as follows:

DATE=`date +"%Y%m%d"`
TABLE_NAME="usertable"
TARGET_BUCKET="<Your-HBase-Folder-on-S3>"
nohup sudo -u hbase hbase snapshot export -snapshot ${TABLE_NAME/:/_}-$DATE -copy-to $TARGET_BUCKET &> ${TABLE_NAME/:/_}-$DATE-export.log &

Exporting the snapshot will take from 10 minute to several hours to complete, depending on the amount of data to be exported. So we run it in the background. You can check the progress by using the yarn application -list command, as shown in the following figure.

Exporting snapshot process

As an example, if you’re using an HBase cluster with 20 r6g.4xlarge core nodes, it will take about 3 hours for 50 TB of data to be exported to Amazon S3 in same AWS Region.

After the snapshot export is completed at the source HBase, you can check the snapshot in the target EMR cluster using the following script:

sudo -u hbase hbase snapshot info -list-snapshots 2>/dev/null

The result is shown in the following figure.

Check snapshot

Confirm the snapshot name—for example, usertable-20240710 and run snapshot restore on the target EMR cluster using the following script.

TABLE_NAME="usertable"
SNAPSHOT_NAME="usertable-20240710"
cat << EOF | nohup sudo -u hbase hbase shell &> restore-snapshot.out &
disable '$TABLE_NAME'
restore_snapshot '$SNAPSHOT_NAME'
enable '$TABLE_NAME'
EOF

The snapshot restore will take from 10 minute to several hours to complete, depending on the amount of data to be restored, so we run it in the background. The result is as shown in the following figure.

Restore snapshot

You can check the progress of the restore through The Amazon EMR web interface for HBase, as shown in the following figure.

Check snapshot restore

From the Amazon EMR web interface for HBase, you can found it takes about 2 hours to run Clone Snapshot for a sample table with 50 TB of data, and then 1 additional hour to run . After these two stages, the snapshot restore is completed.

8. Start replication

After the snapshot restore is completed on the EMR cluster and the status of the table is set to enabled, you can enable HBase replication in the source HBase. The incremental data will be synchronized to the target EMR cluster.

In the source HBase, the example script is as follows:

TABLE_NAME="usertable"
echo -e "enable_peer 'PEER_$TABLE_NAME'" | sudo -u hbase hbase shell 2>/dev/null

The result is as shown in the following figure.

Enable peer

Wait for the incremental data to be synchronized from the source HBase to the HBase on EMR cluster. The time taken depends on the amount of data accumulated in the source HBase during the snapshot export and restore. In our example, it took about 10 minutes to complete the data synchronization.

You can check the replication status with scripts:

echo -e "status 'replication'" | sudo -u hbase hbase shell 2>/dev/null

The result is shown in the following figure.

Replication status

9. Test and verify

After incremental data synchronization is complete, you can start testing and verifying the results. You can use the same HBase API to access both the source and the target HBase clusters and compare the results.

To guarantee the data integrity, you can check the number of HBase table region and store files for the replicated tables from the Amazon EMR web interface for HBase, as shown in the following figure.

Check hbase region and store files

For small tables, we recommend using the HBase command to verify the number of records. After signing in to the primary node of the Amazon EMR using SSH, you can run the following command:

sudo -u hbase hbase org.apache.hadoop.hbase.mapreduce.RowCounter 'usertable'

Then, in the hbase.log file of the HBase log directory, find the number of records for the table usertable.

For large tables, you can use the HBase Java API to validate the row count in a range of row keys.

We provided sample Java code to implement this functionality. For example, we imported the demo data to usertable using the following script:

java -classpath hbase-utils-1.0-SNAPSHOT-jar-with-dependencies.jar HBaseAccess <Your-Zookeeper-IP> put 1000 20

The result is shown in the following figure.

Put demo data into HBase table

You can run the script multiple times to import enough demo data into the table, then you can use the following script to count the number of records where the value of Row Key is between user1000 and user5000, and the value of the column family:field0 is value0

java -classpath hbase-utils-1.0-SNAPSHOT-jar-with-dependencies.jar HBaseRowCounter <Your-Zookeeper-IP> usertable "user1000" "user1100" "family:field0" "value0"

The result is shown in the following figure.

HBase table row counter

You can run the same code on both the source HBase and the target Amazon EMR HBase to verify that the results are consistent. See complete code.

After these steps are complete, you can switch from the source HBase to the target Amazon EMR Hbase, completing the migration.

Clean up

After you’re done with the solution walkthrough, complete the following steps to clean up your resources:

  1. Stop the Amazon EMR on EC2 cluster.
  2. Delete the S3 bucket that stores the HBase data.
  3. Stop the source HBase cluster, and release its related resource, for example, the Amazon EC2 cluster or resources provided by other vendors or cloud service providers.

Key challenges in HBase migration

In the previous sections, we have detailed the steps to implement HBase online migration through snapshots and replication for general scenario. Many customers’ scenarios may have some differences from the general scenario, and you need to make some modifications to the process steps in order to accomplish the migration.

HBase in the cloud doesn’t support snapshot

Many cloud providers have made modifications to the open source version of HBase, resulting in these versions of HBase not providing snapshot and replication functions. However, these cloud providers will provide data transfer tools for HBase, such as Lindorm Tunnel Service, that can be used to transfer HBase data to an HBase cluster with data on HDFS.

To deploy HBase on Amazon S3, you should follow the previous migration process as the best practice, using snapshot and replication techniques to migrate to an Amazon EMR environment. To solve the problem of HBase versions that don’t support snapshots and replication, you can create an HBase on HDFS as a jump or relay cluster, which can be used to synchronize the data from a source HBase to an HDFS-based HBase cluster, then migrate from the middle cluster to the target HBase on S3.

The following diagram illustrates the solution architecture.

Solution architecture for HBase in the cloud doesn’t support snapshot

You need to add three more steps in addition to the migration steps described previously.

Step Activity Description Estimated time
1

Create Cluster B

(EMR HBase on HDFS)

Create an EMR cluster with HBase on HDFS as the relay cluster. Less than 10 minutes
2 Configure data transfer Configure the data transfer from the outer HBase cluster to Amazon EMR HBase on HDFS and start the data transfer. Less than 5 minutes
3

HBase migration

(snapshot and replication)

Treat the outer HBase cluster as an application which writes data into the Amazon EMR HBase cluster, then you can use the steps in the previous scenario to complete the migration to Amazon EMR HBase on Amazon S3.

Single table with large amounts of data

During the migration process, if the amount of data in a single table in the source HBase (Cluster A) is too large—such as 10 TB or even 50TB—you must modify the target Amazon EMR HBase cluster (Cluster B) configuration to ensure that there are no interruptions during the migration process, especially during the snapshot restore on the Amazon EMR HBase cluster. After the snapshot restore is complete, you can rebuild the Amazon EMR HBase cluster (Cluster C).

The following diagram illustrates the solution architecture for handling a very large table.

Solution architecture for handling a very large table

The following are the steps.

Step Activity Description Estimated time
1 Create Cluster B (EMR HBase on S3 for restore) Create an EMR cluster with the required configuration for a large table snapshot restore. Less than 10 minutes
2

HBase migration

(snapshot and replication)

Consider the Amazon EMR HBase on Amazon S3 as the target cluster, then you can use the steps in the first scenario to complete the migration from the source HBase to the Amazon EMR HBase on S3.
3

Recreate Cluster C

(EMR HBase on S3 for production)

After the migration is complete, Cluster B needs to be changed back to its previous configuration before migration. If it’s inconvenient to modify the parameters, you can use the previous configuration to recreate the EMR cluster (Cluster C). Less than 15 minutes
4 Rebuild replication After recreating the EMR cluster, if replication is still needed to synchronize the data, the replication from the source HBase cluster to the new EMR HBase cluster must be rebuilt. Before building the new EMR cluster, the write service on the source HBase cluster should be paused to avoid data loss on the Amazon EMR HBase. Less than 1 minute

In Step 1, create cluster B (EMR HBase on S3 for restore), use the following configuration for snapshot restore. All time values are in milliseconds.

Configuration classification Configuration item Default value Suggested value Explanation
emrfs-site fs.s3.maxConnections 1000 50000 The number of concurrent Amazon S3 connections that your applications need. The default value is 1000 and must be increased to avoid errors such as com. amazon. ws. emr. hadoop. fs. shaded. com. amazonaws. SdkClientException: Unable to execute HTTP request: timeout waiting for connection from pool.
hbase-site hbase.client.operation.timeout 300000 18000000 Operation timeout is a top-level restriction that makes sure a blocking operation in a table will not be blocked for longer than the timeout.
hbase-site hbase.master.cleaner.interval 60000 18000000 The default is to run HBase Clearer in 60,000 milliseconds, which will clear some files in the archive, resulting in an error that HFile cannot be found.
hbase-site hbase.rpc.timeout 60000 18000000 This property limits how long a single RPC call can run before timing out.
hbase-site hbase.snapshot.master.timeout.millis 300000 18000000 Timeout for the primary HBase for the snapshot procedure.
hbase-site hbase.snapshot.region.timeout 300000 18000000 Timeout for region servers to keep threads in a snapshot request pool waiting.
hbase-site hbase.hregion.majorcompaction 604800000 0

Default is 604,800,000 ms (1 week). Set to 0 to disable automatic triggering of compaction. Note that because of the change to manual triggering, you must make compaction one of the daily operation and maintenance tasks, and run it during periods of low activity to avoid impacting production. The following is the compaction script:

echo -e "major_compact '$TABLE_NAME'" | sudo -u hbase hbase shell

Adjust the suggested values based on the amount of table data, which requires conducting some experiments to determine the values to be used in the final migration plan.

Before you recreate a new EMR cluster in the production environment, disable the HBase table on the active EMR cluster. The command line is as follows:

sudo -u hbase bash /usr/lib/hbase/bin/disable_all_tables.sh 2>/dev/null
echo -e "flush 'hbase:meta'" | sudo -u hbase hbase shell 2>/dev/null

Wait for the command to execute successfully and terminate the current EMR cluster (Cluster B). Now the HBase data is kept in Amazon S3; create a new EMR cluster (Cluster C) with the previous configuration before migration, and specify HBase date folder on S3 to be the same as Cluster B.

Using Bucket Cache to improve read performance

To enhance HBase’s read performance, one of most effective method involves caching data. HBase uses BlockCache to implement caching mechanisms for the region server. Currently, HBase provides two different BlockCache implementations to cache data read from HDFS: the default on-heap LruBlockCache and the BucketCache, which is usually off-heap. Bucket cache is the most used method.

BucketCache can be deployed in offheap, file, or mmaped file mode. These three working modes are the same in terms of memory logical organization and caching process; however, the final storage media corresponding to these three working modes are different, that is, the IOEngine is different.

We recommend that customers use BucketCache in file mode, because the default storage type of Amazon Elastic Block Store (Amazon EBS) in Amazon EMR is SSD. You can put all hot data into BucketCache, which is on Amazon EBS. You can then determine the file size used by BucketCache based on the volume of hot data.

The following are the HBase configurations for BucketCache.

Configuration classification Configuration item Suggested value Explanation
emrfs-site hbase.bucketcache.ioengine files:/mnt1/hbase/cache_01.data Where to store the contents of the bucket cache. You can use offheap, file, files, mmap, or pmem. If a file or files, set it to files. Note that some earlier Amazon EMR versions can only support using one file per core node.
hbase-site hbase.bucketcache.persistent.path file:/mnt1/hbase/cache.meta The path to store the metadata of the bucket cache, used to recover the cache during startup.
hbase-site hbase.bucketcache.size Depends on the hot data volume be cached The capacity in MBs of BucketCache for each core node. If you use multiple cache files, then this size is the sum of the capacities of multiple files.
hbase-site hbase.rs.prefetchblocksonopen TRUE Whether the server should asynchronously load all the blocks when a store file is opened (data, metadata, and index). Note that enabling this property contributes to the time the region server takes to open a region and therefore initialize.
hbase-site hbase.rs.cacheblocksonwrite TRUE Whether an HFile block should be added to the block cache when the block is finished.
hbase-site hbase.rs.cachecompactedblocksonwrite TRUE Whether to cache compressed blocks during writing.

More configuration instructions for BucketCache would refer to Configuration properties.

We provided sample Java code to test HBase read performance. In the Java code, we use the putDemoData method to write test data to the table usertable, ensuring that the data is evenly distributed across HBase table regions, and then use the getDemoData method to read the data.

We tested three scenarios: HBase data stored on HDFS, on Amazon S3 without using BucketCache, and on S3 using BucketCache. To ensure that the written data isn’t cached in the first two scenarios, the cache can be cleared by restarting the the region servers.

We tested on the EMR HBase cluster, which has 10 r6g. 2xlarge core nodes. The command is as follows:

java -classpath hbase-utils-1.0-SNAPSHOT-jar-with-dependencies.jar HBaseAccess <Your-Zookeeper-IP> get 20 100

The result is shown in the following figure.

Read HBase table

Benchmark results and key learning

For the three scenarios, we use 100 HBase record row keys as input, ensure these row keys distributed evenly in HBase table regions, we call the API consecutively with 20 times, 50 times, and 100 times, and we got the time cost result as the following figure. We found the read latency is the shortest when the data is on S3 and using BucketCache.

Read performance

Above, we introduced four migration scenarios. In migration processes in production, we’ve gained invaluable knowledge and experience. We’re sharing the results here for you to use as best practices and a recommended run book.

Configuration parameters

In the configurations provided earlier, some are necessary for BucketCache settings while others mitigate known errors to reduce the snapshot duration. For example, the parameter hbase.snapshot.master.timeout.millis is related to the HBASE-14680 issue. It’s advisable to retain these configurations as much as possible throughout the migration process.

Version choice

When migrating to Amazon EMR and choosing a suitable HBase version, it is recommended to select a more recent minor version and patch version, while keeping the major version unchanged. That is to say:

  • If the source HBase is version 1.x, we recommend using EMR 5.36.1, whose HBase version is 1.4.13, because the HBase 1.x API is compatible and won’t require you to make code changes.
  • If the source HBase is version 2.x, we recommend using EMR 6.15.0, which has an HBase of 2.4.17.

The HBase API under the same major version can be universal. See HBase Version to learn more.

Allocating enough space

HDFS needs to leave enough space when exporting snapshots, it depends on the data volume. Data will be moved to the archive, causing double storage is needed for the table.

Replication

At present, there is compatibility problem if replication and WAL compression are used together. If you are using replication, set the hbase.regionserver.wal.enablecompression property to false. See HBASE-26849 for more information.

By default, replication in HBase is asynchronous, as the write-ahead log (WAL) is sent to another cluster in the background. This implies that when using replication for data recovery, some data loss may occur. Additionally, there is a potential for data overwrite conflicts when concurrently writing the same record within a short time frame. However, HBase v2.x introduces synchronous replication as a solution to this issue. For more details, refer to the Serial Replication documentation.

Disk type for BucketCache

Because BucketCache uses a portion of Amazon EBS IO and throughput to synchronize data, it’s recommended to choose Amazon EBS gp3 volumes for their higher IOPS and throughput.

Response latency when accessing to HBase

Users sometimes face the issue of high response latency from their Hbase on EMR clusters using API calls or the HBase shell tool.

In our testing, we found that the communication between HMaster and RegionServer takes an unusually long time to resolve through DNS. You can reduce latency by adding the host name and IP mapping to the /etc/hosts file in the HBase client host.

Conclusion

In this post, we used the results of real-world migration cases to introduce the process of migrating HBase to Amazon EMR HBase using HBase snapshot and replication and the deployment mode of HBase on Amazon S3. We included how to resolve challenges, such as how to configure the cluster to make the migration process smoother when migrating a single large table, or how to use BucketCache to improve reading performance. We also described techniques for testing performance.

We encourage you to migrate HBase to Amazon EMR HBase. For more information about HBase migration, see Amazon EMR HBase best practices.


About the Authors

Dalei XuDalei Xu is a Analytics Specialist Solution Architect at Amazon Web Services, responsible for consulting, designing, and implementing AWS data analytics solutions. With over 20 years of experience in data-related work, proficient in data development, migration to AWS, architecture design, and performance optimization. Hoping to promote AWS data analytics services to more customers, achieving a win-win situation and mutual growth with customers.

Zhiyong Su is a Migration Specialist Solution Architect at Amazon Web Services, primarily responsible for cloud migration or cross-cloud migration for enterprise-level clients. Has held positions such as R&D Engineer, Solutions Architect, and has years of practical experience in IT professional services and enterprise application architecture.

Shijian TangShijian Tang is a Analytics Specialist Solution Architect at Amazon Web Services.

[$] Free-software foundations face fundraising problems

Post Syndicated from jzb original https://lwn.net/Articles/993665/

In July, at the GNOME annual general meeting (AGM),
held at GUADEC
2024
,
the message from the GNOME Foundation board was that all was well,
financially speaking. Not great, but the foundation was on a
break-even budget and expected to go into its next fiscal year with a
similar budget and headcount. On October 7, however, the board announced
that it had had to make some cuts, including reducing its staff by
two people. This is not, however, strictly a GNOME problem: similar
organizations, such as the Python Software Foundation (PSF), KDE e.V.,
and the Free Software Foundation Europe (FSFE) are seeing declines in
fundraising while also being affected by inflation.