Get started with Amazon OpenSearch Service: T-shirt size your domain for log analytics

Post Syndicated from Harsh Bansal original https://aws.amazon.com/blogs/big-data/get-started-with-amazon-opensearch-service-t-shirt-size-your-domain-for-log-analytics/

When you’re spinning up your Amazon OpenSearch Service domain, you need to figure out the storage, instance types, and instance count; decide the sharding strategies and whether to use a cluster manager; and enable zone awareness. Generally, we consider storage as a guideline for determining instance count, but not other parameters. In this post, we offer some recommendations based on T-shirt sizing for log analytics workloads.

Log analytics and streaming workload characteristics

When you use OpenSearch Service for your streaming workloads, you send data from one or more sources into OpenSearch Service. OpenSearch Service indexes your data in an index that you define.

Log data naturally follows a time series pattern, and therefore a time-based indexing strategy (daily or weekly indexes) is recommended. For efficient management of log data, you must implement time-based index patterns and set retention periods. You further define time slicing and a retention period for the data to manage its lifecycle in your domain.

For illustration, consider that you have a data source producing a continuous stream of log data, and you’ve configured a daily rolling index and set a retention period of 3 days. As the logs arrive, OpenSearch Service creates an index per day with names like stream1_2025.05.21, stream1_2025.05.22, and so on. The prefix stream1_* is what we call an index pattern, a naming convention that helps group-related indexes.

The following diagram shows three primary shards for each daily index. These shards are deployed across three OpenSearch Service data instances, with one replica for each primary shard. (For simplicity, the diagram doesn’t show that primary and replica shards are always placed on different instances for fault tolerance.)

When OpenSearch Service processes new log entries, they are sent to all relevant primary shards and their replicas in the active index, which in this example is only today’s index due to the daily index configuration.

There are several important characteristics of how OpenSearch Service processes your new entries:

  • Total shard count – Each index pattern will have a D * P * (1 + R) total shards, where D represents retention in days, P represents primary shards, and R is the number of replicas. These shards are distributed across your data nodes.
  • Active index – Time slicing means that new log entries are only written to today’s index.
  • Resource utilization – When sending a _bulk request with log entries, these are distributed across all shards in the active index. In our example with three primary shards and one replica per shard, that’s a total of six shards processing new data simultaneously, requiring 6 vCPUs to efficiently handle a single _bulk request.

Similarly, OpenSearch Service distributes queries across the shards for the indexes involved. If you query this index pattern across all 3 days, you will engage 9 shards, and need 9 vCPUs to process the request.

This will get even more complicated when you add in more data streams and index patterns. For each additional data stream or index pattern, you deploy shards for each of the daily indexes and use vCPUs to process requests in proportion to the shards deployed, as shown in the preceding diagram. When you make concurrent requests to more than one index, each shard for all the indexes involved must process those requests.

Cluster capacity

As the number of index patterns and concurrent requests increases, you can quickly overwhelm the cluster’s resources. OpenSearch Service includes internal queues that buffer requests and mitigate this concurrency demand. You can monitor these queues using the _cat/thread_pool API, which shows queue depths and helps you understand when your cluster is approaching capacity limits.

Another complicating dimension is that the time to process your updates and queries depends on the contents of the updates and queries. As requests come in, the queues are filling at the rate you are sending them. They are draining at a rate that is governed by the available vCPUs, the time they take on each request, and the processing time for that request. You can interleave more requests if those requests clear in a millisecond than if they clear in a second. You can use the _nodes/stats OpenSearch API to monitor average load on your CPUs. For more information about the query phases, refer to A query, or There and Back Again on the OpenSearch blog.

If you see the queue depths increasing, you are moving into a “warning” area, where the cluster is handling load. But if you continue, you can start to exceed the available queues and must scale to add more CPUs. If you start to see load increasing, which is correlated with queue depth increasing, you are also in a “warning” area and should consider scaling.

Recommendations

For sizing a domain, consider the following steps:

  • Determine the storage required – Total storage = (daily source data in bytes × 1.45) × (number_of_replicas + 1) × number of days retained. This accounts for the additional 45% overhead on daily source data, broken down as follows:
    • 10% for larger index size than source data.
    • 5% for operating system overhead (reserved by Linux for system recovery and disk defragmentation protection).
    • 20% for OpenSearch reserved space per instance (segment merges, logs, and internal operations).
    • 10% for additional storage buffer (minimizes impact of node failure and Availability Zone outages).
  • Define the shard count – Approximate number of primary shards = storage size required per index / desired shard size. Round up to the nearest multiple of your data node count to maintain even distribution. For more detailed guidance on shard sizing and distribution strategies, refer to “Amazon OpenSearch Service 101: How many shards do I need” For log analytics workloads, consider the following:
    • Recommended shard size: 30–50 GB
    • Optimal target: 50 GB per shard
  • Calculate CPU requirements – Recommended ratio is 1.25 vCPU:1 Shard for lower data volumes. Higher ratios are recommended for larger volumes. Target utilization is 60% average, 80% maximum.
  • Choose the right instance type – Consider the following based on your nodes:

Let’s look at an example for domain sizing. The initial requirements are as follows:

  • Daily log volume: 3 TB
  • Retention period: 3 months (90 days)
  • Replica count: 1

We make the following instance calculation.

The following table recommends instances, amount of source data, storage needed for 7 days of retention, and active shards based on the preceding guidelines.

T-Shirt Size Data (Per Day) Storage Needed (with 7 days Retention) Active Shards Data Nodes Primary Nodes
XSmall 10 GB 175 GB 2 @ 50 GB 3 * r7g.large. search 3 * m7g.large. search
Small 100 GB 1.75 TB 6 @ 50 GB 3 * r7g.xlarge. search 3 * m7g.large. search
Medium 500 GB 8.75 TB 30 @ 50 GB 6 * r7g.2xlarge.search 3 * m7g.large. search
Large 1 TB 17.5 TB 60 @ 50 GB 6 * r7g.4xlarge.search 3 * m7g.large. search
XLarge 10 TB 175 TB 600 @ 50 GB 30 * i4g.8xlarge 3 * m7g.2xlarge.search
XXL 80 TB 1.4 PB 2400 @ 50 GB 87 * I4g.16xlarge 3 * m7g.4xlarge.search

As with all sizing recommendations, these guidelines represent a starting point and are based on assumptions. Your workload will differ, and so your actual needs will differ from these recommendations. Make sure to deploy, monitor, and adjust your configuration as needed.

For T-shirt sizing the workloads, an extra-small use case encompasses 10 GB or less of data per day from a single data stream to a single index pattern. A small use case falls between 10–100 GB per day of data, a medium use case between 100–500 GB of data, and so on. Default instance count per domain is 80 for most of the instance family. Refer to the “Amazon OpenSearch Service quotas “ for details.

Additionally, consider the following best practices:

Conclusion

This post provided comprehensive guidelines for sizing your OpenSearch Service domain for log analytic workloads, covering several critical aspects. These recommendations serve as a solid starting point, but each workload has unique characteristics. For optimal performance, consider implementing additional optimizations like data tiering and storage tiers. Evaluate cost-saving options such as reserved instances, and scale your deployment based on actual performance metrics and queue depths.By following these guidelines and actively monitoring your deployment, you can build a well-performing OpenSearch Service domain that meets your log analytics needs while maintaining efficiency and cost-effectiveness.


About the authors

Harsh Bansal

Harsh Bansal

Harsh is an Analytics and AI Solutions Architect at Amazon Web Services. Bansal collaborates closely with clients, assisting in their migration to cloud platforms and optimizing cluster setups to enhance performance and reduce costs. Before joining AWS, Bansal supported clients in leveraging OpenSearch and Elasticsearch for diverse search and log analytics requirements.

Aditya Challa

Aditya Challa

Aditya is a Senior Solutions Architect at Amazon Web Services. Aditya loves helping customers through their AWS journeys because he knows that journeys are always better when there’s company. He’s a big fan of travel, history, engineering marvels, and learning something new every day.

Raaga NG

Raaga NG

Raaga is a Solutions Architect at Amazon Web Services. Raaga is a technologist with over 5 years of experience specializing in Analytics. Raaga is passionate about helping AWS customers navigate their journey to the cloud.

Amazon SageMaker introduces Amazon S3 based shared storage for enhanced project collaboration

Post Syndicated from Hari Ramesh original https://aws.amazon.com/blogs/big-data/amazon-sagemaker-introduces-amazon-s3-based-shared-storage-for-enhanced-project-collaboration/

AWS recently announced that Amazon SageMaker now offers Amazon Simple Storage Service (Amazon S3) based shared storage as the default project file storage option for new Amazon SageMaker Unified Studio projects. This feature addresses the deprecation of AWS CodeCommit while providing teams with a straightforward and consistent way to collaborate on project files across the integrated development tools in SageMaker.

This new Amazon S3 storage option provides the following benefits:

  • Simplified collaboration – File sharing between project members directly without Git operations
  • Universal access – Consistent file access across SageMaker tools (JupyterLab, Query Editor, Visual ETL)
  • Clear workspace separation – Built-in personal storage separation with Amazon Elastic Block Store (Amazon EBS) volumes
  • Global availability – Available in AWS Regions where SageMaker is supported

Although Amazon S3 is the default option for file storage, you can also use Git version control for more robust source control capabilities.

In this post, we discuss this new feature and how to get started using Amazon S3 shared storage in SageMaker Unified Studio.

Solution overview

When you create a new SageMaker Unified Studio domain, the service automatically configures Amazon S3 storage as your default project storage option. Each project receives a dedicated shared location in Amazon S3, accessible to project members, following the structure [bucket]/[domain-id]/[project-id]/shared/.

SageMaker tools JupyterLab and Code Editor provide the following to users:

  • A personal EBS volume for individual work in JupyterLab and Code Editor tools
  • A mounted shared folder containing the project’s Amazon S3 shared storage
  • Clear separation between personal and shared spaces

The shared storage is accessible across SageMaker integrated development tools:

  • JupyterLab and Code Editor show shared files along with personal files
  • Query Editor filters for relevant SQL notebooks
  • Visual ETL provides direct access to shared extract, transform, and load (ETL) workflows

Files saved to the shared location are immediately visible and available to project members. Users can continue working with personal files in their EBS volumes in tools like JupyterLab and Code Editor and explicitly move files to shared storage when ready to collaborate.If you want to use Git for collaboration, you can continue to do so by integrating projects with your GitHub version control, GitLab version control, or managed Bitbucket repositories.

Migration and version control options

For teams currently using Amazon CodeCommit, existing projects will remain fully functional. New projects will default to Amazon S3 storage. If you want to have version control for Amazon S3 based projects, you can enable versioning in Amazon S3 directly.

Prerequisites

You will need to complete the following prerequisites before you can follow the instructions in the next section:

  1. Sign up for an AWS account.
  2. Create a user with administrative access.
  3. Enable IAM Identity Center in the same AWS Region you want to create your SageMaker Unified Studio domain. Confirm in which Region SageMaker Unified Studio is currently available. Set up your IdP and synchronize identities and groups with IAM Identity Center. For more information, refer to IAM Identity Center Identity source tutorials.

Get started with Amazon S3 shared storage

To begin using Amazon S3 shared storage, complete the following steps:

  1. Create a new SageMaker Unified Studio domain.
  2. Create a new project (Amazon S3 storage is the default file storage option).
  3. Open the new project and choose JupyterLab from the Build menu.
  4. Save the new notebook you just created.
  5. Rename the file.

After the project is saved, project users can view the saved notebook in the Project files section under the S3 path [bucket]/[domain-id]/[project-id]/shared/.

Enable version control using Git

To enable version control using Git, complete the following steps:

  1. On the SageMaker console, create a new project profile.
  2. Provide the necessary details for your project profile.
  3. In the Project files storage section, the Amazon S3 option is selected by default. To enable version control for the project, you can use existing Git repository connections by selecting Git repository.

Use shared storage in Query Editor

To use the shared storage feature in Query Editor, complete the following steps:

  1. Choose Query Editor from the Build menu.
  2. Compose your query, and on the Actions menu, choose Save to save the query to shared storage.
  3. Navigate back to the Project files section, where you can view the query notebook files under the S3 path [bucket]/[domain-id]/[project-id]/shared/.

Use shared storage in Visual ETL flows

To use the shared storage feature in Visual ETL flows, complete the following steps:

  1. Choose Visual ETL flows from the Build menu.
  2. Develop your ETL workflow and save the code to the project.
  3. Navigate back to the Project files section, where you can view the files under the S3 path [bucket]/[domain-id]/[project-id]/shared/jobs/uploads/<ETL name>.

Clean up

Make sure you remove the SageMaker Unified Studio resources to mitigate any unexpected costs. This involves a few steps:

  1. Delete the projects.
  2. Delete the domain.
  3. Delete the S3 bucket named amazon-datazone-AWSACCOUNTID-AWSREGION-DOMAINID

Conclusion

The launch of Amazon S3 shared storage in SageMaker represents another step in simplifying the analytics and machine learning (ML) development experience for our customers. By reducing the complexity of Git operations while maintaining robust collaboration capabilities, teams can now focus on building and deploying analytics and ML solutions faster. The feature is now available in Regions where SageMaker is available.

For detailed information about this feature, including setup instructions and best practices, refer to Unified storage in Amazon SageMaker Unified Studio. Share your feedback on this feature in the comments section.


About the Authors

Hari Ramesh

Hari Ramesh

Hari is a Senior Analytics Specialist Solutions Architect at AWS. He focuses on crafting cloud-based data platforms, enabling real-time streaming, big data processing, and robust data governance.

Anagha Barve

Anagha Barve

Anagha is a Software Development Manager on the Amazon SageMaker Unified Studio team. Her team is focused on building tools and integrated experiences for the developers using Amazon SageMaker Unified Studio. In her spare time, she enjoys cooking, gardening and traveling.

Zach Mitchell

Zach Mitchell

Zach is a Sr. Big Data Architect. He works within the product team to enhance understanding between product engineers and their customers while guiding customers through their journey to develop data lakes and other data solutions on AWS analytics services.

Saurabh Bhutyani

Saurabh Bhutyani

Saurabh is a Principal Analytics Specialist Solutions Architect at AWS. He is passionate about new technologies. He joined AWS in 2019 and works with customers to provide architectural guidance for running generative AI use cases, scalable analytics solutions and data mesh architectures using AWS services like Amazon Bedrock, Amazon SageMaker, Amazon EMR, Amazon Athena, AWS Glue, AWS Lake Formation, and Amazon DataZone.

Anchit Gupta

Anchit Gupta

Anchit is a Senior Product Manager for Amazon SageMaker Studio. She focuses on enabling interactive data science and data engineering workflows from within the SageMaker Studio IDE. In her spare time, she enjoys cooking, playing board/card games, and reading.

Multi-Region keys: A new approach to key replication in AWS Payment Cryptography

Post Syndicated from Ruy Cavalcanti original https://aws.amazon.com/blogs/security/multi-region-keys-a-new-approach-to-key-replication-in-aws-payment-cryptography/

In our previous blog post (Part 1 of our key replication series), Automatically replicate your card payment keys across AWS Regions, we explored an event-driven, serverless architecture using AWS PrivateLink to securely replicate card payment keys across AWS Regions. That solution demonstrated how to build a custom replication framework for payment cryptography keys.

Based on customer feedback requesting a more automated, no-code approach, we’re excited to announce an additional option to this capability with Multi-Region keys for AWS Payment Cryptography in Part 2 of our series.

By using this new feature, you can automatically synchronize payment cryptography keys from a primary Region to other Regions that you select, improving resilience and availability of payment applications. You can also choose between account-level replication or key-level replication, giving more flexibility in how to manage payment keys across Regions.

Multi-Region keys: Overview and benefits

The new Multi-Region key replication feature for AWS Payment Cryptography offers you flexible control over your key replication strategy through the following primary capabilities:

  • Control whether keys are replicated
  • Select specific Regions for key replication
  • Manage replication configuration changes
  • Configure either account-level or key-level replication to meet business needs

Multi-Region keys help deliver several benefits for global payment operations, including:

  • Improved availability: Access your payment keys even if a Region becomes unavailable
  • Disaster recovery: Maintain business continuity with replicated keys across Regions
  • Global operations: Support payment processing across multiple geographic regions
  • Simplified management: Centralized control with distributed availability
  • Consistent key IDs: The same key ID across Regions simplifies application development

Configuration options

Payment Cryptography provides two distinct methods for configuring Multi-Region key replication, giving flexibility to implement a strategy that best fits your organization’s needs. You can choose between a broad, account-level approach or a more granular, key-level method.

Account-level

With account-level configuration, AWS automatically replicates exportable symmetric keys created in your Payment Cryptography account from your designated primary Region to other Regions you specify. This simplifies key management in multi-Region deployments, provides consistent key availability in the Regions that you specify, and reduces the operational overhead of key management.

To configure account-level replication using the AWS Command Line Interface (AWS CLI), use the new enable-default-key-replication-regions API to set the Regions where AWS will replicate your keys. To remove Regions from your default replication list, use the disable-default-key-replication-regions API.

Note: Only symmetric keys created after the account-level replication is enabled will be replicated.

Key-level replication

By using key-level replication, you can achieve more granular control by:

  • Designating specific keys as multi-Region keys
  • Defining custom replication targets for each multi-Region key
  • Maintaining Region-specific keys when needed

Note: Within each Region, Payment Cryptography maintains redundancy of your keys across multiple Availability Zones for high availability. Multi-Region key replication extends across geographic boundaries, giving you additional resilience against Regional outages while maintaining control over where your keys are stored.

You can specify replication Regions during key creation using the --replication-regions parameter, using the AWS CLI, with the create-key or import-key APIs. For existing keys, you can use the new add-key-replication-regions and remove-key-replication-regions APIs to manage which regions receive your replicated keys.

Important: When you specify replication Regions during key creation, these settings take precedence over default replication Regions configured at the account level.

How it works

Figure 1 shows the process when you replicate a key in Payment Cryptography.

  1. The key is created in your designated primary Region
  2. Payment Cryptography automatically replicates the key material asynchronously to the specified replica Regions
  3. The replicated keys maintain the same key ID across Regions; only the Region portion of the Amazon Resource Name (ARN) changes
  4. The key in the primary Region is marked with MultiRegionKeyType: PRIMARY
  5. Keys in replica Regions are marked with MultiRegionKeyType: REPLICA and include a reference to the primary Region
  6. When deleting a key, its deletion cascades from the primary to replica Regions

Figure 1: Representation of key replication from us-east-1 to us-west-2

Figure 1: Representation of key replication from us-east-1 to us-west-2

Example: Creating a multi-Region key at key level

The following is an example of creating a card verification key (CVK) in the primary Region (us-east-1) with replication to us-west-2:

aws payment-cryptography create-key \
--exportable \
--key-attributes KeyAlgorithm=TDES_2KEY,\
KeyUsage=TR31_C0_CARD_VERIFICATION_KEY,\
KeyClass=SYMMETRIC_KEY,KeyModesOfUse='{Generate=true,Verify=true}' \
--region us-east-1 \
--replication-regions us-west-2

The response shows the key being created with replication in progress:

{
  "Key": {
    "KeyArn": "arn:aws:payment-cryptography:us-east-1:111122223333:key/qs6643jl4ohibtqk",
    "KeyAttributes": {
      "KeyUsage": "TR31_C0_CARD_VERIFICATION_KEY",
      "KeyClass": "SYMMETRIC_KEY",
      "KeyAlgorithm": "TDES_2KEY",
      "KeyModesOfUse": {
        "Encrypt": false,
        "Decrypt": false,
        "Wrap": false,
        "Unwrap": false,
        "Generate": true,
        "Sign": false,
        "Verify": true,
        "DeriveKey": false,
        "NoRestrictions": false
      }
    },
    "KeyCheckValue": "CC5EE2",
    "KeyCheckValueAlgorithm": "ANSI_X9_24",
    "Enabled": true,
    "Exportable": true,
    "KeyState": "CREATE_COMPLETE",
    "KeyOrigin": "AWS_PAYMENT_CRYPTOGRAPHY",
    "CreateTimestamp": "2025-08-21T15:25:54.475000-03:00",
    "UsageStartTimestamp": "2025-08-21T15:25:54.287000-03:00",
    "MultiRegionKeyType": "PRIMARY",
    "ReplicationStatus": {
      "us-west-2": {
        "Status": "IN_PROGRESS"
      }
    },
    "UsingDefaultReplicationRegions": false
  }
}

After replication completes, the status updates to SYNCHRONIZED:

aws payment-cryptography get-key \
--key-identifier arn:aws:payment-cryptography:us-east-1:111122223333:key/qs6643jl4ohibtqk \
--region us-east-1

{
    "Key": {
        "KeyArn": "arn:aws:payment-cryptography:us-east-1:111122223333:key/qs6643jl4ohibtqk",
        "KeyAttributes": {
            "KeyUsage": "TR31_C0_CARD_VERIFICATION_KEY",
            "KeyClass": "SYMMETRIC_KEY",
            "KeyAlgorithm": "TDES_2KEY",
            "KeyModesOfUse": {
                "Encrypt": false,
                "Decrypt": false,
                "Wrap": false,
                "Unwrap": false,
                "Generate": true,
                "Sign": false,
                "Verify": true,
                "DeriveKey": false,
                "NoRestrictions": false
            }
        },
        "KeyCheckValue": "CC5EE2",
        "KeyCheckValueAlgorithm": "ANSI_X9_24",
        "Enabled": true,
        "Exportable": true,
        "KeyState": "CREATE_COMPLETE",
        "KeyOrigin": "AWS_PAYMENT_CRYPTOGRAPHY",
        "CreateTimestamp": "2025-08-21T15:25:54.475000-03:00",
        "UsageStartTimestamp": "2025-08-21T15:25:54.287000-03:00",
        "MultiRegionKeyType": "PRIMARY",
        "ReplicationStatus": {
            "us-west-2": {
                "Status": "SYNCHRONIZED"
            }
        },
        "UsingDefaultReplicationRegions": false
    }
}

You can then access the key in the replica Region (us-west-2) using the same key ID and changing only the Region name:

aws payment-cryptography get-key \
--key-identifier arn:aws:payment-cryptography:us-west-2:111122223333:key/qs6643jl4ohibtqk \
--region us-west-2

The response shows the replica key with a reference to the primary Region:

{
    "Key": {
        "KeyArn": "arn:aws:payment-cryptography:us-west-2:111122223333:key/qs6643jl4ohibtqk",
        "KeyAttributes": {
            "KeyUsage": "TR31_C0_CARD_VERIFICATION_KEY",
            "KeyClass": "SYMMETRIC_KEY",
            "KeyAlgorithm": "TDES_2KEY",
            "KeyModesOfUse": {
                "Encrypt": false,
                "Decrypt": false,
                "Wrap": false,
                "Unwrap": false,
                "Generate": true,
                "Sign": false,
                "Verify": true,
                "DeriveKey": false,
                "NoRestrictions": false
            }
        },
        "KeyCheckValue": "CC5EE2",
        "KeyCheckValueAlgorithm": "ANSI_X9_24",
        "Enabled": true,
        "Exportable": true,
        "KeyState": "CREATE_COMPLETE",
        "KeyOrigin": "AWS_PAYMENT_CRYPTOGRAPHY",
        "CreateTimestamp": "2025-08-21T15:25:54.475000-03:00",
        "UsageStartTimestamp": "2025-08-21T15:25:54.287000-03:00",
        "MultiRegionKeyType": "REPLICA",
        "PrimaryRegion": "us-east-1"
    }
}

Things to consider

When using multi-Region keys, several important aspects should be considered. Multi-Region key replication supports only symmetric keys with the exportable attribute enabled, and asymmetric keys are not supported. For billing purposes, AWS bills per key per Region, which means replicating to three Regions incurs costs for the primary key plus costs for each key in the replica Regions.

Key aliases and tags require separate management in each Region because they are not part of the replication process. While primary keys support modifications and updates, replica keys are read-only copies that support only cryptographic operations. Modifications must be made to the key in the primary Region, and Payment Cryptography automatically propagates these changes to the replica Regions. Monitor the replication status to confirm successful synchronization of these changes.

The deletion process for multi-Region keys follows specific behavior patterns that are important to understand. When a primary key is scheduled for deletion, associated replica keys are deleted immediately. The primary key enters a pending deletion state with a minimum 3-day waiting period, during which the deletion can be canceled. However, if you restore the primary key by canceling its deletion, you will need to re-enable replication to recreate the replica keys in your desired Regions. After the 3-day waiting period expires, the primary key is permanently deleted and becomes unrecoverable. Note that deleting a replica key affects only that specific Region and does not impact the primary key or other replica keys.

Multi-Region key replication operates with eventual consistency. When creating new keys or making changes to existing keys, these updates might not appear immediately across all Regions. Applications should be designed to handle this eventual consistency model and not assume immediate availability of keys or key changes in replica Regions. If your application requires strong consistency, implement polling mechanisms using the GetKey API to verify that changes have been synchronized before proceeding with key operations.

Logging and monitoring

Payment Cryptography logs API activity through AWS CloudTrail, which now includes new events and attributes specific to Multi-Region key replication.

New CloudTrail event

The service logs a new event type called SynchronizeMultiRegionKey, which appears in primary and replica Regions.

Primary Region events:

Two SynchronizeMultiRegionKey events are logged in the primary Region for each replication Region defined:

One event related to a key export process.

{
    "eventVersion": "1.11",
    "userIdentity": {
        "accountId": "111122223333",
        "invokedBy": "payment-cryptography.amazonaws.com"
    },
    "eventTime": "2025-08-21T18:25:56Z",
    "eventSource": "payment-cryptography.amazonaws.com",
    "eventName": "SynchronizeMultiRegionKey",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "payment-cryptography.amazonaws.com",
    "userAgent": "payment-cryptography.amazonaws.com",
    "requestParameters": null,
    "responseElements": null,
    "eventID": "fbae27f1-f2ad-49d1-ab05-d460b0b4ca25",
    "readOnly": false,
    "eventType": "AwsServiceEvent",
    "managementEvent": true,
    "recipientAccountId": "111122223333",
    "serviceEventDetails": {
        "keyArn": "arn:aws:payment-cryptography:us-east-1:111122223333:key/qs6643jl4ohibtqk",
        "replicationRegion": "us-west-2",
        "replicationType": "ExportKeyReplica"
    },
    "eventCategory": "Management"
}

One event related to a key import process.

{
    "eventVersion": "1.11",
    "userIdentity": {
        "accountId": "111122223333",
        "invokedBy": "payment-cryptography.amazonaws.com"
    },
    "eventTime": "2025-08-21T18:25:56Z",
    "eventSource": "payment-cryptography.amazonaws.com",
    "eventName": "SynchronizeMultiRegionKey",
    "awsRegion": "us-east-1",
    "sourceIPAddress": "payment-cryptography.amazonaws.com",
    "userAgent": "payment-cryptography.amazonaws.com",
    "requestParameters": null,
    "responseElements": null,
    "eventID": "5c06716f-88ea-4315-b633-5dde83d7232c",
    "readOnly": false,
    "eventType": "AwsServiceEvent",
    "managementEvent": true,
    "recipientAccountId": "111122223333",
    "serviceEventDetails": {
        "keyArn": "arn:aws:payment-cryptography:us-east-1:111122223333:key/qs6643jl4ohibtqk",
        "replicationRegion": "us-west-2",
        "replicationType": "ImportKeyReplica"
    },
    "eventCategory": "Management"
}

Replica Region events:

One SynchronizeMultiRegionKey event is logged as an import key process in each replicated Region.

{
    "eventVersion": "1.11",
    "userIdentity": {
        "accountId": "111122223333",
        "invokedBy": "payment-cryptography.amazonaws.com"
    },
    "eventTime": "2025-08-21T18:25:56Z",
    "eventSource": "payment-cryptography.amazonaws.com",
    "eventName": "SynchronizeMultiRegionKey",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "payment-cryptography.amazonaws.com",
    "userAgent": "payment-cryptography.amazonaws.com",
    "requestParameters": null,
    "responseElements": null,
    "eventID": "0a952017-dd89-435e-8959-5de7b43c86d5",
    "readOnly": false,
    "eventType": "AwsServiceEvent",
    "managementEvent": true,
    "recipientAccountId": "111122223333",
    "serviceEventDetails": {
        "keyArn": "arn:aws:payment-cryptography:us-west-2:111122223333:key/qs6643jl4ohibtqk",
        "replicationRegion": "us-west-2",
        "replicationType": "ImportKeyReplica"
    },
    "eventCategory": "Management"
}

New CloudTrail event attributes

New attributes were included in the service key management APIs. The following are examples of the CreateKey API highlighting the new attributes.

One CreateKey event in the primary Region:

{
    "eventVersion": "1.11",
...
    "eventTime": "2025-08-21T18:25:54Z",
    "eventSource": "payment-cryptography.amazonaws.com",
    "eventName": "CreateKey",
    "awsRegion": "us-east-1",
...
    "requestParameters": {
        "keyAttributes": {
            "keyUsage": "TR31_C0_CARD_VERIFICATION_KEY",
            "keyClass": "SYMMETRIC_KEY",
            "keyAlgorithm": "TDES_2KEY",
            "keyModesOfUse": {
                "encrypt": false,
                "decrypt": false,
                "wrap": false,
                "unwrap": false,
                "generate": true,
                "sign": false,
                "verify": true,
                "deriveKey": false,
                "noRestrictions": false
            }
        },
        "exportable": true,
        "replicationRegions": [
            "us-west-2"
        ]
    },
    "responseElements": {
        "key": {
            "keyArn": "arn:aws:payment-cryptography:us-east-1:111122223333:key/qs6643jl4ohibtqk",
            "keyAttributes": {
                "keyUsage": "TR31_C0_CARD_VERIFICATION_KEY",
                "keyClass": "SYMMETRIC_KEY",
                "keyAlgorithm": "TDES_2KEY",
                "keyModesOfUse": {
                    "encrypt": false,
                    "decrypt": false,
                    "wrap": false,
                    "unwrap": false,
                    "generate": true,
                    "sign": false,
                    "verify": true,
                    "deriveKey": false,
                    "noRestrictions": false
                }
            },
            "keyCheckValue": "CC5EE2",
            "keyCheckValueAlgorithm": "ANSI_X9_24",
            "enabled": true,
            "exportable": true,
            "keyState": "CREATE_COMPLETE",
            "keyOrigin": "AWS_PAYMENT_CRYPTOGRAPHY",
            "createTimestamp": "Aug 21, 2025, 6:25:54 PM",
            "usageStartTimestamp": "Aug 21, 2025, 6:25:54 PM",
            "multiRegionKeyType": "PRIMARY",
            "replicationStatus": {
                "us-west-2": {
                    "status": "IN_PROGRESS"
                }
            },
            "usingDefaultReplicationRegions": false
        }
    },
...
}

One CreateKey event in a replica Region:

{
    "eventVersion": "1.11",
    "userIdentity": {
...
        "invokedBy": "payment-cryptography.amazonaws.com"
    },
    "eventTime": "2025-08-21T18:25:54Z",
    "eventSource": "payment-cryptography.amazonaws.com",
    "eventName": "CreateKey",
    "awsRegion": "us-west-2",
    "sourceIPAddress": "payment-cryptography.amazonaws.com",
    "userAgent": "payment-cryptography.amazonaws.com",
    "requestParameters": {
        "keyAttributes": {
            "keyUsage": "TR31_C0_CARD_VERIFICATION_KEY",
            "keyClass": "SYMMETRIC_KEY",
            "keyAlgorithm": "TDES_2KEY",
            "keyModesOfUse": {
                "encrypt": false,
                "decrypt": false,
                "wrap": false,
                "unwrap": false,
                "generate": true,
                "sign": false,
                "verify": true,
                "deriveKey": false,
                "noRestrictions": false
            }
        },
        "exportable": true,
        "enabled": true
    },
    "responseElements": {
        "key": {
            "keyArn": "arn:aws:payment-cryptography:us-west-2:111122223333:key/qs6643jl4ohibtqk",
            "keyAttributes": {
                "keyUsage": "TR31_C0_CARD_VERIFICATION_KEY",
                "keyClass": "SYMMETRIC_KEY",
                "keyAlgorithm": "TDES_2KEY",
                "keyModesOfUse": {
                    "encrypt": false,
                    "decrypt": false,
                    "wrap": false,
                    "unwrap": false,
                    "generate": true,
                    "sign": false,
                    "verify": true,
                    "deriveKey": false,
                    "noRestrictions": false
                }
            },
            "keyCheckValue": "CC5EE2",
            "keyCheckValueAlgorithm": "ANSI_X9_24",
            "enabled": true,
            "exportable": true,
            "keyState": "CREATE_COMPLETE",
            "keyOrigin": "AWS_PAYMENT_CRYPTOGRAPHY",
            "usageStartTimestamp": "Aug 21, 2025, 6:25:54 PM"
        }
    },
...
}

Getting started

To start using Multi-Region key replication in Payment Cryptography:

  1. Determine your primary Region.
  2. Determine your replica Regions and if you will use account-level or key-level configuration.
  3. Create new exportable symmetric keys or update existing keys to use the Multi-Region key replication feature.
  4. Update your applications to use the consistent key IDs across Regions.

Conclusion

The new Multi-Region key replication feature in Payment Cryptography enhances our automatic key replication capabilities, providing improved resilience and simplified management for global payment applications. This feature helps make sure your payment cryptography keys are available when and where you need them, with the flexibility to choose between account-level or key-level replication strategies.

For more information about AWS Payment Cryptography, visit https://aws.amazon.com/payment-cryptography/.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.
 

Ruy Cavalcanti
Ruy Cavalcanti

Ruy is a Senior Security Architect for the Latin American financial industry at AWS. He has worked in IT and Security for over 19 years, helping customers create secure architectures and solve data protection and compliance challenges. When he’s not architecting secure solutions, he enjoys jamming on his guitar, cooking Brazilian-style barbecue, and spending time with his family and friends.
Mark Cline
Mark Cline

Mark is a Principal Product Manager in AWS Payments, where he brings over 15 years of financial services experience across a variety of use cases and disciplines. He works with leading banks, financial institutions, and technology providers to alleviate heavy lifting in the payment system, allowing customers to focus on innovation. When he’s not simplifying payments, you can find him coaching little league or out on a run.

The Truth About Cloud Security Costs: Why High Costs Don’t Always Mean Better Protection

Post Syndicated from Kari Rivas original https://www.backblaze.com/blog/the-truth-about-cloud-security-costs-why-high-costs-dont-always-mean-better-protection/

A decorative image showing a shield and gears.

When evaluating cloud providers, cost is often the most visible factor—but in enterprise IT, information security (InfoSec), and compliance, security is always the first (and likely most important) concern. As a technology leader, you know that determining “acceptable” risk is a moving target, but you’re likely also regularly squeezed by budget pressures and a mandate to contribute to the company’s bottom line. 

Taking a chance on providers with lower price tags might feel like too big of a risk—lower-cost providers must be sacrificing something, and all too often, that something is security. Right?

It’s a fair question, but the answer might surprise you. Today, we’re talking about how specialized cloud providers provide surprising value—and even provide security benefits—when compared with traditional, hyperscaler architectures. Let’s talk about what you need to know to evaluate a cloud provider’s security posture.

Want to hear from the experts?

Join our upcoming session to hear from Backblaze experts Troy Liljedahl, Sr. Director, Solutions Engineering, and Pat Patterson, Chief Technical Evangelist, about the knowledge and features you need to stay ahead of modern threats.

Join us to learn:

  • Foundational controls: Master the best practices for using encryption, Object Lock, access keys, role-based access controls, and more to build a solid defense.
  • Advanced threat detection: Get an exclusive look at Backblaze’s new feature, Anomaly Alerts, which helps detect irregular and potentially suspicious data access patterns.
  • A unified approach: Understand how to integrate these powerful features to create a strong, easy-to-manage security strategy.

Ask the Experts

How specialized cloud providers provide security benefits

In theory, cloud architecture encourages redundancy. But in practice, many companies—even those using multi-cloud strategies—tend to consolidate key services like authentication and orchestration with a single vendor. When that vendor’s services go down, it doesn’t matter that your data is replicated across three availability zones in the same data center. If you can’t log in to access it, your redundancy becomes purely theoretical. This year alone, there have been major outages that had widespread consequences from the likes of Google, IBM cloud, and others.

Specialized cloud providers and multi-cloud strategies provide inherent benefits here.

  • Vendor transparency: Open cloud providers publish clear, detailed practices around architecture, encryption, and compliance rather than burying them behind opaque marketing claims. This transparency allows your teams to independently validate security assurances.
  • Avoiding lock-in: Multi-cloud strategies ensure you’re not beholden to a single vendor’s security practices. If one provider falls short, data replication and redundancy across platforms can maintain both compliance and resilience.
  • Risk distribution: By spreading workloads across providers, organizations mitigate the risk of a single point of failure, outage, or vendor breach.
  • Compliance flexibility: Different providers may align more strongly with specific frameworks (SOC 2, HIPAA, GDPR, etc.), giving enterprises options to meet evolving regulatory demands.

This means that organizations don’t have to choose between cost efficiency and security—they can and should get both.

How to evaluate a cloud provider’s security posture

Choosing the right cloud provider isn’t just about price, features, or performance—it’s about knowing they can safeguard your data and prove it. Here are key areas to assess:

  1. Architecture & physical security
    • Does the provider operate its own infrastructure or rely on generic colocation facilities?
    • What physical safeguards (biometrics, restricted access, surveillance) protect the data centers?
  2. Encryption & data protection
    • Is data encrypted both in transit (TLS/SSL) and at rest (AES-256 or equivalent)?
    • Are key management options available, including customer-managed keys?
    • Is immutability (Object Lock or write once, read many (WORM) storage) supported for ransomware defense?
  3. Access & identity controls
    • Are granular permissioning and role-based access (RBAC) controls available?
    • Does the provider support single sign on (SSO), multi-factor authentication (MFA), and integration with enterprise identity systems?
    • Can admins maintain clear audit logs of all access and changes?
  4. Compliance & certifications
    • Which third-party attestations does the provider maintain (SOC 2, HIPAA, PCI-DSS, GDPR, ISO)?
    • Can they provide signed agreements (such as Business Associate Agreements (BAAs)) as needed for regulated industries?
  5. Resilience & multi-cloud strategy
    • Do they offer replication across regions or the ability to integrate into a multi-cloud strategy?
    • How quickly can you move workloads or data out if you need to change vendors or access data in case of emergency?

By using this evaluation framework, IT leaders can look past marketing promises and price tags, focusing on verifiable controls and independent certifications.

The hyperscaler tax for cloud security

Many enterprises assume that higher cloud storage costs from hyperscalers like AWS, Azure, or Google Cloud translate directly into better security. In reality, much of that premium is a “hyperscaler tax” driven by complex business models, bundled services, and legacy infrastructure—not inherently superior protection. Specialized cloud providers can often deliver the same enterprise-grade security controls—encryption, compliance certifications, access management—without the inflated price tag, proving that security and affordability are not mutually exclusive.

Building a better mousetrap: The innovation behind Backblaze B2

From the beginning, Backblaze has architected its storage solution to be both performant and cost-effective. And, by specializing in storage (as opposed to the myriad of solutions offered by, say, Amazon Web Services and other hyperscalers), Backblaze is able to optimize for the economics of storage and storage alone.

To help you get past the price tag and into the technical details, let’s break down the pillars of Backblaze B2 security and compliance.

Compliance? We’ve got a visual for that.

Want a quick glance on how Backblaze compares to other cloud storage providers on key security and compliance elements? Check out our comparison matrices.

Architecture and physical security: The foundation of trust

Our security starts with our physical infrastructure. Our data centers are designed for 11 nines of data durability and are staffed 24/7/365. They feature:

  • Best-in-class security features: Biometric security, ID checks, and multi-layered access controls.
  • A purpose-built infrastructure: From Backblaze Storage Pods to projects like Shard Stash and ongoing feature releases, the Backblaze platform is designed for maximum data durability and security.

This physical and architectural security is the bedrock of our service, and it’s backed by industry-standard certifications like SOC 2 Type 2 certification.

Data storage security: Protecting data at rest and in transit

Data security is a core tenet of our platform. From the moment your data leaves your system until it is stored on our pods, it is protected by multiple layers of encryption.

  • Encryption in transit: All files are transmitted to Backblaze B2 using an encrypted TLS connection.
  • Encryption at rest: Your data is encrypted before it is stored on disk. We offer two options for server-side encryption with 256-bit Advanced Encryption Standard (AES-256):
    • Server-side encryption (SSE) with Backblaze managed keys (SSE-B2): We handle the key management for you, providing seamless, built-in protection.
    • SSE with customer managed keys (SSE-C): For organizations with strict compliance requirements, you can manage your own keys, giving you complete control over your data’s access.
  • Object Lock for immutability: Our Object Lock feature provides a powerful layer of ransomware protection. Using a write-once, read-many (WORM) model, it prevents files from being modified, manipulated, or deleted for a customer-determined retention period. This is an essential tool for compliance and disaster recovery.
  • Cloud Replication: For businesses with high-availability or geographical redundancy requirements, Backblaze B2 supports automatic replication of data across different regions, ensuring your data is always available and safe from regional outages or other incidents.

Access management security: Granting control and ensuring accountability

Controlling who can access your data is paramount. We provide granular, enterprise-grade access management controls that give you full command over your storage:

  • Fine-grained API key control: Create and manage accounts, groups, and specific data access permissions with robust API key control.
  • Multi-factor authentication (MFA) & single sign-on (SSO): We offer multiple account authentication options, including MFA and SSO via providers like Google Workspace and Office 365, to prevent unauthorized access.
  • Comprehensive logging: Backblaze provides detailed logs and reports on all activities within your account, so you can maintain a clear audit trail.

Compliance: Demonstrating our commitment to best practices

Security is not just a feature; it’s a commitment that’s verified by independent third parties. Backblaze has achieved a number of security and compliance attestations, including:

  • SOC 2, Type 2: We have been independently audited and certified for SOC 2, Type 2 compliance, demonstrating our commitment to protecting customer data.
  • HIPAA: For business customers who are Covered Entities under the Health Insurance Portability and Accountability Act (HIPAA), we can provide a Business Associate Agreement (BAA) upon request.
  • PCI-DSS: Backblaze’s adherence to the Payment Card Industry Data Security Standard (PCI-DSS) is supported by our use of Stripe to handle card information and our internal security controls.
  • GDPR: We adhere to General Data Protection Regulation (GDPR) privacy policies and provide Data Processing Agreement Addendums (DPAs) for EEA/EU and UK residents.

While some competitors may also offer these certifications, Backblaze’s pricing model is built to ensure you don’t have to pay a premium for them. Our efficiencies mean that we can pass the savings directly to you without compromising on the security and compliance that your business demands.

Specialized cloud storage: Enabling enterprises to evaluate their best options

In the end, our goal is to free you from the false choice between security and affordability. The reality is that the high cost of some cloud providers is a result of their complex, multi-tiered business models—not a reflection of superior security. Backblaze’s commitment to building a focused, innovative, and transparent cloud storage solution allows us to deliver on our promise: enterprise-grade security and compliance, at a fraction of the cost.

The post The Truth About Cloud Security Costs: Why High Costs Don’t Always Mean Better Protection appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

[$] Comparing Rust to Carbon

Post Syndicated from daroc original https://lwn.net/Articles/1036912/

Safe, ergonomic interoperability between Rust and C/C++ was a popular topic at

RustConf 2025
in Seattle, Washington. Chandler Carruth gave a presentation
about the different approaches to interoperability in Rust and

Carbon
, the
experimental “(C++)++” language.
His ultimate conclusion was that
while Rust’s ability to interface with other languages is expanding over time,
it wouldn’t offer a complete solution to C++ interoperability anytime soon — and so there is room for
Carbon to take a different approach to incrementally upgrading existing C++ projects.
His

slides
are available for readers wishing to study his example code in more
detail.

OSPAR 2025 report now available with 170 services in scope based on the newly enhanced OSPAR v2.0 guidelines

Post Syndicated from Joseph Goh original https://aws.amazon.com/blogs/security/ospar-2025-report-now-available-with-170-services-in-scope-based-on-the-newly-enhanced-ospar-v2-0-guidelines/

We’re pleased to announce the completion of our annual AWS Outsourced Service Provider’s Audit Report (OSPAR) audit cycle on August 7, 2025, based on the newly enhanced version 2.0 guidelines (OSPAR v2.0). AWS is the first global cloud service provider in Singapore to obtain the report using the new OSPAR v2.0 guidelines.

The Association of Banks in Singapore (ABS) established the Guidelines on Control Objectives and Procedures for Outsourced Service Providers (ABS Guidelines) to provide baseline controls criteria that outsourced service providers (OSPs) operating in Singapore should have in place. ABS enhanced the ABS Guidelines to version 2.0, which OSPs—such as AWS—need to comply with for the audit period commencing on or after January 1, 2025. The enhanced ABS Guidelines integrate key elements from the Monetary Authority of Singapore (MAS) regulatory updates on cyber hygiene, technology risk management, and business continuity management, and include new control domains such as data security, cryptography, software application development and management, and business continuity management.

The 2025 OSPAR certification cycle includes the addition of seven new services in scope, bringing the total number of services in scope to 170 in the AWS Asia Pacific (Singapore) Region. Newly added services in scope include the following:

Successfully completing the OSPAR assessment demonstrates that AWS continues to maintain a robust system of controls to meet these guidelines. This underscores our commitment to fulfill the security expectations for cloud service providers set by the financial services industry in Singapore.Customers can use OSPAR to streamline their due diligence processes, thereby reducing the effort and costs associated with compliance. OSPAR remains a core assurance program for our financial services customers because it is closely aligned with local regulatory requirements from MAS.

You can download the latest OSPAR report from AWS Artifact, a self-service portal for on-demand access to AWS compliance reports. Sign in to AWS Artifact in the AWS Management Console, or learn more at Getting Started with AWS Artifact. The list of services in scope for OSPAR is available in the report, and is also available on the AWS Services in Scope by Compliance Program webpage.

As always, we’re committed to bringing new services into the scope of our OSPAR program based on your architectural and regulatory needs. If you have questions about the OSPAR report, contact your AWS account team.

If you have feedback about this post, submit comments in the Comments section below.

Joseph Goh

Joseph Goh
Joseph is the APJ ASEAN Lead at AWS, based in Singapore. He leads security audits, certifications, and compliance programs across the Asia Pacific region. Joseph is passionate about delivering programs that build trust with customers and providing them assurance on cloud security.

Join the UK Bebras Challenge 2025

Post Syndicated from Andrew Csizmadia original https://www.raspberrypi.org/blog/join-the-uk-bebras-challenge-2025/

Graphic showcasing this year's UK Bebras Challenge dates "10th - 21st November 2025".

The UK Bebras Challenge, the nation’s largest computing competition, is back! Schools can enter now for this year’s Challenge, which runs from 10 to 21 November.

Last year, more than 467,000 students from across the UK took part, tackling fun and thought-provoking puzzles that introduce key ideas in computational thinking with no extra preparation needed.

Read on to learn how your school can get involved.

What is the UK Bebras Challenge?

The UK Bebras Challenge is a free-to-enter annual challenge that is designed to spark interest in both computational thinking and computer science among students aged 6 to 19. The 45-minute challenge is accessible to everyone, offering age-appropriate but challenging interactive tasks for students at different levels, including a tailored version for students with severe sight impairments. 

The tasks are designed to give every student the opportunity to showcase their potential and all participating students receive a certificate. There are also certificates based on performance within school and gold certificates based on national boundaries. With self-marking tasks and no text-based programming required, it’s easy to have your school participate in the UK Bebras Challenge.

This year, each task’s background section is linked to the Ada Computer Science platform. Teachers and students can now explore detailed explanations of the computing concepts behind the Bebras tasks, along with the computational thinking skills students may use to solve them.

“Bebras provides us with incredibly useful insights and conversations about our students who might otherwise have struggled with accessing traditional CS materials. It’s also a great resource for baseline tests where students may not have studied the curriculum the way we present it.” – Teacher in the UK

“I think the problems are a really great way to promote critical thinking, computational thinking, and logic in general. I have only seen a couple but I am very impressed and excited to share with some of my students. Thanks for what you are doing.” – Teacher in the UK

What does a UK Bebras task look like?

The tasks are inspired by classic computing problems and presented in a fun, age-appropriate way. For example, younger students aged 6 to 8 might solve the mystery of a dancing doll, while older students aged 16 to 19 could take on the challenge of an art theft. Both of these tasks involve the use of data structures.

Here’s a question we ran in 2021 for the Juniors group (ages 10 to 12). Can you solve it? 

Beaver farmer

Farmer Mert grows wheat in the fields, which contain a symbol in the map below.

He also has stony fields where nothing grows, shown by the  symbol.

To save water, Mert only wants to water the wheat fields. He can block the water channels coming from the lake at the spots marked with the letters A to J.

The water will only flow downwards towards the fields and will never flow back towards the lake.

Task

Select the letters to block the water from flowing to the empty fields while still letting it flow to the wheat fields.

Example image from a task 'Beaver Farmer' featured in the UK Bebras Challenge.

Example answer

The correct answer is that spots labelled B, F, H, and J must be closed, as shown in the following figure:

Example answer image from a task 'Beaver Farmer' featured in the UK Bebras Challenge.

Explanation

If none of the gates are closed, the water will reach the stony fields. If more gates than shown above are closed, then some wheat fields will get no water. If we review all spots:

  • A must be open to allow water to flow to field 1.
  • B has to be closed to avoid watering field 2. It also brings water to field 3, but this field can also get it through C.
  • C must be open both for field 4 and for field 3, which cannot be watered though A since B is closed.
  • D must be open for fields 7 and 8.
  • E must also be open field 7.
  • F must be closed to prevent water flowing to field 5.
  • G, if closed, would only prevent field 3 from being watered and so must be open as well.
  • H must also be closed even if F is also closed, as water can flow from the open D and E.
  • I must be open for field 11.
  • J has to be closed to prevent fields 9 and 10 from getting watered.

This Bebras task was developed by the Bebras team in Turkey and refined by members of the international Bebras community.

Did you get it right?

How do I get my school involved?

If you are either a UK school or teach a UK-based curriculum, then visit the UK Bebras website for more information and to register your school. 

Once you’ve registered, you’ll get access to the entire UK Bebras back catalogue of questions, allowing you to create custom quizzes for your students to tackle at any time throughout the year. These quizzes are self-marking, and you can download your students’ results to keep track of their progress. The questions are perfect for enrichment activities, end-of-term quizzes, lesson starters, and even full lessons to develop computational thinking skills and promote computing concepts.

Join for free at bebras.uk/admin.

The post Join the UK Bebras Challenge 2025 appeared first on Raspberry Pi Foundation.

Another npm supply-chain attack

Post Syndicated from corbet original https://lwn.net/Articles/1038326/

The Socket.dev blog describes
this week’s attack
on JavaScript packages in the npm repository.

A malicious update to @ctrl/tinycolor (2.2M weekly
downloads) was detected on npm as part of a broader supply chain
attack that impacted more than 40 packages spanning multiple
maintainers.

The compromised versions include a function
(NpmModule.updatePackage) that downloads a package
tarball, modifies package.json, injects a local script
(bundle.js), repacks the archive, and republishes it,
enabling automatic trojanization of downstream packages.

Кого конкретно обслужва законопроектът на ИТН в останалата част на София?

Post Syndicated from Боян Юруков original https://yurukov.net/blog/2025/itn-zakonoproekt2/

Миналата седмица писах за законопроект на ИТН, който цели да вдигне възможната височина на строителство в София на места на 125 метра. Законопроектът е едно изречение и въвежда много странно изключение (на изключението) за имоти по Цариградско шосе. Всъщност, става дума за доста конкретни имоти и така формулирано предложението ме накара да се зачудя кои са те. Именно тях показах в предишната си статия. Както обещах, разгледах и останалите из София, макар там височината да се вдига „само“ с 25%.

За подробно описание какво предлага предложението на тримата депутати ще намерите в предишната ми статия. Най-общо обаче текстът, който искат да променят сега позволява 100 метра височина в Смф зоните (в сиво) на 400 метра от метростанции (в зелено) изключвайки централната част и четири района (в червено). Тук съм добавил бъдещите метростанции и съм махнал тези изцяло в изключените райони. Те вдигат 100 на 125 и добавят няколко имота в Младост по Цариградско. Ако ви изглежда крайно специфично, това е защото е. Практическата липса на аргументация, това че тримата депутати нямат нищо общо със София и как е формулиран законопроектът ме накара да се разровя.

Първо, ще добавя двете карти, които направих. Те са по подобие на другите ми за държавните имоти за продажба от списъка на Желязков и 3D картата на застрояването на София. Те показват имотите отговарящи на горните критерии, които открих. Изключих общинските и държавните имоти, въпреки че, както се разбра, може да бъдат продадени без много шум. Изключих най-малките под 500 кв. макар би могло да се обединят с други по-големи, за да се застоят високо. Махнах метростанциите, около които няма Смф или не намерих имоти. Остават тези на картата. Възможно е да съм пропуснал някои предвид, че данните за собствеността в КАИС и iSofMap не са непременно актуални все още. Ако натиснете върху парцелите ще намерите последните данни за вид и предназначение според КАИС.

Тази карта показва същите данни, но само с парцелите в обемен вид. В някои от откритите парцели вече има построени сгради и повечето надали ще бъдат съборени, за да се възползва някой от новата височина. Тук виждате състоянието на застрояването до преди две години, тъй като тогава последно е 3D заснемането, което използва. Повече за това тук.

Отделно, размерът на имота и съседните има значение, защото има редица други изисквания като отстояние и разгърната площ. Те могат да се преценят при предложен проект, какъвто сега няма или нямам достъп за всеки от тези имоти. Затова маркирам само онези, които са засегнати по този един параметър обсъждан във внесеното предложение.

Може да отворите двете карти на цял екран тук и тук.

Засегнати райони из цяла София

Първо, в близост до горната част на Цариградско виждаме голям участък, който попада в хипотезата на този законопроект. Тези имоти се намират точно на границата на район Младост. Този най-вляво промени проекта си поне веднъж, а преди месец беше на комисия в НАГ с други предложения. Вдясно виждаме доста отделни сгради и към настоящия момент някои от празните парцели вече са запълнени. Получих дост сигнали за озеленяването там и колко е бутафорно.

Друг интересен парцел е този до районната община в Слатина. Той също попада в описаните критерии и е още на виза за проектиране. Има проект до 75 метра. Над това, както и при другите такива, трябва гласуване в СОС. Спорно е обаче колко има желание да се спират такива проекти, както и дали няма гласуване за подобни сгради да бъде из търгувано за подкрепа за други проекти, в парламента или просто за асансьори. Виждали сме и от трите. Не бих могъл да твърдя такова нещо за конкретния парцел. Давам просто пример за такава възможност след промените. Докато за 8 етажа повече може би не биха се занимавали, за 16 или 50 метра отгоре – биха.

Следващите имоти са малки, но показателни. Един от тях се вижда вече застроен – NV Tower. При нея имаше един интересен казус преди години – осветяваше целия квартал като дискотека нощем. После се оказа, че според отговор на ДНСК артистичното осветление беше незаконно и спряха. Съседният парцел е изцяло жилищна сграда в Смф (уж смесен) имот. Питайте Здравков защо. До тях се виждат няколко частни имота, които са отредени към този момент за многоетажен гараж. Попадат в хипотезата за 125 метра небостъргачи и предвид местоположението изглежда отстоянията им ще достатъчни.

На това кръстовище се вижда метростанцията Г.М. Димитров и това показва ключов проблем с този законопроект. Интересен факт тук е, че сградата отсреща на обсъжданите парцели е построена от инвеститорска компания с едноименна поправка предшестваща и наподобяваща обсъждания законопроект. Живущи в сградата на горни етажи се оплакват отдавна от неприятни вибрации и клатене когато минава метрото. При това говорим са само 50 метрова сграда. Този ефект ще е значително по-голям при по-високо строителство. От това следва, че би трябвало да се намалява етажността, а не да се уличава. Имало е, разбира се, ограничения за т.н. „вибрационна зона“ около метрото, както и изисквания за луфтове и конкретни елементи. За съжаление, според архитекти, доста от тези са отпаднали незнайно как в голяма част от града. Резултатът е видим, както казваше един партиен лозунг.

Продължаваме на юг с незастроени парцели до Студентски парк и Симеоновско шосе. Отново – на границата на районите с ограничения. Някои от тях са застроени, а други се подготвят за това и нямат започната процедура. Тук е полезно да се отбележи скандалната кула, която ще бъде забучена по очертанията на парцела на автокъщата в Студентски парк. Изискванията по ЗУТ и ЗУЗСО са специфично направени, за да са възможни такива опасни неща. А са опасни, защото тази отсечка от пътя е навярно най-опасната в Студентски град. Затова ще бъде отрязано още от парка, за да се направи нов булевард – специално за тази 60 метрова сграда. Очакваме същото за други подобни засегнати от този законопроект, независимо, че вносителите настояват, че промените им не пречат и не засягат никого и няма да има допълнителна тежест или разходи за общината.

Доколкото Черни връх е изключен от списъка, там има други проблеми, включително 215 метровата кула и стената от сгради по горната част от булеварда, някои от които според главния архитект незаконно са настроили етажи, но това е било прието от Здравков и ДНСК. Има обаче няколко парцела от другата страна, където сега има складове и един магазин Фантастико. Някой ден някой може да реши да инвестира там и тази промяна ги засяга.

На запад из Овча купел има няколко парцела, но ще обърна внимание на тези по булевард Цар Борис. Някои са застроени, а други – не. Това ще стане възможно с планираните спирки на метрото стигащи Южната дъга.

В северната част на София има доста имоти, особено до Надежда в индустриалните части. Правят впечатление обаче имотите до и зад гарата. Особени тези от северната част на линията. Там вече има застроени масивни комплекси, но ще има още много. Отстоянията няма да са им проблем и отново ще могат да стигнат 125 метра с малко машинации в СОС.

Новото метро донася този проблем този проблем и в квартал Левски. Някои от отбелязаните парцели са вече застроени, други не са и дори нямат виза за проектиране.

Какво следва?

Утре, 17-ти септември ще има обсъждане на законопроектът в комисия в парламента. Един от вносителите в заместник председател на ИТН в тази комисия. С липсата на аргументи на този проект не виждам как може да мине. В сегашната политическа среда и изтъргуване на лобистки поправки срещу подкрепа за други гласувания – като вето на недоверие – не изключвам да мине през гласуването доста бързо и необезпокоявано.

Ще следим какво ще се случи на комисията утре. Би трябвало да се излъчва на живо и ще разберем дали ще бъдат изнесени някакви аргументи и обяснена липсата на такива към законопроекта или ще бъде задушена всякаква дискусия и критика по темата.

The post Кого конкретно обслужва законопроектът на ИТН в останалата част на София? first appeared on Блогът на Юруков.

Security updates for Tuesday

Post Syndicated from corbet original https://lwn.net/Articles/1038325/

Security updates have been issued by AlmaLinux (kernel and kernel-rt), Debian (node-sha.js and python-django), Fedora (chromium, cups, exiv2, perl-Catalyst-Authentication-Credential-HTTP, perl-Catalyst-Plugin-Session, perl-Plack-Middleware-Session, and qemu), Red Hat (container-tools:rhel8, podman, and udisks2), SUSE (cargo-audit, cargo-c, cargo-packaging, and kernel-devel), and Ubuntu (libcpanel-json-xs-perl, libjson-xs-perl, rubygems, sqlite3, and vim).

Building HA Zabbix with PostgreSQL and Patroni

Post Syndicated from Patrik Uytterhoeven original https://blog.zabbix.com/building-ha-zabbix-with-postgresql-and-patroni/30960/

Running a monitoring platform like Zabbix in a production environment demands reliability and resilience. When your monitoring solution is down, you’re flying blind – and for many organizations, that simply isn’t acceptable. This post introduces a robust high-availability (HA) architecture for Zabbix, using PostgreSQL,  Patroni, etcd, HAProxy, keepalived and PgBackRest. Built on RHEL 9 or derrivates, this solution combines modern open-source tools to provide automatic failover, load balancing, and seamless monitoring, all while maintaining consistency and performance.

Architecture overview

The HA design consists of multiple layers working in tandem to maintain continuity even during node or service failures:

Database Cluster Layer

2 or more nodes form the PostgreSQL cluster, managed by Patroni and coordinated using etcd. At any given time, one node is the primary (read/write), and the others are hot standbys ready to take over automatically.

Consensus layer

etcd runs on the same nodes and acts as the distributed configuration store and coordination layer for Patroni. It ensures a consistent cluster state and enables safe failover decisions.

Load balancing layer  

Two HAProxy nodes provide a single point of entry for all clients (including Zabbix), routing requests to the current PostgreSQL primary. These nodes are monitored and coordinated via Keepalived to maintain a floating Virtual IP (VIP), ensuring seamless failover at the connection layer.

Backup layer

A separate backup server is responsible for running PgBackRest, which handles full and incremental backups, WAL archiving, and Point-In-Time Recovery (PITR). This server communicates securely with all database nodes over SSH.

Monitoring layer

Two Zabbix servers, running in active-passive mode, continuously monitor all layers of this stack including the HAProxy health, Patroni cluster role, and etcd status by accessing the PostgreSQL VIP for backend connectivity.

This multi-tiered setup ensures that no single failure be it a database, load balancer, or monitoring server brings down the monitoring platform.

Why HA matters for Zabbix

Zabbix depends heavily on its PostgreSQL database backend. Every metric, trigger, event, and alert is stored there. If PostgreSQL becomes unavailable, even briefly, data loss or monitoring blind spots can occur. That’s why introducing HA at the database layer is a crucial step when scaling Zabbix for enterprise environments.

While Zabbix itself supports HA at the application level, this architecture ensures that the database backend is also fully fault-tolerant, using modern consensus-based clustering with automatic failover.

Component overview

To achieve HA, we bring together several specialized components, each fulfilling a critical role in the system:

PostgreSQL

The relational database engine used by Zabbix. In this example setup, it runs on three nodes, forming a cluster managed by Patroni.

Patroni

Patroni is the orchestrator for the PostgreSQL cluster. It monitors node health, manages replication, promotes standbys when needed, and ensures only one writable leader exists at any time. Patroni leverages a distributed consensus store in this case, etcd but other DCS’s are possible to coordinate decisions across the cluster.

etcd

etcd is a lightweight and highly available key-value store used by Patroni to maintain the cluster’s state. It stores leader election data, health statuses, and locks. We deploy it as a three-node cluster, co-located with the PostgreSQL nodes for convenience, though this setup can be scaled independently if needed as etcd is very latency prone.

HAProxy

To simplify application connectivity, HAProxy acts as a load balancer in front of the database cluster. It monitors the role of each node using Patroni’s REST API and routes connections to the active primary server. If the leader fails, HAProxy automatically reroutes traffic to the new primary.

Keepalived

Keepalived provides a floating virtual IP address (VIP) across the HAProxy nodes. This VIP allows client systems, such as the Zabbix frontend, to connect to a single stable IP even if one HAProxy node fails.

PgBackRest

To protect the data itself, we use PgBackRest for full and incremental backups, as well as Point-In-Time Recovery (PITR). A dedicated backup server is included to pull and store archive logs and backups securely via SSH.

Zabbix server

Finally, we run two Zabbix servers in active-passive mode. Both are configured to connect to the PostgreSQL cluster through the VIP exposed by HAProxy. The Zabbix frontend is deployed on both nodes as well, ensuring continued accessibility through the load-balanced setup.

Topology at a glance

Here’s a simplified view of the architecture:

  • 2 or more database nodes (PostgreSQL + Patroni + etcd)
  • Two HAProxy nodes, each configured with Keepalived to manage a floating virtual IP
  • One backup node for PgBackRest
  • Two Zabbix servers pointing to the PostgreSQL VIP

All systems are tied together with consistent hostname mappings, time synchronization (Chrony), and service monitoring.

Notes:

  • PgBackRest is directly connected to all three PostgreSQL nodes, allowing it to archive WAL segments and pull backups regardless of which node is primary.
  • This design enables full standby backups and supports Point-In-Time Recovery (PITR).
  • HAProxy ensures Zabbix always talks to the current primary node, while Patroni and etcd handle automatic failover and cluster state management.

Design rationale

This setup prioritizes resilience and self-healing. If any single component fails a database node, a load balancer, or even a monitoring server the system continues to function.

Using Patroni with etcd ensures that failovers are handled automatically, without human intervention. HAProxy ensures client traffic is always routed to the current primary, while Keepalived ensures that this routing layer itself is highly available.

We opted for PgBackRest over simple scripts or base backups because it provides not just efficient incremental backups, but also full WAL archiving and point-in-time recovery, which are invaluable for both disaster recovery and debugging.

Lastly, we chose to integrate Zabbix itself into this HA design, treating it not just as a application but as a fully resilient service able to monitor itself, so to speak.

Real-world considerations
  • Resource planning: While our nodes run comfortably, scaling this setup to heavy workloads requires careful tuning of memory, I/O, and PostgreSQL parameters.
  • etcd placement: Although we run etcd co-located with the database nodes in this example, separating etcd onto dedicated infrastructure is ideal for large-scale environments. This avoids resource contention and preserves quorum in extreme failure scenarios.
  • Monitoring the monitors: Zabbix itself must be monitored. In our setup, each component including etcd, Patroni, and PostgreSQL exposes health endpoints that can be used by Zabbix agents or scripts to generate alerts on replication lag, cluster health, and failover events.

Conclusion

This architecture provides a solid foundation for running Zabbix in a fault-tolerant, production-ready environment. It not only ensures high availability for the database layer but also offers flexibility, observability, and operational safety.

Whether you’re running internal infrastructure monitoring or offering Zabbix as a managed service, adopting this type of HA setup removes single points of failure and gives you peace of mind — all using open-source technologies that are battle-tested and widely supported.

If you need assistance with the migration or want to ensure best practices for scaling and optimizing Zabbix, don’t hesitate to reach out to OICTS. We are a Zabbix Premium Partner operating globally, with offices in the USAUKNetherlands, and Belgium, and we’re ready to help you every step of the way.

 

The post Building HA Zabbix with PostgreSQL and Patroni appeared first on Zabbix Blog.

Taming the monorepo beast: Our journey to a leaner, faster GitLab repo

Post Syndicated from Grab Tech original https://engineering.grab.com/taming-monorepo-beast

At Grab, our engineering teams rely on a massive Go monorepo that serves as the backbone for a large portion of our backend services. This repository has been our development foundation for over a decade, but age brought complexity, and size brought sluggishness. What was once a source of unified code became a bottleneck that was slowing down our developers and straining our infrastructure.

A primer on GitLab, Gitaly, and replication

To understand our core problem, it’s helpful to know how GitLab handles repositories at scale. GitLab uses Gitaly, its Git RPC service, to manage all Git operations. In a high-availability setup like ours, we use a Gitaly Cluster with multiple nodes.

Here’s how it works:

  • Write operations: A primary Gitaly node handles all write operations.
  • Replication: Data is replicated to secondary nodes.
  • Read operations: Secondary nodes handle read operations, such as clones and fetches, effectively distributing the load across the cluster.
  • Failover: If the primary node fails, a secondary node can take over.
    For the system to function effectively, replication must be nearly instantaneous. When secondary nodes experience significant delays syncing with the primary—a condition called replication lag—GitLab stops routing read requests to the secondary nodes to ensure data consistency. This forces all traffic back to the primary node, eliminating the benefits of our distributed setup. Figure 1 illustrates the replication architecture of Gitaly nodes.
Figure 1: The replication architecture of Gitaly nodes in a high-availability setup.

The scale of our problem

Our Go monorepo started as a simple repository 11 years ago but ballooned as Grab grew. A Git analysis using the git-sizer utility in early 2025 revealed the shocking scale:

  • 12.7 million commits accumulated over a decade.
  • 22.1 million Git trees consuming 73GB of metadata.
  • 5.16 million blob objects totaling 176GB.
  • 12 million references, mostly leftovers from automated processes.
  • 429,000 commits deep on some branches.
  • 444,000 files in the latest checkout.

This massive size wasn’t just a number—it was crippling our daily operations.

Infrastructure problems

Figure 2: Replication delays of up to four minutes during peak working hours.

In high-availability setups, replication is critical for distributing workloads and ensuring system reliability. However, when replication delays occur, they can severely impact infrastructure performance and create bottlenecks. Figure 2 illustrates replication delays of up to four minutes which caused both secondary nodes, Gitaly S1 (orange) and Gitaly S2 (blue), to lag behind the primary node, Gitaly P (green). As a result, all requests were routed exclusively to the primary node, creating significant performance challenges.

The key issues here are:

  • Single point of failure: Only one of our three Gitaly nodes could handle the load, creating a bottleneck.
  • Throttled throughput: The system limits the read capacity to just one-third of the cluster’s potential.

Developer experience issues

The growing size of the monorepo directly impacted developer workflows:

  • Slow clones: 8+ minutes even on fast networks.
  • Painful Git operations: Every commit, diff, and blame had to process millions of objects.
  • CI pipeline overhead: Repository cloning added up 5-8 minutes to every CI job.
  • Frustrated developers: “Why is this repo so slow?” became a common question.

Operational challenges

The repository’s scale introduced significant operational hurdles:

  • Storage issues: 250GB of Git data made backups and maintenance cumbersome.
  • GitLab UI timeouts: The web interface struggled to handle millions of commits and refs, frequently timing out.
  • Limited CI scalability: Adding more CI runners overloaded the single working node.

All these factors were dragging down developer productivity. It was clear that continuing to let the monorepo grow unchecked wasn’t sustainable. We needed to make the repository leaner and faster, without losing the important history that teams relied on.

Our solution journey

Proof of concept: Validating the theory

Before making any changes, we needed to answer a critical question: “Would trimming repository history solve our replication issues?” Without proof, committing to such a major change felt risky. So we set out to test the idea.

The test setup:

We designed a simple experiment. In our staging environment, we created two repositories:

  • Full history repository: This repository mirrored the original repository with full history.
  • Shallow history repository: This repository contained only a single commit history.

Both repositories contained the same number of files and directories. We then simulated production-like load on both of the repositories.

The results:

  • Full history repository: 160-240 seconds replication delay.
  • Shallow history repository: 1-2.5 seconds replication delay.

This was nearly a 100x improvement in replication performance.

This proof of concept gave us confidence that history trimming was the right approach and provided baseline performance expectations.

Content preservation strategies: What to keep

Initial strategy: Time-based approach (1-2 years)

Initially, we wanted to keep commits from the last 1-2 years and archive everything else, as this seemed like a reasonable balance between recent history and size reduction. However, when we developed our custom migration script, we discovered it could only process 100 commits per hour, approximately 2,400 commits per day. With millions of commits in the original repository, even keeping 1-2 years of history would take months.

  • We can only process ~100 commits per hour in batches of 20 to avoid memory limits on GitLab runners.
  • Each batch takes 2 minutes to process, but requires 10 minutes of cleanup (git gc, git reflog expire) to prevent local disk and memory exhaustion.
  • This means each batch takes 12 minutes, allowing only 5 batches per hour (60 ÷ 12 = 5), totaling to 100 commits per hour (5 × 20 = 100).
  • Larger batches increased cleanup time and skipping cleanup caused jobs to crash after 200-300 commits.

The bottleneck wasn’t just the number of commits, it was the 10-minute cleanup process.

Additional constraints discovered:

As we dug deeper, we discovered more obstacles.

  • Critical dependencies extended beyond two years. Some Go module tags from six years ago were still actively used.
  • A pure time-based cut would break existing build pipelines.
  • Development teams needed some recent history for troubleshooting and daily operations.

Revised strategy: Tag-based + recent history

Given the processing speed constraint of 100 commits per hour, we needed to drastically reduce the number of commits while preserving essential functionality. After careful evaluation, we settled on a tag-based approach combined with recent history.

What we decided to keep:

  • Critical tags: All commits reachable by 2,000+ identified tags, ensuring semantic importance for releases and dependencies.
  • Recent history: Complete commit history for the last month only addressing stakeholder needs within processing constraints.
  • Simplified merge commits: Converted complex merge commits into single commits to further reduce processing time.

Why this approach worked:

  • Time-feasible: Reduced processing time from months to weeks.
  • Functionally complete: Preserved all tagged releases and recent development context.
  • Stakeholder satisfaction: Met development teams’ need for recent history.
  • Massive size reduction: Achieved 99.9% fewer commits while keeping what matters.

The trade-off:

We sacrificed deep historical browsing of 1 to 2 years for practical migration feasibility, while ensuring no critical functionality was lost.

Technical implementation methods: How to execute

Method 1: git filter-repo (Failed)

The approach: Use Git’s filter-repo tool with git replace --graft to remove commits older than a specified criteria.

Why it failed:

  • Complex history: Our repository’s highly non-linear history, with multiple branches and merges, made this approach impractical.
  • Workflow complexity: The process required numerous git replace --graft commands to account for various branches and dependencies, significantly complicating the workflow.
  • Risk of inconsistencies: The complexity introduced a high risk of errors and inconsistencies, making this method unsuitable.

Method 2: git rebase –onto (Failed)

The approach: Use git rebase --onto to preserve selected commits while pruning unwanted history.

Why it failed:

  • Scale issues: The repository size overwhelmed the rebase process.
  • Conflict resolution: High number of unexpected conflicts that couldn’t be resolved automatically.
  • Technical limitations: Batch processing couldn’t solve the performance issues; Git’s internal mechanisms struggled with the scale.

Method 3: Patch-based implementation (Failed)

The approach: Create and apply patches for each commit individually to preserve repository history.

Why it failed:

  • Merge commit complexity: Couldn’t maintain correct parent-child relationships for merge commits.
  • History integrity: Resulted in linear sequence instead of preserving original merge structure.
  • Missing commits: Important merge commits were lost or incorrectly applied.

Method 4: Custom migration script (Success!)

The breakthrough: A sophisticated custom script that could handle our specific requirements and processing constraints. Unlike traditional Git history rewriting tools, our script implements a two-phase chronological processing approach that efficiently handles large-scale repositories.

Phase 1: Bulk migration

In this phase, the script focuses on reconstructing history based on critical tags.

  1. Fetch tags chronologically: Retrieve all tags in the order they were created.
  2. Pre-fetch Large File Storage (LFS) objects: Collect LFS objects for tag-related commits before processing.
  3. Batch processing: Process tags in batches of 20 to optimize memory and network usage. For each tag:
    • Check for associated LFS objects.
    • Perform selective LFS fetch if required.
    • Create a new commit using the original tree hash and metadata.
    • Embed the original commit hash in the commit message for traceability.
    • Gracefully handle LFS checkout failures.

Then, push the processed batch of 20 commits to the destination repository, with LFS tolerance.

  1. Cleanup and continue: Perform cleanup operations after each batch and proceed to the next.

Phase 2: Delta migration

This phase integrates recent commits after the cutoff date.

  1. Fetch recent commits: Retrieve all commits created after the cutoff date in chronological order.
  2. Batch processing: Process commits in batches of 20 for efficiency. For each commit:
    • Check for associated LFS objects.
    • Perform selective LFS fetch if required.
    • Recreate the commit with its original metadata.
    • Embed the original commit hash for resumption tracking in case of interruptions.
    • Gracefully handle LFS checkout failures.

Then, push the processed batch of commits to the destination repository, with LFS tolerance.

  1. Tag mapping: Map tags to their corresponding new commit hashes.
  2. Push tags: Push related tags pointing to the correct new commits.
  3. Final validation: Validate all LFS objects to ensure completeness.

LFS handling

The script incorporates robust mechanisms to handle Git LFS efficiently.

  • Configure LFS for incomplete pushes.
  • Skip LFS download errors when possible.
  • Retry checkout with LFS smudge skip.
  • Perform selective LFS object fetching.
  • Gracefully degrade processing for missing LFS objects.

Key features:

  • Sequential processing of tags and commits in chronological order.
  • Resumable operations that could restart from the last processed item if interrupted.
  • Batch processing to manage memory and network resources efficiently.
  • Robust error handling for network issues and Git complications.
  • Maintains repository integrity while simplifying complex merge structures.
  • Optimized for our specific preservation strategy (tags + recent history).

Implementation: Executing the migration

With our strategy defined (tags + last month), we executed the migration using our custom script. This process involved careful planning, smart processing techniques, and overcoming technical challenges.

Smart processing approach

Our custom script employed several key strategies to ensure efficient and reliable migration:

  • Sequential tag processing: Replay tags chronologically to maintain logical history.
  • Resumable operations: The migration could restart from the last processed item if interrupted.
  • Batch processing: Handle items in manageable groups to prevent resource exhaustion.
  • Progress tracking: Monitor processing rate and estimated completion time.

Technical challenges solved

The migration addressed several critical technical hurdles.

  • Large file support: Handled Git LFS objects with incomplete push allowances.
  • Error handling: Robust retry logic for network issues and Git errors.
  • Merge commit simplification: Converted complex merge structures to linear commits.

Two-phase migration strategy

The migration was executed in two carefully planned phases.

  • Phase 1 – Bulk migration: Migrated 95% of tags while keeping the old repo live.
  • Phase 2 – Delta migration: Performed final synchronization during a maintenance window to migrate recent changes.

Results and impact

Infrastructure transformation

Replication delay, or the time required to sync across all Gitaly nodes, improved by 99.4% following the pruning process. As illustrated in Figures 3 and 4, the new pruned monorepo achieves replication in under ~1.5 seconds on average, compared to ~240 seconds for the old repository. This transformation eliminated the previous single-node bottleneck, enabling read requests to be distributed evenly across all three storage nodes, significantly enhancing system reliability and performance.

Figure 3: In the new pruned monorepo, replication delay ranges from 200 – 2,000 ms.
Figure 4: In the old monorepo, replication delay ranged from 16,000 – 28,000 ms.

The migration significantly improved load distribution across Gitaly nodes. As shown in Figure 5, the new monorepo leverages all three Gitaly nodes to serve requests, effectively tripling read capacity. Additionally, the migration eliminated the single point of failure that existed in the old monorepo, ensuring greater reliability and scalability.

Figure 5: In the new monorepo, requests are evenly distributed across all three servers, demonstrating improved performance and replication across nodes.
Figure 6: In the old monorepo, requests were served only by a single server during working hours, creating a single point of failure.

Performance improvements

The migration resulted in significant improvements across multiple areas.

  • Clone time: Reduced from 7.9 minutes to 5.1 minutes, achieving a 36% improvement, making repository cloning faster and more efficient.
  • Commit count: Achieved a 99.9% reduction, trimming the repository from 13 million commits to just 15.8 thousand commits, drastically simplifying its structure.
  • References: Reduced by 99.9%, going from 12 million to 9.8 thousand refs, streamlining repository metadata.
  • Storage: Reduced by 59%, shrinking storage requirements from 214GB to 87GB, optimizing resource usage.

Developer experience

The migration also transformed the developer experience.

  • Faster Git operations: Commits, diffs, and history commands are noticeably snappier.
  • Responsive GitLab UI: Web interface no longer times out.
  • Scalable CI: The system can now safely run 3x more concurrent jobs.

The following table summarizes the key repository metrics, comparing the state of the repository before and after the migration:

Metric Old Monorepo New Monorepo Reduction
Commits ~13,000,000 ~15,800 −99.9% (histories squashed)
Git trees ~23,600,000 ~2,080,000 −91% (pruned)
Git references ~12,200,000 9,860 −99.9% (cleaned)
Blob storage 214 GiB 86.8 GiB −59% (smaller packs)
Files in checkout ~444,000 ~444,000 ~0% (no change)
Latest code size ~9.9 GiB ~8.4 GiB ~−15% (slightly leaner)

Key challenges and lessons learned

Such a large-scale migration wasn’t without its hiccups and lessons. Here are some challenges we faced and what we learned:

Git LFS woes

Initially, GitLab rejected some commits due to missing LFS objects, even old commits that we weren’t keeping. This happened because GitLab’s push hook expected the content of LFS pointers, even if the files weren’t required. To fix this, we had to allow incomplete pushes and skip LFS download errors. We also wrote logic to selectively fetch LFS objects for commits we were keeping. This ensured that any binary assets needed by tagged commits were present in the new repo. The takeaway is that LFS adds complexity to history rewrites – plan for it by adjusting Git LFS settings (e.g., lfs.allowincompletepush) and verifying important large files are carried over.

Pipeline token scoping

Right after the cutover, some CI pipelines failed to access resources. We discovered a GitLab CI/CD pipeline token issue – our new repo’s ID wasn’t in the allowed list for certain secure token scopes. We quickly updated the settings to include the new project, resolving the authorization error. If your CI jobs interact with other projects or use project-scoped tokens, remember to update those references when you migrate repositories.

Commit hash references broke

One of our internal tools was using commit SHA-1 hashes to track deployed versions. Since rewriting history means changing all commit hashes, the tool couldn’t find the expected commits. The solution was to map old hashes to new ones for the tagged releases, or better, to modify the tool to use tag names instead of raw hashes going forward. We learned to communicate early with teams that have any dependency on Git commit IDs or history assumptions. In our case, providing a mapping of old tag→new tag (which were mostly 1-to-1 except for the commit SHA) helped them adjust. In hindsight, using stable identifiers like semantic version tags, is much more robust than relying on commit hashes, which are ephemeral in a rewritten history.

Developer concerns: “Where’s my history?”

A few engineers were concerned when they noticed that the git log in the new repo only showed two years of history. From their perspective, useful historical context seemed gone. We addressed this by pointing them to the archived full-history repo. In fact, we kept the old repository read-only in our GitLab, so anyone can still search the old history if needed (just not in the main repo). Additionally, we received suggestions on making the archive easily accessible or even automate a way to query old commits on demand. From this we learned, if you prune history, ensure there’s a plan to access legacy information for those rare times it’s needed – whether that’s an archive repo, a Git bundle, or a read-only mirror.

Office network bottleneck

Interestingly, after the migration, a few developers in certain offices didn’t feel a huge speed improvement in clones. It turned out their corporate network/VPN was the limiting factor – cloning 8 GiB vs 10 GiB over a slow link is not a night and day difference. This highlighted that we should continue to work with the IT team on improving network performance. The repo is faster, but the environment matters too. We’re using this as an opportunity to improve our office VPN throughput so that the 36% clone improvement is realized by everyone, not just CI machines.

Automation and hardcoded IDs

We had a lot of automation around the monorepo (scripts, webhooks, integrations). Most of these referenced the project by name, which remained the same, so they were fine. However, a few used the project’s numeric ID in the GitLab API, which changed when we created a new repo. Those broke. We had to scan and update some configs to use the new project ID. Our learning here is to audit all external references such as CI configs, deploy scripts, and monitor jobs when migrating repositories. Ideally, use identifiable names instead of IDs, or ensure you’re prepared to update them during the cutover.

Adjusting to new boundaries

Some teams had to adjust their workflows after the prune. For instance, one team was in the habit of digging into 3 to 5 year old commit logs to debug issues. Post-migration, git log doesn’t go back that far in the main repo; they have to consult the archive for that. It’s a cultural shift to not have all history at your fingertips. We held a short information session to explain how to access the archived repo and emphasized the benefits (faster operations) that come with the lean history. After a while, teams embraced the new normal, appreciating the speed and rarely needing the older commits anyway.

In the end, we had zero data loss – all actual code and tags were preserved – and only some minor inconveniences that were resolved within a day or two. The challenges reinforced the importance of thorough testing (our staging dry-runs caught many issues) and cross-team communication when making such a change.

Impact and next steps

This migration transformed our development infrastructure from a bottleneck into a performance enabler. We eliminated the single point of failure, restored confidence in our Git operations, and created a foundation that can support our growing engineering team.

As the next step, we plan to generalize our pruning script to apply the same optimization techniques to other repositories, ensuring consistency and scalability across our infrastructure. Additionally, we will implement continuous performance monitoring to track repository health and proactively address any emerging issues. To prevent future repository bloat, we aim to establish clear best practices and guidelines, empowering teams to maintain efficiency while supporting the growth of our engineering operations.

Conclusion

What started as a performance crisis became one of our most successful infrastructure projects. By focusing on the right problems—infrastructure reliability and performance rather than just size—we achieved dramatic improvements that benefit every developer daily.

The key takeaway is that sometimes the biggest technical challenges require custom solutions, careful planning, and willingness to iterate until you find what works. Our 99% improvement in replication performance is just the beginning of what’s possible when you tackle infrastructure problems systematically.

This migration was completed by Grab Tech Infra DevTools team, involving months of analysis, custom tooling development, and careful production migration of critical infrastructure serving thousands of developers across multiple time zones.

Join us

Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line – we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

The collective thoughts of the interwebz