New general-purpose Amazon EC2 M8a instances are now available

Post Syndicated from Betty Zheng (郑予彬) original https://aws.amazon.com/blogs/aws/new-general-purpose-amazon-ec2-m8a-instances-are-now-available/

Today, we’re announcing the availability of Amazon Elastic Compute Cloud (Amazon EC2) M8a instances, the latest addition to the general-purpose M instance family. These instances are powered by the 5th Generation AMD EPYC (codename Turin) processors with a maximum frequency of 4.5GHz. Customers can expect up to 30% higher performance and up to 19% better price performance compared to M7a instances. They also provide higher memory bandwidth, improved networking and storage throughput, and flexible configuration options for a broad set of general-purpose workloads.

Improvements in M8a
M8a instances deliver up to 30% better performance per vCPU compared to M7a instances, making them ideal for applications that require benefit from high performance and high throughput such as financial applications, gaming, rendering, application servers, simulation modeling, midsize data stores, application development environments, and caching fleets.

They provide 45% more memory bandwidth compared to M7a instances, accelerating in-memory databases, distributed caches, and real-time analytics.

For workloads with high I/O requirements, M8a instances provide up to 75 Gbps of networking bandwidth and 60 Gbps of Amazon Elastic Block Store (Amazon EBS) bandwidth, a 50% improvement over the previous generation. These enhancements support modern applications that rely on rapid data transfer and low-latency network communication.

Each vCPU on an M8a instance corresponds to a physical CPU core, meaning there is no simultaneous multithreading (SMT). In application benchmarks, M8a instances delivered up to 60% faster performance for GroovyJVM and up to 39% faster performance for Cassandra compared to M7a instances.

M8a instances support instance bandwidth configuration (IBC), which provides flexibility to allocate resources between networking and EBS bandwidth. This gives customers the flexibility to scale network or EBS bandwidth by up to 25% and improve database performance, query processing, and logging speeds.

M8a is available in ten virtualized sizes and two bare metal options (metal-24xl and metal-48xl), providing deployment choices that scale from small applications to large enterprise workloads. All of these improvements are built on the AWS Nitro System, which delivers low virtualization overhead, consistent performance, and advanced security across all instance sizes. These instances are built using the latest sixth generation AWS Nitro Cards, which offload and accelerate I/O for functions, increasing overall system performance.

M8a instances feature sizes of up to 192 vCPU with 768GiB RAM. Here are the detailed specs:

M8a vCPUs Memory (GiB) Network bandwidth (Gbps) EBS bandwidth (Gbps)
medium 1 4 Up to 12.5 Up to 10
large 2 8 Up to 12.5 Up to 10
xlarge 4 16 Up to 12.5 Up to 10
2xlarge 8 32 Up to 15 Up to 10
4xlarge 16 64 Up to 15 Up to 10
8xlarge 32 128 15 10
12xlarge 48 192 22.5 15
16xlarge 64 256 30 20
24xlarge 96 384 40 30
48xlarge 192 768 75 60
metal-24xl 96 384 40 30
metal-48xl 192 768 75 60

For a complete list of instance sizes and specifications, refer to the Amazon EC2 M8a instances page.

When to use M8a instances
M8a is a strong fit for general-purpose applications that need a balance of compute, memory, and networking. M8a instances are ideal for web and application hosting, microservices architectures, and databases where predictable performance and efficient scaling are important.

These instances are SAP certified and also well suited for enterprise workloads such as financial applications and enterprise resource planning (ERP) systems. They’re equally effective for in-memory caching and customer relationship management (CRM), in addition to development and test environments that require cost efficiency and flexibility. With this versatility, M8a supports a wide spectrum of workloads while helping customers improve price performance.

Now available
Amazon EC2 M8a instances are available today in US East (Ohio) US West (Oregon) and Europe (Spain) AWS Regions. M8a instances can be purchased as On-Demand, Savings Plans, and Spot Instances. M8a instances are also available on Dedicated Hosts. To learn more, visit the Amazon EC2 Pricing page.

To learn more, visit the Amazon EC2 M8a instances page and send feedback to AWS re:Post for EC2 or through your usual AWS support contacts.

Betty

StackSets Deployment Strategies: Balancing Speed, Safety, and Scale to Optimize Deployments for Different Organizational Needs

Post Syndicated from Amar Meriche original https://aws.amazon.com/blogs/devops/stacksets-deployment-strategies-balancing-speed-safety-and-scale-to-optimize-deployments-for-different-organizational-needs/

AWS CloudFormation StackSets enables organizations to deploy infrastructure consistently across multiple AWS accounts and regions. However, success depends on choosing the right deployment strategy that balances three critical factors: deployment speed, operational safety, and organizational scale. This guide explores proven StackSets deployment strategies specifically designed for multi-account infrastructure management.

Understanding StackSets Deployment Fundamentals

What are StackSets Actually Used For?

Unlike single-account AWS CloudFormation templates, StackSets are specifically designed for multi-account infrastructure governance. Common use cases include Security baselines (deploying IAM policies, security groups, and access controls across all accounts), Compliance controls (rolling out AWS Config rules, AWS CloudTrail configurations, and audit requirements), Organizational standards (establishing consistent VPC configurations, tagging policies, and naming conventions), Shared services (deploying monitoring solutions, logging infrastructure, and backup policies) or Cost management (implementing budget controls, cost allocation tags, and resource optimization policies)

The Multi-Account Challenge

Managing infrastructure across dozens or hundreds of AWS accounts presents unique challenges:

Single Account (CFN Template)     Multi-Account (StackSets)
      App A                           Org Unit A (50 accounts)
        |                                     |
   [Deploy Once]               [Deploy consistently across all]
        |                                     |
    Success/Fail                Complex success/failure matrix

Multi account and multi region Cloudformation deployment complexity

The Speed-Safety-Scale Triangle

Every StackSets deployment strategy involves trade-offs: Speed (how quickly changes propagate across your organization), Safety (risk mitigation and failure containment) and Scale (ability to manage hundreds of accounts efficiently)

Prerequisites

Before implementing any of the deployment strategies described in this guide, ensure you have:

  1. AWS CLI Installation
    1. Install the latest version of AWS CLI by following the AWS CLI installation guide
    2. Verify installation with: aws –version
  2. AWS Profile Configuration
    1. Configure your AWS credentials using: aws configure
    2. For details on configuration, see AWS CLI configuration basics
    3. Ensure your profile has appropriate permissions for CloudFormation StackSets operations as described in AWS StackSets prerequisites
  3. Proper Account Access The commands in this guide must be executed from either:
    1. The management account of your AWS Organization
    2. OR a delegated administrator account for CloudFormation

For information on setting up a delegated administrator, see Register a delegated administrator

Note: StackSets deployments using service-managed permissions cannot be performed from standalone accounts.

Verify you’re using the correct account with:

bash
# For management account
aws organizations describe-organization
# For delegated admin
aws cloudformation list-stack-sets —call-as DELEGATED_ADMIN

AWS CLI to check the usage of an Organization and not a Standalone account

Core Deployment Strategies

As explained in the StackSet documentation:

  • “For a more conservative deployment, set Maximum Concurrent Accounts to 1, and Failure Tolerance to 0. Set your lowest-impact region to be first in the Region Order Start with one region.”
  • “For a faster deployment, increase the values of Maximum Concurrent Accounts and Failure Tolerance as needed. ”

Based on the above, we are proposing below several deployment strategies, depending on the speed, safety and scale you want to achieve.

1. Sequential Deployment: Maximum Safety

Use Case : Critical security updates, compliance requirements, first-time organizational rollouts

Below are listed some possible use cases:

  • Security baseline updates: New IAM policies affecting root access
  • Compliance rollouts: SOX, HIPAA, or PCI-DSS control implementations
  • Critical infrastructure changes: VPC security group modifications
  • Organizational policy changes: New AWS Config rules for audit compliance

Implementation Example:

For this example, we will download the following template ConfigRuleCloudtrailEnabled.yml from the Cloudformation sample library in the AWS documentation to configure an AWS Config rule to determine if AWS CloudTrail is enabled and follow the next steps:

Step 1: Create the StackSet

With the AWS CLI:

# Create Stackset for security baseline
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
  --stack-set-name security-baseline \
  --template-body file://ConfigRuleCloudtrailEnabled.yml \
  --capabilities CAPABILITY_NAMED_IAM \
  --permission-model SERVICE_MANAGED \
  --auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
  --region us-east-1

AWS CLI to create a security-baseline Stackset

The expected response should be similar to the following :

{"StacksetId": "security-baseline: ...."}

Step 2: Create Stack Instances

Before you launch the below command, you need to adjust the values of the following parameters:

  • OrganizationalUnitIds: you must change the value “ou-test” in the below command line to the name of the target OU you want to deploy to. I recommend creating a new test OU in the console or via the CLI for the purpose of this test.
  • regions: if needed, change the “us-east-1 eu-west-1” value, here you need to list all the regions you want to deploy to. AWS Config must be active in the accounts/regions that you choose, otherwise you’ll get an error when deploying the Stack.

# Deploy security baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1 and eu-west-1
# SEQUENTIAL = One region at a time, sequentially
# MaxConcurrentPercentage = Deploy to 5% of accounts at once
# FailureTolerancePercentage = Stop on first failure
aws cloudformation create-stack-instances \
  --stack-set-name security-baseline \
  --deployment-targets OrganizationalUnitIds=ou-test\
  --regions us-east-1 eu-west-1 \
  --region us-east-1 \
  --operation-preferences RegionConcurrencyType=SEQUENTIAL,MaxConcurrentPercentage=5,FailureTolerancePercentage=0

AWS CLI to create security-baseline Stack Instances sequentially for maximum safety

The CLI output should look like the following:

{"OperationId": ....}

Or create the StackSet and add the Stacks with the AWS Console:

In the CloudFormation Console, click “Create StackSet”

AWS CloudFormation Console: create a security-baseline Stackset

AWS CloudFormation Console: create a security-baseline Stackset

Upload your template from S3 or from your computer and click Next:

AWS CloudFormation Console: specify a template

AWS CloudFormation Console: specify a template

Specify the StackSet name and parameters and click Next:

AWS CloudFormation Console: specify the StackSet name and parameters

AWS CloudFormation Console: specify the StackSet name and parameters

Configure StackSet options and click Next:

AWS CloudFormation Console: configure the StackSet options

AWS CloudFormation Console: configure the StackSet options

Set deployment options and click Next:

AWS CloudFormation Console: set deployment options

AWS CloudFormation Console: set deployment options

AWS CloudFormation Console: set deployment options

AWS CloudFormation Console: set more deployment options

Then Review and Submit.

Not to overweight this blog, we’ll provide only this example of CLI output and Console screenshot, but the “Parallel Deployment” and “Balanced Approach” will be similar to this example. You just need to update the parameters for the different StackSet Operations options.

A real-world example would be a financial services company deploying new MFA requirements across 200 production accounts. They could use sequential deployment with 5 concurrency to ensure each batch was validated before proceeding.

2. Parallel Deployment: Maximum Speed

The Parallel Deployment is best for non-critical updates, development environments, routine maintenance

Here are some possible use cases:

  • Development account standardization: Rolling out new development tools
  • Monitoring infrastructure: Deploying Amazon CloudWatch dashboards and alarms
  • Cost optimization: Implementing automated resource cleanup policies
  • Non-production updates: Updating development and staging environments

Implementation Example:

For this example, we will copy paste the .yml template from this Re:Post article about monitoring IAM events in a file called “monitoring-baseline.yml”, and use it in the following command lines.

Step 1: Create the StackSet

# Create Stackset for monitoring baseline
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
--stack-set-name monitoring-baseline \
--template-body file://monitoring-baseline.yml \
--capabilities CAPABILITY_NAMED_IAM \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
--region us-east-1

AWS CLI to create a monitoring-baseline Stackset

Step 2: Create Stack Instances

Just like in the previous example, before you launch the below command, you need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to dev and sandbox accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1 and eu-west-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 80% of accounts at once
# FailureTolerancePercentage = Tolerate failures in 20% of accounts
aws cloudformation create-stack-instances \
--stack-set-name monitoring-baseline \
--deployment-targets OrganizationalUnitIds=ou-development,ou-sandbox \
--regions us-east-1 eu-west-1 \
--region us-east-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=80,FailureTolerancePercentage=20

AWS CLI to create monitoring-baseline Stack Instances in parallel with high value for max concurrent percentage for maximum speed

3. Progressive Deployment: Balanced Approach or Multi Phase Approach (Recommended)

For most production scenarios with moderate risk tolerance, it is recommended to use a Balanced Approach, or Multi-Phase Implementation.

Balanced Approach

For this example, to make it easier, you can create a copy of “monitoring-baseline.yml” created previously, and name it “balanced-template.yml”.

cp monitoring-baseline.yml balanced-template.yml

bash command to copy the monitoring-baseline.yml file to balanced-template.yml

Then you can use it in the following command lines.

Step 1: Create the StackSet

# Create Stackset for a balanced creation
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
--stack-set-name balanced-deployment \
--template-body file://balanced-template.yml \
--capabilities CAPABILITY_NAMED_IAM \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
--region us-east-1

AWS CLI to create a balanced-deployment Stackset

Step 2: Create Stack Instances

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1, eu-west-1 and ap-southeast-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 25% of accounts at once
# FailureTolerancePercentage = Tolerate failures in 8% of accounts
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets OrganizationalUnitIds=ou-development,ou-sandbox \
--regions us-east-1 eu-west-1 ap-southeast-1 \
--region us-east-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=25,FailureTolerancePercentage=8

AWS CLI to create balanced-deployment Stack Instances in parallel with low max concurrent percentage for a balanced deployment

Multi-Phase Implementation:

Step 1: Create the StackSet

# Create Stackset for a balanced creation
# StackSet operation managed from us-east-1
aws cloudformation create-stack-set \
--stack-set-name balanced-deployment \
--template-body file://balanced-template.yml \
--capabilities CAPABILITY_NAMED_IAM \
--permission-model SERVICE_MANAGED \
--auto-deployment Enabled=true,RetainStacksOnAccountRemoval=false \
--region us-east-1

AWS CLI to create a balanced-deployment Stackset

Phase 1: Pilot Accounts (10% of target)

Phase 1: Create Pilot Stack Instances

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1
# SEQUENTIAL = Deployment in sequence
# MaxConcurrentPercentage = 100% Deploy full speed for small pilot
# FailureTolerancePercentage = Zero tolerance in pilot
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets Accounts=pilot-account-1,pilot-account-2 \
--regions us-east-1 \
--region us-east-1 \
--operation-preferences RegionConcurrencyType=SEQUENTIAL,MaxConcurrentPercentage=100,FailureTolerancePercentage=0

AWS CLI to create balanced-deployment Stack Instances sequentially for maximum safety in Pilot accounts

Wait for Pilot validation before proceeding to Phase 2

Phase 2: Early Adopter OUs (30% of target)

Phase 2: Create Early Adopter Stack Instances

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1, eu-west-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 25% of accounts at once
# FailureTolerancePercentage = Tolerate failures in 5% of accounts
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets OrganizationalUnitIds=ou-early-adopter \
--regions us-east-1 \
--region us-east-1 eu-west-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=25,FailureTolerancePercentage=5

AWS CLI to create balanced-deployment Stack Instances in parallel with low max concurrent percentage for a balanced deployment in Early Adopter OU

Wait for Early Adopter validation before proceeding to Phase 3

Phase 3: Full Deployment (Remaining 60%)

Phase 3: Full Deployment

You need to adjust the values of the OrganizationalUnitIds and regions parameters.

# Deploy monitoring baseline to production accounts
# StackSet operation managed from us-east-1
# Deployed to regions us-east-1, eu-west-1 and ap-southeast-1
# PARALLEL = Deployment in parallel
# MaxConcurrentPercentage = Deploy to 40% of accounts at once for higher speed after validation
# FailureTolerancePercentage = Tolerate failures in 10% of accounts for moderate tolerance
aws cloudformation create-stack-instances \
--stack-set-name balanced-deployment \
--deployment-targets OrganizationalUnitIds=ou-standard-prod,ou-legacy-prod \
--regions us-east-1 \
--region us-east-1 eu-west-1 ap-southeast-1 \
--operation-preferences RegionConcurrencyType=PARALLEL,MaxConcurrentPercentage=25,FailureTolerancePercentage=5

AWS CLI to create balanced-deployment Stack Instances in parallel with low max concurrent percentage for a balanced deployment in the remaining OUs

Using Step Functions for Orchestration

AWS Step Functions provides a serverless workflow service that can orchestrate StackSets deployments with advanced control flow, error handling, and state management capabilities. This approach enhances your multi-account deployments with features not available through standard StackSets operations alone.

Some of the Key Benefits include:

  • Advanced Deployment Orchestration: Coordinate multi-phase rollouts with validation gates
  • Human Approval Workflows: Implement manual approval steps for critical changes
  • Enhanced Error Handling: Define sophisticated retry policies and fallback mechanisms
  • Visual Monitoring: Track deployment progress through the Step Functions visual console

Real-World Use Case: Compliance Control Rollout

In regulated industries, AWS Step Functions enables a phased approach that combines automation with necessary governance. For instance, you can:

  1. Deploy compliance controls to test accounts
  2. Run automated validation and generate compliance reports
  3. Obtain manual approval from compliance team
  4. Deploy to production accounts with comprehensive monitoring

This approach ensures consistent governance while maintaining the complete audit trail required for regulatory compliance.

Monitoring and Optimization

AWS CloudFormation StackSets do not have extensive built-in Amazon CloudWatch metrics specifically designed for monitoring StackSet operations and health. This is actually why the monitoring implementation in our blog post is valuable.

Here’s what AWS does and doesn’t provide out of the box:

What AWS provides natively:

  • Basic AWS API call metrics via AWS CloudTrail (which show that operations happened but don’t track success rates or performance)
  • General service quotas and throttling metrics for CloudFormation as a whole
  • CloudFormation provides some metrics for individual stacks, but not consolidated StackSet-specific metrics

What requires custom implementation (as in our blog post):

  • Success rate metrics for StackSet operations across accounts
  • Deployment completion time tracking
  • Configuration drift detection and monitoring
  • Account-specific failure analysis
  • Comprehensive dashboards that show StackSet health across your organization

The code in our blog post demonstrates how to implement the success rate custom metrics by:

  1. Gathering data from the CloudFormation API about StackSet operations
  2. Calculating the success rate metrics for StackSet deployments
  3. Creating custom Amazon CloudWatch metrics in a custom namespace (like “StackSetMonitoring”)
  4. Setting up alerts for issues

This explains why organizations need to implement custom monitoring solutions like the one shown in our blog post rather than relying solely on built-in metrics.

Automated Monitoring Implementation: example of a custom metric to monitor the StackSet operations success rate

The following AWS Cloudformation template provides real-time monitoring and alerting for AWS CloudFormation StackSet operations through automated infrastructure deployment. This solution creates a complete monitoring system using a AWS Lambda function, Amazon EventBridge rules, Amazon SNS notifications, and Amazon CloudWatch dashboards to track StackSet success and failure rates. The core Lambda function named StackSetMonitor continuously monitors all active StackSets in your account, calculating success rates and publishing custom metrics to Amazon CloudWatch under the StackSetMonitoring namespace.

Below you’ll find a few example of possible custom metrics that could be implemented based on this AWS Cloudformation template:

  • Count of all operations (CREATE, UPDATE, DELETE) per StackSet over time periods
  • Number of stack instances with configuration drift (requires additional API calls)
  • Average time taken for StackSet operations to complete
  • Rate of StackSet operations to identify peak usage times
  • Number of individual stack instances that failed during operations
  • Number of retried operations (indicates infrastructure issues)

Here’s the StackSetMonitor.yml CloudFormation Template:

# StackSetMonitor.yml 
# CFN template for monitoring AWS CloudFormation StackSet operations with real-time alerts, metrics, and dashboards.

AWSTemplateFormatVersion: '2010-09-09'
Description: 'CloudFormation template for StackSet operation monitoring using CloudWatch and SNS'

Parameters:
  StackSetName:
    Type: String
    Description: 'Name of the StackSet to monitor'
    Default: 'security-baseline'
    MinLength: 1
    MaxLength: 128
    AllowedPattern: '[a-zA-Z][-a-zA-Z0-9]*'
    ConstraintDescription: 'Must be a valid StackSet name (1-128 characters, alphanumeric and hyphens, must start with a letter)'
  
  VpcId:
    Type: String
    Description: 'VPC ID where the Lambda function will be deployed (leave empty to create new VPC)'
    Default: ''
  
  SubnetIds:
    Type: CommaDelimitedList
    Description: 'List of subnet IDs for the Lambda function (leave empty to create new subnets)'
    Default: ''
    
  SecurityGroupIds:
    Type: CommaDelimitedList
    Description: 'List of security group IDs for the Lambda function (leave empty to create new security group)'
    Default: ''

Conditions:
  CreateVPC: !Equals [!Ref VpcId, '']
  CreateVPCAndSubnets: !And [!Equals [!Ref VpcId, ''], !Equals [!Join [',', !Ref SubnetIds], '']]
  HasCustomSecurityGroups: !Not [!Equals [!Join [',', !Ref SecurityGroupIds], '']]
  
Resources:
  # KMS Key for CloudWatch Logs encryption
  LogsKMSKey:
    Type: AWS::KMS::Key
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties:
      Description: 'KMS Key for StackSet Monitor CloudWatch Logs and Lambda environment variable encryption'
      EnableKeyRotation: true
      KeyPolicy:
        Version: '2012-10-17'
        Statement:
          - Sid: Enable IAM User Permissions
            Effect: Allow
            Principal:
              AWS: !Sub 'arn:${AWS::Partition}:iam::${AWS::AccountId}:root'
            Action: 'kms:*'
            Resource: '*'
          - Sid: Allow CloudWatch Logs
            Effect: Allow
            Principal:
              Service: !Sub 'logs.${AWS::Region}.amazonaws.com'
            Action:
              - 'kms:Encrypt'
              - 'kms:Decrypt'
              - 'kms:ReEncrypt*'
              - 'kms:GenerateDataKey*'
              - 'kms:DescribeKey'
            Resource: '*'
            Condition:
              ArnEquals:
                'kms:EncryptionContext:aws:logs:arn': 
                  - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/StackSetMonitor'
                  - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/cloudformation/stacksets'
          - Sid: Allow Lambda Service
            Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action:
              - 'kms:Encrypt'
              - 'kms:Decrypt'
              - 'kms:ReEncrypt*'
              - 'kms:GenerateDataKey*'
              - 'kms:DescribeKey'
            Resource: '*'

  LogsKMSKeyAlias:
    Type: AWS::KMS::Alias
    Properties:
      AliasName: alias/stackset-monitor-logs
      TargetKeyId: !Ref LogsKMSKey

  # VPC Resources (created when no existing VPC is provided)
  StackSetMonitorVPC:
    Type: AWS::EC2::VPC
    Condition: CreateVPC
    Properties:
      CidrBlock: 10.0.0.0/16
      EnableDnsHostnames: true
      EnableDnsSupport: true
      Tags:
        - Key: Name
          Value: StackSetMonitor-VPC
        - Key: Purpose
          Value: VPC for StackSet Monitor Lambda function


  PrivateSubnet1:
    Type: AWS::EC2::Subnet
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      CidrBlock: 10.0.1.0/24
      AvailabilityZone: !Select [0, !GetAZs '']
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-Subnet-1
        - Key: Purpose
          Value: Private subnet for StackSet Monitor Lambda

  PrivateSubnet2:
    Type: AWS::EC2::Subnet
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      CidrBlock: 10.0.2.0/24
      AvailabilityZone: !Select [1, !GetAZs '']
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-Subnet-2
        - Key: Purpose
          Value: Private subnet for StackSet Monitor Lambda

  PrivateRouteTable1:
    Type: AWS::EC2::RouteTable
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-RT-1

  PrivateRouteTable2:
    Type: AWS::EC2::RouteTable
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      Tags:
        - Key: Name
          Value: StackSetMonitor-Private-RT-2

  PrivateSubnet1RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Condition: CreateVPC
    Properties:
      RouteTableId: !Ref PrivateRouteTable1
      SubnetId: !Ref PrivateSubnet1

  PrivateSubnet2RouteTableAssociation:
    Type: AWS::EC2::SubnetRouteTableAssociation
    Condition: CreateVPC
    Properties:
      RouteTableId: !Ref PrivateRouteTable2
      SubnetId: !Ref PrivateSubnet2

  # VPC Endpoints for AWS Services (no internet access needed)
  CloudFormationVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.cloudformation
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - cloudformation:ListStackSets
              - cloudformation:ListStackSetOperations
              - cloudformation:ListStackInstances
              - cloudformation:DescribeStackInstance
              - cloudformation:DescribeStacks
              - cloudformation:GetTemplate
            Resource: '*'

  CloudWatchVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.monitoring
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - cloudwatch:PutMetricData
            Resource: '*'

  SNSVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.sns
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - sns:Publish
            Resource: '*'

  EventsVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.events
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - events:PutEvents
            Resource: '*'

  LogsVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.logs
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - logs:CreateLogGroup
              - logs:CreateLogStream
              - logs:PutLogEvents
            Resource: '*'

  SQSVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.sqs
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - sqs:SendMessage
            Resource: '*'

  STSVPCEndpoint:
    Type: AWS::EC2::VPCEndpoint
    Condition: CreateVPC
    Properties:
      VpcId: !Ref StackSetMonitorVPC
      ServiceName: !Sub com.amazonaws.${AWS::Region}.sts
      VpcEndpointType: Interface
      SubnetIds:
        - !Ref PrivateSubnet1
        - !Ref PrivateSubnet2
      SecurityGroupIds:
        - !Ref VPCEndpointSecurityGroup
      PrivateDnsEnabled: true
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal: '*'
            Action:
              - sts:AssumeRole
              - sts:GetCallerIdentity
              - sts:AssumeRoleWithWebIdentity
            Resource: '*'

  # Security Group for Lambda function
  LambdaSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Properties:
      GroupDescription: Security group for StackSet Monitor Lambda function
      VpcId: !If
        - CreateVPC
        - !Ref StackSetMonitorVPC
        - !Ref VpcId
      SecurityGroupEgress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 10.0.0.0/16
          Description: HTTPS to VPC Endpoints
        - IpProtocol: tcp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS TCP to VPC for name resolution
        - IpProtocol: udp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS UDP to VPC for name resolution
      Tags:
        - Key: Name
          Value: StackSetMonitor-Lambda-SG
        - Key: Purpose
          Value: Security group for StackSet Monitor Lambda

  VPCEndpointSecurityGroup:
    Type: AWS::EC2::SecurityGroup
    Condition: CreateVPC
    Properties:
      GroupDescription: Security group for VPC Endpoints
      VpcId: !Ref StackSetMonitorVPC
      SecurityGroupIngress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          SourceSecurityGroupId: !Ref LambdaSecurityGroup
          Description: HTTPS from Lambda security group
        - IpProtocol: tcp
          FromPort: 53
          ToPort: 53
          SourceSecurityGroupId: !Ref LambdaSecurityGroup
          Description: DNS TCP from Lambda security group
        - IpProtocol: udp
          FromPort: 53
          ToPort: 53
          SourceSecurityGroupId: !Ref LambdaSecurityGroup
          Description: DNS UDP from Lambda security group
      SecurityGroupEgress:
        - IpProtocol: tcp
          FromPort: 443
          ToPort: 443
          CidrIp: 10.0.0.0/16
          Description: HTTPS outbound within VPC
        - IpProtocol: tcp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS TCP outbound within VPC
        - IpProtocol: udp
          FromPort: 53
          ToPort: 53
          CidrIp: 10.0.0.0/16
          Description: DNS UDP outbound within VPC
      Tags:
        - Key: Name
          Value: StackSetMonitor-VPCEndpoint-SG
        - Key: Purpose
          Value: Security group for VPC Endpoints

  # Dead Letter Queue for Lambda function
  StackSetMonitorDLQ:
    Type: AWS::SQS::Queue
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties:
      QueueName: StackSetMonitor-DLQ
      MessageRetentionPeriod: 1209600  # 14 days
      KmsMasterKeyId: alias/aws/sqs
      Tags:
        - Key: Purpose
          Value: Dead Letter Queue for StackSet Monitor Lambda

  StackSetAlertsTopic:
    Type: AWS::SNS::Topic
    Properties: 
      TopicName: StackSetAlerts
      DisplayName: StackSet Monitoring Alerts
      KmsMasterKeyId: alias/aws/sns
  
  StackSetLogGroup:
    Type: AWS::Logs::LogGroup
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties: 
      LogGroupName: /aws/cloudformation/stacksets
      RetentionInDays: 30
      KmsKeyId: !GetAtt LogsKMSKey.Arn

  LambdaLogGroup:
    Type: AWS::Logs::LogGroup
    DeletionPolicy: Delete
    UpdateReplacePolicy: Delete
    Properties:
      LogGroupName: /aws/lambda/StackSetMonitor
      RetentionInDays: 30
      KmsKeyId: !GetAtt LogsKMSKey.Arn
  
  StackSetMonitoringDashboard:
    Type: AWS::CloudWatch::Dashboard
    Properties:
      DashboardName: StackSetMonitoring
      DashboardBody: !Sub |
        {
          "widgets": [
            {
              "type": "metric",
              "width": 24,
              "height": 8,
              "properties": {
                "metrics": [
                  [ "StackSetMonitoring", "SuccessRate", "StackSetName", "${StackSetName}" ]
                ],
                "region": "${AWS::Region}",
                "title": "StackSet Operations",
                "period": 300,
                "stat": "Average"
              }
            },
            {
              "type": "log",
              "width": 24,
              "height": 6,
              "properties": {
                "query": "SOURCE '/aws/lambda/StackSetMonitor' | fields @timestamp, @message\n| sort @timestamp desc\n| limit 20",
                "region": "${AWS::Region}",
                "title": "Latest StackSet Monitor Logs",
                "view": "table"
              }
            }
          ]
        }
  
  # Consolidated rule to catch ALL StackSet events for comprehensive monitoring
  AllStackSetOperationsRule:
    Type: AWS::Events::Rule
    Properties:
      Name: AllStackSetOperationsRule
      Description: "Rule for monitoring all CloudFormation StackSet operations with failure notifications"
      EventPattern: {source: ["aws.cloudformation"], detail-type: ["CloudFormation StackSet Operation Status Change"]}
      State: ENABLED
      Targets:
        - Id: ProcessAllEvents
          Arn: !GetAtt StackSetMonitorLambda.Arn
        - Id: NotifyFailure
          Arn: !Ref StackSetAlertsTopic
          InputTransformer:
            InputPathsMap:
              "stackSetId": "$.detail.stack-set-id"
              "operationId": "$.detail.operation-id"
              "status": "$.detail.status"
              "time": "$.time"
            InputTemplate: '"StackSet Event: ID: <stackSetId>, Op: <operationId>, Status: <status>, Time: <time>"'

  StackSetMonitorLambda:
    Type: AWS::Lambda::Function
    DependsOn: LambdaLogGroup
    Properties:
      FunctionName: StackSetMonitor
      Handler: index.lambda_handler
      Role: !GetAtt StackSetMonitorRole.Arn
      Runtime: python3.12
      Timeout: 300
      MemorySize: 512
      ReservedConcurrentExecutions: 1
      DeadLetterConfig:
        TargetArn: !GetAtt StackSetMonitorDLQ.Arn
      VpcConfig:
        SecurityGroupIds: !If
          - HasCustomSecurityGroups
          - !Ref SecurityGroupIds
          - - !Ref LambdaSecurityGroup
        SubnetIds: !If
          - CreateVPCAndSubnets
          - - !Ref PrivateSubnet1
            - !Ref PrivateSubnet2
          - !Ref SubnetIds
      KmsKeyArn: !GetAtt LogsKMSKey.Arn
      Code:
        ZipFile: |
          import boto3
          import json
          import os
          import logging
          import time
          import datetime
          from typing import Dict, Any, Optional
          
          # Custom JSON encoder to handle datetime objects
          class DateTimeEncoder(json.JSONEncoder):
              def default(self, obj):
                  if isinstance(obj, datetime.datetime):
                      return obj.isoformat()
                  return super().default(obj)
          
          # Set up logging with more details
          logger = logging.getLogger()
          logger.setLevel(logging.INFO)
          
          # Log initialization to verify Lambda is loading correctly
          print("StackSetMonitor Lambda initializing...")
          
          def validate_event(event: Dict[str, Any]) -> bool:
              """Validate the incoming event structure"""
              if not isinstance(event, dict):
                  logger.error("Event must be a dictionary")
                  return False
              
              # If it's an EventBridge event, validate required fields
              if 'detail' in event:
                  detail = event.get('detail', {})
                  if not isinstance(detail, dict):
                      logger.error("Event detail must be a dictionary")
                      return False
                  
                  # Validate StackSet event structure
                  if 'stack-set-id' in detail:
                      stack_set_id = detail.get('stack-set-id')
                      if not isinstance(stack_set_id, str) or not stack_set_id.strip():
                          logger.error("stack-set-id must be a non-empty string")
                          return False
                      
                      # Validate operation-id if present
                      operation_id = detail.get('operation-id')
                      if operation_id is not None and not isinstance(operation_id, str):
                          logger.error("operation-id must be a string if provided")
                          return False
                      
                      # Validate status if present
                      status = detail.get('status')
                      if status is not None and not isinstance(status, str):
                          logger.error("status must be a string if provided")
                          return False
              
              return True
          
          def validate_context(context: Any) -> bool:
              """Validate the Lambda context object"""
              if context is None:
                  logger.error("Context cannot be None")
                  return False
              
              # Check for required context attributes
              required_attrs = ['function_name', 'function_version', 'invoked_function_arn', 'memory_limit_in_mb']
              for attr in required_attrs:
                  if not hasattr(context, attr):
                      logger.error(f"Context missing required attribute: {attr}")
                      return False
              
              return True
          
          def sanitize_string(value: str, max_length: int = 255) -> str:
              """Sanitize and truncate string inputs"""
              if not isinstance(value, str):
                  return str(value)[:max_length]
              return value.strip()[:max_length]
          
          def lambda_handler(event: Dict[str, Any], context: Any) -> Dict[str, Any]:
              """Main Lambda handler function for StackSet monitoring with input validation"""
              
              # Input validation
              if not validate_event(event):
                  return {
                      "statusCode": 400,
                      "body": json.dumps({
                          "status": "error",
                          "message": "Invalid event structure"
                      }, cls=DateTimeEncoder)
                  }
              
              if not validate_context(context):
                  return {
                      "statusCode": 400,
                      "body": json.dumps({
                          "status": "error",
                          "message": "Invalid context object"
                      }, cls=DateTimeEncoder)
                  }
              
              # Log the validated event for debugging
              logger.info(f"Event received: {json.dumps(event, cls=DateTimeEncoder)}")
              logger.info(f"Function: {context.function_name}, Version: {context.function_version}")
              
              try:
                  cf = boto3.client('cloudformation')
                  cw = boto3.client('cloudwatch')
                  
                  # Log that we're starting processing
                  logger.info(f"Starting StackSet monitoring at {time.time()}")
                  
                  # Check if this is an event from EventBridge
                  if 'detail' in event and 'stack-set-id' in event.get('detail', {}):
                      detail = event['detail']
                      stack_set_id = sanitize_string(detail['stack-set-id'])
                      operation_id = sanitize_string(detail.get('operation-id', 'N/A'))
                      status = sanitize_string(detail.get('status', 'N/A'))
                      
                      # Validate stack_set_id format
                      if not stack_set_id or len(stack_set_id) > 128:
                          logger.error(f"Invalid stack_set_id: {stack_set_id}")
                          return {
                              "statusCode": 400,
                              "body": json.dumps({
                                  "status": "error",
                                  "message": "Invalid stack_set_id format"
                              }, cls=DateTimeEncoder)
                          }
                      
                      # Log the StackSet operation with additional context
                      logger.info(f"Processing StackSet event - ID: {stack_set_id}, Op: {operation_id}, Status: {status}")
                      
                      # Extract stack set name from the ID
                      stack_set_name = stack_set_id.split('/')[-1] if '/' in stack_set_id else stack_set_id
                      stack_set_name = sanitize_string(stack_set_name, 128)
                      logger.info(f"Extracted StackSet name: {stack_set_name}")
                  
                  # Always gather metrics regardless of event type
                  # Get all active StackSets
                  stack_sets_response = cf.list_stack_sets(Status='ACTIVE')
                  stack_sets = stack_sets_response.get('Summaries', [])
                  
                  if not isinstance(stack_sets, list):
                      logger.error("Invalid response from list_stack_sets")
                      return {
                          "statusCode": 500,
                          "body": json.dumps({
                              "status": "error",
                              "message": "Invalid CloudFormation API response"
                          }, cls=DateTimeEncoder)
                      }
                  
                  logger.info(f"Found {len(stack_sets)} active StackSets")
                  
                  for stack_set in stack_sets:
                      if not isinstance(stack_set, dict) or 'StackSetName' not in stack_set:
                          logger.warning(f"Skipping invalid stack_set entry: {stack_set}")
                          continue
                      
                      stack_set_name = sanitize_string(stack_set['StackSetName'], 128)
                      logger.info(f"Processing StackSet: {stack_set_name}")
                      
                      try:
                          operations = cf.list_stack_set_operations(StackSetName=stack_set_name, MaxResults=5)
                          
                          # Validate operations response
                          if not isinstance(operations, dict):
                              logger.error(f"Invalid operations response for {stack_set_name}")
                              continue
                          
                          # Calculate success rate
                          successes = 0
                          operations_list = operations.get('Summaries', [])
                          
                          if not isinstance(operations_list, list):
                              logger.error(f"Invalid operations list for {stack_set_name}")
                              continue
                          
                          total_ops = len(operations_list)
                          logger.info(f"Found {total_ops} recent operations for {stack_set_name}")
                          
                          for op in operations_list:
                              if isinstance(op, dict) and op.get('Status') == 'SUCCEEDED':
                                  successes += 1
                          
                          success_rate = (successes / total_ops * 100) if total_ops > 0 else 100
                          
                          # Validate success_rate is within expected bounds
                          if not (0 <= success_rate <= 100):
                              logger.error(f"Invalid success_rate calculated: {success_rate}")
                              continue
                          
                          # Publish metrics to CloudWatch
                          cw.put_metric_data(
                              Namespace='StackSetMonitoring',
                              MetricData=[
                                  {'MetricName': 'SuccessRate', 'Value': success_rate, 
                                   'Dimensions': [{'Name': 'StackSetName', 'Value': stack_set_name}]}
                              ]
                          )
                          
                          logger.info(f"Published metrics for {stack_set_name}: Success Rate = {success_rate}%")
                      except Exception as e:
                          logger.error(f"Error processing StackSet {stack_set_name}: {str(e)}")
                  
                  return {
                      "statusCode": 200,
                      "body": json.dumps({
                          "status": "completed",
                          "message": f"Processed {len(stack_sets)} StackSets"
                      }, cls=DateTimeEncoder)
                  }
                  
              except Exception as e:
                  logger.error(f"Error in Lambda function: {str(e)}")
                  # Return a proper response even on error
                  return {
                      "statusCode": 500,
                      "body": json.dumps({
                          "status": "error",
                          "message": str(e)
                      }, cls=DateTimeEncoder)
                  }
  
  # Managed IAM Policies
  CloudFormationAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for CloudFormation and CloudWatch access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - cloudformation:ListStackSets
              - cloudformation:ListStackSetOperations
              - cloudformation:ListStackInstances
              - cloudformation:DescribeStackInstance
            Resource: 
              - !Sub "arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stackset/*"
              - !Sub "arn:${AWS::Partition}:cloudformation:${AWS::Region}:${AWS::AccountId}:stackset-target/*"
          - Effect: Allow
            Action:
              - cloudwatch:PutMetricData
            Resource: "*"
            Condition:
              StringEquals:
                "cloudwatch:namespace": "StackSetMonitoring"
          - Effect: Allow
            Action:
              - sns:Publish
            Resource: !Ref StackSetAlertsTopic

  EventsAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for EventBridge access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - events:PutEvents
            Resource: !Sub "arn:${AWS::Partition}:events:${AWS::Region}:${AWS::AccountId}:event-bus/default"

  LogsAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for CloudWatch Logs access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - logs:CreateLogGroup
              - logs:CreateLogStream
              - logs:PutLogEvents
            Resource: 
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/StackSetMonitor"
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/lambda/StackSetMonitor:*"
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/cloudformation/stacksets"
              - !Sub "arn:${AWS::Partition}:logs:${AWS::Region}:${AWS::AccountId}:log-group:/aws/cloudformation/stacksets:*"

  DLQAccessPolicy:
    Type: AWS::IAM::ManagedPolicy
    Properties:
      Description: 'Policy for Dead Letter Queue access for StackSet Monitor'
      PolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Action:
              - sqs:SendMessage
            Resource: !GetAtt StackSetMonitorDLQ.Arn

  StackSetMonitorRole:
    Type: AWS::IAM::Role
    Properties:
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
          - Effect: Allow
            Principal:
              Service: lambda.amazonaws.com
            Action: sts:AssumeRole
      ManagedPolicyArns:
        - arn:aws:iam::aws:policy/service-role/AWSLambdaVPCAccessExecutionRole
        - !Ref CloudFormationAccessPolicy
        - !Ref EventsAccessPolicy
        - !Ref LogsAccessPolicy
        - !Ref DLQAccessPolicy

  # Permissions for event rules to invoke Lambda
  AllOperationsRuleLambdaPermission:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !Ref StackSetMonitorLambda
      Action: lambda:InvokeFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt AllStackSetOperationsRule.Arn
  
  # Using a one minute schedule for testing, but you can change this value
  StackSetMonitorSchedule:
    Type: AWS::Events::Rule
    Properties:
      Name: RegularStackSetMonitoring
      Description: "Triggers Lambda function every 1 minute to check StackSet operations"
      ScheduleExpression: "rate(1 minute)"
      State: ENABLED
      Targets:
        - Id: RunMonitor
          Arn: !GetAtt StackSetMonitorLambda.Arn
  
  ScheduleLambdaInvokePermission:
    Type: AWS::Lambda::Permission
    Properties:
      FunctionName: !Ref StackSetMonitorLambda
      Action: lambda:InvokeFunction
      Principal: events.amazonaws.com
      SourceArn: !GetAtt StackSetMonitorSchedule.Arn
  
  StackSetSuccessRateAlarm:
    Type: AWS::CloudWatch::Alarm
    Properties:
      AlarmDescription: "Alarm when StackSet operation success rate is low"
      MetricName: SuccessRate
      Namespace: "StackSetMonitoring"
      Statistic: Average
      Period: 300
      EvaluationPeriods: 3
      DatapointsToAlarm: 2
      Threshold: 80
      ComparisonOperator: LessThanThreshold
      AlarmActions: [!Ref StackSetAlertsTopic]
      Dimensions: [{Name: StackSetName, Value: !Ref StackSetName}]

Outputs:
  SNSTopicArn: 
    Description: The ARN of the SNS topic for alerts
    Value: !Ref StackSetAlertsTopic
  DashboardURL: 
    Description: URL to the CloudWatch Dashboard
    Value: !Sub https://console.aws.amazon.com/cloudwatch/home?region=${AWS::Region}#dashboards:name=StackSetMonitoring
  LambdaLogGroupName:
    Description: Name of the CloudWatch Log Group for Lambda logs
    Value: !Ref LambdaLogGroup
  DeadLetterQueueArn:
    Description: ARN of the Dead Letter Queue for Lambda function failures
    Value: !GetAtt StackSetMonitorDLQ.Arn
  DeadLetterQueueURL:
    Description: URL of the Dead Letter Queue for monitoring failed Lambda executions
    Value: !Ref StackSetMonitorDLQ
  TestLambdaCommand:
    Description: Command to manually test the Lambda function
    Value: !Sub "aws lambda invoke --function-name ${StackSetMonitorLambda} --payload '{}' response.json && cat response.json"
  LambdaFunctionArn:
    Description: ARN of the Lambda function configured with VPC
    Value: !GetAtt StackSetMonitorLambda.Arn
  LambdaSecurityGroupId:
    Description: Security Group ID created for the Lambda function
    Value: !Ref LambdaSecurityGroup
  VpcConfiguration:
    Description: VPC configuration summary for the Lambda function
    Value: !Sub 
      - "VPC: ${VpcId}, Subnets: ${SubnetList}, Security Groups: ${LambdaSecurityGroup}"
      - SubnetList: !Join [',', !Ref SubnetIds]

You need to run the following CLI command to deploy the CloudFormation stacks. You can change the ParameterValue of StackSetName“your-stackset-name” by the name of the StackSet you want to monitor. The default value is “security-baseline”. Your CLI profile should use region=“us-east-1“.

aws cloudformation create-stack --stack-name stackset-monitor --template-body file://StackSetMonitor.yml --parameters ParameterKey=StackSetName,ParameterValue="security-baseline" --capabilities CAPABILITY_IAM

AWS CLI to deploy the StackSetMonitor.yml CloudFormation template

The CLI output should look like the following:

{"StackId": "arn:aws:cloudformation:...."}

Here’s the expected output for the CloudFormation template:

StackSetMonitor Console output

StackSetMonitor Console output

And an example of Amazon CloudWatch Dashboard and Alarm screen:

Amazon CloudWatch Dashboard screenshot for StackSetMonitor stack to track StackSet operations success rate

Amazon CloudWatch Dashboard screenshot for StackSetMonitor stack to track StackSet operations success rate

Amazon CloudWatch Alarm screenshot for StackSetMonitor stack to track StackSet operations success rate

Amazon CloudWatch Alarm screenshot for StackSetMonitor stack to track StackSet operations success rate

SNS subscription setup involves retrieving the topic ARN from stack outputs and configuring notifications for email or SMS endpoints (below example CLI for email subscription):

aws sns subscribe --topic-arn $SNS_TOPIC_ARN --protocol email --notification-endpoint [email protected]

AWS CLI to subscribe to the topic providing the user email

Cost:

The estimated monthly expenses ranges between 5 and 15 USD depending on StackSet activity levels, with approximately 2,880 Lambda executions per day (each minute) under the default monitoring schedule.

The solution supports customization of monitoring frequency by modifying the ScheduleExpression from the default one-minute interval. The cost will decrease if the monitoring is less frequent.

Cleanup:

For cleanup, you can run the following command lines:

  • To cleanup the Stack Instances and StackSets created in the Core Deployment Strategies section:

aws cloudformation delete-stack-instances --stack-set-name security-baseline --deployment-targets OrganizationalUnitIds=ou-xxx --regions us-east-1 eu-west-1 --region us-east-1 --no-retain-stack

AWS CLI to delete the Stack Instances

You need to change the parameter OrganizationalUnitIds value with the name of the OU, the parameter regions with the list of regions where you want to delete your stack instances, and the value of the stack-set-name parameter (security-baseline, monitoring-baseline, balanced-deployment…).

Then you can delete the StackSet:

aws cloudformation delete-stack-set --stack-set-name security-baseline

AWS CLI to delete the StackSet

You can change the value of the stack-set-name parameter.

  • To cleanup the stackset-monitor stack

aws cloudformation delete-stack --stack-name stackset-monitor

AWS CLI to delete the stackset-monitor Stack

You can also remove any IAM roles/policies that you specifically created for this blog that you might not need anymore

Conclusion

Throughout this guide, we’ve explored the nuanced approaches to AWS CloudFormation StackSets deployments across large-scale environments. The key takeaways include:

  • Balance is Critical: Every deployment strategy requires careful consideration of the trade-offs between speed, safety, and scale based on your organizational needs.
  • Progressive Adoption Works: For most organizations, a progressive deployment approach with validation gates provides the optimal balance of safety and efficiency.
  • Organizational Context Matters: Enterprise, startup, and regulated industry patterns demonstrate that deployment strategies should be tailored to your specific business requirements and risk tolerance.
  • Monitoring is Essential: As organizations scale to hundreds of accounts, comprehensive monitoring becomes critical for maintaining visibility and ensuring compliance.

These different approaches will help you adopt the right strategy for your AWS CloudFormation Stacksets deployments in your AWS Organization.

You can now test these different approaches on your sandbox environment, before adapting them for your specific needs, in order to balance Speed, Safety and Scale to optimize your deployments.

Amar Meriche

Amar is a Sr Cloud Operations Architect at AWS in Paris. He helps his customers improve their operational posture through advocacy and guidance, and is an active member of the DevOps and IaC community at AWS. He’s passionate about helping customers use the various IaC tools available at AWS following best practices. When he’s not working with customers, Amar can be found on the mountain trails with his family or playing basketball with his team.

Idriss Laouali Abdou

Idriss is a Sr. Product Manager Technical for AWS Infrastructure-as-Code based in Seattle. He focuses on improving developer productivity through StackSets and CloudFormation Infrastructure provisioning experiences. Outside of work, you can find him creating educational content for thousands of students, cooking, or dancing.

Better profile management coming to Firefox

Post Syndicated from jzb original https://lwn.net/Articles/1041266/

Firefox has long had support for multiple profiles
to store personal information such as bookmarks, passwords, and user
preferences. However, Firefox did not make profiles particularly
discoverable or easy to manage. That is about to change; Mozilla has
announced
that it is launching a profile-management feature that will make it
easier to create and switch between profiles. According to the support
page
for the feature, it will be rolled out to users gradually
beginning on October 14.

[$] Upcoming Rust language features for kernel development

Post Syndicated from daroc original https://lwn.net/Articles/1039073/

The

Rust for Linux
project has been good for Rust, Tyler Mandry, one of the
co-leads of Rust’s language-design team, said. He
gave a talk at

Kangrejos 2025
covering upcoming Rust language features and thanking
the Rust for Linux developers for helping drive them forward. Afterward, Benno Lossin and Xiangfei Ding
went into more detail about their work on the three most important language
features for kernel development: field projections, in-place initialization, and arbitrary self types.

Flok License Plate Surveillance

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2025/10/flok-license-plate-surveillance.html

The company Flok is surveilling us as we drive:

A retired veteran named Lee Schmidt wanted to know how often Norfolk, Virginia’s 176 Flock Safety automated license-plate-reader cameras were tracking him. The answer, according to a U.S. District Court lawsuit filed in September, was more than four times a day, or 526 times from mid-February to early July. No, there’s no warrant out for Schmidt’s arrest, nor is there a warrant for Schmidt’s co-plaintiff, Crystal Arrington, whom the system tagged 849 times in roughly the same period.

You might think this sounds like it violates the Fourth Amendment, which protects American citizens from unreasonable searches and seizures without probable cause. Well, so does the American Civil Liberties Union. Norfolk, Virginia Judge Jamilah LeCruise also agrees, and in 2024 she ruled that plate-reader data obtained without a search warrant couldn’t be used against a defendant in a robbery case.

Да преведеш пари по заръката на Аллах

Post Syndicated from Атанас Шиников original https://www.toest.bg/da-prevedesh-pari-po-zarukata-na-allah/

Да преведеш пари по заръката на Аллах

Някои явления и практики, колкото и да са гъвкави спрямо контекста, като че ли проявяват удивителна устойчивост в исторически план. Протягат пръста си през вековете към нас. Ако сте чели „Името на розата“ от Умберто Еко, може да си го представите. Ето например, ако тук съм в ролята на един инфлуенсър или корпоративен коуч, мога да ви припомня историята за задните части на коня. Знаете я, върти се из социалните медии и свързва измеренията на конската задница с космическата програма и совалката в САЩ. 

Според тази история широчината на оста между колелата на древноримските колесници се определяла от широчината на задниците на два бойни коня. Впоследствие стандартът за ширина на оста залегнал в определянето на широчината на междуосието на британските фургони и каруци. Това пък определило и разстоянието между релсите на британските, а после и на американските железници. Оттам и транспортната инфраструктура, включително размерът на тунелите, са определени от този стандарт. Което пък на свой ред диктува размера на ракетоносителите на космическата совалка в американската програма. 

Ето, твърди историята, конските задници имат решаваща роля в дизайна на една от най-напредналите технологии, познати на човечеството. 

Тази история, разбира се, носи белезите на изкуствено украсено и напомпано представяне, чиито фактологични недостатъци могат да бъдат разгледани като предмет на отделно упражнение в стила на предаването „Ловци на митове“. Днес обаче съм решил да ви напомня за нещо по-прагматично в исторически план. Нещо, което не е толкова познато. Не е и толкова инфлуенсърско, с привкус на жълтевина, но пък има връзка със съвремието и би могло да се мисли като злободневно.

Нека сега прибегнем до мисловна спекулация и се опитаме да направим връзка между трафикантите на бежанци през България, ИДИЛ, бившия главен прокурор Иван Гешев, философа от XII век Ибн Рушд, известен при нас като Авероес, и самия Пророк на исляма, „прекрасния образец“ за човечеството – Мохамед. 

Знам, че за много от нашите съвременници възможността нещо, писано през седмото столетие по нашето летоброене, да породи размисъл някъде около дванайсетото столетие, което пък да продължи да се обговаря и практикува до днес, е скандална. Да, знам и че прокарването на паралели между на пръв поглед толкова раздалечени исторически и културни реалии може да изглежда, като да насилваме мисловния процес. Но пък защо да не го насилим – може да излезе нещо забавно или полезно. А и в класическата мюсюлманска образователна литература се казва, че невинаги става „со кротце, со благо“. Там при „акуширането“ на мисловния процес, известно в средите на философите като сократическа майевтика, пръчката и чепикът са незаменими помощници. 

Да започнем с по-скорошните свидетелства. 

В началото на юни тази година Европол излезе с интересна новина. След координирана акция под ръководството на българските власти е разбита криминална мрежа, която се занимава с трафик на предимно сирийски мигранти от Турция към Западна Европа. Според информацията от Европол всеки, който иска да стигне по този канал нелегално от Сирия до Европейския съюз, плаща сума, възлизаща на около 8500 евро. Обикновено мигрантите дават между 2000 и 2500 евро, за да бъдат превозени от Сирия до Турция. След няколкомесечен престой в Турция те плащат между 5000 и 6000 евро, за да влязат в Европейския съюз.

Връщаме се малко по-назад и четем прессъобщение от 2023 г. на Министерството на финансите на САЩ и тяхната служба за контрол на чуждестранните активи. САЩ и Турция предприемат съвместни действия, за да прекъснат финансирането на ИДИЛ. Защото, съгласете се, всичките канали, по които вече бившето образувание, издигащо претенции за халифат по смисъла на ислямския свещен закон (шари‘а), продава археологически паметници, за да се самофинансира, няма как да са прозрачни. Но са реални.

През 2019 г. пък в България са задържани шестима души за финансиране на тероризъм, които се занимават с изпращане на коли за продажба в Близкия изток. От тях петима са сирийци, твърди се в нашенските медии, а една – българска гражданка. Нейната задача според арабиста проф. Чуков, когото често гледаме тези дни покрай войната в Газа, е да търси посредници и купувачи, а останалите участници способстват за транспорта на автомобилите.

Какво е общото между тези няколко новини?

Имах един английски директор от Нюкасъл, който на подобен въпрос винаги отговаряше с „много“. И толкова. Това беше много дразнещо. Може да направим лесно формални паралели. Например по какво си приличат Земята и портокалът? По кръглата форма. По какво си приличат мюсюлманското религиозно училище (мадраса) и западният университет? И двете са образователни институции, възникнали в Средновековието. По какво си приличат Коранът и Библията? Ами и двете са свещени писания. Но както казваше и друга моя бивша шефка от Венецуела в духа на корпоративното клише, „Хора, нека сравняваме ябълки с ябълки“.

Естествено, може да кажем, че става въпрос за дейности отвъд позволеното от закона. Това е лесна аналогия. Може да кажем, че става въпрос и за транснационални криминални мрежи. Това също не е трудно да се заключи. Може да ги свържем и с Близкия изток. Не е толкова интересно. Но тук предлагам да мислим за тях през нещо по-конкретно, не толкова отвлечено. Проблемът за прехвърлянето на парите. Нали не си представяме как сирийските бежанци прилежно попълват платежно нареждане със сумата от 8500 евро, на което в основание за превода пише: „Такса трафик Сирия–ЕС“? Или пък някой си британски търговец на исторически старини превежда няколко десетки хиляди долара от банковата си сметка на получател на име Абу Бакр ал-Багдади, халиф, в Рака, Сирия, с основание „Трансфер покупка на антична статуя“? 

Затова си има друг механизъм, много удобен и достатъчно непрозрачен. За него ислямското право и практика са измислили термина хауала. И за него не е нужна формална банкова система.

Хауала на арабски идва от тройния корен, съставен от буквите ха’-уау-лам. Обозначава неща като „прехвърляне“, „трансформация“, „превръщане“, „преобръщане“.

Почти като „джуркането“ на делата и „превъртането на файловете“ на бившия правосъден министър Данаил Кирилов. Оттук и паричен „трансфер“, но не точно. Защото в терминологичното си значение според теорията и практиката на мюсюлманското право и финанси означава следното.

Да речем, че съм Хасан, новоприел исляма български гражданин от ромската махала в Пазарджик, който възприема себе си като част от глобализираната мюсюлманска общност (умма) по света. Използвам социални медии, ходил съм на курс по класически арабски, Коран, Сунна и право (фикх) в Йордания, смятам традиционния ислям в България за невернишки, защото мюсюлманите в България пият ракия и ядат свинско, и имам своя мрежа от международни контакти. Искам да изпратя пари на Ахмед, преподавател в религиозно училище (медресе) в Кандахар, Афганистан, за да купи нови корани или да финансира кухня за бедни, т.нар. имарет, откъдето идва и името на втората джамия в Пловдив – Имарет джамия. Или пък искам да пратя пари за калашници, произведени в Казанлък, за благочестивата духовна борба и върховно усилие (джихад) на талибаните срещу неверниците от САЩ на терен. 

Тогава тук в играта влиза фигурата на Муса от Пазарджик, примерно. Известен в неформалните среди като хауаладар, т.е. посредник на хауала. Брокер. Агент. Отивам при Муса и му казвам, че искам да пратя 250 долара на Ахмед в Кандахар. 

А Муса, представете си, върти пари от малък бизнес – магазин за халално месо и млечни продукти. (Като онзи на входа на махалата в Пазарджик, в който преди няколко години под пластмасовия часовник с цитат от сура 112 на Корана попитах продавачката: „Имате ли телешки дроб?“ А тя ми отговори: „Имаме, даже е от теле!“) Но междувременно въобразеният за целта на примера Муса се оказва изпечен международник. Познава друг хауаладар в Кандахар на име Мустафа. Давам 250 долара на Муса и му казвам кодова дума. Пращам кодовата дума и на Ахмед. Муса се свързва с Мустафа и му казва кодовата дума. Ахмед отива при Мустафа и срещу кодовата дума получава 250 долара. За услугата се плаща и малка комисиона. 

След това се оказва, че Муса дължи на Мустафа 250 долара. И тъй като са добри познати, знаят се от много години, изяли са много хумус, пилета и дюнери заедно, си уреждат задълженията когато и както могат по линията Пазарджик–Кандахар. Няма реален физически трансфер на пари. Доверието е определящо. Защото Муса и Мустафа поддържат контакти с други свои „колеги“ в Мумбай, Карачи, Истанбул, Кайро, Маракеш и на още десетки други места. Няма следи, няма запис от превода. Най-много някой тефтер на хауаладар да съдържа в таблица бележки за кодови имена на клиенти, други колеги, дати, часове, суми. Да речем, „Абу Мазен (кодовото име за Мустафа от измислената ни история), 6 октомври 2025 г., 250 долара“. 

А как хауаладарите после ще си уредят сметките, си е тяхна работа. Дали по някое време в неопределеното бъдеще ще си пращат пари в плик, пъхнати между страниците на Корана по поща, дали ще втъкнат банкноти в замразен телешки шол, по камили, гълъби, патици (като във филма „Мисия Лондон“), или пък ще действат с нормален банков трансфер с основание „превод“ по друг повод, дали единият ще купи на другия стоки или услуги на съответната стойност – не ни интересува. И Хасан от Пазарджик, и Ахмед от Кандахар не ги интересува. Точно както в момента, в който пиша този текст, не ме интересува как работят софтуерът на текстообработващата програма, сървърите на социалните мрежи, операционната система на мобилния ми телефон или мрежата SWIFT за междубанкови операции. 

Така работи международният обмен на пари според традиционната система на хауала. Типичен пример за трансфер на парично задължение. А това, както е ясно от горния спекулативен и анекдотичен пример, отваря пространство за много транзакции на тъмно и за пране на пари. Примерите са безброй – вижте само доклада на службата по наркотици и престъпления към ООН от 2023 г. за търговията с наркотици в Афганистан¹.

Хауала дори попада в полето на академичните занимания на самия (вече бивш) главен прокурор на България Иван Гешев.

През 2020 г. той, тогава докторант по международно право и отношения, заедно със съавтор Николай Марин, публикува статията „Системата „хавала“ – между обичайното право и организираната престъпност“. След като е предпочетено хавала, другото разпространено четене на оригиналното арабско хауала, „брокерите“ се превръщат в хаваладари. Любопитно е, че практиката е сравнена с блокчейн технологиите и биткойн заради децентрализацията, връзката тип потребитeл-към-потребител и анонимизацията на транзакцията. Дава се пример с разкриването през януари 2019 г. на мрежа от лица, които участват в хауала „за прикриване на особено тежки престъпления“. Това се смятало за първия сблъсък на правозащитните органи с явлението, и то в контекста на организираната престъпност. Като утежняващо обстоятелство се отчита фактът, че по този начин се предоставяла логистична подкрепа на терористични организации. Очевидно е, че така се нарушават не само банковото законодателство, законите за кредитните институции, за плащането в брой и за пренасянето на парични средства през граница. Затова и практиката е определена като „явление с висока степен на обществена опасност“. 

Признавам си, изпитвам известни трудности да мисля за бившия главен прокурор в неговото чисто академично амплоа, но пък с любопитство изчитам докторантската му публикация. 

Очевидно България като транзитна територия по пътя между Близкия и Средния изток и Западна Европа предполага интерес към темата за паричните потоци в нейните актуални аспекти. Според Гергана Йорданова, автор на монография по въпроса², хауалата си е хауала, че и сиренето е с пари. Ако трябва да поиграем с българската поговорка, може да използваш първото, за да си купиш второто. И не само сирене. В случай че искаш да почерпиш приятелче или да помогнеш на роднини, нещо като петолевка в картичка по „Български пощи“ (не го правете), също може да използваш хауаладар. Това според българската изследователка попада в категорията „бяла хауала“ – издръжка на домакинства, образование, лечение, помощ за мигранти, светски, религиозни събития, поклонение (хадж), милостиня, издателска дейност. Всички онези случаи, при които може да се докаже законен произход и предназначение на парите. Но в този „Ин-Ян“ на шариатските транзакции е логично да очакваме и „черната хауала“. Там нещата загрубяват. Туристически услуги, агенции за самолетни билети, търговия с автомобили, таксита, лизинги, обмен на валути, трафик на хора, органи, търговия с бижута и наркотици³

Но стига думскролинг. Този хубав термин, който обозначава нашата слабост ненаситно да превъртаме злокобно съдържание на екрана на устройствата си. Не искаме да дълбаем твърде много в горещите аспекти на хауала. За тях са писали и продължават пишат други авторитетни гласове, че даже, както се видя, и бивши главни прокурори на България. Очевидно е, че хауала попада в полезрението на законодателството, на дейностите по превенция на прането на пари и така привлича вниманието на властите. Но нека оставим злободневието настрана. Предлагам да отстъпим няколко крачки назад във времето. 

Къде са старите религиозни основания на хауала?

(Следва продължение.)

1 UNODC, The Hawala System: Its operations and misuse by opiate traffickers and migrant smugglers, 2023.

2 Йорданова, Гергана. Хауала. Същност. Типологии за изпиране на пари, финансиране на тероризъм и пролиферация. София: Фондация „Институт за национална и международна сигурност“, 2023.

3 Йорданова, Гергана. Разграничение между „черни“ и „бели“ хауала транзакции в контекста на противодействието на изпирането на пари. – Сигурност и отбрана, 2023, №1 с. 129–143.

В рубриката „Ориент кафе“ Атанас Шиников поднася любопитни теми, свързани не толкова с горещата политика, колкото с историята и културата на Близкия изток. А той, древен и днешен, е по-близко до нас и съвремието ни, отколкото си представяме.

How we found a bug in Go’s arm64 compiler

Post Syndicated from Thea Heinen original https://blog.cloudflare.com/how-we-found-a-bug-in-gos-arm64-compiler/

Every second, 84 million HTTP requests are hitting Cloudflare across our fleet of data centers in 330 cities. It means that even the rarest of bugs can show up frequently. In fact, it was our scale that recently led us to discover a bug in Go’s arm64 compiler which causes a race condition in the generated code.

This post breaks down how we first encountered the bug, investigated it, and ultimately drove to the root cause.

Investigating a strange panic

We run a service in our network which configures the kernel to handle traffic for some products like Magic Transit and Magic WAN. Our monitoring watches this closely, and it started to observe very sporadic panics on arm64 machines.

We first saw one with a fatal error stating that traceback did not unwind completely. That error suggests that invariants were violated when traversing the stack, likely because of stack corruption. After a brief investigation we decided that it was probably rare stack memory corruption. This was a largely idle control plane service where unplanned restarts have negligible impact, and so we felt that following up was not a priority unless it kept happening.

And then it kept happening. 

Coredumps per hour


When we first saw this bug we saw that the fatal errors correlated with recovered panics. These were caused by some old code which used panic/recover as error handling. 

At this point, our theory was: 

  1. All of the fatal panics happen within stack unwinding.

  2. We correlated an increased volume of recovered panics with these fatal panics.

  3. Recovering a panic unwinds goroutine stacks to call deferred functions.

  4. A related Go issue (#73259) reported an arm64 stack unwinding crash.

  5. Let’s stop using panic/recover for error handling and wait out the upstream fix?

So we did that and watched as fatal panics stopped occurring as the release rolled out. Fatal panics gone, our theoretical mitigation seemed to work, and this was no longer our problem. We subscribed to the upstream issue so we could update when it was resolved and put it out of our minds.

But, this turned out to be a much stranger bug than expected. Putting it out of our minds was premature as the same class of fatal panics came back at a much higher rate. A month later, we were seeing up to 30 daily fatal panics with no real discernible cause; while that might account for only one machine a day in less than 10% of our data centers, we found it concerning that we didn’t understand the cause. The first thing we checked was the number of recovered panics, to match our previous pattern, but there were none. More interestingly, we could not correlate this increased rate of fatal panics with anything. A release? Infrastructure changes? The position of Mars?

At this point we felt like we needed to dive deeper to better understand the root cause. Pattern matching and hoping was clearly insufficient. 

We saw two classes of this bug — a crash while accessing invalid memory and an explicitly checked fatal error. 

Fatal Error

goroutine 153 gp=0x4000105340 m=324 mp=0x400639ea08 [GC worker (active)]:
/usr/local/go/src/runtime/asm_arm64.s:244 +0x6c fp=0x7ff97fffe870 sp=0x7ff97fffe860 pc=0x55558d4098fc
runtime.systemstack(0x0)
       /usr/local/go/src/runtime/mgc.go:1508 +0x68 fp=0x7ff97fffe860 sp=0x7ff97fffe810 pc=0x55558d3a9408
runtime.gcBgMarkWorker.func2()
       /usr/local/go/src/runtime/mgcmark.go:1102
runtime.gcDrainMarkWorkerIdle(...)
       /usr/local/go/src/runtime/mgcmark.go:1188 +0x434 fp=0x7ff97fffe810 sp=0x7ff97fffe7a0 pc=0x55558d3ad514
runtime.gcDrain(0x400005bc50, 0x7)
       /usr/local/go/src/runtime/mgcmark.go:212 +0x1c8 fp=0x7ff97fffe7a0 sp=0x7ff97fffe6f0 pc=0x55558d3ab248
runtime.markroot(0x400005bc50, 0x17e6, 0x1)
       /usr/local/go/src/runtime/mgcmark.go:238 +0xa8 fp=0x7ff97fffe6f0 sp=0x7ff97fffe6a0 pc=0x55558d3ab578
runtime.markroot.func1()
       /usr/local/go/src/runtime/mgcmark.go:887 +0x290 fp=0x7ff97fffe6a0 sp=0x7ff97fffe560 pc=0x55558d3acaa0
runtime.scanstack(0x4014494380, 0x400005bc50)
       /usr/local/go/src/runtime/traceback.go:447 +0x2ac fp=0x7ff97fffe560 sp=0x7ff97fffe4d0 pc=0x55558d3eeb7c
runtime.(*unwinder).next(0x7ff97fffe5b0?)
       /usr/local/go/src/runtime/traceback.go:566 +0x110 fp=0x7ff97fffe4d0 sp=0x7ff97fffe490 pc=0x55558d3eed40
runtime.(*unwinder).finishInternal(0x7ff97fffe4f8?)
       /usr/local/go/src/runtime/panic.go:1073 +0x38 fp=0x7ff97fffe490 sp=0x7ff97fffe460 pc=0x55558d403388
runtime.throw({0x55558de6aa27?, 0x7ff97fffe638?})
runtime stack:
fatal error: traceback did not unwind completely
       stack=[0x4015d6a000-0x4015d8a000
runtime: g8221077: frame.sp=0x4015d784c0 top=0x4015d89fd0

Segmentation fault

goroutine 187 gp=0x40003aea80 m=13 mp=0x40003ca008 [GC worker (active)]:
       /usr/local/go/src/runtime/asm_arm64.s:244 +0x6c fp=0x7fff2afde870 sp=0x7fff2afde860 pc=0x55557e2d98fc
runtime.systemstack(0x0)
       /usr/local/go/src/runtime/mgc.go:1489 +0x94 fp=0x7fff2afde860 sp=0x7fff2afde810 pc=0x55557e279434
runtime.gcBgMarkWorker.func2()
       /usr/local/go/src/runtime/mgcmark.go:1112
runtime.gcDrainMarkWorkerDedicated(...)
       /usr/local/go/src/runtime/mgcmark.go:1188 +0x434 fp=0x7fff2afde810 sp=0x7fff2afde7a0 pc=0x55557e27d514
runtime.gcDrain(0x4000059750, 0x3)
       /usr/local/go/src/runtime/mgcmark.go:212 +0x1c8 fp=0x7fff2afde7a0 sp=0x7fff2afde6f0 pc=0x55557e27b248
runtime.markroot(0x4000059750, 0xb8, 0x1)
       /usr/local/go/src/runtime/mgcmark.go:238 +0xa8 fp=0x7fff2afde6f0 sp=0x7fff2afde6a0 pc=0x55557e27b578
runtime.markroot.func1()
       /usr/local/go/src/runtime/mgcmark.go:887 +0x290 fp=0x7fff2afde6a0 sp=0x7fff2afde560 pc=0x55557e27caa0
runtime.scanstack(0x40042cc000, 0x4000059750)
       /usr/local/go/src/runtime/traceback.go:458 +0x188 fp=0x7fff2afde560 sp=0x7fff2afde4d0 pc=0x55557e2bea58
runtime.(*unwinder).next(0x7fff2afde5b0)
goroutine 0 gp=0x40003af880 m=13 mp=0x40003ca008 [idle]:
PC=0x55557e2bea58 m=13 sigcode=1 addr=0x118
SIGSEGV: segmentation violation

Now we could observe some clear patterns. Both errors occur when unwinding the stack in (*unwinder).next. In one case we saw an intentional fatal error as the runtime identified that unwinding could not complete and the stack was in a bad state. In the other case there was a direct memory access error that happened while trying to unwind the stack. The segfault was discussed in the GitHub issue and a Go engineer identified it as dereference of a go scheduler struct, m, when unwinding

A review of Go scheduler structs

Go uses a lightweight userspace scheduler to manage concurrency. Many goroutines are scheduled on a smaller number of kernel threads – this is often referred to as M:N scheduling. Any individual goroutine can be scheduled on any kernel thread. The scheduler has three core types – g  (the goroutine), m (the kernel thread, or “machine”), and p (the physical execution context, or  “processor”). For a goroutine to be scheduled a free m must acquire a free p, which will execute a g. Each g contains a field for its m if it is currently running, otherwise it will be nil. This is all the context needed for this post but the go runtime docs explore this more comprehensively. 

At this point we can start to make inferences on what’s happening: the program crashes because we try to unwind a goroutine stack which is invalid. In the first backtrace, if a return address is null, we call finishInternal and abort because the stack was not fully unwound. The segmentation fault case in the second backtrace is a bit more interesting: if instead the return address is non-zero but not a function then the unwinder code assumes that the goroutine is currently running. It’ll then dereference m and fault by accessing m.incgo (the offset of incgo into struct m is 0x118, the faulting memory access).

What, then, is causing this corruption? The traces were difficult to get anything useful from – our service has hundreds if not thousands of active goroutines. It was fairly clear from the beginning that the panic was remote from the actual bug. The crashes were all observed while unwinding the stack and if this were an issue any time the stack was unwound on arm64 we would be seeing it in many more services. We felt pretty confident that the stack unwinding was happening correctly but on an invalid stack. 

Our investigation stalled for a while at this point – making guesses, testing guesses, trying to infer if the panic rate went up or down, or if nothing changed. There was a known issue on Go’s GitHub issue tracker which matched our symptoms almost exactly, but what they discussed was mostly what we already knew. At some point when looking through the linked stack traces we realized that their crash referenced an old version of a library that we were also using – Go Netlink.

goroutine 1267 gp=0x4002a8ea80 m=nil [runnable (scan)]:
runtime.asyncPreempt2()
        /usr/local/go/src/runtime/preempt.go:308 +0x3c fp=0x4004cec4c0 sp=0x4004cec4a0 pc=0x46353c
runtime.asyncPreempt()
        /usr/local/go/src/runtime/preempt_arm64.s:47 +0x9c fp=0x4004cec6b0 sp=0x4004cec4c0 pc=0x4a6a8c
github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0x14360300000000?)
        /go/pkg/mod/github.com/!data!dog/[email protected]/nl/nl_linux.go:803 +0x130 fp=0x4004cfc710 sp=0x4004cec6c0 pc=0xf95de0

We spot-checked a few stack traces and confirmed the presence of this Netlink library. Querying our logs showed that not only did we share a library – every single segmentation fault we observed had happened while preempting NetlinkSocket.Receive.

What’s (async) preemption?

In the prehistoric era of Go (<=1.13) the runtime was cooperatively scheduled. A goroutine would run until it decided it was ready to yield to the scheduler – usually due to explicit calls to runtime.Gosched() or injected yield points at function calls/IO operations. Since Go 1.14 the runtime instead does async preemption. The Go runtime has a thread sysmon which tracks the runtime of goroutines and will preempt any that run for longer than 10ms (at time of writing). It does this by sending SIGURG to the OS thread and in the signal handler will modify the program counter and stack to mimic a call to asyncPreempt.

At this point we had two broad theories:

  • This is a Go Netlink bug – likely due to unsafe.Pointer usage which invoked undefined behavior but is only actually broken on arm64

  • This is a Go runtime bug and we’re only triggering it in NetlinkSocket.Receive for some reason

After finding the same bug publicly reported upstream, we were feeling confident this was caused by a Go runtime bug. However, upon seeing that both issues implicated the same function, we felt more skeptical – notably the Go Netlink library uses unsafe.Pointer so memory corruption was a plausible explanation even if we didn’t understand why.

After an unsuccessful code audit we had hit a wall. The crashes were rare and remote from the root cause. Maybe these crashes were caused by a runtime bug, maybe they were caused by a Go Netlink bug. It seemed clear that there was something wrong with this area of the code, but code auditing wasn’t going anywhere.

Breakthrough

At this point we had a fairly good understanding of what was crashing but very little understanding of why it was happening. It was clear that the root cause of the stack unwinder crashing was remote from the actual crash, and that it had to do with (*NetlinkSocket).Receive, but why? We were able to capture a coredump of a production crash and view it in a debugger. The backtrace confirmed what we already knew – that there was a segmentation fault when unwinding a stack. The crux of the issue revealed itself when we looked at the goroutine which had been preempted while calling (*NetlinkSocket).Receive

(dlv) bt
0  0x0000555577579dec in runtime.asyncPreempt2
   at /usr/local/go/src/runtime/preempt.go:306
1  0x00005555775bc94c in runtime.asyncPreempt
   at /usr/local/go/src/runtime/preempt_arm64.s:47
2  0x0000555577cb2880 in github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive
   at
/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:779
3  0x0000555577cb19a8 in github.com/vishvananda/netlink/nl.(*NetlinkRequest).Execute
   at 
/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:532
4  0x0000555577551124 in runtime.heapSetType
   at /usr/local/go/src/runtime/mbitmap.go:714
5  0x0000555577551124 in runtime.heapSetType
   at /usr/local/go/src/runtime/mbitmap.go:714
...
(dlv) disass -a 0x555577cb2878 0x555577cb2888
TEXT github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(SB) /vendor/github.com/vishvananda/netlink/nl/nl_linux.go
        nl_linux.go:779 0x555577cb2878  fdfb7fa9        LDP -8(RSP), (R29, R30)
        nl_linux.go:779 0x555577cb287c  ff430191        ADD $80, RSP, RSP
        nl_linux.go:779 0x555577cb2880  ff434091        ADD $(16<<12), RSP, RSP
        nl_linux.go:779 0x555577cb2884  c0035fd6        RET

The goroutine was paused between two opcodes in the function epilogue. Since the process of unwinding a stack relies on the stack frame being in a consistent state, it felt immediately suspicious that we preempted in the middle of adjusting the stack pointer. The goroutine had been paused at 0x555577cb2880, between ADD $80, RSP, RSP and ADD $(16<<12), RSP, RSP

We queried the service logs to confirm our theory. This wasn’t isolated – the majority of stack traces showed that this same opcode was preempted. This was no longer a weird production crash we couldn’t reproduce. A crash happened when the Go runtime preempted between these two stack pointer adjustments. We had our smoking gun. 

Building a minimal reproducer

At this point we felt pretty confident that this was actually just a runtime bug and it should be reproducible in an isolated environment without any dependencies. The theory at this point was:

  1. Stack unwinding is triggered by garbage collection

  2. Async preemption between a split stack pointer adjustment causes a crash

  3. What if we make a function which splits the adjustment and then call it in a loop?

package main

import (
	"runtime"
)

//go:noinline
func big_stack(val int) int {
	var big_buffer = make([]byte, 1 << 16)

	sum := 0
	// prevent the compiler from optimizing out the stack
	for i := 0; i < (1<<16); i++ {
		big_buffer[i] = byte(val)
	}
	for i := 0; i < (1<<16); i++ {
		sum ^= int(big_buffer[i])
	}
	return sum
}

func main() {
	go func() {
		for {
			runtime.GC()
		}
	}()
	for {
		_ = big_stack(1000)
	}
}

This function ends up with a stack frame slightly larger than can be represented in 16 bits, and so on arm64 the Go compiler will split the stack pointer adjustment into two opcodes. If the runtime preempts between these opcodes then the stack unwinder will read an invalid stack pointer and crash. 

; epilogue for main.big_stack
ADD $8, RSP, R29
ADD $(16<<12), R29, R29
ADD $16, RSP, RSP
; preemption is problematic between these opcodes
ADD $(16<<12), RSP, RSP
RET

After running this for a few minutes the program panicked as expected!

SIGSEGV: segmentation violation
PC=0x60598 m=8 sigcode=1 addr=0x118

goroutine 0 gp=0x400019c540 m=8 mp=0x4000198708 [idle]:
runtime.(*unwinder).next(0x400030fd10)
        /home/thea/sdk/go1.23.4/src/runtime/traceback.go:458 +0x188 fp=0x400030fcc0 sp=0x400030fc30 pc=0x60598
runtime.scanstack(0x40000021c0, 0x400002f750)
        /home/thea/sdk/go1.23.4/src/runtime/mgcmark.go:887 +0x290 

[...]

goroutine 1 gp=0x40000021c0 m=nil [runnable (scan)]:
runtime.asyncPreempt2()
        /home/thea/sdk/go1.23.4/src/runtime/preempt.go:308 +0x3c fp=0x40003bfcf0 sp=0x40003bfcd0 pc=0x400cc
runtime.asyncPreempt()
        /home/thea/sdk/go1.23.4/src/runtime/preempt_arm64.s:47 +0x9c fp=0x40003bfee0 sp=0x40003bfcf0 pc=0x75aec
main.big_stack(0x40003cff38?)
        /home/thea/dev/stack_corruption_reproducer/main.go:29 +0x94 fp=0x40003cff00 sp=0x40003bfef0 pc=0x77c04
Segmentation fault (core dumped)

real    1m29.165s
user    4m4.987s
sys     0m43.212s

A reproducible crash with standard library only? This felt like conclusive evidence that our problem was a runtime bug.

This was an extremely particular reproducer! Even now with a good understanding of the bug and its fix, some of the behavior is still puzzling. It’s a one-instruction race condition, so it’s unsurprising that small changes could have large impact. For example, this reproducer was originally written and tested on Go 1.23.4, but did not crash when compiled with 1.23.9 (the version in production), even though we could objdump the binary and see the split ADD still present! We don’t have a definite explanation for this behavior – even with the bug present there remain a few unknown variables which affect the likelihood of hitting the race condition. 

A single-instruction race condition window

arm64 is a fixed-length 4-byte instruction set architecture. This has a lot of implications on codegen but most relevant to this bug is the fact that immediate length is limited. add gets a 12-bit immediate, mov gets a 16-bit immediate, etc. How does the architecture handle this when the operands don’t fit? It depends – ADD in particular reserves a bit for “shift left by 12” so any 24 bit addition can be decomposed into two opcodes. Other instructions are decomposed similarly, or just require loading an immediate into a register first. 

The very last step of the Go compiler before emitting machine code involves transforming the program into obj.Prog structs. It’s a very low level intermediate representation (IR) that mostly serves to be translated into machine code. 

//https://github.com/golang/go/blob/fa2bb342d7b0024440d996c2d6d6778b7a5e0247/src/cmd/internal/obj/arm64/obj7.go#L856

// Pop stack frame.
// ADD $framesize, RSP, RSP
p = obj.Appendp(p, c.newprog)
p.As = AADD
p.From.Type = obj.TYPE_CONST
p.From.Offset = int64(c.autosize)
p.To.Type = obj.TYPE_REG
p.To.Reg = REGSP
p.Spadj = -c.autosize

Notably, this IR is not aware of immediate length limitations. Instead, this happens in asm7.go when Go’s internal intermediate representation is translated into arm64 machine code. The assembler will classify an immediate in conclass based on bit size and then use that when emitting instructions – extra if needed.

The Go assembler uses a combination of (mov, add) opcodes for some adds that fit in 16-bit immediates, and prefers (add, add + lsl 12) opcodes for 16-bit+ immediates. 

Compare a stack of (slightly larger than) 1<<15:

; //go:noinline
; func big_stack() byte {
; 	var big_stack = make([]byte, 1<<15)
; 	return big_stack[0]
; }
MOVD $32776, R27
ADD R27, RSP, R29
MOVD $32784, R27
ADD R27, RSP, RSP
RET

With a stack of 1<<16:

; //go:noinline
; func big_stack() byte {
; 	var big_stack = make([]byte, 1<<16)
; 	return big_stack[0]
; } 
ADD $8, RSP, R29
ADD $(16<<12), R29, R29
ADD $16, RSP, RSP
ADD $(16<<12), RSP, RSP
RET

In the larger stack case, there is a point between ADD x, RSP, RSP opcodes where the stack pointer is not pointing to the tip of a stack frame. We thought at first that this was a matter of memory corruption – that in handling async preemption the runtime would push a function call on the stack and corrupt the middle of the stack. However, this goroutine is already in the function epilogue – any data we corrupt is actively in the process of being thrown away. What’s the issue then?

The Go runtime often needs to unwind the stack, which means walking backwards through the chain of function calls. For example: garbage collection uses it to find live references on the stack, panicking relies on it to evaluate defer functions, and generating stack traces needs to print the call stack. For this to work the stack pointer must be accurate during unwinding because of how golang dereferences sp to determine the calling function. If the stack pointer is partially modified, the unwinder will look for the calling function in the middle of the stack. The underlying data is meaningless when interpreted as directions to a parent stack frame and then the runtime will likely crash. 

//https://github.com/golang/go/blob/66536242fce34787230c42078a7bbd373ef8dcb0/src/runtime/traceback.go#L373

if innermost && frame.sp < frame.fp || frame.lr == 0 {
    lrPtr = frame.sp
    frame.lr = *(*uintptr)(unsafe.Pointer(lrPtr))
}

When async preemption happens it will push a function call onto the stack but the parent stack frame is no longer correct because sp was only partially adjusted when the preemption happened. The crash flow looks something like this:

  1. Async preemption happens between the two opcodes that add x, rsp expands to

  2. Garbage collection triggers stack unwinding (to check for heap object liveness)

  3. The unwinder starts traversing the stack of the problematic goroutine and correctly unwinds up to the problematic function

  4. The unwinder dereferences sp to determine the parent function

  5. Almost certainly the data behind sp is not a function

  6. Crash


We saw earlier a faulting stack trace which ended in (*NetlinkSocket).Receive – in this case stack unwinding faulted while it was trying to determine the parent frame. 

goroutine 90 gp=0x40042cc000 m=nil [preempted (scan)]:
runtime.asyncPreempt2()
/usr/local/go/src/runtime/preempt.go:306 +0x2c fp=0x40060a25d0 sp=0x40060a25b0 pc=0x55557e299dec
runtime.asyncPreempt()
/usr/local/go/src/runtime/preempt_arm64.s:47 +0x9c fp=0x40060a27c0 sp=0x40060a25d0 pc=0x55557e2dc94c
github.com/vishvananda/netlink/nl.(*NetlinkSocket).Receive(0xff48ce6e060b2848?)
/vendor/github.com/vishvananda/netlink/nl/nl_linux.go:779 +0x130 fp=0x40060b2820 sp=0x40060a27d0 pc=0x55557e9d2880

Once we discovered the root cause we reported it with a reproducer and the bug was quickly fixed. This bug is fixed in go1.23.12, go1.24.6, and go1.25.0. Previously, the go compiler emitted a single add x, rsp instruction and relied on the assembler to split immediates into multiple opcodes as necessary. After this change, stacks larger than 1<<12 will build the offset in a temporary register and then add that to rsp in a single, indivisible opcode. A goroutine can be preempted before or after the stack pointer modification, but never during. This means that the stack pointer is always valid and there is no race condition.

LDP -8(RSP), (R29, R30)
MOVD $32, R27
MOVK $(1<<16), R27
ADD R27, RSP, RSP
RET

This was a very fun problem to debug. We don’t often see bugs where you can accurately blame the compiler. Debugging it took weeks and we had to learn about areas of the Go runtime that people don’t usually need to think about. It’s a nice example of a rare race condition, the sort of bug that can only really be quantified at a large scale.

We’re always looking for people who enjoy this kind of detective work. Our engineering teams are hiring.

Security updates for Wednesday

Post Syndicated from jzb original https://lwn.net/Articles/1041243/

Security updates have been issued by Fedora (apptainer, civetweb, mod_http2, openssl, pandoc, and pandoc-cli), Oracle (kernel), Red Hat (gstreamer1-plugins-bad-free, iputils, kernel, open-vm-tools, and podman), SUSE (cairo, firefox, ghostscript, gimp, gstreamer-plugins-rs, libxslt, logback, openssl-1_0_0, openssl-1_1, python-xmltodict, and rubygem-puma), and Ubuntu (gst-plugins-base1.0, linux-aws-6.8, linux-aws-fips, linux-azure, linux-azure-nvidia, linux-gke, linux-nvidia-tegra-igx, and linux-raspi).

Creating community at our 2025 Sri Lanka Global Clubs Partner meetup

Post Syndicated from Sonja Bienert original https://www.raspberrypi.org/blog/creating-community-at-our-2025-sri-lanka-global-clubs-partner-meetup/

The Global Clubs Partner network brings together 52 partners from 45 countries, all working together to positively impact their local communities and open up opportunities for the next generation.

Last month, our Global Clubs Partners came together in Sri Lanka for our annual meetup, celebrating collaboration and community. Hosted alongside the second-ever Coolest Projects Sri Lanka, the gathering brought partners from across the world to share ideas, learn together, and build connections that will carry into the year ahead.

A group of educators at a conference.

Building connections across the network

The in-person meetups are all about strengthening the sense of community among Global Clubs Partners. Ellie Proffitt from the RPF Global Partners team shared why coming together is so important:

“It was wonderful to hear everyone’s experiences and connect over our shared mission. Everyone was so keen to learn from each other, and being able to do that face to face made it even more valuable.”

The Sri Lanka meetup gave partners the space to get to know one another, learn from each other’s experiences, and think together about future goals for their clubs.

Educators collaborate together during a workshop.

Exploring together

The agenda combined presentations, workshops, and plenty of time for discussion. It began with the Raspberry Pi Foundation team giving a presentation on our mission and strategy for partner work and beyond. Afterwards, each organisation introduced themselves and partners showcased their activities and successes. Together we explored how to build and sustain Code Club communities, experimented with creative ways to use AI in clubs, and got hands-on with unplugged activities that make computing accessible in low-tech settings. Partners also shared approaches to adapting content for their local contexts, and finished by developing their own vision and strategy for the next 12 months.

The variety of sessions meant there was something for everyone, whether partners were looking for new teaching ideas, strategic guidance, or inspiration from peers.

“There were so many interesting topics, I wish we had more time to go more in depth. I love being able to speak with other organisations doing similar work. Having the Raspberry Pi Foundation facilitate the connections is an asset.” – Global Partner

Educators collaborate together during a workshop.

Inspiring moments at Coolest Projects Sri Lanka

As part of the meetup, partners had the chance to attend Coolest Projects Sri Lanka 2025, where young people showcased their incredible tech creations. Seeing children proudly present projects ranging from apps to hardware builds was a highlight for many.

“We have picked the top 100 [Coolest Projects] entries and they are here today exhibiting that to all our visitors joining from various parts of the world and we are happy to have representatives from the global Raspberry Pi Foundation family also.” – Prabhath, Code Club mentor and founder of STEMUp Educational Foundation

Educators at Coolest Projects Sri Lanka.

Looking ahead

The team left Sri Lanka with deeper connections, renewed energy, and a shared commitment to making computing education accessible to all.

“I feel part of something bigger — a worldwide movement where kids everywhere are learning to create with technology, not just consume it. Being a Global Partner means we can learn from what’s working in other countries, adapt those ideas for us, and also contribute our own innovations back to the network.” – Global Partner 

Could your organisation become a Global Clubs Partner?

You can find out how your organisation could join our Global Clubs Partner network on the Code Club website, or contact us directly with your questions or ideas about a partnership.

The post Creating community at our 2025 Sri Lanka Global Clubs Partner meetup appeared first on Raspberry Pi Foundation.

The collective thoughts of the interwebz