Discover duplicate AWS Config rules for streamlined compliance

Post Syndicated from Aaron Klotnia original https://aws.amazon.com/blogs/security/discover-duplicate-aws-config-rules-for-streamlined-compliance/

Amazon Web Services (AWS) customers use various AWS services to migrate, build, and innovate in the AWS Cloud. To align with compliance requirements, customers need to monitor, evaluate, and detect changes made to AWS resources. AWS Config continuously audits, assesses, and evaluates the configurations of your AWS resources.

AWS Config rules continuously evaluate your AWS resource configurations for desired settings. Depending on the rule, AWS Config will evaluate your resources either in response to configuration changes or periodically. AWS Config provides AWS managed rules, which are predefined, customizable rules that are used to evaluate whether your AWS resources comply with common best practices. For example, you could use a managed rule to assess whether your Amazon Elastic Block Store (Amazon EBS) volumes have encryption enabled or whether specific tags are applied to resources. AWS Config rules can be enabled individually or through AWS Config conformance packs, which group rules and remediations together. You also have options for deploying AWS Config rules: AWS Security Hub groups check against rules together as standards, and AWS Control Tower offers controls through the controls library. Many AWS customers use a combination of these tools, which can create duplicate AWS Config rules controls in a single AWS account.

In this post, we introduce our Duplicate Rule Detection tool, built to help customers identify duplicate AWS Config rules and sources. You can assess the results and review opportunities to reduce duplicate evaluations, consolidate rule deployment, and help to optimize your compliance posture.

Solution overview

This serverless solution collects the current active AWS Config rules and identifies duplicates based on identical sources, scopes, input parameters, and states.

Figure 1 illustrates the solution.

Figure 1: Architectural diagram of the AWS Config Duplicate Rule Detection tool

Figure 1: Architectural diagram of the AWS Config Duplicate Rule Detection tool

The architecture shown in Figure 1 uses the following steps:

  1. An Amazon EventBridge Scheduler triggers an AWS Lambda function.
  2. The Lambda function completes the following tasks:
    1. Sends describe-config-rules to the AWS Config API, which returns details about the enabled AWS Config rules in the current AWS account and AWS Region.
    2. Iterates through the returned AWS Config rules to determine whether there are duplicate rules. If duplicates rules are found, they’re grouped together in JSON format.
    3. Writes the output to a time-stamped JSON file and saves it to an Amazon Simple Storage Service (S3) bucket for further analysis.

Prerequisites

You will need an AWS account with rules enabled using AWS Config, Security Hub standards, or AWS Control Tower controls. Before getting started, make sure that you also have a basic understanding of the following:

Walkthrough

To demonstrate the tool, use an AWS account that has two AWS Config conformance packs deployed—Operational Best Practices for HIPAA Security and Operational Best Practices for NIST CSF—along with the AWS Foundational Security Best Practices (FSBP) standard in Security Hub.

CloudFormation template review

The AWS CloudFormation template included in this post deploys several components:

  • DuplicateRuleDetectionLambda – A Lambda function that:
    • Sends describe-config-rules to the AWS Config API to return enabled Config rules.
    • Queries the returned rules to identify duplicate rules with identical parameters.
    • Writes the date-stamped output JSON file to the DetectionLambdaResultsBucket bucket.
  • DetectionLambdaPolicy – An AWS Identity and Access Management (IAM) policy attached to the DetectionLambdaRole role that allows access to:
    • Basic Lambda execution permissions.
    • config:DescribeConfigRules.
    • s3:PutObject with a constraint to only allow on the DetectionLambdaResultsBucket bucket.
  • DetectionLambdaRole – IAM role with a trust policy to allow only the AWS Lambda service to assume the role.
  • DetectionLambdaResultsBucket – An Amazon S3 bucket for storing the output JSON files written by the DuplicateRuleDetectionLambda function.
  • SchedulerForDuplicateRuleDetectionLambda – An EventBridge Scheduler used to trigger the DuplicateRuleDetectionLambda function.
    • ScheduleExpression – Property to define when the schedule runs.
  • IAMRoleforDuplicateRuleDetectionLambdaScheduler – An IAM role for SchedulerForDuplicateRuleDetectionLambda with an inline IAM policy to allow Lambda invocation.

Deployment

To deploy the solution, follow these steps:

  1. Download the CloudFormation template or open the template in CloudFormation.

    Note: The default frequency of the EventBridge Scheduler is to run on the first day of each month. Update the template CRON expression as needed before creating the stack.

  2. Sign in to the AWS Management Console and navigate to AWS CloudFormation by using the search feature at the top of the page.
  3. In the navigation pane, choose Stacks.
  4. At the top of the Stacks page, choose Create Stack, then select With new resources from the dropdown menu.
  5. On the Create stack page:
    1. For Prerequisite – Prepare template, leave the default setting: Template is ready.
    2. Under Specify template, choose Upload a template file, then select the downloaded duplicate-rule-detection.yaml file and choose Open.
  6. At the bottom of the page, choose Next.
  7. On the Specify stack details page:
    1. For Stack name, enter a name for the Stack, for example, duplicate-detection-rule-stack.
  8. At the bottom of the page, choose Next.
  9. On the Configure stack options page:
    1. (Optional) For Tags, add tags as needed.
    2. For Permissions, don’t choose a role, CloudFormation uses permissions based on your user credentials.
    3. For Stack failure options, leave the default option of Roll back all stack resources.
  10. At the bottom of the page, choose Next.
  11. On the Review page, review the details of your stack.
  12. After you review the stack creation settings, choose Create stack to launch your stack.
  13. From the CloudFormation Stack page, monitor the status of the stack as it updates from CREATE_IN_PROGRESS to CREATE_COMPLETE.
  14. From the Resources tab, you will see the resources that were created from the template.

Test

Use the following steps to invoke the Lambda function to create a one-time output for testing.

  1. Sign in to the AWS CloudFormation console using the AWS account from the prerequisites.
  2. From the navigation pane, choose Stacks and then select the Stack name you used when deploying this solution.
  3. Choose the Resources tab of the duplicate-detection-rule-stack and note the name of the Lambda function created for this solution.
  4. Navigate to the Lambda console and choose Functions from the navigation pane.
  5. Select the function name noted in Step 3.
  6. From the Code tab, choose Test, which will open a test window, then choose Invoke.
  7. Navigate to the Amazon S3 console and select the bucket name that was created as part of this solution to see the JSON output created by the Lambda function.
  8. Select the object created and choose Download to view the output file locally.

Validation

To view the JSON output file and understand the structure, open the downloaded output file with a text editor that supports JSON. Each duplicate rule is presented as a JSON object defined within left ({) and right (}) braces. Matching duplicate rules are grouped together in an array within left ([) and right (]) brackets and separated by commas.

From the sample output that follows, you can see that there are three instances of the same AWS Config managed rule in this account:

  • The first two rules are deployed from two different conformance packs and the third rule was created by Security Hub.
  • The SourceIdentifier key value identifies the managed rule as ACCESS_KEYS_ROTATED.
  • The CreatedBy key value identifies the service that enabled the rule.

Each rule has the same InputParameters, which is a qualifier for how a duplicate rule is defined.

Figure 2: Solution output showing duplicate rules and keys

Figure 2: Solution output showing duplicate rules and keys

Now that you’ve identified the duplicate rules, further investigation is needed to identify the specific conformance pack and Security Hub standards that the rules are included in. The ConfigRuleName value is different for each duplicate rule and includes prefixes and suffixes based on how the rule was deployed:

  • Rules deployed using conformance packs will include a suffix to the displayed AWS Config rule name (for example, access-keys-rotated-conformance-pack-a1b2c3d4e).
  • Rules deployed using Security Hub standards include both a prefix and a suffix to the displayed AWS Config rule name (for example, securityhub-access-keys-rotated-a1b2c3).
  • Rules deployed using AWS Control Tower include a prefix to the displayed AWS Config rule name (for example, AWSControlTower_AWS-GR_EBS_OPTIMIZED_INSTANCE).

The ConfigRuleName value maps back to the specific conformance pack or Security Hub standard.

Figure 3. AWS Config conformance pack dashboard showing mapping between a rule and the conformance pack that enabled the rule

Figure 3: AWS Config conformance pack dashboard showing mapping between a rule and the conformance pack that enabled the rule

To identify which Security Hub standards the rule is enabled with, use the following steps.

  1. From the AWS Config console, choose Conformance pack from the navigation pane. Select a conformance pack and search the rules by filtering with the SourceIdentifier value from the output file.
  2. Using the AWS Config Developer Guide, search the List of Managed Rules using the SourceIdentifier and note the Resource Types for the managed rule (for example, AWS::IAM::User).
  3. Use the Security Hub controls reference to search for the AWS service that was included in the Resource Type from the previous step (that is, the IAM controls).
  4. Search for the corresponding control by using the SourceIdentifier and note the Control ID (that is, IAM.3).
  5. Sign in to the Security Hub console and choose Controls from the navigation pane. Search for the Control ID by filtering on ID and select the Control Title.
  6. Choose the Investigate tab and select the Config rule to view the corresponding AWS Config rule.
  7. Select the Standards and requirements tab on the Control page to view the standards that the AWS Config rule is a part of.
Figure 4: AWS Security Hub dashboard

Figure 4: AWS Security Hub dashboard

Duplicate resolution

After the assessment is complete and duplicate rules are identified, you can work to consolidate rules and resolve duplicates.

If the AWS account being evaluated is in AWS Organizations, a delegated administrator account in the organization might be registered to manage specific AWS services, such as AWS Config and Security Hub. Resolution might need to be completed from the delegated administrator account.

Some options you can take to resolve duplicate AWS Config rules include:

When deciding on an effective approach to consolidate rules and resolve duplicates, it’s helpful to consider additional capabilities such as visualization and automated remediation:

  • AWS Config provides a dashboard to view resources, rules, conformance packs, and their compliance states. You can also configure remediation actions in custom templates to target AWS Systems Manager Automation runbooks that define the actions that Systems Manager performs.
  • Security Hub provides a summary dashboard to identify areas of concern, including aggregating findings across an organization. You can customize the dashboard layout, add or remove widgets, and filter the data to focus on areas of particular interest. To configure automated response and remediation, Security Hub automatically sends new findings and updates to existing findings to EventBridge as EventBridge events. Customers can write simple rules to indicate which events and what automated actions to take when an event matches a rule.
  • AWS Control Tower provides a console to view control categories, individual controls, and status along with enabled OUs or accounts. Remediation of non-compliant resources is currently not supported through AWS Control Tower.

The best approach for consolidating rules and resolving duplicates is to start with an assessment of the preceding factors and develop a strategy for governance at scale. Security Hub provides a comprehensive view of compliance across an organization by collecting security data across AWS accounts, AWS services, and supported third-party products. Enabling one or more Security Hub standards provides a mechanism to deploy controls without risk of duplication. You can deploy additional controls individually from AWS Config or AWS Control Tower.

Clean up

Use the following steps to remove the resources you created in this walkthrough.

  1. Sign in to the AWS CloudFormation console and choose Stacks in the navigation pane.
  2. Select the Stack name you used when deploying this solution.
  3. Choose the Resources tab of the duplicate-detection-rule-stack and note the name of the S3 bucket created for this solution.
  4. Navigate to the Amazon S3 console.
  5. Select the radio button next to the bucket noted in Step 3, choose Empty, and follow the steps to empty the bucket.
  6. Navigate to the AWS CloudFormation console and choose Stacks from the navigation pane.
  7. Select the radio button next to the stack name used in the deployment step and choose Delete.
  8. Choose Delete to confirm that you want to delete the stack.
  9. From the CloudFormation Stack page, monitor the status of the duplicate-detection-rule-stack stack as it updates from DELETE_IN_PROGRESS to DELETE_COMPLETE.

Conclusion

For AWS customers, it’s critical to understand the compliance of resources as it relates to specific rules—such as default encryption settings or making sure that network connections are encrypted. You can use detective controls to evaluate the evolving state of your resources on AWS.

AWS Config rules, one type of detective control available on AWS, can be deployed individually or grouped together in AWS Config conformance packs or through Security Hub standards and the AWS Control Tower controls library. However, using more than one of these mechanisms can result in duplicate rules being deployed in an AWS account. This post provides a solution to assess the currently deployed AWS Config rules in a single AWS account and Region to identify when duplicate rules exist. After duplicates have been identified, you can make informed decisions about changes that you can make to consolidate rules and resolve duplicates. This approach will help to optimize your compliance posture by reducing complexity and eliminating unnecessary redundancy.

If you have feedback about this post, submit comments in the Comments section below.

Aaron Klotnia

Aaron Klotnia

Aaron is an AWS Solutions Architect focused on enabling healthcare customers in the worldwide public sector. He is passionate about analytics, security, and infrastructure modernization.

Abeera Hussain

Abeera Hussain

Abeera is an AWS Solutions Architect supporting the Worldwide Public Sector. She is an active member of the healthcare and life sciences technical field community and supports customers in using AWS to run efficient and secure workloads.

Todd Kudlicki

Todd Kudlicki

Todd is an AWS Solutions Architect for the Worldwide Public Sector. He is an active member of the healthcare and life sciences technical field community. Todd advises healthcare customers on deploying cloud foundations with established best practices for security and compliance.

Netflix’s Distributed Counter Abstraction

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/netflixs-distributed-counter-abstraction-8d0c45eb66b2

By: Rajiv Shringi, Oleksii Tkachuk, Kartik Sathyanarayanan

Introduction

In our previous blog post, we introduced Netflix’s TimeSeries Abstraction, a distributed service designed to store and query large volumes of temporal event data with low millisecond latencies. Today, we’re excited to present the Distributed Counter Abstraction. This counting service, built on top of the TimeSeries Abstraction, enables distributed counting at scale while maintaining similar low latency performance. As with all our abstractions, we use our Data Gateway Control Plane to shard, configure, and deploy this service globally.

Distributed counting is a challenging problem in computer science. In this blog post, we’ll explore the diverse counting requirements at Netflix, the challenges of achieving accurate counts in near real-time, and the rationale behind our chosen approach, including the necessary trade-offs.

Note: When it comes to distributed counters, terms such as ‘accurate’ or ‘precise’ should be taken with a grain of salt. In this context, they refer to a count very close to accurate, presented with minimal delays.

Use Cases and Requirements

At Netflix, our counting use cases include tracking millions of user interactions, monitoring how often specific features or experiences are shown to users, and counting multiple facets of data during A/B test experiments, among others.

At Netflix, these use cases can be classified into two broad categories:

  1. Best-Effort: For this category, the count doesn’t have to be very accurate or durable. However, this category requires near-immediate access to the current count at low latencies, all while keeping infrastructure costs to a minimum.
  2. Eventually Consistent: This category needs accurate and durable counts, and is willing to tolerate a slight delay in accuracy and a slightly higher infrastructure cost as a trade-off.

Both categories share common requirements, such as high throughput and high availability. The table below provides a detailed overview of the diverse requirements across these two categories.

Distributed Counter Abstraction

To meet the outlined requirements, the Counter Abstraction was designed to be highly configurable. It allows users to choose between different counting modes, such as Best-Effort or Eventually Consistent, while considering the documented trade-offs of each option. After selecting a mode, users can interact with APIs without needing to worry about the underlying storage mechanisms and counting methods.

Let’s take a closer look at the structure and functionality of the API.

API

Counters are organized into separate namespaces that users set up for each of their specific use cases. Each namespace can be configured with different parameters, such as Type of Counter, Time-To-Live (TTL), and Counter Cardinality, using the service’s Control Plane.

The Counter Abstraction API resembles Java’s AtomicInteger interface:

AddCount/AddAndGetCount: Adjusts the count for the specified counter by the given delta value within a dataset. The delta value can be positive or negative. The AddAndGetCount counterpart also returns the count after performing the add operation.

{
"namespace": "my_dataset",
"counter_name": "counter123",
"delta": 2,
"idempotency_token": {
"token": "some_event_id",
"generation_time": "2024-10-05T14:48:00Z"
}
}

The idempotency token can be used for counter types that support them. Clients can use this token to safely retry or hedge their requests. Failures in a distributed system are a given, and having the ability to safely retry requests enhances the reliability of the service.

GetCount: Retrieves the count value of the specified counter within a dataset.

{
"namespace": "my_dataset",
"counter_name": "counter123"
}

ClearCount: Effectively resets the count to 0 for the specified counter within a dataset.

{
"namespace": "my_dataset",
"counter_name": "counter456",
"idempotency_token": {...}
}

Now, let’s look at the different types of counters supported within the Abstraction.

Types of Counters

The service primarily supports two types of counters: Best-Effort and Eventually Consistent, along with a third experimental type: Accurate. In the following sections, we’ll describe the different approaches for these types of counters and the trade-offs associated with each.

Best Effort Regional Counter

This type of counter is powered by EVCache, Netflix’s distributed caching solution built on the widely popular Memcached. It is suitable for use cases like A/B experiments, where many concurrent experiments are run for relatively short durations and an approximate count is sufficient. Setting aside the complexities of provisioning, resource allocation, and control plane management, the core of this solution is remarkably straightforward:

// counter cache key
counterCacheKey = <namespace>:<counter_name>

// add operation
return delta > 0
? cache.incr(counterCacheKey, delta, TTL)
: cache.decr(counterCacheKey, Math.abs(delta), TTL);

// get operation
cache.get(counterCacheKey);

// clear counts from all replicas
cache.delete(counterCacheKey, ReplicaPolicy.ALL);

EVCache delivers extremely high throughput at low millisecond latency or better within a single region, enabling a multi-tenant setup within a shared cluster, saving infrastructure costs. However, there are some trade-offs: it lacks cross-region replication for the increment operation and does not provide consistency guarantees, which may be necessary for an accurate count. Additionally, idempotency is not natively supported, making it unsafe to retry or hedge requests.

Edit: A note on probabilistic data structures:

Probabilistic data structures like HyperLogLog (HLL) can be useful for tracking an approximate number of distinct elements, like distinct views or visits to a website, but are not ideally suited for implementing distinct increments and decrements for a given key. Count-Min Sketch (CMS) is an alternative that can be used to adjust the values of keys by a given amount. Data stores like Redis support both HLL and CMS. However, we chose not to pursue this direction for several reasons:

  • We chose to build on top of data stores that we already operate at scale.
  • Probabilistic data structures do not natively support several of our requirements, such as resetting the count for a given key or having TTLs for counts. Additional data structures, including more sketches, would be needed to support these requirements.
  • On the other hand, the EVCache solution is quite simple, requiring minimal lines of code and using natively supported elements. However, it comes at the trade-off of using a small amount of memory per counter key.

Eventually Consistent Global Counter

While some users may accept the limitations of a Best-Effort counter, others opt for precise counts, durability and global availability. In the following sections, we’ll explore various strategies for achieving durable and accurate counts. Our objective is to highlight the challenges inherent in global distributed counting and explain the reasoning behind our chosen approach.

Approach 1: Storing a Single Row per Counter

Let’s start simple by using a single row per counter key within a table in a globally replicated datastore.

Let’s examine some of the drawbacks of this approach:

  • Lack of Idempotency: There is no idempotency key baked into the storage data-model preventing users from safely retrying requests. Implementing idempotency would likely require using an external system for such keys, which can further degrade performance or cause race conditions.
  • Heavy Contention: To update counts reliably, every writer must perform a Compare-And-Swap operation for a given counter using locks or transactions. Depending on the throughput and concurrency of operations, this can lead to significant contention, heavily impacting performance.

Secondary Keys: One way to reduce contention in this approach would be to use a secondary key, such as a bucket_id, which allows for distributing writes by splitting a given counter into buckets, while enabling reads to aggregate across buckets. The challenge lies in determining the appropriate number of buckets. A static number may still lead to contention with hot keys, while dynamically assigning the number of buckets per counter across millions of counters presents a more complex problem.

Let’s see if we can iterate on our solution to overcome these drawbacks.

Approach 2: Per Instance Aggregation

To address issues of hot keys and contention from writing to the same row in real-time, we could implement a strategy where each instance aggregates the counts in memory and then flushes them to disk at regular intervals. Introducing sufficient jitter to the flush process can further reduce contention.

However, this solution presents a new set of issues:

  • Vulnerability to Data Loss: The solution is vulnerable to data loss for all in-memory data during instance failures, restarts, or deployments.
  • Inability to Reliably Reset Counts: Due to counting requests being distributed across multiple machines, it is challenging to establish consensus on the exact point in time when a counter reset occurred.
  • Lack of Idempotency: Similar to the previous approach, this method does not natively guarantee idempotency. One way to achieve idempotency is by consistently routing the same set of counters to the same instance. However, this approach may introduce additional complexities, such as leader election, and potential challenges with availability and latency in the write path.

That said, this approach may still be suitable in scenarios where these trade-offs are acceptable. However, let’s see if we can address some of these issues with a different event-based approach.

Approach 3: Using Durable Queues

In this approach, we log counter events into a durable queuing system like Apache Kafka to prevent any potential data loss. By creating multiple topic partitions and hashing the counter key to a specific partition, we ensure that the same set of counters are processed by the same set of consumers. This setup simplifies facilitating idempotency checks and resetting counts. Furthermore, by leveraging additional stream processing frameworks such as Kafka Streams or Apache Flink, we can implement windowed aggregations.

However, this approach comes with some challenges:

  • Potential Delays: Having the same consumer process all the counts from a given partition can lead to backups and delays, resulting in stale counts.
  • Rebalancing Partitions: This approach requires auto-scaling and rebalancing of topic partitions as the cardinality of counters and throughput increases.

Furthermore, all approaches that pre-aggregate counts make it challenging to support two of our requirements for accurate counters:

  • Auditing of Counts: Auditing involves extracting data to an offline system for analysis to ensure that increments were applied correctly to reach the final value. This process can also be used to track the provenance of increments. However, auditing becomes infeasible when counts are aggregated without storing the individual increments.
  • Potential Recounting: Similar to auditing, if adjustments to increments are necessary and recounting of events within a time window is required, pre-aggregating counts makes this infeasible.

Barring those few requirements, this approach can still be effective if we determine the right way to scale our queue partitions and consumers while maintaining idempotency. However, let’s explore how we can adjust this approach to meet the auditing and recounting requirements.

Approach 4: Event Log of Individual Increments

In this approach, we log each individual counter increment along with its event_time and event_id. The event_id can include the source information of where the increment originated. The combination of event_time and event_id can also serve as the idempotency key for the write.

However, in its simplest form, this approach has several drawbacks:

  • Read Latency: Each read request requires scanning all increments for a given counter potentially degrading performance.
  • Duplicate Work: Multiple threads might duplicate the effort of aggregating the same set of counters during read operations, leading to wasted effort and subpar resource utilization.
  • Wide Partitions: If using a datastore like Apache Cassandra, storing many increments for the same counter could lead to a wide partition, affecting read performance.
  • Large Data Footprint: Storing each increment individually could also result in a substantial data footprint over time. Without an efficient data retention strategy, this approach may struggle to scale effectively.

The combined impact of these issues can lead to increased infrastructure costs that may be difficult to justify. However, adopting an event-driven approach seems to be a significant step forward in addressing some of the challenges we’ve encountered and meeting our requirements.

How can we improve this solution further?

Netflix’s Approach

We use a combination of the previous approaches, where we log each counting activity as an event, and continuously aggregate these events in the background using queues and a sliding time window. Additionally, we employ a bucketing strategy to prevent wide partitions. In the following sections, we’ll explore how this approach addresses the previously mentioned drawbacks and meets all our requirements.

Note: From here on, we will use the words “rollup” and “aggregate” interchangeably. They essentially mean the same thing, i.e., collecting individual counter increments/decrements and arriving at the final value.

TimeSeries Event Store:

We chose the TimeSeries Data Abstraction as our event store, where counter mutations are ingested as event records. Some of the benefits of storing events in TimeSeries include:

High-Performance: The TimeSeries abstraction already addresses many of our requirements, including high availability and throughput, reliable and fast performance, and more.

Reducing Code Complexity: We reduce a lot of code complexity in Counter Abstraction by delegating a major portion of the functionality to an existing service.

TimeSeries Abstraction uses Cassandra as the underlying event store, but it can be configured to work with any persistent store. Here is what it looks like:

Handling Wide Partitions: The time_bucket and event_bucket columns play a crucial role in breaking up a wide partition, preventing high-throughput counter events from overwhelming a given partition. For more information regarding this, refer to our previous blog.

No Over-Counting: The event_time, event_id and event_item_key columns form the idempotency key for the events for a given counter, enabling clients to retry safely without the risk of over-counting.

Event Ordering: TimeSeries orders all events in descending order of time allowing us to leverage this property for events like count resets.

Event Retention: The TimeSeries Abstraction includes retention policies to ensure that events are not stored indefinitely, saving disk space and reducing infrastructure costs. Once events have been aggregated and moved to a more cost-effective store for audits, there’s no need to retain them in the primary storage.

Now, let’s see how these events are aggregated for a given counter.

Aggregating Count Events:

As mentioned earlier, collecting all individual increments for every read request would be cost-prohibitive in terms of read performance. Therefore, a background aggregation process is necessary to continually converge counts and ensure optimal read performance.

But how can we safely aggregate count events amidst ongoing write operations?

This is where the concept of Eventually Consistent counts becomes crucial. By intentionally lagging behind the current time by a safe margin, we ensure that aggregation always occurs within an immutable window.

Lets see what that looks like:

Let’s break this down:

  • lastRollupTs: This represents the most recent time when the counter value was last aggregated. For a counter being operated for the first time, this timestamp defaults to a reasonable time in the past.
  • Immutable Window and Lag: Aggregation can only occur safely within an immutable window that is no longer receiving counter events. The “acceptLimit” parameter of the TimeSeries Abstraction plays a crucial role here, as it rejects incoming events with timestamps beyond this limit. During aggregations, this window is pushed slightly further back to account for clock skews.

This does mean that the counter value will lag behind its most recent update by some margin (typically in the order of seconds). This approach does leave the door open for missed events due to cross-region replication issues. See “Future Work” section at the end.

  • Aggregation Process: The rollup process aggregates all events in the aggregation window since the last rollup to arrive at the new value.

Rollup Store:

We save the results of this aggregation in a persistent store. The next aggregation will simply continue from this checkpoint.

We create one such Rollup table per dataset and use Cassandra as our persistent store. However, as you will soon see in the Control Plane section, the Counter service can be configured to work with any persistent store.

LastWriteTs: Every time a given counter receives a write, we also log a last-write-timestamp as a columnar update in this table. This is done using Cassandra’s USING TIMESTAMP feature to predictably apply the Last-Write-Win (LWW) semantics. This timestamp is the same as the event_time for the event. In the subsequent sections, we’ll see how this timestamp is used to keep some counters in active rollup circulation until they have caught up to their latest value.

Rollup Cache

To optimize read performance, these values are cached in EVCache for each counter. We combine the lastRollupCount and lastRollupTs into a single cached value per counter to prevent potential mismatches between the count and its corresponding checkpoint timestamp.

But, how do we know which counters to trigger rollups for? Let’s explore our Write and Read path to understand this better.

Add/Clear Count:

An add or clear count request writes durably to the TimeSeries Abstraction and updates the last-write-timestamp in the Rollup store. If the durability acknowledgement fails, clients can retry their requests with the same idempotency token without the risk of overcounting. Upon durability, we send a fire-and-forget request to trigger the rollup for the request counter.

GetCount:

We return the last rolled-up count as a quick point-read operation, accepting the trade-off of potentially delivering a slightly stale count. We also trigger a rollup during the read operation to advance the last-rollup-timestamp, enhancing the performance of subsequent aggregations. This process also self-remediates a stale count if any previous rollups had failed.

With this approach, the counts continually converge to their latest value. Now, let’s see how we scale this approach to millions of counters and thousands of concurrent operations using our Rollup Pipeline.

Rollup Pipeline:

Each Counter-Rollup server operates a rollup pipeline to efficiently aggregate counts across millions of counters. This is where most of the complexity in Counter Abstraction comes in. In the following sections, we will share key details on how efficient aggregations are achieved.

Light-Weight Roll-Up Event: As seen in our Write and Read paths above, every operation on a counter sends a light-weight event to the Rollup server:

rollupEvent: {
"namespace": "my_dataset",
"counter": "counter123"
}

Note that this event does not include the increment. This is only an indication to the Rollup server that this counter has been accessed and now needs to be aggregated. Knowing exactly which specific counters need to be aggregated prevents scanning the entire event dataset for the purpose of aggregations.

In-Memory Rollup Queues: A given Rollup server instance runs a set of in-memory queues to receive rollup events and parallelize aggregations. In the first version of this service, we settled on using in-memory queues to reduce provisioning complexity, save on infrastructure costs, and make rebalancing the number of queues fairly straightforward. However, this comes with the trade-off of potentially missing rollup events in case of an instance crash. For more details, see the “Stale Counts” section in “Future Work.”

Minimize Duplicate Effort: We use a fast non-cryptographic hash like XXHash to ensure that the same set of counters end up on the same queue. Further, we try to minimize the amount of duplicate aggregation work by having a separate rollup stack that chooses to run fewer beefier instances.

Availability and Race Conditions: Having a single Rollup server instance can minimize duplicate aggregation work but may create availability challenges for triggering rollups. If we choose to horizontally scale the Rollup servers, we allow threads to overwrite rollup values while avoiding any form of distributed locking mechanisms to maintain high availability and performance. This approach remains safe because aggregation occurs within an immutable window. Although the concept of now() may differ between threads, causing rollup values to sometimes fluctuate, the counts will eventually converge to an accurate value within each immutable aggregation window.

Rebalancing Queues: If we need to scale the number of queues, a simple Control Plane configuration update followed by a re-deploy is enough to rebalance the number of queues.

      "eventual_counter_config": {             
"queue_config": {
"num_queues" : 8, // change to 16 and re-deploy
...

Handling Deployments: During deployments, these queues shut down gracefully, draining all existing events first, while the new Rollup server instance starts up with potentially new queue configurations. There may be a brief period when both the old and new Rollup servers are active, but as mentioned before, this race condition is managed since aggregations occur within immutable windows.

Minimize Rollup Effort: Receiving multiple events for the same counter doesn’t mean rolling it up multiple times. We drain these rollup events into a Set, ensuring a given counter is rolled up only once during a rollup window.

Efficient Aggregation: Each rollup consumer processes a batch of counters simultaneously. Within each batch, it queries the underlying TimeSeries abstraction in parallel to aggregate events within specified time boundaries. The TimeSeries abstraction optimizes these range scans to achieve low millisecond latencies.

Dynamic Batching: The Rollup server dynamically adjusts the number of time partitions that need to be scanned based on cardinality of counters in order to prevent overwhelming the underlying store with many parallel read requests.

Adaptive Back-Pressure: Each consumer waits for one batch to complete before issuing the rollups for the next batch. It adjusts the wait time between batches based on the performance of the previous batch. This approach provides back-pressure during rollups to prevent overwhelming the underlying TimeSeries store.

Handling Convergence:

In order to prevent low-cardinality counters from lagging behind too much and subsequently scanning too many time partitions, they are kept in constant rollup circulation. For high-cardinality counters, continuously circulating them would consume excessive memory in our Rollup queues. This is where the last-write-timestamp mentioned previously plays a crucial role. The Rollup server inspects this timestamp to determine if a given counter needs to be re-queued, ensuring that we continue aggregating until it has fully caught up with the writes.

Now, let’s see how we leverage this counter type to provide an up-to-date current count in near-realtime.

Experimental: Accurate Global Counter

We are experimenting with a slightly modified version of the Eventually Consistent counter. Again, take the term ‘Accurate’ with a grain of salt. The key difference between this type of counter and its counterpart is that the delta, representing the counts since the last-rolled-up timestamp, is computed in real-time.

And then, currentAccurateCount = lastRollupCount + delta

Aggregating this delta in real-time can impact the performance of this operation, depending on the number of events and partitions that need to be scanned to retrieve this delta. The same principle of rolling up in batches applies here to prevent scanning too many partitions in parallel. Conversely, if the counters in this dataset are accessed frequently, the time gap for the delta remains narrow, making this approach of fetching current counts quite effective.

Now, let’s see how all this complexity is managed by having a unified Control Plane configuration.

Control Plane

The Data Gateway Platform Control Plane manages control settings for all abstractions and namespaces, including the Counter Abstraction. Below, is an example of a control plane configuration for a namespace that supports eventually consistent counters with low cardinality:

"persistence_configuration": [
{
"id": "CACHE", // Counter cache config
"scope": "dal=counter",
"physical_storage": {
"type": "EVCACHE", // type of cache storage
"cluster": "evcache_dgw_counter_tier1" // Shared EVCache cluster
}
},
{
"id": "COUNTER_ROLLUP",
"scope": "dal=counter", // Counter abstraction config
"physical_storage": {
"type": "CASSANDRA", // type of Rollup store
"cluster": "cass_dgw_counter_uc1", // physical cluster name
"dataset": "my_dataset_1" // namespace/dataset
},
"counter_cardinality": "LOW", // supported counter cardinality
"config": {
"counter_type": "EVENTUAL", // Type of counter
"eventual_counter_config": { // eventual counter type
"internal_config": {
"queue_config": { // adjust w.r.t cardinality
"num_queues" : 8, // Rollup queues per instance
"coalesce_ms": 10000, // coalesce duration for rollups
"capacity_bytes": 16777216 // allocated memory per queue
},
"rollup_batch_count": 32 // parallelization factor
}
}
}
},
{
"id": "EVENT_STORAGE",
"scope": "dal=ts", // TimeSeries Event store
"physical_storage": {
"type": "CASSANDRA", // persistent store type
"cluster": "cass_dgw_counter_uc1", // physical cluster name
"dataset": "my_dataset_1", // keyspace name
},
"config": {
"time_partition": { // time-partitioning for events
"buckets_per_id": 4, // event buckets within
"seconds_per_bucket": "600", // smaller width for LOW card
"seconds_per_slice": "86400", // width of a time slice table
},
"accept_limit": "5s", // boundary for immutability
},
"lifecycleConfigs": {
"lifecycleConfig": [
{
"type": "retention", // Event retention
"config": {
"close_after": "518400s",
"delete_after": "604800s" // 7 day count event retention
}
}
]
}
}
]

Using such a control plane configuration, we compose multiple abstraction layers using containers deployed on the same host, with each container fetching configuration specific to its scope.

Provisioning

As with the TimeSeries abstraction, our automation uses a bunch of user inputs regarding their workload and cardinalities to arrive at the right set of infrastructure and related control plane configuration. You can learn more about this process in a talk given by one of our stunning colleagues, Joey Lynch : How Netflix optimally provisions infrastructure in the cloud.

Performance

At the time of writing this blog, this service was processing close to 75K count requests/second globally across the different API endpoints and datasets:

while providing single-digit millisecond latencies for all its endpoints:

Future Work

While our system is robust, we still have work to do in making it more reliable and enhancing its features. Some of that work includes:

  • Regional Rollups: Cross-region replication issues can result in missed events from other regions. An alternate strategy involves establishing a rollup table for each region, and then tallying them in a global rollup table. A key challenge in this design would be effectively communicating the clearing of the counter across regions.
  • Error Detection and Stale Counts: Excessively stale counts can occur if rollup events are lost or if a rollup fails and isn’t retried. This isn’t an issue for frequently accessed counters, as they remain in rollup circulation. This issue is more pronounced for counters that aren’t accessed frequently. Typically, the initial read for such a counter will trigger a rollup, self-remediating the issue. However, for use cases that cannot accept potentially stale initial reads, we plan to implement improved error detection, rollup handoffs, and durable queues for resilient retries.

Conclusion

Distributed counting remains a challenging problem in computer science. In this blog, we explored multiple approaches to implement and deploy a Counting service at scale. While there may be other methods for distributed counting, our goal has been to deliver blazing fast performance at low infrastructure costs while maintaining high availability and providing idempotency guarantees. Along the way, we make various trade-offs to meet the diverse counting requirements at Netflix. We hope you found this blog post insightful.

Stay tuned for Part 3 of Composite Abstractions at Netflix, where we’ll introduce our Graph Abstraction, a new service being built on top of the Key-Value Abstraction and the TimeSeries Abstraction to handle high-throughput, low-latency graphs.

Acknowledgments

Special thanks to our stunning colleagues who contributed to the Counter Abstraction’s success: Joey Lynch, Vinay Chella, Kaidan Fullerton, Tom DeVoe, Mengqing Wang


Netflix’s Distributed Counter Abstraction was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

[$] Progress on toolchain security features

Post Syndicated from jake original https://lwn.net/Articles/996344/

Over the years, there has been steady progress in adding security features to
compilers and other tools to assist with hardening the Linux kernel (and, of course, other
programs). In something of a tradition in the toolchains
track
at the Linux
Plumbers Conference
, Kees Cook and Qing Zhao have led a session on that progress and
further plans; this year, they were joined by Justin Stitt (YouTube video).

Тайланд под кожата (трета част)

Post Syndicated from Емине Садкъ original https://www.toest.bg/tayland-pod-kozhata-treta-chast/

<< Към втора част

Тайланд под кожата (трета част)

Архипелагът Ко Ланта е тайландският рай на гмуркачи от цял свят. Именно те популяризират Koh Lanta Yai, или Големия остров, на който се намират едни от най-красивите плажове и коралови рифове в Южен Тайланд. На Големия остров е концентриран туристическият живот на архипелага и чужденците обобщено го наричат просто Ланта.

Ланта е активен през т.нар. сух сезон – от декември до март. През останалото време вали дъжд със седмици. Безбожно влажно е. Островът е пуст. Работят магазините и ресторантите само там, където живеят местните – в Стария град на Ланта и в квартала „Бан Саладан“. Там плажове почти няма, затова и рядко отсядат туристи. От бреговете на двата квартала се откриват гледки към Андаманско море и част от островите на архипелага. Гледките са най-красиви от пристанищата на Стария град и „Бан Саладан“.

За разлика от шумните и популярни острови в област Краби – Ко Крадан и Фи Фи Дон, в Ланта цари неприсъщо спокойствие. Това се дължи на религиозните нрави на преобладаващото местно население – малайзийските мюсюлмани. Наркотиците, проституцията и партитата са сведени до минимум и островът заспива след два часа през нощта.

Ланта наистина не прилича на никое друго място в Тайланд. Географското му разположение – излазът към Андаманско море и близостта му до Малайзия – позволява

необикновен религиозен и етнически плурализъм, започнал още преди петстотин години.

Тогава на пристанището на Стария град за първи път спират китайски и индийски търговски кораби. От индийските заселници нищо не е останало, но Тhai-Chinese, или тайландските китайци, тяхната култура и архитектура присъстват навсякъде на острова. Много от къщите в Стария град създават усещането за добре развито, уседнало общество на търговци. И ако пристигнете по море, ще се възхитите на облика на един величав и достоен град, издигнат над морето.

Някои от изследователите на Ланта твърдят, че когато първите кораби пристигат, местното племенно население Urak Lawoi, мокен, или морските номади, вече живеели в наколни жилища покрай бреговете на острова. Други казват, че морските номади са придошли с малайзийското мюсюлманско население преди триста години. Когато пристигнали на острова, малайзийските мюсюлмани се изхранвали с риболов, търговия с кокосови продукти и работа в каучуковите плантации.

Въпреки че Тайланд е една от четирите водещи износителки на естествени полимери в света,

а презервативите и санитарните ръкавици са на удивително ниски цени, голяма част от каучуковите плантации в Ланта са унищожени. Има само няколко останали гори, които лесно се разпознават по паничките, завързани за долната част на стеблата. В тези панички, приличащи на хранилки за птички, се събират белите сокове, латексът на дървото.

В момента малайзийските мюсюлмани, както всички останали в Ланта, се препитават предимно с туризъм. Джамиите им приличат на ананаси, с много по-заоблени куполи от познатите на Балканите. Облеклото им е традиционно и лесно могат да бъдат различени от будистите и морските номади.

Тайландците наричат морските номади Chao Le, или „онези, които живеят и се изхранват от морето“. Както и Chao Nam – „морските хора“. Някои ги назовават Orang Lanta, или „хората на Ланта“. А онези, които припознават асимилационната политика на страната, им викат Thai Mai, или „новите тайландци“. Както в отношението към повечето малцинства по целия свят,

правителството на Тайланд оказва ужасяващ натиск върху морските номади и не признава правото им на собственост, гражданството, както и езика им – мокен, част от австронезийската група езици.

Морските номади са анимисти. Вярват, че не само хората, но и животните, растенията и всяка част от природата притежават душа, способна да общува с всичко наоколо. Приемат морето като своя дом и смятат, че ресурсите му не могат да бъдат владени от хората. Морските номади ловуват от малки. За децата им се знае, че притежават суперсилата да виждат два пъти по-добре под вода от европейските деца. В Ланта морските номади живеят изолирани, върху наколни жилища или в kabang – големи лодки, с които търгуват и ги използват като кухня и спалня. Малка част от тях живеят на сушата. Онова, което видях в техния земен квартал в Ланта, наподобяваше всяко гето.

Въпреки че морските номади, включително и в Ланта, са силно стигматизирани, те се рекламират като екзотична туристическа атракция. Общността им се справя достойно с неприятното отношение към тях, а част от представителите им работят професионално с гмуркачите и им разкриват тайните на Андаманско море.

До 2015 г., освен гмуркачите, основните туристи на Ланта са шведите. Настаняването на острова става в наколни бамбукови жилища или в къщи на кокили, за които споменах в предната статия. Има и няколко ресторантчета. Инфраструктурата още не успява да свърже всички точки на острова, дълъг 25 километра и широк 6. А преди да построят моста „Шри Ланта“ между Ланта и Ко Ланта Ной, втория по големина остров на архипелага, фериботът oт административния център Краби е пътувал повече от час.

Може да се каже, че допреди десет-петнадесет години Ланта е била диво място, където волно са се разхождали маймуни, варани и бездомни кучета, а местните са се препитавали така, както са го правили в продължение на векове. И с малко туризъм. Най-близката ветеринарна клиника се е намирала в Краби, а бездомните кучета и котки освен гладни, са се разхождали болни, при което случаите на ухапвания са били немалко.

Джуни Ковач отваря първия приют за бездомни животни в Ланта и работи с доброволци туристи от целия свят. Вече е почти невъзможно да се срещнат бездомни животни на острова. Lanta Animal Wellfare се издържа от дарения и чрез готварското училище Time for Lime. За един ден в Time for Lime научих основните правила в тайската кухня. Имаше хора от курса, които си тръгнаха с домашен островен любимец, а фондацията на Джуни помогна с цялата документация.

Трябва да призная, че след месец тайландска храна и кулинарен курс, от време на време ми се прияжда нещо познато. Освен пицарията Loro Loco, която миналата година ни изненада със страхотните си пици, тази година открихме Baja Taco и ако не ни беше срам, щяхме да заживеем до заведението на Баха.

Друго приятно откритие за мен беше Националният парк „Му Ко Ланта“, където видях къпещи се слонове, киснещи в малка рекичка варани и безброй видове маймуни. И дървесна змия, която ми се облещи. Но най-приятното от цялата разходка в парка, признавам, беше морското му дъно и се радвам, че си бях взела шнорхела, за да видя безброй причудливи по форма и цвят риби, които изглеждаха така, сякаш плувах в нечий сън.

Като в сън се чувствах и на Ланта Ной. Там живеят само местни. Обиколката му с мотоциклет отнема няколко часа. На Ланта Ной плажове няма, само рибарски селища, усещане за простор и два-три ресторанта. Както и изоставения жълтеникав палат на китайски търговец. На брега на имението е основано поредното рибарско селище.

Вероятно ще мине още едно или две десетилетия, докато динамиката от Ланта се прехвърли на Ланта Ной, но засега такива опасения няма. И понеже моделът ми на съществуване е такъв, че летувам на едно и също място, докато всяко камъче не научи името ми, предполагам, че ще посетя тази част на Тайланд, докато се промени толкова, че повече не мога да я позная.

За финал – ако решите да пътувате към Ланта, не забравяйте: 1) да кандидатствате за виза онлайн – по-бързо, лесно, евтино и човешко е; и 2) да пиете куба либре в кокосов орех, докато гледате как слънцето потъва в Андаманско море.

Наздраве!

На второ четене: „Гранд хотел Европа“

Post Syndicated from original https://www.toest.bg/na-vtoro-chetene-grand-hotel-evropa/

„Гранд хотел Европа“ от Иля Леонард Пфайфер

На второ четене: „Гранд хотел Европа“

превод от нидерландски Мария Енчева, София: изд. „Колибри“, 2022

Изумително е как в родното интернет пространство не може да се открие почти нищо за този абсолютен шедьовър на европейската литература – роман от и за Европа, заслужаващ най-значимите литературни награди и преведен на почти всички европейски езици. Издаден през 2018 г. (и може би единствено поради това пропуснал да поразсъждава и върху пандемията) и преведен на български през 2022 г., „Гранд хотел Европа“ е останал извън прожекторите. А може би това идва да ни покаже колко встрани сме все още от големите, значимите теми в дискурса за Стария континент днес. От сюжетите и въпросите, които един класически филолог по образование с над четирийсет заглавия в библиографията си не просто е диагностицирал, а разнищва до последната нишка – със смущаваща директност и иронична проницателност, – водейки ни до повече от обезпокоителни изводи. 

„Гранд хотел Европа“ е превъзходен хибрид – и жанрово, и тематично, и езиково. Текст съвременен и злободневен, но същевременно издържан в традицията на големите романи на ХХ век –

тези от калибъра на писатели като Стефан Цвайг и Томас Ман (Пфайфер сам нарича романа си „Вълшебната планина“ на ХХI век). Внушителната мащабност на текста се дължи на толкова много неща. На всеобхватната чак до неправдоподобност ерудиция, която въпреки това струи естествено и от автора разказвач (самият Пфайфер, разбира се), и от героите му. На обърнатия едновременно към миналото, настоящето и бъдещето поглед. На умението на писателя да създава ехо от метафори и символи, отекващо от най-битовото до дълбоко екзистенциалното.

Написаното от Пфайфер би могло да се чете като подробен и проникновен труд, може би дори като поредица от статии в куп дисциплини. То обаче умело е било интегрирано в роман – истински, увлекателен, предизвикателен, непотъващ нито за миг. Текстът съчетава и автобиография, и горчива сатира, и философски етюди, и почти лекционни диалози, та дори детективска мистерия. Интелектът на този едва ли не ренесансов по усет писател ни разхожда педантично – с проверени от първа ръка познания, с прецизни данни и способност за дълбоки философски умозаключения – от изкуството до икономиката, от историята до статистиката, от религията до екологията и от геополитиката до

най-важната тема в книгата: масовия туризъм, „най-видимото последствие от глобализацията“.

Удържайки десетки тематични нишки, Пфайфер ни разказва и две централни истории за себе си като протагонист. Холандският писател на средна възраст, живеещ от години в Генуа, за която е написал романа си La Superba, се нанася в хотел в неназована част на Европа. От една страна, за да осмисли чрез писане приключилата си наскоро връзка с италианката Клио – специалистка по история на изкуството от аристократична фамилия. От друга – за да работи върху романа си за туризма.

Преди да се разделят, новата работа на емоционалната, сприхава и професионално фрустрирана Клио (името на музата на историята не е случайно) ги отвежда във Венеция. Именно там – изправени пред туристическото чудовище, в което се е превърнала потъващата Ла Серенисима, – за първи път ще им хрумне идеята за въпросния роман. Проектът ще отведе Пфайфер (на живо или чрез историите на други герои) и до други дестинации, които се задъхват под наплива на туристите, такива, които никога не са желали ордите от посетители, или път трети, които отчаяно се мъчат да ги привлекат. Ще идем в Генуа, Венеция, Чинкуе Тере, Амстердам, Хийтхорн, Абу Даби, Пакистан и дори в Скопие, чиито абсурдни амбиции за сдобиване с история чрез мегаломанския проект „Скопие 2014“ авторът обяснява и осмива със завидно разбиране.

Разбиране, защото, както Пфайфер заключава:

Историята на Европа може да се опише като история на копнежа по история.

Единствена от всички континенти, Европа е обсебена (до степен да е негов заложник) от своето минало, от носталгията и меланхолията, от комплекса за „златния век“. Както и от изкусителната заблуда на десния популизъм, че идиличното и идеализирано минало може да поправи изплъзващото се настояще. Оттам и инстинктът да се съхранява това минало, което се явява идентичност и единствена ценност на застаряващия континент, маргинализиран вече и икономически, и демографски. Именно миналото е може би последната валута, онова, което Европа има все още да предложи „за продан“ на останалия свят, преди потъването ѝ „а ла Залезът на Запада“. Именно затова

Европа се е превърнала в увеселителния парк на света.

Туризмът не звучи като апетитна литературна тема. Точно такава е обаче под перото на Пфайфер. През блендата на масовия туризъм от ХХI век писателят ще ни покаже всичко онова, което се случва на Стария континент. Психологията на туризма, понякога съвършено абсурдна и парадоксална, бива безмилостно разнищена. Видовете туристи, с техните мотиви, очаквания, привички и дори начин на обличане, са прецизно и гротесково описани. Деструктивните за местните общества, култури, икономики и природа хедонизъм и консумеризъм – обяснени и изобличени. Пфайфер разглежда и най-дребните аспекти в тази далеч не еднозначна игра на консумиране vs. опознаване, на истинско vs. фалшиво и инсценирано, на старо и импозантно vs. ново и практично, на очаквания vs. илюзорна автентичност. 

Туризмът, който има амбицията за не-туризъм, избягването на другите туристи off the beaten track, заема особено място в разсъжденията на Пфайфер. Търсенето на прословутата автентичност на всяка цена е тъкмо това, което в крайна сметка я убива. Което води до една фалшива, вторична, театрална и инсценирана автентичност, захранвана от повърхностните очаквания на туристите и обезпечена материално от китайската икономика.

И докато азиатците и американците прииждат в Европа в търсене на вехтия някогашен разкош, за да си направят селфи, то западняците в същото време скитат из Азия и Африка уж в търсене на една неразкрасена, бедна, окаяна достоверност, за да се почувстват привилегировани. В една от най-потресаващите сцени скитащи из Пакистан холандци стават свидетели на племенен съд и публично изнасилване на младо момиче – само за да установим, че може би целият разиграл се ужас е бил просто представление като за пред туристи, такова, което един западняк очаква, за да задоволи каприза си за автентичност.

„Нашествието на варварите“ е и в двете посоки. Масовият туризъм и настъпателният азиатски бизнес са равносилни на това определение по отношение на Европа. Много повече, отколкото предполагаемото апокалиптично нашествие на бежанците и мигрантите тук. Туристите винаги са другите, пише Пфайфер, ала не оставя тази идея само дотам, а очаквано я пренася и до гореспоменатите нежелани други.

Туризмът е в неловък контраст с другата форма на миграция, резултат от глобализацията. Нея ние без всякакво колебание наричаме „проблемна“. Отваряме широко границите си за чужденците, дошли да си харчат парите при нас, а ги захлопваме под носа на чужденците, дошли да припечелят нещичко.

Предлагайки своето минало на туристите, Европа очаква пари. Идвайки да търсят прехрана и нов живот в нея, бежанците (политически или икономически) са шанс за бъдеще на застаряващия континент. Бъдеще, което Европа отхвърля и от което се страхува.

Авторът ще ни срещне с един от пришълците в образа на пиколото Абдул. Историята на избягалия от пустинята младеж ще се затвори от историята на европееца, тръгнал към пустинята от своята Европа. Великолепната ирония е в това, че пътят на бежанеца Абдул преповтаря този на Еней, тоест преразказан е чрез един от основополагащите за европейската литература произведения. Архетипите нямат националност – и именно националистическите попълзновения отхвърля Пфайфер в името на една обединена Европа. Европа, която в крайна сметка е единственото място, където той може да съществува, обича, разсъждава и пише, отдавайки почит на нейните базови мислители.

На второ четене: „Гранд хотел Европа“

Да, този роман е обяснение в любов към Европа.

Във втора сюжетна линия, която се развива изцяло в гранд хотел „Европа“, това е особено осезаемо. Пфайфер ни запознава с постоянните му гости – особняци, самотници, ерудити насред почти музейната атмосфера, напомняща за една отминала епоха на европейски стил и елегантност в маниерите, езика и облеклото. Тази нарочна старомодност изобразява носталгията по отминалото и възкресява духа на редица европейски романи. А случващото се в хотела се превръща в съвършена метафора за съдбата на Европа. Така всичко започва да се променя, когато се появява новият собственик – китайски бизнесмен с различно виждане за „автентично европейското“ и за воденето на бизнеса, който до момента е затихвал елегантно под безупречния контрол на съдбовно призвания за длъжността си майордом г-н Монтебело.

Отделно пък двамата любовници въвеждат и нишка на приключенска мистерия в романа. Докато разсъждават над историята и европейското изкуство, Лео и Клио търсят последната, изгубена за света картина на Караваджо, чиято история също присъства обилно в романа. Те развиват собствена, неустоима по логиката си теория за съдбата и смъртта на противоречивия художник и за неговите картини. И ни показват, че

„Европа тъне в носталгия“ не е диагноза само на настоящето, а вековен подход в изкуството, валиден и по времето на Ренесанса, и през всички останали епохи.

Да, няма съмнение, че този роман е плод на наричания понякога „лошото момче на нидерландската литература“ Пфайфер. Автор, чаровно или дразнещо (според читателя) егоцентричен, обърнат и рефериращ към себе си и романите си, саморекламиращ се и дори предлагащ разяснения и похвали към книгата, която тепърва ще държим в ръцете си. Но е несъмнено, че читателите му трябва да са на ниво, за да се справят с този венециански карнавал на интелекта, който се движи от профанното до възвишеното и обратно. „Гранд хотел Европа“ е незаобиколима част от дискурса за континента ни, за неговата идентичност (в която няколко пъти се споменава и нещо българско). Както и не особено оптимистична прогноза.

Но „красивите истории никога не завършват добре“, напомня ни този истински европейски патриот. И веднага след това ни обяснява (успокоява ни?), че в миналото си Европа е „била свидетел на залеза на толкова могъщи империи, че цикълът на раждане, разцвет и упадък се е загнездил дълбоко в мозъчната кора на историческото ни съзнание“. Може би тъкмо затова „за нас упадъкът е част от една завършена структура, която без последния си елемент би била естетически несъвършена и непълна“. А естетиката е базова европейска категория и ценност. Именно такова – естетично и тържествено – е и финалното погребение на „Европа“ (тук обаче има мистерия, която няма да разкрием).

Оставяме ви с един от най-носталгичните пасажи от книгата, пропити с трудна любов:

Мисля, че не мога да живея извън Европа. Не просто го мисля – уверен съм. В Европа, където единственото сигурно нещо е упованието в мисълта; където в хода на дългата и уморителна история сме намерили толкова много решения, че у нас се е зародила любов към проблемите; където липсва убедително оправдание да проявяваш усърдие, затова предпочитаме да залагаме на елегантността; където са изобретени както снобизмът, така и иронията; където белезите ни се струват красиви, защото ни карат да бъдем предпазливи […]; където […] още не сме постигнали единодушие относно дефинициите и отправните точки, така че да поведем смислена дискусия за красивото, доброто, и истината; където съмнението е издигнато до религия; където живеят повече философи, отколкото келнери, които да ги обслужват, и повече поети, отколкото читатели; където всеки пейзаж, всеки градски облик и страните на всяка жена са напукани от зрелост; където миналото е веществено като камък, а улиците са четливи като палимпсест; […] където всяко нещо някога е било многократно по-добро и красиво, отколкото е сега; и където си заслужава да си отпочинеш от безкрайното попълване на аналите на хилядолетната история; само там мога да дишам и да обичам.


 Активните дарители на „Тоест“ получават постоянна отстъпка в размер нa 20% от коричната цена на всички заглавия от каталога на издателство „Колибри“, както и на няколко други български издателства в рамките на партньорската програма Читателски клуб „Тоест“. За повече информация прочетете на toest.bg/club.

Никой от нас не чете единствено най-новите книги. Тогава защо само за тях се пише? „На второ четене“ е рубрика, в която отваряме списъците с книги, публикувани преди поне година, четем ги и препоръчваме любимите си от тях. Рубриката е част от партньорската програма Читателски клуб „Тоест“. Изборът на заглавия обаче е единствено на авторите – Стефан Иванов и Антония Апостолова, които биха ви препоръчали тези книги и ако имаше как веднъж на две седмици да се разходите с тях в книжарницата.

 

AWS Wickr recognized in The Forrester Wave for Secure Communications Solutions

Post Syndicated from Anne Grahn original https://aws.amazon.com/blogs/messaging-and-targeting/aws-wickr-recognized-in-the-forrester-wave-for-secure-communications-solutions/

Evolving threats, flexible work models, and a growing patchwork of data protection and privacy laws have made securing business communications a challenge. We are excited to announce that Amazon Web Services (AWS) Wickr has been named a Strong Performer in The Forrester Wave™: Secure Communications Solutions, Q3 2024. We believe this recognition from Forrester underscores the potential value of AWS Wickr to security-conscious customers with demands for compliance, flexible deployment, and high assurance.

The Forrester Wave: Secure Communications Solutions, Q3 2024 evaluates Leaders, Strong Performers, Contenders, and Challengers in the secure communications solutions market. It’s an assessment of the top vendors, providing insights to help security professionals select the right solution for their needs.

The report covers key trends, such as the need for solutions that enable mission-critical communications and collaboration, while aligning with use case-specific security and privacy requirements. Vendors are evaluated across 21 criteria, including assurance, retention, and postquantum cryptography.

AWS Wickr, an end-to-end encrypted messaging and collaboration service, is among 12 Secure Communications Solutions offerings evaluated by Forrester. AWS Wickr provides advanced security for sensitive communications, flexible administrative controls for user and policy management, and data retention to help meet auditing and regulatory needs.

AWS Wickr customers include U.S. Department of Defense organizations such as the U.S. Air Force and the U.S. Army Telemedicine & Advanced Technology Research Center (TATRC). They also include non-profit humanitarian organizations such as Operation Recovery, and private-sector organizations such as Les Ambassadeurs Club. These customers leverage the robust security and collaboration capabilities that Wickr provides across multiple use cases.

As you look to maintain secure and compliant business communications, Forrester’s report offers a valuable guide to finding a solution that works for your organization. Access a complimentary copy of The Forrester Wave: Secure Communications Solutions, Q3 2024 here. 

To learn more about AWS Wickr, visit our website or contact us.

Exploring how well Experience AI maps to UNESCO’s AI competency framework for students

Post Syndicated from Ben Garside original https://www.raspberrypi.org/blog/experience-ai-unesco-ai-competency-framework/

During this year’s annual Digital Learning Week conference in September, UNESCO launched their AI competency frameworks for students and teachers. 

What is the AI competency framework for students? 

The UNESCO competency framework for students serves as a guide for education systems across the world to help students develop the necessary skills in AI literacy and to build inclusive, just, and sustainable futures in this new technological era.

It is an exciting document because, as well as being comprehensive, it’s the first global framework of its kind in the area of AI education.

The framework serves three specific purposes:

  • It offers a guide on essential AI concepts and skills for students, which can help shape AI education policies or programs at schools
  • It aims to shape students’ values, knowledge, and skills so they can understand AI critically and ethically
  • It suggests a flexible plan for when and how students should learn about AI as they progress through different school grades

The framework is a starting point for policy-makers, curriculum developers, school leaders, teachers, and educational experts to look at how it could apply in their local contexts. 

It is not possible to create a single curriculum suitable for all national and local contexts, but the framework flags the necessary competencies for students across the world to acquire the values, knowledge, and skills necessary to examine and understand AI critically from a holistic perspective.

How does Experience AI compare with the framework?

A group of researchers and curriculum developers from the Raspberry Pi Foundation, with a focus on AI literacy, attended the conference and afterwards we tasked ourselves with taking a deep dive into the student framework and mapping our Experience AI resources to it. Our aims were to:

  • Identify how the framework aligns with Experience AI
  • See how the framework aligns with our research-informed design principles
  • Identify gaps or next steps

Experience AI is a free educational programme that offers cutting-edge resources on artificial intelligence and machine learning for teachers, and their students aged 11 to 14. Developed in collaboration with the Raspberry Pi Foundation and Google DeepMind, the programme provides everything that teachers need to confidently deliver engaging lessons that will teach, inspire, and engage young people about AI and the role that it could play in their lives. The current curriculum offering includes a ‘Foundations of AI’ 6-lesson unit, 2 standalone lessons (‘AI and ecosystems’ and ‘Large language models’), and the 3 newly released AI safety resources. 

Working through each lesson objective in the Experience AI offering, we compared them with each curricular goal to see where they overlapped. We have made this mapping publicly available so that you can see this for yourself: Experience AI – UNESCO AI Competency framework students – learning objective mapping (rpf.io/unesco-mapping)

The first thing we discovered was that the mapping of the objectives did not have a 1:1 basis. For example, when we looked at a learning objective, we often felt that it covered more than one curricular goal from the framework. That’s not to say that the learning objective fully met each curricular goal, rather that it covers elements of the goal and in turn the student competency. 

Once we had completed the mapping process, we analysed the results by totalling the number of objectives that had been mapped against each competency aspect and level within the framework.

This provided us with an overall picture of where our resources are positioned against the framework. Whilst the majority of the objectives for all of the resources are in the ‘Human-centred mindset’ category, the analysis showed that there is still a relatively even spread of objectives in the other three categories (Ethics of AI, ML techniques and applications, and AI system design). 

As the current resource offering is targeted at the entry level to AI literacy, it is unsurprising to see that the majority of the objectives were at the level of ‘Understand’. It was, however, interesting to see how many objectives were also at the ‘Apply’ level. 

It is encouraging to see that the different resources from Experience AI map to different competencies in the framework. For example, the 6-lesson foundations unit aims to give students a basic understanding of how AI systems work and the data-driven approach to problem solving. In contrast, the AI safety resources focus more on the principles of Fairness, Accountability, Transparency, Privacy, and Security (FATPS), most of which fall more heavily under the ethics of AI and human-centred mindset categories of the competency framework. 

What did we learn from the process? 

Our principles align 

We built the Experience AI resources on design principles based on the knowledge curated by Jane Waite and the Foundation’s researchers. One of our aims of the mapping process was to see if the principles that underpin the UNESCO competency framework align with our own.

Avoiding anthropomorphism 

Anthropomorphism refers to the concept of attributing human characteristics to objects or living beings that aren’t human. For reasons outlined in the blog I previously wrote on the issue, a key design principle for Experience AI is to avoid anthropomorphism at all costs. In our resources, we are particularly careful with the language and images that we use. Putting the human in the process is a key way in which we can remind students that it is humans who design and are responsible for AI systems. 

Young people use computers in a classroom.

It was reassuring to see that the UNESCO framework has many curricular goals that align closely to this, for example:

  • Foster an understanding that AI is human-led
  • Facilitate an understanding on the necessity of exercising sufficient human control over AI
  • Nurture critical thinking on the dynamic relationship between human agency and machine agency

SEAME

The SEAME framework created by Paul Curzon and Jane Waite offers a way for teachers, resource developers, and researchers to talk about the focus of AI learning activities by separating them into four layers: Social and Ethical (SE), Application (A), Models (M), and Engines (E). 

The SEAME model and the UNESCO AI competency framework take two different approaches to categorising AI education — SEAME describes levels of abstraction for conceptual learning about AI systems, whereas the competency framework separates concepts into strands with progression. We found that although the alignment between the frameworks is not direct, the same core AI and machine learning concepts are broadly covered across both. 

Computational thinking 2.0 (CT2.0)

The concept of computational thinking 2.0 (a data-driven approach) stems from research by Professor Matti Tedre and Dr Henriikka Vartiainen from the University of Eastern Finland. The essence of this approach establishes AI as a different way to solve problems using computers compared to a more traditional computational thinking approach (a rule-based approach). This does not replace the traditional computational approach, but instead requires students to approach the problem differently when using AI as a tool. 

An educator points to an image on a student's computer screen.

The UNESCO framework includes many references within their curricular goals that places the data-driven approach at the forefront of problem solving using AI, including:

  • Develop conceptual knowledge on how AI is trained based on data 
  • Develop skills on assessing AI systems’ need for data, algorithms, and computing resources

Where we slightly differ in our approach is the regular use of the term ‘algorithm’, particularly in the Understand and Apply levels of the framework. We have chosen to differentiate AI systems from traditional computational thinking approaches by avoiding the term ‘algorithm’ at the foundational stage of AI education. We believe the learners need a firm mental model of data-driven systems before students can understand that the Model and Engines of the SEAME model refer to algorithms (which would possibly correspond to the Create stage of the UNESCO framework). 

We can identify areas for exploration

As part of the international expansion of Experience AI, we have been working with partners from across the globe to bring AI literacy education to students in their settings. Part of this process has involved working with our partners to localise the resources, but also to provide training on the concepts covered in Experience AI. During localisation and training, our partners often have lots of queries about the lesson on bias. 

As a result, we decided to see if mapping taught us anything about this lesson in particular, and if there was any learning we could take from it. At close inspection, we found that the lesson covers two out of the three curricular goals for the Understand element of the ‘Ethics of AI’ category (Embodied ethics). 

Specifically, we felt the lesson:

  • Illustrates dilemmas around AI and identifies the main reasons behind ethical conflicts
  • Facilitates scenario-based understandings of ethical principles on AI and their personal implications

What we felt isn’t covered in the lesson is:

  • Guide the embodied reflection and internalisation of ethical principles on AI

Exploring this further, the framework describes this curricular goal as:

Guide students to understand the implications of ethical principles on AI for their human rights, data privacy, safety, human agency, as well as for equity, inclusion, social justice and environmental sustainability. Guide students to develop embodied comprehension of ethical principles; and offer opportunities to reflect on personal attitudes that can help address ethical challenges (e.g. advocating for inclusive interfaces for AI tools, promoting inclusion in AI and reporting discriminatory biases found in AI tools).

We realised that this doesn’t mean that the lesson on bias is ineffective or incomplete, but it does help us to think more deeply about the learning objective for the lesson. This may be something we will look to address in future iterations of the foundations unit or even in the development of new resources. What we have identified is a process that we can follow, which will help us with our decision making in the next phases of resource development. 

How does this inform our next steps?

As part of the analysis of the resources, we created a simple heatmap of how the Experience AI objectives relate to the UNESCO progression levels. As with the barcharts, the heatmap indicated that the majority of the objectives sit within the Understand level of progression, with fewer in Apply, and fewest in Create. As previously mentioned, this is to be expected with the resources being “foundational”. 

The heatmap has, however, helped us to identify some interesting points about our resources that warrant further thought. For example, under the ‘Human-centred mindset’ competency aspect, there are more objectives under Apply than there are Understand. For ‘AI system design’, architecture design is the least covered aspect of Apply. 

By identifying these areas for investigation, again it shows that we’re able to add the learnings from the UNESCO framework to help us make decisions.

What next? 

This mapping process has been a very useful exercise in many ways for those of us working on AI literacy at the Raspberry Pi Foundation. The process of mapping the resources gave us an opportunity to have deep conversations about the learning objectives and question our own understanding of our resources. It was also very satisfying to see that the framework aligns well with our own researched-informed design principles, such as the SEAME model and avoiding anthropomorphisation. 

The mapping process has been a good starting point for us to understand UNESCO’s framework and we’re sure that it will act as a useful tool to help us make decisions around future enhancements to our foundational units and new free educational materials. We’re looking forward to applying what we’ve learnt to our future work! 

The post Exploring how well Experience AI maps to UNESCO’s AI competency framework for students appeared first on Raspberry Pi Foundation.

Anaconda’s new “Web UI” (Fedora Magazine)

Post Syndicated from jzb original https://lwn.net/Articles/997927/

Garrett LeSage has written an in-depth article
for Fedora Magazine about a new web-based user interface (UI) for Fedora’s
Anaconda
installer, planned to ship with Fedora 42. The article looks at
the rationale for moving from GTK 3 to a web-based UI, provides a
number of screenshots and demo screencasts, as well as instructions on
trying out the new installer with Fedora Rawhide.

Streamlining AWS Glue Studio visual jobs: Building an integrated CI/CD pipeline for seamless environment synchronization

Post Syndicated from Andrei Maksimov original https://aws.amazon.com/blogs/big-data/streamlining-aws-glue-studio-visual-jobs-building-an-integrated-ci-cd-pipeline-for-seamless-environment-synchronization/

Many Amazon Web Services (AWS) customers have integrated their data across multiple sources using AWS Glue, a serverless data integration service. By providing seamless integration throughout the development lifecycle, AWS Glue enables organizations to make data-driven business decisions.

AWS Glue Studio visual jobs provide a graphical interface called the visual editor that you can use to author extract, transform, and load (ETL) jobs in AWS Glue visually. The visual editor maintains a visual representation that a variety of data sources, transformations, and data sinks. With its intuitive interface, you can easily create large-scale data integration jobs without needing coding expertise, simplifying workflows and eliminating the need for manual ETL script programming.

As data engineers increasingly rely on the AWS Glue Studio visual editor to create data integration jobs, the need for a streamlined development lifecycle and seamless synchronization between environments has become paramount. Additionally, managing versions of visual directed acyclic graphs (DAGs) is crucial for tracking changes, collaboration, and maintaining consistency across environments.

This post introduces an end-to-end solution that addresses these needs by combining the power of the AWS Glue Visual Job API, a custom AWS Glue Resource Sync Utility, and an based continuous integration and continuous deployment (CI/CD) pipeline.

A few common questions from our customers include:

  • What are the best practices for moving our workloads from a pre-production environment to production?
  • What are the recommended best practices for provisioning data integration components?
  • How can I build AWS Glue visual jobs in the development environment and automatically propagate them to the production account using the CI/CD pipeline?
  • How can I version control and track changes to my AWS Glue Studio visual jobs?

End-to-end development lifecycle for data integration pipeline

The software development lifecycle on AWS has six phases: plan, design, implement, test, deploy, and maintain, as shown in the following diagram.

SDLC

For more information regarding each component, check out End-to-end development lifecycle for data engineers to build a data integration pipeline using AWS Glue.

AWS Glue Resource Sync Utility

As part of synchronizing AWS Glue visual jobs across different environments, requirements include:

  • Manage version control of visual DAGs by tracking changes to AWS Glue Studio visual jobs using version control systems such as Git
  • Promote AWS Glue visual jobs from a pre-production environment to a production environment
  • Transfer ownership of AWS Glue visual jobs between different AWS accounts
  • Replicate AWS Glue visual jobs from one AWS Region to another as part of a disaster recovery strategy

The AWS Glue Resource Sync Utility is a Python application developed on top of the AWS Glue Visual Job API, designed to synchronize AWS Glue Studio visual jobs across different accounts without losing the visual representation. It operates by using source and target AWS environment profiles. Optionally, a list of jobs for synchronization can be provided along with a mapping file to replace environment-specific resources.

For more information on the AWS Glue Resource Sync Utility, refer to Synchronize your AWS Glue Studio Visual Jobs to different environments.

Solution overview

As shown in the following diagram, this solution uses three separate AWS accounts. One account is designated for the development environment, another for the production environment, and a third to host the CI/CD infrastructure and pipeline.

Solution Overview

The solution emphasizes version controlling AWS Glue Studio visual jobs by serializing them into JSON files and storing them in a Git repository. As a result, you can:

  • Track changes to your visual DAGs over time.
  • Collaborate with team members.
  • Restore and deploy visual DAGs in different environments seamlessly.

The AWS account responsible for hosting the CI/CD pipeline is composed of three key components:

  • Managing AWS Glue Job updates – Provides smooth updates and maintenance of AWS Glue jobs.
  • Cross-Account Access Management – Enables secure promotion of updates from the development environment to the production environment.
  • Version Control Integration – Incorporates serialized visual DAGs into the CI/CD pipeline for deployment to target environments.

You can create AWS Glue Studio visual jobs using the intuitive visual editor in your development account. After these jobs are configured, they can serialize the visual DAGs into JSON files and commit them to a Git repository. The CI/CD pipeline detects changes to the repository and automatically triggers the deployment process.

The pipeline includes a step where the AWS Glue Resource Sync Utility deserializes the visual DAGs from the committed JSON files and deploys them to the production environment. This approach promotes consistent deployment of jobs while maintaining their visual representation.

The solution uses the AWS Glue Visual Job API, AWS Glue Resource Sync Utility, and AWS CDK to streamline deployment across environments. It enables seamless synchronization and consistent versioning of AWS Glue jobs between development and production, preserving visual workflows and reducing manual tasks. The solution consists of two main parts:

  • Initial steps (one-time setup) – Setting up the development environment, bootstrapping AWS environments, deploying the CI/CD pipeline, and integrating the AWS Glue Resource Sync Utility
  • Day-to-day development (repeated) – Ongoing activities such as creating visual jobs, serializing them, committing changes to the repository, deploying to production through the pipeline, and verifying the jobs

The solution follows these high-level steps for the initial setup:

  1. Set up the development environment
  2. Bootstrap your AWS environments
  3. Deploy the CI/CD pipeline
  4. Configure AWS developer tools connection on GitHub
  5. Integrate the CI/CD pipeline with the AWS Glue Resource Sync Utility

The solution follows these high-level steps for the day-to-day development:

  1. Create visual jobs in the development account
  2. Serialize visual jobs
  3. Commit changes to Git repository
  4. Deploy visual jobs to production
  5. Verify visual jobs in production

Prerequisites

Before you begin, make sure you have the following:

  • GitHub account
  • Git (git command)
  • Python 3.9 or later
  • Package installer for Python (pip command)
  • AWS CDK Toolkit (cdk command) 2.155.0 or later
  • AWS CLI configured with appropriate credentials for your accounts
  • Three AWS accounts:
    • Development account
    • Production account
    • Pipeline account (for hosting the CI/CD pipeline)

Technical solution walkthrough

This section provides a detailed guide to setting up and using an automated CI/CD pipeline for AWS Glue Studio visual jobs.

Initial steps (one-time setup)

In this section, we walk through the foundational steps required to establish the CI/CD pipeline for AWS Glue Studio visual jobs. These initial steps set up the necessary infrastructure and configurations, providing a smooth and automated deployment process across your development and production environments.

Set up the development environment

To set up the development environment, follow these steps:

  1. Fork the aws-glue-cdk-baseline repository
  2. Clone the forked repository:
git clone https://github.com/<YOUR-GITHUB-USERNAME>/aws-glue-cdk-baseline.git

cd aws-glue-cdk-baseline
  1. Create and activate a Python virtual environment:
python3 -m venv .venv

# On Windows, use .venv\\Scripts\\activate.bat
source .venv/bin/activate
  1. Install required dependencies:
pip install -r requirements.txt

pip install -r requirements-dev.txt
  1. To configure the default settings, edit the default-config.yaml file with your AWS account details and replace placeholders with your AWS account details:
  2. Pipeline account: awsAccountId and awsRegion.
  3. Development account: awsAccountId and awsRegion.
  4. Production account: awsAccountId and awsRegion.

Bootstrap your AWS environments

Bootstrapping prepares your AWS accounts for AWS CDK deployments. To bootstrap your AWS environments, run the following commands, replacing placeholders with your account numbers, Regions, and AWS CLI profiles:

# Bootstrap the pipeline account
cdk bootstrap aws://<PIPELINE-ACCOUNT-NUMBER>/<REGION> --profile <PIPELINE-PROFILE>

# Bootstrap the development account, trusting the pipeline account
cdk bootstrap aws://<DEV-ACCOUNT-NUMBER>/<REGION> --profile <DEV-PROFILE> --trust <PIPELINE-ACCOUNT-NUMBER>

# Bootstrap the production account, trusting the pipeline account
cdk bootstrap aws://<PROD-ACCOUNT-NUMBER>/<REGION> --profile <PROD-PROFILE> --trust <PIPELINE-ACCOUNT-NUMBER>

Deploy the CI/CD pipeline

Deploy the pipeline stack to your pipeline account:

cdk deploy --profile <PIPELINE-PROFILE>

This command creates:

  • The pipeline stack in the pipeline account
  • The AWS Glue app stack in the development account

Configure AWS developer tools connection to GitHub

To establish a connection between AWS CodePipeline and your GitHub repository, follow these steps:

  1. Create a GitHub connection
  2. In the AWS Management Console for your pipeline account, navigate to AWS CodePipeline
  3. In the navigation pane, choose Connections
  4. Choose Create connection
  5. Select GitHub as the source provider
  6. Authorize the connection
  7. Provide a connection name (such as MyGitHubConnection)
  8. Choose Connect to GitHub
  9. Follow the prompts to authorize AWS CodePipeline to access your GitHub account
  10. Make sure that the connection has access to your forked aws-glue-cdk-baseline repository
  11. Note the connection Amazon Resource Name (ARN)
  12. After the connection is established, note the Connection ARN because you’ll need it when configuring the pipeline

Integrate the CI/CD pipeline with the AWS Glue Resource Sync Utility

To integrate the AWS Glue Resource Sync Utility into the pipeline to automate the synchronization of AWS Glue visual jobs, follow these steps:

  1. Download the sync.py script from the AWS Glue Samples repository:
wget https://raw.githubusercontent.com/aws-samples/aws-glue-samples/master/utilities/resource_sync/sync.py \
-O aws_glue_cdk_baseline/job_scripts/sync.py
  1. Create a new file aws_glue_cdk_baseline/job_scripts/generate_mapping.py with the following content:
import yaml
import json
 
def generate_mapping():
    with open('default-config.yaml', 'r') as config_file:
        config = yaml.safe_load(config_file)
    mapping = {
        f"s3://aws-glue-assets-{config['devAccount']['awsAccountId']}-{config['devAccount']['awsRegion']}": f"s3://aws-glue-assets-{config['prodAccount']['awsAccountId']}-{config['prodAccount']['awsRegion']}",
        f"arn:aws:iam::{config['devAccount']['awsAccountId']}:role/service-role/AWSGlueServiceRole": f"arn:aws:iam::{config['prodAccount']['awsAccountId']}:role/service-role/AWSGlueServiceRole",
        f"s3://dev-glue-data-{config['devAccount']['awsAccountId']}-{config['prodAccount']['awsRegion']}": f"s3://prod-glue-data-{config['prodAccount']['awsAccountId']}-{config['prodAccount']['awsRegion']}"
    }
    with open('mapping.json', 'w') as mapping_file:
        json.dump(mapping, mapping_file, indent=2)
 
if __name__ == "__main__":
    generate_mapping()

This script generates a mapping.json file that the sync.py script will use to synchronize the jobs between the development and production environments. The mapping.json file contains the mapping of the development environment assets to the production environment assets:

  • The s3://aws-glue-assets-* Amazon Simple Storage Service (Amazon S3) bucket contains the AWS Glue Studio visual job definitions
  • The arn:aws:iam::*:role/service-role/AWSGlueServiceRole AWS Identity and Access Management (IAM) role is used by the AWS Glue Studio jobs to access AWS resources
  • The s3://dev-glue-data-* and s3://prod-glue-data-* S3 buckets contain scripts and data used by the AWS Glue Studio jobs
  1. Update the aws_glue_cdk_baseline/pipeline_stack.py file to include a step that deserializes the JSON file and deploys the AWS Glue jobs to the production environment:
from typing import Dict
import aws_cdk as cdk
from aws_cdk import (
    Stack,
    aws_iam as iam
)
from constructs import Construct
from aws_cdk.pipelines import CodePipeline, CodePipelineSource, CodeBuildStep
from aws_glue_cdk_baseline.glue_app_stage import GlueAppStage
 
GITHUB_REPO = "YOUR-GITHUB-USERNAME/aws-glue-cdk-baseline"
GITHUB_BRANCH = "main"
GITHUB_CONNECTION_ARN = "YOUR-GITHUB-CONNECTION-ARN"
 
class PipelineStack(Stack):
 
    def __init__(self, scope: Construct, construct_id: str, config: Dict, **kwargs) -> None:
        super().__init__(scope, construct_id, **kwargs)
 
        source = CodePipelineSource.connection(
            GITHUB_REPO,
            GITHUB_BRANCH,
            connection_arn=GITHUB_CONNECTION_ARN
        )
 
        pipeline = CodePipeline(self, "GluePipeline",
            pipeline_name="GluePipeline",
            cross_account_keys=True,
            docker_enabled_for_synth=True,
            synth=CodeBuildStep("CdkSynth",
                input=source,
                install_commands=[
                    "pip install -r requirements.txt",
                    "pip install -r requirements-dev.txt",
                    "npm install -g aws-cdk",
                ],
                commands=[
                    "cdk synth",
                ]
            )
        )
 
        # Add development stage
        dev_stage = GlueAppStage(self, "DevStage", config=config, stage="dev", 
            env=cdk.Environment(
                account=str(config['devAccount']['awsAccountId']),
                region=config['devAccount']['awsRegion']
            ))
        pipeline.add_stage(dev_stage)

        # Add production stage
        prod_stage = GlueAppStage(self, "ProdStage", config=config, stage="prod", 
            env=cdk.Environment(
                account=str(config['prodAccount']['awsAccountId']),
                region=config['prodAccount']['awsRegion']
            ))
        pipeline.add_stage(prod_stage)
 
        # Glue Resource Sync as a separate step in the pipeline
        pipeline.add_wave("GlueJobSync").add_post(CodeBuildStep("GlueJobSync",
            input=source,
            commands=[
                "python $(pwd)/aws_glue_cdk_baseline/job_scripts/generate_mapping.py",
                "python aws_glue_cdk_baseline/job_scripts/sync.py "
                   "--dst-role-arn arn:aws:iam::{0}:role/GlueCrossAccountRole-prod "
                   "--dst-region {1} "
                   "--deserialize-from-file aws_glue_cdk_baseline/resources/resources.json "
                   "--config-path mapping.json "
                   "--targets job,catalog "
                   "--skip-prompt".format(
                       config['prodAccount']['awsAccountId'],
                       config['prodAccount']['awsRegion']
                   ),
            ],
            role_policy_statements=[
                iam.PolicyStatement(
                    actions=[
                        "sts:AssumeRole",
                    ],
                    resources=["*"]
                )
            ]
        ))

Replace the placeholders in the pipeline_stack.py file with your values:

  • GITHUB_REPO with the name of your GitHub repository
  • GITHUB_BRANCH with the name of the branch you want to use for the pipeline
  • GITHUB_CONNECTION_ARN with the ARN of the GitHub connection you created in Step 4
  1. Update the aws_glue_cdk_baseline/glue_app_stack.py file to create a cross-account role with the necessary permissions to access the development environment resources:
    self.cross_account_role = self.create_cross_account_role(
        f"GlueCrossAccountRole-{stage}",
        str(config['pipelineAccount']['awsAccountId'])
    )
 
    def create_cross_account_role(self, role_name: str, trusted_account_id: str):
        return iam.Role(self, f"{role_name}CrossAccountRole",
            role_name=role_name,
            assumed_by=iam.AccountPrincipal(trusted_account_id),
            managed_policies=[iam.ManagedPolicy.from_aws_managed_policy_name("AdministratorAccess")]
        )
 
    @property
    def cross_account_role_arn(self):
        return self.cross_account_role.role_arn

    @property
    def cross_account_role_arn(self):
        return self.glue_app_stack.cross_account_role_arn

Check the andreimaksimov/aws-glue-cdk-baseline for a complete diff.

  1. Commit your changes to the repository:
git add aws_glue_cdk_baseline/job_scripts/sync.py
git add aws_glue_cdk_baseline/job_scripts/generate_mapping.py
git add pipeline_stack.py

git commit -m "Integrate Glue Resource Sync Utility into the pipeline"

git push

Day-to-day development (repeated)

With the initial setup complete, you can now proceed with your regular development activities. This section outlines the steps you’ll repeat during your day-to-day work to develop, version control, and deploy AWS Glue visual jobs.

Create visual jobs in the development account

In this step, you’ll use AWS Glue Studio to create and configure your visual jobs within the development environment.

  1. In your development account, in AWS Glue Studio, select AWS Glue Studio
  2. To create a new visual job, choose Create job
  3. Choose Visual with a blank canvas and use the visual editor to design your ETL job
  4. Configure the job settings:
  5. Job name: Provide a meaningful name
  6. IAM role: Select an IAM role with necessary permissions
  7. Other configurations: Adjust as needed
  8. To save the job, choose Save

Repeat these steps to create additional jobs as required.

Serialize visual jobs

To serialize your visual jobs to enable version control and preparation for deployment, follow these steps:

  1. Run the AWS Glue Resource Sync Utility:
python sync.py \
  --src-role-arn arn:aws:iam::<DEV-ACCOUNT-NUMBER>:role/GlueCrossAccountRole-dev \
  --src-region us-east-1 \
  --serialize-to-file resources.json \
  --targets job,catalog \
  --skip-prompt
  1. Replace <DEV-ACCOUNT-NUMBER> with your development account number
  2. Replace <DEV-REGION> with your development Region (for example, us-east-1)
  3. Verify the serialized file:
  4. Locate JSON in aws_glue_cdk_baseline/resources/
  5. Make sure it contains the definitions of your visual jobs

Commit changes to Git repository

To commit changes to the Git repository, follow these steps:

  1. Add the serialized resources to Git:
git add aws_glue_cdk_baseline/resources/resources.json
  1. Commit your changes:
git commit -m "Add serialized Glue Visual Jobs"
  1. Push to GitHub:
git push

This action triggers the CI/CD pipeline.

Deploy visual jobs to production

The CI/CD pipeline automatically deploys the following changes:

  • Synthesize the AWS CDK application
  • Deploy to the development environment
  • Deploy to the production environment
  • Execute the AWS Glue Resource Sync Utility

The following screenshot shows the CI/CD pipeline.

CICD Pipeline

Verify visual jobs in production

After the pipeline has completed the deployment, it’s important to verify that the visual jobs are correctly reflected in the production environment. To do so, follow these steps:

  1. In the production account, on the AWS Glue Studio console, select AWS Glue Studio
  2. Verify the deployed jobs:
  3. Make sure that the visual jobs are present
  4. Open each job to confirm that the visual DAGs are preserved

By following these steps in your day-to-day workflow, you make sure that your AWS Glue visual jobs are version-controlled, consistent across environments, and that your production environment reflects the latest tested changes.

Version control for AWS Glue visual jobs

By serializing AWS Glue Studio visual jobs to JSON files and committing them to a Git repository, you enable version control for your data integration workflows. By following this approach you can:

  • Track Changes – Monitor modifications to your AWS Glue jobs over time
  • Collaborate – Work with team members on developing and refining jobs
  • Restore and deploy – Easily restore jobs in other accounts or environments

The serialization and deserialization steps are integral to your development and deployment process, making sure that all changes are captured and seamlessly propagated.

Conclusion

By combining the AWS Glue Visual Job API, AWS Glue Resource Sync Utility, and an AWS CDK based CI/CD pipeline, we’ve crafted a comprehensive solution for managing AWS Glue Studio visual jobs across different environments. This integrated approach offers several benefits:

  • Version control integration – Manage and track changes to your AWS Glue visual jobs using Git, enabling collaboration and change tracking
  • Streamlined development – Easily develop and test AWS Glue jobs using the Visual Editor in the development environment
  • Automated deployment – Use a CI/CD pipeline to automatically deploy serialized visual DAGs to the production environment
  • Environment consistency – Promote consistency across development and production environments by using the same job definitions
  • Visual representation preservation – Maintain the visual DAG representation when synchronizing jobs between environments

This solution empowers data engineers to focus on building robust data integration pipelines while automating the complexities of managing and deploying AWS Glue Studio visual jobs across multiple environments.

We encourage you to try this solution and adapt it to your needs. As always, we welcome your feedback and suggestions for further improvements.


About the Authors

Andrei MaksimovAndrei Maksimov is an AWS Senior Cloud Infrastructure Architect specializing in cloud infrastructure, software development, and DevOps. He designs and implements scalable, secure, and efficient cloud solutions and helps customers optimize their cloud environments. Outside of work, Andrei enjoys participating in hackathons, contributing to open source projects, and exploring the latest advancements in AI. You can connect with him on LinkedIn.

David ZhangDavid Zhang is an AWS Data Architect specializing in designing and implementing analytics infrastructure, data management, ETL, and extensive data systems. He helps customers modernize their AWS data platforms. David is also an active speaker at AWS conferences and contributor to AWS conferences, technical content, and open source initiatives. He enjoys playing volleyball, tennis, and weightlifting in his free time. Feel free to connect with him on LinkedIn.

Noritaka SekiyamaNoritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for designing AWS features, implementing software artifacts, and helping with customer architectures. In his spare time, he enjoys watching anime on Prime Video. You can connect with him on LinkedIn.

Security updates for Tuesday

Post Syndicated from corbet original https://lwn.net/Articles/997903/

Security updates have been issued by AlmaLinux (gstreamer1-plugins-base), Debian (chromium, ghostscript, libarchive, mpg123, ruby-saml, and symfony), Fedora (buildah and podman), Red Hat (buildah, containernetworking-plugins, podman, skopeo, and xorg-x11-server-Xwayland), Slackware (wget), SUSE (pcp), and Ubuntu (linux, linux-aws-5.15, linux-gcp, linux-gcp-5.15, linux-gke, linux-gkeop, linux-gkeop-5.15, linux-hwe-5.15, linux-ibm, linux-ibm-5.15, linux-kvm, linux-lowlatency, linux-lowlatency-hwe-5.15, linux-nvidia, linux-oracle, linux-oracle-5.15, linux-raspi, linux-xilinx-zynqmp and mysql-8.0).

Backblaze Drive Stats for Q3 2024

Post Syndicated from Andy Klein original https://www.backblaze.com/blog/backblaze-drive-stats-for-q3-2024/

A decorative image that displays the words Q3 2024 Drive Stats.

As of the end of Q3 2024, Backblaze was monitoring 292,647 hard disk drives (HDDs) and solid state drives (SSDs) in our cloud storage servers located in our data centers around the world. We removed from this analysis 4,100 boot drives, consisting of 3,344 SSDs and 756 HDDs. This leaves us with 288,547 hard drives under management to review for this report. We’ll review the annualized failure rates (AFRs) for Q3 2024 and the lifetime AFRs of the qualifying drive models. Along the way, we’ll share our observations and insights on the data presented and, as always, we look forward to you doing the same in the comments section at the end of the post.

Hard drive failure rates for Q3 2024

For our Q3 2024 quarterly analysis, we remove the following from consideration: drive models which did not have at least 100 drives in service at the end of the quarter, drive models which did not accumulate 10,000 or more drive days during the quarter, and individual drives which exceeded their manufacturer’s temperature specification during their lifetime. The removed pool totalled 471 drives, leaving us with 288,076 drives grouped into 29 drive models for our Q3 2024 analysis. 

The table below lists the AFRs and related data for these drive models. The table is sorted ascending by drive size then ascending by AFR within drive size.

Notes and observations on the Q3 2024 Drive Stats

  • Upward AFR. The quarter-to-quarter AFR continues to creep up rising from 1.71% in Q2 2024 to 1.89% in Q3 2024. The rise can’t be attributed to the aging 4TB drives, as our CVT drive migration system continues to replace these drives. As a consequence, the AFR for the remaining 4TB drives was 0.26% in Q3. The primary culprit is the collection of 8TB drives, which are now on average over seven years old. As a group, the AFR for the 8TB drives rose to 3.04% in Q3 2024, up from 2.31% in Q2. The CVT team is gearing up to begin the migration of 8TB drives over the next few months.
  • Yet another golden oldie is gone. You may have noticed that the 4TB Seagate drives (model: ST4000DM000) are missing from the table. All of the Backblaze Vaults containing these drives have been migrated, and as a consequence there are only two of these drives remaining, not enough to make the quarterly chart. You can read more about their demise in our recent Halloween post. 
  • A new drive in town. In Q3, the 20TB Toshiba drives (model: MG10ACA20TE) arrived in force, populating three complete Backblaze Vaults of 1,200 drives each. Over the last few months our drive qualification team put the 20TB drive model through its paces and, having passed the test, they are now on the list of drive models we can deploy.
  • One zero. For the second quarter in a row, the 14TB Seagate (model: ST16000NM00J) drive model had zero failures. With only 185 drives in service, there is a lot of potential variability in the future, but for the moment, they are settling in quite well.
  • The nine year club. There are no data drives with 10 or more years of service, but there are 39 drives that are nine years or older. They are all 4TB HGST drives (model: HMS5C4040ALE640) spread across 31 different Storage Pods, in five different Backblaze Vaults and two different data centers. Will any of those drives make it to 10 years? Probably not, given that four of the five vaults have started their CVT migrations and will be gone by the end of the year. And, while the fifth vault is not scheduled for migration yet, it is just a matter of time before all of the 4TB drives we are using will be gone.

Reactive and proactive drive failures

In the Drive Stats dataset schema, there is a field named failure, which displays either a 1 for failure or a 0 for not failed. Over the years in various posts, we have stated that for our purposes drive failure is either reactive or proactive. Furthermore, we have suggested that failed drives fall basically evenly into these two categories. We’d like to put some data behind that 50/50 number, but first let’s start by defining our two categories of drive failure, reactive and proactive. 

  • Reactive: A reactive failure is when any of the following conditions occur: the drive crashes and refuses to boot or spin up, the drive won’t respond to system commands, or the drive won’t stay operational. 
  • Proactive: A proactive failure is generally anything not a reactive failure, and typically is when one or more indicators such as SMART stats, FSCK (file system) checks, etc., signal that the drive is having difficulty and drive failure is highly probable. Typically a multitude of indicators are present in drives declared as proactive failures.

A drive that is removed and replaced as either a proactive or reactive failure is considered a drive failure in Drive Stats unless we learn otherwise. For example, a drive is experiencing communications errors and command timeouts and is scheduled for a proactive drive replacement. During the replacement process, the data center tech realizes the drive does not appear to be fully seated. After gently securing the drive, further testing reveals no issues and the drive is no longer considered failed.  At that point, the Drive Stats dataset is updated accordingly.

As noted above, the Drive Stats dataset includes the failure status (0 or 1) but not the type of failure (proactive or reactive). That’s a project for the future. To get a breakdown of different types of drives failure we have to interrogate the data center maintenance ticketing system used by each data center to record any maintenance activities on Storage Pods and related equipment. Historically, the drive failure data was not readily accessible, but a recent software upgrade now allows us access to this data for the first time. So in the spirit of Drive Stats, we’d like to share our drive failure types with you. 

Drive failure type stats

Q3 2024 will be our starting point for any drive failure type stats we publish going forward. For consistency, we will use the same drive models listed in the Drive Stats quarterly report, in this case Q3 2024. For this period, there were 1,361 drive failures across 29 drive models. 

We actually have been using the data center maintenance data for several years as each quarter we validate the failed drives reported by the Drive Stats system with the maintenance records. Only validated failed drives are used for the Drive Stats reports we publish quarterly and in the data we publish on our Drive Stats webpage.

The recent upgrades to the data center maintenance ticketing system have not only made the drive failure validation process easier, we can now easily join together the two sources. This gives us the ability to look at the drive failure data across several different attributes as shown in the tables below. We’ll start with the number of failed drives in each category and go from there. This will form our baseline data.

Reactive vs. proactive drive failures for Q3 2024

Observation period Reactive failures Proactive failures Total failures Reactive % Proactive%
Q3 2024 failed drives 640 721 1,361 47.0% 53.0%

Reactive vs. proactive drive failures for Q3 2024

Manufacturer Reactive failures Proactive failures Total failures Reactive % Proactive %
HGST 194 177 371 52.3% 47.7%
Seagate 258 272 530 48.7% 51.3%
Toshiba 124 221 345 35.9% 64.1%
WDC 64 51 115 55.7% 44.3%

Reactive vs. proactive drive failures by Backblaze data center

Backblaze data center Reactive failures Proactive failures Total failures Reactive % Proactive %
AMS 36 77 113 31.9% 68.1%
IAD 50 92 142 35.2% 64.8%
PHX 179 201 380 47.1% 52.9%
SAC 0 151 148 299 50.5% 49.5%
SAC 2 224 203 427 52.5% 47.5%

Reactive vs. proactive drive failures by server type

Server type Reactive failures Proactive failures Total failures Reactive % Proactive %
5.0 red Storage Pod (45 drives) 4 2 6 66.7% 33.3%
6.0 red Storage Pod (60 drives) 433 349 782 55.4% 44.6%
6.1 red Storage Pod (60 drives) 70 107 177 39.5% 60.5%
Dell Server (26 drives) 22 61 83 26.5% 73.5%
Supermicro Server (60 drives) 111 202 313 35.5% 64.5%

Obviously, there are many things we could analyze here, but for the moment we just want to establish a baseline. Next, we’ll collect additional data to see how consistent and reliable our data is over time. We’ll let you know what we find.

Learning more about proactive failures

One item of interest to us is the different reasons that cause a drive to be designated as a proactive failure. Today we record the reasons for the proactive designation at the time the drive is flagged for replacement, but currently multiple reasons are allowed for a given drive. This makes determining the primary reason difficult to determine. Of course, there may be no such thing as a primary reason, as it is often a combination of factors causing the problem. That analysis could be interesting as well. Regardless of the exact reason, such drives are in bad shape and replacing degraded drives to protect the data they store is our first priority.

Lifetime hard drive failure rates

As of the end of Q3 2024, we were tracking 288,547 operational hard drives. To be considered for the lifetime review, a drive model was required to have 500 or more drives as of the end of Q3 2024 and have over 100,000 accumulated drive days during their lifetime. When we removed those drive models which did not meet the lifetime criteria, we had 286,892 drives grouped into 25 models remaining for analysis as shown in the table below.

Downward lifetime AFR

In Q2 2024, the lifetime AFR for the drives listed was 1.47%. In Q3, the lifetime AFR went down to 1.31%, a significant decrease from one quarter to the next for the lifetime AFR. This decrease is also contrary to the increasing quarterly AFR increase over the same period. At first blush, that doesn’t make much sense as an increasing quarter-to-quarter AFR should increase the lifetime AFR. There are two related factors which explain this seemingly contradictory data. Let’s take a look. 

We’ll start with the table below which summarizes the differences between the Q2 and Q3 lifetime stats.

Period Drive count Drive days Drive failures Lifetime AFR
Q2 2024 283,065 469,219,469 18,949 1.47%
Q3 2024 286,892 398,476,931 14,308 1.31%

To create the dataset for the lifetime AFR tables two criteria are applied: first, at the end of a given quarter, the number of drives of a drive model must be greater than 500, and, second, the number of drive days must be greater than 100,000. The first  criterion ensures that the drive models are relevant to the data presented; that is, we have a significant number of each of the included drive models. The second standard ensures that the drive models listed in the lifetime AFR table have a sufficient number of data points; that is, they have enough drive days to be significant. 

As we can see in the table above, while the number of drives went up from Q2 to Q3, the number of drive days and the number of drive failures went down significantly. This is explained by comparing the drive models listed in the Q2 lifetime table versus the Q3 lifetime table. Let’s summarize.

  • Added: In Q3, we added the 20TB Toshiba drive model (MG10ACA20TE). In Q2, there were only two of these drives in service.
  • Removed: In Q3, we removed the 4TB Seagate drive model (ST4000DM000) as there were only two drives remaining as of the end of Q3, well below the criteria of 500 drives.

When we removed the 4TB Seagate drives we also removed 80,400,065 lifetime drive days and 5,789 lifetime drive failures from the Q3 lifetime AFR computations. If the 4TB Seagate drive model data (drive days and drive failures) was included in the Q3 Lifetime stats, the AFR would have been 1.50%. 

Why not include the 4TB Seagate data? In other words, why have a drive count criteria at all? Shouldn’t we compute lifetime AFR using all of the drive models we have ever used which accumulated over 100,000 drive days in a lifetime? If we did things that way, the list of drive models used to compute the lifetime AFR would now include drive models we stopped using years ago and would include nearly 100 different drive models. As a result, a majority of the drive models used to compute the lifetime AFR would be outdated and the lifetime AFR table would contain rows of basically useless data that has no current or future value. In short, having drive count as one of the criteria in computing lifetime AFR keeps the table relevant and approachable.

The Hard Drive Stats data

It has now been over 11 years since we began recording, storing, and reporting the operational statistics of the HDDs and SSDs we use to store data at Backblaze. We look at the telemetry data of the drives, including their SMART stats and other health related attributes. We do not read or otherwise examine the actual customer data stored. 

Over the years, we have analyzed the data we have gathered and published our findings and insights from our analyses. For transparency, we also publish the data itself, known as the Drive Stats dataset. This dataset is open source and can be downloaded from our Drive Stats webpage.

You can download and use the Drive Stats dataset for free for your own purpose. All we ask are three things: 1) you cite Backblaze as the source if you use the data, 2) you accept that you are solely responsible for how you use the data, 3) you may sell derivative works based on the data, but 4) you can not sell this data to anyone; it is free.

Good luck, and let us know if you find anything interesting.

The post Backblaze Drive Stats for Q3 2024 appeared first on Backblaze Blog | Cloud Storage & Cloud Backup

LodaRAT: Established malware, new victim patterns

Post Syndicated from Natalie Zargarov original https://blog.rapid7.com/2024/11/12/lodarat-established-malware-new-victim-patterns/

Executive Summary

LodaRAT: Established malware, new victim patterns

Rapid7 has observed an ongoing malware campaign involving a new version of LodaRAT. This version possesses the ability to steal cookies and passwords from Microsoft Edge and Brave. LodaRAT, first observed in 2016, is a remote access tool (RAT) written in AutoIt. Development of LodaRAT has continued over the past 8 years, with an Android version distributed in the wild since 2021. This article analyzes the Windows version only.

Originally created for information gathering, LodaRAT has a variety of capabilities for collecting and exfiltrating victim data, delivering additional malware, capturing the victim’s screen, controlling the victim camera or mouse, and even spreading in infected environments. Notably, this appears to be the only update made to that RAT since 2022. Even the embedded DLLs remain the same.

Distribution

Old versions of LodaRAT were using Phishing (T1566) and Known Vulnerability Exploitation (T1203) techniques in their delivery process, but Rapid7 spotted new versions being distributed by DonutLoader (S0695) and CobaltStrike (S0154). We also observed LodaRAT on systems infected with other malware families like AsyncRAT (S1087), Remcos (S0332), Xworm, and more. Though we aren’t able to say for sure whether LodaRAT was distributed with those malware families or simply present by coincidence. New LodaRAT samples masquerade (T1036) as well-known Windows software such as Discord, Skype, and Windows Update, amongst others.

Victimology

While in previous campaigns the threat actor behind this RAT showed interest in specific country-based organizations, the new campaign seems to infect victims all over the world. Approximately 30% of VirusTotal samples were uploaded from the USA.

LodaRAT: Established malware, new victim patterns

Attribution

LodaRAT was attributed to the Kasablanka APT by Cisco in 2021; the group was focused on information gathering and espionage targeting Russia and Bangladesh in 2022. The 2024 campaign observed by Rapid7 shows a notable shift in threat actor behavior — i.e., preferring worldwide distribution over specific regional targets — and therefore we would not necessarily attribute this year’s campaign to the same APT. Being an AutoIt compiled binary, LodaRAT source code can be easily extracted and customized by a skilled threat actor. Rapid7 also found a GitHub repository with leaked LodaRAT source code. Based on capabilities, variable names, and strings, the leaked code is a four-year-old LodaRAT version, meaning adversaries have had plenty of time to analyze and update the code in newer versions.

InsightIDR and Managed Detection and Response customers have existing detection coverage through Rapid7’s expansive library of detection rules. Rapid7 recommends installing the Insight Agent on all applicable hosts to ensure visibility into suspicious processes and proper detection coverage. Below is a non-exhaustive list of detections that are deployed and will alert on behavior related to this malware campaign:

  • Suspicious Process – LodaRAT Malware Executed
  • Suspicious Process – Renamed AutoIt Interpreter

Technical Analysis

In this section we will briefly describe the overall capabilities of LodaRAT. For the full capability list, please see our LodaRAT repository on GitHub. It’s worth mentioning that most of the LodaRAT samples we investigated as part of the 2024 campaign had a string obfuscation mechanism. We build a Python script to decrypt those strings and make an AutoIt script human-readable.

The LodaRAT string deobfuscator is available to the community and can be downloaded here. Some of the samples were also packed with the UPX packer.

LodaRAT execution starts with a check for a specifically named window — for example, `UOMGAYFFBC`. This is done to make sure that only one instance of the malware is executed on the system. Next, the malware changes its window title. It also checks whether the infected OS is Windows 10 or 11. Then, it defines local variables and facilitates registry persistence by adding a new value under the `HKCU\Software\Microsoft\Windows\CurrentVersion\Run` registry key (T1547.001). Persistence is not always achieved by adding a new registry value. However, Rapid7 observed that some LodaRAT samples instead created a new scheduled task that will execute a compiled AutoIt every minute (T1053), while others did not attempt to establish persistence at all. Interestingly, in both cases where Rapid7 did not observe a new registry value being added for persistence, the malware still attempted to delete the registry value during the uninstall process.

The malware also checks if one of the following registry values is set:

  • HKCU\Software\Win32\data
  • HKCU\Software\Win32\img
  • HKCU\Software\Win32\keyx
  • HKCU\Software\Win32\imgCli
  • HKCU\Software\Win32\pidx

All the above keys are set by the malware in response to a specific command from the command-and-control (C2) server. The malware checks whether Windata and Windata\mon folders exist in the user’s %AppData% directory, and if not, it creates them. It also sets the mon directory attributes to System and Hidden to evade detection (T1564.001).

The malware will then start a TCP connection to the C2 server, capture the victim’s screen, and save the capture in the mon folder (T1113). The C2 beacon contains basic victim information, such as:

  1. Whether the user has Administrator rights; if they do, the Admin string will be passed to the C2 server, otherwise the passed parameter will be a string that varies from sample to sample.
  2. Username
  3. OS version and architecture
  4. Whether any anti-virus(AV) solution is running on the system; the malware will tell the C2 server No if no AV solution is found, and Disabled in cases where it is present but not running.
  5. Host IP address
  6. Desktop resolution
  7. Whether the endpoint is a laptop or a desktop
  8. Number of files in the mon folder

That information will be combined into the following packet:
x|<Admin/harcoded_string>|x|<Username>|<OS Version>|<OS Architecture>| | |<Disabled/No>|<Host IP address>|ddd|Pr|<Desktop Height>|X2|<Desktop Width>|X3|<Laptop/Desktop>|<Amount of files in mon folder>|beta

In the response, the RAT waits on a command from the C2 server. While a full list of LodaRAT capabilities can be found here, notable capabilities include:

  1. Downloading and executing additional payloads: We were able to spot the use of the ngrok reverse proxy utility based on the command the malware executes when receiving it from the C2 server. We can also assess with medium confidence that one other tool downloaded from the C2 server is a lateral movement utility that exploits the SMB protocol to drop and/or execute a malicious binary on a remote host. This assumption is based on malware’s attempt to connect to an internal IP on port 445, after which it receives a tool from the C2 server and uses that utility to run .bin file on the remote host.
  2. Executing commands on the victim’s host
  3. Controlling the victim’s mouse
  4. Screen capturing
  5. Stealing browser cookies and credentials
  6. Disabling Windows Firewall
  7. File enumeration and exfiltration
  8. Webcam recording
  9. Microphone recording
  10. New local user creation

In addition, the malware is capable of opening and closing a CD tray, creating a GUI chat window while the conversation is saved to a file.

IOCs

An updated IOC list can be found here.

Conclusion

LodaRAT shows that even older malware can still be a serious threat if it works well enough. While new malware families pop up all the time with fancy updates, LodaRAT has stayed mostly the same since 2021, yet it’s still spreading and infecting systems worldwide. The recent campaign, with its ability to steal credentials from browsers like Microsoft Edge and Brave, proves that small tweaks can keep malware effective without major updates. The fact that LodaRAT keeps working so well reminds us that even older threats shouldn’t be underestimated.

Using the zabbix_utils Library for Tool Development

Post Syndicated from Aleksandr Iantsen original https://blog.zabbix.com/python-zabbix-utils-alert-tracker-tool/29010/

In this article, we will explore a practical example of using the zabbix_utils library to solve a non-trivial task – obtaining a list of alert recipients for triggers associated with a specific Zabbix host. You will learn how to easily automate the process of collecting this information, and see examples of real code that can be adapted to your needs.

Over the last year, the zabbix_utils library has become one of the most popular tools for working with the Zabbix API. It is a convenient tool that simplifies interacting with the Zabbix server, proxy, or agent, especially for those who automate monitoring and management tasks.

Due to its ease of use and extensive functionality, zabbix_utils has found a following among system administrators, monitoring, and DevOps engineers. According to data from PyPI, the library has already been downloaded over 140,000 times since its release, confirming its demand within the community. It’s all thanks to you and your attention to zabbix_utils!

Task Description

Administrators often need to check which Zabbix users receive alerts for specific triggers in the Zabbix monitoring system. This can be useful for auditing, configuring new notifications, or simply for a quick diagnosis of issues. The task becomes especially relevant when you have plenty of hosts containing numerous triggers, and manually checking the recipients for each trigger through the Zabbix interface becomes very time-consuming. 

In such cases, it is advisable to use a custom solution based on the Zabbix API. You can directly access all the required data using the API, and then use additional logic to determine the final alert recipients. The zabbix_utils library makes working with the Zabbix API more convenient and allows you to automate this process. In this project, we use the zabbix_utils library to write a Python script that collects a list of alert recipients for the triggers of the selected Zabbix host. This will allow you to obtain the necessary information faster and with minimal effort.

Environment Setup and Installation

To get started with zabbix_utils, you need to install the library and configure the connection to the Zabbix API. This article provides more details and examples on getting started with the library. However, it would be better if I describe the basic steps to prepare the environment here. 

The library supports several installation methods described in the official README, making it convenient for use in different environments.

1. Installation via pip

The simplest and most common installation method is using the pip package manager. To do this, execute the command:

~$ pip install zabbix_utils

To install all necessary dependencies for asynchronous work, you can use the command:

~$ pip install zabbix_utils[async]

This method is suitable for most users, as pip automatically installs all required dependencies.

2. Installation from Zabbix Repository

Since writing the previous articles, we have added one more installation method – from the official Zabbix repository. First and foremost, you need to add the repository to your system if it has not been installed yet. Official Zabbix packages for Red Hat Enterprise Linux and Debian-based distributions are available on the Zabbix website.

For Red Hat Enterprise Linux and derivatives:

~# dnf install python3-zabbix-utils

For Debian / Ubuntu and derivatives:

~# apt install python3-zabbix-utils

3. Installation from Source Code

If you require the latest version of the library that has not yet been published on PyPI, or you want to customize the code, you can install the library directly from GitHub:

1. Clone the repository from GitHub:

~$ git clone https://github.com/zabbix/python-zabbix-utils

2. Navigate to the project folder:

~$ cd python-zabbix-utils/

3. Install the library by executing the command:

~$ python3 setup.py install

4. Testing the Connection to Zabbix API

After installing zabbix_utils, it is a good idea to check the connection to your Zabbix server via the API. To do this, use the URL to the Zabbix server, the token, or the username and password of the user who has permission to access the Zabbix API.

Example code for checking the connection:

from zabbix_utils import ZabbixAPI ZABBIX_AUTH = {     "url": "your_zabbix_server",     "user": "your_username",     "password": "your_password" } api = ZabbixAPI(**ZABBIX_AUTH) hosts = api.host.get(     output=['hostid', 'name'] ) print(hosts) api.logout()

Main Steps of the Task Solution

Now that the environment is set up, let’s look at the main steps for solving the task of retrieving the list of alert recipients for triggers associated with a specific Zabbix host in Zabbix.

In zabbix_utils, asynchronous API interaction support is built in through the AsyncZabbixAPI class. This allows multiple requests to be sent simultaneously and their results to be handled as they become ready, significantly reducing latencies when making multiple API calls. Therefore, we will use the AsyncZabbixAPI class and the asynchronous approach in this project.

Below are the main steps for solving the task, and code examples for each step. Please note that the code in this project is for demonstration purposes, may not be optimal, or could contain errors. Use it as an example or a base for your project, but not as a complete tool.

Step 1. Obtain Host ID

The first step is to identify the host for which we will retrieve information about triggers and alerts. We need to find the hostid using its name/host to do this. The Zabbix API provides a method to obtain this information, and using zabbix_utils makes this process much simpler.

Example of obtaining the host ID by its name:

host = api.host.get(     output=["hostid"],     filter={"name": "your_host_name"} )

This method returns a unique identifier for the host, which can be used further. However, for our test project, we will use a manually specified host identifier.

Step 2. Retrieve Host Triggers

With the hostid in hand, the next step is to retrieve all triggers associated with this host. Triggers contain the conditions that trigger the alerts. We need to collect information about all triggers so that we can then use it to select actions that match all the conditions.

Example of retrieving node triggers:

triggers = api.trigger.get(     hostids=[hostid],     selectTags="extend",     selectHosts=["hostid"],     selectHostGroups=["groupid"],     selectDiscoveryRule=["templateid"],     output="extend", )

This request returns complete information about the triggers for the host. We get not only the triggers but also their tags, associated host and host groups, and discovery rule information. All this information will be necessary to check the conditions of the actions.

Step 3. Initialize Trigger Metadata

At this stage, objects for each trigger are created to store their metadata. This is done using the Trigger class, which includes information about the trigger such as its name, ID, associated host groups, hosts, tags, templates, and operations.

Here’s the code defining the Trigger class:

class Trigger:     def __init__(self, trigger):         self.name = trigger["description"]         self.triggerid = trigger["triggerid"]         self.hostgroups = [g["groupid"] for g in trigger["hostgroups"]]         self.hosts = [h["hostid"] for h in trigger["hosts"]]         self.tags = {t["tag"]: t["value"] for t in trigger["tags"]}         self.tmpl_triggerid = self.triggerid         self.lld_rule = trigger["discoveryRule"] or {}         if trigger["templateid"] != "0":             self.tmpl_triggerid = trigger["templateid"]         self.templates = []         self.messages = []         self._conditions = {             "0": self.hostgroups,             "1": self.hosts,             "2": [self.triggerid],             "3": trigger["event_name"] or trigger["description"],             "4": trigger["priority"],             "13": self.templates,             "25": self.tags.keys(),             "26": self.tags,         }     def eval_condition(self, operator, value, trigger_data):         # equals or does not equal         if operator in ["0", "1"]:             equals = operator == "0"             if isinstance(value, dict) and isinstance(                 trigger_data, dict):                 if value["tag"] in trigger_data:                     if value["value"] == trigger_data[                         value["tag"]]:                         return equals             elif value in trigger_data and isinstance(                 trigger_data, list):                 return equals             elif value == trigger_data:                 return equals             return not equals         # contains or does not contain         if operator in ["2", "3"]:             contains = operator == "2"             if isinstance(value, dict) and isinstance(                 trigger_data, dict):                 if value["tag"] in trigger_data:                     if value["value"] in trigger_data[                         value["tag"]]:                         return contains             elif value in trigger_data:                 return contains             return not contains           # is greater/less than or equals         if operator in ["5", "6"]:             greater = operator != "5"             try:                 if int(value) < int(trigger_data):                     return not greater                 if int(value) == int(trigger_data):                     return True                 if int(value) > int(trigger_data):                     return greater             except:                 raise ValueError(                     "Values must be numbers to compare them"                 )       def select_templates(self, templates):         for template in templates:             if self.tmpl_triggerid in [                 t["triggerid"] for t in template["triggers"]]:                 self.templates.append(template["templateid"])             if self.lld_rule.get("templateid") in [                 d["itemid"] for d in template["discoveries"]             ]:                 self.templates.append(template["templateid"])     def select_actions(self, actions):         selected_actions = []         for action in actions:             conditions = []             if "filter" in action:                 conditions = action["filter"]["conditions"]                 eval_formula = action["filter"]["eval_formula"]             # Add actions without conditions directly             if not conditions:                 selected_actions.append(action)                 continue             condition_check = {}             for condition in conditions:                 if (                     condition["conditiontype"] != "6"                     and condition["conditiontype"] != "16"                 ):                     if (                         condition["conditiontype"] == "26"                         and isinstance(condition["value"], str)                     ):                         condition["value"] = {                             "tag": condition["value2"],                             "value": condition["value"],                         }                     if condition["conditiontype"] in self._conditions:                         condition_check[                             condition["formulaid"]                         ] = self.eval_condition(                             condition["operator"],                             condition["value"],                             self._conditions[                                 condition["conditiontype"]                             ],                         )                 else:                     condition_check[                         condition["formulaid"]                     ] = True             for formulaid, bool_result in condition_check.items():                 eval_formula = eval_formula.replace(                     formulaid, str(bool_result))
            # Evaluate the final condition formula             if eval(eval_formula):                 selected_actions.append(action)         return selected_actions       def select_operations(self, actions, mediatypes):         messages_metadata = []         for action in self.select_actions(actions):             messages_metadata += self.check_operations(                 "operations", action, mediatypes             )             messages_metadata += self.check_operations(                 "update_operations", action, mediatypes             )             messages_metadata += self.check_operations(                 "recovery_operations", action, mediatypes             )         return messages_metadata
    def check_operations(self, optype, action, mediatypes):         messages_metadata = []         optype_mapping = {             "operations": "0",  # Problem event             "recovery_operations": "1",  # Recovery event             "update_operations": "2",  # Update event         }         operations = copy.deepcopy(action[optype])         # Processing "notify all involved" scenarios         for idx, _ in enumerate(operations):             if operations[idx]["operationtype"] not in ["11", "12"]:                 continue             # Copy operation as a template for reuse             op_template = copy.deepcopy(operations[idx])             del operations[idx]             # Checking for message sending operations             for key in [                 k for k in ["operations", "update_operations"] if k != optype             ]:                 if not action[key]:                     continue                 # Checking for message sending type operations                 for op in [                     o for o in action[key] if o["operationtype"] == "0"                 ]:                     # Copy template for the current operation                     operation = copy.deepcopy(op_template)                     operation.update(                         {                             "operationtype": "0",                             "opmessage_usr": op["opmessage_usr"],                             "opmessage_grp": op["opmessage_grp"],                         }                     )                     operation["opmessage"]["mediatypeid"] = op[                         "opmessage"                     ]["mediatypeid"]                     operations.append(operation)         for operation in operations:             if operation["operationtype"] != "0":                 continue             # Processing "all mediatypes" scenario             if operation["opmessage"]["mediatypeid"] == "0":                 for mediatype in mediatypes:                     operation["opmessage"]["mediatypeid"] = mediatype[                         "mediatypeid"                     ]                     messages_metadata.append(                         self.create_messages(                             optype_mapping[optype], action, operation, [                                 mediatype                             ]                         )                     )             else:                 messages_metadata.append(                     self.create_messages(                         optype_mapping[optype],                         action,                         operation,                         mediatypes                     )                 )         return messages_metadata       def create_messages(self, optype, action, operation, mediatypes):         message = Message(optype, action, operation)         message.select_mediatypes(mediatypes)         self.messages.append(message)         return message

The code for creating Trigger class objects for each of the retrieved triggers:

for trigger in triggers:     triggers_metadata[trigger["triggerid"]] = Trigger(trigger)

This loop iterates through all triggers and saves them in a dictionary called triggers_metadata, where the key is the triggerid and the value is the trigger object.

Step 4. Retrieve Template Information

The next step is to obtain data about the templates associated with all the triggers:

templates = api.template.get(     triggerids=list(set([t.tmpl_triggerid for t in triggers_metadata.values()])),     selectTriggers=["triggerid"],     selectDiscoveries=["itemid"],     output=["templateid"], )

This request returns information about all templates linked to the host’s triggers being examined. Executing a single query for all triggers is a more optimal solution than making individual requests for each trigger. This information will be needed for evaluating the “Template” condition in actions.

Step 5. Get Actions and Media Types

Next, we obtain the list of actions and media types configured in the system:

actions = api.action.get(     selectFilter="extend",     selectOperations="extend",     selectRecoveryOperations="extend",     selectUpdateOperations="extend",     filter={"eventsource": 0, "status": 0},     output=["actionid", "esc_period", "eval_formula", "name"], )
mediatypes = api.mediatype.get(     selectUsers="extend",     selectActions="extend",     selectMessageTemplates="extend",     filter={"status": 0},     output=["mediatypeid", "name"], )

Here we retrieve actions that define how and to whom alerts are sent, and mediatypes through which users can receive notifications (for example, email or SMS).

Step 6. Match Triggers with Templates and Actions

At this stage, each trigger is associated with the corresponding templates and actions:

for trigger in triggers_metadata.values():     trigger.select_templates(templates)     messages += trigger.select_operations(actions, mediatypes)

Here, for each trigger, we update information about its templates and configured actions for sending notifications. The list of associated actions is determined by checking the conditions specified in them against the accumulated data for each trigger.

For each operation of the corresponding trigger action, a Message class object is created:

class Message:     def __init__(self, optype, action, operation):         self.optype = optype         self.mediatypename = ""         self.actionid = action["actionid"]         self.actionname = action["name"]         self.operationid = operation["operationid"]         self.mediatypeid = operation["opmessage"]["mediatypeid"]         self.subject = operation["opmessage"]["subject"]         self.message = operation["opmessage"]["message"]         self.default_msg = operation["opmessage"]["default_msg"]         self.users = [u["userid"] for u in operation["opmessage_usr"]]         self.groups = [g["usrgrpid"] for g in operation["opmessage_grp"]]         self.recipients = []         # Escalation period set to action's period if not specified         self.esc_period = operation.get("esc_period", "0")         if self.esc_period == "0":             self.esc_period = action["esc_period"]         # Use action's escalation period if unset         self.esc_step_from = self.multiply_time(             self.esc_period, int(operation.get("esc_step_from", "1")) - 1         )         if operation.get("esc_step_to", "0") != "0":             self.repeat_count = str(                 int(operation["esc_step_to"]) - int(operation["esc_step_from"]) + 1             )         # If not a problem event, set repeat count to 1         elif self.optype != "0":             self.repeat_count = "1"         # Infinite repeat count if esc_step_to is 0         else:             self.repeat_count = “&infin;”       def multiply_time(self, time_str, multiplier):         # Multiply numbers within the time string         result = re.sub(             r"(\d+)",             lambda m: str(int(m.group(1)) * multiplier),             time_str         )         if result[0] == "0":             return "0"         return result       def select_mediatypes(self, mediatypes):         for mediatype in mediatypes:             if mediatype["mediatypeid"] == self.mediatypeid:                 self.mediatypename = mediatype["name"]                 # Select message templates related to operation type                 msg_template = [                     m                     for m in mediatype["message_templates"]                     if (                         m["recovery"] == self.optype                         and m["eventsource"] == "0"                     )                 ]                 # Use default message if applicable                 if msg_template and self.default_msg == "1":                     self.subject = msg_template[0]["subject"]                     self.message = msg_template[0]["message"]       def select_recipients(self, user_groups, recipients):         for groupid in self.groups:             if groupid in user_groups:                 self.users += user_groups[groupid]         for userid in self.users:             if userid in recipients:                 recipient = copy.deepcopy(recipients[userid])                 if self.mediatypeid in recipient.sendto:                     recipient.mediatype = True                 self.recipients.append(recipient)

Each such object represents a separate message sent to users (recipients) and will contain all message information – its subject, text, recipients, and escalation parameters.

Step 7. Collect User and Group Identifiers

After matching the triggers with actions, the process of collecting unique identifiers for users and groups starts:

userids = set() groupids = set() for message in messages:     userids.update(message.users)     groupids.update(message.groups)

This code snippet collects the IDs of all users and groups involved in the operations for each trigger. This is necessary to perform only one request to the Zabbix API for all involved users and their groups, rather than making separate requests for each trigger.

Step 8. Obtain User and Group Information

The next step is to collect detailed information about users and user groups:

usergroups = {     group["usrgrpid"]: group     for group in api.usergroup.get(         selectUsers=["userid"],         selectHostGroupRights="extend",         output=["usrgrpid", "role"],     ) }   users = {     user["userid"]: user     for user in api.user.get(         selectUsrgrps=["usrgrpid"],         selectMedias=["mediatypeid", "active", "sendto"],         selectRole=["roleid", "type"],         filter={"status": 0},         output=["userid", "username", "name", "surname"],     ) }

Here we gather data about users, including their role and media types through which they receive notifications, as well as data about user groups, including access rights to host groups and the list of users in each group. All this information will be needed to check access to the host with the triggers we are working with.

Step 9. Match Users and Groups with Triggers

After obtaining user information, we match users and groups with their respective rights to receive notifications. Here we also link users with groups, updating the information regarding rights and groups for each user.

for userid in userids:     if userid in users:         user = users[userid]         recipients[userid] = Recipient(user)         for group in user["usrgrps"]:             if group["usrgrpid"] in usergroups:                 recipients[userid].permissions.update([                     h["id"]                     for h in usergroups[group["usrgrpid"]]["hostgroup_rights"]                     if int(h["permission"]) > 1                 ])   for groupid in groupids:     if groupid in usergroups:         group = usergroups[groupid]         user_groups[group["usrgrpid"]] = []         for user in group["users"]:             user_groups[group["usrgrpid"]].append(user["userid"])             if user["userid"] in recipients:                 recipients[user["userid"]].groups.update(group["usrgrpid"])             elif user["userid"] in users:                 recipients[user["userid"]] = Recipient(users[user["userid"]])             recipients[user["userid"]].permissions.update([                 h["id"]                 for h in group["hostgroup_rights"]                 if int(h["permission"]) > 1             ])

This code fragment connects each user with their groups and vice versa, creating a complete list of users with their access rights to the host, and thus their eligibility to receive notifications about events for this host.

For each recipient, a Recipient class object is created containing data about the recipient, such as the notification address, access rights to hosts, configured mediatypes, etc.

Here’s the code that describes the Recipient class:

class Recipient:     def __init__(self, user):         self.userid = user["userid"]         self.username = user["username"]         self.fullname = "{name} {surname}".format(**user).strip()         self.type = user["role"]["type"]         self.groups = set([g["usrgrpid"] for g in user["usrgrps"]])         self.has_right = False         self.permissions = set()         self.sendto = {             m["mediatypeid"]: m["sendto"] for m in user["medias"] if m["active"] == "0"         }         # Check if the user is a super admin (type 3)         if self.type == "3":             self.has_right = True

Step 10. Match Messages with Recipients

Finally, we match recipients with specific messages from Step 6:

for message in messages:     message.select_recipients(user_groups, recipients)

This step completes the main process – each message is assigned to the relevant recipients.

Step 11. Check Recipient Access Rights and Output the Result

Before the actual output of the result with the list of recipients, we can perform a check of the recipients’ message rights and filter only those who have the corresponding rights to receive notifications for the events related to the trigger, or those who have all configured media types specified and active. After these actions, the information can be output in any convenient way – whether it be exporting to a file or displaying it on the screen:

for trigger in triggers_metadata.values():     for message in trigger.messages:         for recipient in message.recipients:             recipient.show = True             if not recipient.has_right:                 recipient.has_right = (len([gid                     for gid in trigger.hostgroups                     if gid in recipient.permissions                 ]) > 0)             if not recipient.has_right and not show_unavail:                 recipient.show = False

Example Implementation

All the examples and code snippets described above have been compiled to create a solution demonstrating the algorithm for obtaining notification recipients for triggers associated with the selected host. We have implemented this algorithm as a simple web interface to make the result more illustrative and convenient for familiarization.

This interface allows users to enter the host’s ID. The script then processes the data and provides a list of notification recipients associated with the triggers on that host. The web interface uses asynchronous requests to the Zabbix API and the zabbix_utils library to ensure fast data processing and ease of use with many triggers and users.

This lets you familiarize yourself with the theoretical steps and code examples and also try to put this solution into action.

Please note once again that the code in this project is for demonstration purposes, may not be optimal, or could contain errors. Use it as an example or a base for your project, but not as a complete tool.

The web interface’s complete source code and installation instructions can be found on GitHub.

Conclusion

In this article, we explored a practical example of using the zabbix_utils library to solve the task of obtaining alert recipients for triggers associated with a selected Zabbix host using the Zabbix API. We detailed the key steps, from setting up the environment and initializing trigger metadata to working with notification recipients and optimizing performance with asynchronous requests.

Using zabbix_utils allowed us to optimize and accelerate interaction with the Zabbix API, expanding the capabilities of the  Zabbix web interface and increasing efficiency when working with large volumes of data. Thanks to support for asynchronous processing and selective API requests, it is possible to significantly reduce the load on the server and improve system performance when working with Zabbix, which is especially important in large infrastructures.

We hope this example will assist you in implementing your own solutions based on the Zabbix API and zabbix_utils, and demonstrate the possibilities for optimizing your interaction with the Zabbix API.

The post Using the zabbix_utils Library for Tool Development appeared first on Zabbix Blog.

Criminals Exploiting FBI Emergency Data Requests

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2024/11/criminals-exploiting-fbi-emergency-data-requests.html

I’ve been writing about the problem with lawful-access backdoors in encryption for decades now: that as soon as you create a mechanism for law enforcement to bypass encryption, the bad guys will use it too.

Turns out the same thing is true for non-technical backdoors:

The advisory said that the cybercriminals were successful in masquerading as law enforcement by using compromised police accounts to send emails to companies requesting user data. In some cases, the requests cited false threats, like claims of human trafficking and, in one case, that an individual would “suffer greatly or die” unless the company in question returns the requested information.

The FBI said the compromised access to law enforcement accounts allowed the hackers to generate legitimate-looking subpoenas that resulted in companies turning over usernames, emails, phone numbers, and other private information about their users.

ASRock Rack GNRD8-2L2T Motherboard Review Intel Xeon 6 Single Socket

Post Syndicated from John Lee original https://www.servethehome.com/asrock-rack-gnrd8-2l2t-motherboard-review-intel-xeon-6-single-socket/

The ASRock Rack GNRD8-2L2T is an Intel Xeon 6 motherboard designed to provide plenty of I/O for the Intel Xeon 6700 series

The post ASRock Rack GNRD8-2L2T Motherboard Review Intel Xeon 6 Single Socket appeared first on ServeTheHome.