Building a Spark observability product with StarRocks: Real-time and historical performance analysis

Post Syndicated from Grab Tech original https://engineering.grab.com/building-a-spark-observability

Introduction

At Grab, we’ve been working to perfect our Spark observability tools. Our initial solution, Iris, was developed to provide a custom, in-depth observability tool for Spark jobs. As described in our previous blog post, Iris collects and analyses metrics and metadata at the job level, providing insights into resource usage, performance, and query patterns across our Spark clusters.

Iris addresses a critical gap in Spark observability by providing real-time performance metrics at the Spark application level. Unlike traditional monitoring tools that typically provide metrics only at the EC2 instance level, Iris dives deeper into the Spark ecosystem. It bridges the observability gap by making Spark metrics accessible through a tabular dataset, enabling real-time monitoring and historical analysis. This approach eliminates the need to parse complex Spark event log JSON files, which users are often unable to access when they need immediate insights. Iris empowers users with on-demand access to comprehensive Spark performance data, facilitating quicker decision-making and more efficient resource management.

Iris served us well, offering basic dashboards and charts that helped our teams understand trends, discover issues, and debug their Spark jobs. However, as our needs evolved and usage grew, we began to encounter limitations:

  1. Fragmented user experience and access control: Observability data is split between Grafana (real-time) and Superset (historical), forcing users to switch platforms for a complete view. The complex Grafana dashboards, while powerful, were challenging for non-technical users. The lack of granular permissions hindered wider adoption. We needed a unified, user-friendly interface with role-based access to serve all Grabbers effectively.

  2. Operational overhead: Our data pipeline for offline analytics includes multiple hops and complex transformations.

  3. Data management: We faced challenges managing real-time data in InfluxDB alongside offline data in our data lake, particularly with string-type metadata.

These challenges and the need for a centralised, user-friendly web application prompted us to seek a more robust solution. Enter StarRocks – a modern analytical database that addresses many of our pain points:

Pain points with InfluxDB StarRocks solution
Limited SQL compatibility: Requires use of Flux query language instead of full SQL Full MySQL-compatible SQL support, enabling seamless integration with existing tools and skills
Complex data ingestion pipeline: Requires external agents like Telegraf to consume Kafka and insert into InfluxDB Direct Kafka ingestion, eliminating the need for intermediate agents and simplifying the data pipeline
Limited pre-aggregation capabilities: Aggregation is limited to time windows and indexed columns, not string columns Flexible materialised views supporting complex aggregations on any column type, improving query performance
Poor support for metadata and joins: Designed primarily for numerical time series data, with slow performance on string data and joins Efficient handling of both time-series and string-type metadata in a single system, with optimised join performance
Difficult integration with data lake: There is no official way to backup or stream data directly to the datalake, requiring separate pipelines Native S3 integration for easy backup and direct data lake accessibility, eliminating the need for separate ingestion pipelines
Performance issues with high cardinality data: Indexing unique identifiers (like app\_id) causes huge indexes and slow queries Optimised for high cardinality data, allowing efficient querying on unique identifiers without performance degradation

In this blog post, we will dive into leveraging StarRocks to build the next generation of the Spark observability platform. We will explore the architecture, data model, and key features that are helping us overcome previous limitations and provide more value to Spark users at Grab.

System architecture overview

In the journey to enhance user experience, we’ve made substantial changes to the architecture, moving from the Telegraf/InfluxDB/Grafana (TIG) stack to a more streamlined and powerful setup centered around StarRocks. This new architecture addresses the previous challenges and provides a more unified, flexible, and efficient solution.

Figure 1. New Iris architecture with StarRocks integration

Key Components of the new architecture:

1. StarRocks database

  • Replaces InfluxDB for both real-time and historical data storage

  • Supports complex queries on metrics and metadata tables

2. Direct Kafka ingestion

  • StarRocks ingests data directly from Kafka, eliminating Telegraf

3. Custom web application (Iris UI)

  • Replaces Grafana dashboards

  • Centralised, flexible interface with custom API

4. Superset integration

  • Maintained and now connected directly to StarRocks

  • Provides real-time data access, consistent with the custom web app

5. Simplified offline data process

  • Scheduled backups from StarRocks to S3 directly

  • Replaces previous complex data lake pipelines

Key improvements:

1. Unified data store: Single source for real-time and historical data

2. Streamlined data flow: A simplified pipeline reduces latency and failure points

3. Flexible visualisation: Custom web app with intuitive, role-specific interfaces

4. Consistent real-time access: Across both custom app and Superset

5. Simplified backup and data lake integration: Direct S3 backups

Data model and ingestion

The Iris observability system is designed to monitor both job executions and ad-hoc cluster usage, encompassing what we call “cluster observation”. This model accounts for two scenarios:

  • Adhoc use: Pre-created clusters shared among team users

  • Job execution: New clusters are created for each job submission

Key design points

For each cluster, we capture both metadata and metrics:

Key point Description
Linkage We use worker\_uuid to link metadata with worker metrics app\_id to link metadata with Spark event metrics.
Granularity Worker metrics are captured every 5 seconds, linked by worker\_uuid. Spark events are captured as they occur, linked by app\_id. Metadata can be captured multiple times.
Flexibility This schema allows for queries at various levels: Individual worker level, job level, cluster level.
Historical analysis The design enables insights from historical runs, such as: Auto-scaling behaviour, maximum worker count per job, maximum or average memory usage over time.

Schemas

Let’s break down our table schemas:

Cluster metadata

    C/C++
    CREATE TABLE `cluster_worker_metadata_raw` (
        `report_date` date  NOT NULL COMMENT "Report date",
        `platform` varchar(128) NOT NULL COMMENT "Platform",
        `worker_uuid` varchar(128) NULL COMMENT "Worker UUID (Iris UUID)",
        `worker_role` varchar(128) NULL COMMENT "Worker role",
        `epoch_ms` bigint(20) NULL COMMENT "Event Time",
        `cluster_id` varchar(128) NULL COMMENT "Cluster ID",
        `job_id` varchar(128) NULL COMMENT "User Job ID",
        `run_id` varchar(128) NULL COMMENT "User Job Run ID",
        `job_owner` varchar(128) NULL COMMENT "User Job Owner",
        `app_id` varchar(128) NULL COMMENT "Spark Application ID",
        `spark_ui_url` varchar(256) NULL COMMENT "Spark UI URL",
        `driver_log_location` varchar(256) NULL COMMENT "Spark Driver Log Location",
        -- other relevant metadata fields
    )
    ENGINE=OLAP
    DUPLICATE KEY(`report_date`, `platform`,`worker_uuid`,`worker_role`)
    PARTITION BY RANGE(`report_date`)()
    DISTRIBUTED BY HASH(`report_date`,`platform`)
    PROPERTIES (
        "replication_num" = "3",
    );

Cluster worker metrics

    C/C++
    CREATE TABLE `cluster_worker_metrics_raw` (
        `report_date` date NOT NULL COMMENT "Report date",
        `platform` varchar(128) NOT NULL COMMENT "Platform",
        `worker_uuid` varchar(128) NULL COMMENT "Worker UUID",
        `worker_role` varchar(128) NULL COMMENT "Worker Role",
        `epoch_ms` bigint(20) NULL COMMENT "EpochMillis",
        `cpus` bigint(20) NULL COMMENT "Worker CPU Cores",
        `memory` bigint(20) NULL COMMENT "Worker Memory",
        `bytes_heap_used` double NULL COMMENT "Byte Heap Used",
        `bytes_non_heap_used` double NULL COMMENT "Byte Non Heap Used",
        `gc_collection_time` double NULL COMMENT "GC Collection Time",
        `cpu_time` double NULL COMMENT "CPU Time",
        -- other relevant metrics fields
    )
    ENGINE=OLAP
    DUPLICATE KEY(`report_date`, `platform`,`worker_uuid`,`worker_role`)
    PARTITION BY RANGE(`report_date`)()
    DISTRIBUTED BY HASH(`report_date`,`platform`)
    PROPERTIES (
        "replication_num" = "3",
    );

Cluster spark metrics

    C/C++
    CREATE TABLE `cluster_spark_metrics_raw`
    (
        `report_date`                 date           NOT NULL COMMENT "Report date",
        `platform`                    varchar(128)   NOT NULL COMMENT "Platform",
        `app_id`                      varchar(128)   NOT NULL COMMENT "Spark Application ID",
        `app_attempt_id`              varchar(128) DEFAULT '1' COMMENT "Spark Application ID",
        `measure_name`                varchar(128)   NULL COMMENT "The spark measure name",
        `epoch_ms`                    bigint(20)     NULL COMMENT "EpochMillis",
        `records_read`                double         NULL COMMENT "Stage Records Read",
        `records_written`             double         NULL COMMENT "Stage Records Written",
        `bytes_read`                  double         NULL COMMENT "Stage Bytes Read",
        `bytes_written`               double         NULL COMMENT "Stage Bytes Written",
        `memory_bytes_spilled`        double         NULL COMMENT "Stage Memory Bytes Spilled",
        `disk_bytes_spilled`          double         NULL COMMENT "Stage Disk Bytes Spilled",
        `shuffle_total_bytes_read`    double         NULL COMMENT "Stage Shuffle Total Bytes Read",
        `shuffle_total_bytes_written` double         NULL COMMENT "Stage Shuffle Total Bytes Written",
        `total_tasks`                 double         NULL COMMENT "Stage Total Tasks",
        `shuffle_write_time`          double         NULL COMMENT "Shuffle Write Time",
        `shuffle_fetch_wait_time`     double         NULL COMMENT "Shuffle Fetch Waiting Time",
        `result_serialization_time`   double         NULL COMMENT "Result Serialization Time",
        -- other relevant metrics fields
    )
    ENGINE = OLAP
    DUPLICATE KEY(`report_date`, `platform`,`app_id`, `app_attempt_id`)
    PARTITION BY RANGE(`report_date`)()
    DISTRIBUTED BY HASH(`report_date`,`platform`)
    PROPERTIES (
        "replication_num" = "3",
    );

Data ingestion from Kafka to StarRocks

We use StarRocks’ routine load feature to ingest data directly from Kafka into our tables. Refer to the StarRocks documentation: Load data using routine load.

Here is a simple example of creating a routine load job for cluster worker metrics:

    C/C++
    CREATE ROUTINE LOAD iris.routetine_cluster_worker_metrics_raw ON cluster_worker_metrics_raw
    COLUMNS(platform, worker_uuid, worker_role, epoch_ms, cpus, `memory`, bytes_heap_used, bytes_non_heap_used, gc_collection_time, report_date=date(from_unixtime(epoch_ms / 1000)))
    WHERE ISNOTNULL(platform)
    PROPERTIES
    (
        "desired_concurrent_number" = "3",
        "format" = "json",
    "jsonpaths" = "[\"$.platform\",\"$.workerUuid\",\"$.workerRole\",\"$.epochMillis\",\"$.cpuCores\",\"$.memory\",\"$.heapMemoryTotalUsed\",\"$.nonHeapMemoryTotalUsed\",\"$.gc-collectionTime\"]"
    )
    FROM KAFKA
    (
        "kafka_broker_list" ="broker:9092",
        "kafka_topic" = "<worker metrics topic>",
        "property.kafka_default_offsets" = "OFFSET_END"
    );

This configuration sets up continuous data ingestion from the specified Kafka topic into our cluster_worker_metrics table, with JSON parsing.

For monitoring the routine, StarRocks provides built-in tools to monitor the status/error log of routine load jobs. Example query to check load:

    C/C++
    SHOW ROUTINE LOAD WHERE NAME = "iris.routetine_cluster_worker_metrics_raw";

Handle both real-time and historical data in the unified system

The new Iris system uses StarRocks to efficiently manage both real-time and historical data. We have implemented three key features to achieve this:

  1. StarRocks’ routine load enables near real-time data ingestion from Kafka. Multiple load tasks concurrently consume messages from different topic partitions, resulting in data appearing in Iris tables within seconds of collection. This quick ingestion keeps our monitoring capabilities current, providing users with up-to-date information about their Spark jobs.

  2. For historical analysis, StarRocks serves as a persistent dataset, storing metadata and job metrics with a time-to-live of over 30 days. This allows us to perform analysis based on the last 30 days of job runs directly in StarRocks, which is significantly faster than using offline data in our data lake.

  3. We’ve also implemented materialised views in StarRocks to pre-calculate and aggregate data for each job run. These views combine information from metadata, worker metrics, and Spark metrics, creating ready-to-use summary data. This approach eliminates the need for complex join operations when users access the job run summary screen in the UI, improving response times for both SQL queries and API access.

This setup offers substantial improvements over our previous InfluxDB-based system. As a time-series database, InfluxDB makes complex queries and joins challenging. It also lacked support for materialised views, making it difficult to create pre-built job-run summaries. Previously, we had to query our data lake using Spark and Presto to view historical runs for a particular job over the last 30 days, which was slower than directly querying in StarRocks.

By combining real-time ingestion, persistent storage, and materialised views, Iris now provides a unified, efficient platform for both immediate monitoring and in-depth historical analysis of Spark jobs.

Query performance and optimisation

StarRocks has significantly improved our query performance for Spark observability. Here are some key aspects of our optimisation strategy.

Materialised views

As mentioned, we leverage StarRocks’ materialised views to pre-aggregate job run summaries. This approach significantly reduces query complexity and improves response times for common UI operations. Materialised views combine data from metadata, worker metrics, and Spark metrics tables, thus eliminating the need for complex joins during query execution. This is particularly beneficial for our job-run summary screen, where pre-calculated aggregations can be retrieved instantly, improving both speed and user experience.

Here’s an example

    C/C++
    CREATE MATERIALIZED VIEW job_runs_001
    PARTITION BY (`report_date`)
    DISTRIBUTED BY HASH(`report_date`,`platform`)
    REFRESH ASYNC
    PROPERTIES (
        "auto_refresh_partitions_limit" = "3",
        "partition_ttl" = "33 DAY"
    )
    AS
    select m.report_date                                                                     as report_date,
        m.platform,
        m.job_id,
        m.run_id,
        m.app_id,
        m.app_attempt_id,
        ANY_VALUE(COALESCE(m.cluster_id, m.cluster_name))                                 as cluster_id,
        ANY_VALUE(m.cluster_name)                                                         as cluster_name,
        ANY_VALUE(m.job_name)                                                             as job_name,
        ANY_VALUE(m.job_owner)                                                            as job_owner,
        ANY_VALUE(m.job_client)                                                           as job_client,
        ANY_VALUE(CASE WHEN m.worker_role = 'driver' THEN m.spark_ui_url END)             as spark_ui_url,
        ANY_VALUE(CASE WHEN m.worker_role = 'driver' THEN m.spark_history_url END)        as spark_history_url,
        ANY_VALUE(CASE WHEN m.worker_role = 'driver' THEN m.driver_log_location END)      as driver_log_location,
        COUNT(d.worker_uuid)                                                              as total_instances,
        from_unixtime(MIN(d.start_time) / 1000, 'yyyy-MM-dd HH:mm:ss')                    as start_time,
        from_unixtime(MAX(d.end_time) / 1000, 'yyyy-MM-dd HH:mm:ss')                      as end_time,
        COALESCE((((MAX(d.end_time) - MIN(d.start_time)) + 120000) / (1000 * 3600)), 0)   as job_hour,
        SUM(COALESCE(d.machine_hour, 0))                                                  as machine_hour,
        SUM(COALESCE(d.cpu_hour, 0))                                                      as cpu_hour,
        MAX(COALESCE(CASE WHEN d.worker_role = 'driver' THEN d.cpu_utilization END, 0))   as driver_cpu_utilization,
        MAX(COALESCE(CASE WHEN d.worker_role = 'driver' THEN d.memory_utilization END,
                        0))                                                                  as driver_memory_utilization,
        MAX(COALESCE(CASE WHEN d.worker_role = 'executor' THEN d.cpu_utilization END, 0)) as worker_cpu_utilization,
        MAX(COALESCE(CASE WHEN d.worker_role = 'executor' THEN d.memory_utilization END,
                        0))                                                                  as worker_memory_utilization,
        -- other relevant metrics fields
    from iris.cluster_worker_metadata_view_001 m
            left join iris.cluster_worker_metrics_view_006 d
                    on d.report_date >= m.report_date and d.platform = m.platform and d.worker_uuid = m.worker_uuid and
                        d.worker_role = m.worker_role
    where m.job_id is not null
    group by m.report_date,
            m.platform,
            m.job_id,
            m.run_id,
            m.app_id,
            m.app_attempt_id;

StarRocks offers powerful and flexible materialised view capabilities that significantly enhance our query performance and data management in Iris. Here are three key features we leverage:

SYNC and ASYNC

StarRocks supports both SYNC and ASYNC materialised views. We primarily use ASYNC views as they allow us to join multiple underlying tables, which is crucial for our job-run summaries. We can configure these views to refresh:

  • Immediately when downstream tables are updated.

  • At set intervals (e.g., every 1 minute). This flexibility allows us to balance data freshness with system performance.

Example setting:

    C/C++
    REFRESH ASYNC START('2022-09-01 10:00:00') EVERY (interval 1 day)

For more details on supported features and settings, refer to the StarRocks documentation: Materialised view.

Partition TTL

We utilise the partition Time To Live (TTL) feature for materialised views. This allows us to control the amount of historical data stored in the views, typically setting it to 33 days. This ensures that the views remain performant and do not consume excessive storage while still providing quick access to recent historical data.

    C/C++
    PROPERTIES (
        "partition_ttl" = "33 DAY"
    )

Selective partition refresh

StarRocks allows us to refresh only specific partitions of a materialised view instead of the entire dataset. We take advantage of this by configuring our views to refresh only the most recent partitions (e.g., the last few days) where new data is typically added. This approach significantly reduces the computational overhead of keeping our materialised views up-to-date, especially for large historical datasets.

    C/C++
    PROPERTIES (
        "auto_refresh_partitions_limit" = "3",
    )

Partitioning

Our tables are partitioned by date, allowing for efficient pruning of historical data. This partitioning strategy is crucial for queries that focus on recent job runs or specific time ranges. By quickly eliminating irrelevant partitions, we significantly reduce the amount of data scanned for each query, leading to faster execution times.

    C/C++
    PARTITION BY RANGE(`report_date`)()
    DISTRIBUTED BY HASH(`report_date`,`platform`)

Dynamic partitioning

We utilise StarRocks’ dynamic partitioning feature to automatically manage our partitions. This ensures that new partitions are created as fresh data arrives and old partitions are dropped when data expires. Dynamic partitioning helps maintain optimal query performance over time without manual intervention, which is especially important for our continuous data ingestion process.

Here’s an example of how we configure dynamic partitioning for a 33-day retention period:

    C/C++
    PROPERTIES (
        "dynamic_partition.enable" = "true",
        "dynamic_partition.time_unit" = "DAY",
        "dynamic_partition.start" = "-33",
        "dynamic_partition.end" = "3",
        "dynamic_partition.prefix" = "p",
        "dynamic_partition.buckets" = "32",
        "dynamic_partition.history_partition_num" = "30"
    );

To verify that dynamic partitioning is working correctly and to monitor the state of your partitions, you can use the following SQL command:

    C/C++
    SHOW PARTITIONS FROM iris.cluster_worker_metrics_raw;

This command provides a summary of all partitions for the specified table (in this case, iris.cluster_worker_metrics_raw). The output includes valuable information such as:

  • The total number of partitions

  • The date range covered by each partition

  • Row count per partition

  • Size of each partition

While dynamic partitioning keeps the most recent 33 days of data readily available in StarRocks for fast querying, we’ve implemented a strategy to retain older data for long-term analysis.

We use a daily cron job to back up data older than 30 days to Amazon S3. This ensures we maintain historical data without impacting the performance of our primary StarRocks cluster.

Here’s an example of the backup query we use:

    Python
    INSERT INTO
        FILES(
            "path" = "{s3backUpPath}/{table_name}/",
            "format" = "parquet",
            "compression" = "zstd",
            "partition_by" = "report_date",
            "aws.s3.region" = "ap-southeast-1"
        )
        SELECT * FROM iris.{table_name} WHERE report_date between '{start_date}' and '{end_date}';

After backing up to S3, we map this data to a data lake table, enabling us to query historical data beyond the 33-day window in StarRocks when needed, without affecting the performance of our primary observability system.

    Python
    df_snapshot = spark.read.parquet(f"{s3backUpPath}/{table_name}")

    # do the transformation if needed here

    df_snapshot.write.format("delta").mode("overwrite").option("partitionOverwriteMode", "dynamic").option("mergeSchema", "true").partitionBy("report_date").save(f"{s3SinkPath}/{table_name}")

    %sql
    CREATE TABLE IF NOT EXISTS iris.{table_name}
    USING DELTA
    LOCATION '{s3SinkPath}/{table_name}'

Data replication

StarRocks uses data replication across multiple nodes, which is crucial for both fault tolerance and query performance. This strategy allows parallel query execution speeding up data retrieval. It’s particularly beneficial for our front-end queries, where low latency is crucial for user experience. This approach aligns with best practices seen in other distributed database systems like Cassandra, DynamoDB, and MySQL’s master-slave architecture.

    C/C++
    PROPERTIES (
        "replication_num" = "3",
    );

Unified web application

We’ve developed a comprehensive web application for Iris, consisting of both backend and frontend components. This unified interface offers users a seamless experience for monitoring and analysing Spark jobs.

Backend

  • Built using Golang, our backend service connects directly to the StarRocks database.

  • It queries data from both raw tables and materialised views, leveraging the optimised data structures we’ve set up in StarRocks.

  • The backend handles authentication and authorisation, ensuring that users have appropriate access to job data.

Frontend

The frontend offers several key screens to show:

  • List of job runs

  • Job status

  • Job metadata

  • Driver log

  • Spark UI

  • Statistics on resource usage and cost

Here is an example of the job overview screen, which displays key summary information: total number of runs, job owner details, performance trends, and cost analysis charts. This comprehensive view provides users with a quick snapshot of their Spark job’s overall health and resource utilisation.

Figure 2: Example of job overview screen

Advanced analytics and insights

One of the key features we’ve implemented in Iris is the ability to perform analytics on historical job runs to capture trends. This feature leverages the power of StarRocks and our data model to provide users with valuable insights and recommendations. Here’s how we’ve implemented it:

Historical run analysis

We’ve created a materialised view that aggregates job run data over the last 30 days. This view likely includes metrics such as count of runs, p95 values for various resource utilisation, etc.

    C/C++
    CREATE MATERIALIZED VIEW job_run_summaries_001
    REFRESH ASYNC EVERY(INTERVAL 1 DAY)
    AS
    select platform,
        job_id,
        count(distinct run_id)                                as count_run,
        ceil(percentile_approx(total_instances, 0.95))        as p95_total_instances,
        ceil(percentile_approx(worker_instances, 0.95))       as p95_worker_instances,
        percentile_approx(job_hour, 0.95)                     as p95_job_hour,
        percentile_approx(machine_hour, 0.95)                 as p95_machine_hour,
        percentile_approx(cpu_hour, 0.95)                     as p95_cpu_hour,
        percentile_approx(worker_gc_hour, 0.95)               as p95_worker_gc_hour,
        ceil(percentile_approx(driver_cpus, 0.95))            as p95_driver_cpus,
        ceil(percentile_approx(worker_cpus, 0.95))            as p95_worker_cpus,
        ceil(percentile_approx(driver_memory_gb, 0.95))       as p95_driver_memory_gb,
        ceil(percentile_approx(worker_memory_gb, 0.95))       as p95_worker_memory_gb,
        percentile_approx(driver_cpu_utilization, 0.95)       as p95_driver_cpu_utilization,
        percentile_approx(worker_cpu_utilization, 0.95)       as p95_worker_cpu_utilization,
        percentile_approx(driver_memory_utilization, 0.95)    as p95_driver_memory_utilization,
        percentile_approx(worker_memory_utilization, 0.95)    as p95_worker_memory_utilization,
        percentile_approx(total_gb_read, 0.95)                as p95_gb_read,
        percentile_approx(total_gb_written, 0.95)             as p95_gb_written,
        percentile_approx(total_memory_gb_spilled, 0.95)      as p95_memory_gb_spilled,
        percentile_approx(disk_spilled_rate, 0.95)            as p95_disk_spilled_rate
    from iris.job_runs
    where report_date >= current_date - interval 30 day
    group by platform, job_id;

Using this aggregated data, we can identify trends in job performance and resource usage over time, such as increasing run times or spikes in resource consumption.

Recommendation API

Based on trend analysis insights, we’ve built a recommendation API that suggests optimizations, such as adjusting resource allocations, identifying potential bottlenecks, or proposing schedule changes to optimise cost and performance.

Frontend integration

The recommendations generated by our API are integrated into the Iris front end. Users can view these recommendations directly in the job overview or details screens, offering actionable insights to improve Spark jobs.

Here is an example: in a job with consistently low resource utilisation (less than 25% over time), our system suggests reducing the worker size by half to optimise costs.

Figure 3. Example of job with low resource utilisation.

Slackbot integration

To make these insights more accessible, we’ve integrated the recommendation system with a SpellVault app (a GenAI platform at Grab). This allows users to interact with the recommendation system directly from Slack, allowing them to stay informed about job performance and potential optimisations without constantly checking the Iris web interface.

Figure 4. Example of integration with SpellVault.

Migration and adoption

Migration strategy

  • Fully migrating real-time CPU/Memory charts from Grafana to the new Iris UI

  • Will deprecate the Grafana dashboard after migration

  • Retaining Superset for platform metrics and specific BI needs

User onboarding and feedback

Iris deployed within the One DE app, centralising access to data engineering tools. The feedback button in the UI allows users to submit comments easily.

Lessons learned and future roadmap

Lessons learned

  • Unified data store: Using StarRocks as a single source for both real-time and historical data has significantly improved query performance and streamlined our architecture.

  • Materialised views: Leveraging StarRocks’ materialised views for pre-aggregations has significantly enhanced query response times, especially for common UI operations.

  • Dynamic partitioning: Implementing dynamic partitioning has helped in maintaining optimal performance as data volumes grow, automatically managing data retention.

  • Direct Kafka ingestion: StarRocks’ ability to ingest data directly from Kafka has streamlined our data pipeline, reducing latency and complexity.

  • Flexible data model: Compared to the previous time-series-focused InfluxDB, the StarRocks relational model enables more complex queries and simplifies metadata handling.

Future roadmap

  1. Enhanced recommendations: Expand the recommendation system to include more in-depth suggestions, such as identifying potential bottlenecks and recommending Spark configurations to add or remove from jobs. These recommendations, aimed at improving runtime and cost performance, will leverage the detailed Spark metrics and event data we’re already collecting.

  2. Advanced analytics: Leverage the comprehensive Spark metrics data to provide deeper insights into job performance and resource utilisation.

  3. Integration expansion: Enhance Iris integration with other internal tools and platforms to increase adoption and ensure a seamless experience across the data engineering ecosystem.

  4. Machine learning integration: Explore the possibility of incorporating machine learning models for predictive analytics on Spark performance.

  5. Scalability improvements: Continue to optimise the system to handle increasing data volumes and user loads as adoption grows.

  6. User experience enhancements: Continuously improve the Iris application’s UI/UX based on user feedback to make it more intuitive and informative.

Conclusion

The journey of building the Iris web application, powered by StarRocks, has been transformative for our Spark observability capabilities at Grab. This evolution was driven by the need for a user-friendly, centralised platform for Spark monitoring and logging.

By leveraging StarRocks’ capabilities, we’ve created a unified interface that seamlessly handles both real-time and historical data. This has allowed us to consolidate previously fragmented tools like Grafana and Superset into a single, cohesive platform. The ability to capture and analyse job metadata and metrics in one place has been crucial, enabling us to implement effective showback/chargeback mechanisms at the job level.

Looking ahead, we’re excited about the potential for more advanced analytics and machine learning-driven insights. The lessons learned from this project will guide our approach to building robust, scalable, and user-friendly data tools at Grab.

Join us

Grab is a leading superapp in Southeast Asia, operating across the deliveries, mobility and digital financial services sectors. Serving over 800 cities in eight Southeast Asian countries, Grab enables millions of people everyday to order food or groceries, send packages, hail a ride or taxi, pay for online purchases or access services such as lending and insurance, all through a single app. Grab was founded in 2012 with the mission to drive Southeast Asia forward by creating economic empowerment for everyone. Grab strives to serve a triple bottom line – we aim to simultaneously deliver financial performance for our shareholders and have a positive social impact, which includes economic empowerment for millions of people in the region, while mitigating our environmental footprint.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Zen and the Art of Microcode Hacking (Google Bug Hunters)

Post Syndicated from corbet original https://lwn.net/Articles/1013136/

The Google Bug Hunters blog has a
detailed description
of how a vulnerability in AMD’s microcode-patching
functionality was discovered and exploited; the authors have also released
a set of tools to assist with this kind of research in the future.

Secure hash functions are designed in such a way that there is no
secret key, and there is no way to use knowledge of the
intermediate state in order to generate a collision. However, CMAC
was not designed as a hash function, and therefore it is a weak
hash function against an adversary who has the key. Remember that
every AMD Zen CPU has to have the same AES-CMAC key in order to
successfully calculate the hash of the AMD public key and the
microcode patch contents. Therefore, the key only needs to be
revealed from a single CPU in order to compromise all other CPUs
using the same key. This opens up the potential for hardware
attacks (e.g., reading the key from ROM with a scanning electron
microscope), side-channel attacks (e.g., using Correlation Power
Analysis to leak the key during validation), or other software or
hardware attacks that can somehow reveal the key. In summary, it is
a safe assumption that such a key will not remain secret forever.

Медът и жилото на САЩ, или за разговора между Тръмп и Зеленски

Post Syndicated from Искрен Иванов original https://www.toest.bg/medut-i-zhiloto-na-sasht-ili-za-razgovora-mezhdu-trump-i-zelenski/

Медът и жилото на САЩ, или за разговора между Тръмп и Зеленски

Всеки, който някога е гледал спор или дебат с участието на Доналд Тръмп, може да потвърди, че в такива случаи настоящият президент на САЩ прави всичко възможно, за да размаже събеседника си, независимо кой стои пред него. Този скандален шоу стил на Тръмп не е случаен, защото именно в него се крие стратегията, която върна милиардера в Белия дом. За разлика от европейската традиция, която е прокарала ясни червени линии за реториката на политиците, в САЩ нещата не стоят точно така.

От друга страна, да сведем случилото се преди броени дни в Белия дом до някаква обикновена политическа свада, в която са се сблъскали герой и злодей, е наивна стратегия, която няма да ни каже какво всъщност стана на срещата между Доналд Тръмп и Володимир Зеленски. Много идеалисти и идеолози виждат нещата по този начин, защото стават жертва на т.нар. пожелателно мислене, при което нагласите и представите на индивида се формират на базата на неговите желания, а не на обективната реалност. Същевременно отговорът трябва да се търси не на ниво геополитика и даже не на ниво лидерство, а на ниво възприятия или дори на липсата на такива.

Ролята на възприятията и нагласите във външната политика

Всеки експерт в сферата на международните отношения би трябвало да е чел поне веднъж в живота си книгата Perceptions and Misperceptions in International Politics, дело на един от най-великите американски учени Робърт Джървис.

След Кубинската ракетна криза от 1962 г. младият Джървис лесно попада в полезрението на ЦРУ с разработките си и започва работа по няколко научни труда, посветени на ядреното сдържане, на съветската стратегическа култура и на начина, по който политиците вземат решения. Всъщност, ако отворите тази книга, ще си отговорите не просто на въпроса защо Тръмп и Зеленски се скараха, но и защо изобщо се стигна до войната в Украйна. И не, това не са дежурните драсканици на хора, които се опитват да ни убедят колко добър или лош е някой от двата отбора, а е огромен, емпирично аргументиран труд, в който Джървис обяснява как може да предвидим или анализираме ходовете на лидерите. 

Важно е да кажем, че теорията за възприятията няма общо с политическата психология, която се старае да покаже как мислят политиците. За разлика от нея, теорията за възприятията в политиката директно отговаря на въпросите как и защо политиците вземат решения. Според нея основните фактори, на базата на които се формират представите на лидерите, са три: 

  • вярванията им;
  • ценностната система, в която са възпитани;
  • способността им да опознават околната среда. 

Въз основа на тези променливи се различават два типа лидери: рационални политици, които имат ясни вярвания и развита ценностна система и възприемат прагматично околната среда; и други, без ясни вярвания и ценности, което им пречи да бъдат рационални. Много важно уточнение тук е, че един рационален лидер може да се превърне в напълно нерационален, в случай че изгуби реална представа за случващото се около него. Основна причина за това е страхът, който не само Джървис, но и много други изследователи посочват като източник на конфликти и разделение. Съгласно теорията за възприятията този страх може да бъде преодолян, но лекарството в такива случаи е по-лошо от болестта – нужно е въпросният лидер да се постави на мястото на опонента си. А практиката сочи, че това почти никога не се случва, поради което конфликтът е неизбежен.

Казано с други думи, оръжията, санкциите, политиката, идеологията са само едната страна на монетата. Всичко може да се промени, ако нагласите на лидерите се обърнат и те започнат да възприемат другия не като враг, а като партньор. Това обръщане на нагласите като че ли повече ни доближава до случващото се в последните няколко месеца, откакто Доналд Тръмп отново е в Белия дом. Доказателство за съждението, че всичко може да се промени, е неуспехът на много европейски политици и анализатори да предвидят какво ще се случи след спечелването на изборите от Тръмп, както и че той може да стигне до крайности, които завинаги да оставят диря в трансатлантическата солидарност. 

Някои си казваха, че „не може да стане по-зле“, а други ни уверяваха, че „ще стане още по-добре“. В крайна сметка това не са прагматични оценки, а ориентири, които, макар и достоверни в някои случаи, не успяват адекватно да подготвят обществото за бурята, която се задава.

Ето защо преди всичко в един такъв анализ е необходимо да се дистанцираме от случващото се в глобалната политика днес. То е нелогично, хаотично, безнадеждно… Всъщност съвсем не – просто така изглежда. 

Защо – можем да разберем само ако пренесем анализа си на индивидуално ниво.

От какво се страхува Доналд Тръмп?

Първият и най-важен въпрос е какво е в състояние да уплаши президента на САЩ до такава степен, че той да насочи цялата си съзнателна критика срещу Украйна и срещу нейния лидер. Тук няма да помогнат изводи от рода на „Тръмп е мародер и рекетьор, а Зеленски е нашият герой“ или пък „Тръмп иска траен мир, а Зеленски – безкрайна война“. Подобни твърдения, макар и частично аргументирани, имат не емпиричен, а политико-идеологически характер. 

Отговорът е много прост: Тръмп се страхува за три неща и би дал мило и драго да не ги загуби – властта си, сигурността си и парите си. За първото и последното на този етап няма опасност, но ефектът от заплахата срещу тях беше видим след съдебните процеси срещу Тръмп, които го превърнаха в мъченик за избирателите му. 

За пръв път обаче от края на Студената война националната сигурност на САЩ е под непосредствена заплаха. И не просто от една трансгранична терористична мрежа, а от две ядрени сили – Русия и Китай. Съюз, който е способен да сложи край на американското глобално лидерство.

Тази заплаха не може да бъде неутрализирана с изпращане на войски зад граница, защото това може да доведе до ядрена война. В нея няма да има победители и тогава Тръмп ще загуби и властта, и парите си. Тук може да се даде един прост пример – думите на Зеленски: 

Г-н Президент, ние усещаме това от години. Но ако паднем, и вие ще го усетите.

На тях Тръмп реагира с гневното: 

Не знаете това! Не ни казвайте какво ще усетим и какво – не! 

Оттук насетне разговорът рязко завива в посока, за която британският премиер заяви, че „никой не би искал да вижда“. Същевременно едва ли има политик в американската история с по-голямо его от Тръмп. Този факт, съчетан със страха от ядрена война и с възхода на Китай, е прекрасна илюстрация на политическата метафора за меда и жилото, чийто автор е Драган Цанков. В този случай Зеленски усети както меда, така и жилото на САЩ.

От какво се страхува Зеленски?

Най-елементарното твърдение, което може да приемем, е, че Зеленски се страхува за живота и властта си. Всъщност, ако беше така, отдавна трябваше да е напуснал Украйна и да живее като един от най-богатите политици в Европа. Същото важи и за аргументите, които правят от Зеленски абсолютен герой или абсолютен злодей. 

Да вярваме, че един политик е доброто или лошото момче, е много наивно и такива нагласи неведнъж са изигравали лоша шега на политиците в най-новата ни история.

Затова е най-трудно да дефинираме от какво толкова се страхува Зеленски, че беше готов да спори с президента и вицепрезидента на САЩ. Слухове за подобни спорове между украинския лидер и Байдън се появиха и през 2022 г., когато в телефонен разговор благият демократ изтървава нервите си и призовава Зеленски да бъде малко по-благодарен за оръжията, които получава. Изглежда, че Байдън се е страхувал също толкова, колкото и Тръмп, конфликтът да не ескалира. 

За да си отговорим на въпроса какво плаши Зеленски, е добре да се върнем назад и да видим кое е най-силното обещание, с което той печели изборите в Украйна със смазваща преднина пред Петро Порошенко през 2019 г. Тогава Зеленски обещава на украинците две неща: че ще смаже олигархията в страната и че ще промени драстично външнополитическия ѝ курс. 

Част от опасенията на украинския лидер включват да не се разочарова от САЩ, защото това би означавало, че предава доверието на избирателите си, тъй като обещаната смяна на външнополитическия курс няма как да се изпълни без САЩ.

Зеленски възприема Америка като великия герой, който ще удари по масата и ще вкара Украйна в НАТО, разгромявайки Русия веднъж завинаги. Тези внушения се засилват с времето под влиянието на много европейски политици, които виждат във войната възможност да ревизират връзките си с Москва. Но нито Зеленски, нито Европа отчетоха, че Великите сили имат само вечни интереси, не и вечни приятели. Студеният прагматизъм на Тръмп сбъдва и най-големия страх на Зеленски – че Америка ще изостави Украйна така, както изостави Афганистан, Виетнам и големи части от Африка. Страх, който все повече започва да се усеща в Европа.

От какво се страхуват европейските политици?

Макар Европа да не е пряк участник в словесната престрелка между Тръмп и Зеленски, тя по всяка вероятност ще бъде пряко потърпевша от нея. Най-объркана от всички, разбира се, е Германия, която, подплатена с мрачните спомени от Втората световна война, години наред се страхува, че ще трябва да воюва отново. Ето че сега ще ѝ се наложи да направи точно това, ако Общността започне да гради своя армия. 

Макрон, който на пръв поглед изглежда уверен заради френското ядрено оръжие, си дава ясна сметка, че ядрените триади на Париж и Лондон са ликвидирани след разпада на СССР, което няма да им позволи да опънат ядрен чадър над Стария континент в случай на нужда, а постигането на ядрен паритет с Русия е дълъг и сложен процес, който ще погребе завинаги спокойния живот на французите. 

Останалата част от Западна Европа се страхува, че старата архитектура за сигурност вече я няма, а испанци, португалци и скандинавци ще трябва да се вдъхновяват от културното наследство на рицарите и викингите, за да градят армиите си наново. 

Най-уплашени обаче са Централна и Източна Европа, тъй като много държави в региона смятат, че те са следващите, ако Украйна падне. 

А Британия е уплашена, че Голямата война ще почне отново и Лондон ще трябва да защитава… Европейския съюз, който напусна. За британците в крайна сметка Първата световна война е много по-болезнен спомен, отколкото битката с Хитлер, защото тогава империята губи почти всичките си колонии. 

И нека все пак не забравяме, че за част от европейските политици кризата на глобалния либерален ред е шок, а за други – неизбежна реалност. Но най-големият страх на европейските лидери е, че в Овалния кабинет може да ги сполети същото, което сполетя и Зеленски. И то не за друго, а защото един дребен знак на неуважение към Тръмп може да коства на съответната държава всичко в отношенията ѝ със САЩ.

От какво се страхуват Путин и Си Дзинпин?

Всъщност в тази графа могат да бъдат поместени не само Русия и Китай, но и много съюзници на САЩ, които искат да избегнат глобален военен конфликт. Големите страхове на Русия всъщност са свързани с престъпването на червените линии, отвъд които тя би употребила ядрено оръжие срещу Украйна – ако властта на руския президент бъде застрашена или ако самото съществуване на Русия бъде поставено на карта. За администрацията на Путин съществуването ѝ е равносилно на съществуването на руската държава. Последните промени в руската ядрена доктрина го доказват, а нагласите са ясни: ако Русия види, че губи войната, ще използва всички средства, за да го предотврати. 

Някой би възразил, че завоюването на територии, чиято инфраструктура е напълно разрушена и обезлюдена, не е никаква победа, но в руското съзнание е точно обратното – достатъчно е тези територии да са руски и е без значение, че гражданите може да бъдат изложени на радиация. Началото на войната в Украйна, в което мнозина се съмняваха, и инцидентът в Чернобил от 1986 г. са историческото доказателство за тези нагласи. 

И точно тук идва изходът от уравнението – оказва се, че Тръмп и Путин се страхуват от едно и също. Затова и ходовете им са толкова идентични. Затова и искат едно и също – край на войната на каквато и да е цена. 

По аналогичен начин стоят нещата и със страховете на Китай. Си Дзинпин очевидно иска да се запише в китайската история като най-силния лидер от Мао Дзъдун насам. Това може да стане само по два начина – чрез обединение с Тайван или ако Китай успее да задмине САЩ. Една война между големите ще унищожи всякакви перспективи и за двете. Неговите страхове са общи с тези на Путин и Тръмп и затова той е склонен да се договори с тях, но все повече се отдалечава от европейците. Китайската карта на Кисинджър беше позорно проиграна от Европейския съюз, който влоши отношенията си с Китай заради САЩ и сега това започва да се обръща срещу него, като остава сам срещу САЩ, Русия и Китай.

Равносметката

Разгорещеният разговор между Тръмп и Зеленски от 28 февруари на практика доказва, че страховете на САЩ и Европа вече са много различни, за да ги обединяват. Вашингтон, който десетилетия наред беше критикуван от европейците, че се меси в делата им и им пречи, най-сетне реши да намали ангажимента си към Европа. Но това създаде много повече проблеми, отколкото реши. По време на Студената война Ричард Никсън на няколко пъти заплашва европейците, че САЩ няма да воюват с СССР заради тях, а Джими Картър до последно вярва, че мирното съжителство с Москва е възможно и че държавите от Съветския блок следва да са в нейната зона на влияние, за да може суперсилите да живеят в мир. 

Парадоксално, тогава страховете на Вашингтон, Москва и Брюксел са общи – да не би в един момент балансът на силите да се наруши така, че да се стигне до Трета световна война. Ето защо автори като Робърт Джървис твърдят, че двуполюсният свят съвсем не е дилема на сигурността, а сблъсък на две системи – социализъм и капитализъм. 

Днес обаче страховете на лидерите силно се разминават. Голяма част от европейските елити все още не си дават сметка за реалната опасност от ядрена война, а Русия умишлено използва ядрен шантаж, за да манипулира липсата на рационални възприятия какво може да се случи. Някой би казал, че при Тръмп шантажът работи, а при Европа – не. Трудно можем да отречем обаче, че кризата на политическото представителство е дотолкова повсеместна и дълбока, че кара избирателите в демократични държави да гласуват за лидери, които после проклинат, или пък да се възхищават на авторитарни вождове, които виждат като новите господари на света. 

Разговорът между Тръмп и Зеленски в този смисъл е завършеният продукт от кризата на либералната демокрация, в която се сблъскват един от най-богатите президенти на САЩ и лидер, който с помощта на проамериканската оръжейна индустрия се опитва да унищожи проруската олигархия в Украйна и да защитава нападнатата си страна. 

В това уравнение не е включена средната класа, а политиците от миналото, които се стремяха да я запазят, вече ги няма, за да формират приемлив за всички консенсус и диалог.

Остава въпросът трябва ли наистина да тестваме готовността на ядрените сили да воюват и до каква степен. Рационалният отговор е, че това не бива да се случва, защото днес във Вашингтон и Москва не стоят Кенеди и Хрушчов, а Тръмп и Путин. 

AWS completes the annual Dubai Electronic Security Centre certification audit to operate as a Tier 1 cloud service provider in the Emirate of Dubai

Post Syndicated from Vishal Pabari original https://aws.amazon.com/blogs/security/aws-completes-the-annual-dubai-electronic-security-centre-certification-audit-to-operate-as-a-tier-1-cloud-service-provider-in-the-emirate-of-dubai-2/

We’re excited to announce that Amazon Web Services (AWS) has completed the annual Dubai Electronic Security Centre (DESC) certification audit to operate as a Tier 1 Cloud Service Provider (CSP) for the AWS Middle East (UAE) Region.

This alignment with DESC requirements demonstrates our continued commitment to adhere to the heightened expectations for CSPs. Government customers of AWS can run their applications in AWS Cloud-certified Regions with confidence.

The independent third-party auditor (BSI) issued the Certificate of Compliance to AWS on behalf of DESC on January 23, 2025. The Certificate of Compliance that illustrates the compliance status of AWS is available through AWS Artifact. AWS Artifact is a self-service portal for on-demand access to AWS compliance reports. Sign in to AWS Artifact in the AWS Management Console, or learn more at Getting Started with AWS Artifact.

The certification includes 11 additional services in scope, for a total of 98 services. This is a 13% year-on-year increase in the number of services in the Middle East (UAE) Region that are in scope of the DESC CSP certification. For up-to-date information, including when additional services are added, see the AWS Services in Scope by Compliance Program webpage and choose DESC CSP.

AWS strives to continuously bring services into the scope of its compliance programs to help you adhere to your architectural and regulatory needs. If you have questions or feedback about DESC compliance, reach out to your AWS account team.

To learn more about our compliance and security programs, see AWS Compliance Programs. As always, we value your feedback and questions; reach out to the AWS Compliance team through the Contact Us page.

If you have feedback about this post, submit comments in the Comments section below.
 

Vishal Pabari
Vishal Pabari

Vishal is a Security Assurance Program Manager at AWS, based in London, UK. Vishal is responsible for third-party and customer audits, attestations, certifications, and assessments across EMEA. Vishal previously worked in risk and control, and technology in the financial services industry.

[$] Two new graph-based functional programming languages

Post Syndicated from daroc original https://lwn.net/Articles/1011803/

Functional programming languages have a long association with graphs. In the
1990s, it was even thought that parallel graph-reduction
architectures could make functional programming languages much faster than their
imperative counterparts. Alas, that prediction mostly failed to materialize.
Even though graphs are still used as a theoretical formalism in order to define
and optimize functional languages (such as Haskell’s

spineless tagless graph-machine
), they are still mostly compiled down to the same old
non-parallel assembly code that every other language uses. Now, two
projects —

Bend
and

Vine
— have sprung up attempting to change that, and prove that
parallel graph reduction can be a useful technique for real programs.

Cross-account data collaboration with Amazon DataZone and AWS analytical tools

Post Syndicated from Arun Pradeep Selvaraj original https://aws.amazon.com/blogs/big-data/cross-account-data-collaboration-with-amazon-datazone-and-aws-analytical-tools/

Data sharing has become a crucial aspect of driving innovation, contributing to growth, and fostering collaboration across industries. According to this Gartner study, organizations promoting data sharing outperform their peers on most business value metrics. A straightforward data access and sharing mechanism is crucial for enabling effective data sharing across an organization. There are challenges such as complexity in managing cross-account permissions and difficulty in discovering the right data across accounts that organizations face when trying to share data products across AWS accounts. Amazon DataZone is a fully managed data management service that customers can use to catalog, discover, share, and govern data stored across Amazon Web Services (AWS).

In this post, we will cover how you can use Amazon DataZone to facilitate data collaboration between AWS accounts.

Solution overview

This solution provides a streamlined way to enable cross-account data collaboration using Amazon DataZone domain association while maintaining security and governance. This post describes the process of using the business data catalog resource of Amazon DataZone to publish data assets so they’re discoverable by other accounts. After they’ve been published, you can query the published assets from another AWS account using analytical tools such as Amazon Athena and the Amazon Redshift query editor, as shown in the following figure.

In this solution (as shown in the preceding figure), the AWS account that contains the data assets is referred to as the producer account. The AWS account that needs to access or use the data from the producer account is referred to as the consumer account. The Amazon DataZone domain is created and managed within the producer account and then the consumer account is associated with that domain.

As part of Amazon DataZone domain association, Amazon DataZone uses AWS Resource Access Manager (AWS RAM) to share the resource. When the producer and consumer AWS accounts are in the same organization within AWS Organizations, the domain association happens automatically. If the producer and consumer AWS accounts are in different organizations, AWS RAM sends an invitation to the consumer AWS account to accept or reject the resource grant.

This solution presents three Amazon DataZone user personas as:

  • Data administrators: Account owners in both producer and consumer AWS accounts. The data administrators are responsible for creating Amazon DataZone domains, configuring domain associations, and accepting domain associations within the Amazon DataZone domain.
  • Data publishers: Users in producer AWS accounts. The data publishers are responsible for creating Amazon DataZone publish projects and environments, producing and publishing data assets, and accepting subscription requests.
  • Data subscribers: Users in consumer AWS accounts. The data subscribers are responsible for creating Amazon DataZone subscribe projects and environments, searching for and subscribing to data assets, and querying the data and deriving insights.

Prerequisites

To follow along with the instructions, you will need:

  • Two AWS accounts, one serving as producer and other account serving as consumer. Create new AWS accounts if necessary.
  • An Amazon Redshift provisioned cluster or Amazon Redshift Serverless workgroup in the producer and consumer AWS accounts provisioned by a data administrator.
  • A secret in AWS Secrets Manager storing the master user credentials for the Amazon Redshift cluster or workgroup in the producer and consumer AWS accounts.
    • The data administrators are responsible for creating secrets.
    • The data producers and consumers can obtain the Amazon Resource Name (ARN) of the secrets from the data administrators during the environment or environment profile creation steps.

Amazon DataZone uses Amazon Redshift Datashares to share data across clusters and accounts. There are specific requirements and limitations for using Amazon Redshift datashares.

  • For cross-account data sharing, both the producer and consumer clusters must be encrypted. See Cluster encryption section of datashare-considerations for more information about the encryption process.
  • Data sharing is supported only for provisioned ra3 cluster types (ra3.16xlarge, ra3.4xlarge, and ra3.xlplus) and Amazon Redshift Serverless.

Walkthrough:

The following are the high level steps to configure cross-account access. We’ve provided step-by-step instructions in the following sections.

  1. Create an Amazon DataZone domain in the producer account. The data administrator creates an Amazon DataZone domain.
  2. Request Amazon DataZone domain association from the producer account to the consumer account.
  3. Accept the domain association request in the consumer account. The data administrator accepts the domain association.
  4. Add data users to the Amazon DataZone domain.
  5. Create the necessary publish project for AWS Glue and Amazon Redshift in the producer account.
  6. Create AWS Glue and Amazon Redshift environments to publish the data assets in the producer account.
  7. Create and run a data source for AWS Glue and Amazon Redshift to publish assets into the business catalog.
  8. Create subscribe projects for AWS Glue and Amazon Redshift.
  9. Create AWS Glue and Amazon Redshift environment profiles and environments in the subscribe project
  10. Subscribe to AWS Glue and Amazon Redshift tables. Consume the data using Athena and Amazon redshift editors. This step is performed by the data subscriber.

Create the Amazon DataZone domain in the producer account

Amazon DataZone domains serve as high-level organizational units for assets, users, and projects, facilitating cross-team and cross-account collaboration. This step focusses on creating the Amazon DataZone domain in the producer account.

  1. Sign in to the producer account AWS Management Console for Amazon DataZone using the data administrator credentials.
  2. Create an Amazon DataZone domain titled Demo_cross_account_domain using the instructions at create domains.
  3. On the Create domain screen, select Quick setup checkbox to automate several configuration steps, saving time and reducing the potential for setup errors. Quick setup enables two default blueprints and creates the default environment profiles for the data lake and data warehouse default blueprints.


Request Amazon DataZone domain association from the producer account to the consumer account

To associate the Amazon DataZone domain with the consumer account, the producer account requests a domain association. This involves providing necessary information about the consumer account and granting appropriate permissions for data access and management.

  1. Sign in to the Amazon DataZone console of the producer account using the data administrator credentials.
  2. Navigate to the domain detail page, and then scroll down and select the Associated Accounts tab.
  3. Enter the consumer account IDs that you want to request association. Choose Add another account if you want to add more than one account. When you’re satisfied with the list of account IDs, choose Request association.
    • Use the latest (AWS RAM DataZonePortalReadWrite policy when requesting the account association. This policy allows users in the consumer account to execute Amazon DataZone APIs and to use the data portal interface.

Accept an account association request from an Amazon DataZone domain

This step focuses on accepting the account association request from the Amazon DataZone domain in the consumer account. This allows the consumer account to be linked with the Amazon DataZone domain to enable data sharing and collaboration between the producer and consumer accounts.

  1. Sign in to the consumer account and go to the Amazon DataZone console  in the same AWS Region as the domain. On the Amazon DataZone home page, choose View requests.
  2. Select the name of the inviting Amazon DataZone domain and choose Review request.
  3. Choose Accept association, you should see the Demo_cross_account_domain state as associated in the Associated domains screen

  1. Choose the domain for which you want to enable an environment blueprint.
  2. From the Blueprints list, choose either the DefaultDataLake blueprint
  3. On the Permissions and resources page, for enabling the DefaultDataLake blueprint, for Glue Manage Access role, specify a new role that grants Amazon DataZone authorization to ingest and manage access to tables in AWS Glue and AWS Lake Formation.

  1. Repeat steps 4 to 6 to enable the DefaultDataWarehouse blueprint by choosing DefaultDataWarehouse instead of DefaultDataLake

Add data users to the Amazon DataZone domain

To grant access to the Amazon DataZone data portal from the console for data publisher and data Subscriber IAM users, use the following steps to add them in the User Management section of the Amazon DataZone domain. See Manage users in the Amazon DataZone console for additional details.

  1. Sign in to the Amazon DataZone console as a data administrator using the producer account.
  2. Select the Amazon DataZone domain and, in the User management section, choose Add and select Add IAM users.
  3. On the Add users page, choose Current account and add the user ARN of the data producer and choose Add users.
  4. Next choose Associated account, and enter the data subscriber user’s ARN and add the user by choosing Add users.

Create the publish project for AWS Glue and Amazon Redshift

This step focuses on creating the publish project for AWS Glue and Amazon Redshift in the producer account. The project will be used to publish data from your data sources to the appropriate AWS services.

  1. Using the producer account, sign in to the Amazon DataZone console as a data publisher.
  2. Select View domains and select the demo_cross_account_domain.
  3. Choose the Open data portal link and sign in to the data portal.
  4. Choose Create New Project and create a project named Glue_Publish_Project for publishing AWS Glue data assets and create the project under demo_cross_account_domain.
  5. Create another project named Redshift_Publish_Project for publishing Amazon Redshift data assets, also under the demo_cross_account_domain.

Create AWS Glue and Amazon Redshift environments to publish the data assets

In this step, you set up AWS Glue and Amazon Redshift environments in the producer account to share data assets. The required infrastructure, such as the AWS Glue Data Catalog and Redshift cluster for storing data, should already be in place. After setup, this will allow the consumer account to access and use the shared data assets. See Create a new environment for detailed instructions on creating a new environment.

Create the AWS Glue environment and a new AWS Glue table

  1. In the same Amazon DataZone domain demo_cross_account_domain, choose Browse Project and select the Glue_Publish_Project and create Glue_Publish_Environment using the default DataLakeProfile.
  2. Leave the producer_glue_db_name, consumer_glue_db_name and Workgroup_name blank.
  3. Choose Create Environment and wait for the process to complete.
  4. After the environment is created, browse the list of available projects and choose Glue_publish_project.
  5. Next, navigate to the Glue_Publish_Environment, and under Analytics tools, choose Amazon Athena to open the Athena query editor
  6. Choose Open Athena and make sure that Glue_Publish_Environment is selected in the Amazon DataZone environment dropdown at the upper right and that in Data on the left, glue_publish_environment_pub_db is selected as the Database.
  7. Create a new AWS Glue table for publishing to Amazon DataZone. Paste the following create table as select (CTAS) query script in the Query window and run it to create a new table named mkt_sls_table. The script creates a table with sample marketing and sales data.
    CREATE TABLE mkt_sls_table AS
    SELECT 146776932 AS ord_num, 23 AS sales_qty_sld, 23.4 AS wholesale_cost, 45.0 as lst_pr, 43.0 as sell_pr, 2.0 as disnt, 12 as ship_mode,13 as warehouse_id, 23 as item_id, 34 as ctlg_page, 232 as ship_cust_id, 4556 as bill_cust_id
    UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
    UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
    UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
    UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
    UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
    UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
    UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
    UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
    UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
    UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561

  8. Go to the Tables and Views section and verify that the mkt_sls_table table was successfully created.

Create the Amazon Redshift publish environment and a new Redshift table

  1. Staying in the same Amazon DataZone domain demo_cross_account_domain, choose Browse Project, to create an Amazon Redshift publish environment, select the Redshift_Publish_Project and create Redshift_Publish_Environment using the default data warehouse profile.
  2.  To configure environment parameters, enter the name of your Amazon Redshift cluster or workgroup, specify the database name and enter the AWS Secrets Manager secret ARN for the Redshift cluster or workgroup. You need to make sure that the secret in Secrets Manager includes the following tags. These tags help Amazon DataZone implement proper access control so that only authorized users within the correct Amazon DataZone project and domain can access the Amazon Redshift resource:
    1. For Amazon Redshift cluster: DataZone.rs.cluster: <cluster_name:database name>
    2. For Amazon Redshift Serverless workgroup: DataZone.rs.workgroup:  <workgroup_name:database_name>
    3. AmazonDataZoneProject: <projectID>
    4. AmazonDataZoneDomain: <domainID>For more information for creating redshift database user secret in secret manager, see Storing database credentials in AWS Secrets Manager.

For more information for creating redshift database user secret in secret manager, see Storing database credentials in AWS Secrets Manager.

  1. Note that the database user you provide in Secrets Manager must have superuser permissions. Data publishers should work with the data administrator to get the details of the Redshift cluster or workgroup, database name, and secret ARN.
  2. The schema is optional.
  3. Choose Create Environment and wait for the process to complete.
  4. Verify that the environment is created successfully without errors.
  5. Browse the list of available projects and select Redshift_publish_project. Navigate to Redshift_publish_environment.
  6. Under Analytics tools, choose Amazon Redshift to open the Amazon Redshift query editor.
  7. Select the Redshift cluster that you want to connect, choose Save and then choose Create Connection using temporary credentials with your IAM identity.
  8. Create a new Redshift table. You can use the CTAS query to create a new table named rs_sls_tbl. Use the provided CTAS script, which creates a table with sample sales data in the datazone_env_redshift_publish_environment schema.
    CREATE TABLE "datazone_env_redshift_publish_environment"."rs_sls_tbl" AS
    SELECT 146776932 AS ord_num, 23 AS sales_qty_sld, 23.4 AS wholesale_cost, 45.0 as lst_pr, 43.0 as sell_pr, 2.0 as disnt, 12 as ship_mode,13 as warehouse_id, 23 as item_id, 34 as ctlg_page, 232 as ship_cust_id, 4556 as bill_cust_id
    UNION ALL SELECT 46776931, 24, 24.4, 46, 44, 1, 14, 15, 24, 35, 222, 4551
    UNION ALL SELECT 46777394, 42, 43.4, 60, 50, 10, 30, 20, 27, 43, 241, 4565
    UNION ALL SELECT 46777831, 33, 40.4, 51, 46, 15, 16, 26, 33, 40, 234, 4563
    UNION ALL SELECT 46779160, 29, 26.4, 50, 61, 8, 31, 15, 36, 40, 242, 4562
    UNION ALL SELECT 46778595, 43, 28.4, 49, 47, 7, 28, 22, 27, 43, 224, 4555
    UNION ALL SELECT 46779482, 34, 33.4, 64, 44, 10, 17, 27, 43, 52, 222, 4556
    UNION ALL SELECT 46779650, 39, 37.4, 51, 62, 13, 31, 25, 31, 52, 224, 4551
    UNION ALL SELECT 46780524, 33, 40.4, 60, 53, 18, 32, 31, 31, 39, 232, 4563
    UNION ALL SELECT 46780634, 39, 35.4, 46, 44, 16, 33, 19, 31, 52, 242, 4557
    UNION ALL SELECT 46781887, 24, 30.4, 54, 62, 13, 18, 29, 24, 52, 223, 4561

  9.  Make sure that the rs_sls_tbl table is successfully created.

Publish assets into the common business catalog

In this step, you create and run the Amazon DataZone data sources for AWS Glue and Amazon Redshift. You will then publish the data assets from these data sources.

The Amazon DataZone data sources allow you to connect to various data sources, including databases, data warehouses, and data lakes, and ingest metadata into Amazon DataZone. By creating and running these data sources, you can make your data available for analysis, transformation, and sharing within your organization.

After the data sources are set up, you can publish the data assets from these sources to make them accessible to other users and applications. This process involves mapping the data assets to the appropriate business terms and metadata, making sure that the data is properly described and categorized.

Add an AWS Glue data source to publish the new AWS Glue table.

  1. Stay signed in the producer account and Amazon DataZone console as a data publisher.
  2. Choose Select project from the top navigation pane and select the Glue_Publish_Project that you want to add the data source to.
  3. Select the Glue_Publish_Environment.
  4. Choose Create data source. Enter glue-publish-datasource as the name.
  5. Under Data source type, choose AWS Glue.
  6. Under Select an environment, select Glue_Publish_Environment.
  7. Under Data selection, select the AWS Glue database glue_publish_environment_pub_db, enter your table selection criteria as “*“, and then and choose Next.
  8. Leave all other setting as default and choose Next.
  9. For Run Preference, select Run on demand to ingest metadata from the specified AWS Glue tables into Amazon DataZone.
  10. Review and choose Create.
  11. After the data source has been created choose Run. The mkt_sls_table will be listed in the inventory and available to publish.
  12. Select the mkt_sls_table table and review the metadata that was generated. Choose Accept All if you’re satisfied with the metadata.
  13. Choose Publish Asset and the mkt_sls_table table will be published to the business data catalog, making it discoverable and understandable across your organization.

Add an Amazon Redshift data source to publish the new Amazon Redshift table.

  1. Stay signed in the producer account and Amazon DataZone console as a data publisher.
  2. Choose Select project from the top navigation pane and select the Redshift_Publish_Project that you want to add the data source to.
  3. Choose the Redshift_Publish_Environment.
  4. Choose Create data source. Enter rs-publish-datasource as the name.
  5. Under Data source type, select Amazon Redshift.
  6. Under Select an environment, select Redshift_Publish_Environment.
  7. Under Redshift Credentials, enter the Redshift cluster and secret details provided by the data administrator.
  8. Under Data Selection, select the database dev and schema datazone_env_redshift_publish_environment.
  9. Keep other setting as default and choose Next.
  10. For Run Preference, select Run on Demand.
  11. Choose Save. After the data source is created, choose Run. The data source runs and the rs_sls_tbl will be listed in the inventory and available to publish.
  12. Select the rs_sls_tbl table and review the metadata that was generated. Choose Accept All if you are satisfied with the metadata.
  13. Choose Publish Asset and the rs_sls_table table will be published to the business data catalog.

Create subscribe projects for AWS Glue and Amazon Redshift

In this step, you create the projects for subscribing to AWS Glue and Amazon Redshift data assets within your Amazon DataZone domain.

  1. Sign in to the Amazon DataZone console as a data subscriber IAM user using the consumer account.
  2. Choose Associated domains and select the demo_cross_account_domain.
  3. Select the Open data portal link and sign in to the data portal.
  4. Choose Create New Project and create a project named Glue_Subscribe_Project for subscribing to the AWS Glue data assets.
  5. Create another project named Redshift_Subscribe_Project for subscribing to the Redshift data assets.

Create AWS Glue and Amazon Redshift environment profiles

In this step, you will set up the environment profiles and environments for AWS Glue and Amazon Redshift in your Amazon DataZone projects. This will allow you to connect and interact with resources across AWS accounts.

The purpose of environment profiles in Amazon DataZone is to streamline the process of environment creation. By using environment profiles, you can preconfigure essential placement information such as AWS account and AWS Region. In this solution, you will configure environment profiles with placement information pointing to your consumer account.

You will also create an Amazon DataZone environment from the profiles you are about to create. This will provision the necessary resources in the consumer account and establish the connections between the Amazon DataZone domain and the consumer account. After the environments are created, you can work with AWS Glue and Amazon Redshift assets seamlessly across different AWS accounts within your Amazon DataZone ecosystem.

Create an AWS Glue profile and environment

  1. Stay signed in the consumer account’s Amazon DataZone console as a data subscriber IAM, select the Environments tab and then choose Create environment profile.
  2. Configure the fields as follows:
    1. Name: Enter glue_subscribe-env-profile.
    2. Owner: The project where the profile is being created is selected by default in this field. Verify that it’s Glue_Subscribe_Project.
    3. Blueprint: Select Default Data Lake.
    4. AWS account parameters: Enter the consumer AWS account number and select the Region.
    5. Authorized projects: Select All projects.
    6. Publishing: Select Publish from any database.
    7. Choose Create Environment Profile.
  3. On the Create environment page, enter the following:
    1. Name: Enter glue_subscribe_environment.
    2. Verify that the Environment profile is set to glue_subscribe-env-profile.
  4. (Optional) Parameters: Enter the Producer glue db name, Consumer glue db name, and Workgroup name.
  5. Choose Create environment.
  6. It takes a few minutes for the environment to be created. Verify that the environment creation is successful without any errors.

Create a Redshift environment profile and environment

  1. Staying in the consumer account’s Amazon DataZone management console as a data subscriber IAM user, navigate to the Redshift_Subscribe_Project you created previously.
  2. Select the Environments tab and then choose Create environment profile.
  3. Configure the fields as follows:
    1. Name: Enter redshift_subscribe-env-profile.
    2. Owner: Verify that Project is set to Redshift_Subscribe_Project.
    3. Blueprint: Select Default Data Warehouse.
    4. Parameter set: Select Enter my own.
    5. AWS account parameters: Enter the consumer AWS account number and select the Region.
    6. Parameters: Select either Amazon Redshift Cluster or Amazon Redshift Serverless in the consumer account.
      • AWS Secret ARN: Enter the AWS Secrets Manager secret ARN for the Redshift cluster or workgroup. You need to make sure that the secret in Secrets Manager includes the following tags. These tags help Amazon DataZone implement proper access control so that only authorized users within the correct Amazon DataZone project and domain can access the Amazon Redshift resource.
        1. AmazonDataZoneDomain: [Domain_ID]
        2. AmazonDataZoneProject:  [Project_ID]

      For more information for creating redshift database user secret in secret manager, see Storing database credentials in AWS Secrets Manager.

      Note that the database user you provide in AWS Secrets Manager must have superuser permissions. Data publishers should work with the data administrator to get the details of the Redshift cluster or workgroup, database name, and secret ARN.

      • Redshift cluster name: Enter the name of the Amazon Redshift cluster or Amazon Redshift Serverless workgroup.
      • Database name: Enter the name of the database within the selected Amazon Redshift cluster or Amazon Redshift Serverless workgroup
    7. Authorized projects: Select All projects.
    8. Publishing: Select Publish any schema.
  4. Choose Create environment profile.
  5. Create an environment from this profile: Create an environment from this profile:
    1. Name: Enter redshift_subscribe_environment.
    2. Verify that the Environment profile is set to redshift_subscribe-env-profile.
  6. Choose Create Environment.

It takes a few minutes for the environment to be created. Verify that the environment creation is successful without any errors.

Subscribe to the AWS Glue and Redshift tables

In this step, you will subscribe AWS Glue and Amazon redshift tables published by the data producer.

Subscribe to the AWS Glue table

  1. Sign in to the Amazon DataZone console of the consumer account using the data subscriber credentials and navigate to the Glue_Subscribe_project you created previously.
  2. Search for the Market Sales Table in the Search bar.
  3. Select the Market Sales Table and choose Subscribe.
  4. In the Subscribe pop-up window, provide the following information:
    • Project: Enter the name of the project that you want to subscribe to the asset. By default this will be Glue_Subscribe_Project.
    • Enter a justification for your subscription request.
  5. Choose Subscribe.
  6. Switch to the data publisher role to approve the subscription request, then back to data subscriber after choosing Approve.
  7. Select the Glue_subscribe_project and choose Subscribed Assets. Verify that the Market Sales Table is added to your environment.
  8. Navigate to the Amazon Athena query editor using the link in the project’s home page.
  9. Choose OPEN AMAZON ATHENA.
  10. You will now be automatically routed to the Athena console, make sure that the Amazon DataZone Environment is set to glue_subscribe_environment.
  11. For Database, select glue_subscribe_environment_sub_db.
  12. You should see the mkt_sls_table in the Tables list. Preview the table by choosing the three-dot menu next to the table name and selecting Preview Table
  13. Review the table preview results. You will be able to see all the sales related data from the mkt_sls_table

Subscribe to the Redshift table

  1. Stay signed in to the Amazon DataZone management console as the data subscriber, Choose Select project from the top navigation pane and select the Redshift_Subscribe_project.
  2. Search for Sales Table in the search bar, and select the Sales Table.
  3. In the Subscribe pop-up window, provide the following information:
    • Project: Enter the name of the project that you want to subscribe to the asset. By default this will be Redshift_Subscribe_Project.
    • Enter a justification for your subscription request.
  4. Choose Subscribe.
  5. Switch back to the data publisher who is the producer of the Market Sales Table choose Approve.
  6. After the subscription request is approved, switch back to data subscriber.
  7. Select the Redshift_subscribe_project and choose Subscribed Assets. After the Sales Table is added to your environment, you can query the data in the table.
  8. Select the Amazon Redshift link in the right side panel of the project home page and navigate to the Amazon Redshift query editor.
  9. Select Open Amazon Redshift and the Redshift query editor v2 will open in a new tab.
  10. In the query editor, right-click your Amazon DataZone environment’s Amazon Redshift cluster and select Create a connection.
  11. Select Temporary credentials using your IAM identity for authentication.
    • If that authentication method isn’t available, open Account settings by choosing the gear icon in the bottom left corner, choose Authenticate with IAM credentials and choose Save.
  12. Enter the name of the Amazon DataZone environment’s database to create the connection.
  13. Choose Create connection.
  14. You can now view the Redshift table rs_sls_tbl in the datazone_env_redshift_subscribe_environment.
  15. Execute the following query to make sure the data is accessible
SELECT * FROM "dev"."datazone_env_redshift_subscribe_environment"."rs_sls_tbl";

You will be able to preview the rs_sls_tbl which will show the sale data from the table.

Clean up

To avoid unnecessary future charges, follow these steps:

Summary

Organizations often face significant challenges when trying to share data products across multiple AWS accounts. These challenges stem from the complexity of configuring proper cross-account access permissions and roles while maintaining robust data governance and security controls.

You can use the solution described in the post to publish and consume data across AWS accounts and make sure that reliable access and consistent data governance is in place. By combining the power of AWS Glue and Amazon Redshift, you can unlock valuable insights and accelerate your data-driven decision-making processes.

In this post, you followed a step-by-step guide to set up cross-account data sharing using Amazon DataZone domain association. You learned how to publish data assets from a producer account. You also learned how to subscribe to and query the published assets from a consumer account. You can optionally use AWS Lake Formation access monitoring to view permissions and data access activities. AWS Lake Formation uses AWS CloudTrail for historical analysis and CloudTrail retains logs for 90 days by default.

Now that you’re familiar with the elements involved in cross-account data sharing using Amazon DataZone and your choice of analytical tool, you’re ready to try it with multiple accounts.


About the Authors

Arun Pradeep Selvaraj is a Senior Solutions Architect at AWS. Arun is passionate about working with his customers and stakeholders on digital transformations and innovation in the cloud while continuing to learn, build and reinvent. He is creative, fast-paced, deeply customer-obsessed, and uses the working backwards process to build modern architectures to help customers solve their unique challenges. Connect with him on LinkedIn.

Piyush Mattoo is a Senior Solution Architect for the Financial Services Data Provider segment at Amazon Web Services. He’s a software technology leader with over a decade of experience building scalable and distributed software systems to enable business value through the use of technology. He has an educational background in Computer Science with a master’s degree in computer and information science from University of Massachusetts. He is based out of Southern California and current interests include camping and nature walks.

Mani Yamaraja is a Senior Customer Solutions Manager for Financial Services Data Provider segment at Amazon Web Services. He has over a decade long experience working with financial services customers enabling their digital transformation journey. Mani adopts a customer centric approach and provides technology solutions working backwards from customer’s business goals. He is passionate about the financial services industry and helps the customers accelerate their cloud based transformation using the proven mechanisms of AWS.

Linux from Scratch version 12.3 released

Post Syndicated from jzb original https://lwn.net/Articles/1013096/

Version
12.3
of Linux From
Scratch
(LFS) has been released, along with Beyond Linux
From Scratch (BLFS) 12.3
. LFS provides step-by-step instructions
on building a customized Linux system entirely from source, and BLFS
helps to extend an LFS installation into a more usable system. Notable
changes in this release include toolchain updates to GNU Binutils
2.44, GNU C Library (glibc) 2.41, and Linux 6.13.2. The Changelog
has a full list of changes since the previous stable release.

Security updates for Wednesday

Post Syndicated from jzb original https://lwn.net/Articles/1013063/

Security updates have been issued by Debian (libreoffice), Fedora (exim and fscrypt), Red Hat (kernel), Slackware (mozilla), SUSE (docker, firefox, and podman), and Ubuntu (linux, linux-lowlatency, linux-lowlatency-hwe-5.15, linux, linux-lowlatency, linux-lowlatency-hwe-6.8, linux, linux-oem-6.11, linux-aws, linux-aws-6.8, linux-oracle, linux-oracle-6.8, linux-raspi, linux-aws, linux-gcp, linux-hwe-6.11, linux-oracle, linux-raspi, linux-realtime, linux-aws, linux-gkeop, linux-ibm, linux-intel-iotg, linux-intel-iotg-5.15, linux-oracle, linux-oracle-5.15, linux-raspi, and linux-gcp, linux-gcp-6.8, linux-gke, linux-gkeop).

Inside the Take Command Summit 2025 Agenda: What’s in Store for This Year’s Event?

Post Syndicated from Rapid7 original https://blog.rapid7.com/2025/03/05/inside-the-take-command-summit-2025-agenda-whats-in-store-for-this-years-event/

Inside the Take Command Summit 2025 Agenda: What’s in Store for This Year’s Event?

The cybersecurity landscape is shifting fast—ransomware is evolving, AI is reshaping security operations, and regulations are becoming more complex than ever. Security teams are under pressure to outpace adversaries, manage risk, and defend against sophisticated threats.

That’s why Take Command 2025 is built to deliver the most relevant, actionable insights security leaders need to navigate these challenges. This full-day virtual event brings together top security minds—from Rapid7’s experts to industry analysts and frontline defenders—covering the strategies, tools, and intelligence to help you take command of your attack surface.

A pre-recorded message from Rapid7 CEO Corey Thomas is already live on our event site, providing an inside look at what you can expect from Take Command 2025, and how our global summit will help security teams stay ahead of emerging threats.See the full list of speakers and watch Corey Thomas’s message on the Take Command 2025 registration page.

A Glimpse Into This Year’s Key Themes

This year’s agenda is packed with deep-dive discussions, real-world case studies, and expert insights on the most pressing security topics today. Here are just a few of the key focus areas you can expect at Take Command 2025:

Understanding the Evolving Threat Landscape

Cybercriminals are always one step ahead—until you learn to think like they do. This panel discussion, led by Raj Samani, Rapid7’s Chief Scientist, will explore the latest attack methodologies, emerging ransomware tactics, and evolving adversary behaviors.

Raj will be joined by Trent Teyema, Founder and President of CSG Strategies, a former FBI Special Agent (SES retired), as they analyze real-world attacker techniques and share how security teams can leverage threat intelligence to anticipate and disrupt threats before they escalate.

Session: Inside the Mind of an Attacker: Navigating the Threat Horizon

AI & Cloud Security: Opportunities and Challenges

AI is transforming cybersecurity, but how can organizations implement it responsibly and effectively? Take Command 2025 will examine:

  • The future of AI-powered security operations—what’s hype vs. reality?
  • How SOC and MDR teams are leveraging AI to improve detection and response
  • Cloud security challenges and why cloud detection & response (CDR) is becoming a critical SOC capability

Thom Langford, Regional CTO at Rapid7, will host this discussion, featuring Ted Harrington, Executive Partner at ISE (the Company of Ethical Hackers). Together, they will explore how AI-powered, Zero Trust-based security models are changing how organizations approach risk and resilience, and what the next era of cybersecurity defense will look like in our ‘From Zero to Hero: Building the Perfect Defense’ session.

Exposure Management & Red Teaming: Proactive Security in Action

Security teams can’t afford to wait for attacks to happen. Implementing proactive security strategies are critical. Take Command 2025 will explore:

  • How red teaming is evolving to match today’s complex threat landscape
  • Real-world lessons from leading vulnerability management programs
  • Why organizations are shifting from traditional vulnerability scanning to proactive exposure management

Industry analyst Tyler Shields (ESG) and offensive security consultant Will Hunt (In.Security) will lead key discussions, sharing practical insights on prioritizing risk, testing defenses, and staying ahead of attackers.

With NIS2, DORA, SEC regulations, and other global mandates becoming more prescriptive, CISOs need to stay ahead of compliance changes—but these evolving policies also present an opportunity to strengthen security programs.

Sessions will focus on:

  • How regulatory frameworks are reshaping security practices across industries
  • Key compliance challenges for global organizations and strategies for staying ahead
  • The intersection of security, policy, and business risk—how to turn compliance into a competitive advantage

Sabeen Malik, Rapid7’s VP of Global Government Affairs & Public Policy, will help demystify cyber regulations, compliance challenges, and evolving data residency concerns in ‘From Chaos to Compliant: Demystifying Cyber Regulations’.

More to Come: A Full Day of Cybersecurity Insights

This is just a preview of the cutting-edge discussions, expert panels, and strategic deep-dives planned for Take Command 2025. Across the day, you’ll also hear from Rapid7’s own SOC experts, product leaders, and security researchers, who will provide real-world insights into:

  • What’s next for AI-driven security operations
  • How real-world attack simulations are changing security strategy
  • Inside the SOC: Expert stories from frontline threat hunters

Whether you’re a practitioner, security leader, or researcher, this event is designed to give you the insights and strategies needed to strengthen your security posture in 2025 and beyond.

Register Now to Take Command

Take Command 2025 is a free, global, virtual event happening on April 9. Don’t miss your chance to hear from security leaders and experts on the biggest challenges shaping the industry.

Register Now!

Критика на граматичния разум

Post Syndicated from original https://www.toest.bg/kritika-na-ghramatichniya-razum/

Критика на граматичния  разум

Ще ме последвате ли в дълбокото? Обещавам да не се удавим. Нека се гмурнем заедно в академичната платформа „Български езикови ресурси онлайн“ (БЕРОН) и да проверим някои думи, които ежедневно употребяваме.

Започваме например с аз. Непосредствено под думата прочитаме: „клас съществително местоимение“. „Клас“ ще рече част на речта¹ и това можем да го асимилираме, но „съществително местоимение“ е… да речем, озадачаващо. Сега това съществително ли е, или местоимение? Сигурно е местоимение, което е съществително. А дали пък не е съществително, което е местоимение? Чакайте, чакайте,

много добре си спомням от училище, че аз е лично местоимение. И това ли вече не е така?

Ако скролнем настрани, ще стигнем до следната информация: „С формите на съществителното местоимение аз се означава авторът на текста – т.е. говорещият (пишещият). По традиция аз се определя като лично местоимение за първо лице. От този текст разбираме, първо, че аз вероятно се числи към класа на съществителните (имена)², а местоимението ще да е един от подкласовете, и второ, това е добре познатото ни старо лично местоимение, но с нова категоризация.

На какво се дължи тази промяна в класификацията на граматичните класове думи? Както разбираме от публикации в научната периодика, предстои да бъде издадена нова академична граматика на българския език, която ще бъде и нормативна³. Отдавна е време за това – предишната е от миналия век (1982 – 1983).

Засега информацията за бъдещата граматика, изготвяна от учени от Института за български език (ИБЕ) на БАН, е доста ограничена, но все пак знаем, че в нея ще бъде представено нещо принципно различно.

Нова класификация на класовете думи (частите на речта)

В новите мнения, възгледи, концепции, модели и т.н. в науката сами по себе си няма нищо лошо – напротив, така се развива и обогатява човешкото познание за света, който ни заобикаля. Когато на научното поле се срещнат различни мнения и те са солидно аргументирани, обикновено надделяват тези, които по-добре концептуализират (ех, каква завъртяна дума, простете, но тя ми дойде в ума) действителността и ни помагат да я опознаем по-добре, а после и да боравим с овладяната материя.

Да видим какво ни е известно до момента за новата граматична класификация и как/дали това ще ни помогне да си обясняваме по-добре езика и да си служим с него. Ще трябва да тръгнем от съществуващата, традиционната, според която

класовете думи са 10 на брой: съществително име, прилагателно име, числително име, местоимение, глагол (вкл. причастията), наречие, предлог, съюз, частица, междуметие.

Тя стъпва на два основни критерия – семантичен и граматичен, тоест всеки от горните класове обединява думи, които имат нещо най-общо в значението си, а също така и общи граматични категории. Прилагателните имена например са обединени от това, че означават признаци на лица и предмети, а също така имат род, число, определеност/неопределеност (могат да се членуват), някои имат и степен. Глаголите назовават действие или състояние и имат лице, число, време, залог и т.н.

Новата класификация игнорира семантичния критерий – тоест най-общото в значението няма значение – и се основава изцяло на формалния подход при разпределянето на думите по класове:

Този подход е базиран на стриктното разграничение между форма и семантика чрез прилагането на строги морфо-синтактични критерии.

Засега имаме по-пълна информация само за класа на прилагателните имена. Ето кои подкласове ще включва той:

1. Прилагателни в собствен смисъл (същински прилагателни).

Това са старите наши познайници – думи като добър, жълт, градски, тоест прилагателните имена, каквито ги знаем и си ги представяме в момента.

2. Прилагателни местоимения.

Новото попълнение в класа на прилагателните имена включва няколко вида от познатите ни местоимения:

а) показателни (този, онзи);
б) притежателни (мой/ми, твой/ти, негов/му);
в) възвратни притежателни (свой);
г) въпросителни (кой, какъв);
д) относителни (който, какъвто);
е) неопределителни (някой, някакъв, нечий);
ж) отрицателни (никой, никакъв, ничий);
з) обобщителни (всеки, всякакъв).

Всъщност това са всички местоимения с изключение на личните (аз, ти) и възвратното лично (себе си).

Основанието да се причислят тези думи към прилагателните имена е, че повечето от тях могат да се променят по род и число и да се членуват, например мой (моя, моят), моя(та), мое(то), мои(те). Тук трябва да направим важното уточнение, че кой, който, някой, никой и всеки са прилагателни в случаи като Всеки приятел ще се зарадва на успеха ти (определение към приятел), но ако са употребени самостоятелно, ще попаднат в големия клас на съществителните имена: Всеки ще се зарадва на успеха ти (в ролята на подлог). И това не е всичко, има случаи, в които думата ще е подлог, обаче вероятно ще трябва да я таксуваме като прилагателно, защото се променя по род: Всяка от вас трябва да внимава (когато говорещият се обръща към жени).

3. Прилагателни числителни.

Познатите ни редни числителни имена (първи, втори, трети) действително много приличат на прилагателните имена по своите граматични характеристики – изменят се по род и число и се членуват. Това се отбелязва от немалко български езиковеди и няма как да се пренебрегне. Редно е също да признаем, че със сегашните числителни бройни (едно, две, три) в момента ги обединява най-вече значението. Все пак имат поне една обща граматична категория – определеност/неопределеност (членуват се).

4. Глаголни прилагателни.

Също както при подкласа на местоименията ще трябва да заострим вниманието си, защото пак ще правим разграничения. Глаголните прилагателни са всъщност част от причастията, които според традиционната класификация представляват глаголни форми. И действително, те са образувани от глаголи. Да си припомним видовете причастия:

а) сегашно деятелно – търсещ;
б) минало свършено деятелно – търсил, потърсил;
в) минало несвършено деятелно – търсел;
г) минало страдателно – търсен;
д) деепричастие – търсейки.

Причастията, които се изменят по род и число и се членуват (а, б, г), вече ще бъдат глаголни прилагателни – търсещ(ият) човек, търсеща(та) жена; потърсила(та) правата си майка, потърсили(те) правата си хора; търсена(та) стока, търсени(те) професии.

Тези обаче, с които се образуват форми за различни глаголни времена и наклонения, ще си останат причастия:

Майката търсела правата си. (преизказна форма – минало несвършено деятелно причастие)
Майката е търсила правата си. (форма за минало неопределено време, включваща минало свършено деятелно причастие)
Тази стока сега е търсена на пазара. (форма за страдателен залог, включваща минало страдателно причастие)

И така, пак ще трябва да разграничаваме едни и същи думи и да ги причисляваме ту към глаголните прилагателни (търсена стока) – ту пък към причастията, тоест към системата на глагола (стоката е търсена). Това се отнася за миналото свършено деятелно причастие и миналото страдателно причастие.

Под секрет ще ви кажа, че и аз започнах да се обърквам – представям си какви усилия трябва да положи един нефилолог, за да проследи тези трансформации, отношения и разграничения. Завършвайки с информацията за глаголните прилагателни, нека се простим.

Сбогом, сегашно деятелно причастие, сбогом, деепричастие! Ще се видим в някой следващ живот може би.

Съдбата на думите, които сега категоризираме като деепричастия, в момента поне на мен не ми е ясна, но предполагам, че ще бъдат отнесени към наречията.

5. Неизменяеми прилагателни.

Само ще маркирам последния подклас прилагателни имена, тъй като той е обект на много дискусии в езиковедската литература. Освен думи като инат, сербез, късметлия, причислявани обикновено към него, все повече набъбва броят на лексемите от западноевропейските езици, особено от английския: супер, екстра, денс, макси и др.

Други граматични класификации

Предлаганото от учените от ИБЕ прекатегоризиране на думите, както можем да предположим, не е изолирано явление в българското езикознание и е редно да го поставим в контекст. В по-ново време може да се посочат класификациите на Юрий Маслов (1981) и Станьо Георгиев (1991). Подробен преглед на концепциите прави Константин Куцаров в своята дисертация „Българските лексемни класове и учението на частите на речта“ (2019)6. Стъпвайки на три критерия – лексико-семантичен, морфологичен и синтактичен, – ученият аргументира обособяването на 12 лексемни класа. Като цяло местоименията не представляват отделен клас, единствено личните (аз, ти, той) са обособени в „дискурсив“. Самостоятелен клас образуват причастията, с изключение на деепричастията, и „детерминативите“, които в момента се разглеждат в синтаксиса като вметнати думи (впрочем, наистина, например). В края на автореферата на дисертацията си Куцаров с основание отбелязва:

… няма класификационен модел, който да не бъде уязвим, тъй като изначално е обременен от субективността на изследователските тези. Това научно съчинение предлага само една от възможните гледни точки по проблема.

Напълно подкрепям виждането на проф. Куцаров. Когато предлагаш нещо принципно ново, отклоняващо се от традицията, ревизиращо традицията и оспорващо традицията, е редно да имаш съзнанието, че това е „само една от възможните гледни точки“, и да не я налагаш като истина от последна инстанция.

Като всеки езиков модел, и сегашната класификация, каквато я знаем от десетилетия, не е съвършена. В никакъв случай не я издигам в култ и не смятам, че е дадена веднъж завинаги и не бива да се търсят по-добри модели. Мисля обаче, че няма как изведнъж да се появи класификация, която веднага да бъде приета от научните среди и обществото и безпроблемно да стане нормативна.

Новата граматична класификация на ИБЕ, отразена в БЕРОН

Може все още да сме в неведение за цялостната концепция на учените от ИБЕ, но тя вече тихомълком е прокарана в БЕРОН, който, по техните думи, е „мощен инструмент за грамотност на всеки българин с достъп до интернет“, „особено ценен за сферата на образованието и най-вече за обучението по български език във и извън България“. Освен че, както споменахме, кой и някой например фигурират като съществителни местоимения и прилагателни местоимения, прекласифицирани са думи като безброй (вече не е наречие, а числително име), колко и толкова (не са местоимения за количество, а са числителни имена (толкова приятели) и наречия (толкова те обичам). Учудване у мен предизвика следният факт: много и малко също са числителни имена и наречия, но само малко може да бъде и съществително име. Защо тази битност е отказана на много? Причастията пък все още са в сегашния си вид и при съответните глаголи са представени всички причастни форми, което ме навежда на мисълта, че прекласифицирането не е доведено докрай.

Дори и в този вид обаче имаме достатъчно основания да се запитаме:

кому е нужно всичко това и каква е ползата от него?

Знаем откъслечни неща за новите класове думи – и то ние, специалистите, които сме се натъкнали на промените, – обаче неизвестни и объркващи дори и филолозите термини, като съществително местоимение, вече фигурират в БЕРОН. Къде е цялостно разработената нова теория за класовете думи? Не трябваше ли тя да бъде публикувана и дискутирана и едва след това евентуално да бъде официализирана и отразена в БЕРОН – една платформа, която, както е декларирано, трябва да служи на „всеки българин“, на цялото общество. Да напомня на колегите от ИБЕ, че същото това общество в огромната си част се състои от нефилолози. Но дори и филолозите – отново в огромната си част – не подозират за намеренията за граматични нововъведения от такъв мащаб.

Учудващо е, че учените от ИБЕ имат смелостта – и това не е точната дума, която си заслужава да употребя – направо да наложат като нормативна една граматична класификация, за която малко специалисти знаят. Да не говорим че въпросната класификация изобщо не е проверена в практиката.

Нека да припомня какво пише в академичната граматика от 1982 – 1983 г.:

С оглед на основната си цел граматиката е съставена с помощта на традиционна в основата си концепция, методология и терминология. По принцип съставителите се опират на утвърдени в българската граматична литература постановки и се избягват подчертано дискусионни положения.7

Питам се защо учените от ИБЕ отстъпват от разумния подход на своите именити предшественици и на какво почива тяхната увереност, че ще поднесат концепция, която най-малкото няма да провокира съществени възражения.

Основният въпрос според мен обаче е какви практически ползи ще ни донесе новата класификация.

По-добре ли ще си обясняваме и ще подреждаме езиковата действителност в съзнанието си? Ще доведе ли това до по-бързо и трайно усвояване на граматичните, правописните и пунктуационните норми? Или напротив – ще прахосаме ценна енергия в прекрояването, преподреждането и новото съшиване на езиковата материя. И междувременно може да се изгубим някъде там, между съществителните местоимения, глаголните прилагателни и думите, които могат да бъдат наречия, числителни и съществителни.

С непрозрачната си езикова политика ИБЕ отдавна е длъжник на българското общество. Още повече Институтът дължи отговори сега, преди някой ден да сме осъмнали под диктата на нова нормативна граматика.

Или по-лошо – с нова нормативна граматика, с която няма да искаме да се съобразяваме.

Изложеното в статията мнение е лично мое и не ангажирам никого с него.

1 Терминът части на речта в българското езикознание вече се измества от (граматични) класове думи/лексеми. Намирам новия термин за коректен.

2 В съществителното местоимение първата дума може да се възприеме и като прилагателно име. Всъщност терминът трябва да се разбира като „клас съществително име, подклас местоимение“.

3 Станчева, Р., Томов, М. Класът на прилагателните имена в нормативната граматика на българския език – формален подход. – Български език, 2023, Приложение, с. 73 – 91.

4 Пак там, с. 73.

5 Примерите в скобите не изчерпват всички местоимения от съответния вид.

6 Дисертацията е публикувана като монография със същото заглавие през 2022 г. от издателство „Колибри“.

7 Граматика на съвременния български книжовен език. Т. 1. Фонетика. София: Издателство на БАН, 1982, с. 7.

Езикът може да е вкусен и извън блюдото – онзи, българският език, на който говорим от малки и на който около 24 май се кълнем в обич. А той в същността си е средство за общуване и за да ни служи добре, непрекъснато се променя. Да го погледнем в неговата динамика и да се опитаме да разберем какво става и защо, кои са движещите механизми и как те са свързани с обществените процеси. И тъй като задачата не е лека, ще го правим постепенно – на порции.

Title Launch Observability at Netflix Scale

Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/title-launch-observability-at-netflix-scale-8efe69ebd653

Part 3: System Strategies and Architecture

By: Varun Khaitan

With special thanks to my stunning colleagues: Mallika Rao, Esmir Mesic, Hugo Marques

This blog post is a continuation of Part 2, where we cleared the ambiguity around title launch observability at Netflix. In this installment, we will explore the strategies, tools, and methodologies that were employed to achieve comprehensive title observability at scale.

Defining the observability endpoint

To create a comprehensive solution, we decided to introduce observability endpoints first. Each microservice involved in our Personalization stack that integrated with our observability solution had to introduce a new “Title Health” endpoint. Our goal was for each new endpoint to adhere to a few principles:

  1. Accurate reflection of production behavior
  2. Standardization across all endpoints
  3. Answering the Insight Triad: “Healthy” or not, why not and how to fix it.

Accurately Reflecting Production Behavior

A key part of our solution is insights into production behavior, which necessitates our requests to the endpoint result in traffic to the real service functions that mimics the same pathways the traffic would take if it came from the usual callers.

In order to allow for this mimicking, many systems implement an “event” handling, where they convert our request into a call to the real service with properties enabled to log when titles are filtered out of their response and why. Building services that adhere to software best practices, such as Object-Oriented Programming (OOP), the SOLID principles, and modularization, is crucial to have success at this stage. Without these practices, service endpoints may become tightly coupled to business logic, making it challenging and costly to add a new endpoint that seamlessly integrates with the observability solution while following the same production logic.

A service with modular business logic facilitates the seamless addition of an observability endpoint.

Standardization

To standardize communication between our observability service and the personalization stack’s observability endpoints, we’ve developed a stable proto request/response format. This centralized format, defined and maintained by our team, ensures all endpoints adhere to a consistent protocol. As a result, requests are uniformly handled, and responses are processed cohesively. This standardization enhances adoption within the personalization stack, simplifies the system, and improves understanding and debuggability for engineers.

The request schema for the observability endpoint.

The Insight Triad API

To efficiently understand the health of a title and triage issues quickly, all implementations of the observability endpoint must answer: is the title eligible for this phase of promotion, if not — why is it not eligible, and what can be done to fix any problems.

The end-users of this observability system are Launch Managers, whose job it is to ensure smooth title launches. As such, they must be able to quickly see whether there is a problem, what the problem is, and how to solve it. Teams implementing the endpoint must provide as much information as possible so that a non-engineer (Launch Manager) can understand the root cause of the issue and fix any title setup issues as they arise. They must also provide enough information for partner engineers to identify the problem with the underlying service in cases of system-level issues.

These requirements are captured in the following protobuf object that defines the endpoint response.

The response schema for the observability endpoint.

High level architecture

We’ve distilled our comprehensive solution into the following key steps, capturing the essence of our approach:

  1. Establish observability endpoints across all services within our Personalization and Discovery Stack.
  2. Implement proactive monitoring for each of these endpoints.
  3. Track real-time title impressions from the Netflix UI.
  4. Store the data in an optimized, highly distributed datastore.
  5. Offer easy-to-integrate APIs for our dashboard, enabling stakeholders to track specific titles effectively.
  6. “Time Travel” to validate ahead of time.
Observability stack high level architecture diagram

In the following sections, we will explore each of these concepts and components as illustrated in the diagram above.

Key Features

Proactive monitoring through scheduled collectors jobs

Our Title Health microservice runs a scheduled collector job every 30 minutes for most of our personalization stack.

For each Netflix row we support (such as Trending Now, Coming Soon, etc.), there is a dedicated collector. These collectors retrieve the relevant list of titles from our catalog that qualify for a specific row by interfacing with our catalog services. These services are informed about the expected subset of titles for each row, for which we are assessing title health.

Once a collector retrieves its list of candidate titles, it orchestrates batched calls to assigned row services using the above standardized schema to retrieve all the relevant health information of the titles. Additionally, some collectors will instead poll our kafka queue for impressions data.

Real-time Title Impressions and Kafka Queue

In addition to evaluating title health via our personalization stack services, we also keep an eye on how our recommendation algorithms treat titles by reviewing impressions data. It’s essential that our algorithms treat all titles equitably, for each one has limitless potential.

This data is processed from a real-time impressions stream into a Kafka queue, which our title health system regularly polls. Specialized collectors access the Kafka queue every two minutes to retrieve impressions data. This data is then aggregated in minute(s) intervals, calculating the number of impressions titles receive in near-real-time, and presented as an additional health status indicator for stakeholders.

Data storage and distribution through Hollow Feeds

Netflix Hollow is an Open Source java library and toolset for disseminating in-memory datasets from a single producer to many consumers for high performance read-only access. Given the shape of our data, hollow feeds are an excellent strategy to distribute the data across our service boxes.

Once collectors gather health data from partner services in the personalization stack or from our impressions stream, this data is stored in a dedicated Hollow feed for each collector. Hollow offers numerous features that help us monitor the overall health of a Netflix row, including ensuring there are no large-scale issues across a feed publish. It also allows us to track the history of each title by maintaining a per-title data history, calculate differences between previous and current data versions, and roll back to earlier versions if a problematic data change is detected.

Observability Dashboard using Health Check Engine

We maintain several dashboards that utilize our title health service to present the status of titles to stakeholders. These user interfaces access an endpoint in our service, enabling them to request the current status of a title across all supported rows. This endpoint efficiently reads from all available Hollow Feeds to obtain the current status, thanks to Hollow’s in-memory capabilities. The results are returned in a standardized format, ensuring easy support for future UIs.

Additionally, we have other endpoints that can summarize the health of a title across subsets of sections to highlight specific member experiences.

Message depicting a dashboard request.

Time Traveling: Catching before launch

Titles launching at Netflix go through several phases of pre-promotion before ultimately launching on our platform. For each of these phases, the first several hours of promotion are critical for the reach and effective personalization of a title, especially once the title has launched. Thus, to prevent issues as titles go through the launch lifecycle, our observability system needs to be capable of simulating traffic ahead of time so that relevant teams can catch and fix issues before they impact members. We call this capability “Time Travel”.

Many of the metadata and assets involved in title setup have specific timelines for when they become available to members. To determine if a title will be viewable at the start of an experience, we must simulate a request to a partner service as if it were from a future time when those specific metadata or assets are available. This is achieved by including a future timestamp in our request to the observability endpoint, corresponding to when the title is expected to appear for a given experience. The endpoint then communicates with any further downstream services using the context of that future timestamp.

An example request with a future timestamp.

Conclusion

Throughout this series, we’ve explored the journey of enhancing title launch observability at Netflix. In Part 1, we identified the challenges of managing vast content launches and the need for scalable solutions to ensure each title’s success. Part 2 highlighted the strategic approach to navigating ambiguity, introducing “Title Health” as a framework to align teams and prioritize core issues. In this final part, we detailed the sophisticated system strategies and architecture, including observability endpoints, proactive monitoring, and “Time Travel” capabilities; all designed to ensure a thrilling viewing experience.

By investing in these innovative solutions, we enhance the discoverability and success of each title, fostering trust with content creators and partners. This journey not only bolsters our operational capabilities but also lays the groundwork for future innovations, ensuring that every story reaches its intended audience and that every member enjoys their favorite titles on Netflix.

Thank you for joining us on this exploration, and stay tuned for more insights and innovations as we continue to entertain the world.


Title Launch Observability at Netflix Scale was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.

The collective thoughts of the interwebz