Tag Archives: metrics

Iris – Turning observations into actionable insights for enhanced decision making

Post Syndicated from Grab Tech original https://engineering.grab.com/iris

Introduction

Iris (/ˈaɪrɪs/), a name inspired by the Olympian mythological figure who personified the rainbow and served as the messenger of the gods, is a comprehensive observability platform for Extract, Transform, Load (ETL) jobs. Just as the mythological Iris connected the gods to humanity, our Iris platform bridges the gap between raw data and meaningful insights, serving the needs of data-driven organisations. Specialising in meticulous monitoring and tracking of Spark and Presto jobs, Iris stands as a transformative tool for peak observability and effective decision-making.

  • Iris captures critical job metrics right at the Java Virtual Machine (JVM) level, including but not limited to runtime, CPU and memory utilisation rates, garbage collection statistics, stage and task execution details, and much more.
  • Iris not only regularly records these metrics but also supports real-time monitoring and offline analytics of metrics in the data lake. This gives you multi-faceted control and insights into the operational aspects of your workloads.
  • Iris gives you an overview of your jobs, predicts if your jobs are over or under-provisioned, and provides suggestions on how to optimise resource usage and save costs.

Understanding the needs

When examining ETL job monitoring across various platforms, a common deficiency became apparent. Existing tools could only provide CPU and memory usage data at the instance level, where an instance could refer to an EC2 unit or a Kubernetes pod with resources bound to the container level.

However, this CPU and memory usage data included usage from the operating system and other background tasks, making it difficult to isolate usage specific to Spark jobs (JVM level). A sizeable fraction of resource consumption, thus, could not be attributed directly to our ETL jobs. This lack of granularity posed significant challenges when trying to perform effective resource optimisation for individual jobs.

Gap between total instance and JVM provisioned resources

The situation was further complicated when compute instances were shared among various jobs. In such cases, determining the precise resource consumption for a specific job became nearly impossible. This made in-depth analysis and performance optimisation of specific jobs a complex and often ineffective process.

In the initial stages of my career in Spark, I took the reins of handling SEGP ETL jobs deployed in Chimera. Then, Chimera did not possess any tool for observing and understanding SEGP jobs. The lack of an efficient tool for close-to-real-time visualisation of Spark cluster/job metrics, profiling code class/function runtime durations, and investigating deep-level job metrics to assess CPU and memory usage, posed a significant challenge even back then.

In the quest for solutions within Grab, I found no tool that could fulfill all these needs. This prompted me to extend my search beyond the organisation, leading me to discover that Uber had an exceptional tool known as the JVM Profiler. This tool could collect JVM metrics and profile the job. Further research also led me to sparkMeasure, a standalone tool known for its ability to measure Spark metrics on-the-fly without any code changes.

This personal research and journey highlights the importance of a comprehensive, in-depth observability tool – emphasising the need that Iris aims to fulfill in the world of ETL job monitoring. Through this journey, Iris was ideated, named after the Greek deity, encapsulating the mission to bridge the gap between the realm of raw ETL job metrics and the world of actionable insights.

Observability with Iris

Platform architecture

Platform architecture of Iris

Iris’s robust architecture is designed to smartly deliver observability into Spark jobs with high reliability. It consists of three main modules: Metrics Collector, Kafka Queue, and Telegraf, InfluxDB, and Grafana (TIG) Stack.

Metrics Collector: This module listens to Spark jobs, collects metrics, and funnels them to the Kafka queue. What sets this apart is its unobstructive nature – there is no need for end-users to update their application code or notebook.

Kafka Queue: Serving as an asynchronous deliverer of metrics messages, Kafka is leveraged to prevent Iris from becoming another bottleneck slowing down user jobs. By functioning as a message queue, it enables the efficient processing of metric data.

TIG Stack: This component is utilised for real-time monitoring, making visualising performance metrics a cinch. The TIG stack proves to be an effective solution for real-time data visualisation.

For offline analytics, Iris pushes metrics data from Kafka into our data lake. This creates a wealth of historical data that can be utilised for future research, analysis, and predictions. The strategic combination of real-time monitoring and offline analysis forms the basis of Iris’s ability to provide valuable insights.

Next, we will delve into how Iris collects the metrics.

Data collection

Iris’s metrics is now primarily driven by two tools that operate under the Metrics Collector module: JVM Profiler and sparkMeasure.

JVM Profiler

As mentioned earlier, JVM Profiler is an exceptional tool that helps to collect and profile metrics at JVM level.

Java process for the JVM Profiler tool

Uber JVM Profiler supports the following features:

  • Debug memory usage for all your Spark application executors, including java heap memory, non-heap memory, native memory (VmRSS, VmHWM), memory pool, and buffer pool (directed/mapped buffer).
  • Debug CPU usage, garbage collection time for all Spark executors.
  • Debug arbitrary Java class methods (how many times they run, how long they take), also called Duration Profiling.
  • Debug arbitrary Java class method call and trace its argument value, also known as Argument Profiling.
  • Do Stacktrack Profiling and generate flamegraph to visualise CPU time spent for the Spark application.
  • Debug I/O metrics (disk read/write bytes for the application, CPU iowait for the machine).
  • Debug JVM Thread Metrics like Count of Total Threads, Peak Threads, Live/Active Threads, and newThreads.

Example metrics (Source code)

{
        "nonHeapMemoryTotalUsed": 11890584.0,
        "bufferPools": [
                {
                        "totalCapacity": 0,
                        "name": "direct",
                        "count": 0,
                        "memoryUsed": 0
                },
                {
                        "totalCapacity": 0,
                        "name": "mapped",
                        "count": 0,
                        "memoryUsed": 0
                }
        ],
        "heapMemoryTotalUsed": 24330736.0,
        "epochMillis": 1515627003374,
        "nonHeapMemoryCommitted": 13565952.0,
        "heapMemoryCommitted": 257425408.0,
        "memoryPools": [
                {
                        "peakUsageMax": 251658240,
                        "usageMax": 251658240,
                        "peakUsageUsed": 1194496,
                        "name": "Code Cache",
                        "peakUsageCommitted": 2555904,
                        "usageUsed": 1173504,
                        "type": "Non-heap memory",
                        "usageCommitted": 2555904
                },
                {
                        "peakUsageMax": -1,
                        "usageMax": -1,
                        "peakUsageUsed": 9622920,
                        "name": "Metaspace",
                        "peakUsageCommitted": 9830400,
                        "usageUsed": 9622920,
                        "type": "Non-heap memory",
                        "usageCommitted": 9830400
                },
                {
                        "peakUsageMax": 1073741824,
                        "usageMax": 1073741824,
                        "peakUsageUsed": 1094160,
                        "name": "Compressed Class Space",
                        "peakUsageCommitted": 1179648,
                        "usageUsed": 1094160,
                        "type": "Non-heap memory",
                        "usageCommitted": 1179648
                },
                {
                        "peakUsageMax": 1409286144,
                        "usageMax": 1409286144,
                        "peakUsageUsed": 24330736,
                        "name": "PS Eden Space",
                        "peakUsageCommitted": 67108864,
                        "usageUsed": 24330736,
                        "type": "Heap memory",
                        "usageCommitted": 67108864
                },
                {
                        "peakUsageMax": 11010048,
                        "usageMax": 11010048,
                        "peakUsageUsed": 0,
                        "name": "PS Survivor Space",
                        "peakUsageCommitted": 11010048,
                        "usageUsed": 0,
                        "type": "Heap memory",
                        "usageCommitted": 11010048
                },
                {
                        "peakUsageMax": 2863661056,
                        "usageMax": 2863661056,
                        "peakUsageUsed": 0,
                        "name": "PS Old Gen",
                        "peakUsageCommitted": 179306496,
                        "usageUsed": 0,
                        "type": "Heap memory",
                        "usageCommitted": 179306496
                }
        ],
        "processCpuLoad": 0.0008024004394748531,
        "systemCpuLoad": 0.23138430784607697,
        "processCpuTime": 496918000,
        "appId": null,
        "name": "24103@machine01",
        "host": "machine01",
        "processUuid": "3c2ec835-749d-45ea-a7ec-e4b9fe17c23a",
        "tag": "mytag",
        "gc": [
                {
                        "collectionTime": 0,
                        "name": "PS Scavenge",
                        "collectionCount": 0
                },
                {
                        "collectionTime": 0,
                        "name": "PS MarkSweep",
                        "collectionCount": 0
                }
        ]
}

A list of all metrics and information corresponding to them can be found here.

sparkMeasure

Complementing the JVM Profiler is sparkMeasure, a standalone tool that was built to robustly capture Spark job-specific metrics.

Architecture of Spark Task Metrics, Listener Bus, and sparkMeasure (Source)

It is registered as a custom listener and operates by collection built-in metrics that Spark exchanges between the driver node and executor nodes. Its standout feature is the ability to collect all metrics supported by Spark, as defined in Spark’s official documentation here.

Example stage metrics collected by sparkMeasure (Source code)

Scheduling mode = FIFO

Spark Context default degree of parallelism = 8

Aggregated Spark stage metrics:

numStages => 3
numTasks => 17
elapsedTime => 1291 (1 s)
stageDuration => 1058 (1 s)
executorRunTime => 2774 (3 s)
executorCpuTime => 2004 (2 s)
executorDeserializeTime => 2868 (3 s)
executorDeserializeCpuTime => 1051 (1 s)
resultSerializationTime => 5 (5 ms)
jvmGCTime => 88 (88 ms)
shuffleFetchWaitTime => 0 (0 ms)
shuffleWriteTime => 16 (16 ms)
resultSize => 16091 (15.0 KB)
diskBytesSpilled => 0 (0 Bytes)
memoryBytesSpilled => 0 (0 Bytes)
peakExecutionMemory => 0
recordsRead => 2000
bytesRead => 0 (0 Bytes)
recordsWritten => 0
bytesWritten => 0 (0 Bytes)
shuffleRecordsRead => 8
shuffleTotalBlocksFetched => 8
shuffleLocalBlocksFetched => 8
shuffleRemoteBlocksFetched => 0
shuffleTotalBytesRead => 472 (472 Bytes)
shuffleLocalBytesRead => 472 (472 Bytes)
shuffleRemoteBytesRead => 0 (0 Bytes)
shuffleRemoteBytesReadToDisk => 0 (0 Bytes)
shuffleBytesWritten => 472 (472 Bytes)
shuffleRecordsWritten => 8

Stages and their duration:
Stage 0 duration => 593 (0.6 s)
Stage 1 duration => 416 (0.4 s)
Stage 3 duration => 49 (49 ms)

Data organisation

The architecture of Iris is designed to efficiently route metrics to two key destinations:

  • Real-time datasets: InfluxDB
  • Offline datasets: GrabTech Datalake in AWS

Real-time dataset

Freshness/latency: 5 to 10 seconds

All metrics flowing in through Kafka topics are instantly wired into InfluxDB. A crucial part of this process is accomplished by Telegraf, a plugin-driven server agent used for collecting and sending metrics. Acting as a Kafka consumer, Telegraf listens to each Kafka topic according to its corresponding metrics profiling. It parses the incoming JSON messages and extracts crucial data points (such as role, hostname, jobname, etc.). Once the data is processed, Telegraf writes it into the InfluxDB.

InfluxDB organises the stored data in what we call ‘measurements’, which could analogously be considered as tables in traditional relational databases.

In Iris’s context, we have structured our real-time data into the following crucial measurements:

  1. CpuAndMemory: This measures CPU and memory-related metrics, giving us insights into resource utilisation by Spark jobs.
  2. I/O: This records input/output metrics, providing data on the reading and writing operations happening during the execution of jobs.
  3. ThreadInfo: This measurement holds data related to job threading, allowing us to monitor concurrency and synchronisation aspects.
  4. application_started and application_ended: These measurements allow us to track Spark application lifecycles, from initiation to completion.
  5. executors_started and executors_removed: These measurements give us a look at the executor dynamics during Spark application execution.

  1. jobs_started and jobs_ended: These provide vital data points relating to the lifecycle of individual Spark jobs within applications.
  2. queries_started and queries_ended: These measurements are designed to track the lifecycle of individual Spark SQL queries.
  3. stage_metrics, stages_started, and stages_ended: These measurements help monitor individual stages within Spark jobs, a valuable resource for tracking the job progress and identifying potential bottlenecks.

The real-time data collected in these measurements form the backbone of the monitoring capabilities of Iris, providing an accurate and current picture of Spark job performances.

Offline dataset

Freshness/latency: 1 hour

In addition to real-time data management with InfluxDB, Iris is also responsible for routing metrics to our offline data storage in the Grab Tech Datalake for long-term trend studies, pattern analysis, and anomaly detection.

The metrics from Kafka are periodically synchronised to the Amazon S3 tables under the iris schema in the Grab Tech AWS catalogue. This valuable historical data from Kafka is meticulously organised with a one-to-one mapping between the platform or Kafka topic to the table in the iris schema. For example: iris.chimera_jvmprofiler_cpuandmemory map with prd-iris-chimera-jvmprofiler-cpuandmemory Kafka topic.


This streamlined organisation means you can write queries to retrieve information from the AWS dataset very similarly to how you would do it from InfluxDB. Whether it’s CPU and memory usage, I/O, thread info, or spark metrics, you can conveniently fetch historical data for your analysis.

Data visualisation

A well-designed visual representation makes it easier to see patterns, trends, and outliers in groups of data. Iris employs different visualisation tools based on whether the data is real-time or historical.

Real-Time data visualisation – Grafana

Iris uses Grafana for showcasing real-time data. For each platform, two primary dashboards have been set up: JVM metrics and Spark metrics.

JVM metrics dashboard: This dashboard is designed to display information related to the JVM.
Spark metrics dashboard: This dashboard primarily focuses on visualising Spark-specific elements.

Offline data visualisation

While real-time visualisation is crucial for immediate awareness and decision-making, visualising historical data provides invaluable insights about long-term trends, patterns, and anomalies. Developers can query the raw or aggregated data from the Iris tables for their specific analyses.

Moreover, to assist platform owners and end-users in obtaining a quick summary of their job data, we provide built-in dashboards with pre-aggregated visuals. These dashboards contain a wealth of information expressed in an easy-to-understand format. Key metrics include:

  • Total instances
  • Total CPU cores
  • Total memory
  • CPU and memory utilisation
  • Total machine runtimes

  • Besides visualisations for individual jobs, we have designed an overview dashboard providing a comprehensive summary of all resources consumed by all ETL jobs. This is particularly useful for platform owners and tech leads, allowing them to have an all-encompassing visibility of the performance and resource usage across the ETL jobs.

    Dashboard for monitoring ETL jobs

    These dashboards’ visuals effectively turn the historical metrics data into clear, comprehensible, and insightful information, guiding users towards objective-driven decision-making.

    Transforming observations into insights

    While our journey with Iris is just in the early stages, we’ve already begun harnessing its ability to transform raw data into concrete insights. The strength of Iris lies not just in its data collection capabilities but also in its potential to analyse and infer patterns from the collated data.

    Currently, we’re experimenting with a job classification model that aims to predict resource allocation efficiency (i.e. identifying jobs as over or under-provisioned). This information, once accurately predicted, can help optimise the usage of resources by fine-tuning the provisions for each job. While this model is still in its early stages of testing and lacks sufficient validation data, it exemplifies the direction we’re heading – integrating advanced analytics with operational observability.

    As we continue to refine Iris and develop more models, our aim is to empower users with deep insights into their Spark applications. These insights can potentially identify bottlenecks, optimise resource allocation and ultimately, enhance overall performance. In the long run, we see Iris evolving from being a data collection tool to a platform that can provide actionable recommendations and enable data-driven decision-making.

    Job classification feature set

    At the core of our job classification model, there are two carefully selected metrics:

    1. CPU cores per hour: This represents the number of tasks a job can handle concurrently in a given hour. A higher number would mean more tasks being processed simultaneously.

    2. Total Terabytes of data input per core: This considers only the input from the underlying HDFS/S3 input, excluding shuffle data. It represents the volume of data one CPU core needs to process. A larger input would mean more CPUs are required to complete the job in a reasonable timeframe.

    The choice of these two metrics for building feature sets is based on a nuanced understanding of Spark job dynamics:

  • Allocating the right CPU cores is crucial as a higher number of cores means more tasks being processed concurrently. This is especially important for jobs with larger input data and more partitioned files, as they often require more concurrent processing capacity, hence, more CPU cores.
  • The total data input helps to estimate the data processing load of a job. A job tasked with processing a high volume of input data but assigned low CPU cores might be under-provisioned and result in an extended runtime.

  • As for CPU and memory utilisation, while it could offer useful insights, we’ve found it may not always contribute to predicting if a job is over or under-provisioned because utilisation can vary run-to-run. Thus, to keep our feature set robust and consistent, we primarily focus on CPU cores per hour and total terabytes of input data.

    With these metrics as our foundation, we are developing models that can classify jobs into over-provisioned or under-provisioned, helping us optimise resource allocation and improve job performance in the long run.

    As always, treat any information related to our job classification feature set and the insights derived from it with utmost care for data confidentiality and integrity.

    We’d like to reiterate that these models are still in the early stages of testing and we are constantly working to enhance their predictive accuracy. The true value of this model will be unlocked as it is refined and as we gather more validation data.

    Model training and optimisation

    Choosing the right model is crucial for deriving meaningful insights from datasets. We decided to start with a simple, yet powerful algorithm – K-means clustering, for job classification. K-means is a type of unsupervised machine learning algorithm used to classify items into groups (or clusters) based on their features.

    Here is our process:

    1. Model exploration: We began by exploring the K-means algorithm using a small dataset for validation.
    2. Platform-specific cluster numbers: To account for the uniqueness of every platform, we ran a Score Test (an evaluation method to determine the optimal number of clusters) for each platform. The derived optimal number of clusters is then used in the monthly job for that respective platform’s data.
    3. Set up a scheduled job: After ensuring the code was functioning correctly, we set up a job to run the model on a monthly schedule. Monthly re-training was chosen to encapsulate possible changes in the data patterns over time.
    4. Model saving and utilisation: The trained model is saved to our S3 bucket and used to classify jobs as over-provisioned or under-provisioned based on the daily job runs.

    This iterative learning approach, through which our model learns from an ever-increasing pool of historical data, helps maintain its relevance and improve its accuracy over time.

    Here is an example output from Databricks train run:

  • Blue green group: Input per core is too large but the CPU per hour is small, so the job may take a lot of time to complete.
  • Purple group: Input per core is too small but the CPU per hour is too high. There may be a lot of wasted CPU here.
  • Yellow group: I think this is the ideal group where input per core and CPU per hour is not high.

  • Keep in mind that classification insights provided by our K-means model are still in the experimental stage. As we continue to refine the approach, the reliability of these insights is expected to grow, providing increasingly valuable direction for resource allocation optimisation.

    Seeing Iris in action

    This section provides practical examples and real-case scenarios that demonstrate Iris’s capacity for delivering insights from ETL job observations.

    Case study 1: Spark benchmarking

    From August to September 2023, we carried out a Spark benchmarking exercise to measure and compare the cost and performance of Grab’s Spark platforms: Open Source Spark on Kubernetes (Chimera), Databricks and AWS EMR. Since each platform has its own way to measure a job’s performance and cost, Iris was used to collect the necessary Spark metrics in order to calculate the cost for each job. Furthermore, many other metrics were collected by Iris in order to compare the platforms’ performances like CPU and memory utilisation, runtime, etc.

    Case study 2: Improving Databricks Infra Cost Unit (DBIU) Accuracy with Iris

    Being able to accurately calculate and fairly distribute Databricks infrastructure costs has always been a challenge, primarily due to difficulties in distinguishing between on-demand and Spot instance usage. This was further complicated by two conditions:

    • Fallback to on-demand instances: Databricks has a feature that automatically falls back to on-demand instances when Spot instances are not readily available. While beneficial for job execution, this feature has traditionally made it difficult to accurately track per-job Spot vs. on-demand usage.
    • User configurable hybrid policy: Users can specify a mix of on-demand and Spot instances for their jobs. This flexible, hybrid approach often results in complex, non-uniform usage patterns, further complicating cost categorisation.

    Iris has made a key difference in resolving these dilemmas. By providing granular, instance-level metrics including whether each instance is on-demand or Spot, Iris has greatly improved our visibility into per-job instance usage.

    This precise data enables us to isolate the on-demand instance usage, which was previously bundled in the total cost calculation. Similarly, it allows us to accurately gauge and consider the usage ratio of on-demand instances in hybrid policy scenarios.

    The enhanced transparency provided by Iris metrics allows us to standardise DBIU cost calculations, making them fairer for users who majorly or only use Spot instances. In other words, users need to pay more if they intentionally choose or fall back to on-demand instances for their jobs.

    The practical application of Iris in enhancing DBIU accuracy illustrates its potential in driving data-informed decisions and fostering fairness in resource usage and cost distribution.

    Case study 3: Optimising job configuration for better performance and cost efficiency

    One of the key utilities of iris is its potential to assist with job optimisation. For instance, we have been able to pinpoint jobs that were consistently over-provisioned and work with end-users to tune their job configurations.

    Through this exercise and continuous monitoring, we’ve seen substantial results from the job optimisations:

  • Cost reductions ranging from 20% to 50% for most jobs.
  • Positive feedback from users about improvements in job performance and cost efficiency.

  • By the way, interestingly, our analysis led us to identify certain the following patterns. These patterns could be leveraged to widen the impact of our optimisation efforts across multiple use-cases in our platforms:

    Pattern Recommendation
  • Job duration < 20 minutes
  • Input per core < 1GB
  • Total used instance is 2x/3x of max worker nodes
  • Use fixed number of workers nodes potentially speeding up performance and certainly reducing costs.
  • CPU utilisation < 25%
  • Cut max worker in half. E.g: 10 to 5 workers
  • Downgrade instance size a half. E.g: 4xlarge -> 2xlarge
  • Job has much shuffle
  • Bump the instance size and reduce the number of workers. E.g. bump 2xlarge -> 4xlarge and reduce number of workers from 100 -> 50
  • However, we acknowledge that these findings may not apply uniformly to every instance. The optimisation recommendations derived from these patterns might not yield the desired outcomes in all cases.

    The future of Iris

    Building upon its firm foundation as a robust Spark observability tool, we envision a future for Iris wherein it not only monitors metrics but provides actionable insights, discerns usage patterns, and drives predictions.

    Our plans to make Iris more accessible include developing APIs endpoint for platform teams to query performance by job names. Another addition we’re aiming for is the ability for Iris to provide resource tuning recommendations. By making platform-specific and job-specific recommendations easily accessible, we hope to assist platform teams in making informed, data-driven decisions on resource allocation and cost efficiency.

    We’re also looking to expand Iris’s capabilities with the development of a listener for Presto jobs, similar to the sparkMeasure tool currently used for Spark jobs. The listener would provide valuable metrics and insights into the performance of Presto jobs, opening up new avenues for optimisation and cost management.

    Another major focus will be building a feedback loop for Iris to further enhance accuracy, continually refine its models, and improve insights provided. This effort would greatly benefit from the close collaboration and inputs from platform teams and other tech leads, as their expertise aids in interpreting Iris’s metrics and predictions and validating its meaningfulness.

    In conclusion, as Iris continues to develop and mature, we foresee it evolving into a crucial tool for data-driven decision-making and proactive management of Spark applications, playing a significant role in the efficient usage of cloud computing resources.

    Conclusion

    The role of Iris as an observability tool for Spark jobs in the world of Big Data is rapidly evolving. Iris has proven to be more than a simple data collection tool; it is a platform that integrates advanced analytics with operational observability.

    Even though Iris is in its early stages, it’s already been instrumental in creating detailed visualisations of both real-time and historical data from varied platforms. Besides that, Iris has started making strides in its journey towards using machine learning models like K-means clustering to classify jobs, demonstrating its potential in helping operators fine-tune resource allocation.

    Using instance-level metrics, Iris is helping improve cost distribution fairness and accuracy, making it a potent tool for resource optimisation. Furthermore, the successful case study of reducing job costs and enhancing performance through resource reallocation provides a promising outlook into Iris’s future applicability.

    With ongoing development plans, such as the Presto listener and the creation of endpoints for broader accessibility, Iris is poised to become an integral tool for data-informed decision-making. As we strive to enhance Iris, we will continue to collaborate with platform teams and tech leads whose feedback is invaluable in fulfilling Iris’s potential.

    Our journey with Iris is a testament to Grab’s commitment to creating a data-informed and efficient cloud computing environment. Iris, with its observed and planned capabilities, is on its way to revolutionising the way resource allocation is managed and optimised.

    Join us

    Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

    Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

    Metrics for issues, pull requests, and discussions

    Post Syndicated from Zack Koppert original https://github.blog/2023-07-19-metrics-for-issues-pull-requests-and-discussions/

    Data-driven insights

    At GitHub, we believe that data-driven insights are the keys to success for any software development project. Understanding the health and progress of your issues, pull requests, and discussions is crucial for effective collaboration, maintainership, and project management.

    That is why we’re excited to announce the release of the Issue Metrics GitHub Action, a powerful tool that empowers developers and teams to measure key metrics and gain valuable insights into their projects.

    With the new Issue Metrics GitHub Action, you can now easily track and monitor important metrics related to issues, pull requests, and discussions, such as time to first response, time to close, and more for any given time period.

    Whether you’re an individual developer, a small team, or a large organization, these metrics will help you gauge the overall health, progress, and engagement of your projects.

    Sample report

    A sample report showing 2 tables. The first table contains overall metrics like average time to first response, anda corresponding value of 50 minutes and 44 seconds. The second table contains a list of the issues measured, with links to the issue and the metrics as measured on the individual issue.

    Common use cases

    Maintainers: ensuring proper attention

    As a maintainer, it is essential to give reasonable attention to the issues and pull requests in the repositories you maintain. With the Issue Metrics GitHub Action, you can track metrics, such as the number of open issues, closed issues, open pull requests, and merged pull requests.

    These metrics can provide you with a clear overview of the workload for a project over a given week, month, or even year. The action can also allow you to consider how you or your team prioritize time and attention effectively while also highlighting potentially overlooked requests in need of attention.

    First responders: timely user contact

    As a first responder in a repository, it’s part of the job description to ensure that users receive contact in a reasonable amount of time. By utilizing the Issue Metrics GitHub Action, you can keep track of metrics like the number of discussions awaiting replies, unresolved issues, or pull requests waiting for reviews. These metrics enable you to maintain a high level of responsiveness, fostering a positive user experience and timely problem resolution. These can be used to build a to-do list or retrospectively to reflect on how long users had to wait for a response during a given time period.

    Open Source Program Office (OSPO): streamlining open source requests

    An important part of what OSPOs do is making the open source release process easy and efficient while adhering to company policy. This process usually involves employees opening an issue, pull request, or discussion. With the Issue Metrics GitHub Action, OSPOs can gain valuable insights into the number of requests, the ratio of open to closed requests, and metrics related to the time it takes to navigate the open-source process to completion.

    These metrics empower you to streamline your workflows, optimize response times, and ensure a smooth open-source collaboration experience. Optimizing the open source release process encourages employees to continue to produce open source projects on the organization’s behalf.

    Product development teams: optimizing pull request reviews

    Product development teams rely heavily on the code review process to collaborate and build high-quality software. By leveraging the Issue Metrics GitHub Action, teams can measure metrics such as the time it takes to get pull request reviews. These insights allow you to reflect on the data during retrospectives, identify areas for improvement, and optimize the review process to enhance team collaboration and accelerate development cycles.

    Certain aspects of efficiency and flow may be hard to measure but often it is possible to spot and remove inefficiencies in the value stream.

    – Forsgren et al. 2021

    Setup and workflow integration

    Setting up the Issue Metrics GitHub Action takes a few minutes, compared to the few hours it takes to calculate these metrics manually. You also only need to set up the action once, and it will run on a regular basis of your own choosing. It integrates into your existing GitHub Actions workflow or you can create a new workflow specifically for metrics tracking.

    The action provides a wide range of customizable options, allowing you to tailor the issues, pull requests, and discussions measured by utilizing GitHub’s powerful search filtering. Ready to use configurations have been tested and used internally at GitHub and are now available for you to try out as well.

    Here is one such example that runs monthly to report on metrics for issues created last month:

    name: Monthly issue metrics
    on:
      workflow_dispatch:
      schedule:
        - cron: '3 2 1 * *'
    
    jobs:
      build:
        name: issue metrics
        runs-on: ubuntu-latest
    
        steps:
    
        - name: Get dates for last month
          shell: bash
          run: |
            # Get the current date
            current_date=$(date +'%Y-%m-%d')
    
            # Calculate the previous month
            previous_date=$(date -d "$current_date -1 month" +'%Y-%m-%d')
    
            # Extract the year and month from the previous date
            previous_year=$(date -d "$previous_date" +'%Y')
            previous_month=$(date -d "$previous_date" +'%m')
    
            # Calculate the first day of the previous month
            first_day=$(date -d "$previous_year-$previous_month-01" +'%Y-%m-%d')
    
            # Calculate the last day of the previous month
            last_day=$(date -d "$first_day +1 month -1 day" +'%Y-%m-%d')
    
            echo "$first_day..$last_day"
            echo "last_month=$first_day..$last_day" >> "$GITHUB_ENV"
    
        - name: Run issue-metrics tool
          uses: github/issue-metrics@v2
          env:
            GH_TOKEN: ${{ secrets.GH_TOKEN }}
            SEARCH_QUERY: 'repo:owner/repo is:issue created:${{ env.last_month }} -reason:"not planned"'
    
        - name: Create issue
          uses: peter-evans/create-issue-from-file@v4
          with:
            title: Monthly issue metrics report
            content-filepath: ./issue_metrics.md
            assignees: <YOUR_GITHUB_HANDLE_HERE>
    

    Ready to start leveling up your GitHub project management?

    Head over to the Issue Metrics GitHub Action repository to explore the documentation, installation instructions, and examples. The repository provides a comprehensive README file that guides you through the setup process and showcases the wide range of metrics you can measure. If you need additional help, feel free to open an issue in the repository.

    GitHub is committed to providing developers with the best tools to enhance collaboration and productivity. The Issue Metrics GitHub Action is a significant step towards empowering teams to measure key metrics related to issues, pull requests, and discussions. By gaining valuable insights into the pulse of your projects, you can drive continuous improvement and deliver exceptional software. We are using this in several places internally across GitHub to help us continually improve and hope this action can help you as well. Happy coding!

    Amazon SES – How to track email deliverability to domain level with CloudWatch

    Post Syndicated from Alaa Hammad original https://aws.amazon.com/blogs/messaging-and-targeting/amazon-ses-how-to-track-email-deliverability-to-domain-level-with-cloudwatch/

    Why is it important to track email deliverability per domain with Amazon Simple Email Service (SES)?

    Amazon Simple Email Service (Amazon SES) is a scalable cloud email service provider that enables businesses to build a large-scale email solution and host multiple domains from the same SES account for different purposes ex: one domain for sending marketing emails such as special offers, another domain to send transactional emails such as order confirmations, and other types of correspondence such as newsletters.

    As your product, service or solution built on Amazon SES grows and you require multiple domains verified, it is important to track email deliverability for emails you send from each domain for business continuity, billing purposes or incidents investigations. This can be useful to identify if you have low email deliverability for your business domain or if you have a domain generating high bounce or complaint rates and take proactive actions before impacting the account’s ability to send emails from any other domains.

    SES offers features that automatically manage deliverability per domain through Virtual Deliverability Manager. Virtual Deliverability Manager helps enhance email deliverability and provides insights into sending and delivery data, as well as offering solutions to fix negative email sending reputation. You can learn more about Virtual Deliverability Manager here.

    Solution Walkthrough

    Amazon SES provides a way to monitor sender reputation metrics such as bounce and complaint rates per account or configuration sets using event publishing. This blog will discuss how you can use Amazon SES message auto-tags to monitor and publish email deliverability events (Send, Delivery, Bounce, Complaints) to CloudWatch custom metrics per domain. In addition, you will see how to create a custom CloudWatch dashboard that’s easy to access in a single view to monitor your domain metrics. This CloudWatch dashboard can help to provide guidance for your team members during operational events about how to respond to specific incidents for your sending domain.

    What are Amazon SES Auto-Tags:

    Message tags are a form of name/value pairs to categorize the email you are sending. For example, if you advertise books, you could name a message tag general, and assign a value of sci-fi or western, when you send an email for the associated campaign. Depending on which email sending interface you use, you can provide the message tag as a parameter to the API call (SendEmail, SendRawEmail) or as an Amazon SES-specific email header.

    In addition to the message tags you add to any emails you send, Amazon SES adds a set of Auto-Tags that are automatically included in any emails you send. You don’t need to pass the parameters of the auto-tags to the API call or email headers since SES does this automatically.

    The auto-tags in the list below are used to track the email deliverability for specific events ( ex: Send, Delivery, Bounce, Complaint). SES does this by using the name/value pairs of the auto-tag name as a dimension in CloudWatch metric to track the count of events of specific auto-tag. This blog post will use “ses:from-domain” auto-tag to configure event publishing for tracking and publish email deliverability events (Send, Delivery, Bounce, Complaints) you receive per domain to CloudWatch metrics and CloudWatch dashboard.

    Amazon SES auto-tags added to messages you send

    Prerequisites:

    For this walkthrough, you should have the following prerequisites:

    Configure Amazon SES to publish email deliverability events to CloudWatch destination:

    To configure event publishing for tracking email deliverability events, you first need to create a configuration set. Configuration sets in SES are groups of rules, that you can apply to your verified identities. When you apply a configuration set to an email, all of the rules in that configuration set are applied to the email.

    After your configuration set is created, you need to create Amazon SES event destination. Amazon SES will send all email deliverability events you intend to track to this event destination. In this blog the event destination is Amazon CloudWatch.

      1. Sign in to the Amazon SES console.
      2. In the navigation pane, under Configuration, choose Configuration sets. Choose Create set.
      3. Enter Configuration set name, leave the rest of fields to default, scroll to the send and click on Create set.
      4. Under configuration set home page click on Event destinations tab and select Add destination
      5. Add SES event destination to configuration set
      6. Under Select event types, check Sends, Deliveries, Hard bounces and Complaints boxes and click Next.
      7. selecting event types to track
      8. Under Specify destination, Select Amazon CloudWatch.
      9. Select event destination as Amazon CloudWatch
      10. Name – enter the name of the destination for this configuration set. The name can include letters, numbers, dashes, and hyphens. (example : Tracking_per_Domain)
      11. Under Amazon CloudWatch dimensions, Select Value source: Message tag , Dimension name: ses:from-domain and Default value: example.com (you will need to add the verified domain name you want to track) as shown below:
      12. add message auto-tag as CloudWatch dimension to track
      13. Review, When you are satisfied that your entries are correct, Click Add destination to add your event destination.

    Send a test email via Amazon SES mailbox simulator to trigger events in CloudWatch custom metric.

    After selected Amazon CloudWatch as event destination , Amazon CloudWatch will create a custom metric with the auto-tag dimension and value you chose. For this custom metric to appear in CloudWatch Console, you must send an email to trigger each selected event. We recommend using the Amazon SES Mailbox Simulator to avoid generating real bounces or complaints that could impact your account’s reputation.

    In the below section, This blog will show how to send those test emails to the following recipients manually using CLI. If you would like to use the console method to send those emails. you will need to send three separate test emails since the console will only allow one recipient per message:

    Amazon SES Mailbox Simulator recipients to trigger the events in CloudWatch metrics:
    [email protected]
    [email protected]
    [email protected]

    Note: You must pass the name of the configuration set when sending an email. This can be done by either specifying the configuration set name in the headers of emails, or specifying it as a default configuration set. This can be done at the time of identity creation, or later while editing a verified identity.

    The following example uses send-emailCLI command to send a formatted email to the Amazon SES simulator recipients:

    Before you run any commands, set your default credentials by following Configuring the AWS CLI. The IAM user must has “ses:SendEmail” permission to send email.

    1. Navigate to your terminal where the AWS CLI is installed and configured. Create message.json file for the message to send and add the following content:
    2. {
      "Subject": {
      "Data": "Testing CW events with email simulator",
      "Charset": "UTF-8"
      },
      "Body": {
      "Text": {
      "Data": "This is the message body of testing CW events with email similulator.",
      "Charset": "UTF-8"
      }
      }
      }
    3. Create a destination.json file to add Amazon SES simulator recipients for bounces, complaints and delivery events as shown below:
    4. { 
      
      "ToAddresses": ["[email protected]", "[email protected]" , "[email protected]"]
      
      }
    5. Send a test email using send-email CLI command to send a formatted email to the Amazon SES simulator recipients:
    6. aws ses send-email --from [email protected] --destination file://destination.json --message file://message.json --configuration-set-name SES_Config_Set --region <AWS Region>
    7. After the message sent, you are expected to see the following output:
    8. {
      
      "MessageId": "EXAMPLEf3a5efcd1-51adec81-d2a4-4e3f-9fe2-5d85c1b23783-000000"
      
      }

    Now you sent a test email to trigger the events you want to track in CloudWatch custom metrics. Lets create the CloudWatch dashboard to see those metrics.

    Create CloudWatch dashboard to track the email deliverability events for my domain.

    1. Sign in to the Amazon CloudWatch console.
    2. In the navigation pane, choose Dashboards, and then choose Create dashboard.
    3. In the Create new dashboard dialog box, enter a name like ‘CW_Domain_Tracking’ for the dashboard, and then choose Create dashboard.
    4. In the Add Widget dialog box, Choose Number to add a number displaying a metric to the dashboard and then choose Next
    5. Under Add metric graph, click on edit sign to rename the graph with your domain example.com . this will make it easy for you to select the dashboard of the domain if you have multiple domains.
    6. In the Browse tab , Select the AWS region where you are running your SES account and in the search bar, search for “ses:from-domain”.
    7. You will get four metrics returned with your domain name “example.com”. Select checkbox beside the four metrics and click Create widget.
    8. CloudWatch dashboard with the metrics
    9. Save dashboard in the top right corner of the dashboard page to save the widget settings.
    10. Save CloudWatch dashboard settings

    After the CloudWatch dashboard created, for any email you send from example.com domain with configuration set name passed in the email header, The email deliverability events will be counted in your CloudWatch metrics and you will be able to see them in the CloudWatch dashboard.

    As an additional step. You can also setup a CloudWatch alarms for this custom metrics and add a threshold for each metric. When the metric breach the threshold, the alarm goes on and send an SNS notification to you to take the necessary actions.

    Cleaning Up:

    This setup includes Amazon CloudWatch and Amazon SES service charges. To avoid incurring any extra charges, remember to delete any resources created manually if you no longer need them for monitoring.

    Resources to delete from Amazon SES console.

    1. In the navigation pane, under Configuration, choose Configuration sets.
    2. Check the box beside Configuration set you created and select Delete.

    Resources to delete from Amazon CloudWatch console.

    1. In the navigation pane, choose Dashboards, and then choose the dashboard you created.
    2. In the upper-right corner of the graph that you want to remove, choose Actions, and then choose Delete Dashboard.
    3. Save dashboard.

    Conclusion:

    You have now seen how to configure Amazon SES to track email deliverability at domain level with CloudWatch dashboard. Tracking email deliverability for emails you send from each domain is essential for business continuity, billing purposes or incidents investigations. Using SES message auto-tags and CloudWatch metrics you can identify the domains that have low email deliverability quickly and take necessary actions to maximize your email deliverability and take proactive actions before impacting the account’s ability to send emails from any other domains.

    About the author:

    Alaa Hammad

    Alaa Hammad is a Senior Cloud Support Engineer at AWS and subject matter expert in Amazon Simple Email Service and AWS Backup service. She has a 10 years of diverse experience in supporting enterprise customers across different industries. She enjoys cooking and try new recipes from different cuisines.

    Are you measuring what matters? A fresh look at Time To First Byte

    Post Syndicated from Sam Marsh original http://blog.cloudflare.com/ttfb-is-not-what-it-used-to-be/

    Are you measuring what matters? A fresh look at Time To First Byte

    Are you measuring what matters? A fresh look at Time To First Byte

    Today, we’re making the case for why Time To First Byte (TTFB) is not a good metric for evaluating how fast web pages load. There are better metrics out there that give a more accurate representation of how well a server or content delivery network performs for end users. In this blog, we’ll go over the ambiguity of measuring TTFB, touch on more meaningful metrics such as Core Web Vitals that should be used instead, and finish on scenarios where TTFB still makes sense to measure.

    Many of our customers ask what the best way would be to evaluate how well a network like ours works. This is a good question! Measuring performance is difficult. It’s easy to simplify the question to “How close is Cloudflare to end users?” The predominant metric that’s been used to measure that is round trip time (RTT). This is the time it takes for one network packet to travel from an end user to Cloudflare and back. We measure this metric and mention it from time to time: Cloudflare has an average RTT of 50 milliseconds for 95% of the Internet-connected population.

    Whilst RTT is a relatively good indicator of the quality of a network, it doesn’t necessarily tell you that much about how good it is at actually delivering actual websites to end users. For instance, what if the web server is really slow? A user might be very close to the data center that serves the traffic, but if it takes a long time to actually grab the asset from disk and serve it the result will still be a poor experience.

    This is where TTFB comes in. It measures the time it takes between a request being sent from an end user until the very first byte of the response being received. This sounds great on paper! However it doesn’t capture how a webpage or web application loads, and what happens after the first byte is received.

    In this blog we’ll cover what TTFB is a good indicator of, what it's not great for, and what you should be using instead.

    What is TTFB?

    TTFB is a metric which reports the duration between sending the request from the client to a server for a given file, and the receipt of the first byte of said file. For example, if you were to download the Cloudflare logo from our website the TTFB would be how long it took to receive the first byte of that image. Similarly, if you were to measure the TTFB of a request to cloudflare.com the metric would return the TTFB of how long it took from request to receiving the first byte of the first HTTP response. Not how long it took for the image to be fully visible or for the web page to be loaded in a state that allowed a user to begin using it.

    The simplest answer therefore is to look at the diametrically opposite measurement, Time to Last Byte (TTLB). TTLB, as you’d expect, measures how long it takes until the last byte of data is received from the server. For the Cloudflare logo file this would make sense, as until the image is fully downloaded it's not exactly useful. But what about for a webpage? Do you really need to wait until every single file is fully downloaded, even those images at the bottom of the page you can't immediately see? TTLB is fine for measuring how long it took to download a single file from a CDN / server. However for multi-faceted traffic, like web pages, it is too conservative, as it doesn’t tell you how long it took for the web page to be usable.

    As an analogy we can look at measuring how long it takes to process an incoming airplane full of passengers. What's important is to understand how long it takes for those passengers to disembark, pass through passport control, collect their baggage and leave the terminal, if no onward journeys. TTFB would measure success as how long it took to get the first passenger off of the airplane. TTLB would measure how long it took the last passenger to leave the terminal, even if this passenger remained in the terminal for hours afterwards due to passport issues or getting lost. Neither are a good measure of success for the airline.

    Why TTFB doesn't make sense

    TTFB is a widely-used metric because it is easy-to-understand and it is a great signal for connection setup time, server time and network latency. It can help website owners identify when performance issues originate from their server. But is TTFB a good signal for how real users experience the loading speed of a web page in a browser?

    When a web page loads in a browser, the user’s perception of speed isn’t related to the moment the browser first receives bytes of data. It is related to when the user starts to see the page rendering on the screen.

    The loading of a web page in a browser is a very complex process. Almost all of this process happens after TTFB is reported. After the first byte has been received, the browser still has to load the main HTML file. It also has to load fonts, stylesheets, javascript, images and other resources. Often these resources link to other resources that also must be downloaded. Often these resources entirely block the rendering of the page. Alongside all these downloads, the browser is also parsing the HTML, CSS and JavaScript. It is building data structures that represent the content of the web page as well as how it is styled. All of this is in preparation to start rendering the final page onto the screen for the user.

    Are you measuring what matters? A fresh look at Time To First Byte

    When the user starts seeing the web page actually rendered on the screen, TTFB has become a distant memory for the browser. For a metric that signals the loading speed as perceived by the user, TTFB falls dramatically short.

    Receiving the first byte isn't sufficient to determine a good end user experience as most pages have additional render blocking resources that get loaded after the initial document request. Render-blocking resources are scripts, stylesheets, and HTML imports that prevent a web page from loading quickly. From a TTFB perspective it means the client could stop the ‘TTFB clock’ on receipt of the first byte of one of these files, but the web browser is blocked from showing anything to the user until the remaining critical assets are downloaded.

    This is because browsers need instructions for what to render and what resources need to be fetched to complete “painting” a given web page. These instructions come from a server response. But the servers sending these responses often need time to compile these resources — this is known as “server think time.” While the servers are busy during this time… browsers sit idle and wait. And the TTFB counter goes up.

    There have been a number of attempts over the years to benefit from this “think time”. First came Server Push, which was superseded last year by Early Hints. Early Hints take advantage of “server think time” to asynchronously send instructions to the browser to begin loading resources while the origin server is compiling the full response. By sending these hints to a browser before the full response is prepared, the browser can figure out what it needs to do to load the webpage faster for the end user. It also stops the TTFB clock, meaning a lower TTFB. This helps ensure the browser gets the critical files sooner to begin loading the webpage, and it also means the first byte is delivered sooner as there is no waiting on the server for the whole dataset to be prepared and ready to send. Even with Early Hints, though, TTFB doesn’t accurately define how long it took the web page to be in a usable state.

    Are you measuring what matters? A fresh look at Time To First Byte

    TTFB also does not take into account multiplexing benefits of HTTP/2 and HTTP/3 which allow browsers to load files in parallel. It also doesn't take into account compression on the origin, which would result in a higher TTFB but a quicker page load overall due to the time the server took to compress the assets and send them in a small format over the network.

    Cloudflare offers many features that can improve the loading speed of a website, but don’t necessarily impact the TTFB score. These features include Zaraz, Rocket Loader, HTTP/2 and HTTP/3 Prioritization, Mirage, Polish, Image Resizing, Auto Minify and Cache. These features improve the loading time of a webpage, ensuring they load optimally through a series of enhancements from image optimization and compression to render blocking elimination by optimizing the sending of assets from the server to the browser in the best possible order.

    More comprehensive metrics are required to illustrate the full loading process of a web page, and the benefit provided by these features. This is where Real User Monitoring helps.  At Cloudflare we are all-in on Real User Monitoring (RUM) as the future of website performance. We’re investing heavily in it: both from an observation point of view and from an optimization one also.

    For those unfamiliar with RUM, we typically optimize websites for three main metrics – known as the “Core Web Vitals”. This is a set of key metrics which are believed to be the best and most accurate representation of a poorly performing website vs a well performing one. These key metrics are Largest Contentful Paint, First Input Delay and Cumulative Layout Shift.

    Are you measuring what matters? A fresh look at Time To First Byte
    Source: https://addyosmani.com/blog/web-vitals-extension/ 

    LCP measures loading performance; typically how long it takes to load the largest image or text block visible in the browser. FID measures interactivity. For example, the time between when a user clicks or taps on a button to when the browser responds and starts doing something. Finally, CLS measures visual stability. A good, or bad example of CLS is when you go to a website on your mobile phone, tap on a link and the page moves at the last second meaning you tap something you didn't want to. That would be a lower CLS score as its poor user experience.

    Looking at these metrics gives us a good idea of how the end user is truly experiencing your website (RUM) vs. how quickly the first byte of the file was retrieved from the nearest Cloudflare data center (TTFB).

    Good TTFB, bad user experience

    One of the “sub parts” that comprise LCP is TTFB. That means a poor TTFB is very likely to result in a poor LCP. If it takes you 20 seconds to retrieve the first byte of the first image, your user isn't going to have a good experience – regardless of your outlook on TTFB vs RUM.

    Conversely, we found that a good TTFB does not always mean a good LCP score, or FID or CLS. We ran a query to collect RUM metrics of web pages we served which had a good TTFB score. Good is defined as a TTFB as less than 800ms. This allowed us to ask the question: TTFB says these websites are good. Does the RUM data support that?

    We took four distinct samples from our RUM data in June. Each sample had a different date-range and sample-rate combination. In each sample we queried for 200,000 page views. From these 200,000 page views we filtered for only the page views that reported a 'Good' TTFB. Across the samples, of all page views that have a good TTFB, about 21% of them did not have a “good” LCP score. 46% of them did not have a “good” FID score. And 57% of them did not have a good CLS score.

    This clearly shows the disparity between measuring the time it takes to receive the first byte of traffic, vs the time it takes for a webpage to become stable and interactive. In summary, LCP includes TTFB but also includes other parts of the loading experience. LCP is a more comprehensive, user-centric metric.

    TTFB is not all bad

    Reading this post and others from Speed Week 2023 you may conclude we really don't like TTFB and you should stop using it. That isn't the case.

    There are a few situations where TTFB does matter. For starters, there are many applications that aren’t websites. File servers, APIs and all sorts of streaming protocols don’t have the same semantics as web pages and the best way to objectively measure performance is to in fact look at exactly when the first byte is returned from a server.

    To help optimize TTFB for these scenarios we are announcing Timing Insights, a new analytics tool to help you understand what is contributing to "Time to First Byte" (TTFB) of Cloudflare and your origin. Timing Insights breaks down TTFB from the perspective of our servers to help you understand what is slow, so that you can begin addressing it.

    Get started with RUM today

    To help you understand the real user experience of your website we have today launched Cloudflare Observatory the new home of performance at Cloudflare.

    Are you measuring what matters? A fresh look at Time To First Byte

    Cloudflare users can now easily monitor website performance using Real User Monitoring (RUM) data along with scheduled synthetic tests from different regions in a single dashboard. This will identify any performance issues your website may have. The best bit? Once we’ve identified any issues, Observatory will highlight customized recommendations to resolve these issues, all with a single click.

    Start making your website faster today with Observatory.

    Query and visualize Amazon Redshift operational metrics using the Amazon Redshift plugin for Grafana

    Post Syndicated from Sergey Konoplev original https://aws.amazon.com/blogs/big-data/query-and-visualize-amazon-redshift-operational-metrics-using-the-amazon-redshift-plugin-for-grafana/

    Grafana is a rich interactive open-source tool by Grafana Labs for visualizing data across one or many data sources. It’s used in a variety of modern monitoring stacks, allowing you to have a common technical base and apply common monitoring practices across different systems. Amazon Managed Grafana is a fully managed, scalable, and secure Grafana-as-a-service solution developed by AWS in collaboration with Grafana Labs.

    Amazon Redshift is the most widely used data warehouse in the cloud. You can view your Amazon Redshift cluster’s operational metrics on the Amazon Redshift console, use AWS CloudWatch, and query Amazon Redshift system tables directly from your cluster. The first two options provide a set of predefined general metrics and visualizations. The last one allows you to use the flexibility of SQL to get deep insights into the details of the workload. However, querying system tables requires knowledge of system table structures. To address that, we came up with a consolidated Amazon Redshift Grafana dashboard that visualizes a set of curated operational metrics and works on top of the Amazon Redshift Grafana data source. You can easily add it to an Amazon Managed Grafana workspace, as well as to any other Grafana deployments where the data source is installed.

    This post guides you through a step-by-step process to create an Amazon Managed Grafana workspace and configure an Amazon Redshift cluster with a Grafana data source for it. Lastly, we show you how to set up the Amazon Redshift Grafana dashboard to visualize the cluster metrics.

    Solution overview

    The following diagram illustrates the solution architecture.

    Architecture Diagram

    The solution includes the following components:

    • The Amazon Redshift cluster to get the metrics from.
    • Amazon Managed Grafana, with the Amazon Redshift data source plugin added to it. Amazon Managed Grafana communicates with the Amazon Redshift cluster via the Amazon Redshift Data Service API.
    • The Grafana web UI, with the Amazon Redshift dashboard using the Amazon Redshift cluster as the data source. The web UI communicates with Amazon Managed Grafana via an HTTP API.

    We walk you through the following steps during the configuration process:

    1. Configure an Amazon Redshift cluster.
    2. Create a database user for Amazon Managed Grafana on the cluster.
    3. Configure a user in AWS Single Sign-On (AWS SSO) for Amazon Managed Grafana UI access.
    4. Configure an Amazon Managed Grafana workspace and sign in to Grafana.
    5. Set up Amazon Redshift as the data source in Grafana.
    6. Import the Amazon Redshift dashboard supplied with the data source.

    Prerequisites

    To follow along with this walkthrough, you should have the following prerequisites:

    • An AWS account
    • Familiarity with the basic concepts of the following services:
      • Amazon Redshift
      • Amazon Managed Grafana
      • AWS SSO

    Configure an Amazon Redshift cluster

    If you don’t have an Amazon Redshift cluster, create a sample cluster before proceeding with the following steps. For this post, we assume that the cluster identifier is called redshift-demo-cluster-1 and the admin user name is awsuser.

    1. On the Amazon Redshift console, choose Clusters in the navigation pane.
    2. Choose your cluster.
    3. Choose the Properties tab.

    Redshift Cluster Properties

    To make the cluster discoverable by Amazon Managed Grafana, you must add a special tag to it.

    1. Choose Add tags. Redshift Cluster Tags
    2. For Key, enter GrafanaDataSource.
    3. For Value, enter true.
    4. Choose Save changes.

    Redshift Cluster Tags

    Create a database user for Amazon Managed Grafana

    Grafana will be directly querying the cluster, and it requires a database user to connect to the cluster. In this step, we create the user redshift_data_api_user and apply some security best practices.

    1. On the cluster details page, choose Query data and Query in query editor v2.Query Editor v2
    2. Choose the redshift-demo-cluster-1 cluster we created previously.
    3. For Database, enter the default dev.
    4. Enter the user name and password that you used to create the cluster.
    5. Choose Create connection.Redshift SU
    6. In the query editor, enter the following statements and choose Run:
    CREATE USER redshift_data_api_user PASSWORD '&lt;password&gt;' CREATEUSER;
    ALTER USER redshift_data_api_user SET readonly TO TRUE;
    ALTER USER redshift_data_api_user SET query_group TO 'superuser';

    The first statement creates a user with superuser privileges necessary to access system tables and views (make sure to use a unique password). The second prohibits the user from making modifications. The last statement isolates the queries the user can run to the superuser queue, so they don’t interfere with the main workload.

    In this example, we use service managed permissions in Amazon Managed Grafana and a workspace AWS Identity and Access Management (IAM) role as an authentication provider in the Amazon Redshift Grafana data source. We create the database user redshift_data_api_user using the AmazonGrafanaRedshiftAccess policy.

    Configure a user in AWS SSO for Amazon Managed Grafana UI access

    Two authentication methods are available for accessing Amazon Managed Grafana: AWS SSO and SAML. In this example, we use AWS SSO.

    1. On the AWS SSO console, choose Users in the navigation pane.
    2. Choose Add user.
    3. In the Add user section, provide the required information.

    SSO add user

    In this post, we select Send an email to the user with password setup instructions. You need to be able to access the email address you enter because you use this email further in the process.

    1. Choose Next to proceed to the next step.
    2. Choose Add user.

    An email is sent to the email address you specified.

    1. Choose Accept invitation in the email.

    You’re redirected to sign in as a new user and set a password for the user.

    1. Enter a new password and choose Set new password to finish the user creation.

    Configure an Amazon Managed Grafana workspace and sign in to Grafana

    Now you’re ready to set up an Amazon Managed Grafana workspace.

    1. On the Amazon Grafana console, choose Create workspace.
    2. For Workspace name, enter a name, for example grafana-demo-workspace-1.
    3. Choose Next.
    4. For Authentication access, select AWS Single Sign-On.
    5. For Permission type, select Service managed.
    6. Chose Next to proceed.AMG Workspace configure
    7. For IAM permission access settings, select Current account.AMG permission
    8. For Data sources, select Amazon Redshift.
    9. Choose Next to finish the workspace creation.Redshift to workspace

    You’re redirected to the workspace page.

    Next, we need to enable AWS SSO as an authentication method.

    1. On the workspace page, choose Assign new user or group.SSO new user
    2. Select the previously created AWS SSO user under Users and Select users and groups tables.SSO User

    You need to make the user an admin, because we set up the Amazon Redshift data source with it.

    1. Select the user from the Users list and choose Make admin.
    2. Go back to the workspace and choose the Grafana workspace URL link to open the Grafana UI.AMG workspace
    3. Sign in with the user name and password you created in the AWS SSO configuration step.

    Set up an Amazon Redshift data source in Grafana

    To visualize the data in Grafana, we need to access the data first. To do so, we must create a data source pointing to the Amazon Redshift cluster.

    1. On the navigation bar, choose the lower AWS icon (there are two) and then choose Redshift from the list.
    2. For Regions, choose the Region of your cluster.
    3. Select the cluster from the list and choose Add 1 data source.Choose Redshift Cluster
    4. On the Provisioned data sources page, choose Go to settings.
    5. For Name, enter a name for your data source.
    6. By default, Authentication Provider should be set as Workspace IAM Role, Default Region should be the Region of your cluster, and Cluster Identifier should be the name of the chosen cluster.
    7. For Database, enter dev.
    8. For Database User, enter redshift_data_api_user.
    9. Choose Save & Test.Settings for Data Source

    A success message should appear.

    Data source working

    Import the Amazon Redshift dashboard supplied with the data source

    As the last step, we import the default Amazon Redshift dashboard and make sure that it works.

    1. In the data source we just created, choose Dashboards on the top navigation bar and choose Import to import the Amazon Redshift dashboard.Dashboards in the plugin
    2. Under Dashboards on the navigation sidebar, choose Manage.
    3. In the dashboards list, choose Amazon Redshift.

    The dashboard appear, showing operational data from your cluster. When you add more clusters and create data sources for them in Grafana, you can choose them from the Data source list on the dashboard.

    Clean up

    To avoid incurring unnecessary charges, delete the Amazon Redshift cluster, AWS SSO user, and Amazon Managed Grafana workspace resources that you created as part of this solution.

    Conclusion

    In this post, we covered the process of setting up an Amazon Redshift dashboard working under Amazon Managed Grafana with AWS SSO authentication and querying from the Amazon Redshift cluster under the same AWS account. This is just one way to create the dashboard. You can modify the process to set it up with SAML as an authentication method, use custom IAM roles to manage permissions with more granularity, query Amazon Redshift clusters outside of the AWS account where the Grafana workspace is, use an access key and secret or AWS Secrets Manager based connection credentials in data sources, and more. You can also customize the dashboard by adding or altering visualizations using the feature-rich Grafana UI.

    Because the Amazon Redshift data source plugin is an open-source project, you can install it in any Grafana deployment, whether it’s in the cloud, on premises, or even in a container running on your laptop. That allows you to seamlessly integrate Amazon Redshift monitoring into virtually all your existing Grafana-based monitoring stacks.

    For more details about the systems and processes described in this post, refer to the following:


    About the Authors

    Sergey Konoplev is a Senior Database Engineer on the Amazon Redshift team. Sergey has been focusing on automation and improvement of database and data operations for more than a decade.

    Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

    Handy Tips #14: Gain new insights into your metrics with the Graph widget

    Post Syndicated from Arturs Lontons original https://blog.zabbix.com/handy-tips-13-gain-new-insights-into-your-metrics-with-the-graph-widget/17416/

    Group, aggregate, and visualize your collected metrics with the Graph widget to obtain additional insights.

    Sometimes simple graphs may not be sufficient – we may need to visualize data trends, display all metrics matching a pattern, or define custom draw styles for specific metrics.

    The Graph widget allows you to visualize your metrics in an advanced fashion:

    • Customize your graph draw styles
    • Select between displaying History or Trend values
    • Calculate and display aggregated metric values on the fly

    • Define custom aggregation intervals
    • Define flexible problem display logic
    • Visually group metrics by defining item and host data sets

    Check out the video to learn how to visualize your metrics by using the Graph widget.

    How to visualize data with the Graph widget:

    1. Navigate to Monitoring → Dashboards – All Dashboards
    2. Click Create Dashboard to create a new Dashboard
    3. Left-click on an empty space and select Add Widget
    4. Select Widget type – Graph
    5. Type in the data set host and item patterns
    6. Use wildcard symbol ‘*‘ to match multiple hosts/items
    7. Click Add new data set to add a second data set
    8. Fill in the host pattern and item pattern to match a different set of metrics
    9. For both data sets, select an Aggregation functionAggregation interval, and Aggregate – Data set
    10. Observe how metrics get aggregated in the graph on the fly

    Tips and best practices:
    • Wildcards can be used in host and item pattern matching
    • The Overrides section can be used to further customize a particular set of items
    • Items within a single data set will use the same base color
    • Up to 50 items may be displayed in the graph

    Handy Tips #11: Collect and send custom metrics with Zabbix sender

    Post Syndicated from vitalijsm original https://blog.zabbix.com/handy-tips-11-collect-and-send-custom-metrics-with-zabbix-sender/17452/

    Collect custom metrics from your in-house applications and forward them to your Zabbix instance by utilizing the Zabbix sender command-line tool.

    As you scale up our monitoring, you will inevitably encounter a situation where a particular metric can be quite complex to collect with native monitoring techniques.

    (more…)

    Safe Updates of Client Applications at Netflix

    Post Syndicated from Netflix Technology Blog original https://netflixtechblog.com/safe-updates-of-client-applications-at-netflix-1d01c71a930c

    By Minal Mishra

    Quality of a client application is of paramount importance to global digital products, as it is the primary way customers interact with a brand. At Netflix, we have significant investments in ensuring new versions of our applications are well tested. However, Netflix is available for streaming on thousands of types of devices and it is powered by hundreds of micro-services which are deployed independently, making it extremely challenging to comprehensively test internally. Hence, it became important to supplement our release decisions with strong evidence received from the field during the update process.

    Our team was formed to mine health signals from the field to quickly evaluate new versions of the client applications. As we invested in systems to enable this vision, it led to increased development velocity, which arguably led to better development practices and quality of the applications. The goal of this blog post is to highlight the investment areas for this vision and the challenges we are facing today.

    Client Applications

    We deal with two classes of client application updates. The first is where an application package is downloaded from the service or a CDN. An example of this is Netflix’s video player or the TV UI javascript package. The second is one where an application is hosted behind an app store, for example mobile phones or even game consoles. We have more flexibility to control the distribution of the application in the former than in the latter case.

    Deployment Strategies

    We are all familiar with the advantages of releasing frequently and in smaller chunks. It helps bring a healthy balance to the velocity and quality equation. The challenge for clients is that each instance of the application runs on a Netflix member’s device and signals are derived from a firehose of events being sent by devices across the globe. Depending on the type of client, we need to determine the right strategy to sample consumer devices, and provide a system that can enable various client engineering teams to look for their signals. Hence, the sampling strategy is different if it is a mobile application versus a smart TV. In contrast, a server application runs on servers which are typically identical and a routing abstraction can serve sampled traffic to new versions. And the signals to evaluate a new version are derived from comparatively few thousands of homogenous servers instead of millions of heterogeneous devices.

    Staged rollouts of apps mimic the different phases of moon

    A widely adopted technique for client applications is gradually rolling out a new version of software rather than making the release available to all users instantly, also known as staged or phased rollout. There are two main benefits to this approach.

    • First, if something were to fail catastrophically, the release can be paused for triage, limiting the number of customers impacted.
    • Second, backend services or infrastructure can be scaled intelligently as adoption ramps up.
    Application version adoption over time for a staged rollout

    This chart represents a counter metric, which exhibits version adoption over the duration of a staged rollout. There is a gradual increase in the percentage of devices switching to N+1 version. In the past, during this period client engineering teams would visually monitor their metric dashboards to evaluate signals as more consumers migrated to a new version of their application.

    Client-side error rate during the staged rollout

    The chart of client-side error rate during the same time period as the version migration is shown here. We observe that the metric for the new version N+1 stabilizes as the rollout ramps up and reaches closer to 100%, whereas the metric for the current version N becomes noisy over the same time duration. Trying to compare any metric during this time period can be a futile effort, as obvious in this case where there was no customer impacting shift in the error rate but we cannot interpret that from the chart. Typically, teams time-shift one metric over the other to visually detect metric deviations, but time can still be a confounder. Staged rollouts have a lot of benefits, but there is a significant opportunity cost to wait before the new version reaches a critical level of adoption.

    AB Tests/Client Canaries

    So we brought the science of controlled testing into this decision framework by using what has been utilized for feature evaluations. The main goal of A/B testing is to design a robust experiment that is going to yield repeatable results and enable us to make sound decisions about whether or not to launch a product feature (read more about A/B tests at Netflix here). In the application update use case, we recommend an extreme version of A/B testing: we test the entire application. The new version may include a user facing feature which is designed to be A/B tested and resides behind a feature flag. However, most times it is adding new obvious improvements, simple bug fixes, performance enhancements, productizing outcomes from previous A/B tests, logging etc that are being shipped in the application. If we apply A/B tests methodology (or client canaries as we like to call them to differentiate from traditional feature based A/B tests) the allocation would look identical for both the versions at any time.

    Client Canary and Control allocation along with the client-side error rate metric

    This chart is showing the new and the baseline version allocations growing over time. Although, majority of users are already on the baseline version we are randomly “allocating” a fraction of those users to be the control group of our experiment. This ensures there is no sampling mismatch between the treatment and the control group. It is easier to visually compare the client side error rate for both versions and even apply statistical inference to change the conversation from “we think” there is a shift in metrics to “we know”.

    Client Canaries and A/B tests

    But there are differences between feature related A/B tests at Netflix and the incremental product changes used for Client Canaries. The main distinctions are: a shorter runtime, multiple executions of analysis sometimes concurrent with allocation, and use of data to support the null hypothesis. The runtime is predetermined, which in a way, is the stopping rule for client canaries. Unlike feature A/B tests at Netflix, we limit our evidence collection to a few hours, so we can release updates within a working day. We continuously analyze metrics to find egregious regressions sooner rather than once all the evidence has been collected.

    Phases of A/B Tests

    The three key phases of any A/B tests can be split into Allocation, Metric Collection and Analysis. We use orchestration to connect and manage client applications through the A/B test lifecycle, thereby reducing the cognitive load of deploying them frequently.

    Allocation

    Sampling is the first stage once your new application has been packaged, tested and published. As time is of the essence here, we rely on dynamic allocation and allocate devices which come to the service during the canary time period based on pre-configured rules. We leverage the allocation service used for all experimentation at Netflix for this purpose.

    However, for applications gated behind an external app store (example mobile apps), we only have access to staged rollout solutions provided by the app stores. We can control the percentage of users receiving updated apps, which can increase over time. In order to mimic the client canary solution, we built a synthetic allocation service to perform sampling post-installation of the app updates. This service tries to allocate a device to the control group that typically matches the profile of a device seen in the treatment group, which was allocated by the app store’s staged rollout solution. This ensures we are controlling for key variables which have the potential to impact the analysis.

    Metrics

    Metrics are a foundational component for client canaries and A/B tests as they give us the necessary insight required to make decisions. And for our use case, metrics need to be computed in real time from millions of user events being sent to our service. Operating at Netflix’s scale, we have to process the event streams on a scalable platform like Mantis and store the time-series data in Apache Druid. To be further cost-efficient with the time-series data we store the metrics for a sliding time window of a few weeks and compress it to a 1 minute time granularity.

    The other challenge is to enable client application engineers to contribute to metrics and dimensions as they are aware of what can be a valuable insight. To do this, our real-time metric data pipeline provides the right abstractions to remove the complexity of a distributed stream processing system and also enables these contributions to be used in offline computations for feature A/B test evaluations.The former reduces the barrier to entry and the latter provides additional motivation for client engineers to contribute. Additionally, this gets us closer to consistent metric definitions in both realtime and offline systems.

    As we accept contributions, we have to have the right checks in place to ensure the data pipeline is reliable and robust. Changes in user events, stream processing jobs or even in the platform can impact metrics, and so it is imperative that we actively monitor the data pipeline and ingestion.

    Analysis

    Historically, we have relied on conventional statistical tests built into Kayenta to detect metric shifts for the release of new versions of applications. It has served us well over the last few years, however at Netflix we are always looking to improve. Some reasons to explore alternate solutions:

    1. Under the hood ACA uses a fixed horizon statistical hypothesis test which is subject to peeking due to frequent analysis execution during the canary time period. And without a correction, this can erode our false positive guarantees, and the correction itself is a function of the number of peeks — which is not known in advance. This often leads to more false errors in the outcomes.
    2. Due to limited time for the canary, rare event metrics such as errors can often be missing from control or treatment and hence might not get evaluated.
    3. Our intuition suggests any form of metric compression, like aggregating to 1 minute granularity, leads to a loss in power for the analysis, and the tradeoff is that we need more time to confidently detect the metric shifts.

    We are actively working on a promising solution to tackle some of these limitations and hope to share more in future.

    Orchestration

    Orchestration reduces the cognitive load of frequently setting up, executing, analyzing and making decisions for client application canaries. To manage A/B test lifecycle, we decided to build a node.js powered extensible backend service to serve the javascript competency of client engineers while complementing the continuous deployment platform Spinnaker. The drawbacks of most orchestration solutions is the lack of version control and testing. So the main design tenets for this service along with reusability and extensibility are testability and traceability.

    Conclusion

    Today, most client applications at Netflix use the client canary model to continuously update their applications. We have seen a significant increase in adoption of this methodology over the past 4 years as shown in this cumulative graph of client canary counts.

    Year-over-year increase in Client Canaries at Netflix

    Time constraints, the need for speed and quality have created several challenges in the client application’s frequent update domain that our team at Netflix aims to solve. We covered some metric related ones in a previous post describing “How Netflix uses Druid for Real-time Insights to Ensure a High-Quality Experience”. We intend to share more in the future diving into the challenges and solutions in the Allocation, Analysis and Orchestration space.


    Safe Updates of Client Applications at Netflix was originally published in Netflix TechBlog on Medium, where people are continuing the conversation by highlighting and responding to this story.