Tag Archives: Amazon Managed Grafana

Migrating from API keys to service account tokens in Grafana dashboards using Terraform

Post Syndicated from Majdoulina Makbal original https://aws.amazon.com/blogs/big-data/migrating-from-api-keys-to-service-account-tokens-in-grafana-dashboards-using-terraform/

With the release of Grafana 9.4, Amazon Managed Grafana added support for service accounts, which have become the recommended authentication method for applications interacting with Amazon Managed Grafana, replacing the previous API key system.

While API keys are created with a specific role that determines their level of access, service accounts offer a more flexible and maintainable approach. They support multiple tokens, can be enabled or disabled independently, and aren’t tied to individual users, allowing applications to remain authenticated even if a user is deleted. Permissions can be assigned directly to service accounts using role-based access control, simplifying management of long-lived access for non-human entities like applications or scripts.

In this blog post, we walk through how to migrate from API keys to service account tokens when automating Amazon Managed Grafana resource management. We will also show how to securely store tokens using AWS Secrets Manager and automate token rotation with AWS Lambda. All infrastructure is deployed using Terraform, though the pattern can be adapted to your infrastructure-as-code framework of choice.

What are service accounts and tokens?

A service account is designed to authenticate automated tools and systems with Amazon Managed Grafana and is intended for programmatic access. A service account token is a secure credential issued to a service account and can be used to authenticate requests to the Amazon Managed Grafana HTTP API. Multiple tokens can be associated with a single service account, and tokens can be individually revoked or rotated without affecting other services or requiring changes to user accounts.

For a deeper understanding, see the Grafana service account documentation.

Solution overview

In this solution, we show you how to create a service account, reference it in your Terraform stack, and then implement rotation of the token associated with it using Lambda and Secrets Manager as shown in the following diagram:

Workflow diagram showing automated secret management between Terraform, AWS Secrets Manager, and Grafana workspace with Lambda rotation

Architecture diagram illustrating the integration between Terraform, AWS Secrets Manager secret store, and an Amazon Managed Grafana workspace, with secret rotation functionality.

The following are the basic steps to set up the solution.

  1. Set up Amazon Managed Grafana with service accounts.
  2. Update the secret in Secrets Manager with the token value.
  3. Automate resource creation in Amazon Managed Grafana using service account tokens in Terraform.
  4. Create a service account and token in your Amazon Managed Grafana workspace.
  5. Store the token securely using Secrets Manager.
  6. Use Terraform to automate Amazon Managed Grafana resource creation with the token.
  7. Automate the rotation of the service account token.

GitHub repo for cloning the code and deploying the Terraform stack.

Prerequisites

Before starting this walkthrough, make sure that you have the following:

Solution walkthrough

Use the following steps to set up and configure the solution.

Provision resources using the Terraform stack

The full source code of the solution is in sample-migrate-from-apikeys-grafana and is deployed using Terraform.

  1. Clone the repository.
git clone https://github.com/aws-samples/sample-migrate-from-apikeys-grafana.git
  1. Initialise a Terraform project.
terraform init
  1. Create infrastructure for the secrets and the Amazon Managed Grafana instance.
terraform apply —target=aws_secretsmanager_secret.token —target=aws_grafana_workspace.grafana

This step creates the Amazon Managed Grafana workspace and the Secrets Manager secret. In the next step, you bind the workspace with AWS IAM Identity Center and generate the service account token.

Retrieve service account token from the Amazon Managed Grafana workspace

You must have administrative privileges in your Amazon Managed Grafana workspace to perform this step. This applies whether you’re using IAM Identity Center or an external identity provider for authentication.

  1. To change a user’s role in AWS IAM Identity Center (console)
    1. Open the Amazon Managed Grafana console.
    2. In the navigation pane, choose Workspaces.
    3. Select the workspace you want to manage.
    4. On the AWS IAM Identity Center, choose the Assigned users tab.
    5. Select the row of the user that you want to modify.
    6. For Action, choose the following:
      • Make admin
    7. Confirm the role change.

  1. Select the workspace URL and sign in using your credentials, you should be able to create a service account under the name grafana-sa (or the name of the variable defined in /variables.tf).

  1. Assign the Editor role to the service account to allow it to create dashboards and folders. Learn more about service account roles in the Assign roles to a service account in Grafana.
  2. After the service account is created, add a service account token to it, again the name should be similar to the one defined in /variables.tf.

Add the token to Secrets Manager and create the rest of the resources

After you complete this step, the access token will be stored in Secrets Manager and will automatically be used in the provider definition during future runs of terraform apply.

  1. Copy the service account token.

  1. Paste it into the plaintext section of the Secrets Manager secret created in the previous section

  1. With the access token stored in Secrets Manager, there is no longer a need to restrict the apply operation to the rotation module using the --target flag. Use the following code to remove the restriction.
    provider "grafana" {
      url  = "https://${aws_grafana_workspace.grafana.endpoint}"
      auth = module.grafana_sa_key_automation.grafana_sa_token
    }

Clean up

To avoid incurring future charges, use the following command to delete unused Amazon Managed Grafana service accounts and Terraform-managed resources run the cli command terraform destroy.

Security notes

To protect the security of your organization, we recommend the following best practices:

  • Always follow least privilege principles. Grant the minimum permissions needed to the service account (for example, Editor instead of Admin).
  • Make sure that Amazon Simple Queue Service (Amazon SQS) queues, Secrets Manager secrets, and Amazon CloudWatch Logs are encrypted with a customer-managed KMS key if required by your organization.
  • Rotate secrets regularly to minimize exposure.

Conclusion

In this post, we demonstrated how to migrate from API keys to Amazon Managed Grafana service account tokens using Terraform, with secure storage in AWS Secrets Manager and optional automated token rotation via AWS Lambda.This modern approach improves security, scalability, and auditing in your automation pipelines.

For more information, see the Amazon Managed Grafana service account documentation.


About the authors

Majdoulina

Majdoulina Makbal

Majdoulina is a Delivery Consultant in AWS Professional Services, specialising in AI and ML solutions. With a strong background in industrial connected services, she brings extensive experience helping organisations across diverse industries transform their business vision into technological reality. Based in Munich, she’s mastering the art of explaining transformer architectures and federated learning over a Maß at Oktoberfest.

Enhance Amazon EMR observability with automated incident mitigation using Amazon Bedrock and Amazon Managed Grafana

Post Syndicated from Yu-Ting Su original https://aws.amazon.com/blogs/big-data/enhance-amazon-emr-observability-with-automated-incident-mitigation-using-amazon-bedrock-and-amazon-managed-grafana/

Maintaining high availability and quick incident response for Amazon EMR clusters is important in data analytics environments. In this post, we show you how to build an automated observability system that combines Amazon Managed Grafana with Amazon Bedrock to detect and remediate EMR cluster issues. We demonstrate how to integrate real-time monitoring with AI-powered remediation suggestions, combining Amazon Managed Grafana for visualization, Amazon Bedrock for intelligent response recommendations, and AWS Systems Manager for automated remediation actions on Amazon Web Services (AWS).

Solution overview

This solution helps you improve EMR cluster observability through a comprehensive four-layer architecture—comprising monitoring, notification, remediation, and knowledge management—to provide the following features:

  • Real-time monitoring of EMR clusters using Amazon Managed Service for Prometheus and Amazon Managed Grafana
  • Automated first-aid remediation through Systems Manager
  • AI-powered incident response suggestions using Amazon Bedrock
  • Integration with the AWS Premium Support knowledge base
  • Historical incident data archival and analysis

The implementation of this architecture delivers the following key benefit:

  • Reduced Mean time to resolution (MTTR)
  • Proactive incident prevention
  • Automated first-response actions
  • Knowledge base enrichment through machine learning

The following diagram illustrates the solution architecture.

End-to-end AWS monitoring solution diagram integrating Knowledge Center, Support, CloudWatch metrics with EventBridge rules and Lambda processing

The architecture comprises the following core components:

  • Monitoring layer – The monitoring layer uses Amazon Managed Service for Prometheus and Amazon CloudWatch to capture real-time metrics from EMR clusters. Amazon Managed Grafana serves as the visualization layer, offering comprehensive dashboards for Apache YARN, HDFS, Apache HBase, and Apache Hudi performance monitoring. Advanced alerting mechanisms trigger notifications based on predefined query results.
  • Notification layer – To provide timely and reliable alert delivery, the notification layer uses Amazon Simple Notification Service (Amazon SNS) for distribution and Amazon Simple Queue Service (Amazon SQS) for message queuing. This architecture prevents message delays and provides a robust trigger mechanism for AWS Lambda functions.
  • Remediation layer – The remediation layer enables automatic issue resolution through:
    • Lambda functions for orchestration
    • Systems Manager for script execution
    • Amazon Bedrock (amazon.nova-lite-v1:0) for generating intelligent response recommendations
  • Knowledge management layer – To maintain an up-to-date knowledge base, the solution:

We provide an AWS CloudFormation template to deploy the solution resources.

Prerequisites

Before starting this walkthrough, make sure you have access to the following AWS resources and configurations:

  • An AWS account
  • Access to the US East (N. Virginia) AWS Region
    • Add access to Amazon Bedrock foundation models (amazon.nova-lite-v1:0)

  • Amazon EMR version 6.15.0 (used in this demo)
  • Archived technical or troubleshooting articles
  • AWS IAM Identity Center enabled with at least one role that can become a Grafana administrator
  • (Optional) AWS Premium Support with a business support plan or higher for enhanced troubleshooting capabilities

Throughout this walkthrough, we provide detailed instructions to set up and configure these prerequisites if you haven’t already done so.

Configure resources using AWS CloudFormation

Complete the following steps to configure your resources:

  1. Launch the CloudFormation stack:

launch stack

  1. Provide emrobservability as the stack name.
  2. Select a virtual private cloud (VPC) and assign a public subnet.
  3. For EMRClusterName, enter a name for your cluster (default: emrObservability).
  4. Enter an existing Amazon S3 location as the Apache HBase root directory location (for example, s3://mybucket/my/hbase/rootdir/).
  5. For MasterInstanceType and CoreInstanceType, enter your instance types (default: m5.xlarge for both).
  6. For CoreInstanceCount, enter your instance count (default: 2).
  7. For SSHIPRange, use CheckIp and enter your IP (for example, 10.1.10/32).
  8. Choose the release label (default: 6.15.0).
  9. For KeyName, enter a key name to SSH to Amazon Elastic Compute Cloud (Amazon EC2) instances.
  10. For LatestAmiId, enter your AMI (default: /aws/service/ami-amazon-linux-latest/amzn2-ami-hvm-x86_64-gp2).
  11. For KBS3Bucket, enter a name for your S3 bucket (for example, mykbbucket).
  12. For SubscriptionEndpoint, enter an email address to receive notifications and responses (for example, [email protected]).

Accept subscription confirmation

Accept the subscription confirmation sent to the email address you specified in the CloudFormation stack parameters. The following screenshot shows an example of the email you receive.

AWS email confirmation for SNS topic subscription to QA Lambda function responses with opt-out instructions

Prepare the knowledge base

Complete the following steps to populate the S3 bucket with archived technical articles and cases:

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Choose the function CustomFunctionCopyKCArticlesToS3Bucket.

AWS Lambda console displaying Functions page with CustomFunctionCopyKCArticlesToS3Bucket function details

  1. Manually invoke the function by choosing Test on the Test tab.

AWS Lambda Test tab interface with event configuration options

  1. Verify successful execution by checking the CloudWatch logs.

AWS Lambda successful function execution result with null output

  1. Repeat the process for the Lambda function CustomFunctionCopyCasesToS3Bucket.

Lambda function interface displaying CustomFunctionCopyCasesToS3Bucket configuration with CloudFormation ID and description panel

AWS Lambda test interface showing Test event configuration options and action buttons

AWS Lambda function execution success message with null response and SHA-256 code

  1. Confirm the S3 bucket has been populated with archived technical articles and cases.

Amazon S3 bucket interface showing two folders with action buttons and search functionality

Sync data to the Amazon Bedrock knowledge base

Complete the following steps to sync the data to your knowledge base:

  1. On the Lambda console, choose Functions in the navigation pane.
  2. Choose the function KBDataSourceSync.

AWS Lambda console displaying filtered functions with CloudFormation tags, Python runtime versions, and modification timestamps

  1. Manually invoke the function by choosing Test on the Test tab.

This task might take 10–15 minutes to complete.

AWS Lambda console test configuration panel with CloudWatch integration and event creation controls

  1. Verify successful execution by checking the CloudWatch logs.

Lambda function execution results showing successful completion status and details

Configure your Amazon Managed Grafana workspace

Complete the following steps to configure your Amazon Managed Grafana workspace:

  1. On the Amazon Managed Grafana console, choose Workspaces in the navigation pane.
  2. Open your workspace.
  3. Choose Assign new user or group.

Amazon Grafana workspace showing IAM configuration notice and user assignment button

  1. Select your IAM Identity Center role and choose Assign users and groups.

Amazon Grafana IAM Identity Center user assignment panel with search and selection controls

  1. On the Admin dropdown menu, choose Make admin.

Amazon Grafana user list showing assigned viewer with admin action options

  1. Enable Grafana alerting, then choose Save changes.

Amazon Grafana alerting configuration panel showing disabled status with navigation tabs and edit button

Amazon Grafana configuration panel showing enabled alerting and plugin management settings

  1. Wait 10 minutes for the workspace to become active.
  2. When it’s active, sign in to the Grafana workspace. (For more information, refer to Connect to your workspace.)

Configure data sources

Add and configure the following data sources:

  1. For Service, choose CloudWatch, then select your Region and add CloudWatch as a data source.

  1. Choose Amazon Managed Service for Prometheus as a second data source and select your Region.

  1. Validate CloudWatch connectivity:
    1. Run test queries (for example, Namespace: AWS/EC2, Metric name: CPUUtilization, Statistic: Maximum).
      Amazon Managed Gragana interface showing CPU utilization query setup for EC2 instance.
    2. Verify CloudWatch metric retrieval.
      Line graph showing CPU utilization over time with peak at 40%.
  1. Validate Amazon Managed Service for Prometheus connectivity:
    1. Run test queries (for example, Metric: hadoop_hbase_numregionservers, Label filters: cluster_id = <Amazon EMR cluster ID>).
      Amazon Managed Grafana query interface showing Hadoop HBase metric configuration.
    2. Verify Prometheus metric retrieval.
      Amazon Managed Grafana monitoring dashboard showing a graph with HBase Region Server amount from 0 to 2

Confirm SNS notification channels

Complete the following steps to confirm your SNS notification is set up:

  1. On the Amazon SNS console, choose Topics in the navigation pane.
  2. Locate and note the ARNs for -LambdaFunctionTopic and -QALambdaFunctionTopic.

AWS SNS Topics list showing 4 topics with names, types, and ARNs

AWS SNS Topics console showing filtered search results for "LambdaFunctionTopic"

AWS SNS Topics console showing filtered search results for "QALambdaFunctionTopic"

  1. Choose Contact points under Alerting.

  1. Create the first contact point:
    1. For Name, enter SNS_SSM.
    2. For Integration, choose AWS SNS.
    3. For Topic, enter the ARN for LambdaFunctionTopic.
    4. For Auth Provider, choose Workspace IAM role.
    5. For Alert Message format, choose JSON.

  1. Create the second contact point:
    1. For Name, enter SNS_QA.
    2. For Integration, choose AWS SNS.
    3. For Topic, enter the ARN for QALambdaFunctionTopic.
    4. For Auth Provider, choose Workspace IAM role.
    5. For Alert Message format, choose JSON.

Create alert rules

Complete the following steps to set up two critical alert rules:

  1. Choose Alert rules under Alerting.

  1. Set up alerting if the Apache HBase region server status is abnormal:
    1. For Alert name, enter HBase region server down.
    2. For Data source, choose Amazon Managed Service for Prometheus.
    3. For Metric, choose hadoop_hbase_numregionservers.
      Alert rule configuration interface for HBase region server monitoring
    4. For Threshold, configure to alert if the region server count is less than 2 for 3 minutes.
      Amazon Managed Grafana alert rule configuration interface with expressions setup
    5. For Evaluation interval, set to 1 minute.
      New evaluation group creation modal showing P0_RegionServer name input and 1m interval settingHBase alert configuration panel showing P0_RegionServer group and 3m pending period
    6. For Contact point, choose SNS_SSM.
      Amazon Managed Grafana alert configuration interface showing labels and notifications setup with AWS SNS integration
  1. Create a second alert for if Amazon EC2 CPU utilization is abnormal:
    1. For Alert name, enter EC2 CPU utilization too high.
    2. For Data source, choose Amazon CloudWatch.
    3. For Namespace, choose AWS/EC2.
    4. For Metric name, choose CPUUtilization
    5. For Statistic, choose Maximum.
      Amazon CloudWatch query interface for setting up EC2 CPU utilization alert conditions
    6. For Threshold, configure to alert if CPU utilization is more than 95% for 3 minutes.
      Amazon Managed Grafana alert interface with Reduce and Threshold expressions for alert condition management
    7. For Evaluation interval, configure to 1 minute.
      New evaluation group configuration modal showing CPU utilization monitoring setup with 1-minute interval
      AWS Managed Grafana alert rule configuration screen showing evaluation behavior settings
    8. For Contact point, choose SNS_QA.Amazon Managed Grafana alert configuration showing customizable labels, contact point selection for SNS_QA integration
  1. On the alert rule creation page, scroll to 5. Add annotations and for Summary, add a clear description of the alert, for example, CPU utilization on EC2 instance is too high.

Alert configuration summary field with "CPU utilization on EC2 instance is too high" warning message

Apache HBase region server incident test

To confirm the system is working as expected, complete the following Apache HBase region server incident test:

  1. SSH into an EMR core instance.
  2. Stop the Apache HBase region server using systemctl:
 # Stop HBase region server service 
 sudo systemctl stop hbase-regionserver.service 

  1. Verify the service status:
 # Check the current state of HBase region server service 
 sudo systemctl status hbase-regionserver.service
  1. Observe Amazon Managed Grafana alert progression:
    1. Monitor alert status changes.
      Alert dashboard showing HBase region server alert status in pending state
      Alert dashboard showing HBase region server alert in firing state
    2. Verify SNS message generation.
    3. Confirm SQS message queuing.
    4. Track the Lambda function triggered for remediation.

Terminal output showing HBase RegionServer service status and daemon processes

HBase monitoring interface displaying region server status with health indicators and action buttons

CPU utilization stress test

Complete the following CPU utilization stress test:

  1. SSH into the EMR primary instance.
  2. Install stress testing tools:
 sudo amazon-linux-extras install epel -y
 sudo yum install stress -y 

  1. Verify the installation:
 stress --version 

  1. Generate high CPU load using the stress command and the following command structure:
 sudo stress [options] 

For our Amazon EMR test, use the following command:

 # For m5.xlarge instances (4 vCPUs) sudo stress --cpu 4 

-c 4 in the command creates 4 CPU-bound processes (one for each vCPU).The following are instance type vCPUs for your reference:

  • m5.xlarge: 4 vCPUs
  • m5.2xlarge: 8 vCPUs
  • m5.4xlarge: 16 vCPUs
  1. Monitor system response:
    1. Observe Amazon Managed Grafana alert status changes.
      Amazon Managed Grafana dashboard header showing rules status
    2. Verify Amazon Bedrock recommendation generation.
    3. Check SNS email notification delivery.
      AWS SNS notification email showing troubleshooting steps for high CPU usageCode snippet showing CPU usage troubleshooting steps in red text

Best practices and considerations

Monitoring infrastructure requires precise alert prioritization and threshold configuration. Alert aggregation techniques prevent notification overload by consolidating event streams and reducing redundant alerts. Operational teams must maintain dashboards through consistent updates and metric integration, providing real-time visibility into system performance and health.

Security implementations focus on least-privilege AWS Identity and Access Management (IAM) roles, restricting access to critical resources and minimizing potential breach vectors. Data protection strategies involve encryption protocols for information at rest and in transit, using AES-256 standards. Automated security audit processes scan automation scripts, identifying potential vulnerabilities through code analysis and runtime inspection.

Performance optimization in serverless architectures uses Lambda extensions to cache knowledge base content, reducing latency and improving response times. Retry mechanisms for API calls implement exponential backoff strategies, mitigating transient network exceptions and enhancing system resilience. Execution time monitoring of Lambda functions enables detection of anomalies through statistical analysis, providing insights into potential system-wide incidents or performance degradations.

Clean up

To avoid incurring future charges, delete the resources by deleting the parent stack on the AWS CloudFormation console.

Conclusion

This solution provides a robust framework for automated EMR cluster monitoring and incident response. By combining real-time monitoring with AI-powered remediation suggestions and automated execution, organizations can significantly reduce MTTR for common Amazon EMR issues while building a knowledge base for future incident response.

Try out this solution for your own use case, and leave your feedback in the comments section.


About the authors

Author Yu-ting Su, Sr. Hadoop System Engineer, AWS Support Engineering. Yu-Ting is a Sr. Hadoop Systems Engineer at Amazon Web Services (AWS). Her expertise is in Amazon EMR and Amazon OpenSearch Service. She’s passionate about distributing computation and helping people to bring their ideas to life.

Process millions of observability events with Apache Flink and write directly to Prometheus

Post Syndicated from Lorenzo Nicora original https://aws.amazon.com/blogs/big-data/process-millions-of-observability-events-with-apache-flink-and-write-directly-to-prometheus/

AWS recently announced support for a new Apache Flink connector for Prometheus. The new connector, contributed by AWS to the Flink open source project, adds Prometheus and Amazon Managed Service for Prometheus as a new destination for Flink.

In this post, we explain how the new connector works. We also show how you can manage your Prometheus metrics data cardinality by preprocessing raw data with Flink to build real-time observability with Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Amazon Managed Service for Prometheus is a secure, serverless, scaleable, Prometheus-compatible monitoring service. You can use the same open source Prometheus data model and query language that you use today to monitor the performance of your workloads without having to manage the underlying infrastructure. Flink connectors are software components that move data into and out of an Amazon Managed Service for Apache Flink application. You can use the new connector to send processed data to an Amazon Managed Service for Prometheus destination starting with Flink version 1.19. With Amazon Managed Service for Apache Flink, you can transform and analyze data in real time. There are no servers and clusters to manage, and there is no compute and storage infrastructure to set up.

Observability beyond compute

In an increasingly connected world, the boundary of systems extends beyond compute assets, IT infrastructure, and applications. Distributed assets such as Internet of Things (IoT) devices, connected cars, and end-user media streaming devices are an integral part of business operations in many sectors. The ability to observe every asset of your business is key to detecting potential issues early, improving the experience of your customers, and protecting the profitability of the business.

Metrics and time series

It is helpful to think of observability as three pillars: metrics, logs, and traces. The most relevant pillar for distributed devices, like IoT, is metrics. This is because metrics can capture measurements from sensors or counting of specific events emitted by the device.

Metrics are series of samples of a given measurement at specific times. For example, in the case of a connected vehicle, they can be the readings from the electric motor RPM sensor. Metrics are normally represented as time series, or sequences of discrete data points in chronological order. Metrics’ time series are normally associated with dimensions, also called labels or tags, to help with classifying and analyzing the data. In the case of a connected vehicle, labels might be something like the following:

  • Metric name – For example, “Electric Motor RPM”
  • Vehicle ID – A unique identifier of the vehicle, like the Vehicle Identification Number (VIN)

Prometheus as a specialized time series database

Prometheus is a popular solution for storing and analyzing metrics. Prometheus defines a standard interface for storing and querying time series. Commonly used in combination with visualization tools like Grafana, Prometheus is optimized for real-time dashboards and real-time alerting.

Often considered mainly for observing compute resources, like containers or applications, Prometheus is actually a specialized time series database that can effectively be used to observe different types of distributed assets, including IoT devices.

Amazon Managed Service for Prometheus is a serverless, Prometheus-compatible monitoring service. See What is Amazon Managed Service for Prometheus? to learn more about Amazon Managed Service for Prometheus.

Effectively processing observability events, at scale

Handling observability data at scale becomes more challenging, due to the number of assets and unique metrics, especially when observing massively distributed devices, for the following reasons:

  • High cardinality – Each device emits multiple metrics or types of events, each to be tracked independently.
  • High frequency – Devices might emit events very frequently, multiple times per second. This might result in a large volume of raw data. This aspect in particular represents the main difference from observing compute resources, which are usually scraped at longer intervals.
  • Events arrive at irregular intervals and out of order – Unlike compute assets that are usually scraped at regular intervals, we often see delays of transmission or temporarily disconnected devices, which cause events to arrive at irregular intervals. Concurrent events from different devices might follow different paths and arrive at different times.
  • Lack of contextual information – Devices often transmit over channels with limited bandwidth, such as GPRS or Bluetooth. To optimize communication, events seldom contain contextual information, such as device model or user detail. However, this information is required for an effective observability.
  • Derive metrics from events – Devices often emit specific events when specific facts happen. For example, when the vehicle ignition is turned on or off, or when a warning is emitted by the onboard computer. These are not direct metrics. However, counting and measuring the rates of these events are valuable metrics that can be inferred from these events.

Effectively extracting value from raw events requires processing. Processing might happen on read, when you query the data, or upfront, before storing.

Storing and analyzing raw events

The common approach with observability events, and with metrics in particular, is “storing first.” You can simply write the raw metrics into Prometheus. Processing, such as grouping, aggregating, and calculating derived metrics, happens “on query,” when data is extracted from Prometheus.

This approach might become particularly inefficient when you’re building real-time dashboards or alerting, and your data has very high cardinality or high frequency. As a time series database is continuously queried, a large volume of data is repeatedly extracted from the storage and processed. The following diagram illustrates this workflow.

Process on query

Preprocessing raw observability events

Preprocessing raw events before storing shifts the work left, as illustrated in the following diagram. This increases the efficiency of real-time dashboards and alerts, allowing the solution to scale.

Pre-process

Apache Flink for preprocessing observability events

Preprocessing raw observability events requires a processing engine that allows you to do the following:

  • Enrich events efficiently, looking up reference data and adding new dimensions to the raw events. For example, adding the vehicle model based on the vehicle ID. Enrichment allows adding new dimensions to the time series, enabling analysis otherwise impossible.
  • Aggregate raw events over time windows, to reduce frequency. For example, if a vehicle emits an engine temperature measurement every second, you can emit a single sample with the average over 5 seconds. Prometheus can efficiently aggregate frequent samples on read. However, ingesting data with a frequency much higher than what is useful for dashboarding and real-time alerting is not an efficient use of Prometheus ingestion throughout and storage.
  • Aggregate raw events over dimensions, to reduce cardinality. For example, aggregating some measurement per vehicle model.
  • Calculate derived metrics applying arbitrary logic. For example, counting the number of warning events emitted by each vehicle. This also enables analysis otherwise impossible using only Prometheus and Grafana.
  • Support event-time semantics, to aggregate over time events from different sources.

Such a preprocessing engine must also be able to scale and process the large volume of input raw events, and to process data with low latency—normally subsecond or single-digit seconds—to enable real-time dashboards and altering. To address these requirements, we see many customers using Flink.

Apache Flink meets the aforementioned requirements. Flink is a framework and distributed stream processing engine, designed to perform computations at in-memory speed and at scale. Amazon Managed Service for Apache Flink offers a fully managed, serverless experience, allowing you to run your Flink applications without managing infrastructure or clusters.

Amazon Managed Service for Apache Flink can process the ingested raw events. The resulting metrics, with lower cardinality and frequency, and additional dimensions, can be written to Prometheus for a more effective visualization and analysis. The following diagram illustrates this workflow.

Amazon Managed Service for Apache Flink, Amazon Managed Prometheus and Grafana

Integrating Apache Flink and Prometheus

The new Flink Prometheus connector allows Flink applications to seamlessly write preprocessed time series data to Prometheus. No intermediate component is needed, and there is no requirement to implement a custom integration. The connector is designed to scale, using the ability of Flink to scale horizontally, and optimizing the writes to a Prometheus backend using a Remote-Write interface.

Example use case

AnyCompany is a car rental company managing a fleet of hundreds of thousands hybrid connected vehicles, in multiple regions. Each vehicle continuously transmits measurements from several sensors. Each sensor emits a sample every second or more frequently. Vehicles also communicate warning events when something wrong is detected by the onboard computer. The following diagram illustrates the workflow.

Example use case: connected cars

AnyCompany is planning to use Amazon Managed Service for Prometheus and Amazon Managed Grafana to visualize vehicle metrics and set up custom alerts.

However, building a real-time dashboard based on raw data, as transmitted by the vehicles, might be complicated and inefficient. Each vehicle might have hundreds of sensors, each of them resulting in a separate time series to display. Additionally, AnyCompany wants to monitor the behavior of different vehicle models. Unfortunately, the events transmitted by the vehicles only contain the VIN. The model can be inferred by looking up (joining) some reference data.

To overcome these challenges, AnyCompany has built a preprocessing stage based on Amazon Managed Service for Apache Flink. This stage has the following capabilities:

  • Enrich the raw data by adding the vehicle model, and looking up reference data based on the vehicle identification.
  • Reduce the cardinality, aggregating the results per vehicle model, available after the enrichment step.
  • Reduce the frequency of the raw metrics to reduce write bandwidth, aggregating over time windows of a few seconds.
  • Calculate derived metrics based on multiple raw metrics. For example, determine whether a vehicle is in motion when either the internal combustion engine or the electrical motor are rotating.

The result of preprocessing are more actionable metrics. A dashboard built on these metrics can, for example, help determine whether the last software update released over-the-air to all vehicles of a specific model in specific regions, is causing issues.

Using the Flink Prometheus connector, the preprocessor application can write directly to Amazon Managed Service for Prometheus, without intermediate components.

Nothing prevents you from choosing to write raw metrics with full cardinality and frequency to Prometheus, allowing you to drill down to the single vehicle. The Flink Prometheus connector is designed to scale by batching and parallelizing writes.

Solution overview

The following GitHub repository contains a fictional end-to-end example covering this use case. The following diagram illustrates the architecture of this example.

Example architecture

The workflow consists of the following steps:

  1. Vehicles, radio transmission, and ingestion of IoT events have been abstracted away, and replaced by a data generator that produces raw events for a hundred thousand fictional vehicles. For simplicity, the data generator is itself an Amazon Managed Service for Apache Flink application.
  2. Raw vehicle events are sent to a stream storage service. In this example, we use Amazon Managed Streaming for Apache Kafka (Amazon MSK).
  3. The core of the system is the preprocessor application, running in Amazon Managed Service for Apache Flink. We will dive deeper into the details of the processor in the following sections.
  4. Processed metrics are directly written to the Prometheus backend, in Amazon Managed Service for Prometheus.
  5. Metrics are used to generate real-time dashboards on Amazon Managed Grafana.

The following screenshot shows a sample dashboard.

Grafana dashboard

Raw vehicle events

Each vehicle transmits three metrics almost every second:

  • Internal combustion (IC) engine RPM
  • Electric motor RPM
  • Number of reported warnings

The raw events are identified by the vehicle ID and the region where the vehicle is located.

Preprocessor application

The following diagram illustrates the logical flow of the preprocessing application running in Amazon Managed Service for Apache Flink.

Flink application logical data flow

The workflow consists of the following steps:

  1. Raw events are ingested from Amazon MSK from Flink Kafka source.
  2. An enrichment operator adds the vehicle model, which is not contained in the raw events. This additional dimension is then used to aggregate the raw events. The resulting metrics have only two dimensions: vehicle model and region.
  3. Raw events are then aggregated over time windows (5 seconds) to reduce frequency. In this example, the aggregation logic also generates a derived metric: the number of vehicles in motion. A new metric can be derived from raw metrics with arbitrary logic. For the sake of the example, a vehicle is considered “in motion” if either the IC engine or electric motor RPM metric are not zero.
  4. The processed metrics are mapped into the input data structure of the Flink Prometheus connector, which maps directly to the time series records expected by the Prometheus Remote-Write interface. Refer to the connector documentation for more details.
  5. Finally, the metrics are sent to Prometheus using the Flink Prometheus connector. Write authentication, required by Amazon Managed Service for Prometheus, is seamlessly enabled using the Amazon Managed Service for Prometheus request signer provided with the connector. Credentials are automatically derived from the AWS Identity and Access Management (IAM) role of the Amazon Managed Service for Apache Flink application. No additional secret or credential is required.

In the GitHub repository, you can find the step-by-step instructions to set up the working example and create the Grafana dashboard.

Flink Prometheus connector key features

The Flink Prometheus connector allows Flink applications to write processed metrics to Prometheus, using the Remote-Write interface.

The connector is designed to scale write throughput by:

  • Parallelizing writes, using the Flink parallelism capability
  • Batching multiple samples in a single write request to the Prometheus endpoint

Error handling complies with Prometheus Remote-Write 1.0 specifications. The specifications are particularly strict about malformed or out-of-order data rejected by Prometheus.

When a malformed or out-of-order write is rejected, the connector discards the offending write request and continues, preferring data freshness over completeness. However, the connector makes data loss observable, emitting WARN log entries and exposing metrics that measure the volume of discarded data. In Amazon Managed Service for Apache Flink, these connector metrics can be automatically exported to Amazon CloudWatch.

Responsibilities of the user

The connector is optimized for efficiency, write throughput, and latency. Validation of incoming data would be particularly expensive in terms of CPU utilization. Additionally, different Prometheus backend implementations enforce constraints differently. For these reasons, the connector doesn’t validate incoming data before writing to Prometheus.

The user is responsible of making sure that the data sent to the Flink Prometheus connector follows the constraints enforced by the particular Prometheus implementations they are using.

Ordering

Ordering is particularly relevant. Prometheus expects that samples belonging to the same time series—samples with the same metric name and labels—are written in time order. The connector makes sure ordering is not lost when data is partitioned to parallelize writes.

However, the user is responsible for retaining the ordering upstream in the pipeline. To achieve this, the user must carefully design data partitioning within the Flink application and the stream storage. Only partitioning by key must be used, and partitioning keys must compound the metric name and all labels that will be used in Prometheus.

Conclusion

Prometheus is a specialized time series database, designed for building real-time dashboards and altering. Amazon Managed Service for Prometheus is a fully managed, serverless backend compatible with the Prometheus open source standard. Amazon Managed Grafana allows you to build real-time dashboards, seamlessly interfacing with Amazon Managed Service for Prometheus.

You can use Prometheus for observability use cases beyond compute resource, to observe IoT devices, connected cars, media streaming devices, and other highly distributed assets providing telemetry data.

Directly visualizing and analyzing high-cardinality and high-frequency data can be inefficient. Preprocessing raw observability events with Amazon Managed Service for Apache Flink shifts the work left, greatly simplifying the dashboards or alerting you can build on top of Amazon Managed Service for Prometheus.

For more information about running Flink, Prometheus, and Grafana on AWS, see the resources of these services:

For more information about the Flink Prometheus integration, see the Apache Flink documentation.


About the authors

Lorenzo NicoraLorenzo Nicora works as Senior Streaming Solution Architect at AWS, helping customers across EMEA. He has been building cloud-centered, data-intensive systems for over 25 years, working across industries both through consultancies and product companies. He has used open-source technologies extensively and contributed to several projects, including Apache Flink, and is the maintainer of the Flink Prometheus connector.

Francisco MorilloFrancisco Morillo is a Senior Streaming Solutions Architect at AWS. Francisco works with AWS customers, helping them design real-time analytics architectures using AWS services, supporting Amazon MSK and Amazon Managed Service for Apache Flink. He is also a main contributor to the Flink Prometheus connector.

Correlate telemetry data with Amazon OpenSearch Service and Amazon Managed Grafana

Post Syndicated from Balaji Mohan original https://aws.amazon.com/blogs/big-data/correlate-telemetry-data-with-amazon-opensearch-service-and-amazon-managed-grafana/

Troubleshooting a large, complex, distributed enterprise application involves challenges like tracing requests across multiple services, identifying performance bottlenecks across the stack, and understanding cascading failures between dependent services. Customers often need to work with isolated data to identify the underlying cause of the problem. By correlating different signals like logs, traces, metrics, and other performance indicators, you can get valuable insight into what caused the problem, where, and why.

Amazon OpenSearch Service is a managed service to deploy, operate, and search data at scale within AWS. Amazon Managed Grafana is a secure data visualization service to query operational data from multiple sources, including OpenSearch Service.

In this post, we show you how to use these services to correlate the various observability signals that improve root cause analysis, thereby resulting in reduced Mean Time to Resolution (MTTR). We also provide a reference solution that can be used at scale for proactive monitoring of enterprise applications to avoid a problem before they occur.

Solution overview

The following diagram shows the solution architecture for collecting and correlating various enterprise telemetry signals at scale.

At the core of this architecture are applications composed of microservices (represented by orange boxes) running on Amazon Elastic Kubernetes Service (Amazon EKS). These microservices contain instrumentation that emit telemetry data in the form of metrics, logs, and traces. This data is exported into the OpenTelemetry Collector, which serves as a central vendor agnostic gateway to collect this data uniformly.

In this post, we use an OpenTelemetry demo application as a sample enterprise application. Large enterprise customers typically separate their observability signal data into various stores for scalability, fault isolation, access control, and ease of operation. To aid in these functions, we recommend and use Amazon OpenSearch Ingestion for a serverless, scalable, and fully managed data pipeline. We separate log and trace data and send them to distinct OpenSearch Service domains. The solution also sends the metrics data to Amazon Managed Service for Prometheus.

We use Amazon Managed Grafana as a data visualization and analytics platform to query and visualize this data. We also show how to employ correlations as a valuable tool to gain insights from these signals spread across various data stores.

The following sections outline building this architecture at scale.

Prerequisites

Complete the following prerequisite steps:

  1. Provision and configure the Amazon Managed Prometheus workspace to receive metrics from the OpenTelemetry Collector.
  2. Create two dedicated OpenSearch Service domains (or use existing ones) to ingest logs and traces from the OpenTelemetry Collector.
  3. Create an Amazon Managed Grafana workspace and configure data sources to connect to Amazon Managed Prometheus and OpenSearch Service.
  4. Set up an EKS cluster to deploy applications and the OpenTelemetry Collector.

Create log and trace OpenSearch Ingestion pipelines

Before setting up the ingestion pipelines, you need to create the necessary AWS Identity and Access Management (IAM) policies and roles. This process involves creating two policies for domain and OSIS access, followed by creating a pipeline role that uses these policies.

Create a policy for ingestion

Complete the following steps to create an IAM policy:

  1. Open the IAM console.
  2. Choose Policies in the navigation pane, then choose Create policy.
  3. On the JSON tab, enter the following policy into the editor:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "es:DescribeDomain",
            "Resource": "arn:aws:es:*:{accountId}:domain/*"
        },
        {
            "Effect": "Allow",
            "Action": [ "es:ESHttpGet", "es:HttpHead", "es:HttpDelete", "es:HttpPatch", "es:HttpPost", "es:HttpPut" ],
            "Resource": "arn:aws:es:us-east-1:{accountId}:domain/otel-traces"
        },
        {
            "Effect": "Allow",
            "Action": [ "es:ESHttpGet", "es:HttpHead", "es:HttpDelete", "es:HttpPatch", "es:HttpPost", "es:HttpPut" ],
            "Resource": "arn:aws:es:us-east-1:{accountId}:domain/otel-logs"
        }
        }
    ]
}

// Replace {accountId} with your own values
  1. Choose Next, choose Next again, and name your policy domain-policy.
  2. Choose Create policy.
  3. Create another policy with the name osis-policy and use the following JSON:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "osis:Ingest",
            "Resource": "arn:aws:osis:us-east-1:{accountId}:pipeline/osi-pipeline-otellogs"
        },
        {
            "Effect": "Allow",
            "Action": "osis:Ingest",
            "Resource": "arn:aws:osis:us-east-1:{accountId}:pipeline/osi-pipeline-oteltraces"
        }
    ]
}
// Replace {accountId} with your own values

Create a pipeline role

Complete the following steps to create a pipeline role:

  1. On the IAM console, choose Roles in the navigation pane, then choose Create role.
  2. Select Custom trust policy and enter the following policy into the editor:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": [
                    "eks.amazonaws.com",
                    "osis-pipelines.amazonaws.com"
                ],
                "AWS": "{nodegroup_arn}"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

// Replace {nodegroup_arn} with your own values
  1. Choose Next, then search for and select the policies osis-policy and domain-policy you just created.
  2. Choose Next and name the role PipelineRole.
  3. Choose Create role.

Allow access for the pipeline role in OpenSearch Service domains

To enable access for the pipeline role in OpenSearch Service domains, complete the following steps:

  1. Open the OpenSearch Service console.
  2. Choose your domain (either logs or traces).
  3. Choose the OpenSearch Dashboards URL
  4. Sign in with your credentials.

Then, complete the following steps for each OpenSearch Service domain (logs and traces domains).

  1. In OpenSearch Dashboards, go to the Security
  2. Choose Roles and then all_access.

This procedure uses the all_access role for demonstration purposes only. This grants full administrative privileges to the pipeline role, which violates the principle of least privilege and could pose security risks. For production environments, you should create a custom role with minimal permissions required for data ingestion, limit permissions to specific indexes and operations, consider implementing index patterns and time-based access controls, and regularly audit role mappings and permissions. For detailed guidance on creating custom roles with appropriate permissions, refer to Security in Amazon OpenSearch Service.

  1. Choose Mapped users and then Managed mapping.
  2. On the Map user page, under Backend roles, update the backend role with the Amazon Resource Name (ARN) for the role PiplelineRole.
  3. Choose Map.

Create a pipeline for logs

Complete the following steps to create a pipeline for logs:

  1. Open the OpenSearch Service console.
  2. Choose Ingestion pipelines.
  3. Choose Create pipeline.
  4. Define the pipeline configuration by entering the following:
version: "2"
otel-logs-pipeline:
  source:
    otel_logs_source:
      path: "/v1/logs"
  sink:
    - opensearch:
        hosts: ["{OpenSearch_domain_endpoint}"]
        aws:
          sts_role_arn: "arn:aws:iam::{accountId}:role/osi-pipeline-role"
          region: "us-east-1"
          serverless: false
        index: "observability-otel-logs%{yyyy-MM-dd}"
       
 # To get the values for the placeholders:
 # 1. {OpenSearch_domain_endpoint}: You can find the domain endpoint by navigating to the Amazon Managed Opensearch managed clusters in the AWS Management Console, and then clicking on the domain.
 # After obtaining the necessary values, replace the placeholders in the configuration with the actual values.           

Create a pipeline for traces

Complete the following steps to create a pipeline for traces:

  1. Open the OpenSearch Service console.
  2. Choose Ingestion pipelines.
  3. Choose Create pipeline.
  4. Define the pipeline configuration by entering the following:
version: "2"
entry-pipeline:
  source:
    otel_trace_source:
      path: "/v1/traces"
  processor:
    - trace_peer_forwarder:
  sink:
    - pipeline:
        name: "span-pipeline"
    - pipeline:
        name: "service-map-pipeline"
span-pipeline:
  source:
    pipeline:
      name: "entry-pipeline"
  processor:
    - otel_traces:
  sink:
    - opensearch:
        index_type: "trace-analytics-raw"
        hosts: ["{OpenSearch_domain_endpoint}"]
        aws:                 
          sts_role_arn: "arn:aws:iam::{accountId}:role/osi-pipeline-role"
          region: "us-east-1"
service-map-pipeline:
  source:
    pipeline:
      name: "entry-pipeline"
  processor:
    - service_map:
  sink:
    - opensearch:
        index_type: "trace-analytics-service-map"
        hosts: ["{OpenSearch_domain_endpoint}"]
        aws:                 
          sts_role_arn: "arn:aws:iam::{accountId}:role/osi-pipeline-role"
          region: "us-east-1"
         
 # To get the values for the placeholders:
 # 1. {OpenSearch_domain_endpoint}: You can find the domain endpoint by navigating to the Amazon Managed Opensearch managed clusters in the AWS Management Console, and then clicking on the domain.  # 2. {accountId}: This is your AWS account ID. You can find your account ID by clicking on your username in the top-right corner of the AWS Management Console and selecting "My Account" from the dropdown menu.
 # After obtaining the necessary values, replace the placeholders in the configuration with the actual values.     

Install the OpenTelemetry demo application in Amazon EKS

Use the EKS cluster you set up earlier along with AWS CloudShell or another tool to complete these steps:

  1. Open the AWS Management Console.
  2. Choose the CloudShell icon in the top navigation bar, or go directly to the CloudShell console.
  3. Wait for the shell environment to initialize—it comes preinstalled with common AWS Command Line Interface (AWS CLI) tools.

Now you can complete the following steps to install the application.

  1. Clone the OpenTelemetry Demo repository:
git clone https://github.com/aws-samples/sample-correlation-opensearch-repository
  1. Navigate to the Kubernetes directory:
cd deployment_files
  1. Deploy the demo application using kubectl apply:
kubectl apply -f .
  1. Use a load balancer to expose the frontend service so you can reach the source application web URL:
kubectl expose deployment opentelemetry-demo-frontendproxy --type=LoadBalancer --name=frontendproxy
  1. After you have deployed the application, access the frontend application using the load balancer on port 8080. Use your browser to visit http://<LoadBalancerIP>:8080/ to open the source application for OpenTelemetry.

By following these steps, you can successfully install and access demo applications on your EKS cluster.

Configure the OpenTelemetry Collector exporter for logs, traces, and metrics

The OpenTelemetry Collector is a tool that manages the receiving, processing, and exporting of telemetry data from your application to a target repository.

In this step, we send logs and traces to OpenSearch Service and metrics to Amazon Managed Prometheus. The OpenTelemetry Collector also works with popular data repositories like Jaeger and a variety of other open source and commercial platforms. In this section, we include steps to configure the OpenTelemetry Collector in an EKS environment. Then we deploy the demo application and explore the OpenTelemetry exporters using AWS Managed Solutions instead of the open source versions.

Complete the following steps:

  1. Open the otel-collector-config ConfigMap in your preferred editor:
kubectl edit configmap opentelemetry-demo-otelcol -n otel-demo
  1. Update the exporters section with the following configuration (provide the appropriate Amazon Managed Service for Prometheus endpoint and OpenSearch Service log ingestion URLs):
exporters:
     logging: {}
     otlphttp/logs:
       logs_endpoint: "<AWS_OPENSEARCH_LOG_INGESTION_URL>/v1/logs"
       auth:
         authenticator: sigv4auth
       compression: none
     otlphttp/traces:
       traces_endpoint: "<AWS_OPENSEARCH_TRACE_INGESTION_URL>/v1/traces"
       auth:
         authenticator: sigv4auth
       compression: none
     prometheusremotewrite:
        endpoint: "<AWS_MANAGED_PROMETHEUS_ENDPOINT>"
        auth:
          authenticator: sigv4auth 
  1. Locate the extensions section and update the IAM role ARN in the sigv4auth configuration:
sigv4auth:
        assume_role:
            arn: "arn:aws:iam::{accountId}:role/osi-pipeline-role"
            sts_region: "us-east-1"
        region: "us-east-1"
        service: "osis"
 #  {accountId}: replace accountID with your account id
  1. After updating the ConfigMap, restart the OpenTelemetry Collector deployment:
kubectl rollout restart deployment opentelemetry-demo-otelcol -n otel-demo

With these changes, the OpenTelemetry Collector will send trace data to the OpenSearch Service domain, metrics data to the AWS Managed Service for Prometheus endpoint, and log data to the OpenSearch Service domain.

Configure Amazon Managed Grafana

Before you can visualize your logs and traces, you need to configure OpenSearch Service as a data source in your Amazon Managed Grafana workspace. This configuration is done through the Amazon Managed Grafana console.

Configure the OpenSearch Service data source

Complete the following steps to configure the OpenSearch Service data source:

  1. Open the Amazon Managed Grafana console.
  2. Select your workspace and choose the workspace URL to access your Grafana instance.
  3. Log in to your Amazon Managed Grafana instance.
  4. From the side menu, choose the configuration (gear) icon.
  5. On the Configuration menu, choose Data Sources.
  6. Choose Add data source.
  7. On the Add data source page, select OpenSearch Service from the list of available data sources.
  8. In the Name field, enter a descriptive name for the data source.
  9. In the URL field, enter the URL (OpenSearch Service domain endpoint) of your OpenSearch Service domain, including the protocol and port number.
  10. If your OpenSearch cluster is configured with authentication, provide the required credentials in the User and Password
  11. If you want to use a specific index pattern for the data source, you can specify it in the Index name field (For example, logstash-*).
  12. Adjust any other settings as needed, such as the Time field name and Time interval.
  13. Choose Save & Test to verify the connection to your OpenSearch cluster.

If the test is successful, you should see a green notification with the message “Data source is working.”

  1. Choose Save to save the data source configuration.
  2. Repeat the same steps for the OpenSearch logs and traces domains.

Configure the Prometheus data source

Complete the following steps to configure the Prometheus data source:

  1. Open the Amazon Managed Grafana console.
  2. Select your workspace and choose the workspace URL to access your Grafana instance.
  3. Log in to your Amazon Managed Grafana instance.
  4. From the side menu, choose the configuration (gear) icon.
  5. On the Configuration menu, choose Data Sources.
  6. Choose Add data source.
  7. On the Add data source page, select Amazon Managed Prometheus from the list of available data sources.
  8. In the Name field, enter a descriptive name for the data source.
  9. The AWS Auth Provider and Default Region fields should be automatically populated based on your Amazon Managed Grafana workspace configuration.
  10. In the Workspace field, enter the ID or alias of your Amazon Managed Prometheus workspace.
  11. Choose Save & Test to verify the connection to your Amazon Managed Prometheus workspace.

If the test is successful, you should see a green notification with the message “Data source is working.”

  1. Choose Save to save the data source configuration.

Create correlations in Amazon Managed Grafana

To establish connections between your logs and traces data, you need to set up data correlations in Amazon Managed Grafana. This allows you to navigate seamlessly between related logs and traces. Follow these steps in your Amazon Managed Grafana workspace:

  1. Open the Amazon Managed Grafana console.
  2. Select your workspace and choose the workspace URL to access your Grafana instance.
  3. In the Amazon Managed Grafana portal, on the Administration menu, choose Plugins and Data, and choose Correlation.

  1. On the Set up the target for the correlation page, under Target, choose your traces data source (OpenSearch Service, for example, otel-traces) from the dropdown list and define the query that will execute when the link is followed. You can use variables to query specific field values. For example, traceId: ${__value.raw}.

  1. On the Set up the target for the correlation page, choose the log data source from the dropdown list, and enter the field name to be linked or correlated with the traces data source in the OpenSearch Service data source. For example, traceID.

  1. Choose Save to complete the correlation configuration.

  1. Repeat the steps to create a correlation between metrics on Prometheus to logs in OpenSearch Service.

Validate results

In Amazon Managed Grafana, using the Prometheus data source, locate the desired instance for correlation. The instance ID will be displayed as a link. Follow the link to open the corresponding log details in a panel on the right side of the page.

With the logs to traces correlation configured, you can access trace information directly from the logs page. Choose traces on the log details panel to view the corresponding trace data.

The following screenshot demonstrates the node graph visualization showing the correlation flow: instance metrics to logs to traces.

Clean up

Remove the infrastructure for this solution when not in use to avoid incurring unnecessary costs.

Conclusion

In this post, we showed how to use correlation as a helpful tool to gain insight into observability data stored in various stores.

Separating logs and traces into dedicated domains provides the following benefits:

  • Better resource allocation and scaling based on different workload patterns
  • Independent performance optimization for each data type
  • Simplified cost tracking and management
  • Enhanced security control with separate access policies

You can use this solution as a reference to build a scalable observability solution for your enterprise to detect, investigate, and remediate problems faster. This ability, when used along next-generation artificial intelligence and machine learning (AI/ML), helps to not only proactively react but predict and prevent problems before they occur. You can learn more about AI/ML with AWS.


About the Authors

Balaji Mohan is a Senior Delivery Consultant specializing in application and data modernization to the cloud. His business-first approach provides seamless transitions, aligning technology with organizational goals. Using cloud-centered architectures, he delivers scalable, agile, and cost-effective solutions, driving innovation and growth.

Senthil Ramasamy is a Senior Database Consultant at Amazon Web Services. He works with AWS customers to provide guidance and technical assistance on database services, helping them with database migrations to the AWS Cloud and improving the value of their solutions when using AWS.

Muthu Pitchaimani is a Search Specialist with Amazon OpenSearch Service. He builds large-scale search applications and solutions. Muthu is interested in the topics of networking and security, and is based out of Austin, Texas.

Enhance monitoring and debugging for AWS Glue jobs using new job observability metrics: Part 2

Post Syndicated from Noritaka Sekiyama original https://aws.amazon.com/blogs/big-data/part-2-enhance-monitoring-and-debugging-for-aws-glue-jobs-using-new-job-observability-metrics/

Monitoring data pipelines in real time is critical for catching issues early and minimizing disruptions. AWS Glue has made this more straightforward with the launch of AWS Glue job observability metrics, which provide valuable insights into your data integration pipelines built on AWS Glue. However, you might need to track key performance indicators across multiple jobs. In this case, a dashboard that can visualize the same metrics with the ability to drill down into individual issues is an effective solution to monitor at scale.

This post, walks through how to integrate AWS Glue job observability metrics with Grafana using Amazon Managed Grafana. We discuss the types of metrics and charts available to surface key insights along with two use cases on monitoring error classes and throughput of your AWS Glue jobs.

Solution overview

Grafana is an open source visualization tool that allows you to query, visualize, alert on, and understand your metrics no matter where they are stored. With Grafana, you can create, explore, and share visually rich, data-driven dashboards. The new AWS Glue job observability metrics can be effortlessly integrated with Grafana for real-time monitoring purpose. Metrics like worker utilization, skewness, I/O rate, and errors are captured and visualized in easy-to-read Grafana dashboards. The integration with Grafana provides a flexible way to build custom views of pipeline health tailored to your needs. Observability metrics open up monitoring capabilities that weren’t possible before for AWS Glue. Companies relying on AWS Glue for critical data integration pipelines can have greater confidence that their pipelines are running efficiently.

AWS Glue job observability metrics are emitted as Amazon CloudWatch metrics. You can provision and manage Amazon Managed Grafana, and configure the CloudWatch plugin for the given metrics. The following diagram illustrates the solution architecture.

Implement the solution

Complete following steps to set up the solution:

  1. Set up an Amazon Managed Grafana workspace.
  2. Sign in to your workspace.
  3. Choose Administration.
  4. Choose Add new data source.
  5. Choose CloudWatch.
  6. For Default Region, select your preferred AWS Region.
  7. For Namespaces of Custom Metrics, enter Glue.
  8. Choose Save & test.

Now the CloudWatch data source has been registered.

  1. Copy the data source ID from the URL https://g-XXXXXXXXXX.grafana-workspace.<region>.amazonaws.com/datasources/edit/<data-source-ID>/.

The next step is to prepare the JSON template file.

  1. Download the Grafana template.
  2. Replace <data-source-id> in the JSON file with your Grafana data source ID.

Lastly, configure the dashboard.

  1. On the Grafana console, choose Dashboards.
  2. Choose Import on the New menu.
  3. Upload your JSON file, and choose Import.

The Grafana dashboard visualizes AWS Glue observability metrics, as shown in the following screenshots.

The sample dashboard has the following charts:

  • [Reliability] Job Run Errors Breakdown
  • [Throughput] Bytes Read & Write
  • [Throughput] Records Read & Write
  • [Resource Utilization] Worker Utilization
  • [Job Performance] Skewness
  • [Resource Utilization] Disk Used (%)
  • [Resource Utilization] Disk Available (GB)
  • [Executor OOM] OOM Error Count
  • [Executor OOM] Heap Memory Used (%)
  • [Driver OOM] OOM Error Count
  • [Driver OOM] Heap Memory Used (%)

Analyze the causes of job failures

Let’s try analyzing the causes of job run failures of the job iot_data_processing.

First, look at the pie chart [Reliability] Job Run Errors Breakdown. This pie chart quickly identifies which errors are most common.

Then filter with the job name iot_data_processing to see the common errors for this job.

We can observe that the majority (75%) of failures were due to glue.error.DISK_NO_SPACE_ERROR.

Next, look at the line chart [Resource Utilization] Disk Used (%) to understand the driver’s used disk space during the job runs. For this job, the green line shows the driver’s disk usage, and the yellow line shows the average of the executors’ disk usage.

We can observe that there were three times when 100% of disk was used in executors.

Next, look at the line chart [Throughput] Records Read & Write to see whether the data volume was changed and whether it impacted disk usage.

The chart shows that around four billion records were read at the beginning of this range; however, around 63 billion records were read at the peak. This means that the incoming data volume has significantly increased, and caused local disk space shortage in the worker nodes. For such cases, you can increase the number of workers, enable auto scaling, or choose larger worker types.

After implementing those suggestions, we can see lower disk usage and a successful job run.

(Optional) Configure cross-account setup

We can optionally configure a cross-account setup. Cross-account metrics depend on CloudWatch cross-account observability. In this setup, we expect the following environment:

  • AWS accounts are not managed in AWS Organizations
  • You have two accounts: one account is used as the monitoring account where Grafana is located, another account is used as the source account where the AWS Glue-based data integration pipeline is located

To configure a cross-account setup for this environment, complete the following steps for each account.

Monitoring account

Complete the following steps to configure your monitoring account:

  1. Sign in to the AWS Management Console using the account you will use for monitoring.
  2. On the CloudWatch console, choose Settings in the navigation pane.
  3. Under Monitoring account configuration, choose Configure.
  4. For Select data, choose Metrics.
  5. For List source accounts, enter the AWS account ID of the source account that this monitoring account will view.
  6. For Define a label to identify your source account, choose Account name.
  7. Choose Configure.

Now the account is successfully configured as a monitoring account.

  1. Under Monitoring account configuration, choose Resources to link accounts.
  2. Choose Any account to get a URL for setting up individual accounts as source accounts.
  3. Choose Copy URL.

You will use the copied URL from the source account in the next steps.

Source account

Complete the following steps to configure your source account:

  1. Sign in to the console using your source account.
  2. Enter the URL that you copied from the monitoring account.

You can see the CloudWatch settings page, with some information filled in.

  1. For Select data, choose Metrics.
  2. Do not change the ARN in Enter monitoring account configuration ARN.
  3. The Define a label to identify your source account section is pre-filled with the label choice from the monitoring account. Optionally, choose Edit to change it.
  4. Choose Link.
  5. Enter Confirm in the box and choose Confirm.

Now your source account has been configured to link to the monitoring account. The metrics emitted in the source account will show on the Grafana dashboard in the monitoring account.

To learn more, see CloudWatch cross-account observability.

Considerations

The following are some considerations when using this solution:

  • Grafana integration is defined for real-time monitoring. If you have a basic understanding of your jobs, it will be straightforward for you to monitor performance, errors, and more on the Grafana dashboard.
  • Amazon Managed Grafana depends on AWS IAM Identify Center. This means you need to manage single sign-on (SSO) users separately, not just AWS Identity and Access Management (IAM) users and roles. It also requires another sign-in step from the AWS console. The Amazon Managed Grafana pricing model depends on an active user license per workspace. More users can cause more charges.
  • Graph lines are visualized per job. If you want to see the lines across all the jobs, you can choose ALL in the control.

Conclusion

AWS Glue job observability metrics offer a powerful new capability for monitoring data pipeline performance in real time. By streaming key metrics to CloudWatch and visualizing them in Grafana, you gain more fine-grained visibility that wasn’t possible before. This post showed how straightforward it is to enable observability metrics and integrate the data with Grafana using Amazon Managed Grafana. We explored the different metrics available and how to build customized Grafana dashboards to surface actionable insights.

Observability is now an essential part of robust data orchestration on AWS. With the ability to monitor data integration trends in real time, you can optimize costs, performance, and reliability.


About the Authors

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Xiaoxi Liu is a Software Development Engineer on the AWS Glue team. Her passion is building scalable distributed systems for efficiently managing big data on the cloud, and her concentrations are distributed system, big data, and cloud computing.

Akira Ajisaka is a Senior Software Development Engineer on the AWS Glue team. He likes open source software and distributed systems. In his spare time, he enjoys playing arcade games.

Shenoda Guirguis is a Senior Software Development Engineer on the AWS Glue team. His passion is in building scalable and distributed data infrastructure and processing systems. When he gets a chance, Shenoda enjoys reading and playing soccer.

Sean Ma is a Principal Product Manager on the AWS Glue team. He has an 18-year track record of innovating and delivering enterprise products that unlock the power of data for users. Outside of work, Sean enjoys scuba diving and college football.

Mohit Saxena is a Senior Software Development Manager on the AWS Glue team. His team focuses on building distributed systems to enable customers with interactive and simple to use interfaces to efficiently manage and transform petabytes of data seamlessly across data lakes on Amazon S3, databases and data-warehouses on cloud.

Amazon Managed Service for Prometheus collector provides agentless metric collection for Amazon EKS

Post Syndicated from Donnie Prakoso original https://aws.amazon.com/blogs/aws/amazon-managed-service-for-prometheus-collector-provides-agentless-metric-collection-for-amazon-eks/

Today, I’m happy to announce a new capability, Amazon Managed Service for Prometheus collector, to automatically and agentlessly discover and collect Prometheus metrics from Amazon Elastic Kubernetes Service (Amazon EKS). Amazon Managed Service for Prometheus collector consists of a scraper that discovers and collects metrics from Amazon EKS applications and infrastructure without needing to run any collectors in-cluster.

This new capability provides fully managed Prometheus-compatible monitoring and alerting with Amazon Managed Service for Prometheus. One of the significant benefits is that the collector is fully managed, automatically right-sized, and scaled for your use case. This means you don’t have to run any compute for collectors to collect the available metrics. This helps you optimize metric collection costs to monitor your applications and infrastructure running on EKS.

With this launch, Amazon Managed Service for Prometheus now supports two major modes of Prometheus metrics collection: AWS managed collection, a fully managed and agentless collector, and customer managed collection.

Getting started with Amazon Managed Service for Prometheus Collector
Let’s take a look at how to use AWS managed collectors to ingest metrics using this new capability into a workspace in Amazon Managed Service for Prometheus. Then, we will evaluate the collected metrics in Amazon Managed Service for Grafana.

When you create a new EKS cluster using the Amazon EKS console, you now have the option to enable AWS managed collector by selecting Send Prometheus metrics to Amazon Managed Service for Prometheus. In the Destination section, you can also create a new workspace or select your existing Amazon Managed Service for Prometheus workspace. You can learn more about how to create a workspace by following the getting started guide.

Then, you have the flexibility to define your scraper configuration using the editor or upload your existing configuration. The scraper configuration controls how you would like the scraper to discover and collect metrics. To see possible values you can configure, please visit the Prometheus Configuration page.

Once you’ve finished the EKS cluster creation, you can go to the Observability tab on your cluster page to see the list of scrapers running in your EKS cluster.

The next step is to configure your EKS cluster to allow the scraper to access metrics. You can find the steps and information on Configuring your Amazon EKS cluster.

Once your EKS cluster is properly configured, the collector will automatically discover metrics from your EKS cluster and nodes. To visualize the metrics, you can use Amazon Managed Grafana integrated with your Prometheus workspace. Visit the Set up Amazon Managed Grafana for use with Amazon Managed Service for Prometheus page to learn more.

The following is a screenshot of metrics ingested by the collectors and visualized in an Amazon Managed Grafana workspace. From here, you can run a simple query to get the metrics that you need.

Using AWS CLI and APIs
Besides using the Amazon EKS console, you can also use the APIs or AWS Command Line Interface (AWS CLI) to add an AWS managed collector. This approach is useful if you want to add an AWS managed collector into an existing EKS cluster or make some modifications to the existing collector configuration.

To create a scraper, you can run the following command:

aws amp create-scraper \ 
       --source eksConfiguration="{clusterArn=<EKS-CLUSTER-ARN>,securityGroupIds=[<SG-SECURITY-GROUP-ID>],subnetIds=[<SUBNET-ID>]}" \ 
       --scrape-configuration configurationBlob=<BASE64-CONFIGURATION-BLOB> \ 
       --destination=ampConfiguration={workspaceArn="<WORKSPACE_ARN>"}

You can get most of the parameter values from the respective AWS console, such as your EKS cluster ARN and your Amazon Managed Service for Prometheus workspace ARN. Other than that, you also need to define the scraper configuration defined as configurationBlob.

Once you’ve defined the scraper configuration, you need to encode the configuration file into base64 encoding before passing the API call. The following is the command that I use in my Linux development machine to encode sample-configuration.yml into base64 and copy it onto the clipboard.

$ base64 sample-configuration.yml | pbcopy

Now Available
The Amazon Managed Service for Prometheus collector capability is now available to all AWS customers in all AWS Regions where Amazon Managed Service for Prometheus is supported.

Learn more:

Happy building!
Donnie

Monitoring Amazon DevOps Guru insights using Amazon Managed Grafana

Post Syndicated from MJ Kubba original https://aws.amazon.com/blogs/devops/monitoring-amazon-devops-guru-insights-using-amazon-managed-grafana/

As organizations operate day-to-day, having insights into their cloud infrastructure state can be crucial for the durability and availability of their systems. Industry research estimates[1] that downtime costs small businesses around $427 per minute of downtime, and medium to large businesses an average of $9,000 per minute of downtime. Amazon DevOps Guru customers want to monitor and generate alerts using a single dashboard. This allows them to reduce context switching between applications, providing them an opportunity to respond to operational issues faster.

DevOps Guru can integrate with Amazon Managed Grafana to create and display operational insights. Alerts can be created and communicated for any critical events captured by DevOps Guru and notifications can be sent to operation teams to respond to these events. The key telemetry data types of logs and metrics are parsed and filtered to provide the necessary insights into observability.

Furthermore, it provides plug-ins to popular open-source databases, third-party ISV monitoring tools, and other cloud services. With Amazon Managed Grafana, you can easily visualize information from multiple AWS services, AWS accounts, and Regions in a single Grafana dashboard.

In this post, we will walk you through integrating the insights generated from DevOps Guru with Amazon Managed Grafana.

Solution Overview:

This architecture diagram shows the flow of the logs and metrics that will be utilized by Amazon Managed Grafana. Insights originate from DevOps Guru, each insight generating an event. These events are captured by Amazon EventBridge, and then saved as logs to Amazon CloudWatch Log Group DevOps Guru service metrics, and then parsed by Amazon Managed Grafana to create new dashboards.

This architecture diagram shows the flow of the logs and metrics that will be utilized by Amazon Managed Grafana, starting with DevOps Guru and then using Amazon EventBridge to save the insight event logs to Amazon CloudWatch Log Group DevOps Guru service metrics to be parsed by Amazon Managed Grafana and create new dashboards in Grafana from these logs and Metrics.

Now we will walk you through how to do this and set up notifications to your operations team.

Prerequisites:

The following prerequisites are required for this walkthrough:

  • An AWS Account
  • Enabled DevOps Guru on your account with CloudFormation stack, or tagged resources monitored.

Using Amazon CloudWatch Metrics

 

DevOps Guru sends service metrics to CloudWatch Metrics. We will use these to      track metrics for insights and metrics for your DevOps Guru usage; the DevOps Guru service reports the metrics to the AWS/DevOps-Guru namespace in CloudWatch by default.

First, we will provision an Amazon Managed Grafana workspace and then create a Dashboard in the workspace that uses Amazon CloudWatch as a data source.

Setting up Amazon CloudWatch Metrics

  1. Create Grafana Workspace
    Navigate to Amazon Managed Grafana from AWS console, then click Create workspace

a. Select the Authentication mechanism

i. AWS IAM Identity Center (AWS SSO) or SAML v2 based Identity Providers

ii. Service Managed Permission or Customer Managed

iii. Choose Next

b. Under “Data sources and notification channels”, choose Amazon CloudWatch

c. Create the Service.

You can use this post for more information on how to create and configure the Grafana workspace with SAML based authentication.

Next, we will show you how to create a dashboard and parse the Logs and Metrics to display the DevOps Guru insights and recommendations.

2. Configure Amazon Managed Grafana

a. Add CloudWatch as a data source:
From the left bar navigation menu, hover over AWS and select Data sources.

b. From the Services dropdown select and configure CloudWatch.

3. Create a Dashboard

a. From the left navigation bar, click on add a new Panel.

b. You will see a demo panel.

c. In the demo panel – Click on Data source and select Amazon CloudWatch.

The Amazon Grafana Workspace dashboard with the Grafana data source dropdown menu open. The drop down has 'Amazon CloudWatch (region name)' highlighted, other options include 'Mixed, 'Dashboard', and 'Grafana'.

d. For this panel we will use CloudWatch metrics to display the number of insights.

e. From Namespace select the AWS/DevOps-Guru name space, Insights as Metric name and Average for Statistics.

In the Amazon Grafana Workspace dashboard the user has entered values in three fields. "Grafana Query with Namespace" has the chosen value: AWS/DevOps-Guru. "Metric name" has the chosen value: Insights. "Statistic" has the chosen value: Average.

click apply

Time series graph contains a single new data point, indicting a recent event.

f. This is our first panel. We can change the panel name from the right-side bar under Title. We will name this panel “Insights

g. From the top right menu, click save dashboard and give your new dashboard a name

Using Amazon CloudWatch Logs via Amazon EventBridge

For other insights outside of the service metrics, such as a number of insights per specific service or the average for a region or for a specific AWS account, we will need to parse the event logs. These logs first need to be sent to Amazon CloudWatch Logs. We will go over the details on how to set this up and how we can parse these logs in Amazon Managed Grafana using CloudWatch Logs Query Syntax. In this post, we will show a couple of examples. For more details, please check out this User Guide documentation. This is not done by default and we will need to use Amazon EventBridge to pass these logs to CloudWatch.

DevOps Guru logs include other details that can be helpful when building Dashboards, such as region, Insight Severity (High, Medium, or Low), associated resources, and DevOps guru dashboard URL, among other things.  For more information, please check out this User Guide documentation.

EventBridge offers a serverless event bus that helps you receive, filter, transform, route, and deliver events. It provides one to many messaging solutions to support decoupled architectures, and it is easy to integrate with AWS Services and 3rd-party tools. Using Amazon EventBridge with DevOps Guru provides a solution that is easy to extend to create a ticketing system through integrations with ServiceNow, Jira, and other tools. It also makes it easy to set up alert systems through integrations with PagerDuty, Slack, and more.

 

Setting up Amazon CloudWatch Logs

  1. Let’s dive in to creating the EventBridge rule and enhance our Grafana dashboard:

a. First head to Amazon EventBridge in the AWS console.

b. Click Create rule.

     Type in rule Name and Description. You can leave the Event bus to default and Rule type to Rule with an event pattern.

c. Select AWS events or EventBridge partner events.

    For event Pattern change to Customer patterns (JSON editor) and use:

{"source": ["aws.devops-guru"]}

This filters for all events generated from DevOps Guru. You can use the same mechanism to filter out specific messages such as new insights, or insights closed to a different channel. For this demonstration, let’s consider extracting all events.

As the user configures their EventBridge Rule, for the Creation method they have chosen "Custom pattern (JSON editor) write an event pattern in JSON." For the Event pattern editor just below they have entered {"source":["aws.devops-guru"]}

d. Next, for Target, select AWS service.

    Then use CloudWatch log Group.

    For the Log Group, give your group a name, such as “devops-guru”.

In the prompt for the new Target's configurations, the user has chosen AWS service as the Target type. For the Select a target drop down, they chose CloudWatch log Group. For the log group, they selected the /aws/events radio option, and then filled in the following input text box with the kebab case group name devops-guru.

e. Click Create rule.

f. Navigate back to Amazon Managed Grafana.
It’s time to add a couple more additional Panels to our dashboard.  Click Add panel.
    Then Select Amazon CloudWatch, and change from metrics to CloudWatch Logs and select the Log Group we created previously.

In the Grafana Workspace, the user has "Data source" selected as Amazon CloudWatch us-east-1. Underneath that they have chosen to use the default region and CloudWatch Logs. Below that, for the Log Groups they have entered /aws/events/DevOpsGuru

g. For the query use the following to get the number of closed insights:

fields @detail.messageType
| filter detail.messageType="CLOSED_INSIGHT"
| count(detail.messageType)

You’ll see the new dashboard get updated with “Data is missing a time field”.

New panel suggestion with switch to table or open visualization suggestions

You can either open the suggestions and select a gauge that makes sense;

New Suggestions display a dial graph, a bar graph, and a count numerical tracker

Or choose from multiple visualization options.

Now we have 2 panels:

Two panels are shown, one is the new dial graph, and the other is the time series graph that was created earlier.

h. You can repeat the same process. To create 3rd panel for the new insights using this query:

fields @detail.messageType 
| filter detail.messageType="NEW_INSIGHT" 
| count(detail.messageType)

Now we have 3 panels:

Grafana now shows three 3 panels. Two dial graphs, and the time series graph.

Next, depending on the visualizations, you can work with the Logs and metrics data types to parse and filter the data.

Setting up a 4th panel as table. Under the Query tab, in the query editor, the user has entered the text: fields detail.messageType, detail.insightSeverity, detail.insightUrlfilter | filter detail.messageType="CLOSED_INSIGHT" or detail.messageType="NEW_INSIGHT"

i. For our fourth panel, we will add DevOps Guru dashboard direct link to the AWS Console.

Repeat the same process as demonstrated previously one more time with this query:

fields detail.messageType, detail.insightSeverity, detail.insightUrlfilter 
| filter detail.messageType="CLOSED_INSIGHT" or detail.messageType="NEW_INSIGHT"                       

                        Switch to table when prompted on the panel.

Grafana now shows 4 panels. The new panel displays a data table that contains information about the most recent DevOps Guru insights. There are also the two dial graphs, and the time series graph from before.

This will give us a direct link to the DevOps Guru dashboard and help us get to the insight details and Recommendations.

Grafana now shows 4 panels. The new panel displays a data table that contains information about the most recent DevOps Guru insights. There are also the two dial graphs, and the time series graph from before.

Save your dashboard.

  1. You can extend observability by sending notifications through alerts on dashboards of panels providing metrics. The alerts will be triggered when a condition is met. The Alerts are communicated with Amazon SNS notification mechanism. This is our SNS notification channel setup.

Screenshot: notification settings show Name: DevopsGuruAlertsFromGrafana and Type: SNS

A previously created notification is used next to communicate any alerts when the condition is met across the metrics being observed.

Screenshot: notification setting with condition when count of query is above 5, a notification is sent to DevopsGuruAlertsFromGrafana with message, "More than 5 insights in the past 1 hour"

Cleanup

To avoid incurring future charges, delete the resources.

  • Navigate to EventBridge in AWS console and delete the rule created in step 4 (a-e) “devops-guru”.
  • Navigate to CloudWatch logs in AWS console and delete the log group created as results of step 4 (a-e) named “devops-guru”.
  • Amazon Managed Grafana: Navigate to Amazon Managed Grafana service and delete the Grafana services you created in step 1.

Conclusion

In this post, we have demonstrated how to successfully incorporate Amazon DevOps Guru insights into Amazon Managed Grafana and use Grafana as the observability tool. This will allow Operations team to successfully observe the state of their AWS resources and notify them through Alarms on any preset thresholds on DevOps Guru metrics and logs. You can expand on this to create other panels and dashboards specific to your needs. If you don’t have DevOps Guru, you can start monitoring your AWS applications with AWS DevOps Guru today using this link.

[1] https://www.atlassian.com/incident-management/kpis/cost-of-downtime

About the authors:

MJ Kubba

MJ Kubba is a Solutions Architect who enjoys working with public sector customers to build solutions that meet their business needs. MJ has over 15 years of experience designing and implementing software solutions. He has a keen passion for DevOps and cultural transformation.

David Ernst

David is a Sr. Specialist Solution Architect – DevOps, with 20+ years of experience in designing and implementing software solutions for various industries. David is an automation enthusiast and works with AWS customers to design, deploy, and manage their AWS workloads/architectures.

Sofia Kendall

Sofia Kendall is a Solutions Architect who helps small and medium businesses achieve their goals as they utilize the cloud. Sofia has a background in Software Engineering and enjoys working to make systems reliable, efficient, and scalable.

Visualize database privileges on Amazon Redshift using Grafana

Post Syndicated from Yota Hamaoka original https://aws.amazon.com/blogs/big-data/visualize-database-privileges-on-amazon-redshift-using-grafana/

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. Amazon Redshift enables you to use SQL for analyzing structured and semi-structured data with best price performance along with secure access to the data.

As more users start querying data in a data warehouse, access control is paramount to protect valuable organizational data. Database administrators want to continuously monitor and manage user privileges to maintain proper data access in the data warehouse. Amazon Redshift provides granular access control on the database, schema, table, column, row, and other database objects by granting privileges to roles, groups, and users from a SQL interface. To monitor privileges configured in Amazon Redshift, you can retrieve them by querying system tables.

Although Amazon Redshift provides a broad capability of managing access to database objects, we have heard from customers that they want to visualize and monitor privileges without using a SQL interface. In this post, we introduce predefined dashboards using Grafana which visualizes database privileges without writing SQL. This dashboard will help database administrators to reduce the time spent on database administration and increase the frequency of monitoring cycles.

Database security in Amazon Redshift

Security is the top priority at AWS. Amazon Redshift provides four levels of control:

  • Cluster management
  • Cluster connectivity
  • Database access
  • Temporary database credentials and single sign-on

This post focuses on database access, which relates to user access control against database objects. For more information, see Managing database security.

Amazon Redshift uses the GRANT command to define permissions in the database. For most database objects, GRANT takes three parameters:

  • Identity – The entity you grant access to. This could be a user, role, or group.
  • Object – The type of database object. This could be a database, schema, table or view, column, row, function, procedure, language, datashare, machine leaning (ML) model, and more.
  • Privilege – The type of operation. Examples include CREATE, SELECT, ALTER, DROP, DELETE, and INSERT. The level of privilege depends on the object.

To remove access, use the REVOKE command.

Additionally, Amazon Redshift offers granular access control with the Row-level security (RLS) feature. You can attach or detach RLS policies to identities with the ATTACH RLS POLICY and DETACH RLS POLICY commands, respectively. See RLS policy ownership and management for more details.

Generally, database administrator monitors and reviews the identities, objects, and privileges periodically to ensure proper access is configured. They also need to investigate access configurations if database users face permission errors. These tasks require a SQL interface to query multiple system tables, which can be a repetitive and undifferentiated operation. Therefore, database administrators need a single pane of glass to quickly navigate through identities, objects, and privileges without writing SQL.

Solution overview

The following diagram illustrates the solution architecture and its key components:

  • Amazon Redshift contains database privilege information in system tables.
  • Grafana provides a predefined dashboard to visualize database privileges. The dashboard runs queries against the Amazon Redshift system table via the Amazon Redshift Data API.

Note that the dashboard focuses on visualization. SQL interface is required to configure privileges in Amazon Redshift. You can use query editor v2, a web-based SQL interface which enables users to run SQL commands from a browser.

Prerequisites

Before moving to the next section, you should have the following prerequisites:

While Amazon Managed Grafana controls the plugin version and updates periodically, local Grafana allows user to control the version. Therefore, local Grafana could be an option if you need earlier access for the latest features. Refer to plugin changelog for released features and versions.

Import the dashboards

After you have finished the prerequisites, you should have access to Grafana configured with Amazon Redshift as a data source. Next, import two dashboards for visualization.

  1. In Grafana console, go to the created Redshift data source and click Dashboards
  2. Import the Amazon Redshift Identities and Objects
  3. Go to the data source again and import the Amazon Redshift Privileges

Each dashboard will appear once imported.

Amazon Redshift Identities and Objects dashboard

The Amazon Redshift Identities and Objects dashboard shows identites and database objects in Amazon Redshift, as shown in the following screenshot.

The Identities section shows the detail of each user, role, and group in the source database.

One of the key features in this dashboard is the Role assigned to Role, User section, which uses a node graph panel to visualize the hierarchical structure of roles and users from multiple system tables. This visualization can help administrators quickly examine which roles are inherited to users instead of querying multiple system tables. For more information about role-based access, refer to Role-based access control (RBAC).

Amazon Redshift Privileges dashboard

The Amazon Redshift Privileges dashboard shows privileges defined in Amazon Redshift.

In the Role and Group assigned to User section, open the Role assigned to User panel to list the roles for a specific user. In this panel, you can list and compare roles assigned to multiple users. Use the User drop-down at the top of the dashboard to select users.

The dashboard will refresh immediately and show filtered result for selected users. Following screenshot is the filtered result for user hr1, hr2 and it3.

The Object Privileges section shows the privileges granted for each database object and identity. Note that objects with no privileges granted are not listed here. To show the full list of database objects, use the Amazon Redshift Identities and Objects dashboard.

The Object Privileges (RLS) section contains visualizations for row-level security (RLS). The Policy attachments panel enables you to examine RLS configuration by visualizing relation between of tables, policies, roles and users.

Conclusion

In this post, we introduced a visualization for database privileges of Amazon Redshift using predefined Grafana dashboards. Database administrators can use these dashboards to quickly navigate through identities, objects, and privileges without writing SQL. You can also customize the dashboard to meet your business requirements. The JSON definition file of this dashboard is maintained as part of OSS in the Redshift data source for Grafana GitHub repository.

For more information about the topics described to in this post, refer to the following:


About the author

Yota Hamaoka is an Analytics Solution Architect at Amazon Web Services. He is focused on driving customers to accelerate their analytics journey with Amazon Redshift.

Enhance operational insights for Amazon MSK using Amazon Managed Service for Prometheus and Amazon Managed Grafana

Post Syndicated from Anand Mandilwar original https://aws.amazon.com/blogs/big-data/enhance-operational-insights-for-amazon-msk-using-amazon-managed-service-for-prometheus-and-amazon-managed-grafana/

Amazon Managed Streaming for Apache Kafka (Amazon MSK) is an event streaming platform that you can use to build asynchronous applications by decoupling producers and consumers. Monitoring of different Amazon MSK metrics is critical for efficient operations of production workloads. Amazon MSK gathers Apache Kafka metrics and sends them to Amazon CloudWatch, where you can view them. You can also monitor Amazon MSK with Prometheus, an open-source monitoring application. Many of our customers use such open-source monitoring tools like Prometheus and Grafana, but doing it in self-managed environment comes with its own challenges regarding manageability, availability, and security.

In this post, we show how you can build an AWS Cloud native monitoring platform for Amazon MSK using the fully managed, highly available, scalable, and secure services Amazon Managed service for Prometheus and Amazon Managed Grafana for better operational insights.

Why is Kafka monitoring critical?

As a critical component of the IT infrastructure, it is necessary to track Amazon MSK clusters’ operations and their efficiencies. Amazon MSK metrics helps monitor critical tasks while operating applications. You can not only troubleshoot problems that have already occurred, but also discover anomalous behavior patterns and prevent problems from occurring in the first place.

Some customers currently use various third-party monitoring solutions like lenses.io, AppDynamics, Splunk, and others to monitor Amazon MSK operational metrics. In the context of cloud computing, customers are looking for an AWS Cloud native service that offers equivalent or better capabilities but with the added advantage of being highly scalable, available, secure, and fully managed.

Amazon MSK clusters emit a very large number of metrics via JMX, many of which can be useful for tuning the performance of your cluster, producers, and consumers. However, that large volume brings complexity with monitoring. By default, Amazon MSK clusters come with CloudWatch monitoring of your essential metrics. You can extend your monitoring capabilities by using open-source monitoring with Prometheus. This feature enables you to scrape a Prometheus friendly API to gather all the JMX metrics and work with the data in Prometheus.

This solution provides a simple and easy observability platform for Amazon MSK along with much needed insights into various critical operational metrics that yields the following organizational benefits for your IT operations or application teams:

  • You can quickly drill down to various Amazon MSK components (broker level, topic level, or cluster level) and identify issues that need investigation
  • You can investigate Amazon MSK issues after the event using the historical data in Amazon Managed Service for Prometheus
  • You can shorten or eliminate long calls that waste time questioning business users on Amazon MSK issues

In this post, we set up Amazon Managed Service for Prometheus, Amazon Managed Grafana, and a Prometheus server running as container on Amazon Elastic Compute Cloud (Amazon EC2) to provide a fully managed monitoring solution for Amazon MSK.

The solution provides an easy-to-configure dashboard in Amazon Managed Grafana for various critical operation metrics, as demonstrated in the following video.

Solution overview

Amazon Managed Service for Prometheus reduces the heavy lifting required to get started with monitoring applications across Amazon MSK, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS Fargate, as well as self-managed Kubernetes clusters. The service also seamlessly integrates with Amazon Managed Grafana to simplify data visualization, team management authentication, and authorization.

Grafana empowers you to create dashboards and alerts from multiple sources such as an Amazon Managed Prometheus workspace, CloudWatch, AWS X-Ray, Amazon OpenSearch Service, Amazon Redshift, and Amazon Timestream.

The following diagram demonstrates the solution architecture. This solution deploys a Prometheus server running as a container within Amazon EC2, which constantly scrapes metrics from the MSK brokers and remote write metrics to an Amazon Managed Service for Prometheus workspace. As of this writing, Amazon Managed Service for Prometheus is not able to scrape the metrics directly, therefore a Prometheus server is necessary to do so. We use Amazon Managed Grafana to query and visualize the operational metrics for the Amazon MSK platform.

Solution Architecture

The following are the high-level steps to deploy the solution:

  1. Create an EC2 key pair.
  2. Configure your Amazon MSK cluster and associated resources. We demonstrate how to configure an existing Amazon MSK cluster or create a new one.
    1. Option A:- Modify an existing Amazon MSK cluster
    2. Option B:- Create a new Amazon MSK cluster
  3. Enable AWS IAM Identity Center (successor to AWS Single Sign-On), if not enabled.
  4. Configure Amazon Managed Grafana and Amazon Managed Service for Prometheus.
  5. Configure Prometheus and start the service.
  6. Configure the data sources in Amazon Managed Grafana.
  7. Import the Grafana dashboard.

Prerequisites

git clone https://github.com/aws-samples/amazonmsk-managed-observability
  • You download three CloudFormation template files along with the Prometheus configuration file (prometheus.yml), targets.json file (you need this to update the MSK broker DNS later on), and three JSON files for creating a dashboard within Amazon Managed Grafana.
  • Make sure internet connection is allowed to download docker image of Prometheus from within Prometheus server

1. Create an EC2 key pair

To create your EC2 key pair, complete the following steps:

  1. On the Amazon EC2 console, under Network & Security in the navigation pane, choose Key Pairs.
  2. Choose Create key pair.
  3. For Name, enter DemoMSKKeyPair.
  4. For Key pair type¸ select RSA.
  5. For Private key file format, choose the format in which to save the private key:
    1. To save the private key in a format that can be used with OpenSSH, select .pem.
    2. To save the private key in a format that can be used with PuTTY, select .ppk.

Create EC2 Key Pair - DemoMSKKeyPair

The private key file is automatically downloaded by your browser. The base file name is the name that you specified as the name of your key pair, and the file name extension is determined by the file format that you chose.

  1. Save the private key file in a safe place.

2. Configure your Amazon MSK cluster and associated resources.

Using the following options to configure an existing Amazon MSK cluster or create a new one.

2.a Modify an existing Amazon MSK cluster

If you want to create a new Amazon MSK cluster for this solution, skip to the section – 2.b.Create a new Amazon MSK cluster, otherwise complete the steps in this section to modify an existing cluster.

Validate cluster monitoring settings

We must enable enhanced partition-level monitoring (available at an additional cost) and open monitoring with Prometheus. Note that open monitoring with Prometheus is only available for provisioned mode clusters.

  1. Sign in to the account where the Amazon MSK cluster is that you want to monitor.
  2. Open your Amazon MSK cluster.
  3. On the Properties tab, navigate to Monitoring metrics.
  4. Check the monitoring level for Amazon CloudWatch metrics for this cluster, and choose Edit to edit the cluster.
  5. Select Enhance partition-level monitoring.

Editing monitoring attributes of Amazon MSK Cluster

  1. Check the monitoring label for Open monitoring with Prometheus, and choose Edit to edit the cluster.
  2. Select Enable open monitoring for Prometheus.
  3. Under Prometheus exporters, select JMX Exporter and Note Exporter.

enable Open monitoring with Prometheus

  1. Under Broker log delivery, select Deliver to Amazon CloudWatch Logs.
  2. For Log group, enter your log group for Amazon MSK.
  3. Choose Save changes.

Deploy CloudFormation stack

Now we deploy the CloudFormation stack Prometheus_Cloudformation.yml that we downloaded earlier.

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose Create stack.
  3. For Prepare template, select Template is ready.
  4. For Template source, select Upload a template.
  5. Upload the Prometheus_Cloudformation.yml file, then choose Next.

CloudFormation - Create Stack for Prometheus_Cloudformation.yml

  1. For Stack name, enter Prometheus.
  2. VPCID – Provide the VPC ID where your Amazon MSK cluster is deployed (mandatory)
  3. VPCCIdr – Provide the VPC CIDR where your Amazon MSK Cluster is deployed (mandatory)
  4. SubnetID – Provide any one of the subnets ID where your existing Amazon MSK cluster is deployed (mandatory)
  5. MSKClusterName – Provide the name your existing Amazon MSK Cluster
  6. Leave Cloud9InstanceType, KeyName, and LatestAmild as default.
  7. Choose Next.

CloudFormation - Specify stack details for Prometheus Stack

  1. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
  2. Choose Create stack.

You’re redirected to the AWS CloudFormation console, and can see the status as CREATE_IN_PROGRESSS. Wait until the status changes to COMPLETE.

CloudFormation stack creation status for Prometheus Stack

  1. On the stack’s Outputs tab, note the values for the following keys (if you don’t see anything under Outputs tab, click on refresh icon):
    1. PrometheusInstancePrivateIP
    2. PrometheusSecurityGroupId

getting private IP and Security Group ID for Prometheus Stack

Update the Amazon MSK cluster security group

Complete the following steps to update the security group of the existing Amazon MSK cluster to allow communication from the Kafka client and Prometheus server:

  1. On the Amazon MSK console, navigate to your Amazon MSK cluster.
  2. On the Properties tab, under Network settings, open the security group.
  3. Choose Edit inbound rules.
  4. Choose Add rule and create your rule with the following parameters:
    1. Type – Custom TCP
    2. Port range – 11001–11002
    3. Source – The Prometheus server security group ID

Set up your AWS Cloud9 environment

To configure your AWS Cloud9 environment, complete the following steps:

  1. On the AWS Cloud9 console, choose Environments in the navigation pane.
  2. Select Cloud9EC2Bastion and choose Open in Cloud9.

AWS Cloud9 home page

  1. Close the Welcome tab and open a new terminal tab
  2. Create an SSH key file with the contents from the private key file DemoMSKKeyPair using the following command:
    touch /home/ec2-user/environment/EC2KeyMSKDemo

  3. Run the following command to list the newly created key file
    ls -ltr

  4. Open the file, enter the contents of the private key file DemoMSKKeyPair, then save the file.

Updating the key file with the private file content

  1. Change the permissions of the file using the following command:
    chmod 600 /home/ec2-user/environment/EC2KeyMSKDemo

  2. Log in to the Prometheus server using this key file and the private IP noted earlier:
    ssh -i /home/ec2-user/environment/EC2KeyMSKDemo ec2-user@<Private-IP-of-Prometheus-instance>

  3. Once you’re logged in, check if the Docker service is up and running using the following command:
    systemctl status docker

checking docker service status

  1. To exit the server, enter exit and press Enter.

2.b Create a new Amazon MSK cluster

If you don’t have an Amazon MSK cluster running in your environment, or you don’t want to use an existing cluster for this solution, complete the steps in this section.

As part of these steps, your cluster will have the following properties:

Deploy CloudFormation stack

Complete the following steps to deploy the CloudFormation stack MSKResource_Cloudformation.yml:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose Create stack.
  3. For Prepare template, select Template is ready.
  4. For Template source, select Upload a template.
  5. Upload the MSKResource_Cloudformation.yml file, then choose Next.
  6. For Stack name, enter MSKDemo.
  7. Network Configuration – Generic (mandatory)
    1. Stack to be deployed in NEW VPC? (true/false) – if false, you MUST provide VPCCidr and other details under Existing VPC section (Default is true)
    2. VPCCidr – Default is 10.0.0.0/16 for a new VPC. You can have any valid values as per your environment. If deploying in an existing VPC, provide the CIDR for the same
  8. Network Configuration – For New VPC
    1. PrivateSubnetMSKOneCidr (Default is 10.0.1.0/24)
    2. PrivateSubnetMSKTwoCidr (Default is 10.0.2.0/24)
    3. PrivateSubnetMSKThreeCidr (Default is 10.0.3.0/24)
    4. PublicOneCidr (Default is 10.0.0.0/24)
  9. Network Configuration – For Existing VPC (You need at least 4 subnets)
    1. VpcId – Provide the value if you are using any existing VPC to deploy the resources else leave it blank(default)
    2. SubnetID1 – Any one of the existing subnets from the given VPCID
    3. SubnetID2 – Any one of the existing subnets from the given VPCID
    4. SubnetID3 – Any one of the existing subnets from the given VPCID
    5. PublicSubnetID – Any one of the existing subnets from the given VPCID
  10. Leave the remaining parameters as default and choose Next.

Specifying the parameter details for MSKDemo stack

  1. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
  2. Choose Create stack.

You’re redirected to the AWS CloudFormation console, and can see the status as CREATE_IN_PROGRESSS. Wait until the status changes to COMPLETE.

CloudFormation stack creation status

  1. On the stack’s Outputs tab, note the values for the following (if you don’t see anything under Outputs tab, click on refresh icon):
    1. KafkaClientPrivateIP
    2. PrometheusInstancePrivateIP

Set up your AWS Cloud9 environment

Follow the steps as outlined in the previous section to configure your AWS Cloud9 environment.

Retrieve the cluster broker list

To get your MSK cluster broker list, complete the following steps:

  1. On the Amazon MSK console, navigate to your cluster.
  2. In the Cluster summary section, choose View client information.
  3. In the Bootstrap servers section, copy the private endpoint.

You need this value to perform some operations later, such as creating an MSK topic, producing sample messages, and consuming those sample messages.

  1. Choose Done.
  2. On the Properties tab, in the Brokers details section, note the endpoints listed.

These need to be updated in the targets.json file (used for Prometheus configuration in a later step).

3. Enable IAM Identity Center

Before you deploy the CloudFormation stack for Amazon Managed Service for Prometheus and Amazon Managed Grafana, make sure to enable IAM Identity Center.

If you don’t use IAM Identity Center, alternatively, you can set up user authentication via SAML. For more information, refer to Using SAML with your Amazon Managed Grafana workspace.

If IAM Identity Center is currently enabled/configured in another region, you don’t need to enable in your current region.

Complete the following steps to enable IAM Identity Center:

  1. On the IAM Identity Center console, under Enable IAM Identity Center, choose Enable.

Enabling IAM Identity Center

  1. Choose Create AWS organization.

4. Configure Amazon Managed Grafana and Amazon Managed Service for Prometheus

Complete the steps in this section to set up Amazon Managed Service for Prometheus and Amazon Managed Grafana.

Deploy CloudFormation template

Complete the following steps to deploy the CloudFormation stack AMG_AMP_Cloudformation:

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Choose Create stack.
  3. For Prepare template, select Template is ready.
  4. For Template source, select Upload a template.
  5. Upload the AMG_AMP_Cloudformation.yml file, then choose Next.
  6. For Stack name, enter ManagedPrometheusAndGrafanaStack, then choose Next.
  7. On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
  8. Choose Create stack.

You’re redirected to the AWS CloudFormation console, and can see the status as CREATE_IN_PROGRESSS. Wait until the status changes to COMPLETE.

  1. On the stack’s Outputs tab, note the values for the following (if you don’t see anything under Outputs tab, click on refresh icon):
    1. GrafanaWorkspaceURL – This is Amazon Managed Grafana URL
    2. PrometheusEndpointWriteURL – This is the Amazon Managed Service for Prometheus write endpoint URL

Outputs information for ManagedPromethusAndGrafana Stack

Create a user for Amazon Managed Grafana

Complete the following steps to create a user for Amazon Managed Grafana:

  1. On the IAM Identity Center console, choose Users in the navigation pane.
  2. Choose Add user.
  3. For Username, enter grafana-admin.
  4. Enter and confirm your email address to receive a confirmation email.

IAM Identity Center - Specifying user information

  1. Skip the optional steps, then choose Add user.

A success message appears at the top of the console.

  1. In the confirmation email, choose Accept invitation and set your user password.

Screenshot showing Invitation link for the new user

  1. On the Amazon Managed Grafana console, choose Workspaces in the navigation pane.
  2. Open the workspace Amazon-Managed-Grafana.
  3. Make a note of the Grafana workspace URL.

You use this URL to log in to view your Grafana dashboards.

  1. On the Authentication tab, choose Assign new user or group.

Amazon Managed Grafana - Workspace information - Assigning user

  1. Select the user you created earlier and choose Assign users and groups.

Amazon Managed Grafana - Assigning user to workspace

  1. On the Action menu, choose what kind of user to make it: admin, editor, or viewer.

Note that your Grafana workspace needs as least one admin user.

  1. Navigate to the Grafana URL you copied earlier in your browser.
  2. Choose Sign in with AWS IAM Identity Center.

Amazon Managed Grafana URL landing page

  1. Log in with your IAM Identity Center credentials.

5. Configure Prometheus and start the service

When you cloned the GitHub repo, you downloaded two configuration files: prometheus.yml and targets.json. In this section, we configure these two files.

  1. Use any IDE (Visual Studio Code or Notepad++) to open prometheus.yml.
  2. In the remote_write section, update the remote write URL and Region.

prometheus.yml file

  1. Use any IDE to open targets.json.
  2. Update the targets with the broker endpoints you obtained earlier.

targets.json file

  1. In your AWS Cloud9 environment, choose File, then Upload Local Files.
  2. Choose Select Files and upload targets.json and prometheus.yml from your local machine.

Upload file dialog page within AWS Cloud9 Environment

  1. In the AWS Cloud9 environment, run the following command using the key file you created earlier:
    ssh -i /home/ec2-user/environment/EC2KeyMSKDemo ec2-user@<Private-IP-of-Prometheus instance> mkdir -p /home/ec2-user/Prometheus

  2. copy targets.json to the Prometheus server:
    scp -i /home/ec2-user/environment/EC2KeyMSKDemo targets.json ec2-user@<Private-IP-of-Prometheus-instance>:/home/ec2-user/prometheus/

  3. copy prometheus.yml to the Prometheus server:
    scp -i /home/ec2-user/environment/EC2KeyMSKDemo prometheus.yml ec2-user@<Private-IP-of-Prometheus-instance>:/home/ec2-user/prometheus/

  4. SSH into the Prometheus server and start the container service for Prometheus
    ssh -i /home/ec2-user/environment/EC2KeyMSKDemo ec2-user@<Private-IP-of-Prometheus instance>

  5. start the prometheus container
    sudo docker run -d -p 9090:9090 --name=prometheus -v /home/ec2-user/prometheus/prometheus.yml:/etc/prometheus/prometheus.yml -v /home/ec2-user/prometheus/targets.json:/etc/prometheus/targets.json prom/prometheus --config.file=/etc/prometheus/prometheus.yml

  6. Check if the Docker service is running:

Start the prometheus container inside Prometheus Server

6. Configure data sources in Amazon Managed Grafana

To configure your data sources, complete the following steps:

  1. Log in to the Amazon Managed Grafana URL.
  2. Choose AWS Data Services in the navigation pane, then choose Data Sources.

Amazon Managed Grafana - Configuring data source

  1. For Service, choose Amazon Managed Service for Prometheus.
  2. For Region, choose your Region.

The correct resource ID is populated automatically.

  1. Select your resource ID and choose Add 1 data source.

Configuring data source for Amazon managed Prometheus

  1. Choose Go to settings.

Configuring data source for Amazon managed Prometheus

  1. For Name, enter Amazon Managed Prometheus and enable Default.

The URL is automatically populated.

  1. Leave everything else as default.

Configuring data source for Amazon managed Prometheus

  1. Choose Save & Test.

If everything is correct, the message Data source is working appears.

Now we configure CloudWatch as a data source.

  1. Choose AWS Data Services, then choose Data source.
  2. For Services, choose CloudWatch.
  3. For Region, choose your correct Region.
  4. Choose Add data source.
  5. Select the CloudWatch data source and choose Go to settings.

Configuring data source with Service as CloudWatch

  1. For Name, enter AmazonMSK-CloudWatch.
  2. Choose Save & Test.

7. Import the Grafana dashboard

You can use the following preconfigured dashboards, which are available to download from the GitHub repo:

  • Kafka Metrics
  • MSK Cluster Overview
  • AWS MSK – Kafka Cluster-CloudWatch

To import your dashboard, complete the following steps:

  1. In Amazon Managed Grafana, choose the plus sign in the navigation pane.
  2. Choose Import.

Importing preconfigured dashboard

  1. Choose Upload JSON file.
  2. Choose the dashboard you downloaded.
  3. Choose Load.

Configuring dashboard by importing the json file

The following screenshot shows your loaded dashboard.

Dashboard showing cluster level metrics

Generate sample data in Amazon MSK (Optional – when you create a new Amazon MSK Cluster)

To generate sample data in Amazon MSK, complete the following steps:

  1. In your AWS Cloud9 environment, log in to the Kafka client.
    ssh -i /home/ec2-user/environment/EC2KeyMSKDemo ec2-user@<Private-IP-of-Kafka Client>

  2. Set the broker endpoint variable
    export MYBROKERS="<Bootstrap Servers Endpoint – captured earlier>"

  3. Run the following command to create a topic called TLSTestTopic60:
    /home/ec2-user/kafka/bin/kafka-topics.sh --command-config /home/ec2-user/kafka/config/client.properties --bootstrap-server $MYBROKERS --create --topic TLSTestTopic60 --partitions 5 --replication-factor 2

  4. Still logged in to the Kafka client, run the following command to start the producer service:
    /home/ec2-user/kafka/bin/kafka-console-producer.sh --producer.config /home/ec2-user/kafka/config/client.properties --broker-list $MYBROKERS --topic TLSTestTopic60

Creating topic and producing messages in Kafka Producer service

  1. Open a new terminal from within your AWS Cloud9 environment and log in to the Kafka client instance
    ssh -i /home/ec2-user/environment/EC2KeyMSKDemo ec2-user@<Private-IP-of-Kafka Client>

  2. Set the broker endpoint variable
    export MYBROKERS="<Bootstrap Servers Endpoint – captured earlier>"

  3. Now you can start the consumer service and see the incoming messages
    /home/ec2-user/kafka/bin/kafka-console-consumer.sh --consumer.config /home/ec2-user/kafka/config/client.properties --bootstrap-server $MYBROKERS --topic TLSTestTopic60 --from-beginning

Starting Kafka consumer service

  1. Press CTRL+C to stop the producer/consumer service.

Kafka metrics dashboards on Amazon Managed Grafana

You can now view your Kafka metrics dashboards on Amazon Managed Grafana:

  • Cluster overall health – Configured using Amazon Managed Service for Prometheus as the data source:
    • Critical metrics

Amazon MSK cluster overview – Configured using Amazon Managed Service for Prometheus as the data source:

  • Critical metrics
  • Cluster throughput (broker-level metrics)

  • Cluster metrics (JVM)

Kafka cluster operation metrics – Configured using CloudWatch as the data source:

  • General overall stats

  • CPU and Memory metrics

Clean up

You will continue to incur costs until you delete the infrastructure that you created for this post. Delete the CloudFormation stack you used to create the respective resources.

If you used an existing cluster, make sure to remove the inbound rules you updated in the security group (otherwise the stack deletion will fail).

  1. On the Amazon MSK console, navigate to your existing cluster.
  2. On the Properties tab, in the Networking settings section, open the security group you applied.
  3. Choose Edit inbound rules.

  1. Choose Delete to remove the rules you added.
  2. Choose Save rules.

Now you can delete your CloudFormation stacks.

  1. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  2. Select ManagedPrometheusAndGrafana and choose Delete.
  3. If you used an existing Amazon MSK cluster, delete the stack Prometheus.
  4. If you created a new Amazon MSK cluster, delete the stack MSKDemo.

Conclusion

This post showed how you can deploy a fully managed, highly available, scalable, and secure monitoring system for Amazon MSK using Amazon Managed Service for Prometheus and Amazon Managed Grafana, and use Grafana dashboards to gain deep insights into various operational metrics. Although this post only discussed using Amazon Managed Service for Prometheus and CloudWatch as the data sources in Amazon Managed Grafana, you can enable various other data sources, such as AWS IoT SiteWise, AWS X-Ray, Redshift, and Amazon Athena, and build a dashboard on top of those metrics. You can use these managed services for monitoring any number of Amazon MSK platforms. Metrics are available to query in Amazon Managed Grafana or Amazon Managed Service for Prometheus in near-real time.

You can use this post as prescriptive guidance and deploy an observability solution for a new or an existing Amazon MSK cluster, identify the metrics that are important for your applications and then create a dashboard using Amazon Managed Grafana and Prometheus.


About the Authors

Anand Mandilwar is an Enterprise Solutions Architect at AWS. He works with enterprise customers helping customers innovate and transform their business in AWS. He is passionate about automation around Cloud operation , Infrastructure provisioning and Cloud Optimization. He also likes python programming. In his spare time, he enjoys honing his photography skill especially in Portrait and landscape area.

Ajit Puthiyavettle is a Solution Architect working with enterprise clients, architecting solutions to achieve business outcomes. He is passionate about solving customer challenges with innovative solutions. His experience is with leading DevOps and security teams for enterprise and SaaS (Software as a Service) companies. Recently he is focussed on helping customers with Security, ML and HCLS workload.

Build near real-time logistics dashboards using Amazon Redshift and Amazon Managed Grafana for better operational intelligence

Post Syndicated from Paul Villena original https://aws.amazon.com/blogs/big-data/build-near-real-time-logistics-dashboards-using-amazon-redshift-and-amazon-managed-grafana-for-better-operational-intelligence/

Amazon Redshift is a fully managed data warehousing service that is currently helping tens of thousands of customers manage analytics at scale. It continues to lead price-performance benchmarks, and separates compute and storage so each can be scaled independently and you only pay for what you need. It also eliminates data silos by simplifying access to your operational databases, data warehouse, and data lake with consistent security and governance policies.

With the Amazon Redshift streaming ingestion feature, it’s easier than ever to access and analyze data coming from real-time data sources. It simplifies the streaming architecture by providing native integration between Amazon Redshift and the streaming engines in AWS, which are Amazon Kinesis Data Streams and Amazon Managed Streaming for Apache Kafka (Amazon MSK). Streaming data sources like system logs, social media feeds, and IoT streams can continue to push events to the streaming engines, and Amazon Redshift simply becomes just another consumer. Before Amazon Redshift streaming was available, we had to stage the streaming data first in Amazon Simple Storage Service (Amazon S3) and then run the copy command to load it into Amazon Redshift. Eliminating the need to stage data in Amazon S3 results in faster performance and improved latency. With this feature, we can ingest hundreds of megabytes of data per second and have a latency of just a few seconds.

Another common challenge for our customers is the additional skill required when using streaming data. In Amazon Redshift streaming ingestion, only SQL is required. We use SQL to do the following:

  • Define the integration between Amazon Redshift and our streaming engines with the creation of external schema
  • Create the different streaming database objects that are actually materialized views
  • Query and analyze the streaming data
  • Generate new features that are used to predict delays using machine learning (ML)
  • Perform inferencing natively using Amazon Redshift ML

In this post, we build a near real-time logistics dashboard using Amazon Redshift and Amazon Managed Grafana. Our example is an operational intelligence dashboard for a logistics company that provides situational awareness and augmented intelligence for their operations team. From this dashboard, the team can see the current state of their consignments and their logistics fleet based on events that happened only a few seconds ago. It also shows the consignment delay predictions of an Amazon Redshift ML model that helps them proactively respond to disruptions before they even happen.

Solution overview

This solution is composed of the following components, and the provisioning of resources is automated using the AWS Cloud Development Kit (AWS CDK):

  • Multiple streaming data sources are simulated through Python code running in our serverless compute service, AWS Lambda
  • The streaming events are captured by Amazon Kinesis Data Streams, which is a highly scalable serverless streaming data service
  • We use the Amazon Redshift streaming ingestion feature to process and store the streaming data and Amazon Redshift ML to predict the likelihood of a consignment getting delayed
  • We use AWS Step Functions for serverless workflow orchestration
  • The solution includes a consumption layer built on Amazon Managed Grafana where we can visualize the insights and even generate alerts through Amazon Simple Notification Service (Amazon SNS) for our operations team

The following diagram illustrates our solution architecture.

Prerequisites

The project has the following prerequisites:

Sample deployment using the AWS CDK

The AWS CDK is an open-source project that allows you to define your cloud infrastructure using familiar programming languages. It uses high-level constructs to represent AWS components to simplify the build process. In this post, we use Python to define the cloud infrastructure due to its familiarity to many data and analytics professionals.

Clone the GitHub repository and install the Python dependencies:

git clone https://github.com/aws-samples/amazon-redshift-streaming-workshop
cd amazon-redshift-streaming-workshop
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Next, bootstrap the AWS CDK. This sets up the resources required by the AWS CDK to deploy into the AWS account. This step is only required if you haven’t used the AWS CDK in the deployment account and Region.

cdk bootstrap

Deploy all stacks:

cdk deploy IngestionStack 
cdk deploy RedshiftStack 
cdk deploy StepFunctionStack

The entire deployment time takes 10–15 minutes.

Access streaming data using Amazon Redshift streaming ingestion

The AWS CDK deployment provisions an Amazon Redshift cluster with the appropriate default IAM role to access the Kinesis data stream. We can create an external schema to establish a connection between the Amazon Redshift cluster and the Kinesis data stream:

CREATE EXTERNAL SCHEMA ext_kinesis FROM KINESIS
IAM_ROLE default;

For instructions on how to connect to the cluster, refer to Connecting to the Redshift Cluster.

We use a materialized view to parse data in the Kinesis data stream. In this case, the whole payload is ingested as is and stored using the SUPER data type in Amazon Redshift. Data stored in streaming engines is usually in semi-structured format, and the SUPER data type provides a fast and efficient way to analyze semi-structured data within Amazon Redshift.


See the following code:

CREATE MATERIALIZED VIEW consignment_stream AS
SELECT approximate_arrival_timestamp,
JSON_PARSE(from_varbyte(kinesis_data, 'utf-8')) as consignment_data FROM ext_kinesis.consignment_stream
WHERE is_utf8(kinesis_data)
AND is_valid_json(from_varbyte(kinesis_data, 'utf-8'));

Refreshing the materialized view invokes Amazon Redshift to read data directly from the Kinesis data stream and load it into the materialized view. This refresh can be done automatically by adding the AUTO REFRESH clause in the materialized view definition. However, in this example, we are orchestrating the end-to-end data pipeline using AWS Step Functions.

REFRESH MATERIALIZED VIEW consignment_stream;

Now we can start running queries against our streaming data and unify it with other datasets like logistics fleet data. If we want to know the distribution of our consignments across different states, we can easily unpack the contents of the JSON payload using the PartiQL syntax.

SELECT cs.consignment_data.origin_state::VARCHAR,
COUNT(1) number_of_consignments,
AVG(on_the_move) running_fleet,
AVG(scheduled_maintenance + unscheduled_maintenance) under_maintenance
FROM consignment_stream cs
INNER JOIN fleet_summary fs
on TRIM(cs.consignment_data.origin_state::VARCHAR) = fs.vehicle_location
GROUP BY 1

Generate features using Amazon Redshift SQL functions

The next step is to transform and enrich the streaming data using Amazon Redshift SQL to generate additional features that will be used by Amazon Redshift ML for its predictions. We use date and time functions to identify the day of the week, and calculate the number of days between the order date and target delivery date.

We also use geospatial functions, specifically ST_DistanceSphere, to calculate the distance between origin and destination locations. The GEOMETRY data type within Amazon Redshift provides a cost-effective way to analyze geospatial data such as longitude and latitudes at scale. In this example, the addresses have already been converted to longitude and latitude. However, if you need to perform geocoding, you can integrate Amazon Location Services with Amazon Redshift using user-defined functions (UDFs). On top of geocoding, Amazon Location Service also allows you to more accurately calculate route distance between origin and destination, and even specify waypoints along the way.

We use another materialized view to persist these transformations. A materialized view provides a simple yet efficient way to create data pipelines using its incremental refresh capability. Amazon Redshift identifies the incremental changes from the last refresh and only updates the target materialized view based on these changes. In this materialized view, all our transformations are deterministic, so we expect our data to be consistent when going through a full refresh or an incremental refresh.


See the following code:

CREATE MATERIALIZED VIEW consignment_transformed AS
SELECT
consignment_data.consignmentid::INT consignment_id,
consignment_data.consignment_date::TIMESTAMP consignment_date,
consignment_data.delivery_date::TIMESTAMP delivery_date,
consignment_data.origin_state::VARCHAR origin_state,
consignment_data.destination_state::VARCHAR destination_state,
consignment_data.revenue::FLOAT revenue,
consignment_data.cost::FLOAT cost,
DATE_PART(dayofweek, consignment_data.consignment_date::TIMESTAMP)::INT day_of_week,
DATE_PART(hour, consignment_data.consignment_date::TIMESTAMP)::INT "hour",
DATEDIFF(days,
consignment_data.consignment_date::TIMESTAMP,
consignment_data.delivery_date::TIMESTAMP
)::INT days_to_deliver,
(ST_DistanceSphere(
ST_Point(consignment_data.origin_lat::FLOAT, consignment_data.origin_long::FLOAT),
ST_Point(consignment_data.destination_lat::FLOAT, consignment_data.destination_long::FLOAT)
) / 1000 --convert to km
) delivery_distance
FROM consignment_stream;

Predict delays using Amazon Redshift ML

We can use this enriched data to make predictions on the delay probability of a consignment. Amazon Redshift ML is a feature of Amazon Redshift that allows you to use the power of Amazon Redshift to build, train, and deploy ML models directly within your data warehouse.

The training of a new Amazon Redshift ML model has been initiated as part of the AWS CDK deployment using the CREATE MODEL statement. The training dataset is defined in the FROM clause, and TARGET defines which column the model is trying to predict. The FUNCTION clause defines the name of the function that is used to make predictions.

CREATE MODEL ml_delay_prediction -- already executed by CDK
FROM (SELECT * FROM ext_s3.consignment_train)
TARGET probability
FUNCTION fnc_delay_probabilty
IAM_ROLE default
SETTINGS (
MAX_RUNTIME 1800, --seconds
S3_BUCKET '<ingestionstack-s3bucketname>' --replace S3 bucket name
)

This simplified model is trained using historical observations, and the training process takes around 30 minutes to complete. You can check the status of the training job by running the SHOW MODEL statement:

SHOW MODEL ml_delay_prediction;

When the model is ready, we can start making predictions on new data that is streamed into Amazon Redshift. Predictions are generated using the Amazon Redshift ML function that was defined during the training process. We pass the calculated features from the transformed materialized view into this function, and the prediction results populate the delay_probability column.

This final output is persisted into the consignment_predictions table, and Step Functions is orchestrating the ongoing incremental data load into this target table. We use a table for the final output, instead of a materialized view, because ML predictions have randomness involved and it may give us non-deterministic results. Using a table gives us more control on how data is loaded.


See the following code:

CREATE TABLE consignment_predictions AS
SELECT *, fnc_delay_probability(
day_of_week, "hour", days_to_deliver, delivery_distance) delay_probability
FROM consignment_transformed;

Create an Amazon Managed Grafana dashboard

We use Amazon Managed Grafana to create a near real-time logistics dashboard. Amazon Managed Grafana is a fully managed service that makes it easy to create, configure, and share interactive dashboards and charts for monitoring your data. We can also use Grafana to set up alerts and notifications based on specific conditions or thresholds, allowing you to quickly identify and respond to issues.

The high-level steps in setting up the dashboard are as follows:

  1. Create a Grafana workspace.
  2. Set up Grafana authentication using AWS IAM Identity Center (successor to AWS Single Sign-On) or using direct SAML integration.
  3. Configure Amazon Redshift as a Grafana data source.
  4. Import the JSON file for the near real-time logistics dashboard.

A more detailed set of instructions is available in the GitHub repository for your reference.

Clean up

To avoid ongoing charges, delete the resources deployed. Access the Amazon Linux 2 environment and run the AWS CDK destroy command. Delete the Grafana objects related to this deployment.

cd amazon-redshift-streaming-workshop
source .venv/bin/activate
cdk destroy –all

Conclusion

In this post, we showed how easy it is to build a near real-time logistics dashboard using Amazon Redshift and Amazon Managed Grafana. We created an end-to-end modern data pipeline using only SQL. This shows how Amazon Redshift is a powerful platform for democratizing your data—it enables a wide range of users, including business analysts, data scientists, and others, to work with and analyze data without requiring specialized technical skills or expertise.

We encourage you to explore what else can be achieved with Amazon Redshift and Amazon Managed Grafana. We also recommend you visit the AWS Big Data Blog for other useful blog posts on Amazon Redshift.


About the Author

Paul Villena is an Analytics Solutions Architect in AWS with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure-as-code, serverless technologies and coding in Python.

Visualize Amazon S3 data using Amazon Athena and Amazon Managed Grafana

Post Syndicated from Pedro Sola Pimentel original https://aws.amazon.com/blogs/big-data/visualize-amazon-s3-data-using-amazon-athena-and-amazon-managed-grafana/

Grafana is a popular open-source analytics platform that you can employ to create, explore, and share your data through flexible dashboards. Its use cases include application and IoT device monitoring, and visualization of operational and business data, among others. You can create your dashboard with your own datasets or publicly available datasets related to your industry.

In November 2021, the AWS team together with Grafana Labs announced the Amazon Athena data source plugin for Grafana. The feature allows you to visualize information on a Grafana dashboard using data stored in Amazon Simple Storage Service (Amazon S3) buckets, with help from Amazon Athena, a serverless interactive query service. In addition, you can provision Grafana dashboards using Amazon Managed Grafana, a fully managed service for open-source Grafana and Enterprise Grafana.

In this post, we show how you can create and configure a dashboard in Amazon Managed Grafana that queries data stored on Amazon S3 using Athena.

Solution overview

The following diagram is the architecture of the solution.

Architecture diagram

The solution is comprised of a Grafana dashboard, created in Amazon Managed Grafana, populated with data queried using Athena. Athena runs queries against data stored in Amazon S3 using standard SQL. Athena integrates with the AWS Glue Data Catalog, a metadata store for data in Amazon S3, which includes information such as the table schema.

To implement this solution, you complete the following high-level steps:

  1. Create and configure an Athena workgroup.
  2. Configure the dataset in Athena.
  3. Create and configure a Grafana workspace.
  4. Create a Grafana dashboard.

Create and configure an Athena workgroup

By default, the AWS Identity and Access Management (IAM) role used by Amazon Managed Grafana has the AmazonGrafanaAthenaAccess IAM policy attached. This policy gives the Grafana workspace access to query all Athena databases and tables. More importantly, it gives the service access to read data written to S3 buckets with the prefix grafana-athena-query-results-. In order for Grafana to be able to read the Athena query results, you have two options:

In this post, we go with the first option. To do that, complete the following steps:

  1. Create an S3 bucket named grafana-athena-query-results-<name>. Replace <name> with a unique name of your choice.
  2. On the Athena console, choose Workgroups in the navigation pane.
  3. Choose Create workgroup.
  4. Under Workgroup name, enter a unique name of your choice.
  5. For Query result configuration, choose Browse S3.
  6. Select the bucket you created and choose Choose.
  7. For Tags, choose Add new tag.
  8. Add a tag with the key GrafanaDataSource and the value true.
  9. Choose Create workgroup.

It’s important that you add the tag described in steps 7–8. If the tag isn’t present, the workgroup won’t be accessible by Amazon Managed Grafana.

For more information about the Athena query results location, refer to Working with query results, recent queries, and output files.

Configure the dataset in Athena

For this post, we use the NOAA Global Historical Climatology Network Daily (GHCN-D) dataset, from the National Oceanic and Atmospheric Administration (NOAA) agency. The dataset is available in the Registry of Open Data on AWS, a registry that exists to help people discover and share datasets.

The GHCN-D dataset contains meteorological elements such as daily maximum and minimum temperatures. It’s a composite of climate records from numerous locations—some locations contain more than 175 years recorded.

The GHCN-D data is in CSV format and is stored in a public S3 bucket (s3://noaa-ghcn-pds/). You access the data through Athena. To start using Athena, you need to create a database:

  1. On the Athena console, choose Query editor in the navigation pane.
  2. Choose the workgroup, created in the previous step, on the top right menu.
  3. To create a database named mydatabase, enter the following statement:
CREATE DATABASE mydatabase
  1. Choose Run.
  2. From the Database list on the left, choose mydatabase to make it your current database.

Now that you have a database, you can create a table in the AWS Glue Data Catalog to start querying the GHCN-D dataset.

  1. In the Athena query editor, run the following query:
CREATE EXTERNAL TABLE `noaa_ghcn_pds`(
  `id` string, 
  `year_date` string, 
  `element` string, 
  `data_value` string, 
  `m_flag` string, 
  `q_flag` string, 
  `s_flag` string, 
  `obs_time` string
)
ROW FORMAT SERDE 
  'org.apache.hadoop.hive.serde2.OpenCSVSerde' 
WITH SERDEPROPERTIES ('separatorChar'=',')
STORED AS INPUTFORMAT 'org.apache.hadoop.mapred.TextInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
LOCATION 's3://noaa-ghcn-pds/csv/'
TBLPROPERTIES ('classification'='csv')

After that, the table noaa_ghcn_pds should appear under the list of tables for your database. In the preceding statement, we define columns based on the GHCN-D data structure. For a full description of the variables and data structure, refer to the dataset’s readme file.

With the database and the table configured, you can start running SQL queries against the entire dataset. For the purpose of this post, you create a second table containing a subset of the data: the maximum temperatures of one weather station located in Retiro Park (or simply El Retiro), one of the largest parks of the city of Madrid, Spain. The identification of the station is SP000003195 and the element of interest is TMAX.

  1. Run the following statement on the Athena console to create the second table:
CREATE TABLE madrid_tmax WITH (format = 'PARQUET') AS
SELECT CAST(data_value AS real) / 10 AS t_max,
  CAST(
    SUBSTR(year_date, 1, 4) || '-' || SUBSTR(year_date, 5, 2) || '-' || SUBSTR(year_date, 7, 2) AS date
  ) AS iso_date
FROM "noaa_ghcn_pds"
WHERE id = 'SP000003195'
  AND element = 'TMAX'

After that, the table madrid_tmax should appear under the list of tables for your database. Note that in the preceding statement, the temperature value is divided by 10. That’s because temperatures are originally recorded in tenths of Celsius degrees. We also adjust the date format. Both adjustments make the consumption of the data easier.

Unlike the noaa_ghcn_pds table, the madrid_tmax table isn’t linked with the original dataset. That means its data won’t reflect updates made to the GHCN-D dataset. Instead, it holds a snapshot of the moment of its creation. That may not be ideal in certain scenarios, but is acceptable here.

Create and configure a Grafana workspace

The next step is to provision and configure a Grafana workspace and assign a user to the workspace.

Create your workspace

In this post, we use the AWS Single Sign-On (AWS SSO) option to set up the users. You can skip this step if you already have a Grafana workspace.

  1. On the Amazon Managed Grafana console, choose Create Workspace.
  2. Give your workspace a name, and optionally a description.
  3. Choose Next.
  4. Select AWS IAM Identity Center (successor to AWS SSO).
  5. For Permission type, choose Service Managed and choose Next.
  6. For Account access, select Current account.
  7. For Data sources, select Amazon Athena and choose Next.
  8. Review the details and choose Create workspace.

This starts the creation of the Grafana workspace.

Create a user and assign it to the workspace

The last step of the configuration is to create a user to access the Grafana dashboard. Complete the following steps:

  1. Create a user for your AWS SSO identity store if you don’t have one already.
  2. On the Amazon Managed Grafana console, choose All workspaces in the navigation pane.
  3. Choose your Grafana workspace to open the workspace details.
  4. On the Authentication tab, choose Assign new user or group.
  5. Select the user you created and choose Assign users and groups.
  6. Change the user type by selecting the user and on the Action menu, choose Make admin.

Create a Grafana dashboard

Now that you have Athena and Amazon Managed Grafana configured, create a Grafana dashboard with data fetched from Amazon S3 using Athena. Complete the following steps:

  1. On the Amazon Managed Grafana console, choose All workspaces in the navigation pane.
  2. Choose the Grafana workspace URL link.
  3. Log in with the user you assigned in the previous step.
  4. In the navigation pane, choose the lower AWS icon (there are two) and then choose Athena on the AWS services tab.
  5. Choose the Region, database, and workgroup used previously, then choose Add 1 data source.
  6. Under Provisioned data sources, choose Go to settings on the newly created data source.
  7. Select Default and then choose Save & test.
  8. In the navigation pane, hover over the plus sign and then choose Dashboard to create a new dashboard.
  9. Choose Add a new panel.
  10. In the query pane, enter the following query:
select iso_date as time, t_max from madrid_tmax where $__dateFilter(iso_date) order by iso_date
  1. Choose Apply.
  2. Change the time range on the top right corner.

For example, if you change to Last 2 years, you should see something similar to the following screenshot.

Temperature visualization

Now that you’re able to populate your Grafana dashboard with data fetched from Amazon S3 using Athena, you can experiment with different visualizations and configurations. Grafana provides lots of options, and you can adjust your dashboard to your preferences, as shown in the following example screenshot of daily maximum temperatures.

Temperature visualization - colorful

As you can see in this visualization, Madrid can get really hot on the summer!

For more information on how to customize Grafana visualizations, refer to Visualization panels.

Clean up

If you followed the instructions in this post in your own AWS account, don’t forget to clean up the created resources to avoid further charges.

Conclusion

In this post, you learned how to use Amazon Managed Grafana in conjunction with Athena to query data stored in an S3 bucket. As an example, we used a subset of the GHCN-D dataset, available in the Registry of Open Data on AWS.

Check out Amazon Managed Grafana and start creating other dashboards using your own data or other publicly available datasets stored in Amazon S3.


About the authors

Pedro Pimentel is a Prototyping Architect working on the AWS Cloud Engineering and Prototyping team, based in Brazil. He works with AWS customers to innovate using new technologies and services. In his spare time, Pedro enjoys traveling and cycling.

Rafael Werneck is a Senior Prototyping Architect at AWS Cloud Engineering and Prototyping, based in Brazil. Previously, he worked as a Software Development Engineer on Amazon.com.br and Amazon RDS Performance Insights.

Create cross-account, custom Amazon Managed Grafana dashboards for Amazon Redshift

Post Syndicated from Tahir Aziz original https://aws.amazon.com/blogs/big-data/create-cross-account-custom-amazon-managed-grafana-dashboards-for-amazon-redshift/

Amazon Managed Grafana recently announced a new data source plugin for Amazon Redshift, enabling you to query, visualize, and alert on your Amazon Redshift data from Amazon Managed Grafana workspaces. With the new Amazon Redshift data source, you can now create dashboards and alerts in your Amazon Managed Grafana workspaces to analyze your structured and semi-structured data across data warehouses, operational databases, and data lakes. The Amazon Redshift plugin also comes with default out-of-the-box dashboards that make it simple to get started monitoring the health and performance of your Amazon Redshift clusters.

In this post, we present a step-by-step tutorial to use the Amazon Redshift data source plugin to visualize metrics from your Amazon Redshift clusters hosted in different AWS accounts using AWS Single Sign-On (AWS SSO) as well as how to create custom dashboards visualizing data from Amazon Redshift system tables in Amazon Managed Grafana.

Solution overview

Let’s look at the AWS services that we use in our tutorial:

Amazon Managed Grafana is a fully managed service for open-source Grafana developed in collaboration with Grafana Labs. Grafana is a popular open-source analytics platform that enables you to query, visualize, alert on, and understand your operational metrics. You can create, explore, and share observability dashboards with your team, and spend less time managing your Grafana infrastructure and more time improving the health, performance, and availability of your applications. Amazon Managed Grafana natively integrates with AWS services (like Amazon Redshift) so you can securely add, query, visualize, and analyze operational and performance data across multiple accounts and Regions for the underlying AWS service.

Amazon Redshift is a fully managed, petabyte-scale data warehouse service in the cloud. You can start with just a few hundred gigabytes of data and scale to a petabyte or more. This enables you to use your data to acquire new insights for your business and customers. Today, tens of thousands of AWS customers from Fortune 500 companies, startups, and everything in between use Amazon Redshift to run mission-critical business intelligence (BI) dashboards, analyze real-time streaming data, and run predictive analytics jobs. With the constant increase in generated data, Amazon Redshift customers continue to achieve successes in delivering better service to their end-users, improving their products, and running an efficient and effective business.

AWS SSO is where you create or connect your workforce identities in AWS and manage access centrally across your AWS organization. You can choose to manage access just to your AWS accounts or cloud applications. You can create user identities directly in AWS SSO, or you can bring them from your Microsoft Active Directory or a standards-based identity provider, such as Okta Universal Directory or Azure AD. With AWS SSO, you get a unified administration experience to define, customize, and assign fine-grained access. Your workforce users get a user portal to access all their assigned AWS accounts, Amazon Elastic Compute Cloud (Amazon EC2) Windows instances, or cloud applications. AWS SSO can be flexibly configured to run alongside or replace AWS account access management via AWS Identity and Access Management (IAM).

The following diagram illustrates the solution architecture.

The solution includes the following components:

  • Captured metrics from the Amazon Redshift clusters in the development and production AWS accounts.
  • Amazon Managed Grafana, with the Amazon Redshift data source plugin added to it. Amazon Managed Grafana communicates with the Amazon Redshift cluster via the Amazon Redshift Data Service API.
  • The Grafana web UI, with the Amazon Redshift dashboard using the Amazon Redshift cluster as the data source. The web UI communicates with Amazon Managed Grafana via an HTTP API.

We walk you through the following steps in this post:

  1. Create a user in AWS SSO for Amazon Managed Grafana workspace access.
  2. Configure an Amazon Managed Grafana workspace.
  3. Set up two Amazon Redshift clusters as the data sources in Grafana.
  4. Import the Amazon Redshift dashboard supplied with the data source.
  5. Create a custom Amazon Redshift dashboard to visualize metrics from the Amazon Redshift clusters.

Prerequisites

To follow along with this post, you should have the following prerequisites:

Set up AWS SSO

In this section, we set up AWS SSO and register users.

In addition to AWS SSO integration, Amazon Managed Grafana also supports direct SAML integration with SAML 2.0 identity providers.

  1. If you don’t have AWS SSO enabled, open the AWS SSO console and choose Enable AWS SSO.
  2. After AWS SSO is enabled, choose Users in the navigation pane.
  3. Choose Add user.
  4. Enter the user details and choose Next: Groups.
  5. Choose Add user.

Set up your Amazon Grafana workspace

In this section, we demonstrate how to set up a Grafana workspace using Amazon Managed Grafana. We set up authentication using AWS SSO, register data sources, and add administrative users for the workspace.

  1. On the Amazon Managed Grafana console, choose Create workspace.
  2. For Workspace name, enter a suitable name.
  3. Choose Next.
  4. For Authentication access, select AWS Single Sign-On.
  5. For Permission type, select Service managed.
  6. Choose Next.
  7. Select Current account.
  8. For Data sources, select Amazon Redshift.
  9. Choose Next.
  10. Review the details and choose Create workspace.

    Now we assign a user to the workspace.
  11. On the Workspaces page, choose the workspace you created.
  12. Note the IAM role attached to your workspace.
  13. Choose Assign new user or group.
  14. Select the user to assign to the workspace.
  15. Choose Assign users and groups.

    For the purposes of this post, we need an admin user.
  16. To change the permissions of the user you just assigned, select the user name and choose Make admin.

For the cross-account setup, we use two Amazon Redshift clusters: production and development. In the next section, we configure IAM roles in both the production and development accounts so that the Grafana in the production account is able to connect to the Amazon Redshift clusters in the production account as well as in the development account.

Configure an IAM role for the development account

In this section, we set up the IAM role in the AWS account hosting the development environment. This role is assumed by the Amazon Managed Grafana service from the production AWS account to establish the connection between Amazon Managed Grafana and Amazon Redshift cluster in the development account.

  1. On the IAM console, choose Roles in the navigation pane.
  2. Choose Create role.
  3. Select Custom trust policy.
  4. Use the following policy code (update the account number for your production account and the Grafana service role attached to the workspace):
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Principal": {
                    "AWS": "arn:aws:iam::<production-account-number>:role/service-role/AmazonGrafanaServiceRole-xxxxxxxxxx",
                    "Service": "grafana.amazonaws.com"
                },
                "Action": "sts:AssumeRole"
            }
        ]
    }

  5. Choose Next.
  6. Attach the managed IAM policy AmazonGrafanaRedshiftAccess to this role. For instructions, refer to Modifying a role permissions policy (console).
  7. Provide a role name, description, and tags (optional), and create the role.

Configure an IAM role for the production account

Next, we configure the IAM role created by the Amazon Managed Grafana service in order to establish the connection between Amazon Managed Grafana and the Amazon Redshift cluster in the production account.

  1. On the IAM console, choose Roles in the navigation pane.
  2. Search for the AmazonGrafanaServiceRole-xxxxxxx role attached to your Grafana workspace.
  3. Create an inline IAM policy and attach it to this role with the following policy code:
    {
    	"Version": "2012-10-17",
    	"Statement": [{
    		"Sid": "VisualEditor0",
    		"Effect": "Allow",
    		"Action": [
    			"sts:AssumeRole"
    		],
    		"Resource":"arn:aws:iam::<dev-account-number>:role/<DevAccountRoleName>"
    	}]
    }

  4. Provide a role name, description, and tags (optional), and create the role.

Import the default dashboard

In this section, we connect to the Amazon Redshift clusters in the production and development accounts from the Amazon Managed Grafana console and import the default dashboard.

  1. On the Amazon Managed Grafana console, choose Workspaces in the navigation pane.
  2. Choose the workspace you just created (authenticate and sign in if needed).
  3. In the navigation pane, choose Settings and on the Configuration menu, choose Data sources.
  4. Choose Add data source.
  5. Search for and choose Amazon Redshift.
  6. On the Settings tab, for Authentication provider, choose Workspace IAM role.
  7. For Default Region, choose us-east-1.
  8. Under Redshift Details, choose Temporary credentials.
  9. Enter the cluster identifier and database name for your Amazon Redshift cluster in the development account.
  10. For Database user, enter redshift_data_api_user.
  11. Choose Save & test.
    When the connection is successfully established, a message appears that the data source is working. You can now move on to the next step.
  12. Repeat these steps to add another data source to connect to the Amazon Redshift cluster in the development account.
  13. On the Settings tab, for Authentication provider, choose Workspace IAM role.
  14. Enter the workspace role as the ARN of the IAM role you created earlier (arn:aws:iam::dev-account-number:role/cross-account-role-name).
  15. For Default Region, choose us-east-1.
  16. Under Redshift Details, choose Temporary credentials.
  17. Enter the cluster identifier and database name for your Amazon Redshift cluster in the development account.
  18. For Database user, enter redshift_data_api_user.
  19. Choose Save & test.
    When the connection is successfully established, a message appears that the data source is working.
  20. On the Dashboards tab, choose Import next to Amazon Redshift.

On the dashboard page, you can change the data source between your production and development clusters on a drop-down menu.

The default Amazon Redshift dashboard, as shown in the following screenshot, makes it easy to monitor the overall health of the cluster by showing different cluster metrics, like total storage capacity used, storage utilization per node, open and closed connections, WLM mode, AQUA status, and more.

Additionally, the default dashboard displays several table-level metrics such as size of the tables, total number of rows, unsorted rows percentage, and more, in the Schema Insights section.

Add a custom dashboard for Amazon Redshift

The Amazon Redshift data source plugin allows you to query and visualize Amazon Redshift data metrics from within Amazon Managed Grafana. It’s preconfigured with general metrics. To add a custom metric from the Amazon Redshift cluster, complete the following steps:

  1. On the Amazon Managed Grafana console, choose All workspaces in the navigation pane.
  2. Choose the Grafana workspace URL for the workspace you want to modify.
  3. Choose Sign in with AWS SSO and provide your credentials.
  4. On the Amazon Managed Grafana workspace page, choose the plus sign and on the Create menu, choose Dashboard.
  5. Choose Add a new panel.
  6. Add the following custom SQL to get the data from the Amazon Redshift cluster:
    select 
    p.usename,
    count(*) as Num_Query,
    SUM(DATEDIFF('second',starttime,endtime)) as Total_Execution_seconds from stl_query s 
    inner join pg_user p on s.userid= p.usesysid where starttime between $__timeFrom() and $__timeTo()
    and s.userid>1 group by 1

    For this post, we use the default settings, but you can control and link the time range using the $__timeFrom() and $__timeTo() macros; they’re bound with the time range control of your dashboard. For more information and details about the supported expressions, see Query Redshift data.

  7. To inspect the data, choose Query inspector to test the custom query outcome.
    Amazon Managed Grafana supports a variety of visualizations. For this post, we create a bar chart.
  8. On the Visualizations tab in the right pane, choose Bar chart.
  9. Enter a title and description for the custom chart, and leave all other properties as default.
    For more information about supported properties, see Visualizations.
  10. Choose Save.
  11. In the pop-up window, enter a dashboard name and choose Save.

    A new dashboard is created with a custom metric.
  12. To add more metrics, choose the Add panel icon, choose Add a new panel, and repeat the previous steps.

Clean up

To avoid incurring future charges, complete the following steps:

  1. Delete the Amazon Managed Grafana workspace.
  2. If you created a new Amazon Redshift cluster for this demonstration, delete the cluster.

Conclusion

In this post, we demonstrated how to use AWS SSO and Amazon Managed Grafana to create an operational view to monitor the health and performance of Amazon Redshift clusters. We learned how to extend your default dashboard by adding custom and insightful dashboards to your Grafana workspace.

We look forward to hearing from you about your experience. If you have questions or suggestions, please leave a comment.


About the Authors

Tahir Aziz is an Analytics Solution Architect at AWS. He has worked with building data warehouses and big data solutions for over 13 years. He loves to help customers design end-to-end analytics solutions on AWS. Outside of work, he enjoys traveling and cooking.

Shawn Sachdev is a Sr. Analytics Specialist Solutions Architect at AWS. He works with customers and provides guidance to help them innovate and build well-architected and high-performance data warehouses and implement analytics at scale on the AWS platform. Before AWS, he worked in several analytics and system engineering roles. Outside of work, he loves watching sports, and is an avid foodie and craft beer enthusiast.

Ekta Ahuja is an Analytics Specialist Solutions Architect at AWS. She is passionate about helping customers build scalable and robust data and analytics solutions. Before AWS, she worked in several different data engineering and analytics roles. Outside of work, she enjoys baking, traveling, and board games.

Query and visualize Amazon Redshift operational metrics using the Amazon Redshift plugin for Grafana

Post Syndicated from Sergey Konoplev original https://aws.amazon.com/blogs/big-data/query-and-visualize-amazon-redshift-operational-metrics-using-the-amazon-redshift-plugin-for-grafana/

Grafana is a rich interactive open-source tool by Grafana Labs for visualizing data across one or many data sources. It’s used in a variety of modern monitoring stacks, allowing you to have a common technical base and apply common monitoring practices across different systems. Amazon Managed Grafana is a fully managed, scalable, and secure Grafana-as-a-service solution developed by AWS in collaboration with Grafana Labs.

Amazon Redshift is the most widely used data warehouse in the cloud. You can view your Amazon Redshift cluster’s operational metrics on the Amazon Redshift console, use AWS CloudWatch, and query Amazon Redshift system tables directly from your cluster. The first two options provide a set of predefined general metrics and visualizations. The last one allows you to use the flexibility of SQL to get deep insights into the details of the workload. However, querying system tables requires knowledge of system table structures. To address that, we came up with a consolidated Amazon Redshift Grafana dashboard that visualizes a set of curated operational metrics and works on top of the Amazon Redshift Grafana data source. You can easily add it to an Amazon Managed Grafana workspace, as well as to any other Grafana deployments where the data source is installed.

This post guides you through a step-by-step process to create an Amazon Managed Grafana workspace and configure an Amazon Redshift cluster with a Grafana data source for it. Lastly, we show you how to set up the Amazon Redshift Grafana dashboard to visualize the cluster metrics.

Solution overview

The following diagram illustrates the solution architecture.

Architecture Diagram

The solution includes the following components:

  • The Amazon Redshift cluster to get the metrics from.
  • Amazon Managed Grafana, with the Amazon Redshift data source plugin added to it. Amazon Managed Grafana communicates with the Amazon Redshift cluster via the Amazon Redshift Data Service API.
  • The Grafana web UI, with the Amazon Redshift dashboard using the Amazon Redshift cluster as the data source. The web UI communicates with Amazon Managed Grafana via an HTTP API.

We walk you through the following steps during the configuration process:

  1. Configure an Amazon Redshift cluster.
  2. Create a database user for Amazon Managed Grafana on the cluster.
  3. Configure a user in AWS Single Sign-On (AWS SSO) for Amazon Managed Grafana UI access.
  4. Configure an Amazon Managed Grafana workspace and sign in to Grafana.
  5. Set up Amazon Redshift as the data source in Grafana.
  6. Import the Amazon Redshift dashboard supplied with the data source.

Prerequisites

To follow along with this walkthrough, you should have the following prerequisites:

  • An AWS account
  • Familiarity with the basic concepts of the following services:
    • Amazon Redshift
    • Amazon Managed Grafana
    • AWS SSO

Configure an Amazon Redshift cluster

If you don’t have an Amazon Redshift cluster, create a sample cluster before proceeding with the following steps. For this post, we assume that the cluster identifier is called redshift-demo-cluster-1 and the admin user name is awsuser.

  1. On the Amazon Redshift console, choose Clusters in the navigation pane.
  2. Choose your cluster.
  3. Choose the Properties tab.

Redshift Cluster Properties

To make the cluster discoverable by Amazon Managed Grafana, you must add a special tag to it.

  1. Choose Add tags. Redshift Cluster Tags
  2. For Key, enter GrafanaDataSource.
  3. For Value, enter true.
  4. Choose Save changes.

Redshift Cluster Tags

Create a database user for Amazon Managed Grafana

Grafana will be directly querying the cluster, and it requires a database user to connect to the cluster. In this step, we create the user redshift_data_api_user and apply some security best practices.

  1. On the cluster details page, choose Query data and Query in query editor v2.Query Editor v2
  2. Choose the redshift-demo-cluster-1 cluster we created previously.
  3. For Database, enter the default dev.
  4. Enter the user name and password that you used to create the cluster.
  5. Choose Create connection.Redshift SU
  6. In the query editor, enter the following statements and choose Run:
CREATE USER redshift_data_api_user PASSWORD '&lt;password&gt;' CREATEUSER;
ALTER USER redshift_data_api_user SET readonly TO TRUE;
ALTER USER redshift_data_api_user SET query_group TO 'superuser';

The first statement creates a user with superuser privileges necessary to access system tables and views (make sure to use a unique password). The second prohibits the user from making modifications. The last statement isolates the queries the user can run to the superuser queue, so they don’t interfere with the main workload.

In this example, we use service managed permissions in Amazon Managed Grafana and a workspace AWS Identity and Access Management (IAM) role as an authentication provider in the Amazon Redshift Grafana data source. We create the database user redshift_data_api_user using the AmazonGrafanaRedshiftAccess policy.

Configure a user in AWS SSO for Amazon Managed Grafana UI access

Two authentication methods are available for accessing Amazon Managed Grafana: AWS SSO and SAML. In this example, we use AWS SSO.

  1. On the AWS SSO console, choose Users in the navigation pane.
  2. Choose Add user.
  3. In the Add user section, provide the required information.

SSO add user

In this post, we select Send an email to the user with password setup instructions. You need to be able to access the email address you enter because you use this email further in the process.

  1. Choose Next to proceed to the next step.
  2. Choose Add user.

An email is sent to the email address you specified.

  1. Choose Accept invitation in the email.

You’re redirected to sign in as a new user and set a password for the user.

  1. Enter a new password and choose Set new password to finish the user creation.

Configure an Amazon Managed Grafana workspace and sign in to Grafana

Now you’re ready to set up an Amazon Managed Grafana workspace.

  1. On the Amazon Grafana console, choose Create workspace.
  2. For Workspace name, enter a name, for example grafana-demo-workspace-1.
  3. Choose Next.
  4. For Authentication access, select AWS Single Sign-On.
  5. For Permission type, select Service managed.
  6. Chose Next to proceed.AMG Workspace configure
  7. For IAM permission access settings, select Current account.AMG permission
  8. For Data sources, select Amazon Redshift.
  9. Choose Next to finish the workspace creation.Redshift to workspace

You’re redirected to the workspace page.

Next, we need to enable AWS SSO as an authentication method.

  1. On the workspace page, choose Assign new user or group.SSO new user
  2. Select the previously created AWS SSO user under Users and Select users and groups tables.SSO User

You need to make the user an admin, because we set up the Amazon Redshift data source with it.

  1. Select the user from the Users list and choose Make admin.
  2. Go back to the workspace and choose the Grafana workspace URL link to open the Grafana UI.AMG workspace
  3. Sign in with the user name and password you created in the AWS SSO configuration step.

Set up an Amazon Redshift data source in Grafana

To visualize the data in Grafana, we need to access the data first. To do so, we must create a data source pointing to the Amazon Redshift cluster.

  1. On the navigation bar, choose the lower AWS icon (there are two) and then choose Redshift from the list.
  2. For Regions, choose the Region of your cluster.
  3. Select the cluster from the list and choose Add 1 data source.Choose Redshift Cluster
  4. On the Provisioned data sources page, choose Go to settings.
  5. For Name, enter a name for your data source.
  6. By default, Authentication Provider should be set as Workspace IAM Role, Default Region should be the Region of your cluster, and Cluster Identifier should be the name of the chosen cluster.
  7. For Database, enter dev.
  8. For Database User, enter redshift_data_api_user.
  9. Choose Save & Test.Settings for Data Source

A success message should appear.

Data source working

Import the Amazon Redshift dashboard supplied with the data source

As the last step, we import the default Amazon Redshift dashboard and make sure that it works.

  1. In the data source we just created, choose Dashboards on the top navigation bar and choose Import to import the Amazon Redshift dashboard.Dashboards in the plugin
  2. Under Dashboards on the navigation sidebar, choose Manage.
  3. In the dashboards list, choose Amazon Redshift.

The dashboard appear, showing operational data from your cluster. When you add more clusters and create data sources for them in Grafana, you can choose them from the Data source list on the dashboard.

Clean up

To avoid incurring unnecessary charges, delete the Amazon Redshift cluster, AWS SSO user, and Amazon Managed Grafana workspace resources that you created as part of this solution.

Conclusion

In this post, we covered the process of setting up an Amazon Redshift dashboard working under Amazon Managed Grafana with AWS SSO authentication and querying from the Amazon Redshift cluster under the same AWS account. This is just one way to create the dashboard. You can modify the process to set it up with SAML as an authentication method, use custom IAM roles to manage permissions with more granularity, query Amazon Redshift clusters outside of the AWS account where the Grafana workspace is, use an access key and secret or AWS Secrets Manager based connection credentials in data sources, and more. You can also customize the dashboard by adding or altering visualizations using the feature-rich Grafana UI.

Because the Amazon Redshift data source plugin is an open-source project, you can install it in any Grafana deployment, whether it’s in the cloud, on premises, or even in a container running on your laptop. That allows you to seamlessly integrate Amazon Redshift monitoring into virtually all your existing Grafana-based monitoring stacks.

For more details about the systems and processes described in this post, refer to the following:


About the Authors

Sergey Konoplev is a Senior Database Engineer on the Amazon Redshift team. Sergey has been focusing on automation and improvement of database and data operations for more than a decade.

Milind Oke is a Data Warehouse Specialist Solutions Architect based out of New York. He has been building data warehouse solutions for over 15 years and specializes in Amazon Redshift.

Amazon Managed Grafana Is Now Generally Available with Many New Features

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/amazon-managed-grafana-is-now-generally-available-with-many-new-features/

In December, we introduced the preview of Amazon Managed Grafana, a fully managed service developed in collaboration with Grafana Labs that makes it easy to use the open-source and the enterprise versions of Grafana to visualize and analyze your data from multiple sources. With Amazon Managed Grafana, you can analyze your metrics, logs, and traces without having to provision servers, or configure and update software.

During the preview, Amazon Managed Grafana was updated with new capabilities. Today, I am happy to announce that Amazon Managed Grafana is now generally available with additional new features:

  • Grafana has been upgraded to version 8 and offers new data sources, visualizations, and features, including library panels that you can build once and re-use on multiple dashboards, a Prometheus metrics browser to quickly find and query metrics, and new state timeline and status history visualizations.
  • To centralize the querying of additional data sources within an Amazon Managed Grafana workspace, you can now query data using the JSON data source plugin. You can now also query Redis, SAP HANA, Salesforce, ServiceNow, Atlassian Jira, and many more data sources.
  • You can use Grafana API keys to publish your own dashboards or give programmatic access to your Grafana workspace. For example, this is a Terraform recipe that you can use to add data sources and dashboards.
  • You can enable single sign-on to your Amazon Managed Grafana workspaces using Security Assertion Markup Language 2.0 (SAML 2.0). We have worked with these identity providers (IdP) to have them integrated at launch: CyberArk, Okta, OneLogin, Ping Identity, and Azure Active Directory.
  • All calls from the Amazon Managed Grafana console and code calls to Amazon Managed Grafana API operations are captured by AWS CloudTrail. In this way, you can have a record of actions taken in Amazon Managed Grafana by a user, role, or AWS service. Additionally, you can now audit mutating changes that occur in your Amazon Managed Grafana workspace, such as when a dashboard is deleted or data source permissions are changed.
  • The service is available in ten AWS Regions (full list at the end of the post).

Let’s do a quick walkthrough to see how this works in practice.

Using Amazon Managed Grafana
In the Amazon Managed Grafana console, I choose Create workspace. A workspace is a logically isolated, highly available Grafana server. I enter a name and a description for the workspace, and then choose Next.

Console screenshot.

I can use AWS Single Sign-On (AWS SSO) or an external identity provider via SAML to authenticate the users of my workspace. For simplicity, I select AWS SSO. Later in the post, I’ll show how SAML authentication works. If this is your first time using AWS SSO, you can see the prerequisites (such as having AWS Organizations set up) in the documentation.

Console screenshot.

Then, I choose the Service managed permission type. In this way, Amazon Managed Grafana will automatically provision the necessary IAM permissions to access the AWS Services that I select in the next step.

Console screenshot.

In Service managed permission settings, I choose to monitor resources in my current AWS account. If you use AWS Organizations to centrally manage your AWS environment, you can use Grafana to monitor resources in your organizational units (OUs).

Console screenshot.

I can optionally select the AWS data sources that I am planning to use. This configuration creates an AWS Identity and Access Management (IAM) role that enables Amazon Managed Grafana to access those resources in my account. Later, in the Grafana console, I can set up the selected services as data sources. For now, I select Amazon CloudWatch so that I can quickly visualize CloudWatch metrics in my Grafana dashboards.

Here I also configure permissions to use Amazon Managed Service for Prometheus (AMP) as a data source and have a fully managed monitoring solution for my applications. For example, I can collect Prometheus metrics from Amazon Elastic Kubernetes Service (EKS) and Amazon Elastic Container Service (Amazon ECS) environments, using AWS Distro for OpenTelemetry or Prometheus servers as collection agents.

Console screenshot.

In this step I also select Amazon Simple Notification Service (SNS) as a notification channel. Similar to the data sources before, this option gives Amazon Managed Grafana access to SNS but does not set up the notification channel. I can do that later in the Grafana console. Specifically, this setting adds SNS publish permissions to topics that start with grafana to the IAM role created by the Amazon Managed Grafana console. If you prefer to have tighter control on permissions for SNS or any data source, you can edit the role in the IAM console or use customer-managed permissions for your workspace.

Finally, I review all the options and create the workspace.

After a few minutes, the workspace is ready, and I find the workspace URL that I can use to access the Grafana console.

Console screenshot.

I need to assign at least one user or group to the Grafana workspace to be able to access the workspace URL. I choose Assign new user or group and then select one of my AWS SSO users.

Console screenshot.

By default, the user is assigned a Viewer user type and has view-only access to the workspace. To give this user permissions to create and manage dashboards and alerts, I select the user and then choose Make admin.

Console screenshot.

Back to the workspace summary, I follow the workspace URL and sign in using my AWS SSO user credentials. I am now using the open-source version of Grafana. If you are a Grafana user, everything is familiar. For my first configurations, I will focus on AWS data sources so I choose the AWS logo on the left vertical bar.

Console screenshot.

Here, I choose CloudWatch. Permissions are already set because I selected CloudWatch in the service-managed permission settings earlier. I select the default AWS Region and add the data source. I choose the CloudWatch data source and on the Dashboards tab, I find a few dashboards for AWS services such as Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Block Store (EBS), AWS Lambda, Amazon Relational Database Service (RDS), and CloudWatch Logs.

Console screenshot.

I import the AWS Lambda dashboard. I can now use Grafana to monitor invocations, errors, and throttles for Lambda functions in my account. I’ll save you the screenshot because I don’t have any interesting data in this Region.

Using SAML Authentication
If I don’t have AWS SSO enabled, I can authenticate users to the Amazon Managed Grafana workspace using an external identity provider (IdP) by selecting the SAML authentication option when I create the workspace. For existing workspaces, I can choose Setup SAML configuration in the workspace summary.

First, I have to provide the workspace ID and URL information to my IdP in order to generate IdP metadata for configuring this workspace.

Console screenshot.

After my IdP is configured, I import the IdP metadata by specifying a URL or copying and pasting to the editor.

Console screenshot.

Finally, I can map user permissions in my IdP to Grafana user permissions, such as specifying which users will have Administrator, Editor, and Viewer permissions in my Amazon Managed Grafana workspace.

Console screenshot.

Availability and Pricing
Amazon Managed Grafana is available today in ten AWS Regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), Europe (Frankfurt), Europe (London), Asia Pacific (Singapore), Asia Pacific (Tokyo), Asia Pacific (Sydney), and Asia Pacific (Seoul). For more information, see the AWS Regional Services List.

With Amazon Managed Grafana, you pay for the active users per workspace each month. Grafana API keys used to publish dashboards are billed as an API user license per workspace each month. You can upgrade to Grafana Enterprise to have access to enterprise plugins, support, and on-demand training directly from Grafana Labs. For more information, see the Amazon Managed Grafana pricing page.

To learn more, you are invited to this webinar on Thursday, September 9 at 9:00 am PDT / 12:00 pm EDT / 6:00 pm CEST.

Start using Amazon Managed Grafana today to visualize and analyze your operational data at any scale.

Danilo