Today, I’m happy to announce a new capability, Amazon Managed Service for Prometheus collector, to automatically and agentlessly discover and collect Prometheus metrics from Amazon Elastic Kubernetes Service (Amazon EKS). Amazon Managed Service for Prometheus collector consists of a scraper that discovers and collects metrics from Amazon EKS applications and infrastructure without needing to run any collectors in-cluster.
This new capability provides fully managed Prometheus-compatible monitoring and alerting with Amazon Managed Service for Prometheus. One of the significant benefits is that the collector is fully managed, automatically right-sized, and scaled for your use case. This means you don’t have to run any compute for collectors to collect the available metrics. This helps you optimize metric collection costs to monitor your applications and infrastructure running on EKS.
With this launch, Amazon Managed Service for Prometheus now supports two major modes of Prometheus metrics collection: AWS managed collection, a fully managed and agentless collector, and customer managed collection.
Getting started with Amazon Managed Service for Prometheus Collector Let’s take a look at how to use AWS managed collectors to ingest metrics using this new capability into a workspace in Amazon Managed Service for Prometheus. Then, we will evaluate the collected metrics in Amazon Managed Service for Grafana.
When you create a new EKS cluster using the Amazon EKS console, you now have the option to enable AWS managed collector by selecting Send Prometheus metrics to Amazon Managed Service for Prometheus. In the Destination section, you can also create a new workspace or select your existing Amazon Managed Service for Prometheus workspace. You can learn more about how to create a workspace by following the getting started guide.
Then, you have the flexibility to define your scraper configuration using the editor or upload your existing configuration. The scraper configuration controls how you would like the scraper to discover and collect metrics. To see possible values you can configure, please visit the Prometheus Configuration page.
Once you’ve finished the EKS cluster creation, you can go to the Observability tab on your cluster page to see the list of scrapers running in your EKS cluster.
The next step is to configure your EKS cluster to allow the scraper to access metrics. You can find the steps and information on Configuring your Amazon EKS cluster.
Once your EKS cluster is properly configured, the collector will automatically discover metrics from your EKS cluster and nodes. To visualize the metrics, you can use Amazon Managed Grafana integrated with your Prometheus workspace. Visit the Set up Amazon Managed Grafana for use with Amazon Managed Service for Prometheus page to learn more.
The following is a screenshot of metrics ingested by the collectors and visualized in an Amazon Managed Grafana workspace. From here, you can run a simple query to get the metrics that you need.
Using AWS CLI and APIs Besides using the Amazon EKS console, you can also use the APIs or AWS Command Line Interface (AWS CLI) to add an AWS managed collector. This approach is useful if you want to add an AWS managed collector into an existing EKS cluster or make some modifications to the existing collector configuration.
To create a scraper, you can run the following command:
You can get most of the parameter values from the respective AWS console, such as your EKS cluster ARN and your Amazon Managed Service for Prometheus workspace ARN. Other than that, you also need to define the scraper configuration defined as configurationBlob.
Once you’ve defined the scraper configuration, you need to encode the configuration file into base64 encoding before passing the API call. The following is the command that I use in my Linux development machine to encode sample-configuration.yml into base64 and copy it onto the clipboard.
$ base64 sample-configuration.yml | pbcopy
Now Available The Amazon Managed Service for Prometheus collector capability is now available to all AWS customers in all AWS Regions where Amazon Managed Service for Prometheus is supported.
Amazon Managed Streaming for Apache Kafka (Amazon MSK) is an event streaming platform that you can use to build asynchronous applications by decoupling producers and consumers. Monitoring of different Amazon MSK metrics is critical for efficient operations of production workloads. Amazon MSK gathers Apache Kafka metrics and sends them to Amazon CloudWatch, where you can view them. You can also monitor Amazon MSK with Prometheus, an open-source monitoring application. Many of our customers use such open-source monitoring tools like Prometheus and Grafana, but doing it in self-managed environment comes with its own challenges regarding manageability, availability, and security.
In this post, we show how you can build an AWS Cloud native monitoring platform for Amazon MSK using the fully managed, highly available, scalable, and secure services Amazon Managed service for Prometheus and Amazon Managed Grafana for better operational insights.
Why is Kafka monitoring critical?
As a critical component of the IT infrastructure, it is necessary to track Amazon MSK clusters’ operations and their efficiencies. Amazon MSK metrics helps monitor critical tasks while operating applications. You can not only troubleshoot problems that have already occurred, but also discover anomalous behavior patterns and prevent problems from occurring in the first place.
Some customers currently use various third-party monitoring solutions like lenses.io, AppDynamics, Splunk, and others to monitor Amazon MSK operational metrics. In the context of cloud computing, customers are looking for an AWS Cloud native service that offers equivalent or better capabilities but with the added advantage of being highly scalable, available, secure, and fully managed.
Amazon MSK clusters emit a very large number of metrics via JMX, many of which can be useful for tuning the performance of your cluster, producers, and consumers. However, that large volume brings complexity with monitoring. By default, Amazon MSK clusters come with CloudWatch monitoring of your essential metrics. You can extend your monitoring capabilities by using open-source monitoring with Prometheus. This feature enables you to scrape a Prometheus friendly API to gather all the JMX metrics and work with the data in Prometheus.
This solution provides a simple and easy observability platform for Amazon MSK along with much needed insights into various critical operational metrics that yields the following organizational benefits for your IT operations or application teams:
You can quickly drill down to various Amazon MSK components (broker level, topic level, or cluster level) and identify issues that need investigation
You can investigate Amazon MSK issues after the event using the historical data in Amazon Managed Service for Prometheus
You can shorten or eliminate long calls that waste time questioning business users on Amazon MSK issues
In this post, we set up Amazon Managed Service for Prometheus, Amazon Managed Grafana, and a Prometheus server running as container on Amazon Elastic Compute Cloud (Amazon EC2) to provide a fully managed monitoring solution for Amazon MSK.
The solution provides an easy-to-configure dashboard in Amazon Managed Grafana for various critical operation metrics, as demonstrated in the following video.
Solution overview
Amazon Managed Service for Prometheus reduces the heavy lifting required to get started with monitoring applications across Amazon MSK, Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Elastic Container Service (Amazon ECS), and AWS Fargate, as well as self-managed Kubernetes clusters. The service also seamlessly integrates with Amazon Managed Grafana to simplify data visualization, team management authentication, and authorization.
The following diagram demonstrates the solution architecture. This solution deploys a Prometheus server running as a container within Amazon EC2, which constantly scrapes metrics from the MSK brokers and remote write metrics to an Amazon Managed Service for Prometheus workspace. As of this writing, Amazon Managed Service for Prometheus is not able to scrape the metrics directly, therefore a Prometheus server is necessary to do so. We use Amazon Managed Grafana to query and visualize the operational metrics for the Amazon MSK platform.
The following are the high-level steps to deploy the solution:
You download three CloudFormation template files along with the Prometheus configuration file (prometheus.yml), targets.json file (you need this to update the MSK broker DNS later on), and three JSON files for creating a dashboard within Amazon Managed Grafana.
Make sure internet connection is allowed to download docker image of Prometheus from within Prometheus server
1. Create an EC2 key pair
To create your EC2 key pair, complete the following steps:
On the Amazon EC2 console, under Network & Security in the navigation pane, choose Key Pairs.
Choose Create key pair.
For Name, enter DemoMSKKeyPair.
For Key pair type¸ select RSA.
For Private key file format, choose the format in which to save the private key:
To save the private key in a format that can be used with OpenSSH, select .pem.
To save the private key in a format that can be used with PuTTY, select .ppk.
The private key file is automatically downloaded by your browser. The base file name is the name that you specified as the name of your key pair, and the file name extension is determined by the file format that you chose.
Save the private key file in a safe place.
2. Configure your Amazon MSK cluster and associated resources.
Using the following options to configure an existing Amazon MSK cluster or create a new one.
2.a Modify an existing Amazon MSK cluster
If you want to create a new Amazon MSK cluster for this solution, skip to the section – 2.b.Create a new Amazon MSK cluster, otherwise complete the steps in this section to modify an existing cluster.
Validate cluster monitoring settings
We must enable enhanced partition-level monitoring (available at an additional cost) and open monitoring with Prometheus. Note that open monitoring with Prometheus is only available for provisioned mode clusters.
Sign in to the account where the Amazon MSK cluster is that you want to monitor.
Open your Amazon MSK cluster.
On the Properties tab, navigate to Monitoring metrics.
Check the monitoring level for Amazon CloudWatch metrics for this cluster, and choose Edit to edit the cluster.
Select Enhance partition-level monitoring.
Check the monitoring label for Open monitoring with Prometheus, and choose Edit to edit the cluster.
Select Enable open monitoring for Prometheus.
Under Prometheus exporters, select JMX Exporter and Note Exporter.
Under Broker log delivery, select Deliver to Amazon CloudWatch Logs.
For Log group, enter your log group for Amazon MSK.
Choose Save changes.
Deploy CloudFormation stack
Now we deploy the CloudFormation stack Prometheus_Cloudformation.yml that we downloaded earlier.
On the AWS CloudFormation console, choose Stacks in the navigation pane.
Choose Create stack.
For Prepare template, select Template is ready.
For Template source, select Upload a template.
Upload the Prometheus_Cloudformation.yml file, then choose Next.
For Stack name, enter Prometheus.
VPCID – Provide the VPC ID where your Amazon MSK cluster is deployed (mandatory)
VPCCIdr – Provide the VPC CIDR where your Amazon MSK Cluster is deployed (mandatory)
SubnetID – Provide any one of the subnets ID where your existing Amazon MSK cluster is deployed (mandatory)
MSKClusterName – Provide the name your existing Amazon MSK Cluster
Leave Cloud9InstanceType, KeyName, and LatestAmild as default.
Choose Next.
On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.
You’re redirected to the AWS CloudFormation console, and can see the status as CREATE_IN_PROGRESSS. Wait until the status changes to COMPLETE.
On the stack’s Outputs tab, note the values for the following keys (if you don’t see anything under Outputs tab, click on refresh icon):
PrometheusInstancePrivateIP
PrometheusSecurityGroupId
Update the Amazon MSK cluster security group
Complete the following steps to update the security group of the existing Amazon MSK cluster to allow communication from the Kafka client and Prometheus server:
On the Amazon MSK console, navigate to your Amazon MSK cluster.
On the Properties tab, under Network settings, open the security group.
Choose Edit inbound rules.
Choose Add rule and create your rule with the following parameters:
Type – Custom TCP
Port range – 11001–11002
Source – The Prometheus server security group ID
Set up your AWS Cloud9 environment
To configure your AWS Cloud9 environment, complete the following steps:
On the AWS Cloud9 console, choose Environments in the navigation pane.
Select Cloud9EC2Bastion and choose Open in Cloud9.
Close the Welcome tab and open a new terminal tab
Create an SSH key file with the contents from the private key file DemoMSKKeyPair using the following command:
touch /home/ec2-user/environment/EC2KeyMSKDemo
Run the following command to list the newly created key file
ls -ltr
Open the file, enter the contents of the private key file DemoMSKKeyPair, then save the file.
Change the permissions of the file using the following command:
Once you’re logged in, check if the Docker service is up and running using the following command:
systemctl status docker
To exit the server, enter exit and press Enter.
2.b Create a new Amazon MSK cluster
If you don’t have an Amazon MSK cluster running in your environment, or you don’t want to use an existing cluster for this solution, complete the steps in this section.
As part of these steps, your cluster will have the following properties:
Complete the following steps to deploy the CloudFormation stack MSKResource_Cloudformation.yml:
On the AWS CloudFormation console, choose Stacks in the navigation pane.
Choose Create stack.
For Prepare template, select Template is ready.
For Template source, select Upload a template.
Upload the MSKResource_Cloudformation.yml file, then choose Next.
For Stack name, enter MSKDemo.
Network Configuration – Generic (mandatory)
Stack to be deployed in NEW VPC? (true/false) – if false, you MUST provide VPCCidr and other details under Existing VPC section (Default is true)
VPCCidr – Default is 10.0.0.0/16 for a new VPC. You can have any valid values as per your environment. If deploying in an existing VPC, provide the CIDR for the same
Network Configuration – For New VPC
PrivateSubnetMSKOneCidr (Default is 10.0.1.0/24)
PrivateSubnetMSKTwoCidr (Default is 10.0.2.0/24)
PrivateSubnetMSKThreeCidr (Default is 10.0.3.0/24)
PublicOneCidr (Default is 10.0.0.0/24)
Network Configuration – For Existing VPC (You need at least 4 subnets)
VpcId – Provide the value if you are using any existing VPC to deploy the resources else leave it blank(default)
SubnetID1 – Any one of the existing subnets from the given VPCID
SubnetID2 – Any one of the existing subnets from the given VPCID
SubnetID3 – Any one of the existing subnets from the given VPCID
PublicSubnetID – Any one of the existing subnets from the given VPCID
Leave the remaining parameters as default and choose Next.
On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.
You’re redirected to the AWS CloudFormation console, and can see the status as CREATE_IN_PROGRESSS. Wait until the status changes to COMPLETE.
On the stack’s Outputs tab, note the values for the following (if you don’t see anything under Outputs tab, click on refresh icon):
KafkaClientPrivateIP
PrometheusInstancePrivateIP
Set up your AWS Cloud9 environment
Follow the steps as outlined in the previous section to configure your AWS Cloud9 environment.
Retrieve the cluster broker list
To get your MSK cluster broker list, complete the following steps:
On the Amazon MSK console, navigate to your cluster.
In the Cluster summary section, choose View client information.
In the Bootstrap servers section, copy the private endpoint.
You need this value to perform some operations later, such as creating an MSK topic, producing sample messages, and consuming those sample messages.
Choose Done.
On the Properties tab, in the Brokers details section, note the endpoints listed.
These need to be updated in the targets.json file (used for Prometheus configuration in a later step).
3. Enable IAM Identity Center
Before you deploy the CloudFormation stack for Amazon Managed Service for Prometheus and Amazon Managed Grafana, make sure to enable IAM Identity Center.
If IAM Identity Center is currently enabled/configured in another region, you don’t need to enable in your current region.
Complete the following steps to enable IAM Identity Center:
On the IAM Identity Center console, under Enable IAM Identity Center, choose Enable.
Choose Create AWS organization.
4. Configure Amazon Managed Grafana and Amazon Managed Service for Prometheus
Complete the steps in this section to set up Amazon Managed Service for Prometheus and Amazon Managed Grafana.
Deploy CloudFormation template
Complete the following steps to deploy the CloudFormation stack AMG_AMP_Cloudformation:
On the AWS CloudFormation console, choose Stacks in the navigation pane.
Choose Create stack.
For Prepare template, select Template is ready.
For Template source, select Upload a template.
Upload the AMG_AMP_Cloudformation.yml file, then choose Next.
For Stack name, enter ManagedPrometheusAndGrafanaStack, then choose Next.
On the Review page, select I acknowledge that AWS CloudFormation might create IAM resources.
Choose Create stack.
You’re redirected to the AWS CloudFormation console, and can see the status as CREATE_IN_PROGRESSS. Wait until the status changes to COMPLETE.
On the stack’s Outputs tab, note the values for the following (if you don’t see anything under Outputs tab, click on refresh icon):
GrafanaWorkspaceURL – This is Amazon Managed Grafana URL
PrometheusEndpointWriteURL – This is the Amazon Managed Service for Prometheus write endpoint URL
Create a user for Amazon Managed Grafana
Complete the following steps to create a user for Amazon Managed Grafana:
On the IAM Identity Center console, choose Users in the navigation pane.
Choose Add user.
For Username, enter grafana-admin.
Enter and confirm your email address to receive a confirmation email.
Skip the optional steps, then choose Add user.
A success message appears at the top of the console.
In the confirmation email, choose Accept invitation and set your user password.
On the Amazon Managed Grafana console, choose Workspaces in the navigation pane.
Open the workspace Amazon-Managed-Grafana.
Make a note of the Grafana workspace URL.
You use this URL to log in to view your Grafana dashboards.
On the Authentication tab, choose Assign new user or group.
Select the user you created earlier and choose Assign users and groups.
On the Action menu, choose what kind of user to make it: admin, editor, or viewer.
Note that your Grafana workspace needs as least one admin user.
Navigate to the Grafana URL you copied earlier in your browser.
Choose Sign in with AWS IAM Identity Center.
Log in with your IAM Identity Center credentials.
5. Configure Prometheus and start the service
When you cloned the GitHub repo, you downloaded two configuration files: prometheus.yml and targets.json. In this section, we configure these two files.
Use any IDE (Visual Studio Code or Notepad++) to open prometheus.yml.
In the remote_write section, update the remote write URL and Region.
Use any IDE to open targets.json.
Update the targets with the broker endpoints you obtained earlier.
In your AWS Cloud9 environment, choose File, then Upload Local Files.
Choose Select Files and upload targets.json and prometheus.yml from your local machine.
In the AWS Cloud9 environment, run the following command using the key file you created earlier:
Press CTRL+C to stop the producer/consumer service.
Kafka metrics dashboards on Amazon Managed Grafana
You can now view your Kafka metrics dashboards on Amazon Managed Grafana:
Cluster overall health – Configured using Amazon Managed Service for Prometheus as the data source:
Critical metrics
Amazon MSK cluster overview – Configured using Amazon Managed Service for Prometheus as the data source:
Critical metrics
Cluster throughput (broker-level metrics)
Cluster metrics (JVM)
Kafka cluster operation metrics – Configured using CloudWatch as the data source:
General overall stats
CPU and Memory metrics
Clean up
You will continue to incur costs until you delete the infrastructure that you created for this post. Delete the CloudFormation stack you used to create the respective resources.
If you used an existing cluster, make sure to remove the inbound rules you updated in the security group (otherwise the stack deletion will fail).
On the Amazon MSK console, navigate to your existing cluster.
On the Properties tab, in the Networking settings section, open the security group you applied.
Choose Edit inbound rules.
Choose Delete to remove the rules you added.
Choose Save rules.
Now you can delete your CloudFormation stacks.
On the AWS CloudFormation console, choose Stacks in the navigation pane.
Select ManagedPrometheusAndGrafana and choose Delete.
If you used an existing Amazon MSK cluster, delete the stack Prometheus.
If you created a new Amazon MSK cluster, delete the stack MSKDemo.
Conclusion
This post showed how you can deploy a fully managed, highly available, scalable, and secure monitoring system for Amazon MSK using Amazon Managed Service for Prometheus and Amazon Managed Grafana, and use Grafana dashboards to gain deep insights into various operational metrics. Although this post only discussed using Amazon Managed Service for Prometheus and CloudWatch as the data sources in Amazon Managed Grafana, you can enable various other data sources, such as AWS IoT SiteWise, AWS X-Ray, Redshift, and Amazon Athena, and build a dashboard on top of those metrics. You can use these managed services for monitoring any number of Amazon MSK platforms. Metrics are available to query in Amazon Managed Grafana or Amazon Managed Service for Prometheus in near-real time.
You can use this post as prescriptive guidance and deploy an observability solution for a new or an existing Amazon MSK cluster, identify the metrics that are important for your applications and then create a dashboard using Amazon Managed Grafana and Prometheus.
About the Authors
Anand Mandilwar is an Enterprise Solutions Architect at AWS. He works with enterprise customers helping customers innovate and transform their business in AWS. He is passionate about automation around Cloud operation , Infrastructure provisioning and Cloud Optimization. He also likes python programming. In his spare time, he enjoys honing his photography skill especially in Portrait and landscape area.
Ajit Puthiyavettle is a Solution Architect working with enterprise clients, architecting solutions to achieve business outcomes. He is passionate about solving customer challenges with innovative solutions. His experience is with leading DevOps and security teams for enterprise and SaaS (Software as a Service) companies. Recently he is focussed on helping customers with Security, ML and HCLS workload.
A common hurdle to DevOps strategies is the manual testing, sign-off, and deployment steps required to deliver new or enhanced feature sets. If an application is updated frequently, these actions can be time-consuming and error prone. You can address these challenges by incorporating progressive delivery concepts along with the Amazon Elastic Kubernetes Service (Amazon EKS) container platform and Argo Rollouts.
Progressive delivery deployments
Progressive delivery is a deployment paradigm in which new features are gradually rolled out to an expanding set of users. Real-time measurements of key performance indicators (KPIs) enable deployment teams to measure customer experience. These measurements can detect any negative impact during deployment and perform an automated rollback before it impacts a larger group of users. Since predefined KPIs are being measured, the rollout can continue autonomously and alleviate the bottleneck of manual approval or rollbacks.
These progressive delivery concepts can be applied to common deployment strategies such as blue/green and canary deployments. A blue/green deployment is a strategy where separate environments are created that are identical to one another. One environment (blue) runs the current application version, while the other environment (green) runs the new version. This enables teams to test on the new environment and move application traffic to the green environment when validated. Canary deployments slowly release your new application to the production environment so that you can build confidence while it is being deployed. Gradually, the new version will replace the current version in its entirety.
Using Kubernetes, you already can perform a rolling update, which can incrementally replace your resource’s Pods with new ones. However, you have limited control of the speed of the rollout, and can’t automatically revert a deployment. KPIs are also difficult to measure in this scenario, resulting in more manual work validating the integrity of the deployment.
To exert more granular control over the deployments, a progressive delivery controller such as Argo Rollouts can be implemented. By using a progressive delivery controller in conjunction with AWS services, you can tune the speed of your deployments and measure your success with KPIs. During the deployment, Argo Rollouts will query metric providers such as Prometheus to perform analysis. (You can find the complete list of the supported metric providers at Argo Rollouts.) If there is an issue with the deployment, automatic rollback actions can be taken to minimize any type of disruption.
Using blue/green deployments for progressive delivery
Blue/green deployments provide zero downtime during deployments and an ability to test your application in production without impacting the stable version. In a typical blue/green deployment on EKS using Kubernetes native resources, a new deployment will be spun up. This includes the new feature version in parallel with the stable deployment (see Figure 1). The new deployment will be tested by a QA team.
Figure 1. Blue/green deployment in progress
Once all the tests have been successfully conducted, the traffic must be directed from the live version to the new version (Figure 2). At this point, all live traffic is funneled to the new version. If there are any issues, a rollback can be conducted by swapping the pointer back to the previous stable version.
Figure 2. Blue/green deployment post-promotion
Keeping this process in mind, there are several manual interactions and decisions involved during a blue/green deployment. Using Argo Rollouts you can replace these manual steps with automation. It automatically creates a preview service for testing out a green service. With Argo Rollouts, test metrics can be captured by using a monitoring service, such as Amazon Managed Service for Prometheus (Figure 3).
Prometheus is a monitoring software that can be used to collect metrics from your application. With PromQL (Prometheus Query Language), you can write queries to obtain KPIs. These KPIs can then be used to define the success or failure of the deployment. Argo Rollout deployment includes stages to analyze the KPIs before and after promoting the new version. During the prePromotionAnalysis stage, you can validate the new version using preview endpoint. This stage can verify smoke tests or integration tests. Upon meeting the desired success (KPIs), the live traffic will be routed to the new version. In the postPromotionAnalysis stage, you can verify KPIs from the production environment. After promoting the new version, failure of the KPIs during any analysis stage will automatically shut down the deployment and revert to the previous stable version.
Figure 3. Blue/green deployment using KPIs
Using canary deployment for progressive delivery
Unlike in a blue/green deployment strategy, in a canary deployment a subset of the traffic is gradually shifted to the new version in your production environment. Since the new version is being deployed in a live environment, feedback can be obtained in real-time, and adjustments can be made accordingly (Figure 4).
Figure 4. An example of a canary deployment
Argo Rollouts supports integration with an Application Load Balancer to manipulate the flow of traffic to different versions of an application. Argo Rollouts can gradually and automatically increase the amount of traffic to the canary service at specific intervals. You can also fully automate the promotion process by using KPIs and metrics from Prometheus as discussed in the blue/green strategy. The analysis will run while the canary deployment is progressing. The success of the KPIs will gradually increase the traffic on the canary service. Any failure will stop the deployment and stop routing live traffic.
Conclusion
Implementing progressive delivery in your application can help you deploy new versions of applications with confidence. This approach mitigates the risk of rolling out new application versions by providing visibility into live error rate and performance. You can measure KPIs and automate rollout and rollback procedures. By leveraging Argo Rollouts, you can have more granular control over how your application is deployed in an EKS cluster.
For additional information on progressive delivery or Argo Rollouts:
At AWS re:Invent 2020, we introduced the preview of Amazon Managed Service for Prometheus, an open source Prometheus-compatible monitoring service that makes it easy to monitor containerized applications at scale. With Amazon Managed Service for Prometheus, you can use the Prometheus query language (PromQL) to monitor the performance of containerized workloads without having to manage the underlying infrastructure required to scale and secure the ingestion, storage, alert, and querying of operational metrics.
Amazon Managed Service for Prometheus automatically scales as your monitoring needs grow. It is a highly available service deployed across multiple Availability Zones (AZs) that integrates AWS security and compliance capabilities. The service offers native support for PromQL as well as the ability to ingest Prometheus metrics from over 150 Prometheus exporters maintained by the open source community.
Today, I am happy to announce the general availability of Amazon Managed Service for Prometheus with new features such as alert manager and ruler that support Amazon Simple Notification Service (Amazon SNS) as a receiver destination for notifications from Alert Manager. You can integrate Amazon SNS with destinations such as email, webhook, Slack, PagerDuty, OpsGenie, or VictorOps with Amazon SNS.
Getting Started with Alert Manager and Ruler To get started in the AWS Management Console, you can simply create a workspace, a logical space dedicated to the storage, alerting, and querying of metrics from one or more Prometheus servers. You can set up the ingestion of Prometheus metrics to this workspace using Helm and query those metrics. To learn more, see Getting started in the Amazon Managed Service for Prometheus User Guide.
At general availability, we added new alert manager and rules management features. The service supports two types of rules: recording rules and alerting rules. These rules files are the same YAML format as standalone Prometheus, which may be configured and then evaluated at regular intervals.
To configure your workspace with a set of rules, choose Add namespace in Rules management and select a YAML format rules file.
An example rules file would record CPU usage metrics in container workloads and triggers an alert if CPU usage is high for five minutes.
Next, you can create a new Amazon SNS topic or reuse an existing SNS topic where it will route the alerts. The alertmanager routes the alerts to SNS and SNS routes to downstream locations. Configured alerting rules will fire alerts to the Alert Manager, which deduplicate, group, and route alerts to Amazon SNS via the SNS receiver. If you’d like to receive email notifications for your alerts, configure an email subscription for the SNS topic you had.
To give Amazon Managed Service for Prometheus permission to send messages to your SNS topic, select the topic you plan to send to, and add the access policy block:
If you have a topic to get alerts, you can configure this SNS receiver in the alert manager configuration. An example config file is the same format as Prometheus, but you have to provide the config underneath an alertmanager_config: block in for the service’s Alert Manager. For more information about the Alert Manager config, visit Alerting Configuration in Prometheus guide.
You can replace the topic_arn for the topic that you create while setting up the SNS connection. To learn more about the SNS receiver in the alert manager config, visit Prometheus SNS receiver on the Prometheus Github page.
To configure the Alert Manager, open the Alert Manager and choose Add definition, then select a YAML format alert config file.
When an alert is created by Prometheus and sent to the Alert Manager, it can be queried by hitting the ListAlerts endpoint to see all the active alerts in the system. After hitting the endpoint, you can see alerts in the list of actively firing alerts.
A successful notification will result in an email received from your SNS topic with the alert details. Also, you can output messages in JSON format to be easily processed downstream of SNS by AWS Lambda or other APIs and webhook receiving endpoints. For example, you can connect SNS with a Lambda function for message transformation or triggering automation. To learn more, visit Configuring Alertmanager to output JSON to SNS in the Amazon Managed Service for Prometheus User Guide.
Sending from Amazon SNS to Other Notification Destinations You can connect Amazon SNS to a variety of outbound destinations such as email, webhook (HTTP), Slack, PageDuty, and OpsGenie.
Webhook – To configure a preexisting SNS topic to output messages to a webhook endpoint, first create a subscription to an existing topic. Once active, your HTTP endpoint should receive SNS notifications.
Slack – You can either integrate with Slack’s email-to-channel integration where Slack has the ability to accept an email and forward it to a Slack channel, or you can utilize a Lambda function to rewrite the SNS notification to Slack. To learn more, see forwarding emails to Slack channels and AWS Lambda function to convert SNS messages to Slack.
PagerDuty – To customize the payload sent to PagerDuty, customize the template that is used in generating the message sent to SNS by adjusting or updating template_files block in your alertmanager definition.
Available Now Amazon Managed Service for Prometheus is available today in nine AWS Regions: US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Frankfurt), Europe (Ireland), Europe (Stockholm), Asia Pacific (Singapore), Asia Pacific (Sydney), and Asia Pacific (Tokyo).
You pay only for what you use, based on metrics ingested, queried, and stored. As part of the AWS Free Tier, you can get started with Amazon Managed Service for Prometheus for 40 million metric samples ingested and 10 GB metrics stored per month. To learn more, visit the pricing page.
If you want to learn about AWS observability on AWS, visit One Observability Workshop which provides a hands-on experience for you on the wide variety of toolsets AWS offers to set up monitoring and observability on your applications.
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.