Tag Archives: news

Microsoft SAM File Readability CVE-2021-36934: What You Need to Know

Post Syndicated from Caitlin Condon original https://blog.rapid7.com/2021/07/21/microsoft-sam-file-readability-cve-2021-36934-what-you-need-to-know/

Microsoft SAM File Readability CVE-2021-36934: What You Need to Know

On Monday, July 19, 2021, community security researchers began reporting that the Security Account Manager (SAM) file on Windows 10 and 11 systems was READ-enabled for all local users. The SAM file is used to store sensitive security information, such as hashed user and admin passwords. READ enablement means attackers with a foothold on the system can use this security-related information to escalate privileges or access other data in the target environment.

On Tuesday, July 20, Microsoft issued an out-of-band advisory for this vulnerability, which is now tracked as CVE-2021-36934. As of July 21, 2021, the vulnerability has been confirmed to affect Windows 10 version 1809 and later. A public proof-of-concept is available that allows non-admin users to retrieve all registry hives. Researcher Kevin Beaumont has also released a demo that confirms CVE-2021-36934 can be used to achieve remote code execution as SYSTEM on vulnerable targets (in addition to privilege escalation). The security community has christened this vulnerability “HiveNightmare” and “SeriousSAM.”

CERT/CC published in-depth vulnerability notes on CVE-2021-36934, which we highly recommend reading. Their analysis reveals that starting with Windows 10 build 1809, the BUILTIN\Users group is given RX permissions to files in the %windir%\system32\config directory. If a VSS shadow copy of the system drive is available, a non-privileged user may leverage access to these files to:

  • Extract and leverage account password hashes.
  • Discover the original Windows installation password.
  • Obtain DPAPI computer keys, which can be used to decrypt all computer private keys.
  • Obtain a computer machine account, which can be used in a silver ticket attack.

There is no patch for CVE-2021-36934 as of July 21, 2021. Microsoft has released workarounds for Windows 10 and 11 customers that mitigate the risk of immediate exploitation—we have reproduced these workarounds in the Mitigation Guidance section below. Please note that Windows customers must BOTH restrict access and delete shadow copies to prevent exploitation of CVE-2021-36934. We recommend applying the workarounds on an emergency basis.

Mitigation Guidance

1. Restrict access to the contents of %windir%\system32\config:

  • Open Command Prompt or Windows PowerShell as an administrator.
  • Run this command:
icacls %windir%\system32\config\*.* /inheritance:e

2. Delete Volume Shadow Copy Service (VSS) shadow copies:

  • Delete any System Restore points and Shadow volumes that existed prior to restricting access to %windir%\system32\config.
  • Create a new System Restore point if desired.

Windows 10 and 11 users must apply both workarounds to mitigate the risk of exploitation. Microsoft has noted that deleting shadow copies may impact restore operations, including the ability to restore data with third-party backup applications.

This story is developing quickly. We will update this blog with new information as it becomes available.

Resources

Amazon EBS io2 Block Express Volumes with Amazon EC2 R5b Instances Are Now Generally Available

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-ebs-io2-block-express-volumes-with-amazon-ec2-r5b-instances-are-now-generally-available/

At AWS re:Invent 2020, we previewed Amazon EBS io2 Block Express volumes, the next-generation server storage architecture that delivers the first SAN built for the cloud. Block Express is designed to meet the requirements of the largest, most I/O-intensive, mission-critical deployments of Microsoft SQL Server, Oracle, SAP HANA, and SAS Analytics on AWS.

Today, I am happy to announce the general availability of Amazon EBS io2 Block Express volumes, with Amazon EC2 R5b instances powered by the AWS Nitro System to provide the best network-attached storage performance available on EC2. The io2 Block Express volumes now also support io2 features such as Multi-Attach and Elastic Volumes.

In the past, customers had to stripe multiple volumes together in order go beyond single-volume performance. Today, io2 volumes can meet the needs of mission-critical performance-intensive applications without striping and the management overhead that comes along with it. With io2 Block Express, customers can get the highest performance block storage in the cloud with four times higher throughput, IOPS, and capacity than io2 volumes with sub-millisecond latency, at no additional cost.

Here is a summary of the use cases and characteristics of the key Solid State Drive (SSD)-backed EBS volumes:

General Purpose SSD Provisioned IOPS SSD
Volume type gp2 gp3 io2 io2 Block Express
Durability 99.8%-99.9% durability 99.999% durability
Use cases General applications, good to start with when you do not fully understand the performance profile yet I/O-intensive applications and databases Business-critical applications and databases that demand highest performance
Volume size 1 GiB – 16 TiB 4 GiB – 16 TiB 4 GiB – 64 TiB
Max IOPS 16,000 64,000 ** 256,000
Max throughput 250 MiB/s * 1,000 MiB/s 1,000 MiB/s ** 4,000 MiB/s

* The throughput limit is between 128 MiB/s and 250 MiB/s, depending on the volume size.
** Maximum IOPS and throughput are guaranteed only on instances built on the Nitro System provisioned with more than 32,000 IOPS.

The new Block Express architecture delivers the highest levels of performance with sub-millisecond latency by communicating with an AWS Nitro System-based instance using the Scalable Reliable Datagrams (SRD) protocol, which is implemented in the Nitro Card dedicated for EBS I/O function on the host hardware of the instance. Block Express also offers modular software and hardware building blocks that can be assembled in many ways, giving you the flexibility to design and deliver improved performance and new features at a faster rate.

Getting Started with io2 Block Express Volumes
You can now create io2 Block Express volumes in the Amazon EC2 console, AWS Command Line Interface (AWS CLI), or using an SDK with the Amazon EC2 API when you create R5b instances.

After you choose the EC2 R5b instance type, on the Add Storage page, under Volume Type, choose Provisioned IOPS SSD (io2). Your new volumes will be created in the Block Express format.

Things to Know
Here are a couple of things to keep in mind:

  • You can’t modify the size or provisioned IOPS of an io2 Block Express volume.
  • You can’t launch an R5b instance with an encrypted io2 Block Express volume that has a size greater than 16 TiB or IOPS greater than 64,000 from an unencrypted AMI or a shared encrypted AMI. In this case, you must first create an encrypted AMI in your account and then use that AMI to launch the instance.
  • io2 Block Express volumes do not currently support fast snapshot restore. We recommend that you initialize these volumes to ensure that they deliver full performance. For more information, see Initialize Amazon EBS volumes in Amazon EC2 User Guide.

Available Now
The io2 Block Express volumes are available in all AWS Regions where R5b instances are available: US East (Ohio), US East (N. Virginia), US West (Oregon), Asia Pacific (Singapore), Asia Pacific (Tokyo), Europe (Frankfurt), with support for more AWS Regions coming soon. We plan to allow EC2 instances of all types to connect to io2 Block Volumes, and will have updates on this later in the year.

In terms of pricing and billing, io2 volumes and io2 Block Express volumes are billed at the same rate. Usage reports do not distinguish between io2 Block Express volumes and io2 volumes. We recommend that you use tags to help you identify costs associated with io2 Block Express volumes. For more information, see the Amazon EBS pricing page.

To learn more, visit the EBS Provisioned IOPS Volume page and io2 Block Express Volumes in the Amazon EC2 User Guide.

Channy

Multi-Cloud and Hybrid Threat Protection with Sumo Logic Cloud SIEM Powered by AWS

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/hybrid-threat-protection-with-sumo-logic-cloud-siem-powered-by-aws/

IT security teams need to have a real-time understanding of what’s happening with their infrastructure and applications. They need to be able to find and correlate data in this continuous flood of information to identify unexpected behaviors or patterns that can lead to a security breach.

To simplify and automate this process, many solutions have been implemented over the years. For example:

  • Security information management (SIM) systems collect data such as log files into a central repository for analysis.
  • Security event management (SEM) platforms simplify data inspection and the interpretation of logs or events.

Many years ago these two approaches were merged to address both information analysis and interpretation of events. These security information and event management (SIEM) platforms provide real-time analysis of security alerts generated by applications, network hardware, and domain specific security tools such as firewalls and endpoint protection tools).

Today, I’d like to introduce a solution created by one of our partners: Sumo Logic Cloud SIEM powered by AWS. Sumo Logic Cloud SIEM provides deep security analytics and contextualized threat data across multi-cloud and hybrid environments to reduce the time to detect and respond to threats. You can use this solution to quickly detect and respond to the higher-priority issues, including malicious activities that negatively impact your business or brand. The solution comes with more than 300 out-of-the-box integrations, including key AWS services, and can help reduce time and effort required to conduct compliance audits for regulations such as PCI and HIPAA.

Sumo Logic Cloud SIEM is available in AWS Marketplace and you can use a free trial to evaluate the solution. Before having a look at how this works in practice, let’s see how it’s being used.

Customer Case Study – Medidata
I had the chance to meet a very interesting customer to talk about how they use Sumo Logic Cloud SIEM. Scott Sumner is the VP and CISO at Medidata, a company that is redefining what’s possible in clinical trials. Medidata is processing patient data for clinical trials of the Moderna and Johnson & Johnson COVID-19 vaccines, so you can see why security is a priority for them. With such critical workloads, the company must keep the trust of the people participating in those trials.

Scott told me, “There is an old saying: If you can’t measure it, you can’t manage it.” In fact, when he joined Medidata in 2015, one of the first things Scott did was to implement a SIEM. Medidata has been using Sumo Logic for more than five years now. They appreciate that it’s a cloud-native solution, and that has made it easier for them to follow the evolution of the tool over the years.

“Not having transparency in the environment you process data is not good for security professionals.” Scott wanted his team to be able to respond quickly and, to do so, they needed to be able to look at a single screen that displays all IP calls, network flows, and any relevant information. For example, Medidata has very aggressive checks for security scans and any kind of external access. To do so, they have to look not just at the perimeter, but at the entire environment. Sumo Logic Cloud SIEM allows them to react without breaking anything in their corporate environment, including the resources they have in more than 45 AWS accounts.

“One of the metrics that is floated around by security specialists is that you have up to five hours to respond to a tentative intrusion,” Scott says. “With Sumo Logic Cloud SIEM, we can match that time aggressively, and most of the times we respond within five minutes.” The ability to respond quickly allows Medidata to keep patient trust. This is very important for them, because delaying a clinical trial can affect people’s health.

Medidata security response is managed by a global team through three levels. Level 1 is covered by a partner who is empowered to block simple attacks. For the next level of escalation, Level 2, Medidata has a team in each region. Beyond that there is Level 3: a hardcore team of forensics examiners distributed across the US, Europe, and Asia who deal with more complicated attacks.

Availability is also important to Medidata. In addition to providing the Cloud SIEM functionality, Sumo Logic helps them monitor availability issues, such as web server failover, and very quickly figure out possible problems as they happen. Interestingly, they use Sumo Logic to better understand how applications talk with each other. These different use cases don’t complicate their setup because the security and application teams are segregated and use Sumo Logic as a single platform to share information seamlessly between the two teams when needed.

I asked Scott Sumner if Medidata was affected by the move to remote work in 2020 due to the COVID-19 pandemic. It’s an important topic for them because at that time Medidata was already involved in clinical trials to fight the pandemic. “We were a mobile environment anyway. A significant part of the company was mobile before. So, we were ready and not being impacted much by working remotely. All our tools are remote, and that helped a lot. Not sure we’d done it easily with an on-premises solution.”

Now, let’s see how this solution works in practice.

Setting Up Sumo Logic Cloud SIEM
In AWS Marketplace I search for “Sumo Logic Cloud SIEM” and look at the product page. I can either subscribe or start the one-month free trial. The free trial includes 1GB of log ingest for security and observability. There is no automatic conversion to paid offer when the free trials expires. After the free trial I have the option to either buy Sumo Logic Cloud SIEM from AWS Marketplace or remain as a free user. I create and accept the contract and set up my Sumo Logic account.

Console screenshot.

In the setup, I choose the Sumo Logic deployment region to use. The Sumo Logic documentation provides a table that describes the AWS Regions used by each Sumo Logic deployment. I need this information later when I set up the integration between AWS security services and Sumo Logic Cloud SIEM. For now, I select US2, which corresponds to the US West (Oregon) Region in AWS.

When my Sumo Logic account is ready, I use the Sumo Logic Security Integrations on AWS Quick Start to deploy the required integrations in my AWS account. You’ll find the source files used by this Quick Start in this GitHub repository. I open the deployment guide and follow along.

This architecture diagram shows the environment deployed by this Quick Start:

Architectural diagram.

Following the steps in the deployment guide, I create an access key and access ID in my Sumo Logic account, and write down my organization ID. Then, I launch the Quick Start to deploy the integrations.

The Quick Start uses an AWS CloudFormation template to create a stack. First, I check that I am in the right AWS Region. I use US West (Oregon) because I am using US2 with Sumo Logic. Then, I leave all default values and choose Next. In the parameters, I select US2 as my Sumo Logic deployment region and enter my Sumo Logic access ID, access key, and organization ID.

Console screenshot.

After that, I enable and configure the integrations with AWS security services and tools such as AWS CloudTrail, Amazon GuardDuty, VPC Flow Logs, and AWS Config.

Console screenshot.

If you have a delegate administrator with GuardDuty, you can enabled multiple member accounts as long as they are a part of the same AWS organization. With that, all findings from member accounts are vended to the delegate administrator and then to Sumo Logic Cloud SIEM through the GuardDuty events processor.

In the next step, I leave the stack options to their default values. I review the configuration and acknowledge the additional capabilities required by this stack (such as the creation of IAM resources) and then choose Create stack.

When the stack creation is complete, the integration with Sumo Logic Cloud SIEM is ready. If I had a hybrid architecture, I could connect those resources for a single point of view and analysis of my security events.

Using Sumo Logic Cloud SIEM
To see how the integration with AWS security services works, and how security events are handled by the SIEM, I use the amazon-guardduty-tester open-source scripts to generate security findings.

First, I use the included CloudFormation template to launch an Amazon Elastic Compute Cloud (Amazon EC2) instance in an Amazon Virtual Private Cloud (VPC) private subnet. The stack also includes a bastion host to provide external access. When the stack has been created, I write down the IP addresses of the two instances from the stack output.

Console screenshot.

Then, I use SSH to connect to the EC2 instance in the private subnet through the bastion host. There are easy-to-follow instructions in the README file. I use the guardduty_tester.sh script, installed in the instance by CloudFormation, to generate security findings for my AWS account.

$ ./guardduty_tester.sh

SSH screenshot.

GuardDuty processes these findings and the events are sent to Sumo Logic through the integration I just set up. In the Sumo Logic GuardDuty dashboard, I see the threats ready to be analyzed and addressed.

Console screenshot.

Availability and Pricing
Sumo Logic Cloud SIEM powered by AWS is a multi-tenant Software as a Service (SaaS) available in AWS Marketplace that ingests data over HTTPS/TLS 1.2 on the public internet. You can connect data from any AWS Region and from multi-cloud and hybrid architectures for a single point of view of your security events.

Start a free trial of Sumo Logic Cloud SIEM and see how it can help your security team.

Read the Sumo Logic team’s blog post here for more information on the service. 

Danilo

Amazon Simple Queue Service (SQS) – 15 Years and Still Queueing!

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-sqs-15-years-and-still-queueing/

Time sure does fly! I wrote about the production launch of Amazon Simple Queue Service (SQS) back in 2006, with no inkling that I would still be blogging fifteen years later, or that this service would still be growing rapidly while remaining fundamental to the architecture of so many different types of web-scale applications.

The first beta of SQS was announced with little fanfare in late 2004. Even though we have added many features since that beta, the original description (“a reliable, highly scalable hosted queue for buffering messages between distributed application components”) still applies. Our customers think of SQS as an infinite buffer in the cloud and use SQS queues to implement the connections between the functional parts of their application architecture.

Over the years, we have worked hard to keep the SQS interface simple and easy to use, even though there’s a lot of complexity inside. In order to make SQS scalable, durable, and reliable, messages are stored in a fleet that consists of thousands of servers in each AWS Region. Within a region, we save three copies of each message, taking care to distribute the messages across storage nodes and Availability Zones. In addition to this built-in redundant storage, SQS is self-healing and resilient to host failures & network interruptions. Even though SQS is now 15 years old, we continue to improve our scaling and load management applications so that you can always count on SQS to handle your mission-critical workloads.

SQS runs at an incredible scale. For example:

Amazon Product Fulfillment – During Prime Day 2021, traffic to SQS set a new record, peaking at 47.7 million messages per second.

Rapid7 – Amazon customer Rapid7 makes extensive use of SQS. According to Jeff Myers (Platform Software Architect):

Amazon SQS provides us with a simple to use, highly performant, and highly scalable messaging service without any management headaches. It is a crucial component of our architecture that has allowed us to scale to handle 10s of billions of messages per day.

You can visit the Amazon SQS home page to read about other high-scale use cases from NASA, BMW Group, Capital One, and other AWS customers.

Serverless Office Hours
Be sure to join us today (June 13th) for Serverless Office Hours at Noon PT on Twitch.tv. Rumor has it that there will be cake!

Back in Time
Here’s a quick recap of some of the major SQS milestones:

2006Production launch. An unlimited number of queues per account and items per queue, with each item up to 8 KB in length. Pay as you go pricing based on messages sent and data transferred.

2007Additional functions including control of the visibility timeout and access to the approximate number of queued messages.

2009SQS in Europe, and additional control over queue access via AddPermission and RemovePermission. This launch also marked the debut of what we then called the Access Policy Language, which has evolved into today’s IAM Policies.

2010A new free tier (100K requests/month then, since expanded to 1 million requests/month), configurable maximum message length (up to 64 KB), and configurable message retention time.

2011Additional CloudWatch Metrics for each SQS queue. Later that year we added batch operations (SendMessageBatch and DeleteMessageBatch), delay queues, and message timers.

2012Support for long polling, along with SDK support for request batching and client-side buffering.

2013Support for even larger payloads (256 KB) and a 50% price reduction.

2014 – Support for dead letter queues, to accept messages that have become stuck due to a timeout or to a processing error within your code.

2015 – Support for extended payloads (up to 2 GB) using the Extended Client Library.

2016 – Support for FIFO queues with exactly-once processing and deduplication and another price reduction.

2017 – Support for server-side encryption of messages and cost allocation tags.

2018 – Support for Amazon VPC Endpoints using AWS PrivateLink and the ability to invoke Lambda functions.

2019 – Support for Tag-on-Create and X-Ray tracing.

2020 – Support for 1-minute metrics for more fine-grained queue monitoring, a new console experience, and result pagination for the ListQueues and ListDeadLetterSourceQueues functions.

2021Tiered pricing so that you save money as your usage grows, and a High Throughput mode for FIFO queues.

Today, SQS remains simple, scalable, cost-effective, and highly reliable.

From the AWS Heroes
We asked some of the AWS Heroes to reflect on the success of SQS and to share some of their success stories. Here’s what we learned:

Eric Hammond (Serverless Hero) uses AWS Lambda Dead Letter Queues. They put alarms on the queues and send internal emails as alerts when problems need to be investigated.

Tom McLaughlin (Serverless Hero) has been using SQS since 2015. He said “My favorite use case is anytime someone wants a queue and I don’t want to manage a queuing platform. Which is always.”

Ken Collins (Serverless Hero) was not entirely sure how long he had been using SQS, and offered a range of 2 to 8 years! He uses it to power the Lambdakiq gem, which implements ActiveJob on SQS & Lambda.

Kyuhyun Byun (Serverless Hero) has been using SQS to power a push system that must sustain massive amounts of traffic, and tells us “Thanks to SQS, I don’t consider building a queuing system anymore.”

Prashanth HN (Serverless Hero) has been using SQS since 2017, and considers himself “late to the party.” He used SQS as part of his first serverless application, and finds that it is ideal for connecting services with differing throughput.

Ben Kehoe (Serverless Hero) told us that “I first saw the power of SQS when a colleague showed me how to retain state in a fleet of EC2 Spot instances by storing the state in SQS when an instance was getting shut down, and having new instances check the queue for state on startup.”

Jeremy Daly (Serverless Hero) started using SQS in 2010 as a lightweight queue that fed a facial recognition process on user-supplied images. Today, he often uses it as a buffer to throttle requests to downstream services that are not yet serverless.

Casey Lee (Container Hero) also started to use SQS in 2010 as a replacement for ActiveMQ between loosely coupled Java services. Casey implements auto scaling based on queue depth, and has found it to be an accurate way to handle the load.

Vlad Ionecsu (Container Hero) began his AWS journey with SQS back in 2014. Vlad found that the API was very easy to understand, and used SQS to power his thesis project.

Sheen Brisals (Serverless Hero) started to use SQS in 2018 while building a proof-of-concept that also used Lambda and S3. Sheen appreciates the ability to adjust the characteristics of each queue to create a good match for the message processing functions, and also makes use of both high and low priority queues.

Gojko Adzic (Serverless Hero) began to use SQS in 2013 as a task dispatch for exporters in MindMup. This online mind-mapping application allows large groups of users to collaborate, and requires strict ordering of updates to each document. Gojko used FIFO queues to allow them to process messages for different documents in parallel, with sequential order within each document.

Sebastian Müller (Serverless Hero) started to use SQS in 2016 by building a notification center for website-builder jimdo.com . The center ensures that customers are kept aware of events (orders, support messages, and contact requests) on a timely basis.

Luca Bianchi (Serverless Hero) first used SQS in 2012. He decoupled a pair of microservices running on AWS Elastic Beanstalk, and also created a fan-out processing system for a gamification platform. Today, his favorite SQS use case stores inference jobs and makes them available to a worker process running on Amazon SageMaker.

Peter Hanssens (Serverless Hero) uses SQS to offload tasks that do not need to be processed immediately. Several years ago, while assisting some data scientists, he created an event-driven batch-processing system that used a Lambda function to check a queue every 5 minutes and fire up EC2 instances to build models, while keeping strict control over the maximum number of running instances.

Serkan Ozal (Serverless Hero) has been using SQS since 2013 or so. He focuses on asynchronous message processing and counts on the ability of SQS to handle a huge number of events. He also makes uses of the message visibility feature so that he can re-process messages if necessary.

Matthieu Napoli (Serverless Hero) has been using SQS for about five years, starting out with EC2, and as a replacement for other queues. As he says, “Paired with Lambda, it gives massive parallelism out of the box without having to think too much about it. Plus the built-in failure handling makes it very robust.”

As you can see, there are a multitude of use cases for SQS.

SQS Resources
If you are not already using SQS, then it is time to get in the queue and get started. Here are some resources to help you find your way:

Happy queueing!

Jeff;

 

 

 

Scheduled report generation in Zabbix 5.4

Post Syndicated from Arturs Lontons original https://blog.zabbix.com/scheduled-report-generation-in-zabbix-5-4/14776/

The release of version 5.4 grants Zabbix users the ability to receive scheduled PDF reports in their mailbox, which is a very sought-after feature. This post and the video will cover all-new report-related configuration parameters and walk you through setting up scheduled report generation.

Contents

I. Reporting in Zabbix 5.4 (0:45)
II. Scheduled reports (2:26)

III. Questions & Answers (13:28)

Reporting in Zabbix 5.4

Zabbix 5.4 is our first big step in bringing out-of-the-box reporting for our end users. With this feature, we now have a foundation to build upon in the future and make reporting more robust and versatile over time. Since reports are 100% based on dashboard widgets, it’s only a matter of time until more report-focused widgets get released, thus enabling not only better dashboards, but also improving the reporting functionality.

  • We have implemented a new web service component responsible for generating reports — of course, you can install this server in a quick and easy fashion by using the provided packages.
  • Reporting works out of the box without the need to deploy or develop any custom scripts.
  • The initial configuration is easy to understand and implement.
  • Reporting will use the existing Email media types to send out these reports.
  • The reports do respect your permissions, as well as roles introduced in Zabbix 5.2.
  • You will be able to test the report before implementing it as per our schedules just by clicking the Test button.

Scheduled reports

We have added a new  Scheduled report section, where the list of reports is available, displaying the report Name, their Owner, Repeats (daily, weekly, etc.), the Period for which the report is generated, and the Last sent date.

Scheduled reports

NOTE. When you configure new reports, and they have not been sent out yet, the Last sent date will be set to ‘Never.’

Creating a report

When you create a report, you will also have to fill in a couple of fields:

  • Owner,
  • Name of the report,
  • Dashboard, the report will be based on,
  • Period — if you send the report for the Previous day, Previous week, Previous month, of Previous year,
  • Cycle — how often you send the report Daily/Weekly/Monthly/ Yearly,
  • Start time (Zabbix server time is used here),
  • Start date and end date.

Creating reports

Receiving a report

When you receive a PDF report to your mailbox, you can also use the {TIME} macro to display server time both in the subject and the body of the message.

Receiving a report

In the PDF report, you can display any information from the included dashboard – Graphs, Problems, Latest data, and much more. Thanks to all of the available widgets, we will be able to customize our reports in a very granular fashion.

Receiving a report example

The report does respect user permissions. So, in the example above, the report shows only the data to which the user (either the recipient or the report creator) has access.

Permissions

After upgrading to Zabbix 5.4, you will see two new options in the User roles section:

  • Scheduled reports UI element. Under the UI elements, you can grant or deny access to the Scheduled reports section. This is accessible only to Super admin and Admin user Types.

Permissions

If the Scheduled reports UI element is unchecked for the role, the user won’t be able to access the Scheduled report section and will see an error message. The same behavior is true if you use a URL to access the Scheduled reports.

Access to scheduled reports denied message for the users of a user role

You can also manage scheduled report permissions in the Access to actions section by checking the Manage scheduled reports box. This action permission grants or denies the ability to create or edit scheduled reports and is also accessible to Admin and Super admin user types.

Manage scheduled reports

If this check box is unchecked, the users won’t be able to create new or edit existing reports, though they will be able to access the UI section and see the list of reports and how they are configured.

Access to Manage scheduled reports restricted

Recipients of scheduled reports

When you are defining a new report, you can select the recipient. Report subscription can contain a user or a user group.

  • When selecting a user, you can specify to include or exclude the user from the subscription.
  • User group to host group permissions still apply.
  • You can specify which user is going to be generating the report – recipient or the creator of the report.:

Report recipients

For example, if we need to send some extra information to our NOC team that might not be directly available to them, you can select Current user, and the report will be generated with the permissions of the report creator. Since it is the admin that is creating the report, you can add some extra information that wouldn’t be visible to your NOC team or other regular users. They still won’t be able to access it in Zabbix, but they’ll receive it in their mailbox if you configure the report for them.

Report prerequisites

Diving a bit deeper into the technical side of things, we need to set up two additional packages to enable the reports:

  • zabbix-web-service — the additional reporting service by default listening to port 10053. The service needs to be reachable from the Zabbix server and can be deployed on the same machine as our frontend or our server. We also have the option to deploy it on a completely separate machine. The zabbix-web-service package should be available if you have added the Zabbix repository.
#yum install zabbix-web-service
  • Google Chrome is required. However, on some distributions, Chromium is reported to also work, though this is not 100% tested. Note that Google Chrome packages are not included in Zabbix. The Google Chrome packages can simply be downloaded from the Google Chrome website and then installed on the zabbix-web-service host.
#wget https://dl.google.com/linux/direct/google-chrome-stable_current_x86_64.rpm

#yum install google-chrome-stable_current_x86_64.rpm

Configuring reports — Web service

We have a whole new configuration file for the web service. Web service supports many different configuration options:

  • Logging — similar to that for server and for proxy. You can set up debug levels, select the log types, rotations, and so on.
### Option: LogType –system (syslog), file, console (standard output)
### Option: LogFile–Log file location
### Option: LogFileSize -Size in MB before rotation
### Option: DebugLevel –0 -5
  • List of allowed server addresses that can access this web service.
### Option: AllowedIP List of comma delimited IP addresses, optionally in CIDR notation, or DNS names of Zabbix servers
  • Timeout settings
### Option: Timeout -Spend no more than Timeout seconds on processing (Default –3)
  • Listen port
### Option: ListenPort -Service will listen on this port for connections from the server
(Default -ListenPort=10053)
  • Encryption settings by using certificates. This way the communication with the web service can be secured.
### Option: TLSAccept –unencrypted or cert
### Option: TLSCAFile–pathname of a file containing top level CA(s) certificates
### Option: TLSCertFile–pathname of a file containing the service certificate
### Option: TLSKeyFile–pathname of a file containing the service private key

Configuring reports — Server

In addition, the server settings now contain report-related parameters:

  • The number of report writer instances.
### Option: StartReportWriters -Number of pre-forked report writer instances.
(Default –0)

NOTE. You need to have at least one StartReportWriter

NOTE. The number of the necessary report writers will depend on the number of reports and how often you generate them.

  • Zabbix Web Service URL (to be passed on to the server)
### Option: WebServiceURL -URL to Zabbix web service, used to perform web-related tasks. (No default value)
#Example: http://192.168.1.156:10053/report

You need to make sure that we can communicate with the Zabbix Web Service URL and permit the incoming traffic through this port to the web service.

Configuring reports — Frontend

As the last step, you need to enable communication between the frontend and the web service.

In Administration > General > Other, we have a new configuration parameter where you need to specify your frontend URL that will be reachable by the web service.

Frontend URL

Once this is done, we can create a report.

Reports — testing

After you have created the report, you can test it. You can click the Test button and send out your test report to see if it works. The users to which we’re sending the report need to have an Email media assigned to them in the User settings.

NOTE. Currently, {TIME} macros are resolved only with the scheduled generation and are not available in test reports, though this might change in the future.

Testing reports

Common issues

Some parameters can certainly be misconfigured, so let’s look at the most common issues:

  • Make sure that you have a properly configured Email media assigned to the user that should be receiving the report. Otherwise, they will fail to receive it.

— Make sure that the Email media type settings are properly configured.
— Once you define the media type, if you’re creating it from scratch, make sure that you test the media type and generate a test report.

Media configuration failed

NOTE. Sending out the report failed in this example siince no media is configured for the report recipients.

  • Make sure that the correct Web service address is configured on the Zabbix server in the WebServiceURL parameter.

— Confirm that the Zabbix server can connect to the Zabbix web service and that so that we can connect to the specified port/IP address.
— Check your firewall settings if the web service is running on a dedicated machine.
— Make sure that third-party security software, such as SELinux or firewalls don’t block the communication.

Wrong WebServiceURL parameter

Otherwise, you will receive an error message on the Frontend. The error messages should be sufficient enough to point you in the right direction.

  • Make sure that the Web service URL is configured without any typos. Otherwise, you will reach the web service, but the report page will output an error — ‘404 page not found’.
WebServiceURL=http://192.168.1.156:10053/reportwrong

Typos in configuration error message

NOTE. If you see this error message, check for typos in the Zabbix server configuration file for WebServiceURL.

  • Don’t forget to assign the Frontend URL in Administration > General > Other.

— If a URL is misconfigured, you might start receiving empty reports.
— If the URL syntax is wrong, you will receive an error message about the malformed URL.

Malformed URL error message

Frontend URL configuration parameter

  • Google Chrome is not pre-packaged with Zabbix,

— You need to have Google Chrome package installed separately. You can download Google Chrome from the official Google Chrome website, for instance, by using wget.

— Make sure that Google Chrome is available via $PATH environmental variable. If you don’t have it configured, you will receive the error message, so you will need to modify the path variable and make sure the executable is available there.

$PATH environmental variable error

Questions & Answers

Question. What are the possibilities to customize the page size like A4, A3?

Answer. It will be based on how you customize your Dashboard. Currently, you cannot customize the page and select portrait or landscape, for instance.

 

Scalability improvements

Post Syndicated from Sergey Simonenko original https://blog.zabbix.com/scalability-improvements/14832/

New improvements might be unnoticed by many Zabbix users since they come to scalability, rather than to new features or some aspects of the user interface experience. However, these improvements might be beneficial for those Zabbix users who run really large instances.

Contents

I. More efficient database use (1:15)

1. New worker processes (3:03)

2. In-memory trend cache (4:49)
3. More server resiliency (7:35)

II. Questions & Answers (10:54)

In case of large instances, the main performance bottleneck would be the database. Zabbix doesn’t establish ad-hoc connections and uses only persistent connections to the database. In Zabbix 5.4, the use of database connections has been further drastically optimized.

More efficient database use

  • In earlier versions, not only database syncers, but also pollers, and some other processes had a dedicated persistent connection to the database. These connections were necessary for calculated items and aggregate checks. These calculated items and aggregate checks are not real items, since they’re based on the queries to the database, particularly to history tables.

Connections were also required to update host availability status. Pollers (unreachable pollers, JMX pollers, as well as the IPMI manager) were updating it directly in the database.

  • In addition, in some cases, when proxies were used (that would be true for large instances) host availability was updated by the proxy poller, in case of a passive proxy, and trapper.

Why was it decided to avoid these connections in Zabbix 5.4?

  • First, they don’t really work smoothly with the default database configuration (PostgreSQL, Oracle). For instance, in PostgreSQL, max_connections is by default set to 100.
  • They can cause locking on the database side.
  • They also result in inefficient memory and CPU utilization.
  • Finally, in earlier versions, it was impossible to perfectly fine-tune the number of connections to the database.

New worker processes

In Zabbix 5.4, two new processes were introduced: history pollers and availability manager. If you have upgraded your Zabbix instance already when you log onto your server and run ps aux | grep zabbix_server, you will notice these new processes:

/usr/sbin/zabbix_server: history poller #1 [got 0 values in 0.000008 sec, idle 1 sec] 
/usr/sbin/zabbix_server: history poller #2 [got 2 values in 0.000186 sec, idle 1 sec] 
/usr/sbin/zabbix_server: history poller #3 [got 0 values in 0.000050 sec, idle 1 sec] 
/usr/sbin/zabbix_server: history poller #4 [got 0 values in 0.000010 sec, idle 1 sec] 
/usr/sbin/zabbix_server: history poller #5 [got 0 values in 0.000012 sec, idle 1 sec] 
/usr/sbin/zabbix_server: availability manager #1 [queued 0, processed 0 values, idle 5.016162 sec during 5.016415 sec]

History pollers

Since calculated items and aggregate checks represent a different types of items, now they have their own poller – history poller. History pollers are also used for several internal items (zabbix[*] item keys) as well.

New configuration parameters

History poller comes with a new configuration parameter. Here, it is important to keep in mind that more is not always better. So, the StartHistoryPollers value (how many history pollers are being pre-forked) should be increased only if history pollers are too busy according to internal self-monitoring, but should be kept as low as possible to avoid unnecessary connections to the database.

### Option: StartHistoryPollers
#     Number of pre-forked instances of history pollers.
#     Only required for calculated, aggregated and internal checks.
#     A database connection is required for each history poller instance.
#
# Mandatory: no
# Range: 0-1000
# Default:
# StartHistoryPollers=5

Availability manager

In earlier versions, pollers, unreachable pollers, JMX pollers, and the IPMI manager updated host availability directly in the database with a separate transaction for each host. Now, we have this separate availability manager, and all processes — pollers, trappers, etc. — communicate with the availability manager, and the statistics queue is flushed by the availability manager to the database every 5 seconds.

In-memory trend cache

Since Zabbix 5.2, new trigger functions like trendavg, trendmax, etc. were introduced, which operate with the trends data for long periods. Similarly to calculated items, these triggers used database queries to obtain the necessary data.

In Zabbix 5.4, finally, the trend cache has been implemented. It stores the results of calculated trends functions. If the value is not available in the cache yet, Zabbix will query the database and update the cache.

As with all newly introduced processes, this cache’s effectiveness can be monitored using internal check zabbix[tcache,cache,], which can be used to set the relevant TrendFunctionCacheSize parameter value.

### Option: TrendFunctionCacheSize
#           Size of trend function cache, in bytes.
#           Shared memory size for caching calculated trend function data.
#
# Mandatory: no
# Range: 128K-2G
# Default:
# TrendFunctionCacheSize=4M

To sum it up, with all these database-related optimizations:

  • Now it is possible to have as many database connections as you really need. So, if you, for instance, operate a very large instance and you need a hundred or more pollers, and at the same time, you don’t rely much on some calculated items or aggregate checks functionality, before Zabbix 5.4 you would end up with hundreds or more database connections that you didn’t need.

Moreover, with PostgreSQL with default configuration, if you increased the number of pollers, your database server could go down and bring down your Zabbix instance. For each PostgreSQL worker process, you would have had a limited work_mem as you had too many database connections. So, your overall database performance would have been sacrificed. That is not the case anymore.

  • In addition, if you are using trend functions with triggers using large periods of time, in the past you might have noticed, for instance, slow queries. Now, these changes will help you to drastically decrease the database load.

More server resiliency

  • Another important feature — graceful start. Active proxies can keep a backlog, which is useful if the communication between the server and the proxy breaks for any reason, for instance:

— server maintenance during upgrade to the next minor release;
— loss of Internet access at a remote site due to fiber cut, etc.

When communication restores, the proxies can easily overload the server after long downtimes, especially in large instances.

  • Since Zabbix 5.4, the server lets the proxies know if it’s busy, so the proxies throttle data sending.

Earlier, the data uploaded by the proxies was throttled when the history cache usage was 80% or greater. However, as the server was responsible for that task, all proxies were getting disabled in some situations. That meant the history data upload, as well as other tasks, such as processing of regular data and processing tasks, were getting suspended until the history cache utilization dropped lower than 80%.

This method was ineffective and unacceptable in large environments. Now, the proxies are responsible for checking whether the server can handle the data. When the history cache usage hits 80%, the following scenario is used:

  • the proxies send the data to the server and the data is accepted;
  • if the server thinks it’s busy it will respond with a special JSON tag upload set to ‘disable’;
  • the proxies will stop uploading history data, but will keep polling the server for tasks and uploading other data;
  • in a while, the proxies will upload data again;
  • if the server is not too busy, it will respond with the JSON tag upload set to ‘enable’.

Unlike the previous two scalability improvements which are based on serious architectural changes, this change was backported to earlier Zabbix versions — 5.0 and 5.2.

Questions & Answers

Question. Would you recommend using proxies even on the local site to allow for the server to be upgraded without losing data or for performance improvements?

Answer. Yes, in some cases there’re such setups. This idea mainly is to have a unified configuration, not only to improve performance. And in some cases, if you use a lot of proxies, you might want to monitor all the items only with the proxies. Such scenarios are used by many Zabbix customers.

Question. So, throttling can give you some noticeable performance benefits. Which version is required on the server and on the proxy for throttling?

Answer. All these changes have been backported to earlier versions, so you can use either Zabbix 5.4.0 released recently or the latest releases of Zabbix 5.0 or Zabbix 5.2.

Question. Is it possible to have two databases in a cluster and point the select queries to one database and, for instance, execution queries to another database? How would database clustering generally work? Is it of benefit to Zabbix? Can Zabbix utilize it?

Answer. In general, our HA setups use some basic features, which are built-in into database servers. They use replication. So, you have to use the servers that will provide some virtual IP for your cluster. That is completely transparent to Zabbix.

However, it is not recommended to split different queries on different nodes. They should still hit a single specific note. So, it is more of an HA approach rather than a horizontal scalability approach.

Question. Would you elaborate on what a large, or medium, or small instance means? What new values per second should we be looking at?

Answer. We can judge from large instances of our customers, and might not know about even larger instances managed by the customers themselves. Large instances can have, for instance, 100,000 NVPS and more. Sometimes, we upgrade really large instances with databases of dozens of terabytes. Some users like keeping really long records.

In my experience, large instances of 20,000 to 40,000 NPVS are quite common and they could benefit a lot from these changes.

Prime Day 2021 – Two Chart-Topping Days

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/prime-day-2021-two-chart-topping-days/

In what has now become an annual tradition (check out my 2016, 2017, 2019, and 2020 posts for a look back), I am happy to share some of the metrics from this year’s Prime Day and to tell you how AWS helped to make it happen.

This year I bought all sorts of useful goodies including a Toshiba 43 Inch Smart TV that I plan to use as a MagicMirror, some watering cans, and a Dremel Rotary Tool Kit for my workshop.

Powered by AWS
As in years past, AWS played a critical role in making Prime Day a success. A multitude of two-pizza teams worked together to make sure that every part of our infrastructure was scaled, tested, and ready to serve our customers. Here are a few examples:

Amazon EC2 – Our internal measure of compute power is an NI, or a normalized instance. We use this unit to allow us to make meaningful comparisons across different types and sizes of EC2 instances. For Prime Day 2021, we increased our number of NIs by 12.5%. Interestingly enough, due to increased efficiency (more on that in a moment), we actually used about 6,000 fewer physical servers than we did in Cyber Monday 2020.

Graviton2 Instances – Graviton2-powered EC2 instances supported 12 core retail services. This was our first peak event that was supported at scale by the AWS Graviton2 instances, and is a strong indicator that the Arm architecture is well-suited to the data center.

An internal service called Datapath is a key part of the Amazon site. It is highly optimized for our peculiar needs, and supports lookups, queries, and joins across structured blobs of data. After an in-depth evaluation and consideration of all of the alternatives, the team decided to port Datapath to Graviton and to run it on a three-Region cluster composed of over 53,200 C6g instances.

At this scale, the price-performance advantage of the Graviton2 of up to 40% versus the comparable fifth-generation x86-based instances, along with the 20% lower cost, turns into a big win for us and for our customers. As a bonus, the power efficiency of the Graviton2 helps us to achieve our goals for addressing climate change. If you are thinking about moving your workloads to Graviton2, be sure to study our very detailed Getting Started with AWS Graviton2 document, and also consider entering the Graviton Challenge! You can also use Graviton2 database instances on Amazon RDS and Amazon Aurora; read about Key Considerations in Moving to Graviton2 for Amazon RDS and Amazon Aurora Databases to learn more.

Amazon CloudFront – Fast, efficient content delivery is essential when millions of customers are shopping and making purchases. Amazon CloudFront handled a peak load of over 290 million HTTP requests per minute, for a total of over 600 billion HTTP requests during Prime Day.

Amazon Simple Queue Service – The fulfillment process for every order depends on Amazon Simple Queue Service (SQS). This year, traffic set a new record, processing 47.7 million messages per second at the peak.

Amazon Elastic Block Store – In preparation for Prime Day, the team added 159 petabytes of EBS storage. The resulting fleet handled 11.1 trillion requests per day and transferred 614 petabytes per day.

Amazon AuroraAmazon Fulfillment Technologies (AFT) powers physical fulfillment for purchases made on Amazon. On Prime Day, 3,715 instances of AFT’s PostgreSQL-compatible edition of Amazon Aurora processed 233 billion transactions, stored 1,595 terabytes of data, and transferred 615 terabytes of data

Amazon DynamoDBDynamoDB powers multiple high-traffic Amazon properties and systems including Alexa, the Amazon.com sites, and all Amazon fulfillment centers. Over the course of the 66-hour Prime Day, these sources made trillions of API calls while maintaining high availability with single-digit millisecond performance, and peaking at 89.2 million requests per second.

Prepare to Scale
As I have detailed in my previous posts, rigorous preparation is key to the success of Prime Day and our other large-scale events. If your are preparing for a similar event of your own, I can heartily recommend AWS Infrastructure Event Management. As part of an IEM engagement, AWS experts will provide you with architectural and operational guidance that will help you to execute your event with confidence.

Jeff;

Triggers, calculated and aggregated items in Zabbix 5.4

Post Syndicated from Aigars Kadiķis original https://blog.zabbix.com/triggers-calculated-and-aggregated-items-in-zabbix-5-4/14880/

In earlier Zabbix versions, we had three categories of things we could manage inside the instance — triggers, calculated items, and aggregated items. Each of them had its own syntax, so we could not be sure what syntax to use in a certain case. That’s why we introduced the unified syntax for every category inside the monitoring tool that will ease up the documentation task and the configuration process. To appreciate the innovation, we need to recap these three categories in Zabbix.

Contents

I. What’s different in Zabbix 5.4? (1:52)
II. Examples (9:20)
III. Aggregated checks (14:09)
IV. Questions & Answers (19:08)

The trigger is a logic implying calculating some formula. If it is true, it will generate an event, a so-called problem. At the same time, calculated items are doing calculations as well, but at the end, they store a value inside the database. This value will be either an integer or a floating number. So, both these things are doing calculations, but before Zabbix 5.4 the syntax was different.

What’s different in Zabbix 5.4?

Each item has a tag:tagvalue

It all starts almost with the item as it is how we collect the metrics. Previously, it was always an application type of thing. Whenever we designed an item, it could be a memory-type of thing, a network, CPU, etc. Now, we have these column tags.

TAG:TAGVALUE

As you can see in this example, the tag value is ‘Application’. This really lets us have a flawless upgrade operation. The ‘Application’ field is completely gone, and now we have only tags on this item level.

Period and time shift is one argument now

Previously, whenever we were dealing with trigger functions, most of the time, the first argument was the interval to be used as an input. We had two arguments: we could analyze metrics for the specified period with <period> and, to go back in time and use <time shift> after comma — <period>,<time shift>.

We have decided to use one argument — <period:time shift>.

If you want to analyze data for yesterday, or last week, or last year, you can now use one argument:

  • <1h> — during the last one hour,
  • <1h:now-1d> — during the same hour one day ago,
  • <1d:now/d> — yesterday during the day.

STR(), REGEX(), IREGEX() => FIND()

Another big change — the functions searching for the data containing a string will now be converted into one function — find().

So, the string function — STR(), which was less CPU-intensive, could search for a string. If we needed a more complex pattern, we used REGEX() or, in case-sensitive cases, the regular expression function — IREGEX(). Now, we can use a single function covering all the needs, including case-sensitive search, — find().

It contains more arguments, which specify a string way to search, and consumes less CPU power. Or we might really need to utilize the regular expression thing.

New syntax

So, in the earlier Zabbix version, the trigger had a curly parenthesis at the beginning and at the end, the key in the middle, and the trigger function mathematical calculations with the dot in the middle: {host:key.max(15m)}.

The unified syntax will always start with the function. This lets us use recursions — as some functions support multiple arguments, we can put a function inside the function thus eliminating the previous limitations: max(/host/key,15m).

This syntax is very similar to that used in the Linux file system — it starts with the forward slash, then comes the host name, and the item key with the interval (including shifting) after the first comma.

Now, the same syntax is used for triggers and calculated items. In addition, previously, we did have the aggregated items. Now, if you want to do some aggregation, you’ll have to use calculated items inside the interface.

Examples

Now, there are many things you can accomplish, which were not possible before.

Absolute time periods

We might need to know what has happened today, for instance, whether the workstation is on at 10 a.m. We can do it by summing up the agent.ping items. So, if the agent.ping is zero, then it is still offline and no one has turned the workstation on. Otherwise, the agent.ping will be 1.

sum(/host/key, 1d:now/d+1d) = 0 and time() > 100000

  • We might need to find out the maximum workstation uptime yesterday. If the uptime is bigger than 10 hours, a person might have been working at the computer for too long and might need a reminder that there’s life outside.

trendmax(/host/key, 1d:now/d)>10h

  • When analyzing user sessions, we might search for the maximum peak last week. To do that, here we compare the last week’s peak with yesterday’s peak and find out that it has doubled. So, it’s a problem and we can get an event.

trendmax(/host/key, 1d:now/d) > trendmax(/host/key, 1d:now/w) / 2

  • We can also analyze activity, for instance, the stock module memory utilization per Windows machine for the previous month. So, we are looking at the maximum memory used during the previous month per server, not the workstation, and then compare the used memory with the memory available.

trendmax(/host/key1, 1M:now/M) < last(/host/key2) / 2

Here, the server is not utilizing half of the available memory, so we will get notified.

The beauty of this trigger is that the stock template already contains all those 12 items — a collection per-used memory, a collection per-total memory. All we need to do is go to the Trigger section, install this trigger, and as soon as one single metric comes in, it will generate the event about the last month as all the data is stored inside the database.

Compare string between two hosts

Now, we can compare text strings. We can use a different type of input, for instance, different servers (say, node 1, node 2), and compare the agent versions. If the version is different or the same, you can decide what the problem is. It will fire up an event, and we can indicate what the difference between those versions is in the event title.

last(/node1/agent.version) <> last(/node2/agent.version)

Aggregated checks

Aggregation on current host (foreach)

You might need to calculate something and store the data in the database.

In this example, if we have a website. we will have a template containing a web scenario, which consists of multiple steps, such as checking for pages. It will provide the Latest data page and the response time per page.

In the Templates, we can select this calculated item, and, for instance, aggregate all those built-in response time metrics.

Average web page response time: avg(last_foreach(//web.test.time[check performance,*,resp]))

The ‘*’, the wildcard, will tell Zabbix to aggregate all the items reflecting the response time, and the double forward slash means aggregation for the current host. ‘foreach’ will now be all over the place whenever we do the aggregation. This command is the same as used in the Windows PowerShell used to go through the different types of elements: either through the hosts, either through the items.

We might use a different type of aggregation using the same approach. For instance, when monitoring the disk space, we are capturing the used space by the Linux or Windows machine that can have different drives or different modes. Using one calculated item, we can aggregate the total used space per all the mounts or all the drives on the server. This might be useful, for instance, to estimate how much disk space you need if it’s time to purchase an SSD drive.

Total used space on all drives: sum(last_foreach(//vfs.fs.size[*,used]))

Aggregation by host group

Another type of functionality — aggregation by the host group.

You might have a group, for instance, ‘MySQL servers’ with the official solution looking for the queries per second. It will aggregate all the queries per second in one single item. If there is a pool of MySQL servers, then you will end up seeing the total number of these queries and afterward linking a trigger on this item.

MySQL queries per second: sum(last_foreach(/*/mysql.queries.rate?[group=”MySQL servers”]))

Aggregation by tag:value

You can do the aggregation not only by host group but also by utilizing the item Tags (tag name and tag value). It does not need to be a tag on the item level. You can mark your host element with the role on the host level. So, you create this item inside the template, and it will search for all these items, which belong to a specific host with this tag and tag value, and do the aggregation. You will end up with an item, which has used space metrics (domain controllers in this example).

Used space on domain controllers: sum(last_foreach(/*/vfs.fs.size[*,used]?[tag=”Role:Domain Controller”]))

Questions & Answers

Question. After the upgrade, what will happen with the trigger item syntax? Do we have to rebuild everything?

Answer. It gets upgraded automatically. However, as per the results of my testing in the test environment, some items have to be checked. I would highly suggest extracting the configuration and testing the upgrade process beforehand to learn how those items will affect the system (just to be on the safe side).

Question. Are there any changes or improvements in the predictive trigger syntax?

Answer. You should definitely take a look at the roadmap for the exact information.

Question. When we’re doing an aggregation, is it always going to be on a single host or can we specify a host group or a tag?

Answer. Surely. Aggregation by host group is possible as it was before. Still, it’s more flexible now. We need to specify a host, a wildcard for the host item, and then we can add additional value by using the host group or by specifying what kind of tag the item should have. Then your parameters will be respected. We can combine all those things, that is, filter by host group and by the tag name and tag value.

In addition, when we’re doing the upgrade, the prototype items and triggers for the low-level discovery rules are also going to be automatically switched over.

 

Planetary-Scale Computing – 9.95 PFLOPS & Position 41 on the TOP500 List

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/planetary-scale-computing-9-95-pflops-position-41-on-the-top500-list/

Weather forecasting, genome sequencing, geoanalytics, computational fluid dynamics (CFD), and other types of high-performance computing (HPC) workloads can take advantage of massive amounts of compute power. These workloads are often spikey and massively parallel, and are used in situations where time to results is critical.

Old Way
Governments, well-funded research organizations, and Fortune 500 companies invest tens of millions of dollars in supercomputers in an attempt to gain a competitive edge. Building a state-of-the-art supercomputer requires specialized expertise, years of planning, and a long-term commitment to the architecture and the implementation. Once built, the supercomputer must be kept busy in order to justify the investment, resulting in lengthy queues while jobs wait their turn. Adding capacity and taking advantage of new technology is costly and can also be disruptive.

New Way
It is now possible to build a virtual supercomputer in the cloud! Instead of committing tens of millions of dollars over the course of a decade or more, you simply acquire the resources you need, solve your problem, and release the resources. You can get as much power as you need, when you need it, and only when you need it. Instead of force-fitting your problem to the available resources, you figure out how many resources you need, get them, and solve the problem in the most natural and expeditious way possible. You do not need to make a decade-long commitment to a single processor architecture, and you can easily adopt new technology as it becomes available. You can perform experiments at any scale without long term commitment, and you can gain experience with emerging technologies such as GPUs and specialized hardware for machine learning training and inferencing.

Top500 Run


Descartes Labs optical and radar satellite imagery analysis of historical deforestation and estimated forest carbon loss for a region in Kalimantan, Borneo.

AWS customer Descartes Labs uses HPC to understand the world and to handle the flood of data that comes from sensors on the ground, in the water, and in space. The company has been cloud-based from the start, and focuses on geospatial applications that often involves petabytes of data.

CTO & Co-Founder Mike Warren told me that their intent is to never be limited by compute power. In the early days of his career, Mike worked on simulations of the universe and built multiple clusters and supercomputers including Loki, Avalon, and Space Simulator. Mike was one of the first to build clusters from commodity hardware, and has learned a lot along the way.

After retiring from Los Alamos National Lab, Mike co-founded Descartes Labs. In 2019, Descartes Labs used AWS to power a TOP500 run that delivered 1.93 PFLOPS, landing at position 136 on the TOP500 list for June 2019. That run made use of 41,472 cores on a cluster of C5 instances. Notably, Mike told me that they launched this run without any help from or coordination with the EC2 team (because Descartes Labs routinely runs production jobs of this magnitude for their customers, their account already had sufficiently high service quotas). To learn more about this run, read Thunder from the Cloud: 40,000 Cores Running in Concert on AWS. This is my favorite part of that story:

We were granted access to a group of nodes in the AWS US-East 1 region for approximately $5,000 charged to the company credit card. The potential for democratization of HPC was palpable since the cost to run custom hardware at that speed is probably closer to $20 to $30 million. Not to mention a 6–12 month wait time.

After the success of this run, Mike and his team decided to work on an even more substantial one for 2021, with a target of 7.5 PFLOPS. Working with the EC2 team, they obtained an EC2 On-Demand Capacity Reservation for a 48 hour period in early June. After some “small” runs that used just 1024 instances at a time, they were ready to take their shot. They launched 4,096 EC2 instances (C5, C5d, R5, R5d, M5, and M5d) with a total of 172,692 cores. Here are the results:

  • Rmax – 9.95 PFLOPS. This is the actual performance that was achieved: Almost 10 quadrillion floating point operations per second.
  • Rpeak – 15.11 PFLOPS. This is the theoretical peak performance.
  • HPL Efficiency – 65.87%. The ratio of Rmax to Rpeak, or a measure of how well the hardware is utilized.
  • N: 7,864,320 . This is the size of the matrix that is inverted to perform the Top500 benchmark. N2 is about 61.84 trillion.
  • P x Q: 64 x 128. This is is a parameter for the run, and represents a processing grid.

This run sits at position 41 on the June 2021 TOP500 list, and represents a 417% performance increase in just two years. When compared to the other CPU-based runs, this one sits at position 20. The GPU-based runs are certainly impressive, but ranking them separately makes for the best apples-to-apples comparison.

Mike and his team were very pleased with the results, and believe that it demonstrates the power and value of the cloud for HPC jobs of any scale. Mike noted that the Thinking Machines CM-5 that took the top spot in 1993 (and made a guest appearance in Jurassic Park) is actually slower than a single AWS core!

The run wrapped up at 11:56 AM PST on June 4th. By 12:20 PM, just 24 minutes later, the cluster had been taken down and all of the instances had been stopped. This is the power of on-demand supercomputing!

Imagine a Beowulf Cluster
Back in the early days of Slashdot, every post that referenced some then-impressive piece of hardware would invariably include a comment to the effect of “Imagine a Beowulf cluster.” Today, you can easily imagine (and then launch) clusters of just about any size and use them to address your large-scale computational needs.

If you have planetary-scale problems that can benefit from the speed and flexibility of the AWS Cloud, it is time to put your imagination to work! Here are some resources to get you started:

Congratulations
I would like to offer my congratulations to Mike and to his team at Descartes Labs for this amazing achievement! Mike has worked for decades to prove to the world that mass-produced, commodity hardware and software can be used to build a supercomputer, and the results more than speak for themselves.

To learn more about this run and about Descartes Labs, read Descartes Labs Achieves #41 in TOP500 with Cloud-based Supercomputing Demonstration Powered by AWS, Signaling New Era for Geospatial Data Analysis at Scale.

Jeff;

 

Customize and Package Dependencies With Your Apache Spark Applications on Amazon EMR on Amazon EKS

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/customize-and-package-dependencies-with-your-apache-spark-applications-on-amazon-emr-on-amazon-eks/

Last AWS re:Invent, we announced the general availability of Amazon EMR on Amazon Elastic Kubernetes Service (Amazon EKS), a new deployment option for Amazon EMR that allows customers to automate the provisioning and management of Apache Spark on Amazon EKS.

With Amazon EMR on EKS, customers can deploy EMR applications on the same Amazon EKS cluster as other types of applications, which allows them to share resources and standardize on a single solution for operating and managing all their applications. Customers running Apache Spark on Kubernetes can migrate to EMR on EKS and take advantage of the performance-optimized runtime, integration with Amazon EMR Studio for interactive jobs, integration with Apache Airflow and AWS Step Functions for running pipelines, and Spark UI for debugging.

When customers submit jobs, EMR automatically packages the application into a container with the big data framework and provides prebuilt connectors for integrating with other AWS services. EMR then deploys the application on the EKS cluster and manages running the jobs, logging, and monitoring. If you currently run Apache Spark workloads and use Amazon EKS for other Kubernetes-based applications, you can use EMR on EKS to consolidate these on the same Amazon EKS cluster to improve resource utilization and simplify infrastructure management.

Developers who run containerized, big data analytical workloads told us they just want to point to an image and run it. Currently, EMR on EKS dynamically adds externally stored application dependencies during job submission.

Today, I am happy to announce customizable image support for Amazon EMR on EKS that allows customers to modify the Docker runtime image that runs their analytics application using Apache Spark on your EKS cluster.

With customizable images, you can create a container that contains both your application and its dependencies, based on the performance-optimized EMR Spark runtime, using your own continuous integration (CI) pipeline. This reduces the time to build the image and helps predicting container launches for a local development or test.

Now, data engineers and platform teams can create a base image, add their corporate standard libraries, and then store it in Amazon Elastic Container Registry (Amazon ECR). Data scientists can customize the image to include their application specific dependencies. The resulting immutable image can be vulnerability scanned, deployed to test and production environments. Developers can now simply point to the customized image and run it on EMR on EKS.

Customizable Runtime Images – Getting Started
To get started with customizable images, use the AWS Command Line Interface (AWS CLI) to perform these steps:

  1. Register your EKS cluster with Amazon EMR.
  2. Download the EMR-provided base images from Amazon ECR and modify the image with your application and libraries.
  3. Publish your customized image to a Docker registry such as Amazon ECR and then submit your job while referencing your image.

You can download one of the following base images. These images contain the Spark runtime that can be used to run batch workloads using the EMR Jobs API. Here is the latest full image list available.

Release Label Spark Hadoop Versions Base Image Tag
emr-5.32.0-latest Spark 2.4.7 + Hadoop 2.10.1 emr-5.32.0-20210129
emr-5.33-latest Spark 2.4.7-amzn-1 + Hadoop 2.10.1-amzn-1 emr-5.33.0-20210323
emr-6.2.0-latest Spark 3.0.1 + Hadoop 3.2.1 emr-6.2.0-20210129
emr-6.3-latest Spark 3.1.1-amzn-0 + Hadoop 3.2.1-amzn-3 emr-6.3.0:latest

These base images are located in an Amazon ECR repository in each AWS Region with an image URI that combines the ECR registry account, AWS Region code, and base image tag in the case of US East (N. Virginia) Region.

755674844232.dkr.ecr.us-east-1.amazonaws.com/spark/emr-5.32.0-20210129

Now, sign in to the Amazon ECR repository and pull the image into your local workspace. If you want to pull an image from a different AWS Region to reduce network latency, choose the different ECR repository that corresponds most closely to where you are pulling the image from US West (Oregon) Region.

$ aws ecr get-login-password --region us-west-2 | docker login --username AWS --password-stdin 895885662937.dkr.ecr.us-west-2.amazonaws.com
$ docker pull 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-5.32.0-20210129

Create a Dockerfile on your local workspace with the EMR-provided base image and add commands to customize the image. If the application requires custom Java SDK, Python, or R libraries, you can add them to the image directly, just as with other containerized applications.

The following example Docker commands are for a use case in which you want to install useful Python libraries such as Natural Language Processing (NLP) using Spark and Pandas.

FROM 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-5.32.0-20210129
USER root
### Add customizations here ####
RUN pip3 install pyspark pandas spark-nlp // Install Python NLP Libraries
USER hadoop:hadoop

In another use case, as I mentioned, you can install a different version of Java (for example, Java 11):

FROM 895885662937.dkr.ecr.us-west-2.amazonaws.com/spark/emr-5.32.0-20210129
USER root
### Add customizations here ####
RUN yum install -y java-11-amazon-corretto // Install Java 11 and set home
ENV JAVA_HOME /usr/lib/jvm/java-11-amazon-corretto.x86_64
USER hadoop:hadoop

If you’re changing Java version to 11, then you also need to change Java Virtual Machine (JVM) options for Spark. Provide the following options in applicationConfiguration when you submit jobs. You need these options because Java 11 does not support some Java 8 JVM parameters.

"applicationConfiguration": [ 
  {
    "classification": "spark-defaults",
    "properties": {
        "spark.driver.defaultJavaOptions" : "
		    -XX:OnOutOfMemoryError='kill -9 %p' -XX:MaxHeapFreeRatio=70",
        "spark.executor.defaultJavaOptions" : "
		    -verbose:gc -Xlog:gc*::time -XX:+PrintGCDetails -XX:+PrintGCDateStamps 
			-XX:OnOutOfMemoryError='kill -9 %p' -XX:MaxHeapFreeRatio=70 
			-XX:+IgnoreUnrecognizedVMOptions"
    }
  }
]

To use custom images with EMR on EKS, publish your customized image and submit a Spark workload in Amazon EMR on EKS using the available Spark parameters.

You can submit batch workloads using your customized Spark image. To submit batch workloads using the StartJobRun API or CLI, use the spark.kubernetes.container.image parameter.

$ aws emr-containers start-job-run \
    --virtual-cluster-id <enter-virtual-cluster-id> \
    --name sample-job-name \
    --execution-role-arn <enter-execution-role-arn> \
    --release-label <base-release-label> \ # Base EMR Release Label for the custom image
    --job-driver '{
        "sparkSubmitJobDriver": {
        "entryPoint": "local:///usr/lib/spark/examples/jars/spark-examples.jar",
        "entryPointArguments": ["1000"],
        "sparkSubmitParameters": [ "--class org.apache.spark.examples.SparkPi --conf spark.kubernetes.container.image=123456789012.dkr.ecr.us-west-2.amazonaws.com/emr5.32_custom"
		  ]
      }
  }'

Use the kubectl command to confirm the job is running your custom image.

$ kubectl get pod -n <namespace> | grep "driver" | awk '{print $1}'
Example output: k8dfb78cb-a2cc-4101-8837-f28befbadc92-1618856977200-driver

Get the image for the main container in the Driver pod (Uses jq).

$ kubectl get pod/<driver-pod-name> -n <namespace> -o json | jq '.spec.containers
| .[] | select(.name=="spark-kubernetes-driver") | .image '
Example output: 123456789012.dkr.ecr.us-west-2.amazonaws.com/emr5.32_custom

To view jobs in the Amazon EMR console, under EMR on EKS, choose Virtual clusters. From the list of virtual clusters, select the virtual cluster for which you want to view logs. On the Job runs table, select View logs to view the details of a job run.

Automating Your CI Process and Workflows
You can now customize an EMR-provided base image to include an application to simplify application development and management. With custom images, you can add the dependencies using your existing CI process, which allows you to create a single immutable image that contains the Spark application and all of its dependencies.

You can apply your existing development processes, such as vulnerability scans against your Amazon EMR image. You can also validate for correct file structure and runtime versions using the EMR validation tool, which can be run locally or integrated into your CI workflow.

The APIs for Amazon EMR on EKS are integrated with orchestration services like AWS Step Functions and AWS Managed Workflows for Apache Airflow (MWAA), allowing you to include EMR custom images in your automated workflows.

Now Available
You can now set up customizable images in all AWS Regions where Amazon EMR on EKS is available. There is no additional charge for custom images. To learn more, see the Amazon EMR on EKS Development Guide and a demo video how to build your own images for running Spark jobs on Amazon EMR on EKS.

You can send feedback to the AWS forum for Amazon EMR or through your usual AWS support contacts.

Channy

Introducing a Public Registry for AWS CloudFormation

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/introducing-a-public-registry-for-aws-cloudformation/

AWS CloudFormation and the AWS Cloud Development Kit (CDK) provide scalable and consistent provisioning of AWS resources (for example, compute infrastructure, monitoring tools, databases, and more). We’ve heard from many customers that they’d like to benefit from the same consistency and scalability when provisioning resources from AWS Partner Network (APN) members, third-party vendors, and open-source technologies, regardless of whether they are using CloudFormation templates or have adopted the CDK to define their cloud infrastructure.

I’m pleased to announce a new public registry for CloudFormation, providing a searchable collection of extensions – resource types or modules – published by AWS, APN partners, third parties, and the developer community. The registry makes it easy to discover and provision these extensions in your CloudFormation templates and CDK applications in the same manner you use AWS-provided resources. Using extensions, you no longer need to create and maintain custom provisioning logic for resource types from third-party vendors. And, you are able to use a single infrastructure as code tool, CloudFormation, to provision and manage AWS and third-party resources, further simplifying the infrastructure provisioning process (the CDK uses CloudFormation under the hood).

Launch Partners
We’re excited to be joined by over a dozen APN Partners for the launch of the registry, with more than 35 extensions available for you to use today. Blog posts and announcements from the APN Partners who collaborated on this launch, along with AWS Quick Starts, can be found below (some will be added in the next few days).

Registries and Resource Types
In 2019, CloudFormation launched support for private registries. These enabled registration and use of resource providers (Lambda functions) in your account, including providers from AWS and third-party vendors. After you registered a provider you could use resource types, comprised of custom provisioning logic, from the provider in your CloudFormation templates. Resource types were uploaded by providers to an Amazon Simple Storage Service (Amazon S3) bucket, and you used the types by referencing the relevant S3 URL. The public registry provides consistency in the sourcing of resource types and modules, and you no longer need to use a collection of Amazon Simple Storage Service (Amazon S3) buckets.

Third-party resource types in the public registry also integrate with drift detection. After creating a resource from a third-party resource type, CloudFormation will detect changes to the resource from its template configuration, known as configuration drift, just as it would with AWS resources. You can also use AWS Config to manage compliance for third-party resources consumed from the registry. The resource types are automatically tracked as Configuration items when you have configured AWS Config to record them, and used CloudFormation to create, update, and delete them. Whether the resource types you use are third-party or AWS resources, you can view configuration history for them, in addition to being able to write AWS Config rules to verify configuration best practices.

The public registry also supports Type Configuration, enabling you to configure third-party resource types with API keys and OAuth tokens per account and region. Once set, the configuration is stored securely and can be updated. This also provides a centralized way to configure third-party resource types.

Publishing Extensions to the Public Registry
Extension publishers must be verified as AWS Marketplace sellers, or as GitHub or BitBucket users, and extensions are validated against best practices. To publish extensions (resource types or modules) to the registry, you must first register in an AWS Region, using one of the mentioned account types.

After you’ve registered, you next publish your extension to a private registry in the same Region. Then, you need to test that the extension meets publishing requirements. For a resource type extension, this means it must pass all the contract tests defined for the type. Modules are subject to different requirements, and you can find more details in the documentation. With testing complete, you can publish your extension to the public registry for your Region. See the user guide for detailed information on publishing extensions.

Using Extensions in the Public Registry
I decided to try a couple of extensions related to Kubernetes, contributed by AWS Quick Starts, to make configuration changes to a cluster. Personally, I don’t have a great deal of experience with Kubernetes and its API so this was a great chance to examine how extensions could save me significant time and effort. During the process of writing this post I learned from others that using the Kubernetes API (the usual way to achieve the changes I had in mind) would normally involve effort even for those with more experience.

For this example I needed a Kubernetes cluster, so I followed this tutorial to set one up in Amazon Elastic Kubernetes Service (EKS), using the Managed nodes – Linux node type. With my cluster ready, I want to make two configuration changes.

First, I want to add a new namespace to the cluster. A namespace is a partitioning construct that lets me deploy the same set of resources to different namespaces in the same cluster without conflict thanks to the isolation namespaces provide. Second, I want to set up and use Helm, a package manager for Kubernetes. I’ll use Helm to install the kube-state-metrics package from the Prometheus helm-charts repository for gathering cluster metrics. While I can use CloudFormation to provision clusters and compute resources, previously, to perform these two configuration tasks, I’d have had to switch to the API or various bespoke tool chains. With the registry, and these two extensions, I can now do everything using CloudFormation (and of course, as I mentioned earlier, I could also use the extensions with the CDK, which I’ll show later).

Before using an extension, it needs to be activated in my account. While activation is easy to do for single accounts using the console, as we’ll see in a moment, if I were using AWS Organizations and wanted to activate various third-party extensions across my entire organization, or for a specific organization unit (OU), I could achieve this using Service-Managed StackSets in CloudFormation. Using the resource type AWS::CloudFormation::TypeActivation in a template submitted to a Service-Managed StackSet, I can target an entire Organization, or a particular OU, passing the Amazon Resource Name (ARN) identifying the third-party extension to be activated. Activation of extensions is also very easy to achieve (whether using AWS Organizations or not) using the CDK with just a few lines of code, again making use of the aforementioned TypeActivation resource type.

To activate the extensions, I head to the CloudFormation console and click Public extensions from the navigation bar. This takes me to the Registry:Public extensions home page, where I switch to viewing third party resource type extensions.

Viewing third-party types in the registry

The extensions I want are AWSQS::Kubernetes::Resource and AWSQS::Kubernetes::Helm. The Resource extension is used to apply a manifest describing configuration changes to a cluster. In my case, the manifest requests a namespace be created. Clicking the name of the AWSQS::Kubernetes::Resource extension takes me to a page where I can view schema, configuration details, and versions for the extension.

Viewing details of the Resource extension

What happens if you deactivate an extension you’re using, or an extension is withdrawn by the publisher? If you deactivate an extension a stack depends on, any resources created from that extension won’t be affected, but you’ll be unable to perform further stack operations, such as Read, Update, Delete, and List (these will fail until the extension is re-activated). Publishers must request their extensions be withdrawn from the registry (there is no “delete” API). If the request is granted, customers who activated the extension prior to withdrawal can still perform Create/Read/Update/Delete/List operations, using what is effectively a snapshot of the extension in their account.

Clicking Activate takes me to a page where I need to specify the ARN of an execution role that CloudFormation will assume when it runs the code behind the extension. I create a role following this user guide topic, but the basic trust relationship is below for reference.

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "resources.cloudformation.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

I also add permissions for the resource types I’m using to my execution role. Details on the permissions needed for the types I chose can be found on GitHub, for Helm, and for Kubernetes (note the GitHub examples include the trust relationship too).

When activating an extension, I can elect to use the default name, which is how I will refer to the type in my templates or CDK applications, or I can enter a new name. The name chosen has to be unique within my account, so if I’ve enabled a version of an extension with its default name, and want to enable a different version, I must change the name. Once I’ve filled in the details, and chosen my versioning strategy (extensions use semantic versioning, and I can elect to accept automatic updates for minor version changes, or to “lock” to a specific version) clicking Activate extension completes the process.

Activating an extension from the registry

That completes the process for the first extension, and I follow the same steps for the AWSQS::Kubernetes::Helm extension. Navigating to Activated extensions I can view a list of all my enabled extensions.

Viewing the list of enabled extensions

I have one more set of permissions to update. Resource types make calls to the Kubernetes API on my behalf so I need to update the aws-auth ConfigMap for my cluster to reference the execution role I just used, otherwise the calls made by the resource types I’m using will fail. To do this, I run the command kubectl edit cm aws-auth -n kube-system at a command prompt. In the text editor that opens, I update the ConfigMap with a new group referencing my CfnRegistryExtensionExecRole, shown below (if you’re following along, be sure to change the account ID and role name to match yours).

apiVersion: v1
data:
  mapRoles: |
    - groups:
      - system:bootstrappers
      - system:nodes
      rolearn: arn:aws:iam::111122223333:role/myAmazonEKSNodeRole
      username: system:node:{{EC2PrivateDNSName}}
    - groups:
      - system:masters
      rolearn: arn:aws:iam::111122223333:role/CfnRegistryExtensionExecRole
      username: cfnresourcetypes
kind: ConfigMap
metadata:
  creationTimestamp: "2021-06-04T20:44:24Z"
  name: aws-auth
  namespace: kube-system
  resourceVersion: "6355"
  selfLink: /api/v1/namespaces/kube-system/configmaps/aws-auth
  uid: dc91bfa8-1663-45d0-8954-1e841913b324

Now I’m ready to use the extensions to configure my cluster with a new namespace, Helm, and the kube-state-metrics package. I create a CloudFormation template that uses the extensions, adding parameters for the elements I want to specify when creating a stack: the name of the cluster to update, and the namespace name. The properties for the KubeStateMetrics resource reference the package I want Helm to install.

AWSTemplateFormatVersion: "2010-09-09"
Parameters:
  ClusterName:
    Type: String
  Namespace:
    Type: String
Resources:
  KubeStateMetrics:
    Type: AWSQS::Kubernetes::Helm
    Properties:
      ClusterID: !Ref ClusterName
      Name: kube-state-metrics
      Namespace: !GetAtt KubeNamespace.Name
      Repository: https://prometheus-community.github.io/helm-charts
      Chart: prometheus-community/kube-state-metrics
  KubeNamespace:
    Type: AWSQS::Kubernetes::Resource
    Properties:
      ClusterName: !Ref ClusterName
      Namespace: default
      Manifest: !Sub |
        apiVersion: v1
        kind: Namespace
        metadata:
          name: ${Namespace}
          labels:
            name: ${Namespace}

On the Stacks page of the CloudFormation console, I click Create stack, upload my template, and then give my stack a name and the values for my declared parameters.

Launching a stack with my activated extensions

I click Next to proceed through the rest of the wizard, leaving other settings at their default values, and then Create stack to complete the process.

Once stack creation is complete, I verify my changes using the kubectl command line tool. I first check that the new namespace, newsblog-sample-namespace, is present with the command kubectl get namespaces. I then run the kubectl get all --namespace newsblog-sample-namespace command to verify the kube-state-metrics package is installed.

Verifying the extensions applied by changes

Extensions can also be used with the AWS Cloud Development Kit. To wrap up this exploration of using the new registry, I’ve included an example below of a CDK application snippet in TypeScript that achieves the same effect, using the same extensions, as the YAML template I showed earlier (I could also have written this using any of the languages supported by the CDK – C#, Java, or Python).

import {Stack, Construct, CfnResource} from '@aws-cdk/core';
export class UnoStack extends Stack {
  constructor(scope: Construct, id: string) {
    super(scope, id);
    const clusterName = 'newsblog-cluster';
    const namespace = 'newsblog-sample-namespace';

    const kubeNamespace = new CfnResource(this, 'KubeNamespace', {
      type: 'AWSQS::Kubernetes::Resource',
      properties: {
        ClusterName: clusterName,
        Namespace: 'default',
        Manifest: this.toJsonString({
          apiVersion: 'v1',
          kind: 'Namespace',
          metadata: {
            name: namespace,
            labels: {
              name: namespace,
            }
          },
        }),
      },
    });
    
    new CfnResource(this, 'KubeStateMetrics', {
      type: 'AWSQS::Kubernetes::Helm',
      properties: {
        ClusterID: clusterName,
        Name: 'kube-state-metrics',
        Namespace: kubeNamespace.getAtt('Name').toString(),
        Repository: 'https://prometheus-community.github.io/helm-charts',
        Chart: 'prometheus-community/kube-state-metrics',
      },
    });
  }
};

As mentioned earlier in this post, I don’t have much experience with the Kubernetes API, and Kubernetes in general. However, by making use of the resource types in the public registry, in conjunction with CloudFormation, I was able to easily configure my cluster using a familiar environment, without needing to resort to the API or bespoke tool chains.

Get Started with the CloudFormation Public Registry
Pricing for the public registry is the same as for the existing registry and private resource types. There is no additional charge for using native AWS resource types; for third-party resource types you will incur charges based on the number of handler operations (add, delete, list, etc.) you run per month. For details, see the AWS CloudFormation Pricing page. The new public registry is available today in the US East (N. Virginia, Ohio), US West (Oregon, N. California), Canada (Central), Europe (Ireland, Frankfurt, London, Stockholm, Paris, Milan), Asia Pacific (Hong Kong, Mumbai, Osaka, Singapore, Sydney, Seoul, Tokyo), South America (Sao Paulo), Middle East (Bahrain), and Africa (Cape Town) AWS Regions.

For more information, see the AWS CloudFormation User Guide and User Guide for Extension Development, and start publishing or using extensions today!

— Steve

Migrate Your Workloads with the Graviton Challenge!

Post Syndicated from Steve Roberts original https://aws.amazon.com/blogs/aws/migrate-your-workloads-with-the-graviton-challenge/

Today, Dave Brown, VP of Amazon EC2 at AWS, announced the Graviton Challenge as part of his session on AWS silicon innovation at the Six Five Summit 2021. We invite you to take the Graviton Challenge and move your applications to run on AWS Graviton2. The challenge, intended for individual developers and small teams, is based on the experiences of customers who’ve already migrated. It provides a framework of eight, approximately four-hour chunks to prepare, port, optimize, and finally deploy your application onto Graviton2 instances. Getting your application running on Graviton2, and enjoying the improved price performance, aren’t the only rewards. There are prizes and swag for those who complete the challenge!

AWS Graviton2 is a custom-built processor from AWS that’s based on the Arm64 architecture. It’s supported by popular Linux operating systems including Amazon Linux 2, Red Hat Enterprise Linux, SUSE Linux Enterprise Server, and Ubuntu. Compared to fifth-generation x86-based Amazon Elastic Compute Cloud (Amazon EC2) instance types, Graviton2 instance types have a 20% lower cost. Overall, customers who have moved applications to Graviton2 typically see up to 40% better price performance for a broad range of workloads including application servers, container-based applications, microservices, caching fleets, data analytics, video encoding, electronic design automation, gaming, open-source databases, and more.

Before I dive in to talk more about the challenge, check out the fun introductory video below from Jeff Barr, Chief Evangelist, AWS and Dave Brown, Vice President, EC2. As Jeff mentions in the video: same exact workload, same or better performance, and up to 40% better price performance!

After you complete the challenge, we invite you to tell us about your adoption journey and enter the contest. If you post on social media with the hashtag #ITookTheGravitonChallenge, you’ll earn a t-shirt. To earn a hoodie, include a short video with your post.

To enter the competition, you’ll need to create a 5 to 10-minute video that describes your project and the application you migrated, any hurdles you needed to overcome, and the price performance benefits you realized.

All valid contest entries will each receive a $500 AWS credit (limited to 500 quantity). A panel of judges will evaluate the content entries and award additional prizes across six categories. All category winners will receive an AWS re:Invent 2021 conference pass, flight, and hotel for one company representative, and winners will be able to meet with senior members of the Graviton2 team at the conference. Here are additional category-specific prizes:

  • Best adoption – enterprise
    Based on the performance gains, total cost savings, number of instances the workload is running on, and time taken to migrate the workload (faster is better), for companies with over 1000 employees. The winner will also receive a chance to present at the conference.
  • Best adoption – small/medium business
    Based on the performance gains, total cost savings, number of instances the workload is running on, and time taken to migrate the workload (faster is better), for companies with 100-1000 employees. The winner will also receive a chance to present at the conference.
  • Best adoption – startup
    Based on the performance gains, total cost savings, number of instances the workload is running on, and time taken to migrate the workload (faster is better), for companies with fewer than 100 employees. The winner will also receive a chance to present at the conference.
  • Best new workload adoption
    Awarded to a workload that’s new to EC2 (migrated to Graviton2 from on-premises, or other cloud) based on the performance gains, total cost savings, number of instances the workload is running on, and time taken to migrate the workload (faster is better). The winner will also receive a chance to participate in a video or written case study.
  • Most impactful adoption
    Awarded to the workload with the biggest social impact based on details provided about what the workload/application does. Applications in this category are related to fields such as sustainability, healthcare and life sciences, conservation, learning/education, justice/equity. The winner will also receive a chance to participate in a video or written case study.
  • Most innovative adoption
    Applications in this category solve unique problems for their customers, address new use cases, or are groundbreaking. The award will be based on the workload description, price performance gains, and total cost savings. The winner will also receive a chance to participate in a video or written case study.

Competition submissions open on June 22 and close August 31. Winners will be announced on October 1 2021.

Identifying a workload to migrate
Now that you know what’s possible with Graviton2, you’re probably eager to get started and identify a workload to tackle as part of the challenge. The ideal workload is one that already runs on Linux and uses open-source components. This means you’ll have full access to the source code of every component and can easily make any required changes. If you don’t have an existing Linux workload that is entirely open-source based, you can, of course, move other workloads. A robust ecosystem of ISVs and AWS services already support Graviton2. However, if you are using software from a vendor that does not support Arm64/Graviton2, reach out to the Graviton Challenge Slack channel for support.

What’s involved in the challenge?
The challenge includes eight steps performed over four days (but you don’t have to do the challenge in four consecutive days). If you need assistance from Graviton2 experts, a dedicated Slack channel is available and you can sign up for emails containing helpful tips and guidance. In addition to support on Slack and supporting emails, you also get $25 AWS credit to cover the cost of the taking the challenge. Graviton2-based burstable T4g instances also have a free trial, available until December 31 2021, that can be used to qualify your workloads.

You can download the complete whitepaper can be downloaded from the Graviton Challenge page, but here is an outline of the process.

Day 1: Learn and explore
The first day you’ll learn about Graviton2 and then assess your selected workload. I recommend that you start by checking out the 2020 AWS re:Invent session, Deep dive on AWS Graviton2 processor-powered EC2 instances. The Getting Started with AWS Graviton GitHub repository will be a useful reference as you work through the challenge.

Assessment involves identifying the application’s dependencies and requirements. As with all preparatory work, the more thorough you are at this stage, the better positioned you are for success. So, don’t skimp on this task!

Day 2: Create a plan and start porting
On the second day, you’ll create a Graviton2 environment. You can use EC2 virtual machine instances with AWS-provided images or build your own custom images. Alternatively, you can go the container route, because both Amazon Elastic Container Service (Amazon ECS) and Amazon Elastic Kubernetes Service (EKS) support Graviton2-based instances.

After you have created your environment, you’ll bootstrap the application. The Getting Started Guide on GitHub contains language-specific getting started information. If your application uses Java, Python, Node.js, .NET, or other high-level languages, then it might run as-is or need minimal changes. Other languages like C, C++, or Go will need to be compiled for the 64-bit Arm architecture. For more information, see the guides on GitHub.

Day 3: Debug and optimize
Now that the application is running on a Graviton2 environment, it’s time to test and verify its functionality. When you have a fully functional application, you can test performance and compare it to x86-64 environments. If you don’t observe the expected performance, reach out to your account team, or get support on the Graviton Challenge Slack channel. We’re here to help analyze and resolve any potential performance gaps.

Day 4: Update infrastructure and start deployments
It’s shipping day! You’ll update your infrastructure to add Graviton2-based instances, and then start deploying. We recommend that you use canary or blue-green deployments so that a portion of your traffic is redirected to the new environments. When you’re comfortable, you can transition all traffic.

At this point, you can celebrate completing the challenge, publish a post on social media using the #ITookTheGravitonChallenge hashtag, let us know about your success, and consider entering the competition. Remember, entries for the competition are due by August 31, 2021.

Start the challenge today!
Now that you have some details about the challenge and rewards, it’s time to start your (migration) engines. Download the whitepaper from the Graviton Challenge landing page, familiarize yourself with the details, and off you go! And, if you do decide to enter the competition, good luck!

Footnote
In my role as a .NET Developer Advocate at AWS, I would be remiss if I failed to mention that this challenge is equally applicable to .NET applications using .NET Core or .NET 5 and later! In fact, .NET 5 includes ARM64-specific optimizations. For information about performance improvements my colleagues found for .NET applications running on AWS Graviton2, see the Powering .NET 5 with AWS Graviton2: Benchmarks blog post. There’s also a lab for .NET 5 on Graviton2. I invite you to check out the getting started material for .NET in the aws-graviton-getting-started GitHub repository and start migrating.

— Steve

Heads Up – AWS News Blog RSS Feed Change

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/heads-up-aws-news-blog-rss-feed-change/

TL;DR – If you are using the ancient Feedburner feed for this blog, please change to the new one (https://aws.amazon.com/blogs/aws/feed/) ASAP.

Back in the early days of AWS, I paid a visit to the Chicago headquarters of a cool startup called FeedBurner. My friend Matt Shobe demo’ed their new product to me and showed me how “burning” a raw RSS feed could add value to it in various ways. Upon my return to Seattle I promptly burned the RSS feed of the original (TypePad-hosted) version of this blog, and shared that feed for many years.

FeedBurner has served us well over the years, but its time has passed. The company was acquired many years ago and the new owners are now putting the product into maintenance mode.

While the existing feed will continue to work for the foreseeable future, I would like to encourage you to use the one directly generated by this blog instead. Simply update your feed reader to refer to https://aws.amazon.com/blogs/aws/feed/ .

Jeff;

In the Works – AWS Region in Tel Aviv, Israel

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/in-the-works-aws-region-in-tel-aviv-israel/

We launched three AWS Regions (Italy, South Africa, and Japan) in the last 12 months, and are working on regions in Australia, Indonesia, Spain, India, Switzerland, and the United Arab Emirates.

Tel Aviv, Israel in the Works
Today I am happy to announce that the AWS Israel (Tel Aviv) Region is in the works and will open in the first half of 2023. This region will have three Availability Zones and will give AWS customers in Israel the ability to run workloads and store data that must remain in-country.

There are 81 Availability Zones within 25 AWS Regions in operation today, with 21 more Availability Zones and seven announced regions (including this one) underway.

As is always the case with an AWS Region, each of the Availability Zones will be a fully isolated part of the AWS infrastructure. The AZs in this region will be connected together via high-bandwidth, low-latency network connections over dedicated, fully redundant metro fiber. This connectivity supports applications that need synchronous replication between AZs for availability or redundancy. You can take a peek at the AWS Global Infrastructure page to learn more about how we design and build regions and AZs.

AWS in Israel
I first visited Israel in 2013 and have been back several (but definitely not enough) times since then. I have spoken at several AWS Summits and visited many of early customers in the area. Today, AWS has the following resources on the ground in Israel:

Israel is also home to Annapurna Labs, an Amazon.com subsidiary that is responsible for developing much of the innovative hardware that powers AWS.

In addition, the government of Israel announced that it has selected AWS as the primary cloud provider for the Nimbus Project. As part of this project, government ministries and subsidiaries in Israel will use cloud computing to power a digital transformation and to provide new digital services for the citizens of Israel.

Stay Tuned
We’ll announce the opening of the AWS Israel (Tel Aviv) Region in a forthcoming blog post, so be sure to stay tuned!

Jeff;

 

What’s new in Zabbix 5.4

Post Syndicated from Alexei Vladishev original https://blog.zabbix.com/whats-new-in-zabbix-5-4/14603/

Zabbix 5.4.0 released on May 17, 2021 — a non-LTS release that will be supported only for 7-8 months, has already received a lot of attention from our users, our community, and our customers due to a number of very significant and long-anticipated improvements. Zabbix 5.4.0 release comes with scheduled PDF report generation, robust problem detection, advanced data aggregation, and other significant improvements.

Contents

I. Reporting and visualization (1:22)
II. More powerful and simple (9:04)
III. Breaking changes (37:59)
IV. Upgrade notes (40:12)
V. Questions & Answers (44:01)

Reporting and visualization

Unification of screens

In Zabbix 5.2, we introduced pre-defined views for Problems. By accessing Monitoring > Problems, you may create different views based on different filtering options, allowing you to filter problems by certain criteria, and then save this filter as a separate view. You can easily switch between views with one click, for instance, between ‘All problems‘, ‘Services‘ (i.e. service-related problems), or ‘High severity problems‘ in this example.

Pre-defined views for Monitoring > Problems

In Zabbix 5.4, we implemented the unification of screens and dashboards. This means that screens are not supported anymore. In Zabbix 5.2, we had Dashboard and Screens on the menu. In Zabbix 5.4, the Screens functionality was moved to Dashboards, where all screens and all dashboards are available, which makes the workflow much more simple and user-friendly.

New Dashboard menu time

This change affected global screens, as well as local screens, which we always had on a template level. Now, we have introduced the dashboards for templates in Configuration > Templates.

For instance, we have the template for Nginx performance monitoring, so in Configuration > Templates, we have a dashboard ‘Nginx performance’. In Monitoring > Hosts, we have two dashboards for the host (cdn.example.com in this example) — ‘Nginx performance’ and the second template that may have come from some other template.

Templates for Dashboards

By clicking Dashboards for this specific host, you can go to the Dashboards view of this host.

Dashboards view for a specific host

Here, you can quickly switch between templates available for the host at the moment, such as ‘Nginx performance’ and ‘System performance’ in this example to see some Linux OS-specific metrics, such as CPU load, Disk I/O, and so on.

Multi-page dashboards

Previously, we had a very nice feature in Zabbix — slideshows or screen slideshows. Since we have moved everything to Dashboards, we spent much time thinking about how to fit it into the Dashboard functionality and found a very good solution. We introduced Multi-page Dashboards – dashboards containing several pages.

Multi-page Dashboards

In this example, in the Zabbix ‘Server performance‘ dashboard, you can see CPU memory metrics, matrix or graphs related to network performance, or any other page that can be created by clicking Add page and defining the Dashboard page properties. Then you’ll see all the pages available on the top of the Dashboard and switch easily between them. You can run the slideshow with the slideshow controls available in the full-screen mode.

Slideshow mode

Scheduled PDF reports

Scheduled reporting was highly anticipated by our community members, our customers, and our users. In Zabbix 5.4, we introduced scheduled PDF reporting allowing you to define, generate, and send PDF reports straight to your inbox.

Scheduled PDF reporting

This new functionality provides some nice features, for instance:

  • Centralized management of reports, so that super admins can see what reports are generated by Zabbix and sent to different users.
  • Any dashboard can be converted to a PDF report and sent to your email box.
  • This functionality is accessible to all users though can be restricted by a new user role.

In addition, now you can determine that you need a report, for instance, with the previous week’s data every Monday at 7.00 am. All you need is to select the period for the report, and Zabbix will generate and email it to you or any other user. You can also select to receive reports daily, monthly, or yearly.

PDF reports can be scheduled daily, weekly, monthly, or yearly

There are many other configuration parameters for PDF reports. Still, the most important advantage of PDF reporting is that this functionality is accessible from Dashboards. You just click Dashboard in Monitoring > Dashboards and select the period for PDF reporting.

PDF reporting accessible from Dashboards

 

More powerful and simple

When we think about what features need to be implemented in Zabbix and in what direction we would like Zabbix to go, we always consider different functionalities and improvements aimed at improving Zabbix usability and making Zabbix a simpler monitoring solution, and on the other hand, more powerful and much more flexible. Zabbix 5.4 is not an exception. We have introduced a number of very significant improvements, which simplify monitoring and make Zabbix even more flexible.

Tags for items

Zabbix already supports tags for almost all essential objects, such as triggers, hosts, host prototypes, and templates. Tags are everywhere. In Zabbix 5.4, we introduced tags for items (metrics).

The item-level applications have now been replaced with tags, so applications are not supported anymore. We now use a much more flexible concept of named tags. This way you can have tags providing information and values while having as many tags as you want, which is much more flexible comparing to applications.

You will notice that in the configuration view of your item you now have an additional tab — Tags where you can see the defined tags, as well as their values.

Tags instead of applications

You don’t have to worry about your applications though. Your applications will be automatically converted to the tag “Application: <app name>” during the upgrade. For instance, the application ‘CPU’ will be converted to the tag “Application: <CPU>”. All information will be preserved during the upgrade.

Syntax

Another very interesting functionality, about which we have been thinking for several years, is a new syntax for trigger expressions. We now have unified syntax for everything in Zabbix. Let’s talk about why this is an important step forward.

In previous versions we used a special syntax for triggers, some functional syntax for calculated items, and a different syntax for aggregated items:

  • TRIGGERS: {host:key.func(params)}>0
  • CALCULATED: 100*last(“vfs.fs.size[/,free]/last(“vfs.fs.size[/,total]”)
  • AGGREGATE: grpsum[“MySQL Servers”,”vfs.fs.size[/,total]”,last]

This was quite confusing, and users had to remember the syntax or consult the documentation. In Zabbix 5.4, we introduced a new and, most importantly, unified syntax. Now we use exactly the same syntax for triggers, calculated items, and aggregated items. In addition, this new syntax is more functional, while the old one was more object-oriented.

  • TRIGGERS: func(/host/key, params)>0
  • CALCULATED: sum(/host/vfs.fs.size[*,free], 10m)
  • AGGREGATE: min(avg_foreach(/*/qps?[group=“PostgreSQL” and tag=“Env:Production”], 5m))

The new syntax has a number of advantages. The new unified syntax:

  • is much simpler and unified for everything: triggers, calculated and aggregated items;
  • supports absolute time periods, such as last day, previous hour, etc. So, now, it’s easy to calculate, for instance, the minimum for the previous hour or an average value for the previous or the current day;
  • is free from any limitations of the old syntax;
  • allows for powerful aggregations and selection of items by tags utilizing wildcards, etc.;
  • allows for a function to be applied to results of other functions: func1(func2(item));
  • allows for multiple items as function arguments: min(item1, item2);
  • supports calculated metrics for everything working around the old limitations.

New set of functions

We have also introduced new sets of functions:

https://www.zabbix.com/documentation/current/manual/appendix/functions

NOTE. If you think that something is missing or you’re not able to represent a problem condition using the new syntax or a new function, just let us know. 

API tokens

Finally, we have added the support of per-user named API tokens with expiry dates. We have introduced the new user role — access to API tokens, and now any user can generate a private API token in Zabbix for some specific use. All tokens can be managed globally by super admins with appropriate permissions. Now we have a very understandable way to define which users have permissions to operate with API tokens and which users don’t.

So, if you are an ordinary user, you may go to User settings, click API tokens, and you will see your tokens with the Name, optional expiry date, the time it was last accessed, and the status — Any, Enabled, or Disabled.

API token properties

API tokens can be created by any user with appropriate permissions. Super admins can select the user.

API tokens created by any user with permissions

NOTE. You can use the tokens for any integrations from the Zabbix API. You are not forced to use the username and password to start using the API, you just need to generate a token — copy the token to the clipboard (it will not be visible later), store it in a secure location, and then reuse it.

All tokens generated by different users can be managed by super admins with sufficient permissions to review the tokens.

API tokens managed globally by super admins

Easy-to-manage templates

In Zabbix, it is not that simple  to update templates. When you make some adjustments, for instance, create new items, new triggers, or add any other entity, and it is not  easy to make an update as you don’t quite understand the final impact of the changes.

In Zabbix 5.4, we introduced unique universal IDs for each template element, such as items, triggers, graphs, and so on, which help to perform template updates in a safe way.

Universal template IDs

In the example above, these universal IDs are contained in the templates in YAML format to monitor Memcached elements. The IDs are unique, and they are used to match an item, a trigger, etc. These IDs serve as the uniqueness criteria. Zabbix can easily understand what item we are trying to update, which items no longer exists, whether it is a new item or we are making an adjustment to an existing item.

The IDs also simplify working with templates. Now you can actually keep all your templates in a Git repository, for instance, in JSON or YAML format, and then push them to Zabbix by CI/CD pipeline using Zabbix API. So, as soon as you have made some adjustments in your template, your CI/CD system takes the latest version of the template, and using Zabbix API applies it in Zabbix. Such an approach really helps to look at your infrastructure as a code.

Better import

When importing a new template or a new version of the template, Zabbix now will show you any differences when comparing it to an existing template, as well as the changes that are going to be made in Zabbix.

Comparing existing and new templates

In this example, I have a new version of the Memcached template. I removed one tag – Application, and replaced it with two tags— Service (Memcached) and Class (Infrastructure). During import, I can see the difference between my existing template in Zabbix and the new template, so that I can easily review all of the changes. After pressing Import, these changes are applied to the configuration.

Scalability improvements

We have also introduced a few scalability improvements:

  • Zabbix Server and Proxy poller processes do not require database connection anymore.
  • In-memory cache for trend data, significant speedup for trend-related functions has been introduced.
  • Better parallel data processing on the Zabbix Server side for heavy loads. For instance, environments with 10,000 – 50 000 and more new values per second will benefit from the improved performance.
  • Graceful startup of Zabbix Server, which is very useful especially for instances with thousands or tens of thousands of proxies.

NOTE. When the server goes down, for instance, for maintenance or for an upgrade from one minor version to another, you have a down time in range of 30 seconds to a few minutes. When the Zabbix Server is back again, all proxies start pushing large amounts of data at the same time. So, it’s really important to maintain the stability and good health of the Zabbix Server at this moment. That’s why we have implemented a graceful startup for the Zabbix Server.

Universal global scripts

Another nice improvement simplifying the Zabbix setup — universal global scripts.

First, we introduced JavaScript Webhooks for global scripts for easy integration with third-party alerting and ticketing systems. Zabbix uses Webhooks for many different purposes — preprocessing, integration, data collection, etc. Now you can use it for global scripts.

Global scripts now can be used for everything — auto-remediation, alerting, integrations, and the manual execution on hosts from the Zabbix UI. Now, when you define a global script, you can also define that the script should be used, for instance, only for Action operations.

Universal global scripts’ parameters

In this example, this particular script will be based on a Webhook, it will accept parameters such as event ID, event severity, and the tags in the JSON format. This global script will open a ticket in ServiceNow. After the script has been defined you need to navigatte to Actions and define what operations should be executed. You can send a message to admins or just open a ticket in the service desk.

Defining Actions

It is a very simple, easy-to-use, and easy-to-understand feature simplifying configuration of actions.

Powerful value maps

In Zabbix 5.4, we have introduced some changes related to value maps. The value map is a simple way to convert, for instance, numeric values collected by Zabbix, into human-readable values.

A common example would be monitoring the state of a service, which returns the numbers zero or one, with zero meaning ‘Down‘ or one meaning ‘Up‘. In this case we can define a value map based on the exact value match. This is the current behavior.

In Zabbix 5.4, we extended it to support matching by ranges. So, you can define that if you receive the status between 0 and 127, you consider the service to be ‘Down‘ and if any other value — ‘Up‘, or vice versa.

We have also introduced matching by regular expression. Now, when you define value mapping, you just like a service state in my case you may specify if the value is in the range between 0 and 31 or 64 and 127, then it will be mapped to ‘Up‘ and any other value — ‘Down‘.

Value mapping by range

Value  maps for templates and hosts

We have introduced value on the template level and on the host level. This means that we do not support global value maps anymore.

Value maps were easy to maintain on the global level, but only for smaller installations. As soon as you have multiple templates and different teams working with a different set of templates, value maps become a nightmare to maintain on a global level. In addition, global value mapping hinders multi-tenancy support for any object that is linked to a value map.

Advantages of value maps on the template level:

  • Now we can deliver self-contained templates without any references to global objects. A template contains all information needed for monitoring: a set of items, set of triggers, graphs, template-level dashboards, and value maps. That means we don’t have any references to any global objects now, and templates have become truly independent. You can go to zabbix.com/integrations, download the template to monitor, for instance, for a Cisco device, and we can be absolutely sure that this template will work perfectly on the system as it isn’t linked to any global elements.
  • This new feature enables better support for multi-tenant environments.
  • We have also introduced new mass actions — mass-update operations for easier management of value maps on a template level.

You can Add, Update, Rename, Remove, or Remove all value maps.

Mass-update operations

Usability improvements

  • In Zabbix 5.4, we have introduced a number of usability improvements. The one visible straight away — the third-level menu for better navigation. The hidden features in Administration > GeneralAutoregistration, Housekeeping, Images, Icon mapping, Regular expressions, etc. were difficult to spot. The third level menu provides better visibility for these submenus.

Third-level menu

  • In Zabbix, there are some usability problems that sometimes you have to jump from one page to another, for instance, from a graph to Latest data, from Latest data to a host, etc. We have been thinking about improving usability to keep users focused on what they do. So, we have introduced modal windows for mass-update and import forms. It is a small, though important step in improving overall performance as when you select the list of hosts to do some mass-update operation, you are staying on this list and don’t have to go to a separate page.

Modal windows for mass-update and import forms

  • Another small, but useful feature — negated filtering for tags. Now, in Monitoring > Problems, you can, for instance, display all problems, which are not from your staging environment. You can define a condition, for instance, TagsEnv Does not equal Staging‘, and save it as a new view ‘Excluding staging‘.

Negated filtering by tags

Better support of XML preprocessing

Zabbix has been supporting XML XPath for four years. Now we decided to introduce another conversion — a native conversion from XML to JSON format.

Since most of the operations in Zabbix are JSON-based, it is a nice way to work with XML data — you convert your XML document to JSON as, for instance, the first preprocessing step, and then you work with this data as with any JSON document.

Better XML preprocessing

More improvements

There are also security-related changes and other changes related to real-time export.

  • Security-related:
    — Support of all SNMPv3 encryption protocols.
    — Unified error messages in case of unsuccessful login.
    — Disabled autocomplete for password fields in Zabbix UI.
  • Real-time export:
    — Information about event severity is included in real-time export files.
    — More granular configuration of information exported in real-time.
  • Also:
    — Support of VMWare cluster monitoring.
    — Support of filtering by the presence of LLD macro for low-level discovery.
    — Support of macro {ITEM.VALUETYPE} for notifications.
    — Support of service name lookup for Oracle for HA setups, so that Zabbix can switch from one node to another.
    — Support of NTLM authentication for JavaScript Webhooks.
    — Support of multiple JMX metrics having the same key on one host.
    — Increased size of memory available to JavaScript Webhooks and preprocessing.
    — CurlHttpRequest renamed to HttpRequest in Webhooks.
    — ‘Alias‘ renamed to ‘Username‘ in user configuration.
    — American English is a default language of Zabbix UI and also Zabbix documentation.

New integrations and templates

  • With any new Zabbix major version, we have new integrations and templates. Zabbix 5.4 is not an exception, and it comes with the new integrations with iTOP, VictorOps,  Rocket.Chat, Signal, Express.ms, and other solutions.
  • We have also introduced a new set of official templates for monitoring of APC UPS hardware, Hikvision cameras, etcd, Hadoop, Zookeeper, Kafka, AMQ, HashiCorp Vault, MS Sharepoint, MS Exchange, smartclt, Gitlab, Jenkins, Apache Ignite, and more applications and services.

New integrations and templates

To find out what Zabbix is capable of monitoring and what integrations are available with the third-party systems, you are welcome to visit zabbix.com/integrations, as the solution you want to monitor or a system you’d like to integrate Zabbix with might be already supported or exist as a community-supported solution.

Official solutions for monitoring and alerting

Recently, we introduced device-specific templates on our integration page, where you can see what devices are currently supported. For instance, we started with APC devices. Here, if you click the required device, you will go to the page with the template for the specific device.

Templates for hardware vendor devices

We are planning to increase the number of out-of-the-box supported vendor devices, and we’ll have a look at the Cisco, Juniper, F5, and some other vendors very soon.

NOTE. Zabbix is a free and open-source solution. We don’t have any closed source components. We are just as free as Linux and use exactly the same GPLv2 license. You can download Zabbix from zabbix.com/download and deploy it anywhere you want.

Deploy Zabbix on-premise

We support a range of the most popular operating systems. So, you may deploy Zabbix on MAC OS, Windows, Docker Containers, Docker environment, Cloud AWS, Azure, OpenStack, Digital Ocean, Google Cloud. Recently, we have introduced support of Linode.

Deploy in the cloud

To have your Zabbix instance running in a cloud, you don’t need to spend a lot. You can use Digital Ocean or Linode, and for about $5 per month, you may have Zabbix Server up and running with Zabbix UI, which will be capable of monitoring thousands of devices.

Breaking changes

  • Applications and screens are not supported and many related API methods are affected.
  • We don’t support global value maps anymore, so we have to switch mentally from supporting value maps globally to supporting value maps on a template level (preferred) or on a host level.
  • New syntax for trigger expressions and calculated metrics, which is easier to understand and use.
    —  Aggregate metrics are merged into calculated metrics.

Integration with Grafana

Following the release of Zabbix 5.4, our users quickly realized that our integration with Grafana was broken as we have introduced the aforementioned API changes. Since we do not support applications anymore, all API changes were documented just a few days before the official release of Zabbix 5.4.

Thanks to the swift reaction of Alexander Zobnin, maintainer of the Grafana plugin for Zabbix, this broken integration was fixed very quickly, and you now can easily and safely use Grafana with Zabbix 5.4.

Upgrade notes

  • Upgrade to Zabbix 5.4 from your existing 5.0, or 5.2, or 4.4, or 4.2 is easy as usual. You need to install new binaries for Zabbix Server and Zabbix Proxies, upgrade Zabbix UI, start the Zabbix Server, and the Zabbix Server will upgrade the structure of your database automatically.
  • All trigger expressions, calculated and aggregate metrics will be automatically converted to the new syntax.
  • All applications will be automatically converted to tags. For instance, “CPU” will be converted to the tag “Application:CPU”.
  • Global values maps will be moved to template and host level.
  • All screens will be automatically converted to Dashboards.

NOTE. The next LTS release, Zabbix 6.0, is expected by the end of 2021. Our development team is working on High Availability for the Zabbix Server, which will be available out of the box so that you could install the Zabbix Server and you run it in a high-availability cluster mode.

We also invest much time to improve the Business Service Monitoring (BSM) in Zabbix, which is related to the service tree and dependencies between services, business service monitoring SLA, and SLA reporting.

Zabbix roadmap

More information about the expected features is available at zabbix.com/roadmap. Here, you may click the version you’re interested in, for instance, 6.0, 6.2, 6.4, or 7.0 LTS. We are now keeping a dashboard of improvement, so you will immediately see any progress down the road of our 2.5-year plan.

Zabbix roadmap

Now, we maintain a dynamic dashboard for our roadmap displaying the progress of development of any feature. For instance, ‘Support of multi-tenancy for Services’ is marked with ‘In dev’, as it is under development and has been added to the roadmap recently.

In addition, this feature has a reference to the ZNXNEXT-59 issue, where more detailed information on the feature is available. ‘Top voted’ mark here means that the feature has been voted for by many Zabbix users and community members.

Questions & Answers

Question. Is migration from screens to dashboards automatic? How does that work?

Answer. All screens — global screens and template-level screens will be automatically converted to global dashboards and template-level dashboards. You don’t need to worry about that, all information will be preserved.

Question. Now you have dashboards delivered as reports. Do you plan to natively send these dashboards to some third-party integration, for instance, put it in some frame in a company’s  website?

Answer. If we have such plans, they should be on the roadmap. You can do it right now using some sort of fixed-frame technology, though it is an ugly way to integrate one solution with another. I am looking forward to adding the ability to include a widget or a dashboard into a third-party HTML page. I think it will be implemented in Zabbix sooner or later.

Question. Do you have any plans to have reports based on some other sections of Zabbix, such as inventory reports, or availability reports, and so on?

Answer. This functionality can evolve as a result of improving the visualization capabilities of Zabbix dashboards or more specifically by introducing additional widgets for different purposes, such as geographical maps, data table widgets, which are already planned for Zabbix 6.0, capacity planning, and widgets for all possible use cases. So, as soon as we have a rich set of widgets for dashboards, it will automatically put these values into PDF reports.

Some changes planned for Zabbix 6.0 are related to business service monitoring and as part of this development, we will also introduce new widgets made specifically for SLA reporting. So, we are developing in this direction — we are going to have more widgets, more widget flexibility, and maybe more visually appealing widgets in the future.

Question. Why did we implement the value maps in the way that we did? Why didn’t we implement it on three levels just as user macros global template host?

Answer. If we implemented it in three levels, the global value maps would still be there, and it would be very hard to manage value maps in this case. Suppose you have a global map defined, such as a service state. What should Zabbix do after you import a new template with the same value map service state, which is defined on a different level? Should it keep the value map service state on a template level or should it upgrade the global service state without creating the service state on the template level? This would introduce a new set of different problems and confusion in the end. So, we really need to keep templates independent to prevent those hidden dependencies on global objects such as value maps, especially for larger Zabbix deployments. Even one hundred templates in your setup might become a huge problem from the maintenance point of view.

Question. Why do we release so many intermediate versions? Why don’t we support versions 5.2 and 5.4 for two or three years?

Answer. The reason is very simple. We maintain backward and forward compatibility within one major release, and we guarantee that the database structure remains intact. So, if you install 5.4.0, everything within 5.4 (5.4.10, 5.4.2, 5.4.0) remains backward and forward compatible and the database structure remains the same. If we start introducing new features in minor releases, we will have to modify and extend the structure of the database, then minor versions of a major release will not be compatible anymore. I don’t think it is a good approach, and you will have no possibility to downgrade if a newer version of Zabbix doesn’t work the way you want. This approach is described in the document release cycle on our website and is has proved its feasibility over time.

We support 5.2 and 5.4 only for several months not to put additional load on our support team. We have two different types of releases — LTS releases with five-year support and non-LTS releases. If we started supporting everything, then at every given moment there will be 10 major Zabbix versions supported by our team. If a customer reports a problem to our support team, we have to fix this problem, for instance, create a patch, follow the QA procedure, test the solution very carefully, etc. Even Microsoft doesn’t maintain 10 major versions. So, we support just a few LTS releases at a time.

Question. Will you continue support of Red Hat 7 or CentOS 7 when Zabbix 6.0 is released? What about RHEL 7?

Answer. We dropped support of CentOS 7 for a good reason. We discovered that in the version of the software we need, some things, for instance, TLS encryption, are outdated in CentOS 7.0. There are also other unsupported dependencies, for example PHP and so on. In addition, we realized that in Zabbix 6.0 we will not support CentOS 7.0 anymore as Zabbix 6.0 will be supported for five years and we just won’t be able to support CentOS 7.0 for extra five years starting from the end of 2021. So, we could drop support of CentOS 7.0 starting from Zabbix 5.2 or keep it supported in Zabbix 5.2 or Zabbix 5.4 and drop it starting from Zabbix 6.0. We decided to drop support of CentOS 7.0 for Zabbix Server, Zabbix Proxy, and Zabbix UI immediately starting from Zabbix 5.2. In addition, we would have to rely on a third party repository to get the particular versions of software dependencies that Zabbix requires.

Now Open Third Availability Zone in the AWS China (Beijing) Region

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/now-open-third-availability-zone-in-the-aws-china-beijing-region/

I made my first trip to China in late 2008. I was able to speak to developers and entrepreneurs and to get a sense of the then-nascent market for cloud computing. With over 900 million Internet users as of 2020 (according to a recent report from China Internet Network Information Center), China now has the largest user base in the world.

A limited preview of the China (Beijing) Region was launched in 2013 and brought to general availability in 2016. A year later the AWS China (Ningxia) Region launched. In order to comply with China’s legal and regulatory requirements, we collaborated with local Chinese partners. These partners have the proper telecom licenses to provide cloud services in Mainland China. Today, developers can deploy cloud-based applications inside of using the same APIs, protocols, services, and operational practices used by our customers in other parts of the world. This commonality has been particularly attractive to multinational companies that can take advantage of their existing AWS experience when they expand their cloud infrastructure into Mainland China.

Third Availability Zone in Beijing
Today I am happy to announce that we are adding a third Availability Zone (AZ) to the China (Beijing) Region operated by Sinnet in order to support the demands of our growing customer base in China. As is the case with all AWS Availability Zones, this one encompasses one or more discrete data centers in separate facilities, each with redundant power, networking, and connectivity. With this launch, both AWS Regions in China offer three AZs and allow customers to build applications that are scalable, fault-tolerant, and highly available.

AWS Customers in the Beijing Region
Many enterprise customers in China are using the China (Beijing) Region to support their digital transformations. For example:

Yantai Shinho was founded in 1992 and now manufactures 13 popular condiments. They now have a presence in over 100 countries and supply products that tens of millions of families enjoy. They are already using the region to support their front-end and big data efforts, and plan to make use of the additional architectural options made possible by the new AZ.

Kingdee International Software Group was founded in 1993 and now provides corporate management and cloud services for more than 6.8 million enterprises and government organizations. They now have over 8,000 employees AND are committed to changing the way that hundreds of millions of people work.

As I noted earlier, our multinational customers are using the AWS Regions in China to expand their global presence. Here are a few examples:

Australian independent software vendor Canva offers its design-on-demand application to 150 million active users in 190 countries. They launched their Chinese products in August 2018, and have since built in into a first-class design platform that includes tens of millions of high-resolution pictures, Chinese fonts, original templates, and more. Chinese users have already created over 50 million designs on the Canva.cn platform.

Swire is a 200 year old business group that spans the aviation, beverage, food, industrial, marine services, and property industries. Their Swire Coca-Cola division has the exclusive right to manufacture, market, and distribute Coca-Cola products in eleven Chinese provinces, the Shanghai Municipality, Hong Kong, Taiwan, and part of the Western United States — a total customer base of 728 million people. Swire Coca-Cola’s systems primarily operate in the China (Beijing) Region and will soon make use of the third AZ.

Finally, startups are using the region to power their fast-growing businesses:

CraiditX applies machine learning technology originally developed for search engines to the financial services industry. Established in 2015, they use behavioral language processing, natural language processing, neural networks, and integrated modeling to build risk management systems.

Founded in 2016, Momenta is a Chinese startup that is building a “brain” for autonomous vehicles. Powered by deep learning and data-driven path planning, they are working on autonomous driving for passenger vehicles and full autonomy for mobile service vehicles, all deployed in the China (Beijing) Region.

81 and 25
This launch raises the global AWS footprint to a total of 81 Availability Zones across 25 geographic regions, with plans to launch 18 additional Availability Zones and six more regions in Australia, India, Indonesia, Spain, Switzerland, and United Arab Emirates (UAE).

–Jeff, with lots of help from Lillian Shao;

PS – The operator and service provider for the AWS China (Beijing) Region is Beijing Sinnet Technology Co., Ltd. The operator and service provider for the AWS China (Ningxia) Region is Ningxia Western Cloud Data Technology Co., Ltd.

Introducing the newest AWS Heroes – June, 2021

Post Syndicated from Ross Barich original https://aws.amazon.com/blogs/aws/introducing-the-newest-aws-heroes-june-2021/

We at AWS continue to be impressed by the passion AWS enthusiasts have for knowledge sharing and supporting peer-to-peer learning in tech communities. A select few of the most influential and active community leaders in the world, who truly go above and beyond to create content and help others build better & faster on AWS, are recognized as AWS Heroes.

Today we are thrilled to introduce the newest AWS Heroes, including the first Heroes based in Perú and Ukraine:

Anahit Pogosova – Tampere, Finland

Data Hero Anahit Pogosova is a Lead Cloud Software Engineer at Solita. She has been architecting and building software solutions with various customers for over a decade. Anahit started working with monolithic on-prem software, but has since moved all the way to the cloud, nowadays focusing mostly on AWS Data and Serverless services. She has been particularly interested in the AWS Kinesis family and how it integrates with AWS Lambda. You can find Anahit speaking at various local and international events, such as AWS meetups, AWS Community Days, ServerlessDays, and Code Mesh. She also writes about AWS on Solita developers’ blog and has been a frequent guest on various podcasts.

Anurag Kale – Gothenburg, Sweden

Data Hero Anurag Kale is a Cloud Consultant at Cybercom Group. He has been using AWS professionally since 2017 and holds the AWS Solutions Architect – Associate certification. He is a co-organizer of the AWS User Group Pune; helping host and organize AWS Community Day Pune 2020 and AWS Community Day India 2020 – Virtual Edition. Anurag’s areas of interest include Amazon DynamoDB, relational databases, serverless data pipelines, data analytics, Infrastructure as Code, and sustainable cloud solutions. He is an active advocate of DynamoDB and Amazon Aurora, and has spoken at various national and international events such as AWS Community Day Nordics 2020 and various AWS Meetups.

Arseny Zinchenko – Kiev, Ukraine

Container Hero Arseny Zinchenko has over 15 years in IT, and currently works as a DevOps Team Lead and Data Security Officer at BetterMe Inc., a leading health & fitness mobile publisher. Since 2011 Arseny has used his blog to share expertise about DevOps, system administration, containerization, and cloud computing. Currently he is focused primary on Amazon Elastic Kubernetes Service (EKS) and security solutions provided by AWS. He is a member of the biggest Ukranian DevOps community, UrkOps, where he helps others to build their best with AWS and containers. where he helps others to build their best with AWS and containers. He also helps implement DevOps methodology in their organizations by using Amazon CloudFormation and AWS managed services like Amazon RDS, Amazon Aurora, and EKS.

Azmi Mengü – Istanbul, Turkey

Community Hero Azmi Mengü is a Sr. Software Engineer on the Infrastructure Team at Armut / HomeRun. He has over 5 years of AWS cloud development experience and has expertise in serverless, containers, data, and storage services. Since 2019, Azmi has been on the organizing committee of the Cloud and Serverless Turkey community. He co-organized and acted as a speaker at over 50 physical and online events during this time. He actively writes blog posts about developing serverless, container, and IaC technologies on AWS. Azmi also co-organized the first-ever ServerlessDays Istanbul in Turkey and AWS Community Day Turkey events.

Carlos Cortez – Lima, Perú

Community Hero Carlos Cortez is the founder and leader of AWS User Group Perú and Founder and CTO of CENNTI, which helps Peruvian companies in their difficult journey to the cloud and the development of Machine Learning solutions. The two biggest AWS events in Perú, AWS Community Day Lima 2019 and AWS UG Perú Conference in 2021, were organized by Carlos. He is the owner of the first AWS Podcast in Latam, “Imperio Cloud” and “Al día con AWS”. He loves to create content for emerging technologies, which is why he created DeepFridays to educate people in Reinforcement Learning.

Chris Miller – Santa Cruz, USA

Machine Learning Hero Chris Miller is an entrepreneur, inventor, and CEO of Cloud Brigade. After winning the 2019 AWS DeepRacer Summit race in Santa Clara, he founded the Santa Cruz DeepRacer Meetup group. Chris has worked with AWS AI/ML product teams with DeepLens and DeepRacer on projects including The Poopinator, and What’s in my Fridge. He prides himself on being a technical founder with experience across a broad range of disciplines, which has led to a lot of crazy projects in competitions and hackathons, such as an automated beer brewery, animatronic ventriloquist dummy, and his team even won a Cardboard Boat Race!

Gert Leenders – Brussels, Belgium

DevTools Hero Gert Leenders started his career as a developer in 2001. Eight years ago, his focus shifted entirely towards AWS. Today, he’s an AWS Cloud Solution Architect helping teams build and deploy cloud-native applications and manage their cloud infrastructure. On his blog, Gert emphasizes hidden gems in AWS developer tools and day-to-day topics for cloud engineers like logging, debugging, error handling and Infrastructure as Code. He also often shares code on GitHub.

Lei Wu – Beijing, China

Machine Learning Hero Lei Wu is head of the machine learning team at FreeWheel. He enjoys sharing technology with others, and he publishes many Chinese language tech blogs at infoQ, covering machine learning, big data, and distributed computing systems. Lei works hard to promote deep learning adoption with AWS services wherever he can, including talks at Spark Summit China, World Artificial Intelligence Conference, AWS Innovate AI/ML edition, and AWS re:Invent where he shared FreeWheel’s best practices on deep learning with Amazon SageMaker.

Hidetoshi Matsui – Hamamatsu, Japan

Serverless Hero Hidetoshi Matsui is a developer at Startup Technology Inc. and a member of the Japan AWS User Group (JAWS-UG). On “builders.flash,” a web magazine for developers run by AWS Japan, the articles he has contributed are among the most viewed pages on the site since 2020. His most impactful achievement is the construction of a distribution site for JAWS-UG’s largest event, JAWS DAYS 2021 re:Connect. He made full use of various AWS services to build a low-latency and scalable distribution system with a serverless architecture and smooth streaming video viewing for nearly 4000 participants.

Philipp Schmid – Nuremberg, Germany

Machine Learning Hero Philipp Schmid is a Machine Learning & Tech Lead at Hugging Face, working to democratize artificial intelligence through open source and open science. He has extensive experience in deep learning, deploying NLP models into production using AWS Lambda, and is an avid advocate of Amazon SageMaker to simplify machine learning, such as “Distributed Training: Train BART/T5 for Summarization using Transformers and Amazon SageMaker.” He loves to share his knowledge on AI and NLP at various meetups such as Data Science on AWS, and on his technical Blog.

Simone Merlini – Milan, Italy

Community Hero Simone Merlini is CEO and CTO at beSharp. In 2012 he co-founded the first AWS User Group in Italy, and he’s currently the organizer of the AWS User Group Milan. He’s also actively involved in the development of Leapp, an open-source project for managing and securing Cloud access in multi-account environments. Simone is also the editor in chief and writer for Proud2beCloud, a blog aimed to share highly specialized AWS knowledge to enable the adoption of cloud technologies.

Virginie Mathivet – Lyon, France

Machine Learning Hero Virginie Mathivet has been leading the DataSquad team at TeamWork since 2017, focused on Data and Artificial Intelligence. Their purpose is to make the most of their clients’ data, via Data Science or Data Engineering / Big Data, mainly on AWS. Virginie regularly participates in conferences and writes books and articles, both for the public (introduction to AI) and for an informed audience (technical subjects). She also campaigns for a better visibility of women in the digital industry and for diversity in the data professions. Her favorite cloud service? Amazon SageMaker of course!

Walid A. Shaari – Dhahran, Saudi Arabia

Container Hero Walid A. Shaari is the community lead for the Dammam Cloud-native AWS User Group, working closely with CNCF ambassadors, K8saraby, and AWS MENA community leaders to enable knowledge sharing, collaboration, and networking. He helped organize the first AWS Community Day – MENA 2020. Walid also maintains GitHub content for Certified Kubernetes Administrators (CKA) and Certified Kubernetes Security Specialists (CKS), and holds several active professional certifications: AWS Certified Associate Solutions Architect, Certified Kubernetes Administrator, Certified Kubernetes Application Developer, Red Hat Certified Architect level IV, and more.

 

 

 

 

If you’d like to learn more about the new Heroes, or connect with a Hero near you, please visit the AWS Hero website.

Ross;

Kinda a big announcement

Post Syndicated from Joel Spolsky original https://www.joelonsoftware.com/2021/06/02/kinda-a-big-announcement/

The other day I was talking to a young developer working on a code base with tons of COM code, and I told him that even before he was born, everyone knew that COM was already so deeply obsolete that it was impossible to find anyone who knew enough to work on it. And yet they still have this old COM code base, and they still have one old programmer holding onto their job by being the only human left on the planet with a brain big enough to manually manage multithreaded objects. I remember that COM was like Gödels Theorem: it seemed important, and you could understand it all long enough to pass an exam, but ultimately it is mostly just a demonstration of how far human intelligence can be made to stretch under extreme duress.

And, bubbeleh, if there is one thing we have learned, it’s that the things that make it easier on your brain are the things that matter.

Programming changes slowly. Really slowly.

Since I learned to code forty years ago, one thing that has mostly, mostly, changed about programming is that most developers no longer have to manage their own memory. Even getting that going took a long long time.

I took a few stupid years trying to be the CEO of a growing company during which I didn’t have time to code, and when I came back to web programming, after a break of about 10 years, I found Node, React, and other goodies, which are, don’t get me wrong, amazing? Really really great? But I also found that it took approximately the same amount of work to make a CRUD web app as it always has, and that there were some things (like handing a file upload, or centering) that were, shockingly, still just as randomly difficult as they were in VBScript twenty years ago.

Where are the flying cars?

The biggest problem is that developers of programming tools love to add things and hate to take things away. So things get harder and harder and more and more complex because there are more and more ways to do the same thing, each has pros and cons, and you are likely to spend as much time just figuring out which “rich text editor” to use as you are to implement it.

(Bill Gates, 1990: “How many f*cking programmers in this company are working on rich text editors?!”)

So, in this world of slow, gradual change and alleged improvement, one thing did change literally overnight, or, to be precise, on September 15, 2008, which was when Stack Overflow launched.

Six-to-eight weeks before that, Stack Overflow was only an idea. (*Actually Jeff started development in April). Six-to-eight weeks after that, it was a standard part of every developer’s toolkit: something they used every day. Something had changed about programming, and changed very fast: the way developers learned and got help and taught each other.

For many years, I was able to coast by telling delightful little stories about our incredible growth numbers, about the pay web site we made obsolete, and even about that one time when a Major Computer Book Publisher threatened to BURY US and launched their own Q&A platform, which turned out to be more of a Scoff Generator than a Q&A platform, but actually now almost anyone I talk to is too young to imagine The Days Before Stack Overflow, when the bookstore had an entire wall of Java and the way you picked a Rich Text Editor was going to Barnes and Noble and browsing through printed books for an hour, in the Rich Text Editor Component shelf.

Stack Overflow got to be pretty big as a business. The company grew faster than any individual’s skills at managing companies, especially mine, so a lot of the business team has changed over, and we now have a really world-class, experienced team that is doing much better than us founders. We’ve done incredible work building a recruiting platform for great developers, a “reach and relevance” platform for getting developers excited about your products, and, most importantly, Stack Overflow for Teams, which is growing so quickly that soon every developer in the world will be using the power of Stack Overflow to get help with their own code base, not just common languages and libraries.

And yeah, the one thing I made sure of was that everyone that came into the company understood exactly why Stack Overflow works, and what is important to the developers that it is by and for. So while we haven’t always been perfect, we have kept true to our mission, and the current leadership is just as committed to the vision of Stack Overflow as the founders are.

Today we’re pleased to announce that Stack Overflow is joining Prosus. Prosus is an investment and holding company, which means that the most important part of this announcement is that Stack Overflow will continue to operate independently, with the exact same team in place that has been operating it, according to the exact same plan and the exact same business practices. Don’t expect to see major changes or awkward “synergies”. The business of Stack Overflow will continue to focus on Reach and Relevance, and Stack Overflow for Teams. The entire company is staying in place: we just have different owners now.

This is, in some ways, the best possible outcome. Stack Overflow stays independent. The company has plenty of cash on hand to expand and deliver more features and fix the old broken ones. Right now, the biggest gating factor to how fast we can do this is just how fast we can hire excellent people.

I’ve been out of the day-to-day for a while now. Together with David Wilkinson, I’m helping to build HASH. HASH makes it easy to build powerful simulations and make better decisions. As we worked on that, we discovered that too much of the data that you might need to run simulations needs to be fixed up before you can use it. That’s because data is often published on the web, using page description languages that are more concerned with formatting and consumption by humans. They lack the structure to make the data they contain readily accessed programatically, so step one is miserable screen scraping and data cleanup. That’s where a lot of people give up.

We think we have an interesting way to fix this. If it works, we’ll change the web as quickly and completely as Stack Overflow changed programming. But it’s kind of ambitious and maybe a little too GRAND. If you are interested in joining that crazy journey, do reach out. The whole thing is going to be open source, so just hang on, and we’ll have something up on GitHub for you to play with.

See you soon!

Amazon Redshift ML Is Now Generally Available – Use SQL to Create Machine Learning Models and Make Predictions from Your Data

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/amazon-redshift-ml-is-now-generally-available-use-sql-to-create-machine-learning-models-and-make-predictions-from-your-data/

With Amazon Redshift, you can use SQL to query and combine exabytes of structured and semi-structured data across your data warehouse, operational databases, and data lake. Now that AQUA (Advanced Query Accelerator) is generally available, you can improve the performance of your queries by up to 10 times with no additional costs and no code changes. In fact, Amazon Redshift provides up to three times better price/performance than other cloud data warehouses.

But what if you want to go a step further and process this data to train machine learning (ML) models and use these models to generate insights from data in your warehouse? For example, to implement use cases such as forecasting revenue, predicting customer churn, and detecting anomalies? In the past, you would need to export the training data from Amazon Redshift to an Amazon Simple Storage Service (Amazon S3) bucket, and then configure and start a machine learning training process (for example, using Amazon SageMaker). This process required many different skills and usually more than one person to complete. Can we make it easier?

Today, Amazon Redshift ML is generally available to help you create, train, and deploy machine learning models directly from your Amazon Redshift cluster. To create a machine learning model, you use a simple SQL query to specify the data you want to use to train your model, and the output value you want to predict. For example, to create a model that predicts the success rate for your marketing activities, you define your inputs by selecting the columns (in one or more tables) that include customer profiles and results from previous marketing campaigns, and the output column you want to predict. In this example, the output column could be one that shows whether a customer has shown interest in a campaign.

After you run the SQL command to create the model, Redshift ML securely exports the specified data from Amazon Redshift to your S3 bucket and calls Amazon SageMaker Autopilot to prepare the data (pre-processing and feature engineering), select the appropriate pre-built algorithm, and apply the algorithm for model training. You can optionally specify the algorithm to use, for example XGBoost.

Architectural diagram.

Redshift ML handles all of the interactions between Amazon Redshift, S3, and SageMaker, including all the steps involved in training and compilation. When the model has been trained, Redshift ML uses Amazon SageMaker Neo to optimize the model for deployment and makes it available as a SQL function. You can use the SQL function to apply the machine learning model to your data in queries, reports, and dashboards.

Redshift ML now includes many new features that were not available during the preview, including Amazon Virtual Private Cloud (VPC) support. For example:

Architectural diagram.

  • You can also create SQL functions that use existing SageMaker endpoints to make predictions (remote inference). In this case, Redshift ML is batching calls to the endpoint to speed up processing.

Before looking into how to use these new capabilities in practice, let’s see the difference between Redshift ML and similar features in AWS databases and analytics services.

ML Feature Data Training
from SQL
Predictions
using SQL Functions
Amazon Redshift ML

Data warehouse

Federated relational databases

S3 data lake (with Redshift Spectrum)

Yes, using
Amazon SageMaker Autopilot
Yes, a model can be imported and executed inside the Amazon Redshift cluster, or invoked using a SageMaker endpoint.
Amazon Aurora ML Relational database
(compatible with MySQL or PostgreSQL)
No

Yes, using a SageMaker endpoint.

A native integration with Amazon Comprehend for sentiment analysis is also available.

Amazon Athena ML

S3 data lake

Other data sources can be used through Athena Federated Query.

No Yes, using a SageMaker endpoint.

Building a Machine Learning Model with Redshift ML
Let’s build a model that predicts if customers will accept or decline a marketing offer.

To manage the interactions with S3 and SageMaker, Redshift ML needs permissions to access those resources. I create an AWS Identity and Access Management (IAM) role as described in the documentation. I use RedshiftML for the role name. Note that the trust policy of the role allows both Amazon Redshift and SageMaker to assume the role to interact with other AWS services.

From the Amazon Redshift console, I create a cluster. In the cluster permissions, I associate the RedshiftML IAM role. When the cluster is available, I load the same dataset used in this super interesting blog post that my colleague Julien wrote when SageMaker Autopilot was announced.

The file I am using (bank-additional-full.csv) is in CSV format. Each line describes a direct marketing activity with a customer. The last column (y) describes the outcome of the activity (if the customer subscribed to a service that was marketed to them).

Here are the first few lines of the file. The first line contains the headers.

age,job,marital,education,default,housing,loan,contact,month,day_of_week,duration,campaign,pdays,previous,poutcome,emp.var.rate,cons.price.idx,cons.conf.idx,euribor3m,nr.employed,y 56,housemaid,married,basic.4y,no,no,no,telephone,may,mon,261,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
57,services,married,high.school,unknown,no,no,telephone,may,mon,149,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
37,services,married,high.school,no,yes,no,telephone,may,mon,226,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no
40,admin.,married,basic.6y,no,no,no,telephone,may,mon,151,1,999,0,nonexistent,1.1,93.994,-36.4,4.857,5191.0,no

I store the file in one of my S3 buckets. The S3 bucket is used to unload data and store SageMaker training artifacts.

Then, using the Amazon Redshift query editor in the console, I create a table to load the data.

CREATE TABLE direct_marketing (
	age DECIMAL NOT NULL, 
	job VARCHAR NOT NULL, 
	marital VARCHAR NOT NULL, 
	education VARCHAR NOT NULL, 
	credit_default VARCHAR NOT NULL, 
	housing VARCHAR NOT NULL, 
	loan VARCHAR NOT NULL, 
	contact VARCHAR NOT NULL, 
	month VARCHAR NOT NULL, 
	day_of_week VARCHAR NOT NULL, 
	duration DECIMAL NOT NULL, 
	campaign DECIMAL NOT NULL, 
	pdays DECIMAL NOT NULL, 
	previous DECIMAL NOT NULL, 
	poutcome VARCHAR NOT NULL, 
	emp_var_rate DECIMAL NOT NULL, 
	cons_price_idx DECIMAL NOT NULL, 
	cons_conf_idx DECIMAL NOT NULL, 
	euribor3m DECIMAL NOT NULL, 
	nr_employed DECIMAL NOT NULL, 
	y BOOLEAN NOT NULL
);

I load the data into the table using the COPY command. I can use the same IAM role I created earlier (RedshiftML) because I am using the same S3 bucket to import and export the data.

COPY direct_marketing 
FROM 's3://my-bucket/direct_marketing/bank-additional-full.csv' 
DELIMITER ',' IGNOREHEADER 1
IAM_ROLE 'arn:aws:iam::123412341234:role/RedshiftML'
REGION 'us-east-1';

Now, I create the model straight form the SQL interface using the new CREATE MODEL statement:

CREATE MODEL direct_marketing
FROM direct_marketing
TARGET y
FUNCTION predict_direct_marketing
IAM_ROLE 'arn:aws:iam::123412341234:role/RedshiftML'
SETTINGS (
  S3_BUCKET 'my-bucket'
);

In this SQL command, I specify the parameters required to create the model:

  • FROM – I select all the rows in the direct_marketing table, but I can replace the name of the table with a nested query (see example below).
  • TARGET – This is the column that I want to predict (in this case, y).
  • FUNCTION – The name of the SQL function to make predictions.
  • IAM_ROLE – The IAM role assumed by Amazon Redshift and SageMaker to create, train, and deploy the model.
  • S3_BUCKET – The S3 bucket where the training data is temporarily stored, and where model artifacts are stored if you choose to retain a copy of them.

Here I am using a simple syntax for the CREATE MODEL statement. For more advanced users, other options are available, such as:

  • MODEL_TYPE – To use a specific model type for training, such as XGBoost or multilayer perceptron (MLP). If I don’t specify this parameter, SageMaker Autopilot selects the appropriate model class to use.
  • PROBLEM_TYPE – To define the type of problem to solve: regression, binary classification, or multiclass classification. If I don’t specify this parameter, the problem type is discovered during training, based on my data.
  • OBJECTIVE – The objective metric used to measure the quality of the model. This metric is optimized during training to provide the best estimate from data. If I don’t specify a metric, the default behavior is to use mean squared error (MSE) for regression, the F1 score for binary classification, and accuracy for multiclass classification. Other available options are F1Macro (to apply F1 scoring to multiclass classification) and area under the curve (AUC). More information on objective metrics is available in the SageMaker documentation.

Depending on the complexity of the model and the amount of data, it can take some time for the model to be available. I use the SHOW MODEL command to see when it is available:

SHOW MODEL direct_marketing

When I execute this command using the query editor in the console, I get the following output:

Console screenshot.

As expected, the model is currently in the TRAINING state.

When I created this model, I selected all the columns in the table as input parameters. I wonder what happens if I create a model that uses fewer input parameters? I am in the cloud and I am not slowed down by limited resources, so I create another model using a subset of the columns in the table:

CREATE MODEL simple_direct_marketing
FROM (
        SELECT age, job, marital, education, housing, contact, month, day_of_week, y
 	  FROM direct_marketing
)
TARGET y
FUNCTION predict_simple_direct_marketing
IAM_ROLE 'arn:aws:iam::123412341234:role/RedshiftML'
SETTINGS (
  S3_BUCKET 'my-bucket'
);

After some time, my first model is ready, and I get this output from SHOW MODEL. The actual output in the console is in multiple pages, I merged the results here to make it easier to follow:

Console screenshot.

From the output, I see that the model has been correctly recognized as BinaryClassification, and F1 has been selected as the objective. The F1 score is a metrics that considers both precision and recall. It returns a value between 1 (perfect precision and recall) and 0 (lowest possible score). The final score for the model (validation:f1) is 0.79. In this table I also find the name of the SQL function (predict_direct_marketing) that has been created for the model, its parameters and their types, and an estimation of the training costs.

When the second model is ready, I compare the F1 scores. The F1 score of the second model is lower (0.66) than the first one. However, with fewer parameters the SQL function is easier to apply to new data. As is often the case with machine learning, I have to find the right balance between complexity and usability.

Using Redshift ML to Make Predictions
Now that the two models are ready, I can make predictions using SQL functions. Using the first model, I check how many false positives (wrong positive predictions) and false negatives (wrong negative predictions) I get when applying the model on the same data used for training:

SELECT predict_direct_marketing, y, COUNT(*)
  FROM (SELECT predict_direct_marketing(
                   age, job, marital, education, credit_default, housing,
                   loan, contact, month, day_of_week, duration, campaign,
                   pdays, previous, poutcome, emp_var_rate, cons_price_idx,
                   cons_conf_idx, euribor3m, nr_employed), y
          FROM direct_marketing)
 GROUP BY predict_direct_marketing, y;

The result of the query shows that the model is better at predicting negative rather than positive outcomes. In fact, even if the number of true negatives is much bigger than true positives, there are much more false positives than false negatives. I added some comments in green and red to the following screenshot to clarify the meaning of the results.

Console screenshot.

Using the second model, I see how many customers might be interested in a marketing campaign. Ideally, I should run this query on new customer data, not the same data I used for training.

SELECT COUNT(*)
  FROM direct_marketing
 WHERE predict_simple_direct_marketing(
           age, job, marital, education, housing,
           contact, month, day_of_week) = true;

Wow, looking at the results, there are more than 7,000 prospects!

Console screenshot.

Availability and Pricing
Redshift ML is available today in the following AWS Regions: US East (Ohio), US East (N Virginia), US West (Oregon), US West (San Francisco), Canada (Central), Europe (Frankfurt), Europe (Ireland), Europe (Paris), Europe (Stockholm), Asia Pacific (Hong Kong) Asia Pacific (Tokyo), Asia Pacific (Singapore), Asia Pacific (Sydney), and South America (São Paulo). For more information, see the AWS Regional Services list.

With Redshift ML, you pay only for what you use. When training a new model, you pay for the Amazon SageMaker Autopilot and S3 resources used by Redshift ML. When making predictions, there is no additional cost for models imported into your Amazon Redshift cluster, as in the example I used in this post.

Redshift ML also allows you to use existing Amazon SageMaker endpoints for inference. In that case, the usual SageMaker pricing for real-time inference applies. Here you can find a few tips on how to control your costs with Redshift ML.

To learn more, you can see this blog post from when Redshift ML was announced in preview and the documentation.

Start getting better insights from your data with Redshift ML.

Danilo

Getting Started with Amazon ECS Anywhere – Now Generally Available

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/getting-started-with-amazon-ecs-anywhere-now-generally-available/

Since Amazon Elastic Container Service (Amazon ECS) was launched in 2014, AWS has released other options for running Amazon ECS tasks outside of an AWS Region such as AWS Wavelength, an offering for mobile edge devices or AWS Outposts, a service that extends to customers’ environments using hardware owned and fully managed by AWS.

But some customers have applications that need to run on premises due to regulatory, latency, and data residency requirements or the desire to leverage existing infrastructure investments. In these cases, customers have to install, operate, and manage separate container orchestration software and need to use disparate tooling across their AWS and on-premises environments. Customers asked us for a way to manage their on-premises containers without this added complexity and cost.

Following Jeff’s preannouncement last year, I am happy to announce the general availability of Amazon ECS Anywhere, a new capability in Amazon ECS that enables customers to easily run and manage container-based applications on premises, including virtual machines (VMs), bare metal servers, and other customer-managed infrastructure.

With ECS Anywhere, you can run and manage containers on any customer-managed infrastructure using the same cloud-based, fully managed, and highly scalable container orchestration service you use in AWS today. You no longer need to prepare, run, update, or maintain your own container orchestrators on premises, making it easier to manage your hybrid environment and leverage the cloud for your infrastructure by installing simple agents.

ECS Anywhere provides consistent tooling and APIs for all container-based applications and the same Amazon ECS experience for cluster management, workload scheduling, and monitoring both in the cloud and on customer-managed infrastructure. You can now enjoy the benefits of reduced cost and complexity by running container workloads such as data processing at edge locations on your own hardware maintaining reduced latency, and in the cloud using a single, consistent container orchestrator.

Amazon ECS Anywhere – Getting Started
To get started with ECS Anywhere, register your on-premises servers or VMs (also referred to as External instances) in the ECS cluster. The AWS Systems Manager Agent, Amazon ECS container agent, and Docker must be installed on these external instances. Your external instances require an IAM role that permits them to communicate with AWS APIs. For more information, see Required IAM permissions in the ECS Developer Guide.

To create a cluster for ECS Anywhere, on the Create Cluster page in the ECS console, choose the Networking Only template. This option is for use with either AWS Fargate or external instance capacity. We recommend that you use the AWS Region that is geographically closest to the on-premises servers you want to register.

This creates an empty cluster to register external instances. On the ECS Instances tab, choose Register External Instances to get activation codes and an installation script.

On the Step 1: External instances activation details page, in Activation key duration (in days), enter the number of days the activation key should remain active. The activation key can be used for up to 1,000 activations. In Number of instances, enter the number of external instances you want to register to your cluster. In Instance role, enter the IAM role to associate with your external instances.

Choose Next step to get a registration command.

On the Step 2: Register external instances page, copy the registration command. Run this command on the external instances you want to register to your cluster.

Paste the registration command in your on-premise servers or VMs. Each external instance is then registered as an AWS Systems Manager managed instance, which is then registered to your Amazon ECS clusters.

Both x86_64 and ARM64 CPU architectures are supported. The following is a list of supported operating systems:

  • CentOS 7, CentOS 8
  • RHEL 7
  • Fedora 32, Fedora 33
  • openSUSE Tumbleweed
  • Ubuntu 18, Ubuntu 20
  • Debian 9, Debian 10
  • SUSE Enterprise Server 15

When the ECS agent has started and completed the registration, your external instance will appear on the ECS Instances tab.

You can also add your external instances to the existing cluster. In this case, you can see both Amazon EC2 instances and external instances are prefixed with mi-* together.

Now that the external instances are registered to your cluster, you are ready to create a task definition. Amazon ECS provides the requiresCompatibilities parameter to validate that the task definition is compatible with the the EXTERNAL launch type when creating your service or running your standalone task. The following is an example task definition:

{
	"requiresCompatibilities": [
		"EXTERNAL"
	],
	"containerDefinitions": [{
		"name": "nginx",
		"image": "public.ecr.aws/nginx/nginx:latest",
		"memory": 256,
		"cpu": 256,
		"essential": true,
		"portMappings": [{
			"containerPort": 80,
			"hostPort": 8080,
			"protocol": "tcp"
		}]
	}],
	"networkMode": "bridge",
	"family": "nginx"
}

You can create a task definition in the ECS console. In Task Definition, choose Create new task definition. For Launch type, choose EXTERNAL and then configure the task and container definitions to use external instances.

On the Tasks tab, choose Run new task. On the Run Task page, for Cluster, choose the cluster to run your task definition on. In Number of tasks, enter the number of copies of that task to run with the EXTERNAL launch type.

Or, on the Services tab, choose Create. Configure service lets you specify copies of your task definition to run and maintain in a cluster. To run your task in the registered external instance, for Launch type, choose EXTERNAL. When you choose this launch type, load balancers, tag propagation, and service discovery integration are not supported.

The tasks you run on your external instances must use the bridge, host, or none network modes. The awsvpc network mode isn’t supported. For more information about each network mode, see Choosing a network mode in the Amazon ECS Best Practices Guide.

Now you can run your tasks and associate a mix of EXTERNAL, FARGATE, and EC2 capacity provider types with the same ECS service and specify how you would like your tasks to be split across them.

Things to Know
Here are a couple of things to keep in mind:

Connectivity: In the event of loss of network connectivity between the ECS agent running on the on-premises servers and the ECS control plane in the AWS Region, existing ECS tasks will continue to run as usual. If tasks still have connectivity with other AWS services, they will continue to communicate with them for as long as the task role credentials are active. If a task launched as part of a service crashes or exits on its own, ECS will be unable to replace it until connectivity is restored.

Monitoring: With ECS Anywhere, you can get Amazon CloudWatch metrics for your clusters and services, use the CloudWatch Logs driver (awslogs) to get your containers’ logs, and access the ECS CloudWatch event stream to monitor your clusters’ events.

Networking: ECS external instances are optimized for running applications that generate outbound traffic or process data. If your application requires inbound traffic, such as a web service, you will need to employ a workaround to place these workloads behind a load balancer until the feature is supported natively. For more information, see Networking with ECS Anywhere.

Data Security: To help customers maintain data security, ECS Anywhere only sends back to the AWS Region metadata related to the state of the tasks or the state of the containers (whether they are running or not running, performance counters, and so on). This communication is authenticated and encrypted in transit through Transport Layer Security (TLS).

ECS Anywhere Partners
ECS Anywhere integrates with a variety of ECS Anywhere partners to help customers take advantage of ECS Anywhere and provide additional functionality for the feature. Here are some of the blog posts that our partners wrote to share their experiences and offerings. (I am updating this article with links as they are published.)

Now Available
Amazon ECS Anywhere is now available in all commercial regions except AWS China Regions where ECS is supported. With ECS Anywhere, there are no minimum fees or upfront commitments. You pay per instance hour for each managed ECS Anywhere task. ECS Anywhere free tier includes 2200 instance hours per month for six months per account for all regions. For more information, see the pricing page.

To learn more, see ECS Anywhere in the Amazon ECS Developer Guide. Please send feedback to the AWS forum for Amazon ECS or through your usual AWS Support contacts.

Get started with the Amazon ECS Anywhere today.

Channy

Update. Watch a cool demo of ECS Anywhere to operate a Raspberry Pi cluster at home office and read its deep-dive blog post.