Tag Archives: Amazon OpenSearch Service (Successor to Amazon Elasticsearch Service)

Build a cost-effective extension to your Elasticsearch cluster with Amazon OpenSearch Service

2022-03-16 Alexandre Levret

Post Syndicated from Alexandre Levret original https://aws.amazon.com/blogs/big-data/build-a-cost-effective-extension-to-your-elasticsearch-cluster-with-amazon-opensearch-service/

During the past year, we’ve seen customers running self-managed Elasticsearch clusters on AWS who were running out of compute and storage capacity because of the non-elasticity of their clusters. They adopted Amazon OpenSearch Service (Successor To Amazon Elasticsearch Service) to benefit from better flexibility for their logs and enhanced retention periods.

In this post, we discuss how to build a cost-effective extension to your Elasticsearch cluster with Amazon OpenSearch Service to extend the retention time of your data.

In May 2021, we published the blog post Introducing Cold Storage for Amazon OpenSearch Service, which explained how to reduce your overall cost. Cold storage separates compute and storage costs when you detach indices from the domain. You benefit from a better overall storage-to-compute ratio. Cold storage can reduce your data retention cost by up to 90% per GB versus storing the same data in the Hot tier. You can automate your data lifecycle management and move your data between the three tiers (Hot, UltraWarm, and Cold) thanks to Index State Management.

AWS Professional Services teams worked with one customer to add an OpenSearch Service domain as a second target for their logs. This customer was only able to keep their indices for 8 days on their existing self-managed Elasticsearch cluster. Because of legal and security requirements, they needed to retain data up to 6 months. The solution was to use an Elasticsearch cluster (running 7.10 version) on Amazon OpenSearch Service as an extension of their existing Elasticsearch cluster. This gave their internal application teams an additional Kibana dashboard to visualize their indices for more than 8 days. This extension uses the UltraWarm tier to provide warm access to their data. Then, they move data to the Cold storage tier when they’re not actively using it to remove compute resources and for cost-effectiveness.

Building this solution as an extension to their existing self-managed cluster gave them 172 extra days of access to their logs (21.5 times the data retention length) at an incremental cost of 15%.

Demystifying Index State Management

Index State Management (ISM) enables you to create a policy to automate index management within different tiers in an OpenSearch Service domain.

As of February 2022, three tiers are available in Amazon OpenSearch Service: Hot, UltraWarm, and Cold.

The default Hot tier is for active writing and low-latency analytics. UltraWarm is for read-only data up to three petabytes at one-tenth of the Hot tier cost, and Cold is for unlimited long-term archival. Although Hot storage is used for indexing and provides the fastest access, UltraWarm complements the Hot storage tier by providing less expensive storage for older and less-frequently accessed data. This is done while maintaining the same interactive analytics experience. Rather than attached storage, UltraWarm nodes use Amazon Simple Storage Service (Amazon S3) and a sophisticated caching solution to improve performance.

ISM helps you from a cost-effective perspective—when you don’t need to access your data after a certain period but you still need to keep them because of legal requirements, for instance, to automate the transition of your data within those tiers. Those operations are based on index age, size, and other conditions.

Also, the order of transition needs to be respected from Hot to UltraWarm to Cold, and from Cold to UltraWarm to Hot—you can’t change this order.

Solution overview

Our solution enables you to extend the retention time for your data. We show you how to add a second Cold OpenSearch Service domain to your existing self-managed Hot deployment. You use Elasticsearch snapshots to move data from the Hot cluster to the Cold domain. You use ISM policies applied to these indices, with different retention periods before their deletion, from 14–180 days.

In addition to that, you add 9 recommended alarms for Amazon OpenSearch Service in Amazon CloudWatch via an AWS CloudFormation template to enhance your ability to monitor your stack. Those recommended alarms notify you, through an Amazon Simple Notification Service (Amazon SNS) topic, on key metrics you should monitor, like ClusterStatus, FreeStorageSpace, CPUUtilization, and JVMMemoryPressure.

The following diagram illustrates the solution architecture:

The diagram contains the following components in our solution for extending your self-managed Elasticsearch cluster with Amazon OpenSearch Service (available on GitHub):

Snapshots repository
1. You run an AWS Lambda function one time to register your S3 bucket (snapshots-bucket in the diagram) as a snapshots repository for your OpenSearch Service domain.
ISM policies
1. You run a Lambda function one time to create six ISM policies that automate the migration of your indices from the Hot tier to UltraWarm and from UltraWarm to Cold storage, as soon as they are restored within the domain, with different retention periods (14, 21, 35, 60, 90, and 180 days before deletion).
Index migration
1. You use an Amazon EventBridge rule to trigger automatically—once a day— a Lambda function (RestoreIndices in the diagram).
2. This function parses the latest snapshots that have been pushed by the Elasticsearch cluster.
3. When the function finds a new index that doesn’t exist yet in the OpenSearch Service domain, it initiates a restore operation and attaches an ISM policy (created during step 2.1).
Free UltraWarm cache
1. You use an EventBridge rule to trigger automatically – once a day – an AWS Lambda function (MoveToCold in the diagram).
2. This function checks for indices that have been warm accessed and moves them back to the Cold tier in order to free UltraWarm nodes caches.
Alerting
1. You use CloudWatch to create 9 alarms based on Amazon OpenSearch Service CloudWatch metrics.
2. CloudWatch redirects alarms to an SNS topic.
3. You receive notifications from the SNS topic, which sends emails as soon as an alarm is raised.

Prerequisites

Complete the following prerequisite steps:

Deploy a self-managed Elasticsearch cluster (running on premises or in AWS) that pushes snapshots periodically to an S3 bucket (ideally once a day).
Deploy an OpenSearch Service domain (running OpenSearch 1.1 version) and enable UltraWarm and Cold options.
Deploy a proxy server (NGINX on the architecture diagram) in a public subnet that allows access to dashboards for your OpenSearch Service domains, hosted within a VPC.
To automate multiple mechanisms in this solution, create an AWS Identity and Access Management (IAM) role for our different Lambda functions. Use the following IAM policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": "logs:CreateLogGroup",
            "Resource": "arn:aws:logs:us-east-1:123456789012:log-group:*",
            "Effect": "Allow"
        },
        {
            "Action": [
                "logs:CreateLogStream",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:aws:logs:us-east-1:123456789012:log-group:/aws/lambda/*:*",
            "Effect": "Allow"
        },
        {
            "Action": "iam:PassRole",
            "Resource": "arn:aws:iam::123456789012:role/snapshotsRole",
            "Effect": "Allow"
        },
        {
            "Action": [
                "es:ESHttpPut",
                "es:ESHttpGet",
                "es:ESHttpPost"
            ],
            "Resource": "arn:aws:es:us-east-1:123456789012:domain/my-test-domain/*",
            "Effect": "Allow"
        }
    ]
}

This policy allows our Lambda functions to send PUT, GET and POST requests to our OpenSearch Service domain, register their logs in CloudWatch Logs, and pass an IAM role used to access the S3 bucket that stores snapshots.

Additionally, edit the trust relationship to be assumed by Lambda:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Principal": {
        "Service": "lambda.amazonaws.com"
      },
      "Action": "sts:AssumeRole"
    }
  ]
}

You use this IAM role for the Lambda functions you create.

You also need to configure OpenSearch’s security plugin to assign permissions for traffic Lambda sends to OpenSearch.

Here you can find existing and predefined Kibana roles.

Select the all_access role and choose Mapped users.
Choose Manage mapping to edit the mapped users.
Enter the ARN of the IAM role you just created as a new backend role on this Kibana role.

In the following sections, we walk you through the steps to set up each component in the solution architecture.

Snapshots repository

To migrate your logs from the Hot cluster to the Cold domain, you register your S3 bucket that stores logs in the form of snapshots (from the Elasticsearch cluster) as a snapshots repository for your OpenSearch Service domain.

Create an IAM role (for this post, we use SnapshotsRole for the role name) to give permissions to the Cold domain to access your S3 bucket that stores snapshots from your Elasticsearch cluster. Use the following IAM policy for this role:

{
  "Version": "2012-10-17",
  "Statement": [{
      "Action": [
        "s3:ListBucket"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::s3-bucket-name"
      ]
    },
    {
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::s3-bucket-name/*"
      ]
    }
  ]
}

Edit the trust relationship to be used from Amazon OpenSearch Service:

{
  "Version": "2012-10-17",
  "Statement": [{
    "Sid": "",
    "Effect": "Allow",
    "Principal": {
      "Service": "es.amazonaws.com"
    },
    "Action": "sts:AssumeRole"
  }]
  
}

Create the Lambda function that is responsible for registering this S3 bucket as the snapshots repository.

On the GitHub repository, you can find the files needed to build this part. See the lambda-functions/register-snapshots-repository.py Python file to create the Lambda function.

Choose Test on the Lambda console to run the function.

You only to run it once. It registers the S3 bucket as a new snapshots repository for your OpenSearch Service domain.

Verify the snapshots repository by navigating to the Kibana dashboard of the Cold domain on the Dev Tools tab and running the following command:
```
GET _snapshots/myelasticsearch-snapshots-repository (replace with your repository name)
```

You can also achieve this step from an Amazon Elastic Compute Cloud (Amazon EC2) instance (instead of a Lambda function) because it only has to be run once, with an instance profile IAM role attached to the EC2 instance.

Index State Management policies

You use Index State Management to automate the transition of your indices between storage tiers in Amazon OpenSearch Service. To use ISM, you create policies (small JSON documents that define a state automaton) and attach these policies to the indices in your domain. ISM policies specify states with actions and transitions that enable you to move and delete indices. You can use the functions/create-indexstatemanagement-policy.py Lambda code to create six ISM policies that automate transition within tiers and delete your Cold indices after 14, 21, 35, 60, 90, and 180 days. You use the IAM role you created earlier, and run that function once to create the policies in your domain.

Navigate to Kibana in your OpenSearch Service domain and choose Index Management. On the State management policies page, verify that you can see your ISM policies.

Index migration

To migrate your data from the Hot cluster to the Cold domain, you use the functions/restore-indices.py code to create a Lambda function (RestoreIndices) and the cfn-templates/event-bridge-lambda-function.yaml CloudFormation template to create its trigger, which is an EventBridge rule (scheduled once a day at 12 AM). Your indices are migrated to the Cold domain thanks to the Lambda function that parses indices within your snapshots repository, and initiates restore operations for each new index that doesn’t exist in the Cold domain. As soon as the index is restored in the domain, the Lambda function attaches an ISM policy to it, based on its index pattern to determine its retention period.

Python code looks for an application name structured in exactly three letters (for example, aws). If your logs have a different index pattern, you need to update relevant code lines (trigramme = index [5:8]).

Free UltraWarm cache

To free cache your UltraWarm nodes from the Cold domain, you use the functions/move-to-Cold.py code to create a Lambda function (MoveToCold) and the cfn-templates/event-bridge-lambda-function.yaml CloudFormation template to create its trigger, which is an EventBridge rule (change its schedule to avoid operating in parallel with the previous rule). Your indices that are in UltraWarm tier for warm access are moved to Cold storage to free the nodes cache to prepare the next index migration and for cost-effectiveness.

Alerting

To get alerted via email when the Cold domain requires your attention, you use the cfn-templates/alarms.yaml CloudFormation template to create an SNS topic that receives notifications when one of the 9 CloudWatch alarms have been raised, based on the Amazon OpenSearch Service metrics. Those alarms come from the recommended CloudWatch alarms for Amazon OpenSearch Service.

Conclusion

In this post, we covered a solution to enable an OpenSearch domain as an extension to your existing self-managed Elasticsearch cluster, in order to extend the retention period of applications logs in a serverless and cost-effective way.

If you’re interested in going deeper into Amazon OpenSearch Service and AWS Analytics capabilities in general, you can get help and join discussions on our forums.

About the Authors

Alexandre Levret is a Professional Services consultant within Amazon Web Services (AWS) dedicated to the public sector in Europe. He aims to build, innovate and inspire his multiple customers which face challenges that cloud computing can help them to resolve.

Improved performance with AWS Graviton2 instances on Amazon OpenSearch Service

2022-03-02 Rohin Bhargava

Post Syndicated from Rohin Bhargava original https://aws.amazon.com/blogs/big-data/improved-performance-with-aws-graviton2-instances-on-amazon-opensearch-service/

Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) is a fully managed service at AWS for OpenSearch. It’s an open-source search and analytics suite used for a broad set of use cases, like real-time application monitoring, log analytics, and website search.

While running an OpenSearch Service domain, you can choose from a variety of instances for your primary nodes and data nodes suitable for your workload: general purpose, compute optimized, memory optimized, or storage optimized. With the release of each new generation, Amazon OpenSearch Service has brought even better price performance.

Amazon OpenSearch Service now supports AWS Graviton2 instances: general purpose (M6g), compute optimized (C6g), memory optimized (R6g), and memory optimized with attached disk (R6gd). These instances offer up to a 38% improvement in indexing throughput, 50% reduction in indexing latency, and 40% improvement in query performance depending upon the instance family and size compared to the corresponding intel-based instances from the current generation (M5, C5, R5).

The AWS Graviton2 instance family includes several new performance optimizations, such as larger caches per core, higher Amazon Elastic Block Store (Amazon EBS) throughput than comparable x86 instances, fully encrypted RAM, and many others. You can benefit from these optimizations with minimal effort by provisioning or migrating your OpenSearch Service instances today.

Performance analysis compared to fifth-generation intel-based instances

We conducted tests using the AWS Graviton2 instances against the fifth-generation intel-based instances and measured performance improvements. Our setup included two six-node domains with three dedicated primary nodes and three data nodes and running Elasticsearch 7.10. For the intel-based setup, we used c5.xlarge for the primary nodes and r5.xlarge for the data nodes. Similarly on the AWS Graviton2-based setup, we used c6g.xlarge for the primary nodes and r6g.xlarge for the data nodes. Both domains were three Availability Zone enabled and VPC enabled, with advanced security and 512 GB of EBS volume attached to each node. Each index had six shards with a single replica.

The dataset contained 2,000 documents with a flat document structure. Each document had 20 fields: 1 date field, 16 text fields, 1 float field, and 2 long fields. Documents were generated on the fly using random samples so that the corpus was infinite.

For ingestion, we used a load generation host where each bulk request had a 4 MB payload (approximately 2,048 documents per request) and nine clients.

We used one query generation host with one client. We ran a mix of low-latency queries (approximately 10 milliseconds), medium-latency queries (100 milliseconds) , and high-latency queries (1,000 milliseconds):

Low-latency queries – These were match-all queries.
Medium-latency queries – These were multi-match queries or queries with filters based on one randomly selected keyword. The results where aggregated in a date histogram and sorted by the descending ingest timestamp.
High-latency queries – These were multi-match queries or queries with filters based on five randomly selected keywords. The results were aggregated using two aggregations: aggregated in a date histogram with a 3-hour interval based on the ingest timestamp, and a date histogram with a 1-minute interval based on the ingest timestamp.

We ran 60 minutes of burn-in time followed by 3 hours of 90/10 ingest to query workloads with a mix of 20% low-latency, 50% medium-latency, and 30% high-latency queries. The amount of load sent to the clusters was identical.

Graphs and results

When ingesting documents at the same throughput, the AWS Graviton2 domain shows a much lower latency than the intel-based domain, as shown in the following graph. Even at p99 latency, the AWS Graviton2 domain is consistently lower than the p50 latency of the intel-based domains. In addition, AWS Graviton2 latencies are more consistent than intel-based instances, providing for a more predictable user experience.

When querying documents at the same throughput, the AWS Graviton2 domain outperforms the intel-based instances. The p50 latency of AWS Graviton2 is better than the p50 latency of intel-based.

Similarly, the p99 latency of AWS Graviton2 is better than that of the intel-based instances. Note in the following graph that the increase in latency over time is due to the growing corpus size.

Conclusion

As demonstrated in our performance analysis, the new AWS Graviton2-based instances consistently yield better performance compared to the fifth-generation intel-based instances. Try these new instances out and let us know how they perform for you!

As usual, let us know your feedback.

About the Authors

Rohin Bhargava is a Sr. Product Manager with the Amazon OpenSearch Service team. His passion at AWS is to help customers find the correct mix of AWS services to achieve success for their business goals.

Chase Engelbrecht is a Software Engineer working with the Amazon OpenSearch Service team. He is interested in performance tuning and optimization of OpenSearch running on Amazon OpenSearch Service.

Enhance resiliency with admission control in Amazon OpenSearch Service (successor to Amazon Elasticsearch Service)

2022-02-23 Mital Awachat

Post Syndicated from Mital Awachat original https://aws.amazon.com/blogs/big-data/enhance-resiliency-with-admission-control-in-amazon-opensearch-service-successor-to-amazon-elasticsearch-service/

OpenSearch is a distributed, open-source search and analytics suite used for a broad set of use cases like real-time application monitoring, log analytics, and website search. Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) is a managed service that makes it easy to secure, deploy, and operate OpenSearch clusters at scale. Amazon OpenSearch Service provides a broad range of cluster configurations to meet your use cases. In 2021, we released automated memory management under Auto-Tune. Auto-Tune is an adaptive resource management system in Amazon OpenSearch Service that continuously monitors incoming workloads and optimizes cluster resources to improve efficiency and performance.

Today, we’re excited to announce the release of admission control for Auto-Tune. Admission control in Amazon OpenSearch Service enhances the overall resiliency of OpenSearch clusters by limiting new incoming requests early, at the REST layer, when a node is stressed. This mechanism prevents potential node failures and cascading effects on the cluster.

Overview of admission control

Admission control acts like a lever to regulate traffic based on cluster state. It does so by allocating tokens for each OpenSearch request, based on predicted resource usage. It releases the tokens when the process is complete. After all the tokens are acquired, any additional requests to the node are throttled with a “too many requests” exception until tokens are available again for request processing. In some cases, an operator can utilize admission control to completely shut down traffic and prevent frequent node drops until a certain condition is met, such as shards being assigned.

Admission control is a gatekeeper for nodes, limiting the number of requests processed to a node based on its current capacity.

Admission control prevents Amazon OpenSearch Service domains from getting overloaded both by steady increases and surges in traffic. It’s resource-aware, so it tunes the cluster based on incoming request cost (content length of request payload), and the point-in-time state of the node (overall Java Virtual Machine (JVM)). This awareness enables real-time, state-based admission control on the node. Admission control for Auto-Tune is available in all AWS Regions on domains running OpenSearch 1.0, or Elasticsearch 6.7 and higher.

By default, admission control throttles _search and _bulk requests when JVM memory pressure and request size thresholds are breached.

JVM memory pressure threshold
Admission control keeps track of the current state of JVM memory pressure and throttles incoming requests based on a preconfigured JVM memory pressure threshold. When the threshold is breached, all configured _search and _bulk requests are throttled until the memory is released on the node and memory pressure is below the threshold.
Request size threshold
The size of a particular request is determined by it’s content-length. Admission control keeps track of in-flight requests and allocates tokens to every request based on this content length. Admission control then throttles incoming requests based on memory occupancy when the aggregated size of in-flight requests breaches the pre-configured threshold. All new _search and _bulk requests are throttled until the in-flight requests complete, relinquishing the quota to be occupied by new requests.

The following diagram illustrates this process.

How Auto-Tune works

Auto-Tune uses performance and usage metrics from OpenSearch clusters to suggest memory-related configuration changes to improve cluster speed and stability. You can view its recommendations on the Amazon OpenSearch Service console. Admission control is a non-disruptive change, meaning that the changes can be applied without rebooting the node.

Admission control’s predefined request size threshold of 10% satisfies most use cases. However, Auto-Tune can now dynamically increase and decrease the default threshold, typically between 5–15%, based on the amount of JVM that is currently occupied on the system. Request size threshold auto-tuning is enabled by default when you enable Auto-Tune.

Auto-Tune currently doesn’t tune the JVM memory pressure threshold.

Monitoring admission control

Amazon OpenSearch Service sends two Auto-Tune metrics to Amazon CloudWatch: AutoTuneSucceeded and AutoTuneFailed. Each metric contains a sub-category called AutotuningType, which indicates the specific type of change in question. Admission control adds a new type called ADMISSION_CONTROL_TUNING.

To view it, choose ES/OpenSearchService on the Metrics page on the CloudWatch console.

Then choose AutotuningType, ClientId, DomainName, TargetId.

For AutotuningType, filter by ADMISSION_CONTROL_TUNING.

Conclusion

Admission control introduces request-based rejections of _search and _bulk requests when there are too many requests or JVM usage is high, breaching thresholds. This prevents the nodes from running into cascading effects of failures arising due to the following:

Surges in traffic – Sudden surges or spikes in request traffic, leading to quick buildup in usage across the nodes
Skew in shard distribution – Improper distribution of shards, leading to hot spots and bottlenecks, affecting the overall performance
Slow Nodes – Entire data node starts to slow down due to degraded hardware such as disk, network volumes, or software bugs

Stay tuned for more exciting updates about Amazon OpenSearch Service and features.

About the Authors

Mital Awachat is an SDE-II working on Amazon OpenSearch Service at Amazon Web Services.

Saurabh Singh is a Senior Software Engineer working on AWS OpenSearch at Amazon Web Services. He is passionate about solving problems related to data retrieval and large-scale distributed systems. He is an active contributor to OpenSearch.

Ranjith Ramachandra is an Engineering Manager working on Amazon OpenSearch Service at Amazon Web Services.

Detect anomalies on one million unique entities with Amazon OpenSearch Service

2022-02-16 Kaituo Li

Post Syndicated from Kaituo Li original https://aws.amazon.com/blogs/big-data/detect-anomalies-on-one-million-unique-entities-with-amazon-opensearch-service/

Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) supports a highly performant, integrated anomaly detection engine that enables the real-time identification of anomalies in streaming data. Last year, we released high-cardinality anomaly detection (HCAD) to detect individual entities’ anomalies. With the 1.1 release, we have allowed you to monitor a million entities with steady, predictable performance. HCAD is easiest when described in contrast to the non-HCAD single-stream solution. In a single-stream detector, we detect anomalies for an aggregate entity. For example, we can use a single-stream detector to sift through aggregated traffic across all IP addresses so that users can be notified when unusual spikes occur. However, we often need to identify anomalies in entities, such as individual hosts and IP addresses. Each entity may work on a different baseline, which means its time series’ distribution (measured in parameters such as magnitude, trend, and seasonality, to name a few) are different. The different baselines make it inaccurate to detect anomalies using a single monolithic model. HCAD distinguishes itself from single-stream detectors by customizing anomaly detection models to entities.

Example use cases of HCAD include the following:

Internet of things – Continuously tracking the temperature of fridges and warning users of temperatures at which food or medicine longevity is at risk, so users can take measures to avoid them. Each entity has specific categorical fields that describe it, and you can think of the categorical fields as characteristics for those entities. A fridge’s serial number is the categorical field that uniquely identifies the fridges. Using a single model generates a lot of false alarms because ambient temperatures can be different. A temperature of 5° C is normal during winter in Seattle, US, but such a temperature in a tropical place during winter is likely anomalous. Also, users may open the door to a fridge several times, triggering a spike in the temperature. The duration and frequency of spikes can vary according to user behavior. HCAD can group temperature data into geographies and users to detect varying local temperatures and user behavior.
Security – An intrusion detection system identifying an increase in failed login attempts in authentication logs. The user name and host IP are the categorical fields used to determine the user accessing from the host. Hackers might guess user passwords by brute force, and not all users on the same host IP may be targeted. The number of failed login counts varies on a host for a particular user at a specific time of day. HCAD creates a representative baseline per user on each host and adapts to changes in the baseline.
IT operations – Monitoring access traffic by shard in a distributed service. The shard ID is the categorical field, and the entity is the shard. A modern distributed system usually consists of shards linked together. When a shard experiences an outage, the traffic increases significantly for dependent shards due to retry storms. It’s hard to discover the increase because only a limited number of shards are affected. For example, traffic on the related shards might be as much as 64 times that of normal levels, whereas average traffic across all shards might just grow by a small constant factor (less than 2).

Making HCAD real time and performant while achieving completeness and scalability is a formidable challenge:

Completeness – Model all or as many entities as possible.
Scalability – Horizontal and vertical scaling without changing model fidelity. That is, when scaling the machine up or out, an anomaly detector can add models monotonically. HCAD uses the same model and gives the same answer for an entity’s time series as in single-stream detection.
Performance – Low impact to system resource usage and high overall throughput.

The first release of HCAD in Amazon OpenSearch Service traded completeness and scalability for performance: the anomaly detector limited the number of entities to 1,000. You can change the setting plugins.anomaly_detection.max_entities_per_query to increase the number of monitored entities per interval. However, such a change incurs a non-negligible cost, which opens the door to cluster instability. Each entity uses memory to host models, disk I/O to read and write model checkpoints and anomaly results, CPU cycles for metadata maintenance and model training and inference, and garbage collection for deleted models and metadata. The more entities, the more resource usage. Furthermore, HCAD could suffer a combinatorial explosion of entities when supporting multiple categorical fields (a feature released in Amazon OpenSearch Service 1.1). Imagine a detector with only one categorical field geolocation. Geolocation has 1,000 possible values. Adding another categorical field product with 1,000 allowed values gives the detector 1 million entities.

For the next version of HCAD, we devoted much effort to improving completeness and scalability. Our approach captures sizing a cluster right and combines in-memory model hosting and on-disk model loading. Performance metrics show HCAD doesn’t saturate the cluster with substantial cost and still leaves plenty of room for other tasks. As a result, HCAD can analyze one million entities in 10 minutes and flags anomalies in different patterns. In this post, we will explore how HCAD can analyze one million entities and the technical implementations behind the improvements.

How to size domains

Model management is a trade-off: disk-based solutions that reload-use-stop-store models on every interval offer savings in memory but suffer high overhead and are hard to scale. Memory-based solutions offer lower overhead and higher throughput but typically increase memory requirements. We exploit the trade-off by implementing an adaptive mechanism that hosts models in memory as much as allowed (capped via the cluster setting plugins.anomaly_detection.model_max_size_percent), as required by best performance. When models don’t fit in memory, we process extra model requests by loading models from disks.

The use of memory whenever possible is responsible for the HCAD scalability. Therefore, it is crucial to sizing a cluster right to offer enough memory for HCAD. The main factors to consider when sizing a cluster are:

Sum of all detectors’ total entity count – A detector’s total entity count is the cardinality of the categorical fields. If there are multiple categorical fields, the number counts all unique combinations of values of these fields present in data. You can decide the cardinality via cardinality aggregation in Amazon OpenSearch Service. If the detector is a single-stream detector, the number of entities is one because there is no defined categorical field.
Heap size – Amazon OpenSearch Service sets aside 50% of RAM for heap. To determine the heap size of an instance type, refer to Amazon OpenSearch Service pricing. For example, an r5.2xlarge host has 64 GB RAM. Therefore, the host’s heap size is 32 GB.
Anomaly detection (AD) maximum memory percentage – AD can use up to 10% of the heap by default. You can customize the percentage via the cluster setting plugins.anomaly_detection.model_max_size_percent. The following update allows AD to use half of the heap via the aforementioned setting:

PUT /_cluster/settings
{
	"persistent": {
		"plugins.anomaly_detection.model_max_size_percent": "0.5"
	}
}

Entity in-memory model size – An entity’s in-memory model size varies according to shingle size, the number of features, and Amazon OpenSearch Service version as we’re constantly improving. All entity models of the same detector configuration in the same software version have the same size. A safe way to obtain the size is to run a profile API on the same detector configuration on an experimental cluster before creating a production cluster. In the following case, each entity model of detector fkzfBX0BHok1ZbMqLMdu is of size 470,491 bytes:

Enter the following profile request:

GET /_plugins/_anomaly_detection/detectors/fkzfBX0BHok1ZbMqLMdu/_profile/models

We get the following response:

{
	...{
		"model_id": "fkzfBX0BHok1ZbMqLMdu_entity_GOIubzeHCXV-k6y_AA4K3Q",
		"entity": [{
				"name": "host",
				"value": "host141"
			},
			{
				"name": "process",
				"value": "process54"
			}
		],
		"model_size_in_bytes": 470491,
		"node_id": "OcxBDJKYRYKwCLDtWUKItQ"
	}
	...
}

Storage requirement for result indexes – Real-time detectors store detection results as much as possible when the indexing pressure isn’t high, including both anomalous and non-anomalous results. When the indexing pressure is high, we save anomalous and a random subset of non-anomalous results. OpenSearch Dashboard employs non-anomalous results as the context of abnormal results and plots the results as a function of time. Additionally, AD stores the history of all generated results for a configurable number of days after generating results. This result retention period is 30 days by default, and adjustable via the cluster setting plugins.anomaly_detection.ad_result_history_retention_period. We need to ensure enough disk space is available to store the results by multiplying the amount of data generated per day by the retention period. For example, consider a detector that generates 1 million result documents for a 10-minute interval detector with 1 million entities per interval. One document’s size is about 1 KB. That’s roughly 144 GB per day, 4,320 GB after a 30-day retention period. The total disk requirement should also be multiplied by the number of shard copies. Currently, AD chooses one primary shard per node (up to 10) and one replica when called for the first time. Because the number of replicas is 1, every shard has two copies, and the total disk requirement is closer to 8,640 GB for the million entities in our example.
Anomaly detection overhead – AD incurs memory overhead for historical analyses and internal operations. We recommend reserving 20% more memory for the overhead to keep running models uninterrupted.

In order to derive the required number of data nodes D, we must first derive an expression for the number of entity models N that a node can host in memory. We define S_i to be the entity model size of detector i. If we use an instance type with heap size H where the maximum AD memory percentage is P, N is equal to AD memory allowance divided by the maximum entity model size among all detectors:

We consider the required number of data nodes D as a function of N. Let’s denote by C_i the total entity counts of detector i. Given n detectors, it follows that:

The fact that AD needs an extra 20% memory overhead is expressed by multiplying 1.2 in the formula. The ceil function represents the smallest integer greater than or equal to the argument.

For example, an r5.2xlarge Amazon Elastic Compute Cloud (Amazon EC2) instance has 64 GB RAM, so the heap size is 32 GB. We configure AD to use at most half of the allowed heap size. We have two HCAD detectors, whose model sizes are 471 KB and 403 KB, respectively. To host 500,000 entities for each detector, we need a 36-data-node cluster according to the following calculation:

We also need to ensure there is enough disk space. In the end, we used a 39-node r5.2xlarge cluster (3 primary and 36 data nodes) with 4 TB Amazon Elastic Block Store (EBS) storage on each node.

What if a detector’s entity count is unknown?

Sometimes, it’s hard to know a detector’s entity count. We can check historical data and estimate the cardinality. But it’s impossible to predict the future accurately. A general guideline is to allocate buffer memory during planning. Appropriately used, buffer memory provides room for small changes. If the changes are significant, you can adjust the number of data nodes because HCAD can scale in and out horizontally.

What if the number of active entities is changing?

The total number of entities created can be higher than the number of active entities, as evident from the following two figures. The total number of entities in the HTTP logs dataset is 2 million within 2 months, but each entity only appears seven times on average. The number of active entities within a time-boxed interval is much less than 2 million. The following figure presents an example time series of network size of IP addresses from the HTTP logs dataset.

The KPI dataset also shows similar behavior, where entities often appear in a short amount of time during bursts of entity activities.

kpi data distribution

AD requires large sample sizes to create a comprehensive picture of the data patterns, making it suitable for dense time series that can be uniformly sampled. AD can still train models and produce predictions if the preceding bursty behavior can last a while and provide at least 400 points. However, training becomes more difficult, and prediction accuracy is lower as data gets more sparse.

It’s wasteful to preallocate memory according to the total number of entities in this case. Instead of the total number of entities, we need to consider the maximal active entities within an interval. You can get an approximate number by using a date_histogram and cardinality aggregation pipeline, and sorting during a representative period. You can run the following query if you’re indexing host-cloudwatch and want to find out the maximal number of active hosts within a 10-minute interval throughout 10 days:

GET /host-cloudwatch/_search?size=0
{
	"query": {
		"range": {
			"@timestamp": {
				"gte": "2021-11-17T22:21:48",
				"lte": "2021-11-27T22:22:48"
			}
		}
	},
	"aggs": {
		"by_10m": {
			"date_histogram": {
				"field": "@timestamp",
				"fixed_interval": "10m"
			},
			"aggs": {
				"dimension": {
					"cardinality": {
						"field": "host"
					}
				},
				"multi_buckets_sort": {
					"bucket_sort": {
						"sort": [{
							"dimension": {
								"order": "desc"
							}
						}],
						"size": 1
					}
				}
			}
		}
	}
}

The query result shows that at most about 1,000 hosts are active during a ten-minute interval:

{
	...
	"aggregations": {
		"by_10m": {
			"buckets": [{
				"key_as_string": "2021-11-17T22:30:00.000Z",
				"key": 1637188200000,
				"doc_count": 1000000,
				"dimension": {
					"value": 1000
				}
			}]
		}
	}
	...
}

HCAD has a cache to store models and maintain a timestamp of last access for each model. For each model, an hourly job checks the time of inactivity and invalidates the model if the time of inactivity is longer than 1 hour. Depending on the timing of the hourly check and the cache capacity, the elapsed time a model is cached varies. If the cache capacity isn’t large enough to hold all non-expired models, we have an adapted least frequently used (LFU) cache policy to evict models (more on this in a later section), and the cache time of those invalidated models is less than 1 hour. If the last access time of a model is reset immediately after the hourly check, when the next hourly check happens, the model doesn’t expire. The model can take another hour to expire when the next hourly check comes. So the max cache time is 2 hours.

The upper bound of active entities that detector i can observe is:

This equation has the following parameters:

A_i is the maximum number of active entities per interval of detector i. We get the number from the preceding query.
120 is the number of minutes in 2 hours. ∆T_i denotes detector i’s interval in minutes. The ceil function represents the smallest integer greater than or equal the argument. ceil(120÷∆T_i) refers to the max number of intervals a model is cached.

Accordingly, we should account for B_i in the sizing formula:

Sizing calculation flow chart

With the definitions of calculating the number of data nodes in place, we can use the following flow chart to make decisions under different scenarios.

sizing flowchart

What if the cluster is underscaled?

If the cluster is underscaled, AD prioritizes more frequent and recent entities. AD makes its best effort to accommodate extra entities by loading their models on demand from disk without hosting them in the in-memory cache. Loading the models on demand means reloading-using-stopping-storing models at every interval, whose overheads are quite high. The overheads mostly have to do with network or disk I/O, rather than with the cost of model inferencing. Therefore, we did it in a steady, controlled manner. If the system resource usage isn’t heavy and there is enough time, HCAD may finish processing the extra entities. Otherwise, HCAD doesn’t necessarily find all the anomalies it could otherwise find.

Example: Analysis of 1 million entities

In the following example, you will learn how to set up a detector to analyze one million entities.

Ingest data

We generated 10 billion documents for 1 million entities in our evaluation of scalability and completeness improvement. Each entity has a cosine wave time series with randomly injected anomalies. With help from the tips in this post, we created the index host-cloudwatch and ingested the documents into the cluster. host-cloudwatch records elapsed CPU and JVM garbage collection (GC) time by a process within a host. Index mapping is as follows:

{
	...
	"mappings": {
		"properties": {
			"@timestamp": {
				"type": "date"
			},
			"cpuTime": {
				"type": "double"
			},
			"jvmGcTime": {
				"type": "double"
			},
			"host": {
				"type": "keyword"
			},
			"process": {
				"type": "keyword"
			}
		}
	}
	...
}

Create a detector

Consider the following factors before you create a detector:

Indexes to monitor – You can use a group of index names, aliases, or patterns. Here we use the host-cloudwatch index created in the last step.
Timestamp field – A detector monitors time series data. Each document in the provided index must be associated with a timestamp. In our example, we use the @timetamp field.
Filter – A filter selects data you want to analyze based on some condition. One example filter selects requests with status code 400 afterwards from HTTP request logs. The 4xx and 5xx classes of HTTP status code indicate that a request is returned with an error. Then you can create an anomaly detector for the number of error requests. In our running example, we analyze all of the data, and thus no filter is used.
Category field – Every entity has specific characteristics that describe it. Category fields provide categories of those characteristics. An entity can have up to two category fields as of Amazon OpenSearch Service 1.1. Here we monitor a specific process of a particular host by specifying the process and host field.
Detector interval – The detector interval is typically application-defined. We aggregate data within an interval and run models on the aggregated data. As mentioned earlier, AD is suitable for dense time series that can be uniformly sampled. You should at least make sure most intervals have data. Also, different detector intervals require different trade-offs between delay and accuracy. Long intervals smooth out long-term and short-term workload fluctuations and, therefore, may be less prone to noise, resulting in a high delay in detection. Short intervals lead to quicker detection but may find anticipated workload fluctuations instead of anomalies. You can plot your time series with various intervals and observe which interval keeps relevant anomalies while reducing noise. For this example, we use the default 10-minute interval.
Feature – A feature is an aggregated value extracted from the monitored data. It gets sent to models to measure the degrees of abnormality. Forming a feature can be as simple as picking a field to monitor and the aggregation function that summarizes the field data as metrics. We provide a suite of functions such as min and average. You can also use a runtime field via scripting. We’re interested in the garbage collection time field aggregated via the average function in this example.
Window delay – Ingestion delay. If the value isn’t configured correctly, a detector might analyze data before the late data arrives at the cluster. Because we ingested all the data in advance, the window delay is 0 in this case.

Our detector’s configuration aggregates average garbage collection processing time every 10 minutes and analyzes the average at the granularity of processes on different hosts. The API request to create such a detector is as follows. You can also use our streamlined UI to create and start a detector.

POST _plugins/_anomaly_detection/detectors
{
	"name": "detect_gc_time",
	"description": "detect gc processing time anomaly",
	"time_field": "@timestamp",
	"indices": [
		"host-cloudwatch"
	],
	"category_field": ["host", "process"],
	"feature_attributes": [{
		"feature_name": "jvmGcTime average",
		"feature_enabled": true,
		"importance": 1,
		"aggregation_query": {
			"gc_time_average": {
				"avg": {
					"field": "jvmGcTime"
				}
			}
		}
	}],
	"detection_interval": {
		"period": {
			"interval": 10,
			"unit": "MINUTES"
		}
	},
	"schema_version": 2
}

After the initial training is complete, all models of the 1 million entities are up in the memory, and 1 million results are generated every detector interval after a few hours. To verify the number of active models in the cache, you can run the profile API:

GET /_plugins/_anomaly_detection/detectors/fkzfBX0BHok1ZbMqLMdu/_profile/models

We get the following response:

{
	...
	"model_count": 1000000
}

You can observe how many results are generated every detector interval (in our case 10 minutes) by invoking the result search API:

GET /_plugins/_anomaly_detection/detectors/results/_search
{
	"query": {
		"range": {
			"execution_start_time": {
				"gte": 1636501467000,
				"lte": 1636502067000
			}
		}
	},
	"track_total_hits": true
}

We get the following response:

{
	...
	"hits": {
		"total": {
			"value": 1000000,
			"relation": "eq"
		},
		...
	}
	...
}

The OpenSearch Dashboard gives an exposition of the top entities producing the most severe or most number of anomalies.

anomaly overview

You can choose a colored cell to review the details of anomalies occurring within that given period.

press anomaly

You can view anomaly grade, confidence, and the corresponding features in a shaded area.

feature graph

Create a monitor

You can create an alerting monitor to notify you of anomalies based on the defined anomaly detector, as shown in the following screenshot.

create monitor

We use anomaly grade and confidence to define a trigger. Both anomaly grade and confidence are values between 0 and 1.

Anomaly grade represents the severity of an anomaly. The closer the grade is to 1, the higher the severity. 0 grade means the corresponding prediction isn’t an anomaly.

Confidence measures whether an entity’s model has observed enough data such that the model contains enough unique, real-world data points. If a confidence value from one model is larger than the confidence of a different model, then the anomaly from the first model has observed more data.

Because we want to receive high fidelity alerts, we configured the grade threshold to be 0 and the confidence threshold to be 0.99.

edit trigger

The final step of creating a monitor is to add an action on what to include in the notification. Our example detector finds anomalies at a particular process in a host. The notification message should contain the entity identity. In this example, we use ctx.results.0.hits.hits.0._source.entity to grab the entity identity.

edit action

A monitor based on a detector extracts the maximum grade anomaly and triggers an alert based on the configured grade and confidence threshold. The following is an example alert message:

Attention

Monitor detect_cpu_gc_time2-Monitor just entered alert status. Please investigate the issue.
- Trigger: detect_cpu_gc_time2-trigger
- Severity: 1
- Period start: 2021-12-08T01:01:15.919Z
- Period end: 2021-12-08T01:21:15.919Z
- Entity: {0={name=host, value=host107}, 1={name=process, value=process622}}

You can customize the extraction query and trigger condition by changing the monitor defining method to Extraction query monitor and modifying the corresponding query and condition. Here is the explanation of all anomaly result index fields you can query.

edit monitor

Evaluation

In this section, we evaluate HCAD’s precision, recall, and overall performance.

Precision and recall

We evaluated precision and recall over the cosine wave data, as mentioned earlier. Such evaluations aren’t easy in the context of real-time processing because only one point is available per entity during each detector interval (10 minutes in the example). Processing all the points takes a long time. Instead, we simulated real-time processing by fast-forwarding the processing in a script. The results are an average of 100 runs. The standard deviation is around 0.12.

The overall average precision, including the effects of cold start using linear interpolation, for the synthetic data is 0.57. The recall is 0.61. We note that no transformations were applied; it’s possible and likely that transformations improve these numbers. The precision is 0.09, and recall is 0.34 for the first 300 points due to interpolated cold start data for training. The numbers pick up as the model observes more real data. After another 5,000 real data points, the precision and recall improve to 0.57 and 0.63, respectively. We reiterate that the exact numbers vary based on the data characteristics—a different benchmark or detection configuration would have other numbers. Further, if there is no missing data, the fidelity of the HCAD model would be the same as that of a single-stream detector.

Performance

We ran HCAD on an idle cluster without ingestion or search traffic. Metrics such as JVM memory pressure and CPU of each node are well within the safe zone, as shown in the following screenshot. JVM memory pressure varies between 23–39%. CPU is mostly around 1%, with hourly spikes up to 65%. An internal hourly maintenance job can account for the spike due to saving hundreds of thousands of model checkpoints, clearing unused models, and performing bookkeeping for internal states. However, this can be a future improvement.

jvm memory pressure

Implementation

We next discuss the specifics of the technical work that is germane to HCAD’s completeness and scalability.

RCF 2.0

In Amazon OpenSearch Service 1.1, we integrated with Random Cut Forest library (RCF) 2.0. RCF is based on partitioning data into different bounding boxes. The previous RCF version maintains bounding boxes in memory. However, a real-time detector only uses the bounding boxes when processing a new data point and leaves them dormant most of the time. RCF 2.0 allows for recreating those bounding boxes when required so that bounding boxes are present in memory when processing the corresponding input. The on-demand recreation has led to nine times memory overhead reduction and therefore can support hosting nine times as many models in a node. In addition, RCF 2.0 revamps the serialization module. The new module serializes and deserializes a model 26 times faster using 20 times smaller disk space.

Pagination

Regarding feature aggregation, we switched from getting top hits using terms aggregation to pagination via composite aggregation. We evaluated multiple pagination implementations using a generated dataset with 1 million entities. Each entity has two documents. The experiment configurations can vary according to the number of data nodes, primary shards, and categorical fields. We believe composite queries are the right choice because even though they may not be the fastest in all cases, they’re the most stable on average (40 seconds).

Amortize expensive operations

HCAD can face thundering herd traffic, in which many entities make requests like reading checkpoints from disks at approximately the same time. Therefore, we create various queues to buffer pent-up requests. These queues amortize expensive costs by performing a small and bounded amount of work steadily. Therefore, HCAD can offer predictable performance and availability at scale.

In-memory cache

HCAD appeals to caching to process entities whose memory requirement is larger than the configured memory size. At first, we tried a least recently used cache but experienced thrashing when running the HTTP logs workload: with 100 1-minute interval detectors and millions of entities for each detector, we saw few cache hits (many hundreds) within 7 hours. We were wasting CPU cycles swapping models in and out of memory all the time. As a general rule, a hit-to-miss ratio worse than 3:1 is not worth considering caching for quick model accesses.

Instead, we turned to a modified LFU caching, augmented to include heavy hitters’ approximation. A decayed count is maintained for each model in the cache. The decayed count for a model in the cache is incremented when the model is accessed. The model with the smallest decayed count is the least frequently used model. When the cache reaches its capacity, it invalidates and removes the least frequently used model if the new entity’s frequency is no smaller than the least frequently used entity. This connection between heavy hitter approximation and traditional LFU allows us to make the more frequent and recent models sticky in memory and phase out models with lesser cache hit probabilities.

Fault tolerance

Unrecoverable memory state is limited, and enough information of models is stored on disk for crash resilience. Models are recovered on a different host after a crash is detected.

High performance

HCAD builds on asynchronous I/O: all I/O requests such as network calls or disk accesses are non-blocking. In addition, model distribution is balanced across the cluster using a consistent hash ring.

Summary

We enhanced HCAD to improve its scalability and completeness without altering the fidelity of the computation. As a result of these improvements, I showed you how to size an OpenSearch domain and use HCAD to monitor 1 million entities in 10 minutes. To learn more about HCAD, see anomaly detection documentation.

If you have feedback about this post, submit comments in the comments section below. If you have questions about this post, start a new thread on the Machine Learning forum.

About the Author

bio

Kaituo Li is an engineer in Amazon OpenSearch Service. He has worked on distributed systems, applied machine learning, monitoring, and database storage in Amazon. Before Amazon, Kaituo was a PhD student in Computer Science at University of Massachusetts, Amherst. He likes reading and sports.

Automating Index State Management for Amazon OpenSearch Service (successor to Amazon Elasticsearch Service)

2021-12-21 Gene Alpert

Post Syndicated from Gene Alpert original https://aws.amazon.com/blogs/big-data/automating-index-state-management-for-amazon-opensearch-service-successor-to-amazon-elasticsearch-service/

When it comes to time-series data, it’s more common to access new data than existing data, such as the last four hours or one day. Often, application teams must maintain multiple indexes for diverse data workloads, which bring new requirements to set up a custom solution to manage the index lifecycles. This becomes tedious as the indexes grow, and it results in housekeeping overhead.

Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) enables you to automate recurring index management activities. This avoids having to use any additional tools to manage the index lifecycles. Index State Management (ISM) lets you create a policy that automates these operations based on index age, size, and other conditions, all from within your Amazon OpenSearch Service domain.

In this post, we discuss how you can implement a policy to automate routine index management tasks and apply them to indexes and index patterns.

Prerequisites

Before you get started, make sure to complete the following prerequisites:

Have Elasticsearch 6.8 or later (required to use ISM and UltraWarm) or Set up a new Amazon OpenSearch Service domain with UltraWarm enabled.
Register a manual snapshot repository for your new domain.
Make sure that your user role has sufficient permissions to access the Kibana console of the Amazon OpenSearch Service domain. If required, validate and configure the access to your domains.

Use case

Amazon OpenSearch Service supports three storage tiers: the default “hot” for active writing and low-latency analytics, UltraWarm for read-only data up to three petabytes at one-tenth of the hot tier cost, and cold for unlimited long-term archival. Although hot storage is used for indexing and providing the fastest access, UltraWarm complements the hot storage tier by providing less expensive storage for older and less-frequently accessed data. This is done while maintaining the same interactive analytics experience. Rather than attached storage, UltraWarm nodes use Amazon Simple Storage Service (Amazon S3) and a sophisticated caching solution to improve performance.

To demonstrate the functionality, we present a sample use case of handling time-series data in daily indices. In this use case, we automatically snapshot each index after 24 hours, then we migrate the index from the default hot state to UltraWarm storage after two days, cold storage after 30 days, and finally delete the index after 60 days.

After you create the Amazon OpenSearch Service domain and register a snapshot repository, complete the following steps:

Wait for the domain status to turn active and choose the Kibana endpoint.
Log in using the Kibana UI endpoint.
On Kibana’s splash page, add all of the sample data listed by choosing Try our sample data and choosing Add data.
After adding the data, choose Index Management (the IM icon on the left navigation pane), which lands into the Index Policies page.
Choose Create policy.
For Name policy, enter ism-policy-example.
Replace the default policy with the following policy:

{
    "policy": {
        "description": "Demonstrate a hot-snapshot-warm-cold-delete workflow.",
        "default_state": "hot",
        "states": [
            {
                "name": "hot",
                "actions": [],
                "transitions": [
                    {
                        "state_name": "snapshot",
                        "conditions": {
                            "min_index_age": "24h"
                        }
                    }
                ]
            },
            {
                "name": "snapshot",
                "actions": [
                    {
                        "retry": {
                            "count": 5,
                            "backoff": "exponential",
                            "delay": "30m"
                        },
                        "snapshot": {
                            "repository": "snapshot-repo",
                            "snapshot": "ism-snapshot"
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "warm",
                        "conditions": {
                            "min_index_age": "2d"
                        }
                    }
                ]
            },
            {
                "name": "warm",
                "actions": [
                    {
                        "retry": {
                            "count": 5,
                            "backoff": "exponential",
                            "delay": "1h"
                        },
                        "warm_migration": {}
                    }
                ],
                "transitions": [
                    {
                        "state_name": "cold",
                        "conditions": {
                            "min_index_age": "30d"
                        }
                    }
                ]
            },
            {
                "name": "cold",
                "actions": [
                    {
                        "retry": {
                            "count": 5,
                            "backoff": "exponential",
                            "delay": "1h"
                        },
                        "cold_migration": {
                            "start_time": null,
                            "end_time": null,
                            "timestamp_field": "@timestamp",
                            "ignore": "none"
                        }
                    }
                ],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "60d"
                        }
                    }
                ]
            },
            {
                "name": "delete",
                "actions": [
                    {
                        "cold_delete": {}
                    }
                ],
                "transitions": []
            }
        ],
        "ism_template": [
            {
                "index_patterns": [
                    "index-*"
                ],
                "priority": 100
            }
        ]
    }
}

Note that the last section of the policy, “ism_template”, is used to automatically attach the ISM policy to any newly created index that matches the index_patterns. You can define multiple patterns.

Furthermore, you can use the ISM API to programmatically work with policies and managed indexes.

Choose Create. You can now see your index policy on the Index Policies page.
On the Indices page, search for kibana_sample, which should list all of the sample data indexes that you added earlier.
Select all of the indexes and choose Apply policy.
From the Policy ID drop-down menu, choose the policy created in the previous step.
Choose Apply.

The policy is now assigned and starts managing the indexes. On the Managed Indices page, you can observe the status as Initializing.

When initialization is complete, the status changes to Running.

You can also set a refresh frequency to refresh the managed indexes’ status information.

Demystifying the policy

In this section, we explain the index policy and its structure.

Policies are JSON documents that define the following:

The states an index can be in
Any actions that you want the plugin to take when an index enters the state
Conditions that must be met for an index to move or transition into a new state

The policy document begins with basic metadata such as description, the default_state that the index should enter, and finally a series of state definitions.

A state is the status that the managed index is currently in. A managed index can only be in one state at a time. Each state has associated actions that are run sequentially upon entering a state, and transitions that are checked after all of the actions are complete.

The first state is hot. In this use case, no actions are defined in this hot state. Moreover, the managed indexes land in this state initially, and then transition to snapshot, warm, cold, and finally delete. Transitions define the conditions that must be met for a state to change (in this case, change to snapshot after the index crosses 24 hours).

We can quickly verify the status in the Index Managemet section of Kibana. Each index will have state details listed under Policy managed indices.

You can also verify this on the Amazon OpenSearch Service console under the Ultrawarm Storage usage and Cold Storage usage columns.

The last state of the policy document marks the indexes to delete based on the actions. This policy state assumes that your index is non-critical and no longer receiving write requests. Having zero replicas carries some risk of data loss.

Among other actions, you can also configure the notification to send policy action information to Chime, Slack, or a webhook URL. Complete details can be found here.

The following screenshot shows the index status on the console.

Additional information on ISM policies

If you have an existing Amazon OpenSearch Service domain with no UltraWarm support (because of any missing prerequisites), you can use policy operations read_only and reduces_replicas to replace the warm state. The following code is the policy template for these two states:

{
                "name": "reduce_replicas",
                "actions": [{
                  "replica_count": {
                    "number_of_replicas": 0
                  }
                }],
                "transitions": [{
                  "state_name": "read_only",
                  "conditions": {
                    "min_index_age": "2d"
                  }
                }]
            },
            {
                "name": "read_only",
                "actions": [
                    {
                        "read_only": {}
                      }
                ],
                "transitions": [
                    {
                        "state_name": "delete",
                        "conditions": {
                            "min_index_age": "3d"
                        }
                    }
                ]
            },

Summary

In this post, you learned how to use the ISM feature with UltraWarm and cold storage tiers for Amazon OpenSearch Service. The walkthrough illustrated how to manage indexes using this plugin with a sample lifecycle policy.

For more information about the ISM plugin, see Index State Management. If you need enhancements or have other feature requests, then please file an issue. To get involved with the project, see Contributing Guidelines.

A big takeaway for me as I evaluated the ISM plugin in Amazon OpenSearch Service was that the ISM plugin is fully compatible and works on any deployment of the open source OpenSearch software. For more information, see the Index State Management in OpenSearch documentation.

About the Author

Gene Alpert is a Big Data and Analytics Consultant with Amazon Web Services. He helps AWS customers optimize their cloud data operations.

Unify log aggregation and analytics across compute platforms

2021-12-15 Hari Ohm Prasath

Post Syndicated from Hari Ohm Prasath original https://aws.amazon.com/blogs/big-data/unify-log-aggregation-and-analytics-across-compute-platforms/

Our customers want to make sure their users have the best experience running their application on AWS. To make this happen, you need to monitor and fix software problems as quickly as possible. Doing this gets challenging with the growing volume of data needing to be quickly detected, analyzed, and stored. In this post, we walk you through an automated process to aggregate and monitor logging-application data in near-real time, so you can remediate application issues faster.

This post shows how to unify and centralize logs across different computing platforms. With this solution, you can unify logs from Amazon Elastic Compute Cloud (Amazon EC2), Amazon Elastic Container Service (Amazon ECS), Amazon Elastic Kubernetes Service (Amazon EKS), Amazon Kinesis Data Firehose, and AWS Lambda using agents, log routers, and extensions. We use Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) with OpenSearch Dashboards to visualize and analyze the logs, collected across different computing platforms to get application insights. You can deploy the solution using the AWS Cloud Development Kit (AWS CDK) scripts provided as part of the solution.

Customer benefits

A unified aggregated log system provides the following benefits:

A single point of access to all the logs across different computing platforms
Help defining and standardizing the transformations of logs before they get delivered to downstream systems like Amazon Simple Storage Service (Amazon S3), Amazon OpenSearch Service, Amazon Redshift, and other services
The ability to use Amazon OpenSearch Service to quickly index, and OpenSearch Dashboards to search and visualize logs from its routers, applications, and other devices

Solution overview

In this post, we use the following services to demonstrate log aggregation across different compute platforms:

Amazon EC2 – A web service that provides secure, resizable compute capacity in the cloud. It’s designed to make web-scale cloud computing easier for developers.
Amazon ECS – A web service that makes it easy to run, scale, and manage Docker containers on AWS, designed to make the Docker experience easier for developers.
Amazon EKS – A web service that makes it easy to run, scale, and manage Docker containers on AWS.
Kinesis Data Firehose – A fully managed service that makes it easy to stream data to Amazon S3, Amazon Redshift, or Amazon OpenSearch Service.
Lambda – A compute service that lets you run code without provisioning or managing servers. It’s designed to make web-scale cloud computing easier for developers.
Amazon OpenSearch Service – A fully managed service that makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more.

The following diagram shows the architecture of our solution.

The architecture uses various log aggregation tools such as log agents, log routers, and Lambda extensions to collect logs from multiple compute platforms and deliver them to Kinesis Data Firehose. Kinesis Data Firehose streams the logs to Amazon OpenSearch Service. Log records that fail to get persisted in Amazon OpenSearch service will get written to AWS S3. To scale this architecture, each of these compute platforms streams the logs to a different Firehose delivery stream, added as a separate index, and rotated every 24 hours.

The following sections demonstrate how the solution is implemented on each of these computing platforms.

Amazon EC2

The Kinesis agent collects and streams logs from the applications running on EC2 instances to Kinesis Data Firehose. The agent is a standalone Java software application that offers an easy way to collect and send data to Kinesis Data Firehose. The agent continuously monitors files and sends logs to the Firehose delivery stream.

The AWS CDK script provided as part of this solution deploys a simple PHP application that generates logs under the /etc/httpd/logs directory on the EC2 instance. The Kinesis agent is configured via /etc/aws-kinesis/agent.json to collect data from access_logs and error_logs, and stream them periodically to Kinesis Data Firehose (ec2-logs-delivery-stream).

Because Amazon OpenSearch Service expects data in JSON format, you can add a call to a Lambda function to transform the log data to JSON format within Kinesis Data Firehose before streaming to Amazon OpenSearch Service. The following is a sample input for the data transformer:

46.99.153.40 - - [29/Jul/2021:15:32:33 +0000] "GET / HTTP/1.1" 200 173 "-" "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"

The following is our output:

{
    "logs" : "46.99.153.40 - - [29/Jul/2021:15:32:33 +0000] \"GET / HTTP/1.1\" 200 173 \"-\" \"Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36\"",
}

We can enhance the Lambda function to extract the timestamp, HTTP, and browser information from the log data, and store them as separate attributes in the JSON document.

Amazon ECS

In the case of Amazon ECS, we use FireLens to send logs directly to Kinesis Data Firehose. FireLens is a container log router for Amazon ECS and AWS Fargate that gives you the extensibility to use the breadth of services at AWS or partner solutions for log analytics and storage.

The architecture hosts FireLens as a sidecar, which collects logs from the main container running an httpd application and sends them to Kinesis Data Firehose and streams to Amazon OpenSearch Service. The AWS CDK script provided as part of this solution deploys a httpd container hosted behind an Application Load Balancer. The httpd logs are pushed to Kinesis Data Firehose (ecs-logs-delivery-stream) through the FireLens log router.

Amazon EKS

With the recent announcement of Fluent Bit support for Amazon EKS, you no longer need to run a sidecar to route container logs from Amazon EKS pods running on Fargate. With the new built-in logging support, you can select a destination of your choice to send the records to. Amazon EKS on Fargate uses a version of Fluent Bit for AWS, an upstream conformant distribution of Fluent Bit managed by AWS.

The AWS CDK script provided as part of this solution deploys an NGINX container hosted behind an internal Application Load Balancer. The NGINX container logs are pushed to Kinesis Data Firehose (eks-logs-delivery-stream) through the Fluent Bit plugin.

Lambda

For Lambda functions, you can send logs directly to Kinesis Data Firehose using the Lambda extension. You can deny the records being written to Amazon CloudWatch.

After deployment, the workflow is as follows:

On startup, the extension subscribes to receive logs for the platform and function events. A local HTTP server is started inside the external extension, which receives the logs.
The extension buffers the log events in a synchronized queue and writes them to Kinesis Data Firehose via PUT records.
The logs are sent to downstream systems.
The logs are sent to Amazon OpenSearch Service.

The Firehose delivery stream name gets specified as an environment variable (AWS_KINESIS_STREAM_NAME).

For this solution, because we’re only focusing on collecting the run logs of the Lambda function, the data transformer of the Kinesis Data Firehose delivery stream filters out the records of type function ("type":"function") before sending it to Amazon OpenSearch Service.

The following is a sample input for the data transformer:

[
   {
      "time":"2021-07-29T19:54:08.949Z",
      "type":"platform.start",
      "record":{
         "requestId":"024ae572-72c7-44e0-90f5-3f002a1df3f2",
         "version":"$LATEST"
      }
   },
   {
      "time":"2021-07-29T19:54:09.094Z",
      "type":"platform.logsSubscription",
      "record":{
         "name":"kinesisfirehose-logs-extension-demo",
         "state":"Subscribed",
         "types":[
            "platform",
            "function"
         ]
      }
   },
   {
      "time":"2021-07-29T19:54:09.096Z",
      "type":"function",
      "record":"2021-07-29T19:54:09.094Z\tundefined\tINFO\tLoading function\n"
   },
   {
      "time":"2021-07-29T19:54:09.096Z",
      "type":"platform.extension",
      "record":{
         "name":"kinesisfirehose-logs-extension-demo",
         "state":"Ready",
         "events":[
            "INVOKE",
            "SHUTDOWN"
         ]
      }
   },
   {
      "time":"2021-07-29T19:54:09.097Z",
      "type":"function",
      "record":"2021-07-29T19:54:09.097Z\t024ae572-72c7-44e0-90f5-3f002a1df3f2\tINFO\tvalue1 = value1\n"
   },   
   {
      "time":"2021-07-29T19:54:09.098Z",
      "type":"platform.runtimeDone",
      "record":{
         "requestId":"024ae572-72c7-44e0-90f5-3f002a1df3f2",
         "status":"success"
      }
   }
]

Prerequisites

To implement this solution, you need the following prerequisites:

The AWS Command Line Interface (AWS CLI) installed. The AWS CLI is a unified tool to manage your AWS services.
The AWS CDK installed on your local laptop.
Git installed and configured on your machine.
The Lambda extension for Kinesis Data Firehose, which is packaged as part of this solution.

Build the code

Check out the AWS CDK code by running the following command:

mkdir unified-logs && cd unified-logs
git clone https://github.com/aws-samples/unified-log-aggregation-and-analytics .

Build the lambda extension by running the following command:

cd lib/computes/lambda/extensions
chmod +x extension.sh
./extension.sh
cd ../../../../

Make sure to replace default AWS region specified under the value of firehose.endpoint attribute inside lib/computes/ec2/ec2-startup.sh.

Build the code by running the following command:

yarn install && npm run build

Deploy the code

If you’re running AWS CDK for the first time, run the following command to bootstrap the AWS CDK environment (provide your AWS account ID and AWS Region):

cdk bootstrap \
    --cloudformation-execution-policies arn:aws:iam::aws:policy/AdministratorAccess \
    aws://<AWS Account Id>/<AWS_REGION>

You only need to bootstrap the AWS CDK one time (skip this step if you have already done this).

Run the following command to deploy the code:

cdk deploy --requires-approval

You get the following output:

 ✅  CdkUnifiedLogStack

Outputs:
CdkUnifiedLogStack.ec2ipaddress = xx.xx.xx.xx
CdkUnifiedLogStack.ecsloadbalancerurl = CdkUn-ecsse-PY4D8DVQLK5H-xxxxx.us-east-1.elb.amazonaws.com
CdkUnifiedLogStack.ecsserviceLoadBalancerDNS570CB744 = CdkUn-ecsse-PY4D8DVQLK5H-xxxx.us-east-1.elb.amazonaws.com
CdkUnifiedLogStack.ecsserviceServiceURL88A7B1EE = http://CdkUn-ecsse-PY4D8DVQLK5H-xxxx.us-east-1.elb.amazonaws.com
CdkUnifiedLogStack.eksclusterClusterNameCE21A0DB = ekscluster92983EFB-d29892f99efc4419bc08534a3d253160
CdkUnifiedLogStack.eksclusterConfigCommand515C0544 = aws eks update-kubeconfig --name ekscluster92983EFB-d29892f99efc4419bc08534a3d253160 --region us-east-1 --role-arn arn:aws:iam::xxx:role/CdkUnifiedLogStack-clustermasterroleCD184EDB-12U2TZHS28DW4
CdkUnifiedLogStack.eksclusterGetTokenCommand3C33A2A5 = aws eks get-token --cluster-name ekscluster92983EFB-d29892f99efc4419bc08534a3d253160 --region us-east-1 --role-arn arn:aws:iam::xxx:role/CdkUnifiedLogStack-clustermasterroleCD184EDB-12U2TZHS28DW4
CdkUnifiedLogStack.elasticdomainarn = arn:aws:es:us-east-1:xxx:domain/cdkunif-elasti-rkiuv6bc52rp
CdkUnifiedLogStack.s3bucketname = cdkunifiedlogstack-logsfailederrcapturebucket0bcc-xxxxx
CdkUnifiedLogStack.samplelambdafunction = CdkUnifiedLogStack-LambdatransformerfunctionFA3659-c8u392491FrW

Stack ARN:
arn:aws:cloudformation:us-east-1:xxxx:stack/CdkUnifiedLogStack/6d53ef40-efd2-11eb-9a9d-1230a5204572

AWS CDK takes care of building the required infrastructure, deploying the sample application, and collecting logs from different sources to Amazon OpenSearch Service.

The following is some of the key information about the stack:

ec2ipaddress – The public IP address of the EC2 instance, deployed with the sample PHP application
ecsloadbalancerurl – The URL of the Amazon ECS Load Balancer, deployed with the httpd application
eksclusterClusterNameCE21A0DB – The Amazon EKS cluster name, deployed with the NGINX application
samplelambdafunction – The sample Lambda function using the Lambda extension to send logs to Kinesis Data Firehose
opensearch-domain-arn – The ARN of the Amazon OpenSearch Service domain

Generate logs

To visualize the logs, you first need to generate some sample logs.

To generate Lambda logs, invoke the function using the following AWS CLI command (run it a few times):

aws lambda invoke \
--function-name "<<samplelambdafunction>>" \
--payload '{"payload": "hello"}' /tmp/invoke-result \
--cli-binary-format raw-in-base64-out \
--log-type Tail

Make sure to replace samplelambdafunction with the actual Lambda function name. The file path needs to be updated based on the underlying operating system.

The function should return "StatusCode": 200, with the following output:

{
    "StatusCode": 200,
    "LogResult": "<<Encoded>>",
    "ExecutedVersion": "$LATEST"
}

Run the following command a couple of times to generate Amazon EC2 logs:

curl http://ec2ipaddress:80

Make sure to replace ec2ipaddress with the public IP address of the EC2 instance.

Run the following command a couple of times to generate Amazon ECS logs:

curl http://ecsloadbalancerurl:80

Make sure to replace ecsloadbalancerurl with the public ARN of the AWS Application Load Balancer.

We deployed the NGINX application with an internal load balancer, so the load balancer hits the health checkpoint of the application, which is sufficient to generate the Amazon EKS access logs.

Visualize the logs

To visualize the logs, complete the following steps:

On the Amazon OpenSearch Service console, choose the hyperlink provided for the OpenSearch Dashboard 7URL.
Configure access to the OpenSearch Dashboard.
Under OpenSearch Dashboard, on the Discover menu, start creating a new index pattern for each compute log.

We can see separate indexes for each compute log partitioned by date, as in the following screenshot.

The following screenshot shows the process to create index patterns for Amazon EC2 logs.

After you create the index pattern, we can start analyzing the logs using the Discover menu under OpenSearch Dashboard in the navigation pane. This tool provides a single searchable and unified interface for all the records with various compute platforms. We can switch between different logs using the Change index pattern submenu.

Clean up

Run the following command from the root directory to delete the stack:

cdk destroy

Conclusion

In this post, we showed how to unify and centralize logs across different compute platforms using Kinesis Data Firehose and Amazon OpenSearch Service. This approach allows you to analyze logs quickly and the root cause of failures, using a single platform rather than different platforms for different services.

If you have feedback about this post, submit your comments in the comments section.

Resources

For more information, see the following resources:

About the author

Hari Hari Ohm Prasath is a Senior Modernization Architect at AWS, helping customers with their modernization journey to become cloud native. Hari loves to code and actively contributes to the open source initiatives. You can find him in Medium, Github & Twitter @hariohmprasath.

ballu Ballu Singh is a Principal Solutions Architect at AWS. He lives in the San Francisco Bay area and helps customers architect and optimize applications on AWS. In his spare time, he enjoys reading and spending time with his family.

Set advanced settings with the Amazon OpenSearch Service Dashboards API

2021-12-15 Prashant Agrawal

Post Syndicated from Prashant Agrawal original https://aws.amazon.com/blogs/big-data/set-advanced-settings-with-the-amazon-opensearch-service-dashboards-api/

Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) is a fully managed service that you can use to deploy and operate OpenSearch clusters cost-effectively at scale in the AWS Cloud. The service makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more by offering the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), and visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions).

A common use case of OpenSearch in multi-tenant environments is to use tenants in OpenSearch Dashboards and provide segregated index patterns, dashboards, and visualizations to different teams in the organization. Tenants in OpenSearch Dashboards aren’t the same as indexes, where OpenSearch organizes all data. You may still have multiple indexes for multi-tenancy and tenants for controlling access to OpenSearch Dashboards’ saved objects.

In this post, we focus on operationalizing advanced settings for OpenSearch Dashboards tenants with programmatic ways, in particular with the Dashboards Advanced Settings API. For a deeper insight into multi-tenancy in OpenSearch, refer to OpenSearch Dashboards multi-tenancy.

One example of advanced settings configurations is deploying time zone settings in an environment where each tenant is aligned to a different geographic area with specific time zone. We explain the time zone configuration with the UI and demonstrate configuring it with the OpenSearch Dashboards Advanced Settings API using curl. This post also provides guidance for other advanced settings you may wish to include in your deployment.

To follow along in this post, make sure you have an Amazon OpenSearch Service domain with access to OpenSearch Dashboards through a role with administrator privileges for the domain. For more information about enabling access control mechanisms for your domains, see Fine-grained access control in Amazon OpenSearch Service.

The following examples use Amazon OpenSearch Service version 1.0, which was the latest release at the time of writing.

Configure advanced settings in the OpenSearch Dashboards UI

To configure advanced settings via the OpenSearch Dashboards UI, complete the following steps:

Log in to OpenSearch Dashboards.
Choose your user icon and choose Switch Tenants to choose the tenant you want to change configuration for.

By default, all OpenSearch Dashboards users have access to two tenants: private and global. The global tenant is shared between every OpenSearch Dashboards user. The private tenant is exclusive to each user and used mostly for experimenting before publishing configuration to other tenants. Make sure to check your configurations in the private tenant before replicating in other tenants, including global.

Choose Stack Management in the navigation pane, then choose Advanced Settings.
In your desired tenant context, choose a value for Timezone for date formatting.

In this example, we change the time zone from the default selection Browser to US/Eastern.

Choose Save changes.

Configure advanced settings with the OpenSearch Dashboards API

For environments where you prefer to perform operations programmatically, Amazon OpenSearch Service provides the ability to configure advanced settings with the OpenSearch Dashboards advanced settings API.

Let’s walk through configuring the time zone using curl.

First, you need to authenticate to the API endpoint with your user name and password, and retrieve the authorization cookies into the file auth.txt:

curl -X POST  https://<domain_endpoint>/_dashboards/auth/login \
-H "osd-xsrf: true" \
-H "content-type:application/json" \
-d '{"username":"<username>", "password":"<password>"}' \
-c auth.txt

In this example, we configure OpenSearch Dashboards to use the internal user database, and the user inherits administrative permissions under the global tenant. In multi-tenant environments, the user is required to have relevant tenant permissions. You can see an example of this in the next section, where we illustrate a multi-tenant environment. Access control in OpenSearch Dashboards is a broad and important topic, and it would be unfair to try to squeeze all of it in this post. Therefore, we don’t cover access control in depth here. For additional information on access control in multi-tenant OpenSearch Dashboards, refer to OpenSearch Dashboards multi-tenancy.

The auth.txt file holds authorization cookies that you use to pass configuration changes to the API endpoint. The auth.txt file should look similar to the following code:

# Netscape HTTP Cookie File
# https://curl.haxx.se/docs/http-cookies.html
# This file was generated by libcurl! Edit at your own risk.

#HttpOnly_<domain_endpoint> FALSE   /_dashboards    TRUE    0       security_authentication Fe26.2**80fca234dd0974fb6dfe9427e6b8362ba1dd78fc5a71
e7f9803694f40980012b*k9QboTT5A24hs71_wN32Cw*9-RvY2UhS-Cmat4RZPHohTbczyGRjmHezlIlhwePG1gv_P2bgSuZhx9XBV9I-zzdxrZIbJTTpymy4mv1rAB_GRuXjt-6ITUfsG58GrI7TI7D3pWKaw8n6lrhamccGYqL9K_dQrE4kr_godwEDLydR1d_
Oc11jEG98yi_O0qhBTu1kDNzNAEqgXEoaLS--afnbwPS0zvqUc4MUgrfGQOTt7mUoWMC778Tpii4V4gxhAcRqe_KoYQG1LhUq-j9XTHCouzB4qTJ8gR3tlbVYMFwhA**f278b1c9f2c9e4f50924c47bfd1a992234400c6f11ee6f005beecc4201760998*3Aj8gQAIKKPoUR0PX-5doFgZ9zqxlcB3YbfDgJIBNLU

Construct configuration changes within the curl body and submit them using an authorization cookie. In this example, we included a sample to modify the time zone to US/Eastern.

curl -X PUT https://<domain_endpoint>/_dashboards/api/saved_objects/config/1.0.0-SNAPSHOT \
-H "osd-xsrf:true" \
-H "content-type:application/json" \
-d '{"attributes":{"dateFormat:tz":"US/Eastern"}}' \
-b auth.txt

By default, the constructed API modifies the configuration in the private tenant, which is exclusive to each user, can’t be shared, and is ideal for testing. We provide instructions to modify configuration in multi-tenant environments later in the post.

Your API call should receive a response similar to the following code, indicating the changes you submitted:

{"id":"1.0.0-SNAPSHOT","type":"config","updated_at":"2021-09-06T19:59:42.425Z","version":"WzcsMV0=","namespaces":["default"],"attributes":{"dateFormat:tz":"US/Eastern"}}

If you prefer to make multiple changes, you can construct the API call as follows:

curl -X PUT https://<domain_endpoint>/_dashboards/api/saved_objects/config/1.0.0-SNAPSHOT \
-H "osd-xsrf:true" \
-H "content-type:application/json" \
-d \
'{
    "attributes":{
      "dateFormat:tz":"US/Eastern",
      "dateFormat:dow":"Monday"
    }
 }' \
-b auth.txt

To retrieve the latest configuration changes, construct a GET request as follows:

curl -X GET https://<domain_endpoint>/_dashboards/api/saved_objects/config/1.0.0-SNAPSHOT \
-H "osd-xsrf:true" \
-H "content-type:application/json" \
-b auth.txt

Configure advanced settings with the OpenSearch Dashboards API in multi-tenant environments

Tenants in OpenSearch Dashboards are commonly used to share custom index patterns, visualizations, dashboards, and other OpenSearch objects with different teams or organizations.

The OpenSearch Dashboards API provides the ability to modify advanced settings in different tenants. In the previous section, we covered making advanced configuration changes for a private tenant. We now walk through a similar scenario for multiple tenants.

First, you need to authenticate to the API endpoint and retrieve the authorization cookies into the file auth.txt. You can construct this request in the same way you would in a single-tenant environment as described in the previous section.

In multi-tenant environments, make sure you configure the user’s role with relevant tenant permissions. One pattern is to associate the user to the kibana_user and a custom group that has tenant permissions. In our example, we associated the tenant admin user tenant-a_admin_user to the two roles as shown in the following code: the kibana_user system role and a custom tenant-a_admin_role that includes tenant permissions.

GET _plugins/_security/api/account
{
  "user_name" : "tenant-a_admin_user",
  "is_reserved" : false,
  "is_hidden" : false,
  "is_internal_user" : true,
  "user_requested_tenant" : "tenant-a",
  "backend_roles" : [
    ""
  ],
  "custom_attribute_names" : [ ],
  "tenants" : {
    "global_tenant" : true,
    "tenant-a_admin_user" : true,
    "tenant-a" : true
  },
  "roles" : [
    "tenant-a_admin_role",
    "kibana_user"
  ]
}


GET _plugins/_security/api/roles/tenant-a_admin_role
{
  "tenant-a_admin_role" : {
    "reserved" : false,
    "hidden" : false,
    "cluster_permissions" : [ ],
    "index_permissions" : [ ],
    "tenant_permissions" : [
      {
        "tenant_patterns" : [
          "tenant-a"
        ],
        "allowed_actions" : [
          "kibana_all_write"
        ]
      }
    ],
    "static" : false
  }
}

After authenticating to the OpenSearch Dashboards API, the auth.txt file holds authorization cookies that you use to pass configuration changes to the API endpoint. The content of the auth.txt file should be similar to the one we illustrated in the previous section.

Construct the configuration changes by adding a securitytenant header. In this example, we modify the time zone and day of week in tenant-a:

curl -X PUT https://<domain_endpoint>/_dashboards/api/saved_objects/config/1.0.0-SNAPSHOT \
-H "osd-xsrf:true" \
-H "content-type:application/json" \
-H "securitytenant: tenant-a" \
-d \
'{
    "attributes":{
     "dateFormat:tz":"US/Eastern",
     "dateFormat:dow":"Monday"
    }
 }' \
-b auth.txt

The OpenSearch Dashboards API endpoint returns a response similar to the following:

{"id":"1.0.0-SNAPSHOT","type":"config","updated_at":"2021-10-10T17:41:47.249Z","version":"WzEsMV0=","namespaces":["default"],"attributes":{"dateFormat:tz":"US/Eastern","dateFormat:dow":"Monday"}}

You could also verify the configuration changes in the OpenSearch Dashboards UI, as shown in the following screenshot.

Conclusion

In this post, you used the Amazon OpenSearch Service Dashboards UI and API to configure advanced settings for a single-tenant and multi-tenant environment. Implementing OpenSearch Dashboards at scale in multi-tenant environments requires more efficient methods than simply using the UI. This is especially important in environments where you serve centralized logging and monitoring domains for different teams. You can use the OpenSearch Dashboards APIs we illustrated in this post and bake your advanced setting configurations into your infrastructure code to accelerate your deployments!

Let us know about your questions and other topics you’d like us to cover in the comment section.

About the Authors

Prashant Agrawal is a Specialist Solutions Architect at Amazon Web Services based in Seattle, WA.. Prashant works closely with Amazon OpenSearch team, helping customers migrate their workloads to the AWS Cloud. Before joining AWS, Prashant helped various customers use Elasticsearch for their search and analytics use cases.

Evren Sen is a Solutions Architect at AWS, focusing on strategic financial services customers. He helps his customers create Cloud Center of Excellence and design, and deploy solutions on the AWS Cloud. Outside of AWS, Evren enjoys spending time with family and friends, traveling, and cycling.

View summarized data with Amazon OpenSearch Service Index Transforms

2021-12-13 Viral Shah

Post Syndicated from Viral Shah original https://aws.amazon.com/blogs/big-data/view-summarized-data-with-amazon-opensearch-service-index-transforms/

Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) recently announced support for Index Transforms. You can use Index Transforms to extract meaningful information from an existing index, and store the aggregated information in a new index. The key benefit of Index Transforms is faster retrieval of data by performing aggregations, grouping in advance, and storing those results in summarized views. For example, you can run continuous aggregations on ecommerce order data to summarize and learn the spending behaviors of your customers. With Index Transforms, you have the flexibility to select specific fields from the source index. You can also run Index Transform jobs on indices that don’t have a timestamp field.

There are two ways to configure Index Transform jobs: by using the OpenSearch Dashboards UI or index transform REST APIs. In this post, we discuss these two methods and share some best practices.

Use the OpenSearch Dashboards UI

To configure an Index Transform job in the Dashboards UI, first identify the source index you want to transform. You can also use sample ecommerce orders data available on the OpenSearch Dashboards home page.

After you log into Kibana Dashboards, choose Home in the navigation pane, then choose Add sample data.
Choose Add Data to create a sample index (for example, opensearch_dashboards_sample_data_ecommerce).
Launch OpenSearch Dashboards and on the menu bar, choose Index Management.
Choose Transform Jobs in the navigation pane.
Choose Create Transform Job.
Specify the Index Transform job name and select the recently created sample ecommerce index as the source.
Choose an existing index or create a new one when selecting the target index.
Choose Edit data filter, you have an option to run transformations only on the filtered data. For this post, we run transformations on products sold more than 10 times but less than 200.
Choose Next.

The sample ecommerce source index has over 50 fields. We only want to select the fields that are relevant to tracking the sales data by product category.

Select the fields category.keyword, total_quantity, and products.price. Index transform wizard allows to filter specific fields of interest, and then select transform operations on these selected fields.
Because we want to aggregate by product category, choose the plus sign next to the field category.keyword and choose Group by terms.
Similarly, choose Aggregate by max, min, avg for the products.price field and Aggregate by sum for the total_quantity field.

Index transform wizard provides preview capability of transformed fields on sample data for quick review. Additionally, you can also edit the transformed field names in favor of more descriptive names.

Currently, Index Transform jobs support histogram, date_histogram, and terms groupings. For more information about groupings, see Bucket aggregations. For metrics aggregations, you can choose from sum, avg, max, min, value_count, percentiles, and scripted_metric.

Scripted metrics can be useful when you need to calculate a value based on an existing attribute of the document. For example, finding a latest follower count on a continuous social feed or finding the customer who placed the first order over certain amount on a particular day. Scripted metrics can be coded in painless scripts —simple, secure scripting language designed specifically for use with search platforms.

The following is the example script to find the first customer who placed an order valued more than $100.

{
   "init_script": "state.timestamp_earliest = -1l; state.order_total = 0l; state.customer_id = ''",
   "map_script": "if (!doc['order_date'].empty) { def current_date = doc['order_date'].getValue().toInstant().toEpochMilli(); def order_total = doc['taxful_total_price'].getValue(); if ((current_date < state.timestamp_earliest && order_total >= 100) || (state.timestamp_earliest == -1 && order_total >= 100)) { state.timestamp_earliest = current_date; state.customer_id = doc['customer_id'].getValue();}}",
   "combine_script": "return state",
   "reduce_script": "def customer_id = ''; def earliest_timestamp = -1L;for (s in states) { if (s.timestamp_earliest < earliest_timestamp || earliest_timestamp == -1){ earliest_timestamp = s.timestamp_latest; customer_id = s.customer_id;}} return customer_id"
}

Scripted metrics run in four phases:

Initialize phase (init_script) – Optional initialization phase where shard level variables can be initialized.
Map phase (map_script) – Runs the code on each collected document.
Combine phase (combine_script) – Returns the results from all shards ornodes to the coordinator node.
Reduce phase (reduce_script) – Produces the final result by processing the results from all shards.

If your use case involves multiple complex scripted metrics calculations, plan to perform calculations prior to ingesting data into the OpenSearch Service domain.

In the last step, specify the schedule for the Index Transform job, for example every 12 hours.
On the Advanced tab, you can modify the pages per run.

This setting indicates the data that can be processed in each search request. Raising this number can increase the memory utilization and lead to higher latency. We recommend using the default setting (1000 pages per run).

Review all the selection and choose Create to schedule the Index Transform job.

Index Transform jobs are enabled by default and run based on a selected schedule. Choose Refresh to view the status of the Index Transform job.

After the job runs successfully, you can view the details around the number of documents processed, and the time taken to index and search the data.

You can also view the target index contents using the _search API using the OpenSearch Dev Tools console.

Use REST APIs

Index Transform APIs can also be used to create, update, start, and stop Index Transform job operations. For example, refer Create Transform API to create Index Transform job to execute every minute. Index Transform API provides flexibility to customize the job interval to meet your specific requirements.

Use the following API to get details of your scheduled Index Transform job:

GET _plugins/_transform/kibana_ecommerce_transform_job

To preview results of a previously run Index Transform job:

GET _plugins/_transform/kibana_ecommerce_transform_job/_explain

We get the following response from our API call:

{
  "kibana_ecommerce_transform_job" : {
    "metadata_id" : "uA45cToY8nOCsVSyCZs2yA",
    "transform_metadata" : {
      "transform_id" : "kibana_ecommerce_transform_job",
      "last_updated_at" : 1633987988049,
      "status" : "finished",
      "failure_reason" : null,
      "stats" : {
        "pages_processed" : 2,
        "documents_processed" : 7409,
        "documents_indexed" : 6,
        "index_time_in_millis" : 56,
        "search_time_in_millis" : 7
      }
    }
  }
}

To delete an existing Index Transform job, disable the job and then issue the Delete API:

POST _plugins/_transform/kibana_ecommerce_transform_job/_stop

DELETE _plugins/_transform/kibana_ecommerce_transform_job

Response:
{
  "took" : 12,
  "errors" : false,
  "items" : [
    {
      "delete" : {
        "_index" : ".opendistro-ism-config",
        "_type" : "_doc",
        "_id" : "kibana_ecommerce_transform_job",
        "_version" : 3,
        "result" : "deleted",
        "forced_refresh" : true,
        "_shards" : {
          "total" : 2,
          "successful" : 2,
          "failed" : 0
        },
        "_seq_no" : 35091,
        "_primary_term" : 1,
        "status" : 200
      }
    }
  ]
}

Best practices:

Index Transform jobs are ideal for continuous aggregation of data and maintaining summarized data instead of performing complex aggregations at query time over and over. It’s designed to run on an index or indices, and not on changes between job runs.

Consider the following best practices when using Index Transforms:

Avoid running Index Transform jobs on rotating indexes with index patterns as the job scans all documents in those indices at each run. Use APIs to create a new Index Transform job for each rotating index.
Factor in additional compute capacity if your Index Transform job involves multiple aggregations because this process can be CPU intensive. For example, If your job scans 5 indices with 3 shards each and takes 5 minutes to complete, then minimum of 17 (5*3=15 for reading source indices and 2 for writing to target index considering 1 replica) vCPUs are required for 5minutes to complete.
Try to schedule Index Transform jobs at non-peak times to minimize the impact on real-time search queries.
Make sure that there is sufficient storage for the target indexes. The size of the target index depends on the cardinality of the chosen group by term(s) and a number of attributes are computed as part of the transform. Make sure you have enough storage overhead discussed in our sizing guide.
Monitor and adjust the OpenSearch Service cluster configurations.

Conclusion

This post describes how you can use OpenSearch Index Transforms to aggregate specific fields from an existing index and store the summarized data into a new index using the OpenSearch Dashboards UI or Index Transform REST APIs. The Index Transform feature is powered by OpenSearch, an open-source search and analytics engine that makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more. Index Transforms are available on all domains running Amazon OpenSearch Service 1.0 or greater, across 25 AWS Regions globally.

About the Authors

Viral Shah is a Principal Solutions Architect with the AWS Data Lab team based out of New York, NY. He has over 20 years of experience working with enterprise customers and startups, primarily in the data and database space. He loves to travel and spend quality time with his family.-

Arun Lakshmanan is a Search Specialist Solution Architect at AWS based out of Chicago, IL.

Configure alerts of high CPU usage in applications using Amazon OpenSearch Service anomaly detection: Part 1

2021-12-10 Jey Vell

Post Syndicated from Jey Vell original https://aws.amazon.com/blogs/big-data/part-1-configure-alerts-of-high-cpu-usage-in-applications-using-amazon-opensearch-service-anomaly-detection/

Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) is a fully managed service that makes it easy to deploy, secure, and run Elasticsearch cost-effectively at scale. Amazon OpenSearch Service supports many use cases, including application monitoring, search, security information and event management (SIEM), and infrastructure monitoring. Amazon OpenSearch Service also offers a rich set of functionalities such as UltraWarm, fine-grained access control, alerting, and anomaly detection.

In this two-part post, we show you how to use anomaly detection in Amazon OpenSearch Service and configure alerts for high CPU usage in applications. In Part 1, we discuss how to set up your anomaly detector.

Solution overview

Anomaly detection in Amazon OpenSearch Service automatically detects anomalies in your Amazon OpenSearch Service data in near-real time by using the Random Cut Forest (RCF) algorithm. The RCF algorithm computes an anomaly grade and confidence score value for each incoming data point. Anomaly detection uses these values to differentiate an anomaly from normal variations in your data.

The following screenshot shows a sample anomaly history dashboard on the Amazon OpenSearch Service console.

You can configure anomaly detectors via the Amazon OpenSearch Service Kibana dashboard or API. The key elements for creating an anomaly detector are detector creation and model configuration. In the following steps, we create an anomaly detector for application log files with CPU usage data.

Create a detector

The first step in creating an anomaly detection solution is creating a detector. A detector is an individual anomaly detection task. You can have more than one detector, and they all can run simultaneously. To create your detector, complete the following steps:

On the Anomaly detection dashboard inside your Kibana dashboard, choose Detectors.
Choose Create detector.
For Name, enter a unique name for your detector.
For Description, enter an optional description.
For Index, choose an index where you want to identify the anomaly.
For Timestamp field, choose the timestamp field from your index.

Optionally, you can add a data filter. This data filter helps you analyze only a subset of your data source and reduce the noisy data.

For Data filter, choose Visual editor.
Choose your field, operator, and value.

Alternatively, choose Custom expression and add in your own filter query.

Set the detector operation settings:
1. Detector interval – This defines the time interval for the detector to collect the data. During this time, the detector aggregates the data, then feeds the aggregated results into the anomaly detection model. The number of data points depends on the interval value. A shorter interval time results in a smaller sample size. We recommend setting the interval based on actual data. Too long of an interval might delay the results, and too short might miss some data points.
2. Window delay – This adds extra processing time to ensure that all data within the window is present.

Configure the model

In order to run the anomaly detection model, you must configure certain model parameters. One key parameter is feature selection. A feature is a field in the index that is monitored for anomalies using different aggregation methods. You can apply anomaly detection to more than one feature for the index specified in the detector’s data source. After you create the detector, you need to configure the model with the right features to enable anomaly detection.

On the anomaly detection dashboard, under the detector name, choose Configure model.
On the Edit model configuration page, for Feature name, enter a name.
For Feature state, select Enable feature.
For Find anomalies based on, choose Field value.
For Aggregation method, choose your appropriate aggregation method.

For example, if you choose average(), the detector finds anomalies based on the average values of your feature. For this post, we choose sum().

For Field, choose your field.

As of this writing, Amazon OpenSearch Service supports the category field for high cardinality. You can use the category field for keyword or IP field type. The category field categorizes or slices the source time series with a dimension like IP addresses, product SKUs, zip codes, and so on. This provides a granular view of anomalies within each entity of the category field, to help you isolate and debug issues. For example, the CPU usage percentage doesn’t help identify the specific instance causing the issue. But by using the host IP categorical value, you may able to find the specific host causing the anomaly.

In the Category field, select Enable category field.
For Field, choose your field.
In the Advanced Settings section, for Window size, set the number of intervals to consider in a detection window.

We recommend choosing the window size value based on the actual data. If you expect missing values in your data or if you want the anomalies based on the current interval, choose 1. If your data is continuously ingested and you want the anomalies based on multiple intervals, choose a larger window size.

In the Sample anomaly history section, you can see a preview of the anomalies.

Choose Save and start detector.
In the pop-up window, select when to start the detector (automatically or manually).
Choose Confirm.

Summary

This post explained the different steps required to create an anomaly detector with Amazon OpenSearch Service. You can use anomaly detection for many use cases, including finding anomalies in access logs from different services, using clickstream data, using IP address data, and more.

Amazon OpenSearch Service anomaly detection is available on domains running any OpenSearch version or Elasticsearch 7.4 or later. All instance types support anomaly detection, except for t2.micro and t2.small.

In the next part of this post, we cover how to set up an alert for these anomalies using the Amazon OpenSearch Service alerting feature.

About the Authors

Jey Vell is a Senior Solutions Architect with AWS, based in Austin TX. Jey specializes in Analytics and ML/AI. Jey works closely with Amazon OpenSearch Service team, providing architecture guidance and technical support to the AWS customers with their search workloads. He brings to his role over 20 years of technology experience in software development and architecture, and IT management.

Jon Handler is a Senior Principal Solutions Architect, specializing in AWS search technologies – Amazon CloudSearch, and Amazon OpenSearch Service. Based in Palo Alto, he helps a broad range of customers get their search and log analytics workloads deployed right and functioning well.

SEEK Asia modernizes search with CI/CD and Amazon OpenSearch Service

2021-12-09 Fabian Tan

Post Syndicated from Fabian Tan original https://aws.amazon.com/blogs/big-data/seek-asia-modernizes-search-with-ci-cd-and-amazon-opensearch-service/

This post was written in collaboration with Abdulsalam Alshallah (Salam), Software Architect, and Hans Roessler, Principal Software Engineer at SEEK Asia.

SEEK is a market leader in online employment marketplaces with deep and rich insights into the future of work. As a global business, SEEK has a presence in Australia, New Zealand, Hong Kong, Southeast Asia, Brazil and Mexico and its websites attract over 400 million visits per year. SEEK Asia’s business operates across seven countries and includes leading portal brands such as jobsdb.com and jobstreet.com and leverages data and technology to create innovative solutions for candidates and hirers.

In this post, we share how SEEK Asia modernized their search-based system with a continuous integration and continuous delivery (CI/CD) pipeline and Amazon OpenSearch Service (successor to Amazon Elasticsearch Service).

Challenges associated with a self-managed search system

SEEK Asia provides a search-based system that enables employers to manage interactions between hirers and candidates. Although the system was already on AWS, it was a self-managed system running on Amazon Elastic Compute Cloud (Amazon EC2) with limited automation.

The self-managed system posed several challenges:

Slower release cycles – Deploying new configurations or new field mappings into the Elasticsearch cluster was a high-risk activity because changes affected the stability of the system. The little automation on both the self-managed cluster and workflows led to slower release cycles.
Higher operational overhead – Sizing the cluster to deliver greater performance, while managing cost effectively, was the other challenge. As with every other distributed system, even with sizing guidance, identifying the appropriate number of shards per node and the number of nodes to meet performance requirements still required some amount of trial and error, turning the exercise into a tedious and time-consuming activity. This consequently also led to slower release cycles. To overcome this challenge, in many occasions, oversizing the cluster became the quickest way to achieve the desired time to market, at the expense of cost.

Further challenges the team faced with self-managing their own Elasticsearch cluster included keeping up with new security patches, and minor and major platform upgrades.

Automating search delivery with Amazon OpenSearch Service

SEEK Asia knew that automation would the key to solving the challenges of their existing search service. Automating the undifferentiated heavy lifting would enable them to deliver more value to their customers quickly and improve staff productivity.

With the problems defined, the team set out to solve the challenges by automating the following:

Search infrastructure deployment
Search A/B testing infrastructure deployment
Redeployment of search infrastructure for any new infrastructure configuration (such as security patches or platform upgrades) and index mapping updates

The key services enabling the automation would be Amazon OpenSearch Service and establishing a search infrastructure CI/CD pipeline.

Architecture overview

The following diagram illustrates the architecture of the SEEK infrastructure and CI/CD pipeline with Amazon OpenSearch Service.

The workflow includes the following steps:

Before the workflow kicks off, an existing Amazon OpenSearch Service cluster with a live feeder hydrates it. The live feeder is a serverless application built on Amazon Simple Queue Service (Amazon SQS) via Amazon Simple Notification Service (Amazon SNS) and AWS Lambda. Amazon SQS queues documents for processing, Amazon SNS enables data fanout (if required), and a Lambda function is invoked to process messages in the SQS queue to import data into Amazon OpenSearch Service. The feeder receives live updates for changes that need to be reflected on the cluster. Write concurrency to Amazon OpenSearch Service is managed by limiting the number of concurrent Lambda function invocations.
The Amazon OpenSearch Service index mapping is version controlled in SEEK’s Git repository. Whenever an update to the index mapping is committed, the CI/CD pipeline kicks off a new Amazon OpenSearch Service cluster provisioning workflow.
As part of the workflow, a new data hydration initialization feeder is deployed. The initialization feeder construct is similar to the live feeder, with one additional component: a script that runs within the CI/CD pipeline to calculate the number of batches required to hydrate the newly provisioned Amazon OpenSearch Service cluster up to a specific timestamp. The feeder systems were designed to achieve idempotency processing. This meant unique identifiers (UIDs) from the source data stores are reused for each document, and duplicated documents update an existing document with the exact same values.
At the same time as Step 3, an Amazon OpenSearch Service cluster is deployed. To accelerate the initial data hydration process temporarily, the new cluster may be sized two or three times larger against sizing guidance with shard replicas and index refresh interval disabled until the hydration process is complete. The existing Amazon OpenSearch Service cluster remains as is, which means that two clusters are running concurrently.
The script inspects the number of documents the source data store has and groups the documents by batch sizes. SEEK identified that 1,000 documents per batch provided the optimal ingestion import time, after running numerous experiments.
Each batch is represented as one message and is queued into Amazon SQS via Amazon SNS. Every message that lands in Amazon SQS invokes a Lambda function. The Lambda function queries a separate data store, builds the document, and loads it into Amazon OpenSearch Service. The more messages that go into the queue, the more functions are invoked. To create baselines that allowed for further indexing optimization, the team took the following configurations into consideration and reiterated to achieve higher ingestion performance:
1. Memory of the Lambda function
2. Size of batch
3. Size of each document in the batch
4. Size of cluster (memory, vCPU, and number of primary shards)
With the initialization feeder running, new documents are streamed to the cluster until it is synced with the data source. Eventually, the newly provisioned Amazon OpenSearch Service cluster catches up and is in the same state as the existing cluster. The hydration is complete when there are no remaining messages in the SQS queue.
The initialization feeder is deleted and the Amazon OpenSearch Service cluster is downsized automatically to complete the deployment workflow, with replica shards created and the index refresh interval configured.
Live search traffic is routed to the newly provisioned cluster when A/B testing is enabled via the API layer built on Application Load Balancer, Amazon Elastic Container Service (Amazon ECS), and Amazon CloudFront. The API layer decouples the client interface from the backend implementation that runs on Amazon OpenSearch Service.

Improved time to market and other outcomes

With Amazon OpenSearch Service, SEEK was able to automate an entire cluster, complete with Kibana, in a secure, managed environment. If testing didn’t produce the desired results, the team could change the dimensions of the cluster horizontally or vertically using different instance offerings within minutes. This enabled them to perform stress tests quickly to identify the sweet spot between performance and cost of the workload.

“By integrating Amazon OpenSearch Service with our existing CI/CD tools, we’re able to fully automate our search function deployments, which accelerated software delivery time,” says Abdulsalam Alshallah, APAC Software Architect. “The newly found confidence in the modern stack, alongside improved engineering practices, allowed us to mitigate the risk of changes—improving our time to market by 89% with zero impact to uptime.”

With the adoption of Amazon OpenSearch Service, other teams also saw improvements, including the following:

Common Vulnerability and Exposure (CVE) has dropped to zero with Amazon OpenSearch Service handling the underlying hardware security updates on SEEK’s behalf, improving their security posture
Improved availability with the Amazon OpenSearch Service Availability Zone awareness feature

Conclusion

Amazon OpenSearch Service managed capabilities has helped SEEK Asia to improve customer experience with speed and automation. By removing the undifferentiated heavy lifting, teams can deploy changes quickly to their search engines, allowing customers to get the latest search features faster and ultimately contributing to the SEEK purpose of helping people live more productive working lives and organisations succeed.

To learn more about Amazon OpenSearch Service, see Amazon OpenSearch Service features, the Developer Guide, or Introducing OpenSearch.

About the Authors

Fabian Tan is a Principal Solutions Architect at Amazon Web Services. He has a strong passion for software development, databases, data analytics and machine learning. He works closely with the Malaysian developer community to help them bring their ideas to life.

Hans Roessler is a Principal Software Architect at SEEKAsia. He is excited about new technologies and upgrading legacy to newer stacks. Always staying in touch with the latest technologies is one of his passions.

Abdulsalam Alshallah (Salam) is a Software architect at SEEK, Previously a Lead Cloud Architect for SEEKAsia, Salam has always been excited about new technologies, Cloud, Serverless & DevOps, in addition to his passion of eliminating wasted time/effort & resources; He is also one of the leaders of AWS User Group Malaysia.

Visualize live analytics from Amazon QuickSight connected to Amazon OpenSearch Service

2021-12-09 Lokesh Yellanur

Post Syndicated from Lokesh Yellanur original https://aws.amazon.com/blogs/big-data/visualize-live-analytics-from-amazon-quicksight-connected-to-amazon-opensearch-service/

Live analytics refers to the process of preparing and measuring data as soon as it enters the database or persistent store. In other words, you get insights or arrive at conclusions immediately. Live analytics enables businesses to respond to events without delay. You can seize opportunities or prevent problems before they happen. Speed is the main benefit of live analytics. The faster a business can use data for insights, the faster they can act on critical decisions.

Some live analytics use cases include:

Analyzing access logs and application logs from servers to identify any server performance issues that could lead to application downtime or help detect unusual activity. For instance, analyzing monitoring data from a manufacturing line can help early intervention before machinery malfunctions.
Targeting individual customers in retail outlets with promotions and incentives while the customers are in the store and close to the merchandise.

We see customers using real-time analytics using our ELK stack. The ELK stack is an acronym used to describe a stack that comprises three popular open-source projects: Elasticsearch, Logstash, and Kibana. Often referred to as Elasticsearch, the ELK stack gives you the ability to aggregate logs from all your systems and applications, analyze these logs, and create visualizations for application and infrastructure monitoring, faster troubleshooting, security analytics, and more. In this post, we extend the live analytics visualizations using Amazon QuickSight.

Solution overview

Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) is a fully managed service that makes it easy for you to deploy, secure, and run OpenSearch cost-effectively at scale. You can build, monitor, and troubleshoot your applications using the tools you love at the scale you need. The service provides support for open-source OpenSearch APIs, managed Kibana, integration with Logstash and other AWS services, and built-in alerting and SQL querying. In addition, Amazon OpenSearch Service lets you pay only for what you use—there are no upfront costs or usage requirements. With Amazon OpenSearch Service, you get the ELK stack you need without the operational overhead.

QuickSight is a scalable, serverless, embeddable, machine learning (ML)-powered business intelligence (BI) service built for the cloud. QuickSight lets you easily create and publish interactive BI dashboards that include ML-powered insights. QuickSight dashboards can be accessed from any device and seamlessly embedded into your applications, portals, and websites.

This post helps you visualize the Centralized Logging solution using QuickSight. Centralized logging helps organizations collect, analyze, and display Amazon CloudWatch logs in a single dashboard in QuickSight.

This solution consolidates, manages, and analyzes log files from various sources. You can collect CloudWatch logs from multiple accounts and AWS Regions. Access log information can be beneficial in security and access audits. It can also help you learn about your customer base and understand your Amazon Simple Storage Service (Amazon S3) bill.

The following diagram illustrates the solution architecture.

For more information about the solution, see Centralized Logging.

Prerequisites

Before you implement the solution, complete the prerequisite steps in this section.

Provision your resources

Launch the following AWS CloudFormation template to launch the Centralized Logging solution:

After you create the stack, you receive an email (to the administrator email address) with your login information, as shown in the following screenshot.

Launch QuickSight in a VPC

QuickSight Enterprise Edition is fully integrated with Amazon Virtual Private Cloud (Amazon VPC). A VPC based on this service closely resembles a traditional network that you operate in your own data center. It enables you to secure and isolate traffic between resources.

Allow QuickSight to access Amazon OpenSearch Service

Make sure QuickSight has access to both the VPC and Amazon OpenSearch Service.

On the QuickSight dashboard, choose the user icon and choose Manage QuickSight.
Choose Security & permissions in the navigation pane.
Choose Add or Remove to update QuickSight access to AWS services.

For Allow access and autodiscovery for these recourses, select Amazon OpenSearch Service.

Manage the VPC and security group connections

You need to give permissions on the QuickSight console to connect to Amazon OpenSearch Service. After you enable Amazon OpenSearch Service on the Security & permissions page, you add a VPC connection with the same VPC and subnet as your Amazon OpenSearch Service domain and create a new security group.

You first create a security group for QuickSight.

Add an inbound rule to allow all communication from the Amazon OpenSearch Service domain.
For Type, choose All TCP.
For Source, select Custom, then enter the ID of the security group used by your Amazon OpenSearch Service domain.
Add an outbound rule to allow all traffic to the Amazon OpenSearch Service domain.
For Type, choose Custom TCP Rule.
For Port Range, enter 443.
For Destination, select Custom, then enter the ID of the security group used by your Amazon OpenSearch Service domain.

Next, you create a security group for the Amazon OpenSearch Service domain.

Add an inbound rule that allows all incoming traffic from the QuickSight security group.
For Type, choose Custom TCP.
For Port Range, enter 443.
For Source, select Custom, then enter the QuickSight security group ID.
Add an outbound rule that allows all traffic to the QuickSight security group.
For Type, choose All TCP.
For Destination, select Custom, then enter the QuickSight security group ID.

Choose your datasets

To validate the connection and create the data source, complete the following steps:

On the QuickSight console, choose Datasets.
Choose Create dataset.
Choose Amazon OpenSearch Service.
For Data source name, enter a name.
Depending on your Amazon OpenSearch Service connections of either public or VPC, choose your connection type and Amazon OpenSearch Service domain.
Choose Validate connection.
Choose Create data source.

Choose Tables.
Select the table in the data source you created.
Review your settings and choose Visualize.

Visualize the data loaded

QuickSight, with its wide array of visuals available, allows you to create meaningful visuals from Amazon OpenSearch Service data.

When you choose Visualize from the previous steps, you start creating an analysis. QuickSight provides a range of visual types to display data, such as graphs, tables, heat maps, scatter plots, line charts, pie charts, and more. The following steps allow you to add a visual type to display the data from the datasets.

On the Add menu, choose Add visual.
Choose your visual type.
Add fields to the field wells to bring data into the visuals to be displayed.

The following screenshot shows a sample group of visuals.

Automatically refresh your data

You can access and visualize your data through direct queries. Your data is queried live each time a visual is rendered. This gives you live access to your data. Additionally, you can automatically refresh the visuals every 1–60 minutes, so that you don’t have to reload the page to see the most up-to-date information. The following screenshot shows the auto-refresh settings while preparing to publish your dashboard.

For more information about the auto-refresh option, see Using Amazon OpenSearch with Amazon QuickSight.

The following screenshot shows an example visualization.

Clean up

When you’re done using this solution, to avoid incurring future charges, delete the resources you created in this walkthrough, including your S3 buckets, Amazon OpenSearch Service cluster, and other associated resources.

Summary

This post demonstrated how to extend your ELK stack with QuickSight in a secure way for analyzing access logs. The application logs help you identify any server performance issues that could lead to application downtime. They can also help detect unusual activity.

As always, AWS welcomes feedback. Please submit comments or questions in the comments section.

About the Authors

Lokesh Yellanur is a Solutions Architect at AWS. He helps customers with data and analytics solutions in AWS.

Joshua Morrison is a Senior Solutions Architect at AWS based in Richmond, Virginia. He spends time working with customers to help with their adoption of modern cloud technology and security best practices. He enjoys being a father and picking up heavy objects.

Suresh Patnam is a Sr Solutions Architect at AWS; He works with customers to build IT strategy, making digital transformation through the cloud more accessible, focusing on big data, data lakes, and AI/ML. In his spare time, Suresh enjoys playing tennis and spending time with his family. Connect him on LinkedIn.

Choose the right storage tier for your needs in Amazon OpenSearch Service

2021-11-23 Changbin Gong

Post Syndicated from Changbin Gong original https://aws.amazon.com/blogs/big-data/choose-the-right-storage-tier-for-your-needs-in-amazon-opensearch-service/

Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) enables organizations to perform interactive log analytics, real-time application monitoring, website search, and more. OpenSearch is an open-source, distributed search and analytics suite derived from Elasticsearch. Amazon OpenSearch Service offers the latest versions of OpenSearch, support for 19 versions of Elasticsearch (1.5 to 7.10 versions), and visualization capabilities powered by OpenSearch Dashboards and Kibana (1.5 to 7.10 versions).

In this post, we present three storage tiers of Amazon OpenSearch Service—hot, UltraWarm, and cold storage—and discuss how to effectively choose the right storage tier for your needs. This post can help you understand how these storage tiers integrate together and what the trade-off is for each storage tier. To choose a storage tier of Amazon OpenSearch Service for your use case, you need to consider the performance, latency, and cost of these storage tiers in order to make the right decision.

Amazon OpenSearch Service storage tiers overview

There are three different storage tiers for Amazon OpenSearch Service: hot, UltraWarm, and cold. The following diagram illustrates these three storage tiers.

Hot storage

Hot storage for Amazon OpenSearch Service is used for indexing and updating, while providing fast access to data. Standard data nodes use hot storage, which takes the form of instance store or Amazon Elastic Block Store (Amazon EBS) volumes attached to each node. Hot storage provides the fastest possible performance for indexing and searching new data.

You get the lowest latency for reading data in the hot tier, so you should use the hot tier to store frequently accessed data driving real-time analysis and dashboards. As your data ages, you access it less frequently and can tolerate higher latency, so keeping data in the hot tier is no longer cost-efficient.

If you want to have low latency and fast access to the data, hot storage is a good choice for you.

UltraWarm storage

UltraWarm nodes use Amazon Simple Storage Service (Amazon S3) with related caching solutions to improve performance. UltraWarm offers significantly lower costs per GiB for read-only data that you query less frequently and don’t need the same performance as hot storage. Although you can’t modify the data while in UltraWarm, you can move the data to the hot storage tier for edits before moving it back.

When calculating UltraWarm storage requirements, you consider only the size of the primary shards. When you query for the list of shards in UltraWarm, you still see the primary and replicas listed. Both shards are stubs for the same, single copy of the data, which is in Amazon S3. The durability of data in Amazon S3 removes the need for replicas, and Amazon S3 abstracts away any operating system or service considerations. In the hot tier, accounting for one replica, 20 GB of index uses 40 GB of storage. In the UltraWarm tier, it’s billed at 20 GB.

The UltraWarm tier acts like a caching layer on top of the data in Amazon S3. UltraWarm moves data from Amazon S3 onto the UltraWarm nodes on demand, which speeds up access for subsequent queries on that data. For that reason, UltraWarm works best for use cases that access the same, small slice of data multiple times. You can add or remove UltraWarm nodes to increase or decrease the amount of cache against your data in Amazon S3 to optimize your cost per GB. To dial in your cost, be sure to test using a representative dataset. To monitor performance, use the WarmCPUUtilization and WarmJVMMemoryPressure metrics. See UltraWarm metrics for a complete list of metrics.

The combined CPU cores and RAM allocated to UltraWarm nodes affects performance for simultaneous searches across shards. We recommend deploying enough UltraWarm instances so that you store no more than 400 shards per ultrawarm1.medium.search node and 1,000 shards per ultrawarm1.large.search node (including both primaries and replicas). We recommend a maximum shard size of 50 GB for both hot and warm tiers. When you query UltraWarm, each shard uses a CPU and moves data from Amazon S3 to local storage. Running single or concurrent queries that access many indexes can overwhelm the CPU and local disk resources. This can cause longer latencies through inefficient use of local storage, and even cause cluster failures.

UltraWarm storage requires OpenSearch 1.0 or later, or Elasticsearch version 6.8 or later.

If you have large amounts of read-only data and want to balance the cost and performance, use UltraWarm for your infrequently accessed, older data.

Cold storage

Cold storage is optimized to store infrequently accessed or historical data at $0.024 per GB per month. When you use cold storage, you detach your indexes from the UltraWarm tier, making them inaccessible. You can reattach these indexes in a few seconds when you need to query that data. Cold storage is a great fit for scenarios in which a low ROI necessitates an archive or delete action on historical data, or if you need to conduct research or perform forensic analysis on older data with Amazon OpenSearch Service.

Cold storage doesn’t have specific instance types because it doesn’t have any compute capacity attached to it. You can store any amount of data in cold storage.

Cold storage requires OpenSearch 1.0 or later, or Elasticsearch version 7.9 or later and UltraWarm.

Manage storage tiers in OpenSearch Dashboards

OpenSearch Dashboards installed on your Amazon OpenSearch Service domain provides a useful UI for managing indexes in different storage tiers on your domain. From the OpenSearch Dashboards main menu, you can view all indexes in hot, UltraWarm, and cold storage. You can also see the indexes managed by Index State Management (ISM) policies. OpenSearch Dashboards enables you to migrate indexes between UltraWarm and cold storage, and monitor index migration status, without using the AWS Command Line Interface (AWS CLI) or configuration API. For more information on OpenSearch Dashboards, see Using OpenSearch Dashboards with Amazon OpenSearch Service.

Cost considerations

The hot tier requires you to pay for what is provisioned, which includes the hourly rate for the instance type. Storage is either Amazon EBS or a local SSD instance store. For Amazon EBS-only instance types, additional EBS volume pricing applies. You pay for the amount of storage you deploy.

UltraWarm nodes charge per hour just like other node types, but you only pay for the storage actually stored in Amazon S3. For example, although the instance type ultrawarm1.large.elasticsearch provides up to 20 TiB addressable storage on Amazon S3, if you only store 2 TiB of data, you’re only billed for 2 TiB. Like the standard data node types, you also pay an hourly rate for each UltraWarm node. For more information, see Pricing for Amazon OpenSearch Service.

Cold storage doesn’t incur compute costs, and like UltraWarm, you’re only billed for the amount of data stored in Amazon S3. There are no additional transfer charges when moving data between cold and UltraWarm storage.

Example use case

Let’s look at an example with 1 TB of source data per day, 7 days hot, 83 days warm, 365 days cold. For more information on sizing the cluster, see Sizing Amazon OpenSearch Service domains.

For hot storage, you can go through a baseline estimation with the calculation as: storage needed = (daily source data in bytes * 1.25) * (number_of_replicas + 1) * number of days retention. With the best practice for two replicas, we should use two replicas here. The minimum storage requirement to retain 7 TB of data on the hot tier is (7TB*1.25)*(2+1)= 26.25 TB. For this amount of storage, we need 6x R6g.4xlarge.search instances given the Amazon EBS size limit.

We also need to verify from the CPU side, we need 25 primary shards (1TB*1.25/50GB) =25. We have two replicas. With that, we have total 75 active shards. With that, the total vCPU needed is 75*1.5=112.5 vCPU. This means 8x R6g.4xlarge.search instances. This also requires three dedicated c6g.xlarge.search leader nodes.

When calculating UltraWarm storage requirements, you consider only the size of the primary shards, because that’s the amount of data stored in Amazon S3. For this example, the total primary shard size for warm storage is 83*1.25=103.75 TB. Each ultrawarm1.large.search instance has 16 CPU cores and can address up to 20 TiB of storage on Amazon S3. A minimum of six ultrawarm1.large.search nodes is recommended. You’re charged for the actual storage, which is 103.75 TB.

For cold storage, you only pay for the cost of storing 365*1.25=456.25 TB on Amazon S3. The following table contains a breakdown of the monthly costs (USD) you’re likely to incur. This assumes a 1-year reserved instance for the cluster instances with no upfront payment in the US East (N. Virgina) Region.

Cost Type	Pricing	Usage	Cost per month
Instance Usage	R6g.4xlarge.search = $0.924 per hour	8 instances * 730 hours in a month = 5,840 hours	5,840 hours * $0.924 = $5,396.16
	c6g.xlarge.search = $0.156 per hour	3 instances (leader nodes) * 730 hours in a month = 2,190 hours	2,190 hours * $0.156 = $341.64
	ultrawarm1.large.search = $2.68 per hour	6 instances * 730 hours = 4,380 hours	4,380 hours * $2.68 = $11,738.40
Storage Cost	Hot storage cost (Amazon EBS) EBS general purpose SSD (gp3) = $0.08 per GB per month	7 days host = 26.25TB	26,880 GB * $0.08 = $2,150.40
	UltraWarm managed storage cost = $0.024 per GB per month	83 days warm = 103.75 TB per month	106,240 GB * $0.024 = $2,549.76
	Cold storage cost on Amazon S3 = $0.022 per GB per month	365 days cold = 456.25 TB per month	467,200 GB * $0.022 = $10,278.40

The total monthly cost is $32,454.76. The hot tier costs $7,888.20, UltraWarm costs $14,288.16, and cold storage is $10,278.40. UltraWarm allows 83 days of additional retention for slightly more cost than the hot tier, which only provides 7 days. For nearly the same cost as the hot tier, the cold tier stores the primary shards for up to 1 year.

Conclusion

Amazon OpenSearch Service supports three integrated storage tiers: hot, UltraWarm, and cold storage. Based on your data retention, query latency, and budgeting requirements, you can choose the best strategy to balance cost and performance. You can also migrate data between different storage tiers. To start using these storage tiers, sign in to the AWS Management Console, use the AWS SDK, or AWS CLI, and enable the corresponding storage tier.

About the Author

Changbin Gong is a Senior Solutions Architect at Amazon Web Services (AWS). He engages with customers to create innovative solutions that address customer business problems and accelerate the adoption of AWS services. In his spare time, Changbin enjoys reading, running, and traveling.

Rich Giuli is a Principal Solutions Architect at Amazon Web Service (AWS). He works within a specialized group helping ISVs accelerate adoption of cloud services. Outside of work Rich enjoys running and playing guitar.

Disaster Recovery with AWS Managed Services, Part I: Single Region

2021-11-12 Dhruv Bakshi

Post Syndicated from Dhruv Bakshi original https://aws.amazon.com/blogs/architecture/disaster-recovery-with-aws-managed-services-part-i-single-region/

This 3-part blog series discusses disaster recovery (DR) strategies that you can implement to ensure your data is safe and that your workload stays available during a disaster. In Part I, we’ll discuss the single AWS Region/multi-Availability Zone (AZ) DR strategy.

The strategy outlined in this blog post addresses how to integrate AWS managed services into a single-Region DR strategy. This will minimize maintenance and operational overhead, create fault-tolerant systems, ensure high availability, and protect your data with robust backup/recovery processes. This strategy replicates workloads across multiple AZs and continuously backs up your data to another Region with point-in-time recovery, so your application is safe even if all AZs within your source Region fail.

Implementing the single Region/multi-AZ strategy

The following sections list the components of the example application presented in Figure 1, which illustrates a multi-AZ environment with a secondary Region that is strictly utilized for backups. This example architecture refers to an application that processes payment transactions that has been modernized with AMS. We’ll show you which AWS services it uses and how they work to maintain the single Region/multi-AZ strategy.

Figure 1. Single Region/multi-AZ with secondary Region for backups

Amazon EKS control plane

Amazon Elastic Kubernetes Service (Amazon EKS) runs the Kubernetes management infrastructure across multiple AZs to eliminate a single point of failure.

This means that if your infrastructure or AZ fails, it will automatically scale control plane nodes based on load, automatically detect and replace unhealthy control plane instances, and restart them across the AZs within the Region as needed.

Amazon EKS data plane

Instead of creating individual Amazon Elastic Compute Cloud (Amazon EC2) instances, create worker nodes using an Amazon EC2 Auto Scaling group. Join the group to a cluster, and the group will automatically replace any terminated or failed nodes if an AZ fails. This ensures that the cluster can always run your workload.

Amazon ElastiCache

Amazon ElastiCache continually monitors the state of the primary node. If the primary node fails, it will promote the read replica with the least replication lag to primary. A replacement read replica is then created and provisioned in the same AZ as the failed primary. This is to ensure high availability of the service and application.

An ElastiCache for Redis (cluster mode disabled) cluster with multiple nodes has three types of endpoints: the primary endpoint, the reader endpoint and the node endpoints. The primary endpoint is a DNS name that always resolves to the primary node in the cluster.

Amazon Redshift

Currently, Amazon Redshift only supports single-AZ deployments. Although there are ways to work around this, we are focusing on cluster relocation. Parts II and III of this series will show you how to implement this service in a multi-Region DR deployment.

Cluster relocation enables Amazon Redshift to move a cluster to another AZ with no loss of data or changes to your applications. When Amazon Redshift relocates a cluster to a new AZ, the new cluster has the same endpoint as the original cluster. Your applications can reconnect to the endpoint and continue operations without modifications or loss of data.

Note: Amazon Redshift may also relocate clusters in non-AZ failure situations, such as when issues in the current AZ prevent optimal cluster operation or to improve service availability.

Amazon OpenSearch Service

Deploying your data nodes into three AZs with Amazon OpenSearch Service (formerly Amazon Elasticsearch Service) can improve the availability of your domain and increase your workload’s tolerance for AZ failures.

Amazon OpenSearch Service automatically deploys into three AZs when you select a multi-AZ deployment. This distribution helps prevent cluster downtime if an AZ experiences a service disruption. When you deploy across three AZs, Amazon OpenSearch Service distributes master nodes equally across all three AZs. That way, in the rare event of an AZ disruption, two master nodes will still be available.

Amazon OpenSearch Service also distributes primary shards and their corresponding replica shards to different zones. In addition to distributing shards by AZ, Amazon OpenSearch Service distributes them by node. When you deploy the data nodes across three AZs with one replica enabled, shards are distributed across the three AZs.

Note: For more information on multi-AZ configurations, please refer to the AZ disruptions table.

Amazon RDS PostgreSQL

Amazon Relational Database Service (Amazon RDS) handles failovers automatically so you can resume database operations as quickly as possible.

In a Multi-AZ deployment, Amazon RDS automatically provisions and maintains a synchronous standby replica in a different AZ. The primary DB instance is synchronously replicated across AZs to a standby replica. If an AZ or infrastructure fails, Amazon RDS performs an automatic failover to the standby. This minimizes the disruption to your applications without administrative intervention.

Backing up data across Regions

Here is how the managed services back up data to a secondary Region:

Manage snapshots of persistent volumes for Amazon EKS with Velero. Amazon Simple Storage Service (Amazon S3) stores these snapshots in an S3 bucket in the primary Region. Amazon S3 replicates these snapshots to an S3 bucket in another Region via S3 cross-Region replication.
Create a manual snapshot of Amazon OpenSearch Service clusters, which are stored in a registered repository like Amazon S3. You can do this manually or automate it via an AWS Lambda function, which automatically and asynchronously copy objects across Regions.
Use manual backups and copy API calls for Amazon ElastiCache to establish a snapshot and restore strategy in a secondary Region. You can manually back your data up to an S3 bucket or automate the backup via Lambda. Once your data is backed up, a snapshot of the ElastiCache cluster will be stored in an S3 bucket. Then S3 cross-Region replication will asynchronously copy the backup to an S3 bucket in a secondary Region.
Take automatic, incremental snapshots of your data periodically with Amazon Redshift and save them to Amazon S3. You can precisely control when snapshots are taken and can create a snapshot schedule and attach it to one or more clusters. You can also configure a cross-Region snapshot copy, which automatically copies your automated and manual snapshots to another Region.
Use AWS Backup to support AWS resources and third-party applications. AWS Backup copies RDS backups to multiple Regions on demand or automatically as part of a scheduled backup plan.

Note: You can add a layer of protection to your backups through AWS Backup Vault Lock and S3 Object Lock.

Conclusion

The single Region/multi-AZ strategy safeguards your workloads against a disaster that disrupts an Amazon data center by replicating workloads across multiple AZs in the same Region. This blog shows you how AWS managed services automatically fails over between AZs without interruption when experiencing a localized disaster, and how backups to a separate Region ensure data protection.

In the next post, we will discuss a multi-Region warm standby strategy for the same application stack illustrated in this post.

Related information

Query data in Amazon OpenSearch Service using SQL from Amazon Athena

2021-11-10 Behram Irani

Post Syndicated from Behram Irani original https://aws.amazon.com/blogs/big-data/query-data-in-amazon-opensearch-service-using-sql-from-amazon-athena/

Amazon Athena is an interactive serverless query service to query data from Amazon Simple Storage Service (Amazon S3) in standard SQL. Amazon OpenSearch Service (successor to Amazon Elasticsearch Service) is a fully managed, open-source, distributed search and analytics suite derived from Elasticsearch, allowing you to run OpenSearch Service or Elasticsearch clusters at scale without having to manage hardware provisioning, software installation, patching, backups, and so on.

Although both services have their own use cases, there might be situations when you want to run queries that combine data in Amazon S3 with data in an Amazon OpenSearch Service Cluster. In such cases, federated queries in Athena are a good option—they provide the capability to combine data from multiple data sources and analyze them in a single query. The federated query feature works by using data source connectors that are built for different data sources. To allow Amazon OpenSearch Service to be one of the sources that Athena can query against, AWS has made available a data source connector for OpenSearch Service clusters to be queried from Athena.

This post demonstrates how to query data in Amazon OpenSearch Service and Amazon S3 in a single query. We use the data made available in the public COVID-19 data lake, documented as part of the post A public data lake for analysis of COVID-19 data.

In particular, we use the following two datasets:

alleninstitute_metadata : Metadata on papers pulled from the COVID-19 Open Research Dataset (CORD-19). The sha column indicates the paper ID, which is the file name of the paper in the data lake. This dataset is stored in Amazon OpenSearch Service because it contains the column abstract, which you can search on.
alleninstitute_comprehend_medical : Results containing annotations obtained by running the papers in the preceding dataset through Amazon Comprehend Medical. This is accessed from its public storage at s3://covid19-lake/alleninstitute/CORD19/comprehendmedical/comprehend_medical.json.

Data flow when combining data from Amazon OpenSearch Service and Amazon S3

The data source connectors are implemented as AWS Lambda functions. When a user issues a query that combines data from Amazon OpenSearch Service and Amazon S3, Athena refers to the AWS Glue Data Catalog metadata to look up the table definitions. For the table whose data is in Amazon S3, Athena fetches the data from Amazon S3. For the tables that are in Amazon OpenSearch Service, Athena invokes the Lambda function (part of the data source connector application) to read the data from Amazon OpenSearch Service. Depending on the amount of data, you can invoke this function multiple times in parallel for the same query to enable faster reads.

The following diagram illustrates this data flow.

Set up the two data sources using AWS CloudFormation

To prepare for querying both data sources, launch the AWS CloudFormation template using the “Launch Stack” button below. All you need to do is choose Create stack.

To run the CloudFormation stack, you need to be logged in to an AWS account with permissions to do the following:

Create a CloudFormation stack
Create an Identity and Access Management (IAM) role
Create a Lambda function, assign an IAM role to it, and invoke it
Launch an OpenSearch Service cluster
Create AWS Glue databases and table
Create an S3 bucket

For instructions on creating a CloudFormation stack, see Get started.

For more information about controlling permissions and access for these services, see the following resources:

CloudFormation – Controlling access with AWS Identity and Access Management
Lambda – AWS Lambda permissions
Amazon OpenSearch Service – Identity and Access Management in Amazon OpenSearch Service
AWS Glue – Managing Access Permissions for AWS Glue Resources
Amazon S3 – Identity and access management in Amazon S3

The CloudFormation template creates the following:

A table in the AWS Glue Data Catalog named alleninstitute_comprehend_medical that points to the S3 location s3://covid19-lake/alleninstitute/CORD19/comprehendmedical/comprehend_medical.json. This contains the results extracted from the CORD-19 data using the natural language processing service Amazon Comprehend Medical.
An S3 bucket with the name athena-es-connector-spill-bucket- followed by the first few characters from the stack ID to keep the bucket name unique.
An OpenSearch Service cluster with the name es-alleninstitute-data, which has two instances configured to allow a role to access the cluster.
An IAM role to access the OpenSearch Service cluster.
A Lambda function that contains a piece of Python code that reads all the metadata of the papers along with the abstract. This data is available as JSON at s3://covid19-lake/alleninstitute/CORD19/json/metadata/. For this post, we load just one of the four JSON files available.
A custom resource that invokes the Lambda function to load the data into the OpenSearch Service cluster.

The stack can take 15–30 minutes to complete.

When the stack is fully deployed, navigate to the Outputs tab of the stack and note the name of the S3 bucket created (the value for SpillBucket).

For the rest of the steps, you need permissions to do the following:

Deploy an application in the AWS Serverless Application Repository (for information about IAM access, see Identity and Access Management for the AWS Serverless Application Repository)
Create a new data source and run queries in Athena (for information about IAM access, see Identity and Access Management in Athena)

Deploy the Amazon Athena OpenSearch connector

When the OpenSearch Service domain with an index containing the metadata related to the COVID-19 research papers and the AWS Glue table pointing to the Amazon Comprehend Medical output data is ready, you can deploy the Amazon Athena OpenSearch connector using the AWS Serverless Application Repository.

On the AWS Serverless Application Repository console, choose Available applications.
Search for Athena Elasticsearch and select Show apps that create custom IAM roles or resource policies.
Choose AthenaElasticsearchConnector.

You’re redirected to the application screen.

Scroll down to the Application settings section.
For AthenaCatalogName, enter a name (for this post, we use es-connector).

This name is the name of the application and the Lambda function that connects to Amazon OpenSearch Service every time you run a query from Athena. For more details about all the parameters, refer to the connector’s GitHub page.

For SpillBucket, enter the name you noted in the previous section when we deployed the CloudFormation stack (it begins with athena-es-connector-spill-bucket).
Leave all other settings as default.
Select I acknowledge that this app creates custom IAM roles.
Choose Deploy.

In a few seconds, you’re redirected to the Applications page. You can see the status of your deployment on the Deployments tab. The deployment takes 1–2 minutes to complete.

Create a new data source in Athena

Now that the connector application has been deployed, it’s time to set up the OpenSearch Service domain to show as a catalog on Athena.

On the Athena console, navigate to the cord19 database.

The database contains the table alleninstitute_comprehend_medical, which was created as part of the CloudFormation template. This refers to the data sitting in Amazon S3 at s3://covid19-lake/alleninstitute/CORD19/comprehendmedical/.

Choose Data sources in the navigation pane.
Choose Connect data source.
Select Custom data source.
For Data source name, enter a name (for example, es-cord19-catalog).
Select Use an existing Lambda function and choose es-connector on the drop-down menu.
Choose Connect data source.
Choose Next.
For Lambda function, choose es-connector.
Choose Connect.

A new catalog es-cord19-catalog should now be available, as in the following screenshot.

On the Query editor tab, for Data source, choose es-cord19-catalog.

You can now query this data source from Athena.

Query OpenSearch Service domains from Athena

When you choose the es-cord19-catalog data source, the Lambda function (which was part of the connector application that we deployed) gets invoked and fetches the details about the domain and the index. The OpenSearch Service domain shows up as a database, and the index is shown as a table. You can also query the table with the following query:

select count(*) from "es-cord19-catalog"."es-alleninstitute-data".alleninstitute_metadata

Now you can join data from both Amazon OpenSearch Service and Amazon S3 with queries, such as the following:

select es.title, es.url from 
"es-cord19-catalog"."es-alleninstitute-data".alleninstitute_metadata es
    inner join
AwsDataCatalog.cord19.alleninstitute_comprehend_medical s3
    on es.sha = s3.paper_id
WHERE 
    array_join(s3.dx_name, ',') like '%infectious disease%'

The preceding query gets the title and the URL of all the research papers where the diagnosis was related to infectious diseases.

The following screenshot shows the query results.

Clean up

To clean up the resources created as part of this post, complete the following steps:

On the Amazon S3 console, locate and select your S3 bucket (the same bucket you noted from the CloudFormation stack).
Choose Empty.

You can also achieve this by running the following command from a command line:

aws s3 rm s3://athena-es-connector-spill-bucket-f7eb2cb0 –recursive

On the AWS CloudFormation console, delete the stack you created.
Delete the stack created for the Amazon Athena OpenSearch connector application. The default name is serverlessrepo-AthenaElasticsearchConnector.
On the Athena console, delete the es-cord19-catalog data source.

You can also delete the data source with the following command:

aws athena delete-data-catalog --name "es-cord19-catalog"

Conclusion

In this post, we saw how to combine data from OpenSearch Service clusters with other data sources like Amazon S3 to run federated queries. You can apply this solution to other use cases, such as combining AWS CloudTrail logs loaded into OpenSearch Service clusters with VPC flow logs data in Amazon S3 to analyze unusual network traffic, or combining product reviews data in Amazon OpenSearch Service with product data in Amazon S3 or other data sources. You can also pull data from Amazon OpenSearch Service and create an AWS Glue table out of it using a CTAS query in Athena.

To learn more about the Amazon Athena OpenSearch connector and its other configuration options, see the GitHub repo.

To learn more about query federation in Athena, refer to Using Amazon Athena Federated Query or Query any data source with Amazon Athena’s new federated query.

About the Authors

Behram Irani, Sr Analytics Solutions Architect

Madhav Vishnubhatta, Sr Technical Account Manager

Now Available: Updated guidance on the Data Analytics Lens for AWS Well-Architected Framework

2021-10-29 Wallace Printz

Post Syndicated from Wallace Printz original https://aws.amazon.com/blogs/big-data/now-available-updated-guidance-on-the-data-analytics-lens-for-aws-well-architected-framework/

Nearly all businesses today require some form of data analytics processing, from auditing user access to generating sales reports. For all your analytics needs, the Data Analytics Lens for AWS Well-Architected Framework provides prescriptive guidance to help you assess your workloads and identify best practices aligned to the AWS Well-Architected Pillars: Operational Excellence, Security, Reliability, Performance Efficiency, and Cost Optimization. Today, we’re pleased to announce a completely revised and updated version of the Data Analytics Lens whitepaper.

Self-assess with Well-Architected design principles

The updated version of the Data Analytics Lens whitepaper has been revised to provide guidance to CxOs as well as all data personas. Within each of the five Well-Architected Pillars, we provide top-level design principles for CxOs to quickly identify areas for teams and fundamental rules that analytics workloads designers should follow. Each design principle is followed by a series of questions and best practices that architects and system designers can use to perform self-assessments. Additionally, the Data Analytics Lens includes suggestions that prescriptively explain steps to implement best practices useful for implementation teams.

For example, the Security Pillar design principle “Control data access” works with the best practice to build user identity solutions that uniquely identify people and systems. The associated suggestion for this best practice is to centralize workforce identities, which details how to use this principle and includes links to more documentation on the suggestion.

“Building Data Analytics platform or workloads is one of the complex architecture patterns. It involves multi-layered approach such as Data Ingestion, Data Landing, Transformation Layer, Analytical/Insight and Reporting. Choices of technology and service for each of these layers are wide. The AWS Well-Architected Analytics Lens helps us to design and validate with great confidence against each of the pillars. Now Cognizant Architects can perform assessments using the Data Analytics Lens to validate and help build secure, scalable and innovative data solutions for customers.”

– Supriyo Chakraborty, Principal Architect & Head of Data Engineering Guild, Cognizant Germany
– Somasundaram Janavikulam, Cloud Enterprise Architect & Well Architected Partner Program Lead, Cognizant

In addition to performing your own assessment, AWS can provide a guided experience through reviewing your workload with a Well-Architected Framework Review engagement. For customers building data analytics workloads with AWS Professional Services, our teams of Data Architects can perform assessments using the Data Analytics Lens during the project engagements. This provides you with an objective assessment of your workloads and guidance on future improvements. The integration is available now for customers of the AWS Data Lake launch offering, with additional Data Analytics offerings coming in 2022. Reach out to your AWS Account Team if you’d like to know more about these guided Reviews.

Updated architectural patterns and scenarios

In this version of the Data Analytics Lens, we have also revised the discussion of data analytics patterns and scenarios to keep up with the industry and modern data analytics practices. Each scenario includes sections on characteristics that help you plan when developing systems for that scenario, a reference architecture to visualize and explain how the components work together, and configuration notes to help you properly configure your solution.

This version covers the following topics:

Building a modern data architecture (formerly Lake House Architecture)
Organize around data domains by delivering data as a product using a data mesh
Efficiently and securely provide batch data processing
Use streaming ingest and stream processing for real-time workloads
Build operational analytics systems to improve business processes and performance
Provide data visualization securely and cost-effectively at scale

Changed from the first release, the machine learning and tenant analytics scenarios have been migrated to a separate Machine Learning Lens whitepaper and SaaS Lens whitepaper.

Conclusion

We expect this updated version will provide better guidance to validate your existing architectures, as well as provide recommendations for any gaps that identified.

For more information about building your own Well-Architected systems using the Data Analytics Lens, see the Data Analytics Lens whitepaper.

Special thanks to everyone across the AWS Solution Architecture and Data Analytics communities who contributed. These contributions encompassed diverse perspectives, expertise, and experiences in developing the new AWS Well-Architected Data Analytics Lens.

About the Authors

Wallace Printz is a Senior Solutions Architect based in Austin, Texas. He helps customers across Texas transform their businesses in the cloud. He has a background in semiconductors, R&D, and machine learning.

Indira Balakrishnan is a Senior Solutions Architect in the AWS Analytics Specialist SA Team. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems using data-driven decisions. Outside of work, she volunteers at her kids’ activities and spends time with her family.

Top 5: Featured Architecture Content for September

2021-10-04 Elyse Lopez

Post Syndicated from Elyse Lopez original https://aws.amazon.com/blogs/architecture/top-5-featured-architecture-content-for-september/

The AWS Architecture Center provides new and notable reference architecture diagrams, vetted architecture solutions, AWS Well-Architected best practices, whitepapers, and more. This blog post features some of our best picks from the new and newly updated content we released in the past month.

1. AWS Best Practices for DDoS Resiliency

Prioritizing the availability and responsiveness of your application helps you maintain customer trust. That’s why it’s crucial to protect your business from the impact of distributed denial of service (DDoS) and other cyberattacks. This whitepaper provides you prescriptive guidance to improve the resiliency of your applications and best practices for how to manage different attack types.

2. Predictive Modeling for Automotive Retail

Automotive retailers use data to better understand how their incentives are helping to sell cars. This new reference architecture diagram shows you how to design a modeling system that provides granular return on investment (ROI) predictions for automotive sales incentives.

3. AWS Graviton Performance Testing – Tips for Independent Software Vendors

If you’re deciding whether to phase in AWS Graviton processors for your workload, this whitepaper covers best practices and common pitfalls for defining test approaches to evaluate Amazon Elastic Compute Cloud (Amazon EC2) instance performance and how to set success factors and compare different test methods and their implementation.

4. Text Analysis with Amazon OpenSearch Service and Amazon Comprehend

This AWS Solutions Implementation was recently updated with new guidance related to Amazon OpenSearch Service, the successor to Amazon Elasticsearch Service. Learn how Amazon OpenSearch Service and Amazon Comprehend work together to deploy a cost-effective, end-to-end solution to extract meaningful insights from unstructured text-based data such as customer calls, support tickets, and online customer feedback.

5. Back to Basics: Hosting a Static Website on AWS

In this episode of Back to Basics, join SA Readiness Specialist Even Zhang as he breaks down the AWS services you can use to host and scale your static website without a single server. You’ll also learn how to use additional functionalities to enhance your observability and security posture or run A/B tests.

Figure 1. CloudFront Edge Locations and Caches from Back to Basics video

Amazon Elasticsearch Service Is Now Amazon OpenSearch Service and Supports OpenSearch 1.0

2021-09-09 Channy Yun

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/amazon-elasticsearch-service-is-now-amazon-opensearch-service-and-supports-opensearch-10/

In 2015, we launched Amazon Elasticsearch Service (Amazon ES), a fully managed service that makes it easy for you to perform interactive log analytics, real-time application monitoring, website search, and more.

Amazon ES has been a popular service for log analytics because of its ability to ingest high volumes of log data. Additionally, with UltraWarm and cold storage tiers, you can lower costs to one-tenth of traditional hot storage on Amazon ES. Because Amazon ES integrates with Logstash, Amazon Kinesis Data Firehose, Amazon CloudWatch Logs, and AWS IoT, you can select the secure data ingestion tool that meets your use case requirements.

Developers embrace open-source software for many reasons. One of the most important reasons is the freedom to use that software where and how they want. On January 21, 2021, Elastic NV announced that they would change their software licensing strategy. After Elasticsearch version 7.10.2 and Kibana 7.10.2, they will not release new versions of Elasticsearch and Kibana under the permissive 2.0 version of the Apache License (ALv2). Instead, Elastic NV is releasing Elasticsearch and Kibana under the Elastic license, with source code available under the Elastic License or Server Side Public License (SSPL). These licenses are not open source and do not offer users the same freedom.

For this reason, we decided to create and maintain OpenSearch, a community-driven, open-source fork from the last ALv2 version of Elasticsearch and Kibana. We are making a long-term investment in the OpenSearch project and recently released version 1.0.

OpenSearch provides a highly scalable system for providing fast access and response to large volumes of data with an integrated visualization tool, OpenSearch Dashboards, that makes it easy for users to explore their data. OpenSearch and OpenSearch Dashboards were originally derived from Elasticsearch 7.10.2 and Kibana 7.10.2. Like Elasticsearch and Apache Solr, OpenSearch is powered by the Apache Lucene search library.

Announcing Amazon OpenSearch Service
Today, we rename Amazon Elasticsearch Service to Amazon OpenSearch Service because the service now supports OpenSearch 1.0. Although the name has changed, we will continue to deliver the same experiences without any negative impact to ongoing operations, development methodology, or business use.

Amazon OpenSearch Service offers a choice of open-source engines to deploy and run, including the currently available 19 versions of ALv2 Elasticsearch 7.10 and earlier and OpenSearch 1.0. We will continue to support and maintain the ALv2 Elasticsearch versions with security and bug fixes. We will deliver all-new features and functionality through OpenSearch and OpenSearch Dashboards. Amazon OpenSearch Service APIs will be backward-compatible with the existing service APIs, so there is no need for you to update your current client code or applications. We will keep clients of OpenSearch compatible with open source.

To get started, in the AWS Management Console, choose Create a domain. In Step 1: Choose deployment type, select OpenSearch 1.0 (latest).

We recommend OpenSearch 1.0 if you are deploying a new cluster and want access to the latest features and enhancements. OpenSearch 1.0 is compatible with the open-source Elasticsearch 7.10 APIs and most clients.

Upgrading to OpenSearch 1.0
Amazon OpenSearch Service offers a seamless in-place upgrade path from existing Elasticsearch 6.x and 7.x managed clusters to OpenSearch. To upgrade a domain to OpenSearch 1.0 in the AWS Management Console, choose the domain that you want to upgrade, choose Actions, and then select Upgrade domain.

Next, you can select a version to upgrade to OpenSearch 1.0 for your existing domain without creating a separate domain and migrating your data.

The upgrade process is irreversible. It can’t be paused or canceled. During an upgrade, you can’t make configuration changes to the domain. Before you start an upgrade, you can perform the pre-upgrade checks for issues that can block an upgrade and take a snapshot of the cluster by selecting Check upgrade eligibility.

Amazon OpenSearch Services starts the upgrade, which can take from 15 minutes to several hours to complete. To learn more, see Upgrading Elasticsearch and Service Software Updates in Amazon OpenSearch Service Developer Guide.

OpenSearch Features
OpenSearch provides the following features that were not previously available in open-source Elasticsearch.

Features	Description
Advanced Security	Offers encryption, authentication, authorization, and auditing features. They include integrations with Active Directory, LDAP, SAML, Kerberos, JSON web tokens, and more. OpenSearch also provides fine-grained, role-based access control to indices, documents, and fields.
SQL Query Syntax	Provides the familiar SQL query syntax. Use aggregations, group by, and where clauses to investigate your data. Read data as JSON documents or CSV tables so you have the flexibility to use the format that works best for you.
Reporting	Schedule, export, and share reports from dashboards, saved searches, alerts, and visualizations.
Anomaly Detection	Use machine learning anomaly detection based on the Random Cut Forest (RCF) algorithm to automatically detect anomalies as your data is ingested. Combine with alerting to monitor data in near real time and send alert notifications automatically.
Index Management	Define custom policies to automate routine index management tasks, such as rollover and delete, apply them to indices and index patterns, and transforms.
Performance Analyzer and RCA Framework	Query numerous cluster performance metrics and aggregations. Use PerfTop, the command line interface (CLI) to quickly display and analyze those metrics. Use the root cause analysis (RCA) framework to investigate performance and reliability issues in clusters.
Asynchronous Search	Run complex queries without worrying about the query timing out with Asynchronous Search queries running in the background. Track query progress and retrieve partial results as they become available.
Trace Analytics	Ingest and visualize OpenTelemetry data for distributed applications. Visualize the flow of events between these applications to identify performance problems.
Alerting	Automatically monitor data and send alert notifications to stakeholders. With an intuitive interface and a powerful API, easily set up, manage, and monitor alerts. Craft highly specific alert conditions using OpenSearch’s full query language and scripting capabilities.
k-NN search	Using machine learning, run the nearest neighbor search algorithm on billions of documents across thousands of dimensions with the same ease as running any regular OpenSearch query. Use aggregations and filter clauses to further refine similarity search operations. k-NN similarity search powers use cases such as product recommendations, fraud detection, image and video search, related document search, and more.
Piped Processing Language	Provides a familiar query syntax with a comprehensive set of commands delimited by pipes (\|) to query data.
Dashboard Notebooks	Combine dashboards, visualizations, text, and more to provide context and detailed explanations when analyzing data.

OpenSearch 1.0 supports three new features that are not available in the existing Elasticsearch versions supported on Amazon OpenSearch Service: Transforms, Data Streams, and Notebooks in OpenSearch Dashboards.

To engage with the OpenSearch community, we welcome pull requests through GitHub to fix bugs, improve performance and stability, or add new features. You can leave feedback in the OpenSearch community forum.

Now Available
Starting today, Amazon Elasticsearch Service has been renamed to Amazon OpenSearch Service in all AWS Regions. For more information, see the Amazon OpenSearch Service page.

You can send feedback to the AWS forum for Amazon OpenSearch Service or through your usual AWS Support contacts.

— Channy