Amazon Elasticsearch Service (Amazon ES) is a fully managed service that you can use to deploy, secure, and run Elasticsearch cost-effectively at scale. The service provides support for open-source Elasticsearch APIs, managed Kibana, and integration with Logstash and other AWS services. Amazon ES provides a deep security model that spans many layers of interaction and supports fine-grained access control at the cluster, index, document, and field level, on a per-user basis. The service’s security plugin integrates with federated identity providers for Kibana login.
A common use case for Amazon ES is log analytics. Customers configure their applications to store log data to the Elasticsearch cluster, where the data can be queried for insights into the functionality and use of the applications over time. In many cases, users reviewing those insights should not have access to all the details from the log data. The log data for a web application, for example, might include the source IP addresses of incoming requests. Privacy rules in many countries require that those details be masked, wholly or in part. This post explains how to set up field masking within your Amazon ES domain.
Field masking is an alternative to field-level security that lets you anonymize the data in a field rather than remove it altogether. When creating a role, add a list of fields to mask. Field masking affects whether you can see the contents of a field when you search. You can use field masking to either perform a random hash or pattern-based substitution of sensitive information from users, who shouldn’t have access to that information.
When you use field masking, Amazon ES creates a hash of the actual field values before returning the search results. You can apply field masking on a per-role basis, supporting different levels of visibility depending on the identity of the user making the query. Currently, field masking is only available for string-based fields. A search result with a masked field (clientIP) looks like this:
To follow along in this post, make sure you have an Amazon ES domain with Elasticsearch version 6.7 or higher, sample data loaded (this example uses the web logs data supplied by Kibana), and access to Kibana through a role with administrator privileges for the domain.
Configure field masking
Field masking is managed by defining specific access controls within the Kibana visualization system. You’ll need to create a new Kibana role, define the fine-grained access-control privileges for that role, specify which fields to mask, and apply that role to specific users.
You can use either the Kibana console or direct-to-API calls to set up field masking. In our first example, we’ll use the Kibana console.
To configure field masking in the Kibana console
Log in to Kibana, choose the Security pane, and then choose Roles, as shown in Figure 1.
Figure 1: Choose security roles
Choose the plus sign (+) to create a new role, as shown in Figure 2.
Figure 2: Create role
Choose the Index Permissions tab, and then choose Add index permissions, as shown in Figure 3.
Once you’ve set Index Patterns, Permissions: Action Groups, Document Level Security Query, and Include or exclude fields, you can use the Anonymize fields entry to mask the clientIP, as shown in Figure 4.
Figure 4: Anonymize field
Choose Save Role Definition.
Next, you need to create one or more users and apply the role to the new users. Go back to the Security page and choose Internal User Database, as shown in Figure 5.
Figure 5: Select Internal User Database
Choose the plus sign (+) to create a new user, as shown in Figure 6.
Figure 6: Create user
Add a username and password, and under Open Distro Security Roles, select the role es-mask-role, as shown in Figure 7.
Figure 7: Select the username, password, and roles
Choose Submit.
If you prefer, you can perform the same task by using the Amazon ES REST API using Kibana dev tools.
Use the following API to create a role as described in below snippet and shown in Figure 8.
You can verify field masking by running a simple search query using Kibana dev tools (GET web_logs/_search) and retrieving the data first by using the kibana_user (with no field masking), and then by using the es-mask-user (with field masking) you just created.
Query responses run by the kibana_user (all access) have the original values in all fields, as shown in Figure 10.
Figure 10: Retrieval of the full clientIP data with kibana_user
Figure 11, following, shows an example of what you would see if you logged in as the es-mask-user. In this case, the clientIP field is hidden due to the es-mask-role you created.
Figure 11: Retrieval of the masked clientIP data with es-mask-user
Use pattern-based field masking
Rather than creating a hash, you can use one or more regular expressions and replacement strings to mask a field. The syntax is <field>::/<regular-expression>/::<replacement-string>.
You can use either the Kibana console or direct-to-API calls to set up pattern-based field masking. In the following example, clientIP is masked in such a way that the last three parts of the IP address are masked by xxx using the pattern is clientIP::/[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}$/::xxx.xxx.xxx>. You see only the first part of the IP address, as shown in Figure 12.
Figure 12: Anonymize the field with a pattern
Run the search query to verify that the last three parts of clientIP are masked by custom characters and only the first part is shown to the requester, as shown in Figure 13.
Figure 13: Retrieval of the masked clientIP (according to the defined pattern) with es-mask-user
Conclusion
Field level security should be the primary approach for ensuring data access security – however if there are specific business requirements that cannot be met with this approach, then field masking may offer a viable alternative. By using field masking, you can selectively allow or prevent your users from seeing private information such as personally identifying information (PII) or personal healthcare information (PHI). For more information about fine-grained access control, see the Amazon Elasticsearch Service Developer Guide.
Amazon Elasticsearch Service (Amazon ES) provides fine-grained access control, powered by the Open Distro for Elasticsearch security plugin. The security plugin adds Kibana authentication and access control at the cluster, index, document, and field levels that can help you secure your data. You now have many different ways to configure your Amazon ES domain to provide access control. In this post, I offer basic configuration information to get you started.
Figure 1: A high-level view of data flow and security
Figure 1 details the authentication and access control provided in Amazon ES. The left half of the diagram details the different methods of authenticating. Looking horizontally, requests originate either from Kibana or directly access the REST API. When using Kibana, you can use a login screen powered by the Open Distro security plugin, your SAML identity provider, or Amazon Cognito. Each of these methods results in an authenticated identity: SAML providers via the response, Amazon Cognito via an AWS Identity and Access Management (IAM) identity, and Open Distro via an internal user identity. When you use the REST API, you can use AWS Signature V4 request signing (SigV4 signing), or user name and password authentication. You can also send unauthenticated traffic, but your domain should be configured to reject all such traffic.
The right side of the diagram details the access control points. You can consider the handling of access control in two phases to better understand it—authentication at the edge by IAM and authentication in the Amazon ES domain by the Open Distro security plugin.
First, requests from Kibana or direct API calls have to reach your domain endpoint. If you follow best practices and the domain is in an Amazon Virtual Private Cloud (VPC), you can use Amazon Elastic Compute Cloud (Amazon EC2) security groups to allow or deny traffic based on the originating IP address or security group of the Amazon EC2 instances. Best practice includes least privilege based on subnet ACLs and security group ingress and egress restrictions. In this post, we assume that your requests are legitimate, meet your access control criteria, and can reach your domain.
When a request reaches the domain endpoint—the edge of your domain—, it can be anonymous or it can carry identity and authentication information as described previously. Each Amazon ES domain carries a resource-based IAM policy. With this policy, you can allow or deny traffic based on an IAM identity attached to the request. When your policy specifies an IAM principal, Amazon ES evaluates the request against the allowed Actions in the policy and allows or denies the request. If you don’t have an IAM identity attached to the request (SAML assertion, or user name and password) you should leave the domain policy open and pass traffic through to fine-grained access control in Amazon ES without any checks. You should employ IAM security best practices and add additional IAM restrictions for direct-to-API access control once your domain is set up.
The Open Distro for Elasticsearch security plugin has its own internal user database for user name and password authentication and handles access control for all users. When traffic reaches the Elasticsearch cluster, the plugin validates any user name and password authentication information against this internal database to identify the user and grant a set of permissions. If a request comes with identity information from either SAML or an IAM role, you map that backend role onto the roles or users that you have created in Open Distro security.
The Amazon ES console provides a guided wizard that lets you configure—and reconfigure—your Amazon ES domain. Step 1 offers you the opportunity to select some predefined configurations that carry through the wizard. In step 2, you choose the instances to deploy in your domain. In Step 3, you configure the security. This post focuses on step 3. See also these tutorials that explain using an IAM master user and using an HTTP-authenticated master user.
Note: At the time of writing, you cannot enable fine-grained access control on existing domains; you must create a new domain and enable the feature at domain creation time. You can use fine-grained access control with Elasticsearch versions 6.8 and later.
Set your endpoint
Amazon ES gives you a DNS name that resolves to an IP address that you use to send traffic to the Elasticsearch cluster in the domain. The IP address can be in the IP space of the public internet, or it can resolve to an IP address in your VPC. While—with fine-grained access control—you have the means of securing your cluster even when the endpoint is a public IP address, we recommend using VPC access as the more secure option. Shown in Figure 2.
Figure 2: Select VPC access
With the endpoint in your VPC, you use security groups to control which ports accept traffic and limit access to the endpoints of your Amazon ES domain to IP addresses in your VPC. Make sure to use least privilege when setting up security group access.
Enable fine-grained access control
You should enable fine-grained access control. Shown in Figure 3.
Figure 3: Enabled fine-grained access control
Set up the master user
The master user is the administrator identity for your Amazon ES domain. This user can set up additional users in the Amazon ES security plugin, assign roles to them, and assign permissions for those roles. You can choose user name and password authentication for the master user, or use an IAM identity. User name and password authentication, shown in Figure 4, is simpler to set up and—with a strong password—may provide sufficient security depending on your use case. We recommend you follow your organization’s policy for password length and complexity. If you lose this password, you can return to the domain’s dashboard in the AWS Management Console and reset it. You’ll use these credentials to log in to Kibana. Following best practices on choosing your master user, you should move to an IAM master user once setup is complete.
Note: Password strength is a function of length, complexity of characters (e.g., upper and lower case letters, numbers, and special characters), and unpredictability to decrease the likelihood the password could be guessed or cracked over a period of time.
Figure 4: Setting up the master username and password
Do not enable Amazon Cognito authentication
When you use Kibana, Amazon ES includes a login experience. You currently have three choices for the source of the login screen:
The Open Distro security plugin
Amazon Cognito
Your SAML-compliant system
You can apply fine-grained access control regardless of how you log in. However, setting up fine-grained access control for the master user and additional users is most straightforward if you use the login experience provided by the Open Distro security plugin. After your first login, and when you have set up additional users, you should migrate to either Cognito or SAML for login, taking advantage of the additional security they offer. To use the Open Distro login experience, disable Amazon Cognito authentication, as shown in Figure 5.
Figure 5: Amazon Cognito authentication is not enabled
If you plan to integrate with your SAML identity provider, check the Prepare SAML authentication box. You will complete the set up when the domain is active.
Figure 6: Choose Prepare SAML authentication if you plan to use it
Use an open access policy
When you create your domain, you attach an IAM policy to it that controls whether your traffic must be signed with AWS SigV4 request signing for authentication. Policies that specify an IAM principal require that you use AWS SigV4 signing to authenticate those requests. The domain sends your traffic to IAM, which authenticates signed requests to resolve the user or role that sent the traffic. The domain and IAM apply the policy access controls and either accept the traffic or reject it based on the commands. This is done down to the index level for single-index API calls.
When you use fine-grained access control, your traffic is also authenticated by the Amazon ES security plugin, which makes the IAM authentication redundant. Create an open access policy, as shown in Figure 7, which doesn’t specify a principal and so doesn’t require request signing. This may be acceptable, since you can choose to require an authenticated identity on all traffic. The security plugin authenticates the traffic as above, providing access control based on the internal database.
Figure 7: Selected open access policy
Encrypted data
Amazon ES provides an option to encrypt data in transit and at rest for any domain. When you enable fine-grained access control, you must use encryption with the corresponding checkboxes automatically checked and not changeable. These include Transport Layer Security (TLS) for requests to the domain and for traffic between nodes in the domain, and encryption of data at rest through AWS Key Management Service (KMS). Shown in Figure 8.
Figure 8: Enabled encryption
Accessing Kibana
When you complete the domain creation wizard, it takes about 10 minutes for your domain to activate. Return to the console and the Overview tab of your Amazon ES dashboard. When the Domain Status is Active, select the Kibana URL. Since you created your domain in your VPC, you must be able to access the Kibana endpoint via proxy, VPN, SSH tunnel, or similar. Use the master user name and password that you configured earlier to log in to Kibana, as shown in Figure 9. As detailed above, you should only ever log in as the master user to set up additional users—administrators, users with read-only access, and others.
Figure 9: Kibana login page
Conclusion
Congratulations, you now know the basic steps to set up the minimum configuration to access your Amazon ES domain with a master user. You can examine the settings for fine-grained access control in the Kibana console Security tab. Here, you can add additional users, assign permissions, map IAM users to security roles, and set up your Kibana tenancy. We’ll cover those topics in future posts.
Amazon Elasticsearch Service (Amazon ES) is a fully managed service that makes it easy to deploy, secure, scale, and monitor your Elasticsearch cluster in the AWS Cloud. Elasticsearch is a distributed database solution, which can be difficult to plan for and execute. This post discusses some best practices for deploying Amazon ES domains.
The most important practice is to iterate. If you follow these best practices, you can plan for a baseline Amazon ES deployment. Elasticsearch behaves differently for every workload—its latency and throughput are largely determined by the request mix, the requests themselves, and the data or queries that you run. There is no deterministic rule that can 100% predict how your workload will behave. Plan for time to tune and refine your deployment, monitor your domain’s behavior, and adjust accordingly.
Deploying Amazon ES
Whether you deploy on the AWS Management Console, in AWS CloudFormation, or via Amazon ES APIs, you have a wealth of options to configure your domain’s hardware, high availability, and security features. This post covers best practices for choosing your data nodes and your dedicated master nodes configuration.
When you configure your Amazon ES domain, you choose the instance type and count for data and the dedicated master nodes. Elasticsearch is a distributed database that runs on a cluster of instances or nodes. These node types have different functions and require different sizing. Data nodes store the data in your indexes and process indexing and query requests. Dedicated master nodes don’t process these requests; they maintain the cluster state and orchestrate. This post focuses on instance types. For more information about instance sizing for data nodes, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain. For more information about instance sizing for dedicated master nodes, see Get Started with Amazon Elasticsearch Service: Use Dedicated Master Instances to Improve Cluster Stability.
Amazon ES supports five instance classes: M, R, I, C, and T. As a best practice, use the latest generation instance type from each instance class. As of this writing, these are the M5, R5, I3, C5, and T2.
Choosing your instance type for data nodes
When choosing an instance type for your data nodes, bear in mind that these nodes carry all the data in your indexes (storage) and do all the processing for your requests (CPU). As a best practice, for heavy production workloads, choose the R5 or I3 instance type. If your emphasis is primarily on performance, the R5 typically delivers the best performance for log analytics workloads, and often for search workloads. The I3 instances are strong contenders and may suit your workload better, so you should test both. If your emphasis is on cost, the I3 instances have better cost efficiency at scale, especially if you choose to purchase reserved instances.
For an entry-level instance or a smaller workload, choose the M5s. The C5s are a specialized instance, relevant for heavy query use cases, which require more CPU work than disk or network. Use the T2 instances for development or QA workloads, but not for production. For more information about how many instances to choose, and a deeper analysis of the data handling footprint, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain.
Choosing your instance type for dedicated master nodes
When choosing an instance type for your dedicated master nodes, keep in mind that these nodes are primarily CPU-bound, with some RAM and network demand as well. The C5 instances work best as dedicated masters up to about 75 data node clusters. Above that node count, you should choose R5.
Choosing Availability Zones
Amazon ES makes it easy to increase the availability of your cluster by using the Zone Awareness feature. You can choose to deploy your data and master nodes in one, two, or three Availability Zones. As a best practice, choose three Availability Zones for your production deployments.
When you choose more than one Availability Zone, Amazon ES deploys data nodes equally across the zones and makes sure that replicas go into different zones. Additionally, when you choose more than one Availability Zone, Amazon ES always deploys dedicated master nodes in three zones (if the Region supports three zones). Deploying into more than one Availability Zone gives your domain more stability and increases your availability.
Elasticsearch index and shard design
When you use Amazon ES, you send data to indexes in your cluster. An index is like a table in a relational database. Each search document is like a row, and each JSON field is like a column.
Amazon ES partitions your data into shards, with a random hash by default. You must configure the shard count, and you should use the best practices in this section.
Index patterns
For log analytics use cases, you want to control the life cycle of data in your cluster. You can do this with a rolling index pattern. Each day, you create a new index, then archive and delete the oldest index in the cluster. You define a retention period that controls how many days (indexes) of data you keep in the domain based on your analysis needs. For more information, see Index State Management.
Setting your shard counts
There are two types of shards: primary and replica. The primary shard count defines how many partitions of data Elasticsearch creates. The replica count specifies how many additional copies of the primary shards it creates. You set the primary shard count at index creation and you can’t change it (there are ways, but it’s not recommended to use the _shrink or _split API for clusters under load at scale). You also set the replica count at index creation, but you can change the replica count on the fly and Elasticsearch adjusts accordingly by creating or removing replicas.
You can set the primary and replica shard counts if you create the index manually, with a POST command. A better way for log analytics is to set an index template. See the following code:
When you set a template like this, every index that matches the index_pattern has the settings and the mapping (if you specify one) applied to that index. This gives you a convenient way of managing your shard strategy for rolling indexes. If you change your template, you get your new shard count in the next indexing cycle.
You should set the number_of_shards based on your source data size, using the following guideline: primary shard count = (daily source data in bytes * 1.25) / 50 GB.
For search use cases, where you’re not using rolling indexes, use 30 GB as the divisor, targeting 30 GB shards. However, these are guidelines. Always test with your own data, indexing, and queries to find your optimal shard size.
You should try to align your shard and instance counts so that your shards distribute equally across your nodes. You do this by adjusting shard counts or data node counts so that they are evenly divisible. For example, the default settings for Elasticsearch versions 6 and below are 5 primary shards and 1 replica (a total of 10 shards). You can get even distribution by choosing 2, 5, or 10 data nodes. Although it’s important to distribute your workload evenly on your data nodes, it’s not always possible to get every index deployed equally. Use the shard size as the primary guide for shard count and make small (< 20%) adjustments, generally favoring more instances or smaller shards, based on even distribution.
Determining storage size
So far, you’ve mapped out a shard count, based on the storage needed. Now you need to make sure that you have sufficient storage and CPU resources to process your requests. First, find your overall storage need: storage needed = (daily source data in bytes * 1.25) * (number_of_replicas + 1) * number of days retention.
You multiply your unreplicated index size by the number of replicas and days of retention to determine the total storage needed. Each replica adds an additional storage need equal to the primary storage size. You add this again for every day you want to retain data in the cluster. For search use cases, set the number of days of retention to 1.
The total storage need drives a minimum on the instance type and instance based on the maximum storage that instance provides. If you’re using EBS-backed instances like the M5 or R5, you can deploy EBS volumes up to the supported limit. For more information, see Amazon Elasticsearch Service Limits.
For instances with ephemeral store, storage is limited by the instance type (for example, I3.8xlarge.elasticsearch has 7.8 TB of attached storage). If you choose EBS, you should use the general purpose, GP2, volume type. Although the service does support the io1 volume type and provisioned IOPS, you generally don’t need them. Use provisioned IOPS only in special circumstances, when metrics support it.
Take the total storage needed and divide by the maximum storage per instance of your chosen instance type to get the minimum instance count.
After you have an instance type and count, make sure you have sufficient vCPUs to process your requests. Multiply the instance count by the vCPUs that instance provides. This gives you a total count of vCPUs in the cluster. As an initial scale point, make sure that your vCPU count is 1.5 times your active shard count. An active shard is any shard for an index that is receiving substantial writes. Use the primary shard count to determine active shards for indexes that are receiving substantial writes. For log analytics, only the current index is active. For search use cases, which are read heavy, use the primary shard count.
Although 1.5 is recommended, this is highly workload-dependent. Be sure to test and monitor CPU utilization and scale accordingly.
As you work with shard and instance counts, bear in mind that Amazon ES works best when the total shard count is as small as possible—fewer than 10,000 is a good soft limit. Each instance should also have no more than 25 shards total per GB of JVM heap on that instance. For example, the R5.xlarge has 32 GB of RAM total. The service allocates half the RAM (16 GB) for the heap (the maximum heap size for any instance is 31.5 GB). You should never have more than 400 = 16 * 25 shards on any node in that cluster.
Use case
Assume you have a log analytics workload supporting Apache web logs (500 GB/day) and syslogs (500 GB/day), retained for 7 days. This post focuses on the R5 instance type as the best choice for log analytics. You use a three-Availability Zone deployment, one primary and two replicas per index. With a three-zone deployment, you have to deploy nodes in multiples of three, which drives instance count and, to some extent, shard count.
The primary shard count for each index is (500 * 1.25) / 50 GB = 12.5 shards, which you round to 15. Using 15 primaries allows additional space to grow in each shard and is divisible by three (the number of Availability Zones, and therefore the number of instances, are a multiple of 3). The total storage needed is 1,000 * 1.25 * 3 * 7 = 26.25 TB. You can provide that storage with 18x R5.xlarge.elasticsearch, 9x R5.2xlarge.elasticsearch, or 6x R5.4xlarge.elasticsearch instances (based on EBS limits of 1.5 TB, 3 TB, and 6 TB, respectively). You should pick the 4xlarge instances, on the general guideline that vertical scaling is usually higher performance than horizontal scaling (there are many exceptions to this general rule, so make sure to iterate appropriately).
Having found a minimum deployment, you now need to validate the CPU count. Each index has 15 primary shards and 2 replicas, for a total of 45 shards. The most recent indexes receive substantial write, so each has 45 active shards, giving a total of 90 active shards. You ignore the other 6 days of indexes because they are infrequently accessed. For log analytics, you can assume that your read volume is always low and drops off as the data ages. Each R5.4xlarge.elasticsearch has 16 vCPUs, for a total of 96 in your cluster. The best practice guideline is 135 = 90 * 1.5 vCPUs needed. As a starting scale point, you need to increase to 9x R5.4xlarge.elasticsearch, with 144 vCPUs. Again, testing may reveal that you’re over-provisioned (which is likely), and you may be able to reduce to six. Finally, given your data node and shard counts, provision 3x C5.large.elasticsearch dedicated master nodes.
Conclusion
This post covered some of the core best practices for deploying your Amazon ES domain. These guidelines give you a reasonable estimate of the number and type of data nodes. Stay tuned for subsequent posts that cover best practices for deploying secure domains, monitoring your domain’s performance, and ingesting data into your domain.
About the Author
Jon Handler (@_searchgeek) is a Principal Solutions Architect at Amazon Web Services based in Palo Alto, CA. Jon works closely with the CloudSearch and Elasticsearch teams, providing help and guidance to a broad range of customers who have search workloads that they want to move to the AWS Cloud. Prior to joining AWS, Jon’s career as a software developer included four years of coding a large-scale, eCommerce search engine.
A customer has been successfully creating and running multiple Amazon Elasticsearch Service (Amazon ES) domains to support their business users’ search needs across products, orders, support documentation, and a growing suite of similar needs. The service has become heavily used across the organization. This led to some domains running at 100% capacity during peak times, while others began to run low on storage space. Because of this increased usage, the technical teams were in danger of missing their service level agreements. They contacted me for help.
This post shows how you can set up automated alarms to warn when domains need attention.
Solution overview
Amazon ES is a fully managed service that delivers Elasticsearch’s easy-to-use APIs and real-time analytics capabilities along with the availability, scalability, and security that production workloads require. The service offers built-in integrations with a number of other components and AWS services, enabling customers to go from raw data to actionable insights quickly and securely.
One of these other integrated services is Amazon CloudWatch. CloudWatch is a monitoring service for AWS Cloud resources and the applications that you run on AWS. You can use CloudWatch to collect and track metrics, collect and monitor log files, set alarms, and automatically react to changes in your AWS resources.
CloudWatch collects metrics for Amazon ES. You can use these metrics to monitor the state of your Amazon ES domains, and set alarms to notify you about high utilization of system resources. For more information, see Amazon Elasticsearch Service Metrics and Dimensions.
While the metrics are automatically collected, the missing piece is how to set alarms on these metrics at appropriate levels for each of your domains. This post includes sample Python code to evaluate the current state of your Amazon ES environment, and to set up alarms according to AWS recommendations and best practices.
There are two components to the sample solution:
es-check-cwalarms.py: This Python script checks the CloudWatch alarms that have been set, for all Amazon ES domains in a given account and region.
es-create-cwalarms.py: This Python script sets up a set of CloudWatch alarms for a single given domain.
The sample code can also be found in the amazon-es-check-cw-alarms GitHub repo. The scripts are easy to extend or combine, as described in the section “Extensions and Adaptations”.
Assessing the current state
The first script, es-check-cwalarms.py, is used to give an overview of the configurations and alarm settings for all the Amazon ES domains in the given region. The script takes the following parameters:
python es-checkcwalarms.py -h
usage: es-checkcwalarms.py [-h] [-e ESPREFIX] [-n NOTIFY] [-f FREE][-p PROFILE] [-r REGION]
Checks a set of recommended CloudWatch alarms for Amazon Elasticsearch Service domains (optionally, those beginning with a given prefix).
optional arguments:
-h, – help show this help message and exit
-e ESPREFIX, – esprefix ESPREFIX Only check Amazon Elasticsearch Service domains that begin with this prefix.
-n NOTIFY, – notify NOTIFY List of CloudWatch alarm actions; e.g. ['arn:aws:sns:xxxx']
-f FREE, – free FREE Minimum free storage (MB) on which to alarm
-p PROFILE, – profile PROFILE IAM profile name to use
-r REGION, – region REGION AWS region for the domain. Default: us-east-1
The script first identifies all the domains in the given region (or, optionally, limits them to the subset that begins with a given prefix). It then starts running a set of checks against each one.
The script can be run from the command line or set up as a scheduled Lambda function. For example, for one customer, it was deemed appropriate to regularly run the script to check that alarms were correctly set for all domains. In addition, because configuration changes—cluster size increases to accommodate larger workloads being a common change—might require updates to alarms, this approach allowed the automatic identification of alarms no longer appropriately set as the domain configurations changed.
The output shown below is the output for one domain in my account.
Starting checks for Elasticsearch domain iotfleet , version is 53
Iotfleet Automated snapshot hour (UTC): 0
Iotfleet Instance configuration: 1 instances; type:m3.medium.elasticsearch
Iotfleet Instance storage definition is: 4 GB; free storage calced to: 819.2 MB
iotfleet Desired free storage set to (in MB): 819.2
iotfleet WARNING: Not using VPC Endpoint
iotfleet WARNING: Does not have Zone Awareness enabled
iotfleet WARNING: Instance count is ODD. Best practice is for an even number of data nodes and zone awareness.
iotfleet WARNING: Does not have Dedicated Masters.
iotfleet WARNING: Neither index nor search slow logs are enabled.
iotfleet WARNING: EBS not in use. Using instance storage only.
iotfleet Alarm ok; definition matches. Test-Elasticsearch-iotfleet-ClusterStatus.yellow-Alarm ClusterStatus.yellow
iotfleet Alarm ok; definition matches. Test-Elasticsearch-iotfleet-ClusterStatus.red-Alarm ClusterStatus.red
iotfleet Alarm ok; definition matches. Test-Elasticsearch-iotfleet-CPUUtilization-Alarm CPUUtilization
iotfleet Alarm ok; definition matches. Test-Elasticsearch-iotfleet-JVMMemoryPressure-Alarm JVMMemoryPressure
iotfleet WARNING: Missing alarm!! ('ClusterIndexWritesBlocked', 'Maximum', 60, 5, 'GreaterThanOrEqualToThreshold', 1.0)
iotfleet Alarm ok; definition matches. Test-Elasticsearch-iotfleet-AutomatedSnapshotFailure-Alarm AutomatedSnapshotFailure
iotfleet Alarm: Threshold does not match: Test-Elasticsearch-iotfleet-FreeStorageSpace-Alarm Should be: 819.2 ; is 3000.0
The output messages fall into the following categories:
System overview, Informational: The Amazon ES version and configuration, including instance type and number, storage, automated snapshot hour, etc.
Free storage: A calculation for the appropriate amount of free storage, based on the recommended 20% of total storage.
Warnings: best practices that are not being followed for this domain. (For more about this, read on.)
Alarms: An assessment of the CloudWatch alarms currently set for this domain, against a recommended set.
The script contains an array of recommended CloudWatch alarms, based on best practices for these metrics and statistics. Using the array allows alarm parameters (such as free space) to be updated within the code based on current domain statistics and configurations.
For a given domain, the script checks if each alarm has been set. If the alarm is set, it checks whether the values match those in the array esAlarms. In the output above, you can see three different situations being reported:
Alarm ok; definition matches. The alarm set for the domain matches the settings in the array.
Alarm: Threshold does not match. An alarm exists, but the threshold value at which the alarm is triggered does not match.
WARNING: Missing alarm!! The recommended alarm is missing.
All in all, the list above shows that this domain does not have a configuration that adheres to best practices, nor does it have all the recommended alarms.
Setting up alarms
Now that you know that the domains in their current state are missing critical alarms, you can correct the situation.
To demonstrate the script, set up a new domain named “ver”, in us-west-2. Specify 1 node, and a 10-GB EBS disk. Also, create an SNS topic in us-west-2 with a name of “sendnotification”, which sends you an email.
Run the second script, es-create-cwalarms.py, from the command line. This script creates (or updates) the desired CloudWatch alarms for the specified Amazon ES domain, “ver”.
python es-create-cwalarms.py -r us-west-2 -e test -c ver -n "['arn:aws:sns:us-west-2:xxxxxxxxxx:sendnotification']"
EBS enabled: True type: gp2 size (GB): 10 No Iops 10240 total storage (MB)
Desired free storage set to (in MB): 2048.0
Creating Test-Elasticsearch-ver-ClusterStatus.yellow-Alarm
Creating Test-Elasticsearch-ver-ClusterStatus.red-Alarm
Creating Test-Elasticsearch-ver-CPUUtilization-Alarm
Creating Test-Elasticsearch-ver-JVMMemoryPressure-Alarm
Creating Test-Elasticsearch-ver-FreeStorageSpace-Alarm
Creating Test-Elasticsearch-ver-ClusterIndexWritesBlocked-Alarm
Creating Test-Elasticsearch-ver-AutomatedSnapshotFailure-Alarm
Successfully finished creating alarms!
As with the first script, this script contains an array of recommended CloudWatch alarms, based on best practices for these metrics and statistics. This approach allows you to add or modify alarms based on your use case (more on that below).
After running the script, navigate to Alarms on the CloudWatch console. You can see the set of alarms set up on your domain.
Because the “ver” domain has only a single node, cluster status is yellow, and that alarm is in an “ALARM” state. It’s already sent a notification that the alarm has been triggered.
In most cases, the alarm triggers due to an increased workload. The likely action is to reconfigure the system to handle the increased workload, rather than reducing the incoming workload. Reconfiguring any backend store—a category of systems that includes Elasticsearch—is best performed when the system is quiescent or lightly loaded. Reconfigurations such as setting zone awareness or modifying the disk type cause Amazon ES to enter a “processing” state, potentially disrupting client access.
Other changes, such as increasing the number of data nodes, may cause Elasticsearch to begin moving shards, potentially impacting search performance on these shards while this is happening. These actions should be considered in the context of your production usage. For the same reason I also do not recommend running a script that resets all domains to match best practices.
Avoid the need to reconfigure during heavy workload by setting alarms at a level that allows a considered approach to making the needed changes. For example, if you identify that each weekly peak is increasing, you can reconfigure during a weekly quiet period.
While Elasticsearch can be reconfigured without being quiesced, it is not a best practice to automatically scale it up and down based on usage patterns. Unlike some other AWS services, I recommend against setting a CloudWatch action that automatically reconfigures the system when alarms are triggered.
There are other situations where the planned reconfiguration approach may not work, such as low or zero free disk space causing the domain to reject writes. If the business is dependent on the domain continuing to accept incoming writes and deleting data is not an option, the team may choose to reconfigure immediately.
Extensions and adaptations
You may wish to modify the best practices encoded in the scripts for your own environment or workloads. It’s always better to avoid situations where alerts are generated but routinely ignored. All alerts should trigger a review and one or more actions, either immediately or at a planned date. The following is a list of common situations where you may wish to set different alarms for different domains:
Dev/test vs. production You may have a different set of configuration rules and alarms for your dev environment configurations than for test. For example, you may require zone awareness and dedicated masters for your production environment, but not for your development domains. Or, you may not have any alarms set in dev. For test environments that mirror your potential peak load, test to ensure that the alarms are appropriately triggered.
Differing workloads or SLAs for different domains You may have one domain with a requirement for superfast search performance, and another domain with a heavy ingest load that tolerates slower search response. Your reaction to slow response for these two workloads is likely to be different, so perhaps the thresholds for these two domains should be set at a different level. In this case, you might add a “max CPU utilization” alarm at 100% for 1 minute for the fast search domain, while the other domain only triggers an alarm when the average has been higher than 60% for 5 minutes. You might also add a “free space” rule with a higher threshold to reflect the need for more space for the heavy ingest load if there is danger that it could fill the available disk quickly.
“Normal” alarms versus “emergency” alarms If, for example, free disk space drops to 25% of total capacity, an alarm is triggered that indicates action should be taken as soon as possible, such as cleaning up old indexes or reconfiguring at the next quiet period for this domain. However, if free space drops below a critical level (20% free space), action must be taken immediately in order to prevent Amazon ES from setting the domain to read-only. Similarly, if the “ClusterIndexWritesBlocked” alarm triggers, the domain has already stopped accepting writes, so immediate action is needed. In this case, you may wish to set “laddered” alarms, where one threshold causes an alarm to be triggered to review the current workload for a planned reconfiguration, but a different threshold raises a “DefCon 3” alarm that immediate action is required.
The sample scripts provided here are a starting point, intended for you to adapt to your own environment and needs.
Running the scripts one time can identify how far your current state is from your desired state, and create an initial set of alarms. Regularly re-running these scripts can capture changes in your environment over time and adjusting your alarms for changes in your environment and configurations. One customer has set them up to run nightly, and to automatically create and update alarms to match their preferred settings.
Removing unwanted alarms
Each CloudWatch alarm costs approximately $0.10 per month. You can remove unwanted alarms in the CloudWatch console, under Alarms. If you set up a “ver” domain above, remember to remove it to avoid continuing charges.
Conclusion
Setting CloudWatch alarms appropriately for your Amazon ES domains can help you avoid suboptimal performance and allow you to respond to workload growth or configuration issues well before they become urgent. This post gives you a starting point for doing so. The additional sleep you’ll get knowing you don’t need to be concerned about Elasticsearch domain performance will allow you to focus on building creative solutions for your business and solving problems for your customers.
Dr. Veronika Megler is a senior consultant at Amazon Web Services. She works with our customers to implement innovative big data, AI and ML projects, helping them accelerate their time-to-value when using AWS.
This is a guest post by Yukinori Koide, an the head of development for the Newspass department at Gunosy.
Gunosy is a news curation application that covers a wide range of topics, such as entertainment, sports, politics, and gourmet news. The application has been installed more than 20 million times.
Gunosy aims to provide people with the content they want without the stress of dealing with a large influx of information. We analyze user attributes, such as gender and age, and past activity logs like click-through rate (CTR). We combine this information with article attributes to provide trending, personalized news articles to users.
Users need fresh and personalized news. There are two constraints to consider when delivering appropriate articles:
Time: Articles have freshness—that is, they lose value over time. New articles need to reach users as soon as possible.
Frequency (volume): Only a limited number of articles can be shown. It’s unreasonable to display all articles in the application, and users can’t read all of them anyway.
To deliver fresh articles with a high probability that the user is interested in them, it’s necessary to include not only past user activity logs and some feature values of articles, but also the most recent (real-time) user activity logs.
We optimize the delivery of articles with these two steps.
Personalization: Deliver articles based on each user’s attributes, past activity logs, and feature values of each article—to account for each user’s interests.
Trends analysis/identification: Optimize delivering articles using recent (real-time) user activity logs—to incorporate the latest trends from all users.
Optimizing the delivery of articles is always a cold start. Initially, we deliver articles based on past logs. We then use real-time data to optimize as quickly as possible. In addition, news has a short freshness time. Specifically, day-old news is past news, and even the news that is three hours old is past news. Therefore, shortening the time between step 1 and step 2 is important.
To tackle this issue, we chose AWS for processing streaming data because of its fully managed services, cost-effectiveness, and so on.
Solution
The following diagrams depict the architecture for optimizing article delivery by processing real-time user activity logs
There are three processing flows:
Process real-time user activity logs.
Store and process all user-based and article-based logs.
Execute ad hoc or heavy queries.
In this post, I focus on the first processing flow and explain how it works.
Process real-time user activity logs
The following are the steps for processing user activity logs in real time using Kinesis Data Streams and Kinesis Data Analytics.
The Fluentd server sends the following user activity logs to Kinesis Data Streams:
b. Insert the joined source stream and application reference data source into the temporary stream.
CREATE OR REPLACE PUMP "TMP_PUMP" AS
INSERT INTO "TMP_SQL_STREAM"
SELECT STREAM
R.GENDER, R.SEGMENT_ID, S.ARTICLE_ID, S.ACTION
FROM "SOURCE_SQL_STREAM_001" S
LEFT JOIN "REFERENCE_DATA_SOURCE" R
ON S.USER_ID = R.USER_ID;
c. Define the destination stream named DESTINATION_SQL_STREAM.
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
TIME TIMESTAMP, GENDER VARCHAR(32), SEGMENT_ID INTEGER, ARTICLE_ID INTEGER,
IMPRESSION INTEGER, CLICK INTEGER
);
d. Insert the processed temporary stream, using a tumbling window, into the destination stream per minute.
CREATE OR REPLACE PUMP "STREAM_PUMP" AS
INSERT INTO "DESTINATION_SQL_STREAM"
SELECT STREAM
ROW_TIME AS TIME,
GENDER, SEGMENT_ID, ARTICLE_ID,
SUM(CASE ACTION WHEN 'impression' THEN 1 ELSE 0 END) AS IMPRESSION,
SUM(CASE ACTION WHEN 'click' THEN 1 ELSE 0 END) AS CLICK
FROM "TMP_SQL_STREAM"
GROUP BY
GENDER, SEGMENT_ID, ARTICLE_ID,
FLOOR("TMP_SQL_STREAM".ROWTIME TO MINUTE);
Batch servers get results from Amazon ES every minute. They then optimize delivering articles with other data sources using a proprietary optimization algorithm.
How to connect a stream to another stream in another AWS Region
When we built the solution, Kinesis Data Analytics was not available in the Asia Pacific (Tokyo) Region, so we used the US West (Oregon) Region. The following shows how we connected a data stream to another data stream in the other Region.
There is no need to continue containing all components in a single AWS Region, unless you have a situation where a response difference at the millisecond level is critical to the service.
Benefits
The solution provides benefits for both our company and for our users. Benefits for the company are cost savings—including development costs, operational costs, and infrastructure costs—and reducing delivery time. Users can now find articles of interest more quickly. The solution can process more than 500,000 records per minute, and it enables fast and personalized news curating for our users.
Conclusion
In this post, I showed you how we optimize trending user activities to personalize news using Amazon Kinesis Data Firehose, Amazon Kinesis Data Analytics, and related AWS services in Gunosy.
AWS gives us a quick and economical solution and a good experience.
If you have questions or suggestions, please comment below.
Yukinori Koide is the head of development for the Newspass department at Gunosy. He is working on standardization of provisioning and deployment flow, promoting the utilization of serverless and containers for machine learning and AI services. His favorite AWS services are DynamoDB, Lambda, Kinesis, and ECS.
Akihiro Tsukada is a start-up solutions architect with AWS. He supports start-up companies in Japan technically at many levels, ranging from seed to later-stage.
Yuta Ishii is a solutions architect with AWS. He works with our customers to provide architectural guidance for building media & entertainment services, helping them improve the value of their services when using AWS.
We can’t believe that there are just few days left before re:Invent 2017. If you are attending this year, you’ll want to check out our Big Data sessions! The Big Data and Machine Learning categories are bigger than ever. As in previous years, you can find these sessions in various tracks, including Analytics & Big Data, Deep Learning Summit, Artificial Intelligence & Machine Learning, Architecture, and Databases.
We have great sessions from organizations and companies like Vanguard, Cox Automotive, Pinterest, Netflix, FINRA, Amtrak, AmazonFresh, Sysco Foods, Twilio, American Heart Association, Expedia, Esri, Nextdoor, and many more. All sessions are recorded and made available on YouTube. In addition, all slide decks from the sessions will be available on SlideShare.net after the conference.
This post highlights the sessions that will be presented as part of the Analytics & Big Data track, as well as relevant sessions from other tracks like Architecture, Artificial Intelligence & Machine Learning, and IoT. If you’re interested in Machine Learning sessions, don’t forget to check out our Guide to Machine Learning at re:Invent 2017.
This year’s session catalog contains the following breakout sessions.
Raju Gulabani, VP, Database, Analytics and AI at AWS will discuss the evolution of database and analytics services in AWS, the new database and analytics services and features we launched this year, and our vision for continued innovation in this space. We are witnessing an unprecedented growth in the amount of data collected, in many different forms. Storage, management, and analysis of this data require database services that scale and perform in ways not possible before. AWS offers a collection of database and other data services—including Amazon Aurora, Amazon DynamoDB, Amazon RDS, Amazon Redshift, Amazon ElastiCache, Amazon Kinesis, and Amazon EMR—to process, store, manage, and analyze data. In this session, we provide an overview of AWS database and analytics services and discuss how customers are using these services today.
Deep dive customer use cases
ABD401 – How Netflix Monitors Applications in Near Real-Time with Amazon Kinesis Thousands of services work in concert to deliver millions of hours of video streams to Netflix customers every day. These applications vary in size, function, and technology, but they all make use of the Netflix network to communicate. Understanding the interactions between these services is a daunting challenge both because of the sheer volume of traffic and the dynamic nature of deployments. In this session, we first discuss why Netflix chose Kinesis Streams to address these challenges at scale. We then dive deep into how Netflix uses Kinesis Streams to enrich network traffic logs and identify usage patterns in real time. Lastly, we cover how Netflix uses this system to build comprehensive dependency maps, increase network efficiency, and improve failure resiliency. From this session, you will learn how to build a real-time application monitoring system using network traffic logs and get real-time, actionable insights.
In this session, learn how Nextdoor replaced their home-grown data pipeline based on a topology of Flume nodes with a completely serverless architecture based on Kinesis and Lambda. By making these changes, they improved both the reliability of their data and the delivery times of billions of records of data to their Amazon S3–based data lake and Amazon Redshift cluster. Nextdoor is a private social networking service for neighborhoods.
ABD205 – Taking a Page Out of Ivy Tech’s Book: Using Data for Student Success Data speaks. Discover how Ivy Tech, the nation’s largest singly accredited community college, uses AWS to gather, analyze, and take action on student behavioral data for the betterment of over 3,100 students. This session outlines the process from inception to implementation across the state of Indiana and highlights how Ivy Tech’s model can be applied to your own complex business problems.
ABD207 – Leveraging AWS to Fight Financial Crime and Protect National Security Banks aren’t known to share data and collaborate with one another. But that is exactly what the Mid-Sized Bank Coalition of America (MBCA) is doing to fight digital financial crime—and protect national security. Using the AWS Cloud, the MBCA developed a shared data analytics utility that processes terabytes of non-competitive customer account, transaction, and government risk data. The intelligence produced from the data helps banks increase the efficiency of their operations, cut labor and operating costs, and reduce false positive volumes. The collective intelligence also allows greater enforcement of Anti-Money Laundering (AML) regulations by helping members detect internal risks—and identify the challenges to detecting these risks in the first place. This session demonstrates how the AWS Cloud supports the MBCA to deliver advanced data analytics, provide consistent operating models across financial institutions, reduce costs, and strengthen national security.
ABD208 – Cox Automotive Empowered to Scale with Splunk Cloud & AWS and Explores New Innovation with Amazon Kinesis Firehose In this session, learn how Cox Automotive is using Splunk Cloud for real time visibility into its AWS and hybrid environments to achieve near instantaneous MTTI, reduce auction incidents by 90%, and proactively predict outages. We also introduce a highly anticipated capability that allows you to ingest, transform, and analyze data in real time using Splunk and Amazon Kinesis Firehose to gain valuable insights from your cloud resources. It’s now quicker and easier than ever to gain access to analytics-driven infrastructure monitoring using Splunk Enterprise & Splunk Cloud.
ABD209 – Accelerating the Speed of Innovation with a Data Sciences Data & Analytics Hub at Takeda Historically, silos of data, analytics, and processes across functions, stages of development, and geography created a barrier to R&D efficiency. Gathering the right data necessary for decision-making was challenging due to issues of accessibility, trust, and timeliness. In this session, learn how Takeda is undergoing a transformation in R&D to increase the speed-to-market of high-impact therapies to improve patient lives. The Data and Analytics Hub was built, with Deloitte, to address these issues and support the efficient generation of data insights for functions such as clinical operations, clinical development, medical affairs, portfolio management, and R&D finance. In the AWS hosted data lake, this data is processed, integrated, and made available to business end users through data visualization interfaces, and to data scientists through direct connectivity. Learn how Takeda has achieved significant time reductions—from weeks to minutes—to gather and provision data that has the potential to reduce cycle times in drug development. The hub also enables more efficient operations and alignment to achieve product goals through cross functional team accountability and collaboration due to the ability to access the same cross domain data.
ABD210 – Modernizing Amtrak: Serverless Solution for Real-Time Data Capabilities As the nation’s only high-speed intercity passenger rail provider, Amtrak needs to know critical information to run their business such as: Who’s onboard any train at any time? How are booking and revenue trending? Amtrak was faced with unpredictable and often slow response times from existing databases, ranging from seconds to hours; existing booking and revenue dashboards were spreadsheet-based and manual; multiple copies of data were stored in different repositories, lacking integration and consistency; and operations and maintenance (O&M) costs were relatively high. Join us as we demonstrate how Deloitte and Amtrak successfully went live with a cloud-native operational database and analytical datamart for near-real-time reporting in under six months. We highlight the specific challenges and the modernization of architecture on an AWS native Platform as a Service (PaaS) solution. The solution includes cloud-native components such as AWS Lambda for microservices, Amazon Kinesis and AWS Data Pipeline for moving data, Amazon S3 for storage, Amazon DynamoDB for a managed NoSQL database service, and Amazon Redshift for near-real time reports and dashboards. Deloitte’s solution enabled “at scale” processing of 1 million transactions/day and up to 2K transactions/minute. It provided flexibility and scalability, largely eliminate the need for system management, and dramatically reduce operating costs. Moreover, it laid the groundwork for decommissioning legacy systems, anticipated to save at least $1M over 3 years.
ABD211 – Sysco Foods: A Journey from Too Much Data to Curated Insights In this session, we detail Sysco’s journey from a company focused on hindsight-based reporting to one focused on insights and foresight. For this shift, Sysco moved from multiple data warehouses to an AWS ecosystem, including Amazon Redshift, Amazon EMR, AWS Data Pipeline, and more. As the team at Sysco worked with Tableau, they gained agile insight across their business. Learn how Sysco decided to use AWS, how they scaled, and how they became more strategic with the AWS ecosystem and Tableau.
ABD217 – From Batch to Streaming: How Amazon Flex Uses Real-time Analytics to Deliver Packages on Time Reducing the time to get actionable insights from data is important to all businesses, and customers who employ batch data analytics tools are exploring the benefits of streaming analytics. Learn best practices to extend your architecture from data warehouses and databases to real-time solutions. Learn how to use Amazon Kinesis to get real-time data insights and integrate them with Amazon Aurora, Amazon RDS, Amazon Redshift, and Amazon S3. The Amazon Flex team describes how they used streaming analytics in their Amazon Flex mobile app, used by Amazon delivery drivers to deliver millions of packages each month on time. They discuss the architecture that enabled the move from a batch processing system to a real-time system, overcoming the challenges of migrating existing batch data to streaming data, and how to benefit from real-time analytics.
ABD218 – How EuroLeague Basketball Uses IoT Analytics to Engage Fans IoT and big data have made their way out of industrial applications, general automation, and consumer goods, and are now a valuable tool for improving consumer engagement across a number of industries, including media, entertainment, and sports. The low cost and ease of implementation of AWS analytics services and AWS IoT have allowed AGT, a leader in IoT, to develop their IoTA analytics platform. Using IoTA, AGT brought a tailored solution to EuroLeague Basketball for real-time content production and fan engagement during the 2017-18 season. In this session, we take a deep dive into how this solution is architected for secure, scalable, and highly performant data collection from athletes, coaches, and fans. We also talk about how the data is transformed into insights and integrated into a content generation pipeline. Lastly, we demonstrate how this solution can be easily adapted for other industries and applications.
ABD222 – How to Confidently Unleash Data to Meet the Needs of Your Entire Organization Where are you on the spectrum of IT leaders? Are you confident that you’re providing the technology and solutions that consistently meet or exceed the needs of your internal customers? Do your peers at the executive table see you as an innovative technology leader? Innovative IT leaders understand the value of getting data and analytics directly into the hands of decision makers, and into their own. In this session, Daren Thayne, Domo’s Chief Technology Officer, shares how innovative IT leaders are helping drive a culture change at their organizations. See how transformative it can be to have real-time access to all of the data that’ is relevant to YOUR job (including a complete view of your entire AWS environment), as well as understand how it can help you lead the way in applying that same pattern throughout your entire company
ABD303 – Developing an Insights Platform – Sysco’s Journey from Disparate Systems to Data Lake and Beyond Sysco has nearly 200 operating companies across its multiple lines of business throughout the United States, Canada, Central/South America, and Europe. As the global leader in food services, Sysco identified the need to streamline the collection, transformation, and presentation of data produced by the distributed units and systems, into a central data ecosystem. Sysco’s Business Intelligence and Analytics team addressed these requirements by creating a data lake with scalable analytics and query engines leveraging AWS services. In this session, Sysco will outline their journey from a hindsight reporting focused company to an insights driven organization. They will cover solution architecture, challenges, and lessons learned from deploying a self-service insights platform. They will also walk through the design patterns they used and how they designed the solution to provide predictive analytics using Amazon Redshift Spectrum, Amazon S3, Amazon EMR, AWS Glue, Amazon Elasticsearch Service and other AWS services.
ABD309 – How Twilio Scaled Its Data-Driven Culture As a leading cloud communications platform, Twilio has always been strongly data-driven. But as headcount and data volumes grew—and grew quickly—they faced many new challenges. One-off, static reports work when you’re a small startup, but how do you support a growth stage company to a successful IPO and beyond? Today, Twilio’s data team relies on AWS and Looker to provide data access to 700 colleagues. Departments have the data they need to make decisions, and cloud-based scale means they get answers fast. Data delivers real-business value at Twilio, providing a 360-degree view of their customer, product, and business. In this session, you hear firsthand stories directly from the Twilio data team and learn real-world tips for fostering a truly data-driven culture at scale.
ABD310 – How FINRA Secures Its Big Data and Data Science Platform on AWS FINRA uses big data and data science technologies to detect fraud, market manipulation, and insider trading across US capital markets. As a financial regulator, FINRA analyzes highly sensitive data, so information security is critical. Learn how FINRA secures its Amazon S3 Data Lake and its data science platform on Amazon EMR and Amazon Redshift, while empowering data scientists with tools they need to be effective. In addition, FINRA shares AWS security best practices, covering topics such as AMI updates, micro segmentation, encryption, key management, logging, identity and access management, and compliance.
ABD331 – Log Analytics at Expedia Using Amazon Elasticsearch Service Expedia uses Amazon Elasticsearch Service (Amazon ES) for a variety of mission-critical use cases, ranging from log aggregation to application monitoring and pricing optimization. In this session, the Expedia team reviews how they use Amazon ES and Kibana to analyze and visualize Docker startup logs, AWS CloudTrail data, and application metrics. They share best practices for architecting a scalable, secure log analytics solution using Amazon ES, so you can add new data sources almost effortlessly and get insights quickly
ABD316 – American Heart Association: Finding Cures to Heart Disease Through the Power of Technology Combining disparate datasets and making them accessible to data scientists and researchers is a prevalent challenge for many organizations, not just in healthcare research. American Heart Association (AHA) has built a data science platform using Amazon EMR, Amazon Elasticsearch Service, and other AWS services, that corrals multiple datasets and enables advanced research on phenotype and genotype datasets, aimed at curing heart diseases. In this session, we present how AHA built this platform and the key challenges they addressed with the solution. We also provide a demo of the platform, and leave you with suggestions and next steps so you can build similar solutions for your use cases
ABD319 – Tooling Up for Efficiency: DIY Solutions @ Netflix At Netflix, we have traditionally approached cloud efficiency from a human standpoint, whether it be in-person meetings with the largest service teams or manually flipping reservations. Over time, we realized that these manual processes are not scalable as the business continues to grow. Therefore, in the past year, we have focused on building out tools that allow us to make more insightful, data-driven decisions around capacity and efficiency. In this session, we discuss the DIY applications, dashboards, and processes we built to help with capacity and efficiency. We start at the ten thousand foot view to understand the unique business and cloud problems that drove us to create these products, and discuss implementation details, including the challenges encountered along the way. Tools discussed include Picsou, the successor to our AWS billing file cost analyzer; Libra, an easy-to-use reservation conversion application; and cost and efficiency dashboards that relay useful financial context to 50+ engineering teams and managers.
ABD312 – Deep Dive: Migrating Big Data Workloads to AWS Customers are migrating their analytics, data processing (ETL), and data science workloads running on Apache Hadoop, Spark, and data warehouse appliances from on-premise deployments to AWS in order to save costs, increase availability, and improve performance. AWS offers a broad set of analytics services, including solutions for batch processing, stream processing, machine learning, data workflow orchestration, and data warehousing. This session will focus on identifying the components and workflows in your current environment; and providing the best practices to migrate these workloads to the right AWS data analytics product. We will cover services such as Amazon EMR, Amazon Athena, Amazon Redshift, Amazon Kinesis, and more. We will also feature Vanguard, an American investment management company based in Malvern, Pennsylvania with over $4.4 trillion in assets under management. Ritesh Shah, Sr. Program Manager for Cloud Analytics Program at Vanguard, will describe how they orchestrated their migration to AWS analytics services, including Hadoop and Spark workloads to Amazon EMR. Ritesh will highlight the technical challenges they faced and overcame along the way, as well as share common recommendations and tuning tips to accelerate the time to production.
ABD402 – How Esri Optimizes Massive Image Archives for Analytics in the Cloud Petabyte scale archives of satellites, planes, and drones imagery continue to grow exponentially. They mostly exist as semi-structured data, but they are only valuable when accessed and processed by a wide range of products for both visualization and analysis. This session provides an overview of how ArcGIS indexes and structures data so that any part of it can be quickly accessed, processed, and analyzed by reading only the minimum amount of data needed for the task. In this session, we share best practices for structuring and compressing massive datasets in Amazon S3, so it can be analyzed efficiently. We also review a number of different image formats, including GeoTIFF (used for the Public Datasets on AWS program, Landsat on AWS), cloud optimized GeoTIFF, MRF, and CRF as well as different compression approaches to show the effect on processing performance. Finally, we provide examples of how this technology has been used to help image processing and analysis for the response to Hurricane Harvey.
ABD329 – A Look Under the Hood – How Amazon.com Uses AWS Services for Analytics at Massive Scale Amazon’s consumer business continues to grow, and so does the volume of data and the number and complexity of the analytics done in support of the business. In this session, we talk about how Amazon.com uses AWS technologies to build a scalable environment for data and analytics. We look at how Amazon is evolving the world of data warehousing with a combination of a data lake and parallel, scalable compute engines such as Amazon EMR and Amazon Redshift.
ABD327 – Migrating Your Traditional Data Warehouse to a Modern Data Lake In this session, we discuss the latest features of Amazon Redshift and Redshift Spectrum, and take a deep dive into its architecture and inner workings. We share many of the recent availability, performance, and management enhancements and how they improve your end user experience. You also hear from 21st Century Fox, who presents a case study of their fast migration from an on-premises data warehouse to Amazon Redshift. Learn how they are expanding their data warehouse to a data lake that encompasses multiple data sources and data formats. This architecture helps them tie together siloed business units and get actionable 360-degree insights across their consumer base. MCL202 – Ally Bank & Cognizant: Transforming Customer Experience Using Amazon Alexa Given the increasing popularity of natural language interfaces such as Voice as User technology or conversational artificial intelligence (AI), Ally® Bank was looking to interact with customers by enabling direct transactions through conversation or voice. They also needed to develop a capability that allows third parties to connect to the bank securely for information sharing and exchange, using oAuth, an authentication protocol seen as the future of secure banking technology. Cognizant’s Architecture team partnered with Ally Bank’s Enterprise Architecture group and identified the right product for oAuth integration with Amazon Alexa and third-party technologies. In this session, we discuss how building products with conversational AI helps Ally Bank offer an innovative customer experience; increase retention through improved data-driven personalization; increase the efficiency and convenience of customer service; and gain deep insights into customer needs through data analysis and predictive analytics to offer new products and services.
MCL317 – Orchestrating Machine Learning Training for Netflix Recommendations At Netflix, we use machine learning (ML) algorithms extensively to recommend relevant titles to our 100+ million members based on their tastes. Everything on the member home page is an evidence-driven, A/B-tested experience that we roll out backed by ML models. These models are trained using Meson, our workflow orchestration system. Meson distinguishes itself from other workflow engines by handling more sophisticated execution graphs, such as loops and parameterized fan-outs. Meson can schedule Spark jobs, Docker containers, bash scripts, gists of Scala code, and more. Meson also provides a rich visual interface for monitoring active workflows and inspecting execution logs. It has a powerful Scala DSL for authoring workflows as well as the REST API. In this session, we focus on how Meson trains recommendation ML models in production, and how we have re-architected it to scale up for a growing need of broad ETL applications within Netflix. As a driver for this change, we have had to evolve the persistence layer for Meson. We talk about how we migrated from Cassandra to Amazon RDS backed by Amazon Aurora
MCL350 – Humans vs. the Machines: How Pinterest Uses Amazon Mechanical Turk’s Worker Community to Improve Machine Learning Ever since the term “crowdsourcing” was coined in 2006, it’s been a buzzword for technology companies and social institutions. In the technology sector, crowdsourcing is instrumental for verifying machine learning algorithms, which, in turn, improves the user’s experience. In this session, we explore how Pinterest adapted to an increased reliability on human evaluation to improve their product, with a focus on how they’ve integrated with Mechanical Turk’s platform. This presentation is aimed at engineers, analysts, program managers, and product managers who are interested in how companies rely on Mechanical Turk’s human evaluation platform to better understand content and improve machine learning algorithms. The discussion focuses on the analysis and product decisions related to building a high quality crowdsourcing system that takes advantage of Mechanical Turk’s powerful worker community.
ABD201 – Big Data Architectural Patterns and Best Practices on AWS In this session, we simplify big data processing as a data bus comprising various stages: collect, store, process, analyze, and visualize. Next, we discuss how to choose the right technology in each stage based on criteria such as data structure, query latency, cost, request rate, item size, data volume, durability, and so on. Finally, we provide reference architectures, design patterns, and best practices for assembling these technologies to solve your big data problems at the right cost
ABD202 – Best Practices for Building Serverless Big Data Applications Serverless technologies let you build and scale applications and services rapidly without the need to provision or manage servers. In this session, we show you how to incorporate serverless concepts into your big data architectures. We explore the concepts behind and benefits of serverless architectures for big data, looking at design patterns to ingest, store, process, and visualize your data. Along the way, we explain when and how you can use serverless technologies to streamline data processing, minimize infrastructure management, and improve agility and robustness and share a reference architecture using a combination of cloud and open source technologies to solve your big data problems. Topics include: use cases and best practices for serverless big data applications; leveraging AWS technologies such as Amazon DynamoDB, Amazon S3, Amazon Kinesis, AWS Lambda, Amazon Athena, and Amazon EMR; and serverless ETL, event processing, ad hoc analysis, and real-time analytics.
ABD206 – Building Visualizations and Dashboards with Amazon QuickSight Just as a picture is worth a thousand words, a visual is worth a thousand data points. A key aspect of our ability to gain insights from our data is to look for patterns, and these patterns are often not evident when we simply look at data in tables. The right visualization will help you gain a deeper understanding in a much quicker timeframe. In this session, we will show you how to quickly and easily visualize your data using Amazon QuickSight. We will show you how you can connect to data sources, generate custom metrics and calculations, create comprehensive business dashboards with various chart types, and setup filters and drill downs to slice and dice the data.
ABD203 – Real-Time Streaming Applications on AWS: Use Cases and Patterns To win in the marketplace and provide differentiated customer experiences, businesses need to be able to use live data in real time to facilitate fast decision making. In this session, you learn common streaming data processing use cases and architectures. First, we give an overview of streaming data and AWS streaming data capabilities. Next, we look at a few customer examples and their real-time streaming applications. Finally, we walk through common architectures and design patterns of top streaming data use cases.
ABD213 – How to Build a Data Lake with AWS Glue Data Catalog As data volumes grow and customers store more data on AWS, they often have valuable data that is not easily discoverable and available for analytics. The AWS Glue Data Catalog provides a central view of your data lake, making data readily available for analytics. We introduce key features of the AWS Glue Data Catalog and its use cases. Learn how crawlers can automatically discover your data, extract relevant metadata, and add it as table definitions to the AWS Glue Data Catalog. We will also explore the integration between AWS Glue Data Catalog and Amazon Athena, Amazon EMR, and Amazon Redshift Spectrum.
ABD214 – Real-time User Insights for Mobile and Web Applications with Amazon Pinpoint With customers demanding relevant and real-time experiences across a range of devices, digital businesses are looking to gather user data at scale, understand this data, and respond to customer needs instantly. This requires tools that can record large volumes of user data in a structured fashion, and then instantly make this data available to generate insights. In this session, we demonstrate how you can use Amazon Pinpoint to capture user data in a structured yet flexible manner. Further, we demonstrate how this data can be set up for instant consumption using services like Amazon Kinesis Firehose and Amazon Redshift. We walk through example data based on real world scenarios, to illustrate how Amazon Pinpoint lets you easily organize millions of events, record them in real-time, and store them for further analysis.
ABD223 – IT Innovators: New Technology for Leveraging Data to Enable Agility, Innovation, and Business Optimization Companies of all sizes are looking for technology to efficiently leverage data and their existing IT investments to stay competitive and understand where to find new growth. Regardless of where companies are in their data-driven journey, they face greater demands for information by customers, prospects, partners, vendors and employees. All stakeholders inside and outside the organization want information on-demand or in “real time”, available anywhere on any device. They want to use it to optimize business outcomes without having to rely on complex software tools or human gatekeepers to relevant information. Learn how IT innovators at companies such as MasterCard, Jefferson Health, and TELUS are using Domo’s Business Cloud to help their organizations more effectively leverage data at scale.
ABD301 – Analyzing Streaming Data in Real Time with Amazon Kinesis Amazon Kinesis makes it easy to collect, process, and analyze real-time, streaming data so you can get timely insights and react quickly to new information. In this session, we present an end-to-end streaming data solution using Kinesis Streams for data ingestion, Kinesis Analytics for real-time processing, and Kinesis Firehose for persistence. We review in detail how to write SQL queries using streaming data and discuss best practices to optimize and monitor your Kinesis Analytics applications. Lastly, we discuss how to estimate the cost of the entire system
ABD302 – Real-Time Data Exploration and Analytics with Amazon Elasticsearch Service and Kibana In this session, we use Apache web logs as example and show you how to build an end-to-end analytics solution. First, we cover how to configure an Amazon ES cluster and ingest data using Amazon Kinesis Firehose. We look at best practices for choosing instance types, storage options, shard counts, and index rotations based on the throughput of incoming data. Then we demonstrate how to set up a Kibana dashboard and build custom dashboard widgets. Finally, we review approaches for generating custom, ad-hoc reports.
ABD304 – Best Practices for Data Warehousing with Amazon Redshift & Redshift Spectrum Most companies are over-run with data, yet they lack critical insights to make timely and accurate business decisions. They are missing the opportunity to combine large amounts of new, unstructured big data that resides outside their data warehouse with trusted, structured data inside their data warehouse. In this session, we take an in-depth look at how modern data warehousing blends and analyzes all your data, inside and outside your data warehouse without moving the data, to give you deeper insights to run your business. We will cover best practices on how to design optimal schemas, load data efficiently, and optimize your queries to deliver high throughput and performance.
ABD305 – Design Patterns and Best Practices for Data Analytics with Amazon EMR Amazon EMR is one of the largest Hadoop operators in the world, enabling customers to run ETL, machine learning, real-time processing, data science, and low-latency SQL at petabyte scale. In this session, we introduce you to Amazon EMR design patterns such as using Amazon S3 instead of HDFS, taking advantage of both long and short-lived clusters, and other Amazon EMR architectural best practices. We talk about lowering cost with Auto Scaling and Spot Instances, and security best practices for encryption and fine-grained access control. Finally, we dive into some of our recent launches to keep you current on our latest features.
ABD307 – Deep Analytics for Global AWS Marketing Organization To meet the needs of the global marketing organization, the AWS marketing analytics team built a scalable platform that allows the data science team to deliver custom econometric and machine learning models for end user self-service. To meet data security standards, we use end-to-end data encryption and different AWS services such as Amazon Redshift, Amazon RDS, Amazon S3, Amazon EMR with Apache Spark and Auto Scaling. In this session, you see real examples of how we have scaled and automated critical analysis, such as calculating the impact of marketing programs like re:Invent and prioritizing leads for our sales teams.
ABD311 – Deploying Business Analytics at Enterprise Scale with Amazon QuickSight One of the biggest tradeoffs customers usually make when deploying BI solutions at scale is agility versus governance. Large-scale BI implementations with the right governance structure can take months to design and deploy. In this session, learn how you can avoid making this tradeoff using Amazon QuickSight. Learn how to easily deploy Amazon QuickSight to thousands of users using Active Directory and Federated SSO, while securely accessing your data sources in Amazon VPCs or on-premises. We also cover how to control access to your datasets, implement row-level security, create scheduled email reports, and audit access to your data.
ABD315 – Building Serverless ETL Pipelines with AWS Glue Organizations need to gain insight and knowledge from a growing number of Internet of Things (IoT), APIs, clickstreams, unstructured and log data sources. However, organizations are also often limited by legacy data warehouses and ETL processes that were designed for transactional data. In this session, we introduce key ETL features of AWS Glue, cover common use cases ranging from scheduled nightly data warehouse loads to near real-time, event-driven ETL flows for your data lake. We discuss how to build scalable, efficient, and serverless ETL pipelines using AWS Glue. Additionally, Merck will share how they built an end-to-end ETL pipeline for their application release management system, and launched it in production in less than a week using AWS Glue.
ABD318 – Architecting a data lake with Amazon S3, Amazon Kinesis, and Amazon Athena Learn how to architect a data lake where different teams within your organization can publish and consume data in a self-service manner. As organizations aim to become more data-driven, data engineering teams have to build architectures that can cater to the needs of diverse users – from developers, to business analysts, to data scientists. Each of these user groups employs different tools, have different data needs and access data in different ways. In this talk, we will dive deep into assembling a data lake using Amazon S3, Amazon Kinesis, Amazon Athena, Amazon EMR, and AWS Glue. The session will feature Mohit Rao, Architect and Integration lead at Atlassian, the maker of products such as JIRA, Confluence, and Stride. First, we will look at a couple of common architectures for building a data lake. Then we will show how Atlassian built a self-service data lake, where any team within the company can publish a dataset to be consumed by a broad set of users.
Companies have valuable data that they may not be analyzing due to the complexity, scalability, and performance issues of loading the data into their data warehouse. However, with the right tools, you can extend your analytics to query data in your data lake—with no loading required. Amazon Redshift Spectrum extends the analytic power of Amazon Redshift beyond data stored in your data warehouse to run SQL queries directly against vast amounts of unstructured data in your Amazon S3 data lake. This gives you the freedom to store your data where you want, in the format you want, and have it available for analytics when you need it. Join a discussion with AWS solution architects to ask question.
ABD330 – Combining Batch and Stream Processing to Get the Best of Both Worlds Today, many architects and developers are looking to build solutions that integrate batch and real-time data processing, and deliver the best of both approaches. Lambda architecture (not to be confused with the AWS Lambda service) is a design pattern that leverages both batch and real-time processing within a single solution to meet the latency, accuracy, and throughput requirements of big data use cases. Come join us for a discussion on how to implement Lambda architecture (batch, speed, and serving layers) and best practices for data processing, loading, and performance tuning
ABD335 – Real-Time Anomaly Detection Using Amazon Kinesis Amazon Kinesis Analytics offers a built-in machine learning algorithm that you can use to easily detect anomalies in your VPC network traffic and improve security monitoring. Join us for an interactive discussion on how to stream your VPC flow Logs to Amazon Kinesis Streams and identify anomalies using Kinesis Analytics.
ABD339 – Deep Dive and Best Practices for Amazon Athena Amazon Athena is an interactive query service that enables you to process data directly from Amazon S3 without the need for infrastructure. Since its launch at re:invent 2016, several organizations have adopted Athena as the central tool to process all their data. In this talk, we dive deep into the most common use cases, including working with other AWS services. We review the best practices for creating tables and partitions and performance optimizations. We also dive into how Athena handles security, authorization, and authentication. Lastly, we hear from a customer who has reduced costs and improved time to market by deploying Athena across their organization.
We look forward to meeting you at re:Invent 2017!
About the Author
Roy Ben-Alta is a solution architect and principal business development manager at Amazon Web Services in New York. He focuses on Data Analytics and ML Technologies, working with AWS customers to build innovative data-driven products.
Starting today, you can connect to your Amazon Elasticsearch Service domains from within an Amazon VPC without the need for NAT instances or Internet gateways. VPC support for Amazon ES is easy to configure, reliable, and offers an extra layer of security. With VPC support, traffic between other services and Amazon ES stays entirely within the AWS network, isolated from the public Internet. You can manage network access using existing VPC security groups, and you can use AWS Identity and Access Management (IAM) policies for additional protection. VPC support for Amazon ES domains is available at no additional charge.
Getting Started
Creating an Amazon Elasticsearch Service domain in your VPC is easy. Follow all the steps you would normally follow to create your cluster and then select “VPC access”.
That’s it. There are no additional steps. You can now access your domain from within your VPC!
Things To Know
To support VPCs, Amazon ES places an endpoint into at least one subnet of your VPC. Amazon ES places an Elastic Network Interface (ENI) into the VPC for each data node in the cluster. Each ENI uses a private IP address from the IPv4 range of your subnet and receives a public DNS hostname. If you enable zone awareness, Amazon ES creates endpoints in two subnets in different availability zones, which provides greater data durability.
You need to set aside three times the number of IP addresses as the number of nodes in your cluster. You can divide that number by two if Zone Awareness is enabled. Ideally, you would create separate subnets just for Amazon ES.
A few notes:
Currently, you cannot move existing domains to a VPC or vice-versa. To take advantage of VPC support, you must create a new domain and migrate your data.
Currently, Amazon ES does not support Amazon Kinesis Firehose integration for domains inside a VPC.
Nowadays, streaming data is seen and used everywhere—from social networks, to mobile and web applications, IoT devices, instrumentation in data centers, and many other sources. As the speed and volume of this type of data increases, the need to perform data analysis in real time with machine learning algorithms and extract a deeper understanding from the data becomes ever more important. For example, you might want a continuous monitoring system to detect sentiment changes in a social media feed so that you can react to the sentiment in near real time.
In this post, we use Amazon Kinesis Streams to collect and store streaming data. We then use Amazon Kinesis Analytics to process and analyze the streaming data continuously. Specifically, we use the Kinesis Analytics built-in RANDOM_CUT_FOREST function, a machine learning algorithm, to detect anomalies in the streaming data. Finally, we use Amazon Kinesis Firehose to export the anomalies data to Amazon Elasticsearch Service (Amazon ES). We then build a simple dashboard in the open source tool Kibana to visualize the result.
Solution overview
The following diagram depicts a high-level overview of this solution.
Amazon Kinesis Streams
You can use Amazon Kinesis Streams to build your own streaming application. This application can process and analyze streaming data by continuously capturing and storing terabytes of data per hour from hundreds of thousands of sources.
Amazon Kinesis Analytics
Kinesis Analytics provides an easy and familiar standard SQL language to analyze streaming data in real time. One of its most powerful features is that there are no new languages, processing frameworks, or complex machine learning algorithms that you need to learn.
Amazon Kinesis Firehose
Kinesis Firehose is the easiest way to load streaming data into AWS. It can capture, transform, and load streaming data into Amazon S3, Amazon Redshift, and Amazon Elasticsearch Service.
Amazon Elasticsearch Service
Amazon ES is a fully managed service that makes it easy to deploy, operate, and scale Elasticsearch for log analytics, full text search, application monitoring, and more.
Solution summary
The following is a quick walkthrough of the solution that’s presented in the diagram:
IoT sensors send streaming data into Kinesis Streams. In this post, you use a Python script to simulate an IoT temperature sensor device that sends the streaming data.
By using the built-in RANDOM_CUT_FOREST function in Kinesis Analytics, you can detect anomalies in real time with the sensor data that is stored in Kinesis Streams. RANDOM_CUT_FOREST is also an appropriate algorithm for many other kinds of anomaly-detection use cases—for example, the media sentiment example mentioned earlier in this post.
The processed anomaly data is then loaded into the Kinesis Firehose delivery stream.
By using the built-in integration that Kinesis Firehose has with Amazon ES, you can easily export the processed anomaly data into the service and visualize it with Kibana.
Implementation steps
The following sections walk through the implementation steps in detail.
Creating the delivery stream
Open the Amazon Kinesis Streams console.
Create a new Kinesis stream. Give it a name that indicates it’s for raw incoming stream data—for example, RawStreamData. For Number of shards, type 1.
The Python code provided below simulates a streaming application, such as an IoT device, and generates random data and anomalies into a Kinesis stream. The code generates two temperature ranges, where the first range is the hypothetical sensor’s normal operating temperature range (10–20), and the second is the anomaly temperature range (100–120).Make sure to change the stream name on line 16 and 20 and the Region on line 6 to match your configuration. Alternatively, you can download the Amazon Kinesis Data Generator from this repository and use it to generate the data.
import json
import datetime
import random
import testdata
from boto import kinesis
kinesis = kinesis.connect_to_region("us-east-1")
def getData(iotName, lowVal, highVal):
data = {}
data["iotName"] = iotName
data["iotValue"] = random.randint(lowVal, highVal)
return data
while 1:
rnd = random.random()
if (rnd < 0.01):
data = json.dumps(getData("DemoSensor", 100, 120))
kinesis.put_record("RawStreamData", data, "DemoSensor")
print '***************************** anomaly ************************* ' + data
else:
data = json.dumps(getData("DemoSensor", 10, 20))
kinesis.put_record("RawStreamData", data, "DemoSensor")
print data
Open the Amazon Elasticsearch Service console and create a new domain.
Give the domain a unique name. In the Configure cluster screen, use the default settings.
In the Set up access policy screen, in the Set the domain access policy list, choose Allow access to the domain from specific IP(s).
Enter the public IP address of your computer. Note: If you’re working behind a proxy or firewall, see the “Use a proxy to simplify request signing” section in this AWS Database blog post to learn how to work with a proxy. For additional information about securing access to your Amazon ES domain, see How to Control Access to Your Amazon Elasticsearch Domain in the AWS Security Blog.
After the Amazon ES domain is up and running, you can set up and configure Kinesis Firehose to export results to Amazon ES:
Open the Amazon Kinesis Firehose console and choose Create Delivery Stream.
In the Destination dropdown list, choose Amazon Elasticsearch Service.
Type a stream name, and choose the Amazon ES domain that you created in Step 4.
Provide an index name and ES type. In the S3 bucket dropdown list, choose Create New S3 bucket. Choose Next.
In the configuration, change the Elasticsearch Buffer size to 1 MB and the Buffer interval to 60s. Use the default settings for all other fields. This shortens the time for the data to reach the ES cluster.
Under IAM Role, choose Create/Update existing IAM role. The best practice is to create a new role every time. Otherwise, the console keeps adding policy documents to the same role. Eventually the size of the attached policies causes IAM to reject the role, but it does it in a non-obvious way, where the console basically quits functioning.
Choose Next to move to the Review page.
Review the configuration, and then choose Create Delivery Stream.
Run the Python file for 1–2 minutes, and then press Ctrl+C to stop the execution. This loads some data into the stream for you to visualize in the next step.
Analyzing the data
Now it’s time to analyze the IoT streaming data using Amazon Kinesis Analytics.
Open the Amazon Kinesis Analytics console and create a new application. Give the application a name, and then choose Create Application.
On the next screen, choose Connect to a source. Choose the raw incoming data stream that you created earlier. (Note the stream name Source_SQL_STREAM_001 because you will need it later.)
Use the default settings for everything else. When the schema discovery process is complete, it displays a success message with the formatted stream sample in a table as shown in the following screenshot. Review the data, and then choose Save and continue.
Next, choose Go to SQL editor. When prompted, choose Yes, start application.
Copy the following SQL code and paste it into the SQL editor window.
CREATE OR REPLACE STREAM "TEMP_STREAM" (
"iotName" varchar (40),
"iotValue" integer,
"ANOMALY_SCORE" DOUBLE);
– Creates an output stream and defines a schema
CREATE OR REPLACE STREAM "DESTINATION_SQL_STREAM" (
"iotName" varchar(40),
"iotValue" integer,
"ANOMALY_SCORE" DOUBLE,
"created" TimeStamp);
– Compute an anomaly score for each record in the source stream
– using Random Cut Forest
CREATE OR REPLACE PUMP "STREAM_PUMP_1" AS INSERT INTO "TEMP_STREAM"
SELECT STREAM "iotName", "iotValue", ANOMALY_SCORE FROM
TABLE(RANDOM_CUT_FOREST(
CURSOR(SELECT STREAM * FROM "SOURCE_SQL_STREAM_001")
)
);
– Sort records by descending anomaly score, insert into output stream
CREATE OR REPLACE PUMP "OUTPUT_PUMP" AS INSERT INTO "DESTINATION_SQL_STREAM"
SELECT STREAM "iotName", "iotValue", ANOMALY_SCORE, ROWTIME FROM "TEMP_STREAM"
ORDER BY FLOOR("TEMP_STREAM".ROWTIME TO SECOND), ANOMALY_SCORE DESC;
Choose Save and run SQL. As the application is running, it displays the results as stream data arrives. If you don’t see any data coming in, run the Python script again to generate some fresh data. When there is data, it appears in a grid as shown in the following screenshot.Note that you are selecting data from the source stream name Source_SQL_STREAM_001 that you created previously. Also note the ANOMALY_SCORE column. This is the value that the Random_Cut_Forest function calculates based on the temperature ranges provided by the Python script. Higher (anomaly) temperature ranges have a higher score.Looking at the SQL code, note that the first two blocks of code create two new streams to store temporary data and the final result. The third block of code analyzes the raw source data (Stream_Pump_1) using the Random_Cut_Forest function. It calculates an anomaly score (ANOMALY_SCORE) and inserts it into the TEMP_STREAM stream. The final code block loads the result stored in the TEMP_STREAM into DESTINATION_SQL_STREAM.
Choose Exit (done editing) next to the Save and run SQL button to return to the application configuration page.
Load processed data into the Kinesis Firehose delivery stream
Now, you can export the result from DESTINATION_SQL_STREAM into the Amazon Kinesis Firehose stream that you created previously.
On the application configuration page, choose Connect to a destination.
Choose the stream name that you created earlier, and use the default settings for everything else. Then choose Save and Continue.
On the application configuration page, choose Exit to Kinesis Analytics applications to return to the Amazon Kinesis Analytics console.
Run the Python script again for 4–5 minutes to generate enough data to flow through Amazon Kinesis Streams, Kinesis Analytics, Kinesis Firehose, and finally into the Amazon ES domain.
Open the Kinesis Firehose console, choose the stream, and then choose the Monitoring
As the processed data flows into Kinesis Firehose and Amazon ES, the metrics appear on the Delivery Stream metrics page. Keep in mind that the metrics page takes a few minutes to refresh with the latest data.
Open the Amazon Elasticsearch Service dashboard in the AWS Management Console. The count in the Searchable documents column increases as shown in the following screenshot. In addition, the domain shows a cluster health of Yellow. This is because, by default, it needs two instances to deploy redundant copies of the index. To fix this, you can deploy two instances instead of one.
Visualize the data using Kibana
Now it’s time to launch Kibana and visualize the data.
Use the ES domain link to go to the cluster detail page, and then choose the Kibana link as shown in the following screenshot. If you’re working behind a proxy or firewall, see the “Use a proxy to simplify request signing” section in this blog post to learn how to work with a proxy.
In the Kibana dashboard, choose the Discover tab to perform a query.
You can also visualize the data using the different types of charts offered by Kibana. For example, by going to the Visualize tab, you can quickly create a split bar chart that aggregates by ANOMALY_SCORE per minute.
Conclusion
In this post, you learned how to use Amazon Kinesis to collect, process, and analyze real-time streaming data, and then export the results to Amazon ES for analysis and visualization with Kibana. If you have comments about this post, add them to the “Comments” section below. If you have questions or issues with implementing this solution, please open a new thread on the Amazon Kinesis or Amazon ES discussion forums.
Tristan Li is a Solutions Architect with Amazon Web Services. He works with enterprise customers in the US, helping them adopt cloud technology to build scalable and secure solutions on AWS.
Monitoring your AWS environment is important for security, performance, and cost control purposes. For example, by monitoring and analyzing API calls made to your Amazon EC2 instances, you can trace security incidents and gain insights into administrative behaviors and access patterns. The kinds of events you might monitor include console logins, Amazon EBS snapshot creation/deletion/modification, VPC creation/deletion/modification, and instance reboots, etc.
In this post, I show you how to build a near real-time API monitoring solution for EC2 events using Amazon CloudWatch Events and Amazon Kinesis Firehose. Please be sure to have Amazon CloudTrail enabled in your account.
CloudWatch Events offers a near real-time stream of system events that describe changes in AWS resources. CloudWatch Events now supports Kinesis Firehose as a target.
Kinesis Firehose is a fully managed service for continuously capturing, transforming, and delivering data in minutes to storage and analytics destinations such as Amazon S3, Amazon Kinesis Analytics, Amazon Redshift, and Amazon Elasticsearch Service.
Walkthrough
For this walkthrough, you create a CloudWatch event rule that matches specific EC2 events such as:
Starting, stopping, and terminating an instance
Creating and deleting VPC route tables
Creating and deleting a security group
Creating, deleting, and modifying instance volumes and snapshots
Your CloudWatch event target is a Kinesis Firehose delivery stream that delivers this data to an Elasticsearch cluster, where you set up Kibana for visualization. Using this solution, you can easily load and visualize EC2 events in minutes without setting up complicated data pipelines.
Set up the Elasticsearch cluster
Create the Amazon ES domain in the Amazon ES console, or by using the create-elasticsearch-domain command in the AWS CLI.
Create a rule, and configure the event source and target. You can choose to configure multiple event sources with several AWS resources, along with options to specify specific or multiple event types.
In the CloudWatch console, choose Events.
For Service Name, choose EC2.
In Event Pattern Preview, choose Edit and copy the pattern below. For this walkthrough, I selected events that are specific to the EC2 API, but you can modify it to include events for any of your AWS resources.
The following screenshot shows what your event looks like in the console.
Next, choose Add target and select the delivery stream that you just created.
Set up Kibana on the Elasticsearch cluster
Amazon ES provides a default installation of Kibana with every Amazon ES domain. You can find the Kibana endpoint on your domain dashboard in the Amazon ES console. You can restrict Amazon ES access to an IP-based access policy.
In the Kibana console, for Index name or pattern, type log. This is the name of the Elasticsearch index.
For Time-field name, choose @time.
To view the events, choose Discover.
The following chart demonstrates the API operations and the number of times that they have been triggered in the past 12 hours.
Summary
In this post, you created a continuous, near real-time solution to monitor various EC2 events such as starting and shutting down instances, creating VPCs, etc. Likewise, you can build a continuous monitoring solution for all the API operations that are relevant to your daily AWS operations and resources.
With Kinesis Firehose as a new target for CloudWatch Events, you can retrieve, transform, and load system events to the storage and analytics destination of your choice in minutes, without setting up complicated data pipelines.
If you have any questions or suggestions, please comment below.
Another month of big data solutions on the Big Data Blog. Please take a look at our summaries below and learn, comment, and share. Thank you for reading!
NEW POSTS
Amazon QuickSight Spring Announcement: KPI Charts, Export to CSV, AD Connector, and More! In this blog post, we share a number of new features and enhancements in Amazon Quicksight. You can now create key performance indicator (KPI) charts, define custom ranges when importing Microsoft Excel spreadsheets, export data to comma separated value (CSV) format, and create aggregate filters for SPICE data sets. In the Enterprise Edition, we added an additional option to connect to your on-premises Active Directory using AD Connector.
Securely Analyze Data from Another AWS Account with EMRFS Sometimes, data to be analyzed is spread across buckets owned by different accounts. In order to ensure data security, appropriate credentials management needs to be in place. This is especially true for large enterprises storing data in different Amazon S3 buckets for different departments. This post shows how you can use a custom credentials provider to access S3 objects that cannot be accessed by the default credentials provider of EMRFS.
Querying OpenStreetMap with Amazon Athena This post explains how anyone can use Amazon Athena to quickly query publicly available OSM data stored in Amazon S3 (updated weekly) as an AWS Public Dataset. Imagine that you work for an NGO interested in improving knowledge of and access to health centers in Africa. You might want to know what’s already been mapped, to facilitate the production of maps of surrounding villages, and to determine where infrastructure investments are likely to be most effective.
Build a Real-time Stream Processing Pipeline with Apache Flink on AWS This post outlines a reference architecture for a consistent, scalable, and reliable stream processing pipeline that is based on Apache Flink using Amazon EMR, Amazon Kinesis, and Amazon Elasticsearch Service. An AWSLabs GitHub repository provides the artifacts that are required to explore the reference architecture in action. Resources include a producer application that ingests sample data into an Amazon Kinesis stream and a Flink program that analyses the data in real time and sends the result to Amazon ES for visualization.
Manage Query Workloads with Query Monitoring Rules in Amazon Redshift Amazon Redshift is a powerful, fully managed data warehouse that can offer significantly increased performance and lower cost in the cloud. However, queries which hog cluster resources (rogue queries) can affect your experience. In this post, you learn how query monitoring rules can help spot and act against such queries. This, in turn, can help you to perform smooth business operations in supporting mixed workloads to maximize cluster performance and throughput.
Amazon QuickSight Now Supports Audit Logging with AWS CloudTrail In this post, we announce support for AWS CloudTrail in Amazon QuickSight, which allows logging of QuickSight events across an AWS account. Whether you have an enterprise setting or a small team scenario, this integration will allow QuickSight administrators to accurately answer questions such as who last changed an analysis, or who has connected to sensitive data. With CloudTrail, administrators have better governance, auditing and risk management of their QuickSight usage.
Near Zero Downtime Migration from MySQL to DynamoDB This post introduces two methods of seamlessly migrating data from MySQL to DynamoDB, minimizing downtime and converting the MySQL key design into one more suitable for NoSQL.
Want to learn more about Big Data or Streaming Data? Check out our Big Dataand Streaming dataeducational pages.
Leave a comment below to let us know what big data topics you’d like to see next on the AWS Big Data Blog.
Many organizations begin their cloud journey to AWS by moving a few applications to demonstrate the power and flexibility of AWS. This initial application architecture includes building security groups that control the network ports, protocols, and IP addresses that govern access and traffic to their AWS Virtual Private Cloud (VPC). When the architecture process is complete and an application is fully functional, some organizations forget to revisit their security groups to optimize rules and help ensure the appropriate level of governance and compliance. Not optimizing security groups can create less-than-optimal security, with ports open that may not be needed or source IP ranges set that are broader than required.
Removing unused rules or limiting source IP addresses requires either an in-depth knowledge of an application’s active ports on Amazon EC2 instances or analysis of active network traffic. In this blog post, I discuss a method to:
Use VPC Flow Logs to capture information about the IP traffic in an Amazon VPC.
Enrich the VPC Flow Logs dataset with security group IDs by using Firehose and Lambda.
Demonstrate how to visualize and analyze network traffic from VPC Flow Logs by using Amazon Elasticsearch Service (Amazon ES).
Using this approach can help you remediate security group rules to necessary source IPs, ports, and nested security groups, helping to improve the security of your AWS resources while minimizing the potential risk to production environments.
As illustrated in the preceding diagram, this is how the data flows in this model:
The Lambda ingestor function passes the data to Firehose.
Firehose then passes the data to the Lambda decorator function.
The Lambda decorator function performs a number of lookups for each record and returns the data to Firehose with additional fields.
Firehose then posts the enhanced dataset to the Amazon ES endpoint and any errors to Amazon S3.
The solution
Step 1: Set up your Amazon ES cluster and VPC Flow Logs
Create an Amazon ES cluster
The first step in this solution is to create an Amazon ES cluster. Do this first because it takes some time for the cluster to become available. If you are new to Amazon ES, you can learn more about it in the Amazon ES documentation.
Type es-flowlogs for the Elasticsearch domain name.
Set Version to 1 in the drop-down list. Choose Next.
Set Instance count to 2 and select the Enable zone awareness check box. (This ensures cluster stability in the event of an Availability Zone outage.) Accept the defaults for the rest of the page.
[Optional] If you use this domain for production purposes, I recommend using dedicated master nodes. Select the Enable dedicated master check box and select medium.elasticsearch from the Instance type drop-down list. Leave the Instance count at 3, which is the default.
Choose Next.
From the Set the domain access policy to drop-down list on the next page, select Allow access to the domain from specific IP(s). In the dialog box, type or paste the comma-separated list of valid IPv4 addresses or Classless Inter-Domain Routing (CIDR) blocks you would like to be able to access the Amazon ES domain.
It will take a few minutes for the cluster to be available. In the meantime, you can begin enabling VPC Flow Logs.
Enable VPC Flow Logs
VPC Flow Logs is a feature that lets you capture information about the IP traffic going to and from network interfaces in your VPC. Flow log data is stored using Amazon CloudWatch Logs. For more information about VPC Flow Logs, see VPC Flow Logs and CloudWatch Logs.
Choose Your VPCs in the navigation pane, and select the VPC you would like to analyze. (You can also enable VPC Flow Logs on only a subnet if you do not want to enable it on the entire VPC.)
Choose the Flow Logs tab in the bottom pane, and then choose Create Flow Log.
In the text beneath the Role box, choose Set Up Permissions (this will open an IAM management page).
Choose Allow on the IAM management page. Return to the VPC Flow Logs setup page.
Choose All from the Filter drop-down list.
Choose flowlogsRole from the Role drop-down list (you created this role in steps 3 and 4 in this procedure).
Choose Flowlogs from the Destination Log Group drop-down list.
Choose Create Flow Log.
Step 2: Set up AWS Lambda to enrich the VPC Flow Logs dataset with security group IDs
If you completed Step 1, VPC Flow Logs data is now streaming to CloudWatch Logs. Next, you will deploy two Lambda functions. The first, the ingestor function, moves the data into Firehose, and the second, the decorator function, adds three new fields to the VPC Flow Logs dataset and returns records to Firehose for delivery to Amazon ES.
The new fields added by the decorator function are:
Direction – By comparing the primary IP address of the elastic network interface (ENI) in the destination IP address, you can set the direction for the IP connection.
Security group IDs – Each ENI can be associated with as many as five security groups. The security group IDs are added as an array in the record.
Source – This includes a number of fields that result from looking up srcaddr from a free service for geographical lookups.
The Source includes:
source-country-code
source-country-name
source-region-code
source-region-name
source-city
source-location, latitude, and longitude.
Follow the instructions in this GitHub repository to deploy the two Lambda functions and the associated permissions that are required.
Step 3: Set up Firehose
Firehose is a fully managed service that allows you to transform flow log data and stream it into Amazon ES. The service scales automatically with load, and you only pay for the data transmitted through the service.
Choose Go to Firehose and then choose Create Delivery Stream.
Step 3.1: Define the destination
Choose Amazon Elasticsearch Service from the Destination drop-down list.
For Delivery stream name, type VPCFlowLogsToElasticSearch (the name must match the default environment variable in the ingestion Lambda function).
Choose es-flowlogs from the Elasticsearch domain drop-down list. (The Amazon ES cluster configuration state needs to be Active for es-flowlogs to be available in the drop-down list.)
For Index, type cwl.
Choose OneDay from the Index rotation drop-down list.
For Type, type log.
For Backup mode, select Failed Documents Only.
For S3 bucket, select New S3 bucket in the drop-down list and type a bucket name of your choice. Choose Create bucket.
Choose Next.
Step 3.2: Configure Lambda
Choose Enable for Data transformation.
Choose vpc-flow-log-appender-dev-FlowLogDecoratorFunction-xxxxx from the Lambda function drop-down list (make sure you select the Decorator function).
Choose Create/Update existing IAM role, Firehose delivery IAM roll from the IAM role drop-down list.
Choose Allow. This takes you back to the Firehose Configuration.
Choose Next and then choose Create Delivery Stream.
Step 4: Stream data to Firehose
The next step is to enable the data to stream from CloudWatch Logs to Firehose. You will use the Lambda ingestion function you deployed earlier: vpc-flow-log-appender-dev-FlowLogIngestionFunction-xxxxxxx.
Choose Logs in the navigation pane, and select the check box next to Flowlogs under Log Groups.
From the Actions menu, choose Stream to AWS Lambda. Choose vpc-flow-log-appender-dev-FlowLogIngestionFunction-xxxxxxx (select the Ingestion function). Choose Next.
Choose Amazon VPC Flow Logs from the Log Format drop-down list. Choose Next.
Choose Start Streaming.
VPC Flow Logs will now be forwarded to Firehose, capturing information about the IP traffic going to and from network interfaces in your VPC. Firehose appends additional data fields and forwards the enriched data to your Amazon ES cluster.
Data is now flowing to your Amazon ES cluster, but be patient because it can take up to 30 minutes for the data to begin appearing in your Amazon ES cluster.
Step 5: Verify that the flow log data is streaming through Firehose to the Amazon ES cluster
You should see VPC Flow Logs with ENI IDs under Log Streams (see the following screenshot) and Stored Bytes greater than zero in the CloudWatch log group.
Do you have logs from the Lambda ingestion function in the CloudWatch log group? As shown in the following screenshot, you should see START, END and REPORT records. These show that the ingestion function is running and streaming data to Firehose.
Do you have logs from the Lambda decorator function in the CloudWatch log group? You should see START, END, and REPORT records as well as entries similar to: “Processing completed. Successful records XXX, Failed records 0.”
Do you have cwl-* indexes in the Amazon ES dashboard, as shown in the following screenshot? If you do, you are successfully streaming through Firehose and populating the Amazon ES cluster, and you are ready to proceed to Step 6. Remember, it can take up to 30 minutes for the flow logs from your workloads to begin flowing to the Amazon ES cluster.
Step 6: Using the SGDashboard to analyze VPC network traffic
You now need set up a Kibana dashboard to monitor the traffic in your VPC.
Choose es-flowlogs under Elasticsearch domain name.
Click the link next to Kibana, as shown in the following screenshot.
The first time you access Kibana, you will be asked to set the defaultindex. To set the defaultindex in the Amazon ES cluster:
Set the Index name or pattern to cwl-*.
For Time-field name, type @timestamp.
Choose Create.
Load the SGDashboard:
Download this JSON file and save it to your computer. The file includes a dashboard and visualizations I created for this blog post’s purposes.
In Kibana, choose Management in the navigation pane, choose Saved Objects, and then import the file you just downloaded.
Choose Dashboard and Open to load the SGDashboard you just imported. (You might have to press Enter in the top search box to have the dashboard load the first time.)
The following screenshot shows the SGDashboard after it has loaded.
The SGDashboard is composed of a set of visualizations. Each visualization contains a view or summary of the underlying data contained in the Amazon ES cluster, as shown in the preceding screenshot. You can control the timeframe for the dashboard in the upper right corner. By clicking the timeframe, the dashboard exposes alternative timeframes that you can select.
The SGDashboard includes a list of security groups, destination ports, source IP addresses, actions, protocols, and connection directions as well as raw VPC Flow Log records. This information is useful because you can compare this to your security group configurations. Ports might be open in the security group but have no network traffic flowing to the instances on those ports, which means the corresponding rules can probably be removed. Also, by evaluating IP ranges in use, you can narrow the ranges to only those IP addresses required for the application. The following screenshot on the left shows a view of the SGDashboard for a specific security group. By comparing its accepted inbound IP addresses with the security group rules in the following screenshot on the right, you can ensure the source IP ranges are sufficiently restrictive.
Analyze VPC Flow Logs data
Amazon ES allows you to quickly view and filter VPC Flow Logs data to determine what network traffic is flowing in your VPC. This analysis requires an understanding of security groups and elastic network interfaces (ENIs). Let’s say you have two security groups associated with the same ENI, and the first security group has traffic it will register for both groups. You will still see traffic to the ENI listed in the second security group because it is allowing traffic to the ENI. Therefore, when you click a security group that you want to filter, additional groups might still be on the list because they are included in the VPC Flow Logs records.
The following screenshot on the left is a view of the SGDashboard with a security group selected (sg-978414e8). Even though that security group has a filter, two additional security groups remain in the dashboard. The following screenshot on the right shows the raw log data where each record contains all three security groups and demonstrates that all three security groups share a common set of flow log records.
Also, note that security groups are stateful, so if the instance itself is initiating traffic to a different location, the return traffic will be displayed in the Kibana dashboard. The best example of this is port 123 Network Time Protocol (NTP). This type of traffic can be easily removed from the display by choosing the port on the right side of the dashboard, and then reversing the filter, as shown in the following screenshot. By reversing the filter, you can exclude data from the view.
Example: Unused security groups
Let’s say that some security groups are no longer in use. First, I change the time range by clicking the current time range in the top right corner of the dashboard, as shown in the following screenshot. I select Week to date.
As the following screenshot shows, the dashboard has identified five security groups that have had traffic during the week to date.
As you can see in the following screenshot, I have many security groups in my test account that are not in use. Any security groups not in the SGDashboard are candidates for removal.
Example: Unused inbound rules
Let’s take a look at security group sg-63ed8c1c from the preceding screenshot. When I click sg-63ed8c1c (the security group ID) in the dashboard, a filter is applied that reduces the security groups displayed to only the records with that security group included. We can compare the traffic associated with this security group in the SGDashboard (shown in the following screenshot) to the security group rules in the EC2 console.
As the following screenshot of the EC2 console shows, this security group has only 2 inbound rules: one for HTTP on port 80 and one for RDP. The SGDashboard shows that traffic is not flowing on port 80, so I can safely remove that rule from the security group.
Summary
It can be challenging to help ensure that your AWS Cloud environment allows only intended traffic and is as secure and manageable as possible. In this post, I have shown how to enable VPC Flow Logs. I then showed how to use Firehose and Lambda to add security group IDs, directions, and locations to the VPC Flow Logs dataset. The SGDashboard then enables you to analyze the flow log data and compare it with your security group configurations to improve your cloud security.
If you have comments about this blog post, submit them in the “Comments” section below. If you have implementation or troubleshooting questions about the solution in this post, please start a new thread on the AWS WAF forum.
In today’s business environments, data is generated in a continuous fashion by a steadily increasing number of diverse data sources. Therefore, the ability to continuously capture, store, and process this data to quickly turn high volume streams of raw data into actionable insights has become a substantial competitive advantage for organizations.
Apache Flink is an open source project that is well suited to form the basis of such a stream processing pipeline. It offers unique capabilities that are tailored to the continuous analysis of streaming data. However, building and maintaining a pipeline based on Flink often requires considerable expertise, in addition to physical resources and operational efforts.
This post outlines a reference architecture for a consistent, scalable, and reliable stream processing pipeline that is based on Apache Flink using Amazon EMR, Amazon Kinesis, and Amazon Elasticsearch Service. An AWSLabs GitHub repository provides the artifacts that are required to explore the reference architecture in action. Resources include a producer application that ingests sample data into an Amazon Kinesis stream and a Flink program that analyses the data in real time and sends the result to Amazon ES for visualization.
Analyzing geospatial taxi data in real time
Consider a scenario related to optimizing taxi fleet operations. You obtain information continuously from a fleet of taxis currently operating in New York City. Using this data, you want to optimize the operations by analyzing the gathered data in real time and making data-based decisions.
You would like, for instance, to identify hot spots—areas that are currently in high demand for taxis—so that you can direct unoccupied taxis there. You also want to track current traffic conditions so that you can give approximate trip durations to customers, for example, for rides to the nearby airports. Naturally, your decisions should be based on information that closely reflects the current demand and traffic conditions. The incoming data needs to be analyzed in a continuous and timely fashion. Relevant KPIs and derived insights should be accessible to real-time dashboards.
For the purpose of this post, you emulate a stream of trip events by replaying a dataset of historic taxi trips collected in New York City into Amazon Kinesis Streams. The dataset is available from the New York City Taxi & Limousine Commission website. It contains information on the geolocation and collected fares of individual taxi trips.
In more realistic scenarios, you could leverage AWS IoT to collect the data from telemetry units installed in the taxis and then ingest the data into an Amazon Kinesis stream.
Architecture of a reliable and scalable stream processing pipeline
Because the pipeline serves as the central tool to operate and optimize the taxi fleet, it’s crucial to build an architecture that is tolerant against the failure of single nodes. The pipeline should adapt to changing rates of incoming events. Therefore, you should separate the ingestion of events, their actual processing, and the visualization of the gathered insights into different components. By loosely coupling these components of the infrastructure and using managed services, you can increase the robustness of the pipeline in case of failures. You can also scale the different parts of your infrastructure individually and reduce the efforts that are required to build and operate the entire pipeline.
By decoupling the ingestion and storage of events sent by the taxis from the computation of queries deriving the desired insights, you can substantially increase the robustness of the infrastructure.
Events are initially persisted by means of Amazon Kinesis Streams, which holds a replayable, ordered log and redundantly stores events in multiple Availability Zones. Later, the events are read from the stream and processed by Apache Flink. As Flink continuously snapshots its internal state, the failure of an operator or entire node can be recovered by restoring the internal state from the snapshot and replaying events that need to be reprocessed from the stream.
Another advantage of a central log for storing events is the ability to consume data by multiple applications. It is feasible to run different versions of a Flink application side by side for benchmarking and testing purposes. Or, you could use Amazon Kinesis Firehose to persist the data from the stream to Amazon S3 for long-term archival and then thorough historical analytics, using Amazon Athena.
Because Amazon Kinesis Streams, Amazon EMR, and Amazon ES are managed services that can be created and scaled by means of simple API calls, using these services allows you to focus your expertise on providing business value. Let AWS do the undifferentiated heavy lifting that is required to build and, more importantly, operate and scale the entire pipeline. The creation of the pipeline can be fully automated with AWS CloudFormation and individual components can be monitored and automatically scaled by means of Amazon CloudWatch. Failures are detected and automatically mitigated.
For the rest of this post, I focus on aspects that are related to building and running the reference architecture on AWS. To learn more about Flink, see the Flink training session that discusses how to implement Flink programs using its APIs. The scenario used in the session is fairly similar to the one discussed in this post.
Building and running the reference architecture
To see the taxi trip analysis application in action, use two CloudFormation templates to build and run the reference architecture:
The first template builds the runtime artifacts for ingesting taxi trips into the stream and for analyzing trips with Flink
The second template creates the resources of the infrastructure that run the application
The resources that are required to build and run the reference architecture, including the source code of the Flink application and the CloudFormation templates, are available from the flink-stream-processing-refarch AWSLabs GitHub repository.
Building the runtime artifacts and creating the infrastructure
Execute the first CloudFormation template to create an AWS CodePipeline pipeline, which builds the artifacts by means of AWS CodeBuild in a serverless fashion. You can also install Maven and building the Flink Amazon Kinesis connector and the other runtime artifacts manually. After all stages of the pipeline complete successfully, you can retrieve the artifacts from the S3 bucket that is specified in the output section of the CloudFormation template.
When the first template is created and the runtime artifacts are built, execute the second CloudFormation template, which creates the resources of the reference architecture described earlier.
Wait until both templates have been created successfully before proceeding to the next step. This takes up to 15 minutes, so feel free to get a fresh cup of coffee while CloudFormation does all the work for you.
Starting the Flink runtime and submitting a Flink program
To start the Flink runtime and submit the Flink program that is doing the analysis, connect to the EMR master node. The parameters of this and later commands can be obtained from the output sections of the two CloudFormation templates, which have been used to provision the infrastructure and build the runtime artifacts.
$ ssh -C -D 8157 «EMR master node IP»
The EMR cluster that is provisioned by the CloudFormation template comes with two c4.xlarge core nodes with four vCPUs each. Generally, you match the number of node cores to the number of slots per task manager. For this post, it is reasonable to start a long-running Flink cluster with two task managers and four slots per task manager:
After the Flink runtime is up and running, the taxi stream processor program can be submitted to the Flink runtime to start the real-time analysis of the trip events in the Amazon Kinesis stream.
Now that the Flink application is running, it is reading the incoming events from the stream, aggregating them in time windows according to the time of the events, and sending the results to Amazon ES. The Flink application takes care of batching records so as not to overload the Elasticsearch cluster with small requests and of signing the batched requests to enable a secure configuration of the Elasticsearch cluster.
Ingesting trip events into the Amazon Kinesis stream
To ingest the events, use the taxi stream producer application, which replays a historic dataset of taxi trips recorded in New York City from S3 into an Amazon Kinesis stream with eight shards. In addition to the taxi trips, the producer application also ingests watermark events into the stream so that the Flink application can determine the time up to which the producer has replayed the historic dataset.
This application is by no means specific to the reference architecture discussed in this post. You can easily reuse it for other purposes as well, for example, building a similar stream processing architecture based on Amazon Kinesis Analytics instead of Apache Flink.
Exploring the Kibana dashboard
Now that the entire pipeline is running, you can finally explore the Kibana dashboard that displays insights that are derived in real time by the Flink application:
For the purpose of this post, the Elasticsearch cluster is configured to accept connections from the IP address range specified as a parameter of the CloudFormation template that creates the infrastructure. For production-ready applications, this may not always be desirable or possible. For more information about how to securely connect to your Elasticsearch cluster, see the Set Access Control for Amazon Elasticsearch Service post on the AWS Database blog.
In the Kibana dashboard, the map on the left visualizes the start points of taxi trips. The redder a rectangle is, the more taxi trips started in that location. The line chart on the right visualizes the average duration of taxi trips to John F. Kennedy International Airport and LaGuardia Airport, respectively.
Given this information, taxi fleet operations can be optimized by proactively sending unoccupied taxis to locations that are currently in high demand, and by estimating trip durations to the local airports more precisely.
You can now scale the underlying infrastructure. For example, scale the shard capacity of the stream, change the instance count or the instance types of the Elasticsearch cluster, and verify that the entire pipeline remains functional and responsive even during the rescale operation.
Running Apache Flink on AWS
As you have just seen, the Flink runtime can be deployed by means of YARN, so EMR is well suited to run Flink on AWS. However, there are some AWS-related considerations that need to be addressed to build and run the Flink application:
Building the Flink Amazon Kinesis connector
Adapting the Amazon Kinesis consumer configuration
Enabling event time processing by submitting watermarks to Amazon Kinesis
Connecting Flink to Amazon ES
Building the Flink Amazon Kinesis connector
Flink provides a connector for Amazon Kinesis streams. In contrast to other Flink artifacts, the Amazon Kinesis connector is not available from Maven central, so you need to build it yourself. I recommend building Flink with Maven 3.2.x instead of the more recent Maven 3.3.x release, as Maven 3.3.x may produce outputs with improperly shaded dependencies.
Adapting the Amazon Kinesis consumer configuration
Flink recently introduced support for obtaining AWS credentials from the role that is associated with an EMR cluster. Enable this functionality in the Flink application source code by setting the AWS_CREDENTIALS_PROVIDER property to AUTO and by omitting any AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY parameters from the Properties object.
Properties kinesisConsumerConfig = new Properties();
kinesisConsumerCon-fig.setProperty(AWSConfigConstants.AWS_CREDENTIALS_PROVIDER, "AUTO");
Credentials are automatically retrieved from the instance’s metadata and there is no need to store long-term credentials in the source code of the Flink application or on the EMR cluster.
As the producer application ingests thousands of events per second into the stream, it helps to increase the number of records fetched by Flink in a single GetRecords call. Change this value to the maximum value that is supported by Amazon Kinesis.
Enabling event time processing by submitting watermarks to Amazon Kinesis
Flink supports several notions of time, most notably event time. Event time is desirable for streaming applications as it results in very stable semantics of queries. The time of events is determined by the producer or close to the producer. The reordering of events due to network effects has substantially less impact on query results.
To realize event time, Flink relies on watermarks that are sent by the producer in regular intervals to signal the current time at the source to the Flink runtime. When integrating with Amazon Kinesis Streams, there are two different ways of supplying watermarks to Flink:
Manually adding watermarks to the stream
Relying on ApproximalArrivalTime, which is automatically added to events on their ingestion to a stream
By just setting the time model to event time on an Amazon Kinesis stream, Flink automatically uses the ApproximalArrivalTime value supplied by Amazon Kinesis.
Alternatively, you can choose to use the time that is determined by the producer by specifying a custom Timestamp Assigner operator that extracts the watermark information from the corresponding events of the stream.
If you rely on PunctuatedAssigner, it is important to ingest watermarks to all individual shards, as Flink processes each shard of a stream individually. This can be realized by enumerating the shards of a stream. Ingest watermarks to specific shards by explicitly setting the hash key to the hash range of the shard to which the watermark should be sent.
The producer that is ingesting the taxi trips into Amazon Kinesis uses the latter approach. You can explore the details of the implementation in the flink-stream-processing-refarch AWSLabs GitHub repository.
Connecting Flink to Amazon ES
Flink provides several connectors for Elasticsearch. However, all these connectors merely support the TCP transport protocol of Elasticsearch, whereas Amazon ES relies on the HTTP protocol. As of Elasticsearch 5, the TCP transport protocol is deprecated. While an Elasticsearch connector for Flink that supports the HTTP protocol is still in the works, you can use the Jest library to build a custom sink able to connect to Amazon ES. The sink should be capable of signing requests with IAM credentials.
final AWSCredentialsProvider credentialsProvider = new DefaultAWSCredentialsProviderChain();
final Supplier<LocalDateTime> clock = () -> Local-DateTime.now(ZoneOffset.UTC);
final AWSSigner awsSigner = new AWSSign-er(credentialsProvider, "«region»", "es", clock);
final AWSSigningRequestInterceptor requestInterceptor = new AWSSigningRequestInterceptor(awsSigner);
final JestClientFactory factory = new JestClientFactory() {
@Override
protected HttpClientBuilder configureHttpCli-ent(HttpClientBuilder builder) {
builder.addInterceptorLast(requestInterceptor);
return builder;
}
@Override
protected HttpAsyncClientBuilder configureHttpCli-ent(HttpAsyncClientBuilder builder) {
builder.addInterceptorLast(requestInterceptor);
return builder;
}
};
factory.setHttpClientConfig(new HttpClientConfig
.Builder("«es endpoint»"))
.multiThreaded(true)
.build());
final JestClient client = factory.getObject();
For the full implementation details of the Elasticsearch sink, see the flink-taxi-stream-processor AWSLabs GitHub repository, which contains the source code of the Amazon Kinesis program.
Summary
This post discussed how to build a consistent, scalable, and reliable stream processing architecture based on Apache Flink. It illustrates how to leverage managed services to reduce the expertise and operational effort that is usually required to build and maintain a low latency and high throughput stream processing pipeline, so that you can focus your expertise on providing business value.
Start using Apache Flink on Amazon EMR today. The AWSLabs GitHub repository contains the resources that are required to run through the given example and includes further information that helps you to get started quickly.
If you have questions or suggestions, please comment below.
About the Author
Dr. Steffen Hausmann is a Solutions Architect with Amazon Web Services. He has a strong background in the area of complex event and stream processing and supports customers on their cloud journey. In his spare time, he likes hiking in the nearby mountains.
To help you secure your AWS resources, we recommend that you adopt a layered approach that includes the use of preventative and detective controls. For example, incorporating host-based controls for your Amazon EC2 instances can restrict access and provide appropriate levels of visibility into system behaviors and access patterns. These controls often include a host-based intrusion detection system (HIDS) that monitors and analyzes network traffic, log files, and file access on a host. A HIDS typically integrates with alerting and automated remediation solutions to detect and address attacks, unauthorized or suspicious activities, and general errors in your environment.
In this blog post, I show how you can use Amazon CloudWatch Logs to collect and aggregate alerts from an open-source security (OSSEC) HIDS. I use a CloudWatch Logs subscription to deliver the alerts to Amazon Elasticsearch Service (Amazon ES) for analysis and visualization with Kibana – a popular open-source visualization tool. To make it easier for you to see this solution in action, I provide a CloudFormation template to handle most of the deployment work. You can use this solution to gain improved visibility and insights across your EC2 fleet and help drive security remediation activities. For example, if specific hosts are scanning your EC2 instances and triggering OSSEC alerts, you can implement a VPC network access control list (ACL) or AWS WAF rule to block those source IP addresses or CIDR blocks.
Solution overview
The following diagram depicts a high-level overview of this post’s solution.
Here is how the solution works:
On the target EC2 instances, the OSSEC HIDS generates alerts that the CloudWatch Logs agent captures. The HIDS performs log analysis, integrity checking, Windows registry monitoring, rootkit detection, real-time alerting, and active response. For more information, see Getting started with OSSEC.
The CloudWatch Logs group receives the alerts as events.
A CloudWatch Logs subscription is applied to the target log group to forward the events through AWS Lambda to Amazon ES.
Amazon ES loads the logged alert data.
Kibana visualizes the alerts in near-real time. Amazon ES provides a default installation of Kibana with every Amazon ES domain.
Deployment considerations
For the purposes of this post, the primary OSSEC HIDS deployment consists of a Linux-based installation for which the alerts are generated locally within each system. Note that this solution depends on Amazon ES and Lambda in the target region for deployment. You can find the latest information about AWS service availability in the Region table. You also must identify an Amazon Virtual Private Cloud (VPC) subnet that has Internet access and DNS resolution for your EC2 instances to provision the required components properly.
To simplify the deployment process, I created a test environment AWS CloudFormation template. You can use this template to provision a test environment stack automatically into an existing Amazon VPC subnet. You will use CloudFormation to provision the core components of this solution and then configure Kibana for alert analysis. The source code for this solution is available on GitHub.
This post’s template performs the following high-level steps in the region you choose:
Creates two EC2 instances running Amazon Linux with an AWS Identity and Access Management (IAM) role for CloudWatch Logs access. Note: To provide sample HIDS alert data, the two EC2 instances are configured automatically to generate simulated HIDS alerts locally.
Installs and configures OSSEC, the CloudWatch Logs agent, and additional packages used for the test environment.
Creates the target HIDS Amazon ES domain.
Creates the target HIDS CloudWatch Logs group.
Creates the Lambda function and CloudWatch Logs subscription to send HIDS alerts to Amazon ES.
After the CloudFormation stack has been deployed, you can access the Kibana instance on the Amazon ES domain to complete the final steps of the setup for the test environment, which I show later in the post.
Although out of scope for this blog post, when deploying OSSEC into your existing EC2 environment, you should determine the desired configuration, including target log files for monitoring, directories for integrity checking, and active response. This typically also requires time for testing and tuning of the system to optimize it for your environment. The OSSEC documentation is a good place to start to familiarize yourself with this process. You could take another approach to OSSEC deployment, which involves an agent installation and a separate OSSEC manager to process events centrally before exporting them to CloudWatch Logs. This deployment requires an additional server component and network communication between the agent and the manager. Note that although Windows Server is supported by OSSEC, it requires an agent-based installation and therefore requires an OSSEC manager to be present. Review OSSEC Architecture for additional information about OSSEC architecture and deployment options.
Deploy the solution
This solution’s high-level steps are:
Launch the CloudFormation stack.
Configure a Kibana index pattern and begin exploring alerts.
Configure a Kibana HIDS dashboard and visualize alerts.
1. Launch the CloudFormation stack
You will launch your test environment by using a CloudFormation template that automates the provisioning process. For the following input parameters, you must identify a target VPC and subnet (which requires Internet access) for deployment. If the target subnet uses an Internet gateway, set the AssignPublicIP parameter to true. If the target subnet uses a NAT gateway, you can leave the default setting of AssignPublicIP as false.
First, you will need to stage the Lambda function deployment package in an S3 bucket located in the region into which you are deploying. To do this, download the zipped deployment package and upload it to your in-region bucket. For additional information about uploading objects to S3, see Uploading Object into Amazon S3.
You also must provide a trusted source IP address or CIDR block for access to the environment following the creation of the stack and an EC2 key pair to associate with the instances. For information about creating an EC2 key pair, see Creating a Key Pair Using Amazon EC2. Note that the trusted IP address or CIDR block also is used to create the Amazon ES access policy automatically for Kibana access. We recommend that you use a specific IP address or CIDR range rather than using 0.0.0.0/0, which would allow all IPv4 addresses to access your instances. For more information about authorizing inbound traffic to your instances, see Authorizing Inbound Traffic for Your Linux Instances.
After you have confirmed the input parameters (see the following screenshot and table for more details), create the CloudFormation stack.
Input parameter
Input parameter description
1. HIDSInstanceSize
EC2 instance size for test server
2. ESInstanceSize
Amazon ES instance size
3. MyKeyPair
A public/private key pair that allows you to connect securely to your instance after it launches
A SubnetId with outbound connectivity within the VPC you selected (requires Internet access)
8. AssignPublicIP
Set to true if your subnet is configured to connect through an Internet gateway; set to false if your subnet is configured to connect through a NAT gateway
9. MyTrustedNetwork
Your trusted source IP or CIDR block that is used to whitelist access to the EC2 instances and the Amazon ES endpoint
To finish creating the CloudFormation stack:
Enter the input parameters and choose Next.
On the Options page, accept the defaults and choose Next.
On the Review page, confirm the details, select the I acknowledge thatAWS CloudFormation might create IAM resources check box, and then choose Create. (The stack will be created in approximately 10 minutes.)
After the stack has been created, note the HIDSESKibanaURL on the CloudFormation Outputs tab. Then, proceed to the Kibana configuration instructions in the next section.
2. Configure a Kibana index pattern and begin exploring alerts
In this section, you perform the initial setup of Kibana. To access Kibana, find the HIDSESKibanaURL in the CloudFormation stack outputs (see the previous section) and choose it. This will bring you to the Kibana instance, which is automatically provisioned to your Amazon ES instance. The source IP you provided in the CloudFormation input parameters is used to automatically populate the Amazon ES access policy. If you receive an error similar to the following error, you must confirm that your Amazon ES access policy is correct.
{"Message":"User: anonymous is not authorized to perform: es:ESHttpGet on resource: hids-alerts"}
The OSSEC HIDS alerts now are being processed into Amazon ES. To use Kibana to analyze the alert data interactively, you must configure an index pattern that identifies the data you wish to analyze in Amazon ES. You can read additional information about index patterns in the Kibana documentation.
In the Index name or pattern box, type cwl-2017.*. The index pattern is generated within the Lambda function as cwl-YYYY.MM.DD, so you can use a wildcard character for the month and day to match data from 2017. From the Time-field name drop-down list, choose @timestamp, and then choose Create.
In Kibana, you should now be able to choose the Discover pane and see alerts being populated. To set the refresh rate for the display of near-real-time alerts, choose your desired time range in the top right (such as Last 15 minutes).
Choose Auto-refresh, and then choose an interval, such as 5 seconds.
Kibana should now be configured to auto-refresh at a 5-second interval within the timeframe you configured. You should now see your alerts updating along with a count graph, as shown in the following screenshot.
The EC2 instances are automatically configured by CloudFormation to simulate activity to display several types of alerts, including:
Successful sudo to ROOT executed – The Linux sudo command was successfully executed.
Web server 400 error code – The server cannot process the request due to an apparent client error (such as malformed request syntax, too large size, invalid request message framing, or deceptive request routing).
SSH insecure connection attempt (scan) – Invalid connection attempt to the SSH listener.
Login session opened – Opened login session on the system.
Login session closed – Closed login session on the system.
New Yum package installed – Package installed on the system.
Yum package deleted – Package deleted from the system.
Let’s take a closer look at some of the alert fields, as shown in the following screenshot.
The numbered alert fields in the preceding screenshot are defined as follows:
@log_group – The source CloudWatch Logs group
@log_stream – The CloudWatch Logs stream name (InstanceID)
@message – The JSON payload from the source alerts.json OSSEC log
@owner – The AWS account ID where the alert originated
@timestamp – The time stamp applied by the consumer Lambda function
full_log – The log event from the source file
location – The source log file path and file name
rule.comment – A brief description of the OSSEC rule that was matched
rule.level – The OSSEC rule classification from 0 to 16 (see Rules Classification for more information)
rule.sidid – The rule ID of the OSSEC rule that was matched
srcip – The source IP address that triggered the alert; in this case, the simulated alerts contain the local IP of the server
You can enter search criteria in the Kibana query bar to explore HIDS alert data interactively. For example, you can run the following query to see all the rule.level 6 alerts for the EC2 InstanceID i-0e427a8594852eca2 where the source IP is 10.10.10.10.
“rule.level: 6 AND @log_stream: "i-0e427a8594852eca2" AND srcip: 10.10.10.10”
You can perform searches including simple text, Lucene query syntax, or use the full JSON-based Elasticsearch Query DSL. You can find additional information on searching your data in the Elasticsearch documentation.
3. Configure a Kibana HIDS dashboard and visualize alerts
To analyze alert trends and patterns over time, it can be helpful to use charts and graphs to represent the alert data. I have configured a basic dashboard template that you can import into your Kibana instance.
Save the template locally and then choose Management in the Kibana navigation pane.
Choose Saved Objects, Import, and the HIDS dashboard template.
Choose the eye icon to the right of the HIDS Alerts dashboard entry. This will take you to the imported dashboard.
After importing the Kibana dashboard template and selecting it, you will see the HIDS dashboard, as shown in the following screenshot. This sample HIDS dashboard includes Alerts Over Time, Top 20 Alert Types, Rule Level Breakdown, Top 10 Rule Source ID, and Top 10 Source IPs.
To explore the alert data in more detail, you can choose an alert type on which to filter, as shown in the following two screenshots.
You can see more details about the alerts based on criteria such as source IP address or time range. For more information about using Kibana to visualize alert data, see the Kibana User Guide.
Summary
In this blog post, I showed how to use CloudWatch Logs to collect alerts in near-real time from an OSSEC HIDS and use a CloudWatch Logs subscription to pass the alerts into Amazon ES for analysis and visualization with Kibana. The dashboard deployed by this solution can help you improve the security monitoring of your EC2 fleet as part of a defense-in-depth security strategy in your AWS environment.
You can use this solution to help detect attacks, anomalous activities, and error trends across your EC2 fleet. You can also use it to help prioritize remediation efforts for your systems or help determine where to introduce additional security controls such as VPC security group rules, VPC network ACLs, or AWS WAF rules.
If you have comments about this post, add them to the “Comments” section below. If you have questions about or issues implementing this solution, start a new thread on the CloudWatch or Amazon ES forum. The source code for this solution is available on GitHub. If you need OSSEC-specific support, see OSSEC Support Options.
– Cameron
The collective thoughts of the interwebz
By continuing to use the site, you agree to the use of cookies. more information
The cookie settings on this website are set to "allow cookies" to give you the best browsing experience possible. If you continue to use this website without changing your cookie settings or you click "Accept" below then you are consenting to this.