Tag Archives: logging

How US federal agencies can use AWS to improve logging and log retention

Post Syndicated from Derek Doerr original https://aws.amazon.com/blogs/security/how-us-federal-agencies-can-use-aws-to-improve-logging-and-log-retention/

This post is part of a series about how Amazon Web Services (AWS) can help your US federal agency meet the requirements of the President’s Executive Order on Improving the Nation’s Cybersecurity. You will learn how you can use AWS information security practices to help meet the requirement to improve logging and log retention practices in your AWS environment.

Improving the security and operational readiness of applications relies on improving the observability of the applications and the infrastructure on which they operate. For our customers, this translates to questions of how to gather the right telemetry data, how to securely store it over its lifecycle, and how to analyze the data in order to make it actionable. These questions take on more importance as our federal customers seek to improve their collection and management of log data in all their IT environments, including their AWS environments, as mandated by the executive order.

Given the interest in the technologies used to support logging and log retention, we’d like to share our perspective. This starts with an overview of logging concepts in AWS, including log storage and management, and then proceeds to how to gain actionable insights from that logging data. This post will address how to improve logging and log retention practices consistent with the Security and Operational Excellence pillars of the AWS Well-Architected Framework.

Log actions and activity within your AWS account

AWS provides you with extensive logging capabilities to provide visibility into actions and activity within your AWS account. A security best practice is to establish a wide range of detection mechanisms across all of your AWS accounts. Starting with services such as AWS CloudTrail, AWS Config, Amazon CloudWatch, Amazon GuardDuty, and AWS Security Hub provides a foundation upon which you can base detective controls, remediation actions, and forensics data to support incident response. Here is more detail on how these services can help you gain more security insights into your AWS workloads:

  • AWS CloudTrail provides event history for all of your AWS account activity, including API-level actions taken through the AWS Management Console, AWS SDKs, command line tools, and other AWS services. You can use CloudTrail to identify who or what took which action, what resources were acted upon, when the event occurred, and other details. If your agency uses AWS Organizations, you can automate this process for all of the accounts in the organization.
  • CloudTrail logs can be delivered from all of your accounts into a centralized account. This places all logs in a tightly controlled, central location, making it easier to both protect them as well as to store and analyze them. As with AWS CloudTrail, you can automate this process for all of the accounts in the organization using AWS Organizations.  CloudTrail can also be configured to emit metrical data into the CloudWatch monitoring service, giving near real-time insights into the usage of various services.
  • CloudTrail log file integrity validation produces and cyptographically signs a digest file that contains references and hashes for every CloudTrail file that was delivered in that hour. This makes it computationally infeasible to modify, delete or forge CloudTrail log files without detection. Validated log files are invaluable in security and forensic investigations. For example, a validated log file enables you to assert positively that the log file itself has not changed, or that particular user credentials performed specific API activity.
  • AWS Config monitors and records your AWS resource configurations and allows you to automate the evaluation of recorded configurations against desired configurations. For example, you can use AWS Config to verify that resources are encrypted, multi-factor authentication (MFA) is enabled, and logging is turned on, and you can use AWS Config rules to identify noncompliant resources. Additionally, you can review changes in configurations and relationships between AWS resources and dive into detailed resource configuration histories, helping you to determine when compliance status changed and the reason for the change.
  • Amazon GuardDuty is a threat detection service that continuously monitors for malicious activity and unauthorized behavior to protect your AWS accounts and workloads. Amazon GuardDuty analyzes and processes the following data sources: VPC Flow Logs, AWS CloudTrail management event logs, CloudTrail Amazon Simple Storage Service (Amazon S3) data event logs, and DNS logs. It uses threat intelligence feeds, such as lists of malicious IP addresses and domains, and machine learning to identify potential threats within your AWS environment.
  • AWS Security Hub provides a single place that aggregates, organizes, and prioritizes your security alerts, or findings, from multiple AWS services and optional third-party products to give you a comprehensive view of security alerts and compliance status.

You should be aware that most AWS services do not charge you for enabling logging (for example, AWS WAF) but the storage of logs will incur ongoing costs. Always consult the AWS service’s pricing page to understand cost impacts. Related services such as Amazon Kinesis Data Firehose (used to stream data to storage services), and Amazon Simple Storage Service (Amazon S3), used to store log data, will incur charges.

Turn on service-specific logging as desired

After you have the foundational logging services enabled and configured, next turn your attention to service-specific logging. Many AWS services produce service-specific logs that include additional information. These services can be configured to record and send out information that is necessary to understand their internal state, including application, workload, user activity, dependency, and transaction telemetry. Here’s a sampling of key services with service-specific logging features:

  • Amazon CloudWatch provides you with data and actionable insights to monitor your applications, respond to system-wide performance changes, optimize resource utilization, and get a unified view of operational health. CloudWatch collects monitoring and operational data in the form of logs, metrics, and events, providing you with a unified view of AWS resources, applications, and services that run on AWS and on-premises servers. You can gain additional operational insights from your AWS compute instances (Amazon Elastic Compute Cloud, or EC2) as well as on-premises servers using the CloudWatch agent. Additionally, you can use CloudWatch to detect anomalous behavior in your environments, set alarms, visualize logs and metrics side by side, take automated actions, troubleshoot issues, and discover insights to keep your applications running smoothly.
  • Amazon CloudWatch Logs is a component of Amazon CloudWatch which you can use to monitor, store, and access your log files from Amazon Elastic Compute Cloud (Amazon EC2) instances, AWS CloudTrail, Route 53, and other sources. CloudWatch Logs enables you to centralize the logs from all of your systems, applications, and AWS services that you use, in a single, highly scalable service. You can then easily view them, search them for specific error codes or patterns, filter them based on specific fields, or archive them securely for future analysis. CloudWatch Logs enables you to see all of your logs, regardless of their source, as a single and consistent flow of events ordered by time, and you can query them and sort them based on other dimensions, group them by specific fields, create custom computations with a powerful query language, and visualize log data in dashboards.
  • Traffic Mirroring allows you to achieve full packet capture (as well as filtered subsets) of network traffic from an elastic network interface of EC2 instances inside your VPC. You can then send the captured traffic to out-of-band security and monitoring appliances for content inspection, threat monitoring, and troubleshooting.
  • The Elastic Load Balancing service provides access logs that capture detailed information about requests that are sent to your load balancer. Each log contains information such as the time the request was received, the client’s IP address, latencies, request paths, and server responses. The specific information logged varies by load balancer type:
  • Amazon S3 access logs record the S3 bucket and account that are being accessed, the API action, and requester information.
  • AWS Web Application Firewall (WAF) logs record web requests that are processed by AWS WAF, and indicate whether the requests matched AWS WAF rules and what actions, if any, were taken. These logs are delivered to Amazon S3 by using Amazon Kinesis Data Firehose.
  • Amazon Relational Database Service (Amazon RDS) log files can be downloaded or published to Amazon CloudWatch Logs. Log settings are specific to each database engine. Agencies use these settings to apply their desired logging configurations and chose which events are logged.  Amazon Aurora and Amazon RDS for Oracle also support a real-time logging feature called “database activity streams” which provides even more detail, and cannot be accessed or modified by database administrators.
  • Amazon Route 53 provides options for logging for both public DNS query requests as well as Route53 Resolver DNS queries:
    • Route 53 Resolver DNS query logs record DNS queries and responses that originate from your VPC, that use an inbound Resolver endpoint, that use an outbound Resolver endpoint, or that use a Route 53 Resolver DNS Firewall.
    • Route 53 DNS public query logs record queries to Route 53 for domains you have hosted with AWS, including the domain or subdomain that was requested; the date and time of the request; the DNS record type; the Route 53 edge location that responded to the DNS query; and the DNS response code.
  • Amazon Elastic Compute Cloud (Amazon EC2) instances can use the unified CloudWatch agent to collect logs and metrics from Linux, macOS, and Windows EC2 instances and publish them to the Amazon CloudWatch service.
  • Elastic Beanstalk logs can be streamed to CloudWatch Logs. You can also use the AWS Management Console to request the last 100 log entries from the web and application servers, or request a bundle of all log files that is uploaded to Amazon S3 as a ZIP file.
  • Amazon CloudFront logs record user requests for content that is cached by CloudFront.

Store and analyze log data

Now that you’ve enabled foundational and service-specific logging in your AWS accounts, that data needs to be persisted and managed throughout its lifecycle. AWS offers a variety of solutions and services to consolidate your log data and store it, secure access to it, and perform analytics.

Store log data

The primary service for storing all of this logging data is Amazon S3. Amazon S3 is ideal for this role, because it’s a highly scalable, highly resilient object storage service. AWS provides a rich set of multi-layered capabilities to secure log data that is stored in Amazon S3, including encrypting objects (log records), preventing deletion (the S3 Object Lock feature), and using lifecycle policies to transition data to lower-cost storage over time (for example, to S3 Glacier). Access to data in Amazon S3 can also be restricted through AWS Identity and Access Management (IAM) policies, AWS Organizations service control policies (SCPs), S3 bucket policies, Amazon S3 Access Points, and AWS PrivateLink interfaces. While S3 is particularly easy to use with other AWS services given its various integrations, many customers also centralize their storage and analysis of their on-premises log data, or log data from other cloud environments, on AWS using S3 and the analytic features described below.

If your AWS accounts are organized in a multi-account architecture, you can make use of the AWS Centralized Logging solution. This solution enables organizations to collect, analyze, and display CloudWatch Logs data in a single dashboard. AWS services generate log data, such as audit logs for access, configuration changes, and billing events. In addition, web servers, applications, and operating systems all generate log files in various formats. This solution uses the Amazon Elasticsearch Service (Amazon ES) and Kibana to deploy a centralized logging solution that provides a unified view of all the log events. In combination with other AWS-managed services, this solution provides you with a turnkey environment to begin logging and analyzing your AWS environment and applications.

You can also make use of services such as Amazon Kinesis Data Firehose, which you can use to transport log information to S3, Amazon ES, or any third-party service that is provided with an HTTP endpoint, such as Datadog, New Relic, or Splunk.

Finally, you can use Amazon EventBridge to route and integrate event data between AWS services and to third-party solutions such as software as a service (SaaS) providers or help desk ticketing systems. EventBridge is a serverless event bus service that allows you to connect your applications with data from a variety of sources. EventBridge delivers a stream of real-time data from your own applications, SaaS applications, and AWS services, and then routes that data to targets such as AWS Lambda.

Analyze log data and respond to incidents

As the final step in managing your log data, you can use AWS services such as Amazon Detective, Amazon ES, CloudWatch Logs Insights, and Amazon Athena to analyze your log data and gain operational insights.

  • Amazon Detective makes it easy to analyze, investigate, and quickly identify the root cause of security findings or suspicious activities. Detective automatically collects log data from your AWS resources. It then uses machine learning, statistical analysis, and graph theory to help you visualize and conduct faster and more efficient security investigations.
  • Incident Manager is a component of AWS Systems Manger which enables you to automatically take action when a critical issue is detected by an Amazon CloudWatch alarm or Amazon Eventbridge event. Incident Manager executes pre-configured response plans to engage responders via SMS and phone calls, enable chat commands and notifications using AWS Chatbot, and execute AWS Systems Manager Automation runbooks. The Incident Manager console integrates with AWS Systems Manager OpsCenter to help you track incidents and post-incident action items from a central place that also synchronizes with popular third-party incident management tools such as Jira Service Desk and ServiceNow.
  • Amazon Elasticsearch Service (Amazon ES) is a fully managed service that collects, indexes, and unifies logs and metrics across your environment to give you unprecedented visibility into your applications and infrastructure. With Amazon ES, you get the scalability, flexibility, and security you need for the most demanding log analytics workloads. You can configure a CloudWatch Logs log group to stream data it receives to your Amazon ES cluster in near real time through a CloudWatch Logs subscription.
  • CloudWatch Logs Insights enables you to interactively search and analyze your log data in CloudWatch Logs.
  • Amazon Athena is an interactive query service that you can use to analyze data in Amazon S3 by using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.


As called out in the executive order, information from network and systems logs is invaluable for both investigation and remediation services. AWS provides a broad set of services to collect an unprecedented amount of data at very low cost, optionally store it for long periods of time in tiered storage, and analyze that telemetry information from your cloud-based workloads. These insights will help you improve your organization’s security posture and operational readiness and, as a result, improve your organization’s ability to deliver on its mission.

Next steps

To learn more about how AWS can help you meet the requirements of the executive order, see the other post in this series:

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Derek Doerr

Derek is a Senior Solutions Architect with the Public Sector team at AWS. He has been working with AWS technology for over four years. Specializing in enterprise management and governance, he is passionate about helping AWS customers navigate their journeys to the cloud. In his free time, he enjoys time with family and friends, as well as scuba diving.

Automate Amazon Athena queries for PCI DSS log review using AWS Lambda

Post Syndicated from Logan Culotta original https://aws.amazon.com/blogs/security/automate-amazon-athena-queries-for-pci-dss-log-review-using-aws-lambda/

In this post, I will show you how to use AWS Lambda to automate PCI DSS (v3.2.1) evidence generation, and daily log review to assist with your ongoing PCI DSS activities. We will specifically be looking at AWS CloudTrail Logs stored centrally in Amazon Simple Storage Service (Amazon S3) (which is also a Well-Architected Security Pillar best practice) and use Amazon Athena to query.

This post assumes familiarity with creating a database in Athena. If you’re new to Athena, please take a look at the Athena getting started guide and create a database before continuing. Take note of the bucket chosen for the output of Athena query results, we will use it later in this post.

In this post, we walk through:

  • Creating a partitioned table for your AWS CloudTrail logs. In order to reduce costs and time to query results in Athena, we’ll show you how to partition your data. If you’re not already familiar with partitioning, you can learn about it in the Athena user guide.
  • Constructing SQL queries to search for PCI DSS audit log evidence. The SQL queries that are provided in this post are directly related to PCI DSS requirement 10. Customizing these queries to meet your responsibilities may be able to assist you in preparing for a PCI DSS assessment.
  • Creating an AWS Lambda function to automate running these SQL queries daily, in order to help address the PCI DSS daily log review requirement 10.6.1.

Create and partition a table

The following code will create and partition a table for CloudTrail logs. Before you execute this query, be sure to replace the variable placeholders with the information from your database. They are:

  • <YOUR_TABLE> – the name of your Athena table
  • LOCATION – the path to your CloudTrail logs in Amazon S3. An example is included in the following code. It includes the variable placeholders:
    • <AWS_ACCOUNT_NUMBER> – your AWS account number. If using organizational CloudTrail, use the following format throughout the post for this variable: o-<orgID>/<ACCOUNT_NUMBER>
    • <LOG_BUCKET> – the bucket name where the CloudTrail logs to be queried reside

    eventVersion STRING,
    userIdentity STRUCT<
        type: STRING,
        principalId: STRING,
        arn: STRING,
        accountId: STRING,
        invokedBy: STRING,
        accessKeyId: STRING,
        userName: STRING,
        sessionContext: STRUCT<
            attributes: STRUCT<
                mfaAuthenticated: STRING,
                creationDate: STRING>,
            sessionIssuer: STRUCT<
                type: STRING,
                principalId: STRING,
                arn: STRING,
                accountId: STRING,
                userName: STRING>>>,
    eventTime STRING,
    eventSource STRING,
    eventName STRING,
    awsRegion STRING,
    sourceIpAddress STRING,
    userAgent STRING,
    errorCode STRING,
    errorMessage STRING,
    requestParameters STRING,
    responseElements STRING,
    additionalEventData STRING,
    requestId STRING,
    eventId STRING,
    resources ARRAY<STRUCT<
        arn: STRING,
        accountId: STRING,
        type: STRING>>,
    eventType STRING,
    apiVersion STRING,
    readOnly STRING,
    recipientAccountId STRING,
    serviceEventDetails STRING,
    sharedEventID STRING,
    vpcEndpointId STRING
COMMENT 'CloudTrail table'
PARTITIONED BY(region string, year string, month string, day string)
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde'
STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
TBLPROPERTIES ('classification'='cloudtrail');

Execute the query. You should see a message stating Query successful.

Figure 1: Query successful

Figure 1: Query successful

The preceding query creates a CloudTrail table and defines the partitions in your Athena database. Before you begin running queries to generate evidence, you will need to run alter table commands to finalize the partitioning.

Be sure to update the following variable placeholders with your information:

  • <YOUR_DATABASE> – the name of your Athena database

Provide values for the following variables:

  • region – region of the logs to partition
  • month – month of the logs to partition
  • day – day of the logs to partition
  • year – year of the logs to partition
  • LOCATION – the path to your CloudTrail logs in Amazon S3 to partition, down to the specific day (should match the preceding values of region, month, day, and year). It includes the variable placeholders:
    • <LOG_BUCKET>

ALTER TABLE <YOUR_DATABASE>.<YOUR_TABLE>  ADD partition  (region='us-east-1', month='02', day='28', year='2020') location 's3://<LOG_BUCKET>/AWSLogs/<AWS_ACCOUNT_NUMBER>/CloudTrail/us-east-1/2020/02/28/';

After the partition has been configured, you can query logs from the date and region that was partitioned. Here’s an example for PCI DSS requirement 10.2.4 (all relevant PCI DSS requirements are described later in this post).

SELECT * FROM <YOUR_DATABASE>.<YOUR_TABLE> WHERE eventname = 'ConsoleLogin' AND responseelements LIKE '%Failure%' AND region= 'us-east-1' AND year='2020' AND month='02' AND day='28';

Create a Lambda function to save time

As you can see, this process above can involve a lot of manual steps as you set up partitioning for each region and then query for each day or region. Let’s simplify the process by putting these into a Lambda function.

Use the Lambda console to create a function

To create the Lambda function:

  1. Open the Lambda console and choose Create function, and select the option to Author from scratch.
  2. Enter Athena_log_query as the function name, and select Python 3.8 as the runtime.
  3. Under Choose or create an execution role, select Create new role with basic Lambda permissions.
  4. Choose Create function.
  5. Once the function is created, select the Permissions tab at the top of the page and select the Execution role to view in the IAM console. It will look similar to the following figure.
    Figure 2: Permissions tab

    Figure 2: Permissions tab

Update the IAM Role to allow Lambda permissions to relevant services

  1. In the IAM console, select the policy name. Choose Edit policy, then select the JSON tab and paste the following code into the window, replacing the following variable and placeholders:
    • us-east-1 – This is the region where resources are. Change only if necessary.
    • <YOUR_TABLE>
    • <LOG_BUCKET>
    • <OUTPUT_LOG_BUCKET> – bucket name you chose to store the query results when setting up Athena.
        "Version": "2012-10-17",
        "Statement": [
                "Effect": "Allow",
                "Action": [
                "Resource": [
                "Effect": "Allow",
                "Action": [
                "Resource": "arn:aws:logs:us-east-1:<AWS_ACCOUNT_NUMBER>:*"

    Note: Depending on the environment, this policy might not be restrictive enough and should be limited to only users needing access to the cardholder data environment and audit logs. More information about restricting IAM policies can be found in IAM JSON Policy Elements: Condition Operators.

  2. Choose Review policy and then Save changes.

Customize the Lambda Function

  1. On the Lambda dashboard, choose the Configuration tab. In Basic settings, increase the function timeout to 5 minutes to ensure that the function always has time to finish running your queries, and then select Save. Best Practices for Developing on AWS Lambda has more tips for using Lambda.
  2. Paste the following code into the function editor on the Configuration tab, replacing the existing text. The code includes eight example queries to run and can be customized as needed.

    The first query will add partitions to your Amazon S3 logs so that the following seven queries will run quickly and be cost effective.

    This code combines the partitioning, and example Athena queries to assist in meeting PCI DSS logging requirements, which will be explained more below:

    Replace these values in the code that follows:

    • <YOUR_TABLE>
    • <LOG_BUCKET>
    • REGION1 – first region to partition
    • REGION2 – second region to partition*
    import boto3
    import datetime
    import time
    #This should be the name of your Athena database
    #This should be the name of your Athena database table
    #This is the Amazon S3 bucket name you want partitioned and logs queried from:
    #AWS Account number for the Amazon S3 path to your CloudTrail logs
    #This is the Amazon S3 bucket name for the Athena Query results:
    #Define regions to partition
    REGION1 = "us-east-1"
    REGION2 = "us-west-2"
    RETRY_COUNT = 50
    #Getting the current date and splitting into variables to use in queries below
    CURRENT_DATE = datetime.datetime.today()
    #location for the Athena query results
    OUTPUT_LOCATION = "s3://"+OUTPUT_LOG_BUCKET+"/DailyAthenaLogs/CloudTrail/"+str(CURRENT_DATE.isoformat())
    #Athena Query definitions for PCI DSS requirements
    YEAR_MONTH_DAY = f'year=\'{ATHENA_YEAR}\' AND month=\'{ATHENA_MONTH}\' AND day=\'{ATHENA_DAY}\';'
    PARTITION_STATEMENT_1 = f'partition (region="{REGION1}", month="{ATHENA_MONTH}", day="{ATHENA_DAY}", year="{ATHENA_YEAR}")'
    PARTITION_STATEMENT_2 = f'partition (region="{REGION2}", month="{ATHENA_MONTH}", day="{ATHENA_DAY}", year="{ATHENA_YEAR}")'
    LIKE_BUCKET = f' \'%{LOG_BUCKET}%\''
    #Query to partition selected regions
    #Access to audit trails or CHD 10.2.1/10.2.3
    QUERY_2 = f'{SELECT_STATEMENT} requestparameters LIKE {LIKE_BUCKET} AND sourceipaddress <> \'cloudtrail.amazonaws.com\' AND sourceipaddress <> \'athena.amazonaws.com\' AND eventName = \'GetObject\' AND {YEAR_MONTH_DAY}'
    #Root Actions PCI DSS 10.2.2
    QUERY_3 = f'{SELECT_STATEMENT} userIdentity.sessionContext.sessionIssuer.userName LIKE \'%root%\' AND {YEAR_MONTH_DAY}'
    #Failed Logons PCI DSS 10.2.4
    QUERY_4 = f'{SELECT_STATEMENT} eventname = \'ConsoleLogin\' AND responseelements LIKE \'%Failure%\' AND {YEAR_MONTH_DAY}'
    #Privilege changes PCI DSS 10.2.5.b, 10.2.5.c
    QUERY_5 = f'{SELECT_STATEMENT} eventname LIKE \'%AddUserToGroup%\' AND requestparameters LIKE \'%Admin%\' AND {YEAR_MONTH_DAY}'
    # Initialization, stopping, or pausing of the audit logs PCI DSS 10.2.6
    QUERY_6 = f'{SELECT_STATEMENT} eventname = \'StopLogging\' OR eventname = \'StartLogging\' AND {YEAR_MONTH_DAY}'
    #Suspicious activity PCI DSS 10.6
    QUERY_7 = f'{SELECT_STATEMENT} eventname LIKE \'%DeleteSecurityGroup%\' OR eventname LIKE \'%CreateSecurityGroup%\' OR eventname LIKE \'%UpdateSecurityGroup%\' OR eventname LIKE \'%AuthorizeSecurityGroup%\' AND {YEAR_MONTH_DAY}'
    QUERY_8 = f'{SELECT_STATEMENT} eventname LIKE \'%Subnet%\' and eventname NOT LIKE \'Describe%\' AND {YEAR_MONTH_DAY}'
    #Defining function to generate query status for each query
    def query_stat_fun(query, response):
        client = boto3.client('athena')
        query_execution_id = response['QueryExecutionId']
        print(query_execution_id +' : '+query)
        for i in range(1, 1 + RETRY_COUNT):
            query_status = client.get_query_execution(QueryExecutionId=query_execution_id)
            query_fail_status = query_status['QueryExecution']['Status']
            query_execution_status = query_fail_status['State']
            if query_execution_status == 'SUCCEEDED':
                print("STATUS:" + query_execution_status)
            if query_execution_status == 'FAILED':
                print("STATUS:" + query_execution_status)
            raise Exception('Maximum Retries Exceeded')
    def lambda_handler(query, context):
        client = boto3.client('athena')
        queries = [QUERY_1, QUERY_2, QUERY_3, QUERY_4, QUERY_5, QUERY_6, QUERY_7, QUERY_8]
        for query in queries:
            response = client.start_query_execution(
                    'Database': ATHENA_DATABASE },
                    'OutputLocation': OUTPUT_LOCATION })
            query_stat_fun(query, response)

    Note: More regions can be added if you have additional regions to partition. The ADD partition statement can be copied and pasted to add additional regions as needed. Additionally, you can hard code the regions for your environment into the statements.

  3. Choose Save in the top right.

Athena Queries used to collect evidence

The queries used to gather evidence for PCI DSS are broken down from the Lambda function we created, using the partitioned date example from above. They are listed with their respective requirement.

Note: AWS owns the security OF the cloud, providing high levels of security in alignment with our numerous compliance programs. The customer is responsible for the security of their resources IN the cloud, keeping its content secure and compliant. The queries below are meant to be a proof of concept and should be tailored to your environment.

10.2.1/10.2.3 – Implement automated audit trails for all system components to reconstruct access to either or both cardholder data and audit trails:

"SELECT * FROM <YOUR_DATABASE>.<YOUR_TABLE> WHERE requestparameters LIKE '%<LOG_BUCKET>%' AND sourceipaddress <> 'cloudtrail.amazonaws.com' AND sourceipaddress <>  'athena.amazonaws.com' AND eventName = 'GetObject' AND year='2020' AND month='02' AND day='28';"

10.2.2 – Implement automated audit trails for all system components to reconstruct all actions taken by anyone using root or administrative privileges.

"SELECT * FROM <YOUR_DATABASE>.<YOUR_TABLE> WHERE userIdentity.sessionContext.sessionIssuer.userName LIKE '%root%' AND year='2020' AND month='02' AND day='28';"

10.2.4 – Implement automated audit trails for all system components to reconstruct invalid logical access attempts.

"SELECT * FROM <YOUR_DATABASE>.<YOUR_TABLE> WHERE eventname = 'ConsoleLogin' AND responseelements LIKE '%Failure%' AND year='2020' AND month='02' AND day='28';"

10.2.5.b – Verify all elevation of privileges is logged.

10.2.5.c – Verify all changes, additions, or deletions to any account with root or administrative privileges are logged:

"SELECT * FROM <YOUR_DATABASE>.<YOUR_TABLE> WHERE eventname LIKE '%AddUserToGroup%' AND requestparameters LIKE '%Admin%' AND year='2020' AND month='02' AND day='28';"

10.2.6 – Implement automated audit trails for all system components to reconstruct the initialization, stopping, or pausing of the audit logs:

"SELECT * FROM <YOUR_DATABASE>.<YOUR_TABLE> WHERE eventname = 'StopLogging' OR eventname = 'StartLogging' AND year='2020' AND month='02' AND day='28';"

10.6 – Review logs and security events for all system components to identify anomalies or suspicious activity:

"SELECT * FROM <YOUR_DATABASE>.<YOUR_TABLE> WHERE eventname LIKE '%DeleteSecurityGroup%' OR eventname LIKE '%CreateSecurityGroup%' OR eventname LIKE '%UpdateSecurityGroup%' OR eventname LIKE '%AuthorizeSecurityGroup%' AND year='2020' AND month='02' AND day='28';" 

"SELECT * FROM <YOUR_DATABASE>.<YOUR_TABLE> WHERE eventname LIKE '%Subnet%' and eventname NOT LIKE 'Describe%' AND year='2020' AND month='02' AND day='28';" 

You can use the AWS Command Line Interface (AWS CLI) to invoke the Lambda function using the following command, replacing <YOUR_FUNCTION> with the name of the Lambda function you created:

aws lambda invoke --function-name <YOUR_FUNCTION> outfile

The AWS Lambda API Reference has more information on using Lambda with AWS CLI.

Note: the results from the function will be located in the OUTPUT_LOCATION variable within the Lambda function.

Use Amazon CloudWatch to run your Lambda function

You can create a rule in CloudWatch to have this function run automatically on a set schedule.

Create a CloudWatch rule

  1. From the CloudWatch dashboard, under Events, select Rules, then Create rule.
  2. Under Event Source, select the radio button for Schedule and choose a fixed rate or enter in a custom cron expression.
  3. Finally, in the Targets section, choose Lambda function and find your Lambda function from the drop down.

    The example screenshot shows a CloudWatch rule configured to invoke the Lambda function daily:

    Figure 3: CloudWatch rule

    Figure 3: CloudWatch rule

  4. Once the schedule is configured, choose Configure details to move to the next screen.
  5. Enter a name for your rule, make sure that State is enabled, and choose Create rule.

Check that your function is running

You can then navigate to your Lambda function’s CloudWatch log group to see if the function is running as intended.

To locate the appropriate CloudWatch group, from your Lambda function within the console, select the Monitoring tab, then choose View logs in CloudWatch.

Figure 4: View logs in CloudWatch

Figure 4: View logs in CloudWatch

You can take this a step further and set up an SNS notification to email you when the function is triggered.


In this post, we walked through partitioning an Athena table, which assists in reducing time and cost when running queries on your S3 buckets. We then constructed example SQL queries related to PCI DSS requirement 10, to assist in audit preparation. Finally, we created a Lambda function to automate running daily queries to pull PCI DSS audit log evidence from Amazon S3, to assist with the PCI DSS daily log review requirement. I encourage you to customize, add, or remove the SQL queries to best fit your needs and compliance requirements.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Logan Culotta

Logan Culotta is a Security Assurance Consultant, and a current Qualified Security Assessor (QSA). Logan is part of the AWS Security Assurance team, which is also a Qualified Security Assessor Company (QSAC). He enjoys finding ways to automate compliance and security in the AWS cloud. In his free time, you can find him spending time with his family, road cycling, or cooking.

Export logs from Cloudflare Gateway with Logpush

Post Syndicated from Sam Rhea original https://blog.cloudflare.com/export-logs-from-cloudflare-gateway-with-logpush/

Export logs from Cloudflare Gateway with Logpush

Like many people, I have spent a lot more time at home in the last several weeks. I use the free version of Cloudflare Gateway, part of Cloudflare for Teams, to secure the Internet-connected devices on my WiFi network. In the last week, Gateway has processed about 114,000 DNS queries from those devices and blocked nearly 100 as potential security risks.

I can search those requests in the Cloudflare for Teams UI. The logs capture the hostname requested, the time of the request, and Gateway’s decision to allow or block. This works fine for one-off investigations into a block, but does not help if I want to analyze the data more thoroughly. The last thing I want to do is click through hundreds or thousands of pages.

That problem is even more difficult for organizations attempting to keep hundreds or thousands of users and their devices secure. Whether they secure roaming devices with DoH or a static IP address, or keep users safe as they return to offices, deployments at that scale need a better option for auditing tens or hundreds of millions of queries each week.

Starting today, you can configure the automatic export of logs from Cloudflare Gateway to third-party storage destinations or security information and event management (SIEM) tools. Once exported, your team can analyze and audit the data as needed. The feature builds on the same robust Cloudflare Logpush Service that powers data export from Cloudflare’s infrastructure products.

Cloudflare Gateway

Cloudflare Gateway is one-half of Cloudflare for Teams, Cloudflare’s platform for securing users, devices, and data. With Cloudflare for Teams, our global network becomes your team’s network, replacing on-premise appliances and security subscriptions with a single solution delivered closer to your users – wherever they work.

Export logs from Cloudflare Gateway with Logpush

As part of that platform, Cloudflare Gateway blocks threats on the public Internet from becoming incidents inside your organization. Gateway’s first release added DNS security filtering and content blocking to the world’s fastest DNS resolver, Cloudflare’s

Deployment takes less than 5 minutes. Teams can secure entire office networks and segment traffic reports by location. For distributed organizations, Gateway can be deployed via MDM on networks that support IPv6 or using a dedicated IPv4 as part of a Cloudflare Enterprise account.

With secure DNS filtering, administrators can click a single button to block known threats, like sources of malware or phishing sites. Policies can be extended to block specific categories, like gambling sites or social media. When users request a filtered site, Gateway stops the DNS query from resolving and prevents the device from connecting to a malicious destination or hostname with blocked material.

Cloudflare Logpush

The average user makes about 5,000 DNS queries each day. For an organization with 1,000 employees, that produces 5M rows of data daily. That data includes regular Internet traffic, but also potential trends like targeted phishing campaigns or the use of cloud storage tools that are not approved by your IT organization.

The Cloudflare for Teams UI presents some summary views of that data, but each organization has different needs for audit, retention, or analysis. The best way to let you investigate the data in any way you need is to give you all of it. However the volume of data and how often you might need to review it means that API calls or CSV downloads are not suitable. A real logging pipeline is required.

Cloudflare Logpush solves that challenge. Cloudflare’s Logpush Service exports the data captured by Cloudflare’s network to storage destinations that you control. Rather than requiring your team to build a system to call Cloudflare APIs and pull data, Logpush routinely exports data with fields that you configure.

Cloudflare’s data team built the Logpush pipeline to make it easy to integrate with popular storage providers. Logpush supports AWS S3, Google Cloud Storage, Sumo Logic, and Microsoft Azure out of the box. Administrators can choose a storage provider, validate they own the destination, and configure exports of logs that will send deltas every five minutes from that point onward.

How it works

When enabled, you can navigate to a new section of the Logs component in the Cloudflare for Teams UI, titled “Logpush”. Once there, you’ll be able to choose which fields you want to export from Cloudflare Gateway and the storage destination.

Export logs from Cloudflare Gateway with Logpush

The Logpush wizard will walk you through validating that you own the destination and configuring how you want folders to be structured. When saved, Logpush will send updated logs every five minutes to that destination. You can configure multiple destinations and monitor for any issues by returning to this section of the Cloudflare for Teams UI.

Export logs from Cloudflare Gateway with Logpush

What’s next?

Cloudflare’s Logpush Service is only available to customers on a contract plan. If you are interested in upgrading, please let us know. All Cloudflare for Teams plans include 30-days of data that can be searched in the UI.

Cloudflare Access, the other half of Cloudflare for Teams, also supports granular log export. You can configure Logpush for Access in the Cloudflare dashboard that houses Infrastructure features like the WAF and CDN. We plan to migrate that configuration to this UI in the near future.

Enable automatic logging of web ACLs by using AWS Config

Post Syndicated from Mike George original https://aws.amazon.com/blogs/security/enable-automatic-logging-of-web-acls-by-using-aws-config/

In this blog post, I will show you how to use AWS Config, with its auto-remediation functionality, to ensure that all web ACLs have logging enabled. The AWS CloudFormation template included in this blog post will facilitate this solution, and will get you started being able to manage web ACL logging at scale.

AWS Firewall Manager can automatically deploy an AWS Web Application Firewall (WAF) rule to protect your applications when your organization creates new Application Load Balancers, API Gateways, and CloudFront distributions. However, you still have to enable logging for web ACLs on an individual basis. Information contained in web ACL logs includes the time that AWS WAF received the request for your AWS resource, detailed information about the request, and the action for the rule that each request matched. This data can be extremely important for compliance and auditing needs, debugging, or forensic research.

Web ACL logging is a best practice, and is a business requirement within many organizations. Rather than leaving logging as a manual step in a deployment process, I will show you how to use automated mechanisms to enable logging, so that your business can meet its security and compliance requirements.


The solution in this blog post assumes that you are already using AWS Web Application Firewall (AWS WAF) and AWS Firewall Manager to manage your firewall rules at scale. The following is a list of all the AWS services used in this blog post:

Using AWS Config to ensure automatic logging

AWS Config is a service that enables you to evaluate the configurations of the AWS resources in your account. AWS Config continuously monitors and records resource configuration changes. AWS Config can alert you and perform actions when resources get added, removed, or change state. AWS Config has a set of built-in rules that it can evaluate your AWS resources against, or you can build your own AWS Config rules.

In fact, when you enable AWS Firewall Manager to automatically apply AWS WAF rules to your Application Load Balancers, API Gateways, or CloudFront distributions, AWS Firewall Manager creates AWS Config rules behind the scenes. These AWS Config rules are designed so that the correct web ACLs are automatically applied whenever new Application Load Balancers, API Gateways, or CloudFront distributions are created. Enterprises use AWS Config rules to ensure consistent compliance with their internal organizational policies. You can use AWS Config to ensure that your AWS WAF rules have logging enabled.

When creating custom AWS Config rules, you associate each custom rule with an AWS Lambda function, which contains the logic that evaluates whether your AWS resource complies with the rule. You can configure the custom AWS Config rule to invoke the Lambda function in response to a configuration change, or to run periodically. After the Lambda function executes, it evaluates whether your resource complies with your rule, and it then sends the results back to AWS Config. If the resource violates the conditions of the rule, then AWS Config flags the resource as noncompliant. For more information, see How AWS Config Works in the AWS Developer Documentation.

You can also perform auto-remediation on non-compliant resources by using the built-in remediation functionality in AWS Config. When AWS Config detects a noncompliant resource, it can invoke an automation function that is defined as a Systems Manager Automation document. Systems Manager has a number of pre-built Automation documents that can do things such as create an Amazon Machine Image (AMI), create a Jira issue, and create a ServiceNow incident. For the full list of built-in Automation documents, see Systems Manager Automation Document Details Reference.

You can also create your own Automation documents to support business cases not covered by the built-in Systems Manager Automation documents. Systems Manager Automation documents can run scripts, call AWS API functions, call custom Lambda functions, or execute a CloudFormation stack, and more.

Overview of the solution

The following is a high-level overview diagram of the solution described in this post, when an AWS WAF web ACL has a configuration change:

Figure 1: High-level solution overview

Figure 1: High-level solution overview

When an AWS WAF web ACL has a configuration change, the following steps will occur:

  1. The creation of the AWS WAF web ACL generates a ConfigurationItemChangeNotification, which is sent to AWS Config (step 1).
  2. AWS Config in turn sends the notification on to an AWS Lambda function (step 2), which determines if the web ACL in question is “compliant”. In this case, compliant means that the web ACL has logging configured.
  3. Lambda queries the web ACL (step 3) to determine if logging is enabled.
  4. The Lambda query results are then reported back to AWS Config (step 4).
  5. If logging is not enabled, the web ACL is seen as noncompliant and AWS Config kicks off an auto-remediation step (step 5) by executing a Systems Manager Automation document.
  6. The Automation document calls a Lambda function (step 6).
  7. The Lambda function attempts to enable logging on the web ACL (step 7).
  8. If logging is successfully enabled, then the web ACL automatically sends logs through a Kinesis Data Firehose delivery stream (step 8).
  9. The Kinesis Data Firehose delivery stream stores the data in an S3 bucket (step 9).
  10. After the Lambda function has completed enabling logging functionality, it reports back to Systems Manager (step 10).
  11. Systems Manager reports back to AWS Config (step 11).
  12. At this point, the web ACL compliance status still hasn’t been updated. AWS Config still believes the web ACL is noncompliant, so AWS Config calls the Lambda function (step 2) to determine if the compliance status has changed.
  13. Lambda checks the web ACL again (step 3), determines that it is compliant, and returns the results to AWS Config (step 4).

Because AWS Config stores the compliance history of the web ACL configuration, compliance team members will be able to go into AWS Config and see the history of the web ACL, as shown in the following screenshot. You will be able to see that the configuration state was noncompliant when the web ACL was created, and that it became compliant after logging was enabled.

Figure 2: Web ACL compliance history in AWS Config

Figure 2: Web ACL compliance history in AWS Config

Using the CloudFormation template

To automatically enable logging on all web ACLs, I created a CloudFormation template for you to use to set up all the necessary components. The CloudFormation template creates the following:

  • An S3 bucket to store the logs.
  • A Kinesis Data Firehose delivery stream.
  • An AWS Config rule.
  • A Systems Manager Automation document.
  • Two Lambda functions. The first Lambda function is used by AWS Config to evaluate whether the web ACL has logging enabled. The second Lambda function is used by the Systems Manager Automation document to automatically enable logging.
  • AWS IAM policies and roles to ensure that everything works correctly.

I designed this CloudFormation template to be executed in an AWS account that already has AWS Firewall Manager enabled, however it will not prevent you from running it in an AWS account that does not have it enabled. Accounts without AWS Firewall Manager won’t benefit from the central configuration and management that AWS Firewall Manager provides. However, this stack will still allow you to ensure that existing or new web ACLs have logging enabled.

To deploy the template

  1. Copy the CloudFormation template file that follows these instructions, and save it to your computer.
  2. Sign in to the AWS account where you want to deploy this stack.
  3. Choose Services, choose CloudFormation, and then choose Stacks.
  4. In the upper right, choose Create stack, and then choose With new resources (standard).
  5. In the Specify template section, choose Upload a template file, and then select Choose file.
  6. Navigate to the file that you saved in step 1. Choose Next.
  7. In the Stack name field, enter a stack name that is meaningful to you. Choose Next, and choose Next again.
  8. Select the checkbox that says I acknowledge that AWS CloudFormation might create IAM resources and choose the Create stack button.

CloudFormation template file

# This CloudFormation template enables auto-logging of web ACLs through the use of 
# AWS Config and Systems Manager Automation documents.
# This solution creates an S3 bucket, a Kinesis Data Firehose, an AWS Config rule, 
# a Systems Manager Automation document, and two Lambda functions to evaluate and 
# remediate when web ACLs are not configured for logging.

    Value: !Ref S3Bucket
    Value: !Ref Firehose

    Type: AWS::S3::Bucket

    Type: AWS::KinesisFirehose::DeliveryStream
          - ''
          - - aws-waf-logs-
            - !Ref AWS::StackName
        RoleARN: !GetAtt DeliveryRole.Arn
        BucketARN: !GetAtt S3Bucket.Arn
          IntervalInSeconds: 300
          SizeInMBs: 5
        CompressionFormat: UNCOMPRESSED

    Type: AWS::IAM::Role
        Version: '2012-10-17'
          - Sid: ''
            Effect: Allow
              Service: firehose.amazonaws.com
            Action: 'sts:AssumeRole'
                'sts:ExternalId': !Ref 'AWS::AccountId'

    Type: AWS::IAM::Policy
      PolicyName: 'firehose_delivery_policy'
        Version: 2012-10-17
          - Effect: Allow
              - 's3:AbortMultipartUpload'
              - 's3:GetBucketLocation'
              - 's3:GetObject'
              - 's3:ListBucket'
              - 's3:ListBucketMultipartUploads'
              - 's3:PutObject'
              - !GetAtt S3Bucket.Arn
              - !Join
                - ''
                - - !GetAtt S3Bucket.Arn
                  - '*'
        - !Ref DeliveryRole

    Type: "AWS::SSM::Document"
        schemaVersion: "0.3"
        description: "Adds logging to non-compliant WebACLs"
        assumeRole: "{{ AutomationAssumeRole }}"
            type: "String"
            description: "(Required) The WebACLId of the WebACL"
            type: "String"
            description: "(Optional) The ARN of the role that allows Automation to perform the actions on your behalf"
          - name: performRemediation
            action: aws:invokeLambdaFunction
              FunctionName: !GetAtt WafLambda.Arn
              Payload: '{"webAclName":"{{ WebACLId }}"}'
      DocumentType: Automation

    Type: AWS::Lambda::Function
      # The AmazonSSMAutomationRole role expects the Lambda function name to begin with Automation*
      FunctionName: !Sub Automation-${AWS::StackName}-EnableWafLogging
          !Sub |
            #CODE GOES HERE
            import boto3
            import json
            import os

            # This Lambda function ensures that all WAF web ACLs have logging enabled.
            # Trigger Type: SSM Automation
            # Scope of Automation: AWS::WAF::WebACL & AWS::WAFRegional::WebACL

            FIREHOSE_ARN = os.environ['FIREHOSE_ARN']
            CONFIG_RULE_NAME = os.environ['CONFIG_RULE_NAME']

            def evaluate_compliance(webAclName):
              hasConfig = False

              #Setting up variables
              client = ''
              response = ''
              wafArn = ''

              #Check if this is a WAFv2. The ResourceId passed in is already the ARN
              if webAclName.find('arn:aws:wafv2:') >= 0:
                wafArn = webAclName
                client = boto3.client('wafv2')

                isWebAcl = True
                #Test if this is AWS::WAF::WebACL
                  print('Testing for WAF::WebACL')
                  client = boto3.client('waf')
                  response = client.get_web_acl(WebACLId=webAclName)
                  isWebAcl = False

                if not isWebAcl:
                  #Test if this is AWS::WAFRegional::WebACL
                    print('Testing for WAFRegional::WebACL')
                    client = boto3.client('waf-regional')
                    response = client.get_web_acl(WebACLId=webAclName)

                wafArn = response['WebACL']['WebACLArn']

                response = client.get_logging_configuration(ResourceArn=wafArn)
                hasConfig = True
                print('Attempting to fix non-compliance')
                print('WAF ARN: ' + wafArn)
                response = client.put_logging_configuration(LoggingConfiguration={'ResourceArn': wafArn,'LogDestinationConfigs': [ FIREHOSE_ARN ]})

            def regen_compliance():
                print("Attempting to re-run AWS Config rule to update compliance status")
                client = boto3.client('config')
                response = client.start_config_rules_evaluation(ConfigRuleNames=[CONFIG_RULE_NAME])

            def handler(event, context):
              aclName = event['webAclName']


      Handler: "index.handler"
          FIREHOSE_ARN: !GetAtt Firehose.Arn
          CONFIG_RULE_NAME: !Ref ConfigRule
      Runtime: python3.7
      Timeout: 30
      Role: !GetAtt LambdaExecutionRole.Arn

    Type: AWS::Config::ConfigRule
          - ''
          - - Enable-WebACL-Logging-
            - !Ref AWS::StackName
      Description: 'Ensures that all new web ACLs have logging enabled'
          - AWS::WAF::WebACL
          - AWS::WAFv2::WebACL
          - AWS::WAFRegional::WebACL
        Owner: "CUSTOM_LAMBDA"
        - EventSource: "aws.config"
          MessageType: ConfigurationItemChangeNotification
        - EventSource: "aws.config"
          MessageType: OversizedConfigurationItemChangeNotification
        SourceIdentifier: !GetAtt Lambda.Arn
    DependsOn: PermissionToCallLambda

    Type: "AWS::Config::RemediationConfiguration"
      # AutomationAssumeRole, MaximumAutomaticAttempts and RetryAttemptSeconds are Required if Automatic is true
      Automatic: true
      ConfigRuleName: !Ref ConfigRule
      MaximumAutomaticAttempts: 1
              - !GetAtt AutoRemediationIamRole.Arn
            Value: RESOURCE_ID
      RetryAttemptSeconds: 60
      TargetId: !Ref AutomationDoc
      TargetType: SSM_DOCUMENT

    Type: 'AWS::IAM::Role'
        Version: '2012-10-17'
          - Effect: Allow
                - ssm.amazonaws.com
              - 'sts:AssumeRole'
        - 'arn:aws:iam::aws:policy/service-role/AmazonSSMAutomationRole'
      Policies: []

    Type: AWS::Lambda::Permission
      FunctionName: !GetAtt WafLambda.Arn
      Action: "lambda:InvokeFunction"
      Principal: "ssm.amazonaws.com"

    Type: AWS::Lambda::Permission
      FunctionName: !GetAtt Lambda.Arn
      Action: "lambda:InvokeFunction"
      Principal: "config.amazonaws.com"

    Type: AWS::Lambda::Function
          !Sub |
            import boto3
            import json
            # This Lambda function determines if WAF web ACLs have logging enabled
            # Trigger Type: Config: Change Triggered
            # Scope of Changes: AWS::WAF::WebACL, AWS::WAFv2::WebACL & AWS::WAFRegional::WebACL

            def is_applicable(config_item, event):
              status = config_item['configurationItemStatus']
              event_left_scope = event['eventLeftScope']
              test = ((status in ['OK', 'ResourceDiscovered']) and
                event_left_scope == False)
              return test

            def evaluate_compliance(config_item):
              wafArn = config_item['ARN']
              hasConfig = False

              client = ''
              if (config_item['resourceType'] == 'AWS::WAF::WebACL'):
                client = boto3.client('waf')
              elif (config_item['resourceType'] == 'AWS::WAFRegional::WebACL'):
                client = boto3.client('waf-regional')
              elif (config_item['resourceType'] == 'AWS::WAFv2::WebACL'):
                client = boto3.client('wafv2')

                response = client.get_logging_configuration(ResourceArn=wafArn)
                hasConfig = True

              if not hasConfig:
                return 'NON_COMPLIANT'
                return 'COMPLIANT'

            def handler(event, context):
              invoking_event = json.loads(event['invokingEvent'])
              compliance_value = 'NOT_APPLICABLE'

              if is_applicable(invoking_event['configurationItem'], event):
                compliance_value = evaluate_compliance(invoking_event['configurationItem'])

              config = boto3.client('config')
              response = config.put_evaluations(
                  'ComplianceResourceType': invoking_event['configurationItem']['resourceType'],
                  'ComplianceResourceId': invoking_event['configurationItem']['resourceId'],
                  'ComplianceType': compliance_value,
                  'OrderingTimestamp': invoking_event['configurationItem']['configurationItemCaptureTime']
      Handler: "index.handler"
      Runtime: python3.7
      Timeout: 30
      Role: !GetAtt LambdaExecutionRole.Arn

    Type: AWS::IAM::Role
        Version: '2012-10-17'
        - Effect: Allow
            - lambda.amazonaws.com
          - sts:AssumeRole
      Path: "/"
      - PolicyName: lambda-logging
          Version: '2012-10-17'
          - Effect: Allow
            - logs:*
            Resource: arn:aws:logs:*:*:*
      - PolicyName: waf-config
          Version: '2012-10-17'
          - Effect: Allow
            - waf:PutLoggingConfiguration
            - waf:GetLoggingConfiguration
            - waf:GetWebACL
            - wafv2:PutLoggingConfiguration
            - wafv2:GetLoggingConfiguration
            - wafv2:GetWebACL
            - waf-regional:PutLoggingConfiguration
            - waf-regional:GetLoggingConfiguration
            - waf-regional:GetWebACL
            - arn:aws:waf::*:*
            - arn:aws:wafv2:*:*:*/*/*
            - arn:aws:waf-regional:*:*:*
      - PolicyName: config-evaluate
          Version: '2012-10-17'
          - Effect: Allow
            - config:PutEvaluations
            - config:StartConfigRulesEvaluation
            Resource: '*'
      - PolicyName: allow-lambda-servicelinkedrole
          Version: '2012-10-17'
          - Effect: Allow
            - iam:CreateServiceLinkedRole
            Resource: arn:aws:iam::*:role/aws-service-role/*

How the CloudFormation template works

To enable logging on a web ACL, the web ACL expects a Kinesis Data Firehose delivery stream that has a name that starts with aws-waf-logs-. You typically configure a Kinesis Data Firehose delivery stream to deliver data to an S3 bucket. This CloudFormation template creates a Kinesis Data Firehose delivery stream with a name that the web ACL is expecting and is configured to deliver data to an S3 bucket. The Kinesis Data Firehose delivery stream has the name of aws-waf-logs-StackName, where StackName is the name you provided when you created this CloudFormation stack.

The CloudFormation template also creates an AWS Config rule with the name Enable-WebACL-Logging-StackName. This AWS Config rule is configured to monitor resources of type AWS::WAF::WebACL (typically a CloudFront distribution), AWS::WAFRegional::WebACL (typically an API Gateway or an Application Load Balancer), and AWS::WAFv2::WebACL, which is the latest version of the AWS WAF API. When AWS Config detects a change to one of your web ACLs (for example, an AWS WAF rule being added to an Application Load Balancer), the event is sent off to a Lambda function for evaluation against your rule.

The Lambda function is where all the heavy lifting is performed. When the Lambda function is invoked, control is passed to the handler method. This method calls the evaluate_compliance method, which uses the Boto3 Python library to pull the logging configuration of the web ACL in question. The function simply checks to see if it can pull a logging configuration from the web ACL. If it can pull a logging configuration, that means that logging is enabled. If it cannot pull a logging configuration, it means logging is not enabled. The Lambda function then reports back the status of COMPLIANT (meaning logging is enabled) or NON_COMPLIANT (meaning logging is not enabled) to AWS Config.

This AWS Config rule is configured to auto-remediate noncompliant web ACLs. When a noncompliant web ACL is identified, AWS Config executes a Systems Manager Automation document, which calls a Lambda function to enable logging. This Lambda function is configured with an environment variable called FIREHOSE_ARN, which is the ARN of the Kinesis Data Firehose delivery stream that is created as part of this CloudFormation stack. In this Lambda function, if it cannot pull a logging configuration, it creates a new logging configuration using the Kinesis Data Firehose delivery stream that has already been configured. The Lambda function then attempts to call a method on AWS Config to re-evaluate compliance for this rule.

When you view the details of this rule within the AWS Config console, you’ll see all web ACLs listed under the Resource ID column. The Resource compliance status column will show as Compliant, meaning that these web ACLs comply with your AWS Config rule. Because the AWS Config rule enforces logging on web ACLs, you can be confident that logging is properly enabled.

Figure 3: Compliance status of web ACLs in AWS Config

Figure 3: Compliance status of web ACLs in AWS Config

The remaining parts of the CloudFormation template are in place to ensure that the system has sufficient permissions to work correctly. The Kinesis Data Firehose delivery stream is assigned to an IAM role, which has a policy assigned that grants it appropriate permissions to write to your S3 bucket. The AWS Config rule is granted permission to call the first Lambda function, and then Systems Manager is granted permission to call the second Lambda function. Finally, the Lambda functions are assigned to an IAM role that has permissions to request and modify the logging configurations of the web ACLs, and to update AWS Config with the results of those actions.

The CloudFormation template in this post provides a simple solution for automatically enabling logging of all web ACLs within an AWS Region. If your organization is looking for additional operational control, you can extend this example CloudFormation template to verify that all web ACLs are using the same logging configuration. This change could be accomplished by modifying the Lambda functions to ensure that the web ACL has both a logging configuration and is using the same Kinesis Data Firehose delivery stream that is defined within the CloudFormation template. If a logging configuration exists for a web ACL, but it is using the wrong Kinesis Data Firehose delivery stream, a Lambda function can delete that logging configuration and re-create it using the correct Kinesis Data Firehose delivery stream.

While this solution described in this blog post uses custom AWS Config rules and Automation documents for enabling logging on web ACLs, this approach can be generalized to use custom AWS Config rules for other contexts and for other resource types. For example, you can use this same approach to ensure that your Amazon Elastic Compute Cloud (Amazon EC2) instances comply with your internal IT security policies.

Cost Considerations

For customers who already use AWS WAF and AWS Firewall Manager, this solution adds additional costs for the use of AWS Config, Amazon Kinesis Data Firehose, and Amazon S3.

With AWS Config, you pay per configuration item recorded in your AWS account per AWS Region and the number of active rule evaluations recorded. For more information, see AWS Config pricing.

With AWS Systems Manager, you pay for the number of initiated actions performed (called steps) in the Automation and the duration of each step per second. I expect that my usage for this solution would fall under the free tier, but your usage may vary. For more information, see AWS Systems Manager pricing.

With AWS Lambda, you pay for the number of requests and the duration of those requests. However, because I don’t expect a lot of requests to Lambda in this solution, I expect that my usage would fall under the free tier, but your usage may vary. For more information, see AWS Lambda pricing.

With Amazon Kinesis Data Firehose, you pay only for the volume of data you ingest into the service. For more information, see Amazon Kinesis Data Firehose pricing.

For customers who want managed distributed denial of service (DDoS) protection, AWS Shield Advanced may be a good solution. Additionally, AWS Shield Advanced customers get AWS WAF and AWS Firewall Manager at no additional cost for usage on their resources that are protected by AWS Shield Advanced. For more information, see AWS Shield Pricing.


AWS Firewall Manager is a powerful solution for managing web ACLs at scale. By using a custom AWS Config rule—the same underlying technology used by AWS Firewall Manager—you can create a scalable approach to verify that all your web ACLs within an AWS Region have logging enabled. The CloudFormation template included in this blog post gives your organization a good starting point for being able to manage web ACL logging at scale.

Find out more:

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS WAF forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Mike George

Mike is a Senior Solutions Architect based out of Salt Lake City, Utah. He enjoys helping customers solve their technology problems. His interests include software engineering, security, and AI/ML

How We Built A Logging Stack at Grab

Post Syndicated from Grab Tech original https://engineering.grab.com/how-built-logging-stack

And Solved Our Inhouse Logging Problem


Let me take you back a year ago at Grab. When we lacked any visualizations or metrics for our service logs. When performing a query for a string from the last three days was something only run before you went for a beverage.

When a service stops responding, Grab’s core problems were and are:

  • We need to know it happened before the customer does.
  • We need to know why it happened.
  • We need to solve our customers’ problems fast.

We had a hodgepodge of log-based solutions for developers when they needed to figure out the above, or why a driver never showed up, or a customer wasn’t receiving our promised promotions. These included logs in a cloud based storage service (which could take hours to retrieve). Or a SAS provider constantly timing out on our queries. Or even asking our SREs to fetch logs from the potential machines for the service engineer, a rather laborious process.

Here’s what we did with our logs to solve these problems.


Our current size and growth rate ruled out several available logging systems. By size, we mean a LOT of data and a LOT of users who search through hundreds of billions of logs to generate reports. Or who track down that one user who managed to find that pesky corner case in our code.

When we started this project, we generated 25TB of logging data a day. Our first thought was “Do we really need all of these logs?”. To this day our feeling is “probably not”.

However, we can’t always define what another developer can and cannot do. Besides, this gave us an amazing opportunity to build something to allow for all that data!

Some of our SREs had used the ELK Stack (Elasticsearch / Logstash / Kibana). They thought it could handle our data and access loads, so it was our starting point.

How We Built a Multi-Petabyte Cluster:

Information Gathering:

It started with gathering numbers. How much data did we produce each day? How many days were retained? What’s a reasonable response time to wait for?

Before starting a project, understand your parameters. This helps you spec out your cluster, get buy-in from higher ups, and increase your success rate when rolling out a product used by the entire engineering organization. Remember, if it’s not better than what they have now, why will they switch?

A good starting point was opening the floor to our users. What features did they want? If we offered a visualization suite so they can see ERROR event spikes, would they use it? How about alerting them about SEGFAULTs? Hands down the most requested feature was speed; “I want an easy webUI that shows me the user ID when I search for it, and get all the results in <5 seconds!”

Getting Our Feet Wet:

New concerns always pop up during a project. We’re sure someone has correlated the time spent in R&D to the number of problems. We had an always moving target, since as our proof of concept began, our daily logger volume kept increasing.

Thankfully, using Elasticsearch as our data store meant we could fully utilize horizontal scaling. This let us start with a simple 5 node cluster as we built out our proof-of-concept (POC). Once we were ready to onboard more services, we could move into a larger footprint.

The specs at the time called for about 80 nodes to handle all our data. But if we designed our system correctly, we’d only need to increase the number of Elasticsearch nodes as we enrolled more customers. Our key operating metrics were CPU utilization, heap memory needed for the JVM, and total disk space.

Initial Design:

First, we set up tooling to use Ansible both to launch a machine and to install and configure Elasticsearch. Then we were ready to scale.

Our initial goal was to keep the design as simple as possible. Opting to allow each node in our cluster to perform all responsibilities. In this setup each node would behave as all of the four available types:

  • Ingest: Used for transforming and enriching documents before sending them to data nodes for indexing.
  • Coordinator: Proxy node for directing search and indexing requests.
  • Master: Used to control cluster operations and determine a quorum on indexed documents.
  • Data: Nodes that hold the indexed data.

These were all design decisions made to move our proof of concept along, but in hindsight they might have created more headaches down the road with troubleshooting, indexing speed, and general stability. Remember to do your homework when spec’ing out your cluster.

It’s challenging to figure out why you are losing master nodes because someone filled up the field data cache performing a search. Separating your nodes can be a huge help in tracking down your problem.

We also decided to further reduce complexity by going with ingest nodes over Logstash. But at the time, the documentation wasn’t great so we had a lot of trial and error in figuring out how they work. Particularly as compared to something more battle tested like Logstash.

If you’re unfamiliar with ingest node design, they are lightweight proxies to your data nodes that accept a bulk payload, perform post-processing on documents,and then send the documents to be indexed by your data nodes. In theory, this helps keep your entire pipeline simple. And in Elasticsearch’s defense, ingest nodes have made massive improvements since we began.

But adding more ingest nodes means ADDING MORE NODES! This can create a lot of chatter in your cluster and cause more complexity when  troubleshooting problems. We’ve seen when an ingest node failing in an odd way caused larger cluster concerns than just a failed bulk send request.


This isn’t anything new, but we can’t overstate the usefulness of monitoring. Thankfully, we already had a robust tool called Datadog with an additional integration for Elasticsearch. Seeing your heap utilization over time, then breaking it into smaller graphs to display the field data cache or segment memory, has been a lifesaver. There’s nothing worse than a node falling over due to an OOM with no explanation and just hoping it doesn’t happen again.

At this point, we’ve built out several dashboards which visualize a wide range of metrics from query rates to index latency. They tell us if we sharply drop on log ingestion or if circuit breakers are tripping. And yes, Kibana has some nice monitoring pages for some cluster stats. But to know each node’s JVM memory utilization on a 400+ node cluster, you need a robust metric system.


Common Problems:

There are many blogs about the common problems encountered when creating an Elasticsearch cluster and Elastic does a good job of keeping blog posts up to date. We strongly encourage you to read them. Of course, we ran into classic problems like ensuring our Java objects were compressed (Hints: Don’t exceed 31GB of heap for your JVM and always confirm you’ve enabled compression).

But we also ran into some interesting problems that were less common. Let’s look at some major concerns you have to deal with at this scale.

Grab’s Problems:

Field Data Cache:

So, things are going well, all your logs are indexing smoothly, and suddenly you’re getting Out Of Memory (OOMs) events on your data nodes. You rush to find out what’s happening, as more nodes crash.

A visual representation of your JVM heap’s memory usage is very helpful here. You can always hit the Elasticsearch API, but after adding more then 5 nodes to your cluster this kind of breaks down. Also, you don’t want to know what’s going on while a node is down, but what happened before it died.

Using our graphs, we determined the field data cache went from virtually zero memory used in the heap to 20GB! This forced us to read up on how this value is set, and, as of this writing, the default value is still 100% of the parent heap memory. Basically, this breaks down to allowing 70% of your total heap being allocated to a single search in the form of field data.

Now, this should be a rare case and it’s very helpful to keep the field names and values in memory for quick lookup. But, if, like us, you have several trillion documents, you might want to watch out.

From our logs, we tracked down a user who was sorting by the _id field. We believe this is a design decision in how Kibana interacts with Elasticsearch. A good counter argument would be a user wants a quick memory lookup if they search for a document using the _id. But for us, this meant a user could load into memory every ID in the indices over a 14 day period.

The consequences? 20+GB of data loaded into the heap before the circuit breaker tripped. It then only took 2 queries at a time to knock a node over.

You can’t disable indexing that field, and you probably don’t want to. But you can prevent users from stumbling into this and disable the _id field in the Kibana advanced settings. And make sure you re-evaluate your circuit breakers. We drastically lowered the available field cache and removed any further issues.

Translog Compression:

At first glance, compression seems an obvious choice for shipping shards between nodes. Especially if you have the free clock cycles, why not minimize the bandwidth between nodes?

However, we found compression between nodes can drastically slow down shard transfers. By disabling compression, shipping time for a 50GB shard went from 1h to 20m. This was because Lucenesegments are already compressed, a new issue we ran into full force and are actively working with the community to fix. But it’s also a configuration to watch out for in your setup, especially if you want a fast recovery of a shard.

Segment Memory:

Most of our issues involved the heap memory being exhausted. We can’t stress enough the importance of having visualizations around how the JVM is used. We learned this lesson the hard way around segment memory.

This is a prime example of why you need to understand your data when building a cluster. We were hitting a lot of OOMs and couldn’t figure out why. We had fixed the field cache issue, but what was using all our RAM?

There is a reason why having a 16TB data node might be a poorly spec’d machine. Digging into it, we realized we simply allocated too many shards to our nodes. Looking up the total segment memory used per index should give a good idea of how many shards you can put on a node before you start running out of heap space. We calculated on average our 2TB indices used about 5GB of segment memory spread over 30 nodes.

The numbers have since changed and our layout was tweaked, but we came up with calculations showing we could allocate about 8TB of shards to a node with 32GB heap memory before we running into issues. That’s if you really want to push it, but it’s also a metric used to keep your segment memory per node around 50%. This allows enough memory to run queries without knocking out your data nodes. Naturally this led us to ask “What is using all this segment memory per node?”

Index Mapping and Field Types:

Could we lower how much segment memory our indices used to cut our cluster operation costs? Using the segments data found in the ES cluster and some simple Python loops, we tracked down the total memory used per field in our index.

We used a lot of segment memory for the _id field (but can’t do much about that). It also gave us a good breakdown of our other fields. And we realized we indexed fields in completely unnecessary ways. A few fields should have been integers but were keyword fields. We had fields no one would ever search against and which could be dropped from index memory.

Most importantly, this began our learning process of how tokens and analyzers work in Elasticsearch/Lucene.

Picking the Wrong Analyzer:

By default, we use Elasticsearch’s Standard Analyzer on all analyzed fields. It’s great, offering a very close approximation to how users search and it doesn’t explode your index memory like an N-gram tokenizer would.

But it does a few things we thought unnecessary, so we thought we could save a significant amount of heap memory. For starters, it keeps the original tokens: the Standard Analyzer would break IDXVB56KLM into tokens IDXVB, 56,  and KLM. This usually works well, but it really hurts you if you have a lot of alphanumeric strings.

We never have a user search for a user ID as a partial value. It would be more useful to only return the entire match of an alphanumeric string. This has the added benefit of only storing the single token in our index memory. This modification alone stripped a whole 1GB off our index memory, or at our scale meant we could eliminate 8 nodes.

We can’t stress enough how cautious you need to be when changing analyzers on a production system. Throughout this process, end users were confused why search results were no longer returning or returning weird results. There is a nice kibana pluginthat gives you a representation of how your tokens look with a different analyzer, or use the build in ES tools to get the same understanding.

Be Careful with Cloud Maintainers:

We realized that running a cluster at this scale is expensive. The hardware alone sets you back a lot, but our hidden bigger cost was cross traffic between availability zones.

Most cloud providers offer different “zones” for your machines to entice you to achieve a High-Availability environment. That’s a very useful thing to have, but you need to do a cost/risk analysis. If you migrate shards from HOT to WARM to COLD nodes constantly, you can really rack up a bill. This alone was about 30% of our total cluster cost, which wasn’t cheap at our scale.

We re-worked how our indices sat in the cluster. This let us create a different index for each zone and pin logging data so it never left the zone it was generated in. One small tweak to how we stored data cut our costs dramatically. Plus, it was a smaller scope for troubleshooting. We’d know a zone was misbehaving and could focus there vs. looking at everything.


Running our own logging stack started as a challenge. We roughly knew the scale we were aiming for; it wasn’t going to be trivial or easy. A year later, we’ve gone from pipe-dream to production and immensely grown the team’s ELK stack knowledge.

We could probably fill 30 more pages with odd things we ran into, hacks we implemented, or times we wanted to pull our hair out. But we made it through and provide a superior logging platform to our engineers at a significant price reduction while maintaining a stable platform.

There are many different ways we could have started knowing what we do now. For example, using Logstash over Ingest nodes, changing default circuit breakers, and properly using heap space to prevent node failures. But hindsight is 20/20 and it’s rare for projects to not change.

We suggest anyone wanting to revamp their centralized logging system look at the ELK solutions. There is a learning curve, but the scalability is outstanding and having subsecond lookup time for assisting a customer is phenomenal. But, before you begin, do your homework to save yourself weeks of troubleshooting down the road. In the end though, we’ve received nothing but praise from Grab engineers about their experiences with our new logging system.

Structured Logging: The Best Friend You’ll Want When Things Go Wrong

Post Syndicated from Grab Tech original https://engineering.grab.com/structured-logging


Everyday millions of people around Southeast Asia count on Grab to get themselves or what they need from point A to B in a safe, comfortable and reliable manner. In fact, just very recently we crossed our 3 billion transport rides milestone, gaining the last billion in just a mere 6 months!

We take this responsibility very seriously, and as we continue to grow and expand, it’s important for us to maintain a sophisticated backend system that is capable of sustaining the kind of scale needed to support all our customers in Southeast Asia. This backend system is comprised of multiple services that interact with each other in many different ways. As Grab evolves, maintaining them becomes a significantly larger and harder task as developers continuously develop new features.

To maintain these systems well, it’s important to have better observability; data that helps us better understand what is happening in the system by having good monitoring (metrics), event logs, and tracing for request scope data. Out of these, logs provide the most complete picture of what happened within the system – and is typically the first and most engaged point of contact. With good logs, the backend becomes much easier to understand, maintain, and debug. Without logs or with bad logs – we have a recipe for disaster; making it nearly impossible to understand what’s happening.

In this article, we focus on a form of logging called structured logging. We discuss what it is, why is it better, and how we built a framework that integrates well with our current Elastic stack-based logging backend, allowing us to do logging better and more efficiently.

Structured Logging is a part of a larger endeavour which will enable us to reduce the Mean Time To Resolve (MTTR), helping developers to mitigate issues faster when outages happen.

What are Logs?

Logs are lines of texts containing some information about some event that occurred in our system, and they serve a crucial function of helping us understand what’s happening in the backend. Logs are usually placed at points in the code where a significant event has happened (for example, some database operation succeeded or a passenger got assigned to a driver) or at any other place in the code that we are interested in observing.

The first thing that a developer would normally do when an error is reported is check the logs – sort of like walking through the history of the system and finding out what happened. Therefore, logs can be a developer’s best friend in times of service outages, errors, and failed builds.

Logs in today’s world have varying formats and features.

  • Log Format: These range from simple key-value based (like syslog) to quite structured and detailed (like JSON). Since logs are mostly meant for developer eyes, how detailed or structured a log is dictates how fast the developer can query the logs, as well as read them. The more structured the data is – the larger the size is per log line, although it’s more queryable and contains richer information.
  • Levelled Logging (or Log Levels): Logs with different severities can be logged at different levels. The visibility can be limited to a single level, limiting all logs only with a certain severity or above (for example, only logs WARN and above). Usually log levels are static in production environments, and finding DEBUG logs usually requires redeploying.
  • Log Aggregation Backend: Logs can have different log aggregation backends, which means different backends (i.e. Splunk, Kibana, etc.) decide what your logs might look like or what you might be able to do with them. Some might cost a lot more than others.
  • Causal Ordering: Logs might or might not preserve the exact time in which they are written. This is important, as how exact the time is dictates how accurately we can predict the sequence of events via logs.
  • Log Correlation: We serve countless requests from our backend services. Being able to see all the logs relevant to a particular request or a particular event helps us drill down to relevant  information for a specific request (e.g. for a specific passenger trying to book a ride).

Combine this with the plethora of logging libraries available and you easily have a developer who is holding his head in confusion, unable to decide what to use. Also, each library has their own set of advantages and disadvantages, so the discussion might quickly become subjective and polarized – therefore it is crucial that you choose the appropriate library and backend pair for your applications.

We at Grab use different types of logging libraries. However, as requirements changed  – we also found ourselves re-evaluating our logging strategy.

The State of Logging at Grab

The number of Golang services at Grab has continuously grown. Most services used syslog-style key-value format logs, recognized as the most common format of logs for server-side applications due to its simplicity and ease for reading and writing. All these logs were made possible by a handful of common libraries, which were directly imported and used by different services.

We used a cloud-based SaaS vendor as a frontend for these logs, where application-emitted logs were routed to files and sent to our logging vendor, making it possible to view and query them in real time. Things were pretty great and frictionless for a long time.

However, as time went by, our logging bills started mounting to unprecedented levels and we found ourselves revisiting and re-evaluating how we did logging. A few issues surfaced:

  • Logging volume reduction efforts were successful to some extent – but were arduous and painful. Part of the reason was that almost all the logs were at a single log level – INFO.
Figure 1: Log Level Usage
Figure 1: Log Level Usage


This issue was not limited to a single service, but pervasive across services. For mitigation, some services added sampling to logs, some removed logs altogether. The latter is only a recipe for disaster, so it was known that we had to improve levelled logging.

  • The vendor was expensive for us at the time and also had a few concerns – primarily with limitations around DSL (query language). There were many good open source alternatives available – Elastic stack to name one. Our engineers felt confident that we could probably manage our logging infrastructure and manage the costs better – which led to the proposal and building of Elastic stack logging cluster. Elasticsearch is vastly more powerful and rich than our vendor at the time and our current libraries weren’t enough to fully leverage its capabilities, so we needed a library which can leverage structure in logs better and easily integrate with Elastic stack.
  • There were some minor issues in our logging libraries namely:
    • Singleton initialisation pattern that made unit-testing harder
    • Single logger interface that reduced the possibility of extending the core logging functionality as almost all the services imported the logger interface directly
    • No out-of-the-box support for multiple writers
  • If we were to write a library, we had to fix these issues – and also encourage usage of best practices.

  • Grab’s critical path (number of services traversed by a single booking flow request) has grown in size. On average, a single booking request touches multiple microservices – each of which does something different. At the large scale at which we operate, it’s necessary therefore to easily view logs from all the services for a single request – however this was not something which was done automatically by the library. Hence, we also wanted to make log correlation easier and better.
  • Logs are events which happened at some point of time. The order in which these events occurred gives us a complete history of what happened in the system. However, the core logging library which formed the base of the logging across our Golang services didn’t preserve the log generation time (it instead used write time). This led to jumbling of logs which are generated in a span of a few microseconds – which not only makes the lives of our developers harder, but makes it near impossible to get an exact history of the system. This is why we wanted to also improve and enable causal ordering of logs – one of the key steps in understanding what’s happening in the system.

Why Change?

As mentioned, we knew there were issues with how we were logging. To best approach the problem and be able to solve it as much as possible without affecting existing infrastructure and services, it was decided to bootstrap a new library from the ground up. This library would solve known issues, as well as contain features which would not have been possible by modifying existing libraries. For a recap, here’s what we wanted to solve:

  • Improve levelled logging
  • Leverate structure in logs better
  • Easily integrate with Elastic stack
  • Encourage usage of best practices
  • Make log correlation easier and better
  • Improve and enable causal ordering of logs for a better understanding of service distribution

Enter Structured Logging. Structured Logging has been quite popular around the world, finding widespread adoption. It was easily integrable with our Elastic stack backend and would also solve most of our pain points.

Structured Logging

Keeping our previous problems and requirements in mind, we bootstrapped a library in Golang, which has the following features:

Dynamic Log Levels

This allows us to change our initialized log levels at runtime from a configuration management system – something which was not possible and encouraged before.

This makes the log levels actually more meaningful now –  developers can now deploy with the usual WARN or INFO log levels, and when things go wrong, just with a configuration change they can update the log level to DEBUG and make their services output more logs when debugging. This also helps us keep our logging costs in check. We made support for integrating this with our configuration management system easy and straightforward.

Consistent Structure in Logs

Logs are inherently unstructured unlike database schema, which is rigid, or a freeform text, which has no structure. Our Elastic stack backend is primarily based on indices (sort of like tables) with mapping (sort of like a loose schema). For this, we needed to output logs in JSON with a consistent structure (for example, we cannot output integer and string under the same JSON field because that will cause an indexing failure in Elasticsearch). Also, we were aware that one of our primary goals was keeping our logging costs in check, and since it didn’t make sense to structure and index almost every field – adding only the structure which is useful to us made sense.

For addressing this, we built a utility that allows us to add structure to our logs deterministically. This is built on top of a schema in which we can add key-value pairs with a specific key name and type, generate code based on that – and use the generated code to make sure that things are consistently formatted and don’t break. We called this schema (a collection of key name and type pairs) the Common Grab Log Schema (CGLS). We only add structure to CGLS which is important – everything included in CGLS gets formatted in the different field and everything else gets formatted in a single field in the generated JSON. This helps keeps our structure consistent and easily usable with Elastic stack.

Figure 2: Overview of Common Grab Log Schema for Golang backend services
Figure 2: Overview of Common Grab Log Schema for Golang backend services

Plug and Play support with Grab-Kit

We made the initialization and use easy and out-of-the-box with our in-house support for Grab-Kit, so developers can just use it without making any drastic changes. Also, as part of this integration, we added automatic log correlation based on request IDs present in traces, which ensured that all the logs generated for a particular request already have that trace ID.

Configurable Log Format

Our primary requirement was building a logger expressive and consistent enough to integrate with the Elastic stack backend well – without going through fancy log parsing in the downstream. Therefore, the library is expressive and configurable enough to allow any log format (we can write different log formats for different future use cases. For example, readable format in development settings and JSON output in production settings), with a default option of JSON output. This ensures that we can produce log output which is compatible with Elastic stack, but still be configurable enough for different use cases.

Support for Multiple Writes with Different Formats

As part of extending the library’s functionality, we needed enough configurability to be able to send different logs to different places at different settings. For example, sending FATAL logs to Slack asynchronously in some readable format, while sending all the usual logs to our Elastic stack backend. This library includes support for chaining such “cores” to any arbitrary degree possible – making sure that this logger can be used in such highly specialized cases as well.

Production-like Logging Environment in Development

Developers have been seeing console logs since the dawn of time, however having structured JSON logs which are only meant for production logs and are more searchable provides more power. To leverage this power in development better and allow developers to directly see their logs in Kibana, we provide a dockerized version of Kibana which can be spun up locally to accept structured logs. This allows developers to directly use the structured logs and see their logs in Kibana – just like production!

Having this library enabled us to do logging in a much better way. The most noticeable impact was that our simple access logs can now be queried better – with more filters and conditions.

Figure 3: Production-like Logging Environment in Development
Figure 3: Production-like Logging Environment in Development

Causal Ordering

Having an exact history of events makes debugging issues in production systems easier – as one can just look at the history and quickly hypothesize what’s wrong and fix it. To this end, the structured logging library adds the exact write timestamp in nanoseconds in the logger. This combined with the structured JSON-like format makes it possible to sort all the logs by this field – so we can see logs in the exact order as they happened – achieving causal ordering in logs. This is an underplayed but highly powerful feature that makes debugging easier.

Figure 4: Causal ordering of logs with Y'ALL
Figure 4: Causal ordering of logs with Y’ALL

But Why Structured Logging?

Now that you know about the history and the reasons behind our logging strategy, let’s discuss the benefits that you reap from it.

On the outset, having logs well-defined and structured (like JSON) has multiple benefits, including but not limited to:

  • Better root cause analysis: With structured logs, we can ingest and perform more powerful queries which won’t be possible with simple unstructured logs. Developers can do more informative queries on finding the logs which are relevant to the situation. Not only this, log correlation and causal ordering make it possible to gain a better understanding of the distributed logs. Unlike unstructured data, where we are only limited to full-text or a handful of log types, structured logs take the possibility to a whole new level.
  • More transparency or better observability: With structured logs, you increase the visibility of what is happening with your system – since now you can log information in a better, more expressive way. This enables you to have a more transparent view of what is happening in the system and makes your systems easier to maintain and debug over longer periods of time.
  • Better consistency: With structured logs, you increase the structure present in your logs – and in turn, make your logs more consistent as the systems evolve. This allows us to index our logs in a system like Elastic stack more easily as we can be sure that we are sticking to some structure. Also with the adoption of a common schema, we can be rest assured that we are all using the same structure.
  • Better standardization: Having a single, well-defined, structured way to do logging allows us to standardize logging – which reduces cognitive overhead of figuring out what happened in systems via logs and allows easier adoption. Instead of going through 100 different types of logs, you instead would only have a single format. This is also one of the goals of the library – standardizing the usage of the library across Golang backend services.

We get some additional benefits as well:

  • Dynamic Log Levels: This allows us to have meaningful log levels in our code – where we can deploy with baseline warning settings and switch to lower levels (debug logs) only when we need them. This helps keep our logging costs low, as well as reduces the noise that developers usually need to go through when debugging.
  • Future-proof Consistency in Logs: With the adoption of a common schema, we make sure that we stick with the same structure, even if say tomorrow our logging infrastructure changes – making us future-ready. Instead of manually specifying what to log, we can simply expose a function in our loggers.
  • Production-Like Logging Environment in Development: The dockerized Kibana allows developers to enjoy the same benefits as the production Kibana. This also encourages developers to use Elastic stack more and explore its features such as building dashboards based on the log data, having better watchers, and so on.

I hope you have enjoyed this article and found it useful. Comments and corrections are always welcome.

Happy Logging!

Alerting, monitoring, and reporting for PCI-DSS awareness with Amazon Elasticsearch Service and AWS Lambda

Post Syndicated from Michael Coyne original https://aws.amazon.com/blogs/security/alerting-monitoring-and-reporting-for-pci-dss-awareness-with-amazon-elasticsearch-service-and-aws-lambda/

Logging account activity within your AWS infrastructure is paramount to your security posture and could even be required by compliance standards such as PCI-DSS (Payment Card Industry Security Standard). Organizations often analyze these logs to adapt to changes and respond quickly to security events. For example, if users are reporting that their resources are unable to communicate with the public internet, it would be beneficial to know if a network access list had been changed just prior to the incident. Many of our customers ship AWS CloudTrail event logs to an Amazon Elasticsearch Service cluster for this type of analysis. However, security best practices and compliance standards could require additional considerations. Common concerns include how to analyze log data without the data leaving the security constraints of your private VPC.

In this post, I’ll show you not only how to store your logs, but how to put them to work to help you meet your compliance goals. This implementation deploys an Amazon Elasticsearch Service domain with Amazon Virtual Private Cloud (Amazon VPC) support by utilizing VPC endpoints. A VPC endpoint enables you to privately connect your VPC to Amazon Elasticsearch without requiring an internet gateway, NAT device, VPN connection, or AWS Direct Connect connection. An AWS Lambda function is used to ship AWS CloudTrail event logs to the Elasticsearch cluster. A separate AWS Lambda function performs scheduled queries on log sets to look for patterns of concern. Amazon Simple Notification Service (SNS) generates automated reports based on a sample set of PCI guidelines discussed further in this post and notifies stakeholders when specific events occur. Kibana serves as the command center, providing visualizations of CloudTrail events that need to be logged based on the provided sample set of PCI-DSS compliance guidelines. The automated report and dashboard that are constructed around the sample PCI-DSS guidelines assist in event awareness regarding your security posture and should not be viewed as a de facto means of achieving certification. This solution serves as an additional tool to provide visibility in to the actions and events within your environment. Deployment is made simple with a provided AWS CloudFormation template.

Figure 1: Architectural diagram

Figure 1: Architectural diagram

The figure above depicts the architecture discussed in this post. An Elasticsearch cluster with VPC support is deployed within an AWS Region and Availability Zone. This creates a VPC endpoint in a private subnet within a VPC. Kibana is an Elasticsearch plugin that resides within the Elasticsearch cluster, it is accessed through a provided endpoint in the output section of the CloudFormation template. CloudTrail is enabled in the VPC and ships CloudTrail events to both an S3 bucket and CloudWatch Log Group. The CloudWatch Log Group triggers a custom Lambda function that ships the CloudTrail Event logs to the Elasticsearch domain through the VPC endpoint. An additional Lambda function is created that performs a periodic set of Elasticsearch queries and produces a report that is sent to an SNS Topic. A Windows-based EC2 instance is deployed in a public subnet so users will have the ability to view and interact with a Kibana dashboard. Access to the EC2 instance can be restricted to an allowed CIDR range through a parameter set in the CloudFormation deployment. Access to the Elasticsearch cluster and Kibana is restricted to a Security Group that is created and is associated with the EC2 instance and custom Lambda functions.

Sample PCI-DSS Guidelines

This solution provides a sample set of (10) PCI-DSS guidelines for events that need to be logged.

  • All Commands, API action taken by AWS root user
  • All failed logins at the AWS platform level
  • Action related to RDS (configuration changes)
  • Action related to enabling/disabling/changing of CloudTrail, CloudWatch logs
  • All access to S3 bucket that stores the AWS logs
  • Action related to VPCs (creation, deletion and changes)
  • Action related to changes to SGs/NACLs (creation, deletion and changes)
  • Action related to IAM users, roles, and groups (creation, deletion and changes)
  • Action related to route tables (creation, deletion and changes)
  • Action related to subnets (creation, deletion and changes)

Solution overview

In this walkthrough, you’ll create an Elasticsearch cluster within an Amazon VPC environment. You’ll ship AWS CloudTrail logs to both an Amazon S3 Bucket (to maintain an immutable copy of the logs) and to a custom AWS Lambda function that will stream the logs to the Elasticsearch cluster. You’ll also create an additional Lambda function that will run once a day and build a report of the number of CloudTrail events that occurred based on the example set of 10 PCI-DSS guidelines and then notify stakeholders via SNS. Here’s what you’ll need for this solution:

To make it easier to get started, I’ve included an AWS CloudFormation template that will automatically deploy the solution. The CloudFormation template along with additional files can be downloaded from this link. You’ll need the following resources to set it up:

  • An S3 bucket to upload and store the sample AWS Lambda code and sample Kibana dashboards. This bucket name will be requested during the CloudFormation template deployment.
  • An Amazon Virtual Private Cloud (Amazon VPC).

If you’re unfamiliar with how CloudFormation templates work, you can find more info in the CloudFormation Getting Started guide.

AWS CloudFormation deployment

The following parameters are available in this template.

Parameter Default Description
Elasticsearch Domain Name Name of the Amazon Elasticsearch Service domain.
Elasticsearch Version 6.2 Version of Elasticsearch to deploy.
Elasticsearch Instance Count 3 The number of data nodes to deploy in to the Elasticsearch cluster.
Elasticsearch Instance Class The instance class to deploy for the Elasticsearch data nodes.
Elasticsearch Instance Volume Size 10 The size of the volume for each Elasticsearch data node in GB.
VPC to launch into The VPC to launch the Amazon Elasticsearch Service cluster into.
Availability Zone to launch into The Availability Zone to launch the Amazon Elasticsearch Service cluster into.
Private Subnet ID The subnet to launch the Amazon Elasticsearch Service cluster into.
Elasticsearch Security Group A new Security Group is created that will be associated with the Amazon Elasticsearch Service cluster.
Security Group Description A description for the above created Security Group.
Windows EC2 Instance Class m5.large Windows instance for interaction with Kibana.
EC2 Key Pair EC2 Key Pair to associate with the Windows EC2 instance.
Public Subnet Public subnet to associate with the Windows EC2 instance for access.
Remote Access Allowed CIDR The CIDR range to allow remote access (port 3389) to the EC2 instance.
S3 Bucket Name—Lambda Functions S3 Bucket that contains custom AWS Lambda functions.
Private Subnet Private subnet to associate with AWS Lambda functions that are deployed within a VPC.
CloudWatch Log Group Name This will create a CloudWatch Log Group for the AWS CloudTrail event logs.
S3 Bucket Name—CloudTrail logging This will create a new Amazon S3 Bucket for logging CloudTrail events. Name must be a globally unique value.
Date range to perform queries now-1d (examples: now-1d, now-7d, now-90d)
Lambda Subnet CIDR Create a Subnet CIDR to deploy AWS Lambda Elasticsearch query function in to
Availability Zone—Lambda The availability zone to associate with the preceding AWS Lambda Subnet
Email Address [email protected] Email address for reporting to notify stakeholders via SNS. You must accept the subscription by selecting the link sent to this address before alerts will arrive.

It takes 30-45 minutes for this stack to be created. When it’s complete, the CloudFormation console will display the following resource values in the Outputs tab. These values can be referenced at any time and will be needed in the following sections.

oElasticsearchDomainEndpoint Elasticsearch Domain Endpoint Hostname
oKibanaEndpoint Kibana Endpoint Hostname
oEC2Instance Windows EC2 Instance Name used for Kibana access
oSNSSubscriber SNS Subscriber Email Address
oElasticsearchDomainArn Arn of the Elasticsearch Domain
oEC2InstancePublicIp Public IP address of the Windows EC2 instance

Managing and testing the solution

Now that you’ve set up the environment, it’s time to configure the Kibana dashboard.

Kibana configuration

From the AWS CloudFormation output, gather information related to the Windows-based EC2 instance. Once you have retrieved that information, move on to the next steps.

Initial configuration and index pattern

  1. Log into the Windows EC2 instance via Remote Desktop Protocol (RDP) from a resource that is within the allowed CIDR range for remote access to the instance.
  2. Open a browser window and navigate to the Kibana endpoint hostname URL from the output of the AWS CloudFormation stack. Access to the Elasticsearch cluster and Kibana is restricted to the security group that is associated with the EC2 instance and custom Lambda functions during deployment.
  3. In the Kibana dashboard, select Management from the left panel and choose the link for Index Patterns.
  4. Add one index pattern containing the following: cwl-*
    Figure 2: Define the index pattern

    Figure 2: Define the index pattern

  5. Select Next Step.
  6. Select the Time Filter Field named @timestamp.
    Figure 3: Select "@timestamp"

    Figure 3: Select “@timestamp”

  7. Select Create index pattern.

At this point we’ve launched our environment and have accessed the Kibana console. Within the Kibana console, we’ve configured the index pattern for the CloudWatch logs that will contain the CloudTrail events. Next, we’ll configure visualizations and a dashboard.

Importing sample PCI DSS queries and Kibana dashboard

  1. Copy the export.json from the location you extracted the downloaded zip file to the EC2 Kibana bastion.
  2. Select Management on the left panel and choose the link for Saved Objects.
  3. Select Import in upper right corner and navigate to export.json.
  4. Select Yes, overwrite all saved objects, then select Index Pattern cwl-* and confirm all changes.
  5. Once the import completes, select PCI DSS Dashboard to see the sample dashboard and queries.

Note: You might encounter an error during the import that looks like this:

Figure 4: Error message

Figure 4: Error message

This simply means that your streamed logs do not have login-type events in the time period since your deployment. To correct this, you can add a field with a null event.

  1. From the left panel, select Dev Tools and copy the following JSON into the left panel of the console:
            POST /cwl-/default/
                "userIdentity": {
                    "userName": "test"

  2. Select the green Play triangle to execute the POST of a document with the missing field.
    Figure 5: Select the "Play" button

    Figure 5: Select the “Play” button

  3. Now reimport the dashboard using the steps in Importing Sample PCI DSS Queries and Kibana Dashboard. You should be able to complete the import with no errors.

At this point, you should have CloudTrail events that have been streamed to the Elasticsearch cluster, with a configured Kibana dashboard that looks similar to the following graphic:

Figure 6: A configured Kibana dashboard

Figure 6: A configured Kibana dashboard

Automated Reports

A custom AWS Lambda function was created during the deployment of the Amazon CloudFormation stack. This function uses the sample PCI-DSS guidelines from the Kibana dashboard to build a daily report. The Lambda function is triggered every 24 hours and performs a series of Elasticsearch time-based queries of now-1day (the last 24 hours) on the sample guidelines. The results are compiled into a message that is forwarded to Amazon Simple Notification Service (SNS), which sends a report to stakeholders based on the email address you provided in the CloudFormation deployment.

The Lambda function will be named <CloudFormation Stack Name>-ES-Query-LambdaFunction. The Lambda Function enables environment variables such as your query time window to be adjusted or additional functionality like additional Elasticsearch queries to be added to the code. The below sample report allows you to monitor any events against the sample PCI-DSS guidelines. These reports can then be further analyzed in the Kibana dashboard.

    Logging Compliance Report - Wednesday, 11. July 2018 01:06PM
    Violations for time period: 'now-1d'
    All Failed login attempts
    - No Alerts Found
    All Commands, API action taken by AWS root user
    - No Alerts Found
    Action related to RDS (configuration changes)
    - No Alerts Found
    Action related to enabling/disabling/changing of CloudTrail CloudWatch logs
    - 3 API calls indicating alteration of log sources detected
    All access to S3 bucket that stores the AWS logs
    - No Alerts Found
    Action related to VPCs (creation, deletion and changes)
    - No Alerts Found
    Action related to changes to SGs/NACLs (creation, deletion and changes)
    - No Alerts Found
    Action related to changes to IAM roles, users, and groups (creation, deletion and changes)
    - 2 API calls indicating creation, alteration or deletion of IAM roles, users, and groups
    Action related to changes to Route Tables (creation, deletion and changes)
    - No Alerts Found
    Action related to changes to Subnets (creation, deletion and changes)
    - No Alerts Found         


At this point, you have now created a private Elasticsearch cluster with Kibana dashboards that monitors AWS CloudTrail events on a sample set of PCI-DSS guidelines and uses Amazon SNS to send a daily report providing awareness in to your environment—all isolated securely within a VPC. In addition to CloudTrail events streaming to the Elasticsearch cluster, events are also shipped to an Amazon S3 bucket to maintain an immutable source of your log files. The provided Lambda functions can be further modified to add additional or more complex search queries and to create more customized reports for your organization. With minimal effort, you could begin sending additional log data from your instances or containers to gain even more insight as to the security state of your environment. The more data you retain, the more visibility you have into your resources and the closer you are to achieving Compliance-on-Demand.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.


Michael Coyne

Michael is a consultant for AWS Professional Services. He enjoys the fast-paced environment of ever-changing technology and assisting customers in solving complex issues. Away from AWS, Michael can typically be found with a guitar and spending time with his wife and two young kiddos. He holds a BS in Computer Science from WGU.

Friday Squid Blogging: Do Cephalopods Contain Alien DNA?

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/06/friday_squid_bl_627.html

Maybe not DNA, but biological somethings.

Cause of Cambrian explosion — Terrestrial or Cosmic?“:

Abstract: We review the salient evidence consistent with or predicted by the Hoyle-Wickramasinghe (H-W) thesis of Cometary (Cosmic) Biology. Much of this physical and biological evidence is multifactorial. One particular focus are the recent studies which date the emergence of the complex retroviruses of vertebrate lines at or just before the Cambrian Explosion of ~500 Ma. Such viruses are known to be plausibly associated with major evolutionary genomic processes. We believe this coincidence is not fortuitous but is consistent with a key prediction of H-W theory whereby major extinction-diversification evolutionary boundaries coincide with virus-bearing cometary-bolide bombardment events. A second focus is the remarkable evolution of intelligent complexity (Cephalopods) culminating in the emergence of the Octopus. A third focus concerns the micro-organism fossil evidence contained within meteorites as well as the detection in the upper atmosphere of apparent incoming life-bearing particles from space. In our view the totality of the multifactorial data and critical analyses assembled by Fred Hoyle, Chandra Wickramasinghe and their many colleagues since the 1960s leads to a very plausible conclusion — life may have been seeded here on Earth by life-bearing comets as soon as conditions on Earth allowed it to flourish (about or just before 4.1 Billion years ago); and living organisms such as space-resistant and space-hardy bacteria, viruses, more complex eukaryotic cells, fertilised ova and seeds have been continuously delivered ever since to Earth so being one important driver of further terrestrial evolution which has resulted in considerable genetic diversity and which has led to the emergence of mankind.

Two commentaries.

This is almost certainly not true.

As usual, you can also use this squid post to talk about the security stories in the news that I haven’t covered.

Read my blog posting guidelines here.

Security and Human Behavior (SHB 2018)

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/05/security_and_hu_7.html

I’m at Carnegie Mellon University, at the eleventh Workshop on Security and Human Behavior.

SHB is a small invitational gathering of people studying various aspects of the human side of security, organized each year by Alessandro Acquisti, Ross Anderson, and myself. The 50 or so people in the room include psychologists, economists, computer security researchers, sociologists, political scientists, neuroscientists, designers, lawyers, philosophers, anthropologists, business school professors, and a smattering of others. It’s not just an interdisciplinary event; most of the people here are individually interdisciplinary.

The goal is to maximize discussion and interaction. We do that by putting everyone on panels, and limiting talks to 7-10 minutes. The rest of the time is left to open discussion. Four hour-and-a-half panels per day over two days equals eight panels; six people per panel means that 48 people get to speak. We also have lunches, dinners, and receptions — all designed so people from different disciplines talk to each other.

I invariably find this to be the most intellectually stimulating conference of my year. It influences my thinking in many different, and sometimes surprising, ways.

This year’s program is here. This page lists the participants and includes links to some of their work. As he does every year, Ross Anderson is liveblogging the talks. (Ross also maintains a good webpage of psychology and security resources.)

Here are my posts on the first, second, third, fourth, fifth, sixth, seventh, eighth, ninth, and tenth SHB workshops. Follow those links to find summaries, papers, and occasionally audio recordings of the various workshops.

Next year, I’ll be hosting the event at Harvard.

Use Slack ChatOps to Deploy Your Code – How to Integrate Your Pipeline in AWS CodePipeline with Your Slack Channel

Post Syndicated from Rumi Olsen original https://aws.amazon.com/blogs/devops/use-slack-chatops-to-deploy-your-code-how-to-integrate-your-pipeline-in-aws-codepipeline-with-your-slack-channel/

Slack is widely used by DevOps and development teams to communicate status. Typically, when a build has been tested and is ready to be promoted to a staging environment, a QA engineer or DevOps engineer kicks off the deployment. Using Slack in a ChatOps collaboration model, the promotion can be done in a single click from a Slack channel. And because the promotion happens through a Slack channel, the whole development team knows what’s happening without checking email.

In this blog post, I will show you how to integrate AWS services with a Slack application. I use an interactive message button and incoming webhook to promote a stage with a single click.

To follow along with the steps in this post, you’ll need a pipeline in AWS CodePipeline. If you don’t have a pipeline, the fastest way to create one for this use case is to use AWS CodeStar. Go to the AWS CodeStar console and select the Static Website template (shown in the screenshot). AWS CodeStar will create a pipeline with an AWS CodeCommit repository and an AWS CodeDeploy deployment for you. After the pipeline is created, you will need to add a manual approval stage.

You’ll also need to build a Slack app with webhooks and interactive components, write two Lambda functions, and create an API Gateway API and a SNS topic.

As you’ll see in the following diagram, when I make a change and merge a new feature into the master branch in AWS CodeCommit, the check-in kicks off my CI/CD pipeline in AWS CodePipeline. When CodePipeline reaches the approval stage, it sends a notification to Amazon SNS, which triggers an AWS Lambda function (ApprovalRequester).

The Slack channel receives a prompt that looks like the following screenshot. When I click Yes to approve the build promotion, the approval result is sent to CodePipeline through API Gateway and Lambda (ApprovalHandler). The pipeline continues on to deploy the build to the next environment.

Create a Slack app

For App Name, type a name for your app. For Development Slack Workspace, choose the name of your workspace. You’ll see in the following screenshot that my workspace is AWS ChatOps.

After the Slack application has been created, you will see the Basic Information page, where you can create incoming webhooks and enable interactive components.

To add incoming webhooks:

  1. Under Add features and functionality, choose Incoming Webhooks. Turn the feature on by selecting Off, as shown in the following screenshot.
  2. Now that the feature is turned on, choose Add New Webhook to Workspace. In the process of creating the webhook, Slack lets you choose the channel where messages will be posted.
  3. After the webhook has been created, you’ll see its URL. You will use this URL when you create the Lambda function.

If you followed the steps in the post, the pipeline should look like the following.

Write the Lambda function for approval requests

This Lambda function is invoked by the SNS notification. It sends a request that consists of an interactive message button to the incoming webhook you created earlier.  The following sample code sends the request to the incoming webhook. WEBHOOK_URL and SLACK_CHANNEL are the environment variables that hold values of the webhook URL that you created and the Slack channel where you want the interactive message button to appear.

# This function is invoked via SNS when the CodePipeline manual approval action starts.
# It will take the details from this approval notification and sent an interactive message to Slack that allows users to approve or cancel the deployment.

import os
import json
import logging
import urllib.parse

from base64 import b64decode
from urllib.request import Request, urlopen
from urllib.error import URLError, HTTPError

# This is passed as a plain-text environment variable for ease of demonstration.
# Consider encrypting the value with KMS or use an encrypted parameter in Parameter Store for production deployments.

logger = logging.getLogger()

def lambda_handler(event, context):
    print("Received event: " + json.dumps(event, indent=2))
    message = event["Records"][0]["Sns"]["Message"]
    data = json.loads(message) 
    token = data["approval"]["token"]
    codepipeline_name = data["approval"]["pipelineName"]
    slack_message = {
        "channel": SLACK_CHANNEL,
        "text": "Would you like to promote the build to production?",
        "attachments": [
                "text": "Yes to deploy your build to production",
                "fallback": "You are unable to promote a build",
                "callback_id": "wopr_game",
                "color": "#3AA3E3",
                "attachment_type": "default",
                "actions": [
                        "name": "deployment",
                        "text": "Yes",
                        "style": "danger",
                        "type": "button",
                        "value": json.dumps({"approve": True, "codePipelineToken": token, "codePipelineName": codepipeline_name}),
                        "confirm": {
                            "title": "Are you sure?",
                            "text": "This will deploy the build to production",
                            "ok_text": "Yes",
                            "dismiss_text": "No"
                        "name": "deployment",
                        "text": "No",
                        "type": "button",
                        "value": json.dumps({"approve": False, "codePipelineToken": token, "codePipelineName": codepipeline_name})

    req = Request(SLACK_WEBHOOK_URL, json.dumps(slack_message).encode('utf-8'))

    response = urlopen(req)
    return None


Create a SNS topic

Create a topic and then create a subscription that invokes the ApprovalRequester Lambda function. You can configure the manual approval action in the pipeline to send a message to this SNS topic when an approval action is required. When the pipeline reaches the approval stage, it sends a notification to this SNS topic. SNS publishes a notification to all of the subscribed endpoints. In this case, the Lambda function is the endpoint. Therefore, it invokes and executes the Lambda function. For information about how to create a SNS topic, see Create a Topic in the Amazon SNS Developer Guide.

Write the Lambda function for handling the interactive message button

This Lambda function is invoked by API Gateway. It receives the result of the interactive message button whether or not the build promotion was approved. If approved, an API call is made to CodePipeline to promote the build to the next environment. If not approved, the pipeline stops and does not move to the next stage.

The Lambda function code might look like the following. SLACK_VERIFICATION_TOKEN is the environment variable that contains your Slack verification token. You can find your verification token under Basic Information on Slack manage app page. When you scroll down, you will see App Credential. Verification token is found under the section.

# This function is triggered via API Gateway when a user acts on the Slack interactive message sent by approval_requester.py.

from urllib.parse import parse_qs
import json
import os
import boto3


#Triggered by API Gateway
#It kicks off a particular CodePipeline project
def lambda_handler(event, context):
	#print("Received event: " + json.dumps(event, indent=2))
	body = parse_qs(event['body'])
	payload = json.loads(body['payload'][0])

	# Validate Slack token
	if SLACK_VERIFICATION_TOKEN == payload['token']:
		# This will replace the interactive message with a simple text response.
		# You can implement a more complex message update if you would like.
		return  {
			"isBase64Encoded": "false",
			"statusCode": 200,
			"body": "{\"text\": \"The approval has been processed\"}"
		return  {
			"isBase64Encoded": "false",
			"statusCode": 403,
			"body": "{\"error\": \"This request does not include a vailid verification token.\"}"

def send_slack_message(action_details):
	codepipeline_status = "Approved" if action_details["approve"] else "Rejected"
	codepipeline_name = action_details["codePipelineName"]
	token = action_details["codePipelineToken"] 

	client = boto3.client('codepipeline')
	response_approval = client.put_approval_result(


Create the API Gateway API

  1. In the Amazon API Gateway console, create a resource called InteractiveMessageHandler.
  2. Create a POST method.
    • For Integration type, choose Lambda Function.
    • Select Use Lambda Proxy integration.
    • From Lambda Region, choose a region.
    • In Lambda Function, type a name for your function.
  3.  Deploy to a stage.

For more information, see Getting Started with Amazon API Gateway in the Amazon API Developer Guide.

Now go back to your Slack application and enable interactive components.

To enable interactive components for the interactive message (Yes) button:

  1. Under Features, choose Interactive Components.
  2. Choose Enable Interactive Components.
  3. Type a request URL in the text box. Use the invoke URL in Amazon API Gateway that will be called when the approval button is clicked.

Now that all the pieces have been created, run the solution by checking in a code change to your CodeCommit repo. That will release the change through CodePipeline. When the CodePipeline comes to the approval stage, it will prompt to your Slack channel to see if you want to promote the build to your staging or production environment. Choose Yes and then see if your change was deployed to the environment.


That is it! You have now created a Slack ChatOps solution using AWS CodeCommit, AWS CodePipeline, AWS Lambda, Amazon API Gateway, and Amazon Simple Notification Service.

Now that you know how to do this Slack and CodePipeline integration, you can use the same method to interact with other AWS services using API Gateway and Lambda. You can also use Slack’s slash command to initiate an action from a Slack channel, rather than responding in the way demonstrated in this post.

From Framework to Function: Deploying AWS Lambda Functions for Java 8 using Apache Maven Archetype

Post Syndicated from Ryosuke Iwanaga original https://aws.amazon.com/blogs/compute/from-framework-to-function-deploying-aws-lambda-functions-for-java-8-using-apache-maven-archetype/

As a serverless computing platform that supports Java 8 runtime, AWS Lambda makes it easy to run any type of Java function simply by uploading a JAR file. To help define not only a Lambda serverless application but also Amazon API Gateway, Amazon DynamoDB, and other related services, the AWS Serverless Application Model (SAM) allows developers to use a simple AWS CloudFormation template.

AWS provides the AWS Toolkit for Eclipse that supports both Lambda and SAM. AWS also gives customers an easy way to create Lambda functions and SAM applications in Java using the AWS Command Line Interface (AWS CLI). After you build a JAR file, all you have to do is type the following commands:

aws cloudformation package 
aws cloudformation deploy

To consolidate these steps, customers can use Archetype by Apache Maven. Archetype uses a predefined package template that makes getting started to develop a function exceptionally simple.

In this post, I introduce a Maven archetype that allows you to create a skeleton of AWS SAM for a Java function. Using this archetype, you can generate a sample Java code example and an accompanying SAM template to deploy it on AWS Lambda by a single Maven action.


Make sure that the following software is installed on your workstation:

  • Java
  • Maven
  • (Optional) AWS SAM CLI

Install Archetype

After you’ve set up those packages, install Archetype with the following commands:

git clone https://github.com/awslabs/aws-serverless-java-archetype
cd aws-serverless-java-archetype
mvn install

These are one-time operations, so you don’t run them for every new package. If you’d like, you can add Archetype to your company’s Maven repository so that other developers can use it later.

With those packages installed, you’re ready to develop your new Lambda Function.

Start a project

Now that you have the archetype, customize it and run the code:

cd /path/to/project_home
mvn archetype:generate \
  -DarchetypeGroupId=com.amazonaws.serverless.archetypes \
  -DarchetypeArtifactId=aws-serverless-java-archetype \
  -DarchetypeVersion=1.0.0 \
  -DarchetypeRepository=local \ # Forcing to use local maven repository
  -DinteractiveMode=false \ # For batch mode
  # You can also specify properties below interactively if you omit the line for batch mode
  -DgroupId=YOUR_GROUP_ID \
  -DartifactId=YOUR_ARTIFACT_ID \
  -Dversion=YOUR_VERSION \

You should have a directory called YOUR_ARTIFACT_ID that contains the files and folders shown below:

├── event.json
├── pom.xml
├── src
│   └── main
│       ├── java
│       │   └── Package
│       │       └── Example.java
│       └── resources
│           └── log4j2.xml
└── template.yaml

The sample code is a working example. If you install SAM CLI, you can invoke it just by the command below:

mvn -P invoke verify
[INFO] Scanning for projects...
[INFO] ---------------------------< com.riywo:foo >----------------------------
[INFO] Building foo 1.0
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] --- maven-jar-plugin:3.0.2:jar (default-jar) @ foo ---
[INFO] Building jar: /private/tmp/foo/target/foo-1.0.jar
[INFO] --- maven-shade-plugin:3.1.0:shade (shade) @ foo ---
[INFO] Including com.amazonaws:aws-lambda-java-core:jar:1.2.0 in the shaded jar.
[INFO] Replacing /private/tmp/foo/target/lambda.jar with /private/tmp/foo/target/foo-1.0-shaded.jar
[INFO] --- exec-maven-plugin:1.6.0:exec (sam-local-invoke) @ foo ---
2018/04/06 16:34:35 Successfully parsed template.yaml
2018/04/06 16:34:35 Connected to Docker 1.37
2018/04/06 16:34:35 Fetching lambci/lambda:java8 image for java8 runtime...
java8: Pulling from lambci/lambda
Digest: sha256:14df0a5914d000e15753d739612a506ddb8fa89eaa28dcceff5497d9df2cf7aa
Status: Image is up to date for lambci/lambda:java8
2018/04/06 16:34:37 Invoking Package.Example::handleRequest (java8)
2018/04/06 16:34:37 Decompressing /tmp/foo/target/lambda.jar
2018/04/06 16:34:37 Mounting /private/var/folders/x5/ldp7c38545v9x5dg_zmkr5kxmpdprx/T/aws-sam-local-1523000077594231063 as /var/task:ro inside runtime container
START RequestId: a6ae19fe-b1b0-41e2-80bc-68a40d094d74 Version: $LATEST
Log output: Greeting is 'Hello Tim Wagner.'
END RequestId: a6ae19fe-b1b0-41e2-80bc-68a40d094d74
REPORT RequestId: a6ae19fe-b1b0-41e2-80bc-68a40d094d74	Duration: 96.60 ms	Billed Duration: 100 ms	Memory Size: 128 MB	Max Memory Used: 7 MB

{"greetings":"Hello Tim Wagner."}

[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 10.452 s
[INFO] Finished at: 2018-04-06T16:34:40+09:00
[INFO] ------------------------------------------------------------------------

This maven goal invokes sam local invoke -e event.json, so you can see the sample output to greet Tim Wagner.

To deploy this application to AWS, you need an Amazon S3 bucket to upload your package. You can use the following command to create a bucket if you want:

aws s3 mb s3://YOUR_BUCKET --region YOUR_REGION

Now, you can deploy your application by just one command!

mvn deploy \
    -DawsRegion=YOUR_REGION \
    -Ds3Bucket=YOUR_BUCKET \
[INFO] Scanning for projects...
[INFO] ---------------------------< com.riywo:foo >----------------------------
[INFO] Building foo 1.0
[INFO] --------------------------------[ jar ]---------------------------------
[INFO] --- exec-maven-plugin:1.6.0:exec (sam-package) @ foo ---
Uploading to aws-serverless-java/com.riywo:foo:1.0/924732f1f8e4705c87e26ef77b080b47  11657 / 11657.0  (100.00%)
Successfully packaged artifacts and wrote output template to file target/sam.yaml.
Execute the following command to deploy the packaged template
aws cloudformation deploy --template-file /private/tmp/foo/target/sam.yaml --stack-name <YOUR STACK NAME>
[INFO] --- maven-deploy-plugin:2.8.2:deploy (default-deploy) @ foo ---
[INFO] Skipping artifact deployment
[INFO] --- exec-maven-plugin:1.6.0:exec (sam-deploy) @ foo ---

Waiting for changeset to be created..
Waiting for stack create/update to complete
Successfully created/updated stack - archetype
[INFO] ------------------------------------------------------------------------
[INFO] ------------------------------------------------------------------------
[INFO] Total time: 37.176 s
[INFO] Finished at: 2018-04-06T16:41:02+09:00
[INFO] ------------------------------------------------------------------------

Maven automatically creates a shaded JAR file, uploads it to your S3 bucket, replaces template.yaml, and creates and updates the CloudFormation stack.

To customize the process, modify the pom.xml file. For example, to avoid typing values for awsRegion, s3Bucket or stackName, write them inside pom.xml and check in your VCS. Afterward, you and the rest of your team can deploy the function by typing just the following command:

mvn deploy


Lambda Java 8 runtime has some types of handlers: POJO, Simple type and Stream. The default option of this archetype is POJO style, which requires to create request and response classes, but they are baked by the archetype by default. If you want to use other type of handlers, you can use handlerType property like below:

## POJO type (default)
mvn archetype:generate \

## Simple type - String
mvn archetype:generate \

### Stream type
mvn archetype:generate \

See documentation for more details about handlers.

Also, Lambda Java 8 runtime supports two types of Logging class: Log4j 2 and LambdaLogger. This archetype creates LambdaLogger implementation by default, but you can use Log4j 2 if you want:

## LambdaLogger (default)
mvn archetype:generate \

## Log4j 2
mvn archetype:generate \

If you use LambdaLogger, you can delete ./src/main/resources/log4j2.xml. See documentation for more details.


So, what’s next? Develop your Lambda function locally and type the following command: mvn deploy !

With this Archetype code example, available on GitHub repo, you should be able to deploy Lambda functions for Java 8 in a snap. If you have any questions or comments, please submit them below or leave them on GitHub.

Spring 2018 AWS SOC Reports are Now Available with 11 Services Added in Scope

Post Syndicated from Chris Gile original https://aws.amazon.com/blogs/security/spring-2018-aws-soc-reports-are-now-available-with-11-services-added-in-scope/

Since our last System and Organization Control (SOC) audit, our service and compliance teams have been working to increase the number of AWS Services in scope prioritized based on customer requests. Today, we’re happy to report 11 services are newly SOC compliant, which is a 21 percent increase in the last six months.

With the addition of the following 11 new services, you can now select from a total of 62 SOC-compliant services. To see the full list, go to our Services in Scope by Compliance Program page:

• Amazon Athena
• Amazon QuickSight
• Amazon WorkDocs
• AWS Batch
• AWS CodeBuild
• AWS Config
• AWS OpsWorks Stacks
• AWS Snowball
• AWS Snowball Edge
• AWS Snowmobile
• AWS X-Ray

Our latest SOC 1, 2, and 3 reports covering the period from October 1, 2017 to March 31, 2018 are now available. The SOC 1 and 2 reports are available on-demand through AWS Artifact by logging into the AWS Management Console. The SOC 3 report can be downloaded here.

Finally, prospective customers can read our SOC 1 and 2 reports by reaching out to AWS Compliance.

Want more AWS Security news? Follow us on Twitter.

Bad Software Is Our Fault

Post Syndicated from Bozho original https://techblog.bozho.net/bad-software-is-our-fault/

Bad software is everywhere. One can even claim that every software is bad. Cool companies, tech giants, established companies, all produce bad software. And no, yours is not an exception.

Who’s to blame for bad software? It’s all complicated and many factors are intertwined – there’s business requirements, there’s organizational context, there’s lack of sufficient skilled developers, there’s the inherent complexity of software development, there’s leaky abstractions, reliance on 3rd party software, consequences of wrong business and purchase decisions, time limitations, flawed business analysis, etc. So yes, despite the catchy title, I’m aware it’s actually complicated.

But in every “it’s complicated” scenario, there’s always one or two factors that are decisive. All of them contribute somehow, but the major drivers are usually a handful of things. And in the case of base software, I think it’s the fault of technical people. Developers, architects, ops.

We don’t seem to care about best practices. And I’ll do some nasty generalizations here, but bear with me. We can spend hours arguing about tabs vs spaces, curly bracket on new line, git merge vs rebase, which IDE is better, which framework is better and other largely irrelevant stuff. But we tend to ignore the important aspects that span beyond the code itself. The context in which the code lives, the non-functional requirements – robustness, security, resilience, etc.

We don’t seem to get security. Even trivial stuff such as user authentication is almost always implemented wrong. These days Twitter and GitHub realized they have been logging plain-text passwords, for example, but that’s just the tip of the iceberg. Too often we ignore the security implications.

“But the business didn’t request the security features”, one may say. The business never requested 2-factor authentication, encryption at rest, PKI, secure (or any) audit trail, log masking, crypto shredding, etc., etc. Because the business doesn’t know these things – we do and we have to put them on the backlog and fight for them to be implemented. Each organization has its specifics and tech people can influence the backlog in different ways, but almost everywhere we can put things there and prioritize them.

The other aspect is testing. We should all be well aware by now that automated testing is mandatory. We have all the tools in the world for unit, functional, integration, performance and whatnot testing, and yet many software projects lack the necessary test coverage to be able to change stuff without accidentally breaking things. “But testing takes time, we don’t have it”. We are perfectly aware that testing saves time, as we’ve all had those “not again!” recurring bugs. And yet we think of all sorts of excuses – “let the QAs test it”, we have to ship that now, we’ll test it later”, “this is too trivial to be tested”, etc.

And you may say it’s not our job. We don’t define what has do be done, we just do it. We don’t define the budget, the scope, the features. We just write whatever has been decided. And that’s plain wrong. It’s not our job to make money out of our code, and it’s not our job to define what customers need, but apart from that everything is our job. The way the software is structured, the security aspects and security features, the stability of the code base, the way the software behaves in different environments. The non-functional requirements are our job, and putting them on the backlog is our job.

You’ve probably heard that every software becomes “legacy” after 6 months. And that’s because of us, our sloppiness, our inability to mitigate external factors and constraints. Too often we create a mess through “just doing our job”.

And of course that’s a generalization. I happen to know a lot of great professionals who don’t make these mistakes, who strive for excellence and implement things the right way. But our industry as a whole doesn’t. Our industry as a whole produces bad software. And it’s our fault, as developers – as the only people who know why a certain piece of software is bad.

In a talk of his, Bob Martin warns us of the risks of our sloppiness. We have been building websites so far, but we are more and more building stuff that interacts with the real world, directly and indirectly. Ultimately, lives may depend on our software (like the recent unfortunate death caused by a self-driving car). And I’ll agree with Uncle Bob that it’s high time we self-regulate as an industry, before some technically incompetent politician decides to do that.

How, I don’t know. We’ll have to think more about it. But I’m pretty sure it’s our fault that software is bad, and no amount of blaming the management, the budget, the timing, the tools or the process can eliminate our responsibility.

Why do I insist on bashing my fellow software engineers? Because if we start looking at software development with more responsibility; with the fact that if it fails, it’s our fault, then we’re more likely to get out of our current bug-ridden, security-flawed, fragile software hole and really become the experts of the future.

The post Bad Software Is Our Fault appeared first on Bozho's tech blog.

Friday Squid Blogging: US Army Developing 3D-Printable Battlefield Robot Squid

Post Syndicated from Bruce Schneier original https://www.schneier.com/blog/archives/2018/05/friday_squid_bl_623.html

The next major war will be super weird.

As usual, you can also use this squid post to talk about the security stories in the news that I haven’t covered.

Read my blog posting guidelines here.