Tag Archives: Amazon Simple Storage Services (S3)

New for Amazon Redshift – Data Lake Export and Federated Query

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-for-amazon-redshift-data-lake-export-and-federated-queries/

A data warehouse is a database optimized to analyze relational data coming from transactional systems and line of business applications. Amazon Redshift is a fast, fully managed data warehouse that makes it simple and cost-effective to analyze data using standard SQL and existing Business Intelligence (BI) tools.

To get information from unstructured data that would not fit in a data warehouse, you can build a data lake. A data lake is a centralized repository that allows you to store all your structured and unstructured data at any scale. With a data lake built on Amazon Simple Storage Service (S3), you can easily run big data analytics and use machine learning to gain insights from your semi-structured (such as JSON, XML) and unstructured datasets.

Today, we are launching two new features to help you improve the way you manage your data warehouse and integrate with a data lake:

  • Data Lake Export to unload data from a Redshift cluster to S3 in Apache Parquet format, an efficient open columnar storage format optimized for analytics.
  • Federated Query to be able, from a Redshift cluster, to query across data stored in the cluster, in your S3 data lake, and in one or more Amazon Relational Database Service (RDS) for PostgreSQL and Amazon Aurora PostgreSQL databases.

This architectural diagram gives a quick summary of how these features work and how they can be used together with other AWS services.

Let’s explain the interactions you see in the diagram better, starting from how you can use these features, and the advantages they provide.

Using Redshift Data Lake Export

You can now unload the result of a Redshift query to your S3 data lake in Apache Parquet format. The Parquet format is up to 2x faster to unload and consumes up to 6x less storage in S3, compared to text formats. This enables you to save data transformation and enrichment you have done in Redshift into your S3 data lake in an open format.

You can then analyze the data in your data lake with Redshift Spectrum, a feature of Redshift that allows you to query data directly from files on S3. Or you can use different tools such as Amazon Athena, Amazon EMR, or Amazon SageMaker.

To try this new feature, I create a new cluster from the Redshift console, and follow this tutorial to load sample data that keeps track of sales of musical events across different venues. I want to correlate this data with social media comments on the events stored in my data lake. To understand their relevance, each event should have a way of comparing its relative sales to other events.

Let’s build a query in Redshift to export the data to S3. My data is stored across multiple tables. I need to create a query that gives me a single view of what is going on with sales. I want to join the content of the  sales and date tables, adding information on the gross sales for an event (total_price in the query), and the percentile in terms of all time gross sales compared to all events.

To export the result of the query to S3 in Parquet format, I use the following SQL command:

UNLOAD ('SELECT sales.*, date.*, total_price, percentile
           FROM sales, date,
                (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) / 10.0 as percentile
                   FROM (SELECT eventid, sum(pricepaid) total_price
                           FROM sales
                       GROUP BY eventid)) as percentile_events
          WHERE sales.dateid = date.dateid
            AND percentile_events.eventid = sales.eventid')
TO 's3://MY-BUCKET/DataLake/Sales/'
FORMAT AS PARQUET
CREDENTIALS 'aws_iam_role=arn:aws:iam::123412341234:role/myRedshiftRole';

To give Redshift write access to my S3 bucket, I am using an AWS Identity and Access Management (IAM) role. I can see the result of the UNLOAD command using the AWS Command Line Interface (CLI). As expected, the output of the query is exported using the Parquet columnar data format:

$ aws s3 ls s3://MY-BUCKET/DataLake/Sales/
2019-11-25 14:26:56 1638550 0000_part_00.parquet
2019-11-25 14:26:56 1635489 0001_part_00.parquet
2019-11-25 14:26:56 1624418 0002_part_00.parquet
2019-11-25 14:26:56 1646179 0003_part_00.parquet

To optimize access to data, I can specify one or more partition columns so that unloaded data is automatically partitioned into folders in my S3 bucket. For example, I can unload sales data partitioned by year, month, and day. This enables my queries to take advantage of partition pruning and skip scanning irrelevant partitions, improving query performance and minimizing cost.

To use partitioning, I need to add to the previous SQL command the PARTITION BY option, followed by the columns I want to use to partition the data in different directories. In my case, I want to partition the output based on the year and the calendar date (caldate in the query) of the sales.

UNLOAD ('SELECT sales.*, date.*, total_price, percentile
           FROM sales, date,
                (SELECT eventid, total_price, ntile(1000) over(order by total_price desc) / 10.0 as percentile
                   FROM (SELECT eventid, sum(pricepaid) total_price
                           FROM sales
                       GROUP BY eventid)) as percentile_events
          WHERE sales.dateid = date.dateid
            AND percentile_events.eventid = sales.eventid')
TO 's3://MY-BUCKET/DataLake/SalesPartitioned/'
FORMAT AS PARQUET
PARTITION BY (year, caldate)
CREDENTIALS 'aws_iam_role=arn:aws:iam::123412341234:role/myRedshiftRole';

This time, the output of the query is stored in multiple partitions. For example, here’s the content of a folder for a specific year and date:

$ aws s3 ls s3://MY-BUCKET/DataLake/SalesPartitioned/year=2008/caldate=2008-07-20/
2019-11-25 14:36:17 11940 0000_part_00.parquet
2019-11-25 14:36:17 11052 0001_part_00.parquet
2019-11-25 14:36:17 11138 0002_part_00.parquet
2019-11-25 14:36:18 12582 0003_part_00.parquet

Optionally, I can use AWS Glue to set up a Crawler that (on demand or on a schedule) looks for data in my S3 bucket to update the Glue Data Catalog. When the Data Catalog is updated, I can easily query the data using Redshift Spectrum, Athena, or EMR.

The sales data is now ready to be processed together with the unstructured and semi-structured  (JSON, XML, Parquet) data in my data lake. For example, I can now use Apache Spark with EMR, or any Sagemaker built-in algorithm to access the data and get new insights.

Using Redshift Federated Query
You can now also access data in RDS and Aurora PostgreSQL stores directly from your Redshift data warehouse. In this way, you can access data as soon as it is available. Straight from Redshift, you can now perform queries processing data in your data warehouse, transactional databases, and data lake, without requiring ETL jobs to transfer data to the data warehouse.

Redshift leverages its advanced optimization capabilities to push down and distribute a significant portion of the computation directly into the transactional databases, minimizing the amount of data moving over the network.

Using this syntax, you can add an external schema from an RDS or Aurora PostgreSQL database to a Redshift cluster:

CREATE EXTERNAL SCHEMA IF NOT EXISTS online_system
FROM POSTGRES
DATABASE 'online_sales_db' SCHEMA 'online_system'
URI ‘my-hostname' port 5432
IAM_ROLE 'iam-role-arn'
SECRET_ARN 'ssm-secret-arn';

Schema and port are optional here. Schema will default to public if left unspecified and default port for PostgreSQL databases is 5432. Redshift is using AWS Secrets Manager to manage the credentials to connect to the external databases.

With this command, all tables in the external schema are available and can be used by Redshift for any complex SQL query processing data in the cluster or, using Redshift Spectrum, in your S3 data lake.

Coming back to the sales data example I used before, I can now correlate the trends of my historical data of musical events with real-time sales. In this way, I can understand if an event is performing as expected or not, and calibrate my marketing activities without delays.

For example, after I define the online commerce database as the online_system external schema in my Redshift cluster, I can compare previous sales with what is in the online commerce system with this simple query:

SELECT eventid,
       sum(pricepaid) total_price,
       sum(online_pricepaid) online_total_price
  FROM sales, online_system.current_sales
 GROUP BY eventid
 WHERE eventid = online_eventid;

Redshift doesn’t import database or schema catalog in its entirety. When a query is run, it localizes the metadata for the Aurora and RDS tables (and views) that are part of the query. This localized metadata is then used for query compilation and plan generation.

Available Now
Amazon Redshift data lake export is a new tool to improve your data processing pipeline and is supported with Redshift release version 1.0.10480 or later. Refer to the AWS Region Table for Redshift availability, and check the version of your clusters.

The new federation capability in Amazon Redshift is released as a public preview and allows you to bring together data stored in Redshift, S3, and one or more RDS and Aurora PostgreSQL databases. When creating a cluster in the Amazon Redshift management console, you can pick three tracks for maintenance: Current, Trailing, or Preview. Within the Preview track, preview_features should be chosen to participate to the Federated Query public preview. For example:

These features simplify data processing and analytics, giving you more tools to react quickly, and a single point of view for your data. Let me know what you are going to use them for!

Danilo

Easily Manage Shared Data Sets with Amazon S3 Access Points

Post Syndicated from Brandon West original https://aws.amazon.com/blogs/aws/easily-manage-shared-data-sets-with-amazon-s3-access-points/

Storage that is secure, scalable, durable, and highly available is a fundamental component of cloud computing. That’s why Amazon Simple Storage Service (S3) was the first service launched by AWS, back in 2006. It has been a building block of many of the more than 175 services that AWS now offers. As we approach the beginning of a new decade, capabilities like Amazon Redshift, Amazon Athena, Amazon EMR and AWS Lake Formation have made S3 not just a way to store objects but an engine for turning that data into insights. These capabilities mean that access patterns and requirements for the data stored in buckets have evolved.

Today we’re launching a new way to manage data access at scale for shared data sets in S3: Amazon S3 Access Points. S3 Access Points are unique hostnames with dedicated access policies that describe how data can be accessed using that endpoint. Before S3 Access Points, shared access to data meant managing a single policy document on a bucket. These policies could represent hundreds of applications with many differing permissions, making audits, and updates a potential bottleneck affecting many systems.

With S3 Access Points, you can add access points as you add additional applications or teams, keeping your policies specific and easier to manage. A bucket can have multiple access points, and each access point has its own AWS Identity and Access Management (IAM) policy. Access point policies are similar to bucket policies, but associated with the access point. S3 Access Points can also be restricted to only allow access from within a Amazon Virtual Private Cloud. And because each access point has a unique DNS name, you can now address your buckets with any name that is unique within your AWS account and region.

Creating S3 Access Points

Let’s add an access point to a bucket using the S3 Console. You can also create and manage your S3 Access Points using the AWS Command Line Interface (CLI), AWS SDKs, or via the API. I’ve selected a bucket that contains artifacts generated by a AWS Lambda function, and clicked on the access points tab.

Access points tab in S3 Console

Let’s create a new access point. I want to give an IAM user Alice permission to GET and PUT objects with the prefix Alice. I’m going to name this access point alices-access-point. There are options for restricting access to a Virtual Private Cloud, which just requires a Virtual Private Cloud ID. In this, I want to allow access from outside the VPC as well, so after I took this screenshot, I selected Internet and moved onto the next step.

Creating an Access Point

S3 Access Points makes it easy to block public access. I’m going to block all public access to this access point.

Public access settings

And now I can attach my policy. In this policy, our Principal is our user Alice, and the resource is our access point combined with every object with the prefix /Alice. For more examples of the kinds of policies you might want to attach to your S3 Access Points, take a look at the docs.

Creating access point policy

After I create the access point, I can access it by hostname using the format https://[access_point_name]-[accountID].s3-accesspoint.[region].amazonaws.com. Via the SDKs and CLI, I can use it the same way I would use a bucket once I’ve updated to the latest version. For example, assuming I were authenticated as Alice, I could do the following:

$ aws s3api get-object --key /Alice/object.zip --bucket arn:aws:s3:us-east-1:[my-account-id]:alices-access-point download.zip

Access points that are not restricted to VPCs can also be used via the S3 Console.

Things to Know

When it comes to software design, keeping scopes small and focused on a specific task is almost always a good decision. With S3 Access Points, you can customize hostnames and permissions for any user or application that needs access to your shared data set. Let us know how you like this new capability, and happy building!

— Brandon

Continuously monitor unused IAM roles with AWS Config

Post Syndicated from Michael Chan original https://aws.amazon.com/blogs/security/continuously-monitor-unused-iam-roles-aws-config/

Developing in the cloud encourages you to iterate frequently as your applications and resources evolve. You should also apply this iterative approach to the AWS Identity and Access Management (IAM) roles you create. Periodically ensuring that all the resources you’ve created are still being used can reduce operational complexity by eliminating the need to track unnecessary resources. It also improves security: identifying unused IAM roles helps reduce the potential for improper or unintended access to your critical infrastructure and workloads.

The IAM API now provides you with information about when a role has last been used to make an AWS request. In this post, I demonstrate how you can identify inactive roles using role last used information. Additionally, I’ll show you how to implement continuous monitoring of role activity using AWS Config.

AWS services and features used

This solution uses the following services and features:

  • AWS IAM: This service enables you to manage access to AWS services and resources securely. It provides an API to retrieve the timestamp of an IAM role’s last use when making an AWS request, and the region where the request was made.
  • AWS Config: This service allows you to continuously monitor and record your AWS resource configurations. It will periodically trigger your AWS Config rule (see next bullet) and will record compliance status.
  • AWS Config Rule: This resource represents your desired configuration settings for specific AWS resources or for an entire AWS account. This resource will check the compliance status of your AWS resources. You can provide the logic that determines compliance, which enables you to mark IAM roles in use as “compliant” and inactive roles as “non-compliant.”
  • AWS Lambda: This service lets you run code without provisioning or managing servers. Lambda will be used to execute API calls to retrieve role last used information and to provide compliance evaluations to AWS Config.
  • Amazon Simple Storage Service (Amazon S3): This is a highly available and durable object store. You’ll use it to store your Lambda code in .zip format prior to deploying your Lambda function.
  • AWS CloudFormation: This service provides a common language for you to describe and provision all the infrastructure resources in your cloud environment. You’ll use it to provision all the resources described in this solution.

Solution logic

This solution identifies unused IAM roles within your account. First, you’ll identify unused roles based on a time window (last number of days) you set. I use 60 days in my example, but this range is configurable. Second, you’ll use AWS Lambda to process all the roles in your account. Third, you’ll determine if they’re compliant based on their creation time and role last used information. Last, you’ll send your evaluations to AWS Config, which records the results and reports if each role is compliant or not. If not, you can take steps to remediate, such as denying all actions that the role can perform.

Prerequisites

This solution has the following prerequisites:

Solution architecture

 

Figure 1: Solution architecture

Figure 1: Solution architecture

As shown in the diagram, AWS Config (1) executes the AWS Config custom rule daily, and this frequency is configurable (2), which in turn invokes the Lambda function (3). The Lambda function enumerates each role and determines its creation date and role last used timestamp, both of which are provided via IAM’s GetAccountAuthorizationDetails API (4). When the Lambda function has determined the compliance of all your roles, the function returns the compliance results to AWS Config (5). AWS Config retains the history of compliance changes evaluated by the rule. If configured, compliance notifications can be sent to an Amazon Simple Notification Service (Amazon SNS) topic. Compliance status is viewable either in the AWS Management Console or through use of the AWS CLI or AWS SDK.

Deploying the solution

The resources for this solution are deployed through AWS CloudFormation. You must prepare the Lambda function’s source code for packaging before AWS CloudFormation can deploy the complete solution into your account.

Step 1: Prepare the Lambda deployment

First, make sure you’re running a *nix prompt (Linux, Mac, or Windows subsystem for Linux). Follow the commands below to create an empty folder named iam-role-last-used where you’ll place your Lambda source code.


mkdir iam-role-last-used
cd iam-role-last-used
touch lambda_function.py

Note that the directory you create and the code it contains will later be compressed into a .zip file by the AWS CLI’s cloudformation package command. This command also uploads the deployment .zip file to your S3 bucket. The cloudformation deploy command will reference this bucket when deploying the solution.

Next, create a Lambda layer with the latest boto3 package. This ensures that your Lambda function is using an up-to-date boto3 SDK and allows you to control the dependencies in your function’s deployment package. You can do this by following steps 1 through 4 in these directions. Be sure to record the Lambda layer ARN that you create because you will use it later.

Finally, open the lambda_function.py file in your favorite editor or integrated development environment (IDE), and place the following code into the lambda_function.py file:


import boto3
from botocore.exceptions import ClientError
from botocore.config import Config
import datetime
import fnmatch
import json
import os
import re
import logging


logger = logging.getLogger()
logging.basicConfig(
    format="[%(asctime)s] %(levelname)s [%(module)s.%(funcName)s:%(lineno)d] %(message)s", datefmt="%H:%M:%S"
)
logger.setLevel(os.getenv('log_level', logging.INFO))

# Configure boto retries
BOTO_CONFIG = Config(retries=dict(max_attempts=5))

# Define the default resource to report to Config Rules
DEFAULT_RESOURCE_TYPE = 'AWS::IAM::Role'

CONFIG_ROLE_TIMEOUT_SECONDS = 60

# Set to True to get the lambda to assume the Role attached on the Config service (useful for cross-account).
ASSUME_ROLE_MODE = False

# Evaluation strings for Config evaluations
COMPLIANT = 'COMPLIANT'
NON_COMPLIANT = 'NON_COMPLIANT'


# This gets the client after assuming the Config service role either in the same AWS account or cross-account.
def get_client(service, execution_role_arn):
    if not ASSUME_ROLE_MODE:
        return boto3.client(service)
    credentials = get_assume_role_credentials(execution_role_arn)
    return boto3.client(service, aws_access_key_id=credentials['AccessKeyId'],
                        aws_secret_access_key=credentials['SecretAccessKey'],
                        aws_session_token=credentials['SessionToken'],
                        config=BOTO_CONFIG
                        )


def get_assume_role_credentials(execution_role_arn):
    sts_client = boto3.client('sts')
    try:
        assume_role_response = sts_client.assume_role(RoleArn=execution_role_arn,
                                                      RoleSessionName="configLambdaExecution",
                                                      DurationSeconds=CONFIG_ROLE_TIMEOUT_SECONDS)
        return assume_role_response['Credentials']
    except ClientError as ex:
        if 'AccessDenied' in ex.response['Error']['Code']:
            ex.response['Error']['Message'] = "AWS Config does not have permission to assume the IAM role."
        else:
            ex.response['Error']['Message'] = "InternalError"
            ex.response['Error']['Code'] = "InternalError"
        raise ex


# Validates role pathname whitelist as passed via AWS Config parameters and returns a list of comma separated patterns.
def validate_whitelist(unvalidated_role_pattern_whitelist):
    # Names of users, groups, roles must be alphanumeric, including the following common
    # characters: plus (+), equal (=), comma (,), period (.), at (@), underscore (_), and hyphen (-).

    if not unvalidated_role_pattern_whitelist:
        return None

    regex = re.compile('^[-a-zA-Z0-9+=,[email protected]_/|*]+')
    if regex.search(unvalidated_role_pattern_whitelist):
        raise ValueError("[Error] Provided whitelist has invalid characters")

    return unvalidated_role_pattern_whitelist.split('|')


# This uses Unix filename pattern matching (as opposed to regular expressions), as documented here:
# https://docs.python.org/3.7/library/fnmatch.html.  Please note that if using a wildcard, e.g. "*", you should use
# it sparingly/appropriately.
# If the rolename matches the pattern, then it is whitelisted
def is_whitelisted_role(role_pathname, pattern_list):
    if not pattern_list:
        return False

    # If role_pathname matches pattern, then return True, else False
    # eg. /service-role/aws-codestar-service-role matches pattern /service-role/*
    # https://docs.python.org/3.7/library/fnmatch.html
    for pattern in pattern_list:
        if fnmatch.fnmatch(role_pathname, pattern):
            # whitelisted
            return True

    # not whitelisted
    return False


# Form an evaluation as a dictionary. Suited to report on scheduled rules.  More info here:
#   https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/config.html#ConfigService.Client.put_evaluations
def build_evaluation(resource_id, compliance_type, notification_creation_time, resource_type=DEFAULT_RESOURCE_TYPE, annotation=None):
    evaluation = {}
    if annotation:
        evaluation['Annotation'] = annotation
    evaluation['ComplianceResourceType'] = resource_type
    evaluation['ComplianceResourceId'] = resource_id
    evaluation['ComplianceType'] = compliance_type
    evaluation['OrderingTimestamp'] = notification_creation_time
    return evaluation


# Determine if any roles were used to make an AWS request
def determine_last_used(role_name, role_last_used, max_age_in_days, notification_creation_time):

    last_used_date = role_last_used.get('LastUsedDate', None)
    used_region = role_last_used.get('Region', None)

    if not last_used_date:
        compliance_result = NON_COMPLIANT
        reason = "No record of usage"
        logger.info(f"NON_COMPLIANT: {role_name} has never been used")
        return build_evaluation(role_name, compliance_result, notification_creation_time, resource_type=DEFAULT_RESOURCE_TYPE, annotation=reason)


    days_unused = (datetime.datetime.now() - last_used_date.replace(tzinfo=None)).days

    if days_unused > max_age_in_days:
        compliance_result = NON_COMPLIANT
        reason = f"Was used {days_unused} days ago in {used_region}"
        logger.info(f"NON_COMPLIANT: {role_name} has not been used for {days_unused} days, last use in {used_region}")
        return build_evaluation(role_name, compliance_result, notification_creation_time, resource_type=DEFAULT_RESOURCE_TYPE, annotation=reason)

    compliance_result = COMPLIANT
    reason = f"Was used {days_unused} days ago in {used_region}"
    logger.info(f"COMPLIANT: {role_name} used {days_unused} days ago in {used_region}")
    return build_evaluation(role_name, compliance_result, notification_creation_time, resource_type=DEFAULT_RESOURCE_TYPE, annotation=reason)


# Returns a list of docts, each of which has authorization details of each role.  More info here:
#   https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/iam.html#IAM.Client.get_account_authorization_details
def get_role_authorization_details(iam_client):

    roles_authorization_details = []
    roles_list = iam_client.get_account_authorization_details(Filter=['Role'])

    while True:
        roles_authorization_details += roles_list['RoleDetailList']
        if 'Marker' in roles_list:
            roles_list = iam_client.get_account_authorization_details(Filter=['Role'], MaxItems=100, Marker=roles_list['Marker'])
        else:
            break

    return roles_authorization_details


# Check the compliance of each role by determining if role last used is > than max_days_for_last_used
def evaluate_compliance(event, context):

    # Initialize our AWS clients
    iam_client = get_client('iam', event["executionRoleArn"])
    config_client = get_client('config', event["executionRoleArn"])

    # List of resource evaluations to return back to AWS Config
    evaluations = []

    # List of dicts of each role's authorization details as returned by boto3
    all_roles = get_role_authorization_details(iam_client)

    # Timestamp of when AWS Config triggered this evaluation
    notification_creation_time = str(json.loads(event['invokingEvent'])['notificationCreationTime'])

    # ruleParameters is received from AWS Config's user-defined parameters
    rule_parameters = json.loads(event["ruleParameters"])

    # Maximum allowed days that a role can be unused, or has been last used for an AWS request
    max_days_for_last_used = int(os.environ.get('max_days_for_last_used', '60'))
    if 'max_days_for_last_used' in rule_parameters:
        max_days_for_last_used = int(rule_parameters['max_days_for_last_used'])

    whitelisted_role_pattern_list = []
    if 'role_whitelist' in rule_parameters:
        whitelisted_role_pattern_list = validate_whitelist(rule_parameters['role_whitelist'])

    # Iterate over all our roles.  If the creation date of a role is <= max_days_for_last_used, it is compliant
    for role in all_roles:

        role_name = role['RoleName']
        role_path = role['Path']
        role_creation_date = role['CreateDate']
        role_last_used = role['RoleLastUsed']
        role_age_in_days = (datetime.datetime.now() - role_creation_date.replace(tzinfo=None)).days

        if is_whitelisted_role(role_path + role_name, whitelisted_role_pattern_list):
            compliance_result = COMPLIANT
            reason = "Role is whitelisted"
            evaluations.append(
                build_evaluation(role_name, compliance_result, notification_creation_time, resource_type=DEFAULT_RESOURCE_TYPE, annotation=reason))
            logger.info(f"COMPLIANT: {role} is whitelisted")
            continue

        if role_age_in_days <= max_days_for_last_used:
            compliance_result = COMPLIANT
            reason = f"Role age is {role_age_in_days} days"
            evaluations.append(
                build_evaluation(role_name, compliance_result, notification_creation_time, resource_type=DEFAULT_RESOURCE_TYPE, annotation=reason))
            logger.info(f"COMPLIANT: {role_name} - {role_age_in_days} is newer or equal to {max_days_for_last_used} days")
            continue

        evaluation_result = determine_last_used(role_name, role_last_used, max_days_for_last_used, notification_creation_time)
        evaluations.append(evaluation_result)

    # Iterate over our evaluations 100 at a time, as put_evaluations only accepts a max of 100 evals.
    evaluations_copy = evaluations[:]
    while evaluations_copy:
        config_client.put_evaluations(Evaluations=evaluations_copy[:100], ResultToken=event['resultToken'])
        del evaluations_copy[:100]

Here’s how the above code works. The AWS Config custom rule invokes the Lambda function, calling the evaluate_compliance() method. evaluate_compliance() does the following:

  1. Retrieves information on all roles from IAM using the GetAccountAuthorizationDetails API as mentioned previously. This includes each role’s creation date and role last used timestamp.
  2. Marks each role as compliant if the role name matches one of the patterns in your whitelisted_role_pattern_list. This pattern list is passed to your rule via a user-configurable AWS CloudFormation parameter named RolePatternWhitelist. “Whitelisting roles,” below, provides instructions about how to do this.
  3. Marks each role as compliant if the age of the role in days (role_age_in_days) is less than or equal to the parameter MaxDaysForLastUsed (max_days_for_last_used). This is set via a user-configurable parameter in your CloudFormation stack. You’ll use this parameter to set the time window for how long a role can be inactive.
  4. If neither of the above conditions are met, then determine_last_used() is called, and each role will be marked as non-compliant if days_unused is greater than max_age_in_days.
  5. Finally, evaluate_compliance() calls put_evaluations() against AWS Config to store your evaluations of each role.

Step 2: Deploy the AWS CloudFormation template

Next, create an AWS CloudFormation template file named  iam-role-last-used.yml. This template uses the AWS Serverless Application Model (AWS SAM), which is an extension of CloudFormation. AWS SAM simplifies the deployment so that you don’t have to manually upload your deployment .zip file to your Amazon S3 bucket. To ensure that your template knows the location of your code .zip file, place the file on the same directory level as the iam-role-last-used directory that you created above. Then copy and paste the code below and save it to the iam-role-last-used.yml file.


AWSTemplateFormatVersion: '2010-09-09'
Description: "Creates an AWS Config rule and Lambda to check all roles' last used compliance"
Transform: 'AWS::Serverless-2016-10-31'
Parameters:

  MaxDaysForLastUsed:
    Description: Checks the number of days allowed for a role to not be used before being non-compliant
    Type: Number
    Default: 60
    MaxValue: 365

  NameOfSolution:
    Type: String
    Default: iam-role-last-used
    Description: The name of the solution - used for naming of created resources

  RolePatternWhitelist:
    Description: Pipe separated whitelist of role pathnames using simple pathname matching
    Type: String
    Default: ''
    AllowedPattern: '[-a-zA-Z0-9+=,[email protected]_/|*]+|^$'

  LambdaLayerArn:
    Type: String
    Description: The ARN for the Lambda Layer you will use.
  
Resources:
  LambdaInvokePermission:
    Type: 'AWS::Lambda::Permission'
    DependsOn: CheckRoleLastUsedLambda
    Properties: 
      FunctionName: !GetAtt CheckRoleLastUsedLambda.Arn
      Action: lambda:InvokeFunction
      Principal: config.amazonaws.com
      SourceAccount: !Ref 'AWS::AccountId'

  LambdaExecutionRole:
    Type: 'AWS::IAM::Role'
    Properties:
      RoleName: !Sub '${NameOfSolution}-${AWS::Region}'
      AssumeRolePolicyDocument:
        Version: '2012-10-17'
        Statement:
        - Effect: Allow
          Principal:
            Service: lambda.amazonaws.com
          Action:
          - sts:AssumeRole
      Path: /
      Policies:
      - PolicyName: !Sub '${NameOfSolution}'
        PolicyDocument:
          Version: '2012-10-17'
          Statement:
          - Effect: Allow
            Action:
            - config:PutEvaluations
            Resource: '*'
          - Effect: Allow
            Action:
            - iam:GetAccountAuthorizationDetails
            Resource: '*'
          - Effect: Allow
            Action:
            - logs:CreateLogStream
            - logs:PutLogEvents
            Resource:
            - !Sub 'arn:${AWS::Partition}:logs:${AWS::Region}:*:log-group:/aws/lambda/${NameOfSolution}:log-stream:*'

  CheckRoleLastUsedLambda:
    Type: 'AWS::Serverless::Function'
    Properties:
      Description: "Checks IAM roles' last used info for AWS Config"
      FunctionName: !Sub '${NameOfSolution}'
      Handler: lambda_function.evaluate_compliance
      MemorySize: 256
      Role: !GetAtt LambdaExecutionRole.Arn
      Runtime: python3.7
      Timeout: 300
      CodeUri: ./iam-role-last-used
      Layers:
      - !Ref LambdaLayerArn

  LambdaLogGroup:
    Type: 'AWS::Logs::LogGroup'
    Properties: 
      LogGroupName: !Sub '/aws/lambda/${NameOfSolution}'
      RetentionInDays: 30

  ConfigCustomRule:
    Type: 'AWS::Config::ConfigRule'
    DependsOn:
    - LambdaInvokePermission
    - LambdaExecutionRole
    Properties:
      ConfigRuleName: !Sub '${NameOfSolution}'
      Description: Checks the number of days that an IAM role has not been used to make a service request. If the number of days exceeds the specified threshold, it is marked as non-compliant.
      InputParameters: !Sub '{"role-whitelist":"${RolePatternWhitelist}","max_days_for_last_used":"${MaxDaysForLastUsed}"}'
      Source: 
        Owner: CUSTOM_LAMBDA
        SourceDetails: 
        - EventSource: aws.config
          MaximumExecutionFrequency: TwentyFour_Hours
          MessageType: ScheduledNotification
        SourceIdentifier: !GetAtt CheckRoleLastUsedLambda.Arn

For your reference, below is a summary of the template.

  • Parameters (these are user-configurable variables):
    • MaxDaysForLastUsed—maximum amount of days allowed for a role that has not been used to make an AWS request before becoming non-compliant
    • NameOfSolution—the name of the solution, used for naming of created resources
    • RolePatternWhitelist—a pipe (“|”) separated whitelist of role pathnames using simple pathname matching (see Whitelisting roles below)
    • LambdaLayerArn—the unique ARN for your Lambda layer
  • Resources (these are the AWS resources that will be created within your account):
    • LambdaInvokePermission—allows AWS Config to invoke your Lambda function
    • LambdaExecutionRole—the role and permissions that Lambda will assume to process your roles. The policies assigned to this role allow you to perform the iam:GetAccountAuthorizationDetails, config:PutEvaluations, logs:CreateLogStream, and logs:PutLogEvents actions. The PutEvaluations action allows you to send evaluation results back to AWS Config. The CreateLogStream and PutLogEvents actions allows you to write the Lambda execution logs to AWS CloudWatch Logs.
    • CheckRoleLastUsedLambda—defines your Lambda function and its attributes
    • LambdaLogGroup—logs from Lambda will be written to this CloudWatch Log Group
    • ConfigCustomRule—defines your custom AWS Config rule and its attributes

With the CloudFormation template you created above, use the AWS CLI’s cloudformation package command to zip the deployment package and upload it to the S3 bucket that you specify, as shown below. Make sure to replace <YOUR S3 BUCKET> with your bucket name only. Do not include the s3:// prefix:


aws cloudformation package --region <YOUR REGION> --template-file iam-role-last-used.yml \
--s3-bucket <YOUR S3 BUCKET> \
--output-template-file iam-role-last-used-transformed.yml

This will create the file iam-role-last-used-transformed.yml, which adds a reference to the S3 bucket and the pathname needed by CloudFormation to deploy your Lambda function.

Finally, deploy the solution into your AWS account using the cloudformation deploy command below. You can provide different values for NameOfSolutionMaxDaysForLastAccess, or RolePatternWhitelist by using the –parameter-overrides option. Otherwise, defaults will be used. These are specified at the top of the AWS Cloudformation template pasted above, under the Parameters section.


aws cloudformation deploy --region <YOUR REGION> --template-file iam-role-last-used-transformed.yml \
--stack-name iam-role-last-used \
--parameter-overrides NameOfSolution='iam-role-last-used' \
MaxDaysForLastUsed=60 \
RolePatternWhitelist='/breakglass-role|/security-*' \
LambdaLayerArn='<YOUR LAMBDA LAYER ARN>' \
--capabilities CAPABILITY_NAMED_IAM

The deployment is complete after the AWS CLI indicates success. This typically takes only a few minutes:


Waiting for changeset to be created..
Waiting for stack create/update to complete
Successfully created/updated stack - iam-role-last-used

Step 3: View your findings

Now that your deployment is complete, you can view your compliance findings by going to the AWS Config console.

  1. Select the same region where you deployed the CloudFormation template.
  2. Select Rules in the left pane, which brings up the current list of rules in your account.
  3. Select the iam-role-last-used rule to view the rule’s details, as shown in Figure 2.

When a successful evaluation is indicated in the Overall rule status field, the compliance evaluation is complete. You may need to wait a few minutes for the function to complete successfully as results may not be available yet. You can periodically refresh your web browser to check for completion.
 

Figure 2: AWS Config rule details

Figure 2: AWS Config custom rule details

After the rule completes its evaluations of your roles, you’ll be able to view your compliance results on the same page. In the screenshot below, you can see that there are multiple non-compliant roles. You can switch between viewing compliant and non-compliant resources by selecting the dropdown menu under Compliance status.
 

Figure 3: Viewing the compliance status

Figure 3: Viewing the compliance status

For more insight, you can hover over the “i” symbol, which provides additional information about the role’s non-compliant status (see Figure 4).
 

Figure 4: Hover over the information icon

Figure 4: Hover over the information icon

Step 4: Export a report of your compliance

Once a successful evaluation has completed, you may want to create an exportable report of compliance. You can use the AWS CLI to programmatically script and automatically generate reports for your application, infrastructure, and security teams. They can use these reports to review non-compliant roles and take action if the role is no longer needed. The AWS CLI command below demonstrates how you can achieve this. Note that the command below encompasses a single line:

aws configservice get-compliance-details-by-config-rule –config-rule-name iam-role-last-used –output text –query ‘EvaluationResults [*].{A:EvaluationResultIdentifier.EvaluationResultQualifier.ResourceId,B:ComplianceType,C:Annotation}’

The output is tab-delimited and will be similar to the lines below. The first column displays the role name. The second column states the compliance status. The last column explains the reason for the compliance status:

AdminRole   COMPLIANT      Was last used in us-west-2 46 days ago
Ec2DevRole  NON_COMPLIANT  No record of usage

Remediation

Now that you have a report of non-compliant roles, you must decide what to do with them. If your teams agree that a role is not necessary, the remediation can be to simply delete the role. If unsure, you can retain the role but deny it from performing any action. You can do this by attaching a new permissions policy that will deny all actions for all resources. Re-enabling the role would be as easy as removing the added policy. Otherwise, if the role is necessary but not frequently used, you can whitelist the role through the method below.

Whitelisting roles

Whitelisted roles will be reported as compliant by the custom rule even if left unused. You might have roles such as a security incident response or a break-glass role that require whitelisting.

The whitelist is supplied via the CloudFormation parameter RolePatternWhitelist and is stored as an AWS Config rule parameter. The syntax uses UNIX filename pattern matching. If you need to specify multiple patterns, you can use the | (pipe) character as a delimiter between each pattern. Each delimited pattern will then be matched against the role name, including the path. For example, if you wish to whitelist the breakglass-role, security-incident-response-role and security-audit-role roles, the whitelist patterns you provide to the AWS CloudFormation template might be:

/breakglass-role|/security-*

Important: The use of wildcards (*) should be used thoughtfully, as they will match anything.

Enhancements

In this walkthrough, I’ve kept the architecture and code simple to make the solution easier to follow. You can further customize the solution through the following enhancements:

Conclusion

In this post, I’ve shown you how to use AWS IAM and AWS Config to implement a detective security control that provides visibility into your IAM roles and their last time of use. I’ve also shown how you can view the results in the AWS Management Console and export them using the AWS CLI. Finally, I’ve presented different options for remediation and a means to whitelist roles that are necessary but infrequently used. These techniques can augment your security and compliance program by preventing unintended access through your IAM roles.

Additional resources

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the IAM forum or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Michael Chan

Michael Chan

Michael is a Developer Advocate for AWS Identity. Prior to this, he was a Professional Services Consultant who assisted customers with their journey to AWS. He enjoys understanding customer problems and working backwards to provide practical solutions.

Roland AbiHanna

Roland is a Sr. Solutions Architect with Amazon Web Services. He’s focused on helping enterprise customers realize their business needs through cloud solutions, specializing in DevOps and automation. Prior to AWS, Roland ran DevOps for a variety of start-ups in Europe and the Middle East. Outside of work, Roland enjoys hiking and searching for the perfect blend of hops, barley, and water.

S3 Replication Update: Replication SLA, Metrics, and Events

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/s3-replication-update-replication-sla-metrics-and-events/

S3 Cross-Region Replication has been around since early 2015 (new Cross-Region Replication for Amazon S3), and Same-Region Replication has been around for a couple of months.

Replication is very easy to set up, and lets you use rules to specify that you want to copy objects from one S3 bucket to another one. The rules can specify replication of the entire bucket, or of a subset based on prefix or tag:

You can use replication to copy critical data within or between AWS regions in order to meet regulatory requirements for geographic redundancy as part of a disaster recover plan, or for other operational reasons. You can copy within a region to aggregate logs, set up test & development environments, and to address compliance requirements.

S3’s replication features have been put to great use: Since the launch in 2015, our customers have replicated trillions of objects and exabytes of data! Today I am happy to be able to tell you that we are making it even more powerful, with the addition of Replication Time Control. This feature builds on the existing rule-driven replication and gives you fine-grained control based on tag or prefix so that you can use Replication Time Control with the data set you specify. Here’s what you get:

Replication SLA – You can now take advantage of a replication SLA to increase the predictability of replication time.

Replication Metrics – You can now monitor the maximum replication time for each rule using new CloudWatch metrics.

Replication Events – You can now use events to track any object replications that deviate from the SLA.

Let’s take a closer look!

New Replication SLA
S3 replicates your objects to the destination bucket, with timing influenced by object size & count, available bandwidth, other traffic to the buckets, and so forth. In situations where you need additional control over replication time, you can use our new Replication Time Control feature, which is designed to perform as follows:

  • Most of the objects will be replicated within seconds.
  • 99% of the objects will be replicated within 5 minutes.
  • 99.99% of the objects will be replicated within 15 minutes.

When you enable this feature, you benefit from the associated Service Level Agreement. The SLA is expressed is terms of a percentage of objects that are expected to be replicated within 15 minutes, and provides for billing credits if the SLA is not met:

  • 99.9% to 98.0% – 10% credit
  • 98.0% to 95.0% – 25% credit
  • 95% to 0% – 100% credit

The billing credit applies to a percentage of the Replication Time Control fee, replication data transfer, S3 requests, and S3 storage charges in the destination for the billing period.

I can enable Replication Time Control when I create a new replication rule, and I can also add it to an existing rule:

Replication begins as soon as I create or update the rule. I can use the Replication Metrics and the Replication Events to monitor compliance.

In addition to the existing charges for S3 requests and data transfer between regions, you will pay an extra per-GB charge to use Replication Time Control; see the S3 Pricing page for more information.

Replication Metrics
Each time I enable Replication Time Control for a rule, S3 starts to publish three new metrics to CloudWatch. They are available in the S3 and CloudWatch Consoles:

I created some large tar files, and uploaded them to my source bucket. I took a quick break, and inspected the metrics. Note that I did my testing before the launch, so don’t get overly concerned with the actual numbers. Also, keep in mind that these metrics are aggregated across the replication for display, and are not a precise indication of per-object SLA compliance.

BytesPendingReplication jumps up right after the upload, and then drops down as the replication takes place:

ReplicationLatency peaks and then quickly drops down to zero after S3 Replication transfers over 37 GB from the United States to Australia with a maximum latency of 8.3 minutes:

And OperationsPendingCount tracks the number of objects to be replicated:

I can also set CloudWatch Alarms on the metrics. For example, I might want to know if I have a replication backlog larger than 75 GB (for this to work as expected, I must set the Missing data treatment to Treat missing data as ignore (maintain the alarm state):

These metrics are billed as CloudWatch Custom Metrics.

Replication Events
Finally, you can track replication issues by setting up events on an SQS queue, SNS topic, or Lambda function. Start at the console’s Events section:

You can use these events to monitor adherence to the SLA. For example, you could store Replication time missed threshold and Replication time completed after threshold events in a database to track occasions where replication took longer than expected. The first event will tell you that the replication is running late, and the second will tell you that it has completed, and how late it was.

To learn more, read about Replication.

Available Now
You can start using these features today in all commercial AWS Regions, excluding the AWS China (Beijing) and AWS China (Ningxia) Regions.

Jeff;

PS – If you want to learn more about how S3 works, be sure to attend the re:Invent session: Beyond Eleven Nines: Lessons from the Amazon S3 Culture of Durability.

 

Provisioning the Intuit Data Lake with Amazon EMR, Amazon SageMaker, and AWS Service Catalog

Post Syndicated from Michael Sambol original https://aws.amazon.com/blogs/big-data/provisioning-the-intuit-data-lake-with-amazon-emr-amazon-sagemaker-and-aws-service-catalog/

This post shares Intuit’s learnings and recommendations for running a data lake on AWS. The Intuit Data Lake is built and operated by numerous teams in Intuit Data Platform. Thanks to Tristan Baker (Chief Architect), Neil Lamka (Principal Product Manager), Achal Kumar (Development Manager), Nicholas Audo, and Jimmy Armitage for their feedback and support.

A data lake is a centralized repository for storing structured and unstructured data at any scale. At Intuit, creating such a pile of raw data is easy. However, more interesting challenges present themselves:

  1. How should AWS accounts be organized?
  2. What ingestion methods will be used? How will analysts find the data they need?
  3. Where should data be stored? How should access be managed?
  4. What security measures are needed to protect Intuit’s sensitive data?
  5. Which parts of this ecosystem can be automated?

This post outlines the approach taken by Intuit, though it is important to remember that there are many ways to build a data lake (for example, AWS Lake Formation).

We’ll cover the technologies and processes involved in creating the Intuit Data Lake at a high level, including the overall structure and the automation used in provisioning accounts and resources. Watch this space in the future for more detailed blog posts on specific aspects of the system, from the other teams and engineers who worked together to build the Intuit Data Lake.

Architecture

Account Structure

Data lakes typically follow a hub-and-spoke model, with the hub account containing shared services that control access to data sources. For the purposes of this post, we’ll refer to the hub account as Central Data Lake.

In this pattern, access to Central Data Lake is apportioned to spoke accounts called Processing Accounts. This model maintains separation between end users and allows for division of billing among distinct business units.

 

 

It is common to maintain two ecosystems: pre-production (Pre-Prod) and production (Prod). This allows data lake administrators to silo access to data by preventing connectivity between Pre-Prod and Prod.

To enable experimentation and testing, it may also be advisable to maintain separate VPC-based environments within Pre-Prod accounts, such as dev, qa, and e2e. Processing Account VPCs would then be connected to the corresponding VPC in Central Data Lake.

Note that at first, we connected accounts via VPC Peering. However, as we scaled we quickly approached the hard limit of 125 VPC peering connections, requiring us to migrate to AWS Transit Gateway. As of this writing, we connect multiple new Processing Accounts weekly.

 

 

Central Data Lake

There may be numerous services running in a hub account, but we’ll focus on the aspects that are most relevant to this blog: ingestion, sanitization, storage, and a data catalog.

 

 

Ingestion, Sanitization, and Storage

A key component to Central Data Lake is a uniform ingestion pattern for streaming data. One example is an Apache Kafka cluster running on Amazon EC2. (You can read about how Intuit engineers do this in another AWS blog.) As we deal with hundreds of data sources, we’ve enabled access to ingestion mechanisms via AWS PrivateLink.

Note: Amazon Managed Streaming for Apache Kafka (Amazon MSK) is an alternative for running Apache Kafka on Amazon EC2, but was not available at the start of Intuit’s migration.

In addition to stream processing, another method of ingestion is batch processing, such as jobs running on Amazon EMR. After data is ingested by one of these methods, it can be stored in Amazon S3 for further processing and analysis.

Intuit deals with a large volume of customer data, and each field is carefully considered and classified with a sensitivity level. All sensitive data that enters the lake is encrypted at the source. The ingestion systems retrieve the encrypted data and move it into the lake. Before it is written to S3, the data is sanitized by a proprietary RESTful service. Analysts and engineers operating within the data lake consume this masked data.

Data Catalog

A data catalog is a common way to give end users information about the data and where it lives. One example is a Hive Metastore backed by Amazon Aurora. Another alternative is the AWS Glue Data Catalog.

Processing Accounts

When Processing Accounts are delivered to end users, they include an identical set of resources. We’ll discuss the automation of Processing Accounts below, but the primary components are as follows:

 

 

                           Processing Account structure upon delivery to the customer

 

Data Storage Mechanisms

One reasonable question is whether all data should reside in Central Data Lake, or if it’s acceptable to distribute data across multiple accounts. A data lake might employ a combination of the two approaches, and classify data locations as primary or secondary.

The primary location for data is Central Data Lake, and it arrives there via the ingestion pipelines discussed previously. Processing Accounts can read from the primary source, either directly from the ingestion pipelines or from S3. Processing Accounts can contribute their transformed data back into Central Data Lake (primary), or store it in their own accounts (secondary). The proper storage location depends on the type of data, and who needs to consume it.

One rule worth enforcing is that no cross-account writes should be permitted. In other words, the IAM principal (in most cases, an IAM role assumed by EC2 via an instance profile) must be in the same account as the destination S3 bucket. This is because cross-account delegation is not supported—specifically, S3 bucket policies in Central Data Lake cannot grant Processing Account A access to objects written by a role in Processing Account B.

Another possibility is for EMR to assume different IAM roles via a custom credentials provider (see this AWS blog), but we chose not to go down this path at Intuit because it would have required many EMR jobs to be rewritten.

 

 

Data Access Patterns

The majority of end users are interested in the data that resides in S3. In Central Data Lake and some Processing Accounts, there may be a set of read-only S3 buckets: any account in the data lake ecosystem can read data from this type of bucket.

To facilitate management of S3 access for read-only buckets, we built a mechanism to control S3 bucket policies, administered entirely via code. Our deployment pipelines use account metadata to dynamically generate the correct S3 bucket policy based on the type of account (Pre-Prod or Prod). These policies are committed back into our code repository for auditability and ease of management.

We employ the same method for managing KMS key policies, as we use KMS with customer managed customer master keys (CMKs) for at-rest encryption in S3.

Here’s an example of a generated S3 bucket policy for a read-only bucket:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ProcessingAccountReadOnly",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::111111111111:root",
                    "arn:aws:iam::222222222222:root",
                    "arn:aws:iam::333333333333:root",
                    "arn:aws:iam::444444444444:root",
                    "arn:aws:iam::555555555555:root",
                    ...
                    ...
                    ...
                    "arn:aws:iam::999999999999:root",
                ]
            },
            "Action": [
                "s3:ListBucket",
                "s3:GetObject"
            ],
            "Resource": [
                "arn:aws:s3:::intuit-data-lake-example/*",
                "arn:aws:s3:::intuit-data-lake-example"
            ]
        }
    ]
}

Note that we grant access at the account level, rather than using explicit IAM principal ARNs. Because the reads are cross-account, permissions are also required on the IAM principals in Processing Accounts. Maintaining these policies—with automation, at that level of granularity—is untenable at scale. Furthermore, using specific IAM principal ARNs would create an external dependency on foreign accounts. For example, if a Processing Account deletes an IAM role that is referenced in an S3 bucket policy in Central Data Lake, the bucket policy can no longer be saved, causing interruptions to deployment pipelines.

Security

Security is mission critical for any data lake. We’ll mention a subset of the controls we use, but not dive deep.

Encryption

Encryption can be enforced both in transit and at rest, using multiple methods:

  1. Traffic within the lake should use the latest version of TLS (1.2 as of this writing)
  2. Data can be encrypted with application-level (client-side) encryption
  3. KMS keys can used for at-rest encryption of S3, EBS, and RDS

Ingress and Egress

There’s nothing out of the ordinary in our approach to ingress and egress, but it’s worth mentioning the standard patterns we’ve found important:

Policies restricting ingress and egress are the primary points at which a data lake can guarantee quality (ingress) and prevent loss (egress).

Authorization

Access to the Intuit Data Lake is controlled via IAM roles, meaning no IAM users (with long-term credentials) are created. End users are granted access via an internal service that manages role-based, federated access to AWS accounts. Regular reviews are conducted to remove nonessential users.

Configuration Management

We use an internal fork of Cloud Custodian, which is a suite of preventative, detective, and responsive controls consisting of Amazon CloudWatch Events and AWS Config rules. Some of the violations it reports and (optionally) mitigates include:

  • Unauthorized CIDRs in inbound security group rules
  • Public S3 bucket policies and ACLs
  • IAM user console access
  • Unencrypted S3 buckets, EBS volumes, and RDS instances

Lastly, Amazon GuardDuty is enabled in all Intuit Data Lake accounts and is monitored by Intuit Security.

Automation

If there is one thing we’ve learned building the Intuit Data Lake, it is to automate everything.

There are four areas of automation we’ll discuss in this blog:

  1. Creation of Processing Accounts
  2. Processing Account Orchestration Pipeline
  3. Processing Account Terraform Pipeline
  4. EMR and SageMaker deployment via Service Catalog

Creation of Processing Accounts

The first step in creating a Processing Account is to make a request through an internal tool. This triggers automation that provisions an Intuit-stamped AWS account under the correct business unit.

 

Note: AWS Control Tower’s Account Factory was not available at the start of our journey, but it can be leveraged to provision new AWS accounts in a secured, best practice, self-service way.

Account setup also includes automated VPC creation (with optional VPN), fully automated using Service Catalog. End users simply specify subnet sizes.

It’s worth noting that Intuit leverages Service Catalog for self-service deployment of other common patterns, including ingress security groups, VPC endpoints, and VPC peering. Here’s an example portfolio:

Processing Account Orchestration Pipeline

After account creation and VPC provisioning, the Processing Account Orchestration Pipeline runs. This pipeline executes one-time tasks required for Processing Accounts. These tasks include:

  • Bootstrapping an IAM role for use in further configuration management
  • Creation of KMS keys for S3, EBS, and RDS encryption
  • Creation of variable files for the new account
  • Updating the master configuration file with account metadata
  • Generation of scripts to orchestrate the Terraform pipeline discussed below
  • Sharing Transit Gateways via Resource Access Manager

Processing Account Terraform Pipeline

This pipeline manages the lifecycle of dynamic, frequently-updated resources, including IAM roles, S3 buckets and bucket policies, KMS key policies, security groups, NACLs, and bastion hosts.

There is one pipeline for every Processing Account, and each pipeline deploys a series of layers into the account, using a set of parameterized deployment jobs. A layer is a logical grouping of Terraform modules and AWS resources, providing a way to shrink Terraform state files and reduce blast radius if redeployment of specific resources is required.

EMR and SageMaker Deployment via Service Catalog

AWS Service Catalog facilitates the provisioning of Amazon EMR and Amazon SageMaker, allowing end users to launch EMR clusters and SageMaker notebook instances that work out of the box, with embedded security.

Service Catalog allows data scientists and data engineers to launch EMR clusters in a self-service fashion with user-friendly parameters, and provides them with the following:

  • Bootstrap action to enable connectivity to services in Central Data Lake
  • EC2 instance profile to control S3, KMS, and other granular permissions
  • Security configuration that enables at-rest and in-transit encryption
  • Configuration classifications for optimal EMR performance
  • Encrypted AMI with monitoring and logging enabled
  • Custom Kerberos connection to LDAP

For SageMaker, we use Service Catalog to launch notebook instances with custom lifecycle configurations that set up connections or initialize the following: Hive Metastore, Kerberos, security, Splunk logging, and OpenDNS. You can read more about lifecycle configurations in this AWS blog. Launching a SageMaker notebook instance with best-practice configuration is as easy as follows:

 

 

Conclusion

This post illustrates the building blocks we used in creating the Intuit Data Lake. Our solution isn’t wholly unique, but comprised of common-sense approaches we’ve gleaned from dozens of engineers across Intuit, representing decades of experience. These practices have enabled us to push petabytes of data into the lake, and serve hundreds of Processing Accounts with varying needs. We are still building, but we hope our story helps you in your data lake journey.

The content and opinions in this post are those of the third-party author and AWS is not responsible for the content or accuracy of this post.

 


About the Authors

Michael Sambol is a senior consultant at AWS. He holds an MS in computer science from Georgia Tech. Michael enjoys working out, playing tennis, traveling, and watching Western movies.

 

 

 

 

Ben Covi is a staff software engineer at Intuit. At any given moment, he’s probably losing a game of Catan.

 

 

 

New – Insert, Update, Delete Data on S3 with Amazon EMR and Apache Hudi

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-insert-update-delete-data-on-s3-with-amazon-emr-and-apache-hudi/

Storing your data in Amazon S3 provides lots of benefits in terms of scale, reliability, and cost effectiveness. On top of that, you can leverage Amazon EMR to process and analyze your data using open source tools like Apache Spark, Hive, and Presto. As powerful as these tools are, it can still be challenging to deal with use cases where you need to do incremental data processing, and record-level insert, update, and delete.

Talking with customers, we found that there are use cases that need to handle incremental changes to individual records, for example:

  • Complying with data privacy regulations, where their users choose to exercise their right to be forgotten, or change their consent as to how their data can be used.
  • Working with streaming data, when you have to handle specific data insertion and update events.
  • Using change data capture (CDC) architectures to track and ingest database change logs from enterprise data warehouses or operational data stores.
  • Reinstating late arriving data, or analyzing data as of a specific point in time.

Starting today, EMR release 5.28.0 includes Apache Hudi (incubating), so that you no longer need to build custom solutions to perform record-level insert, update, and delete operations. Hudi development started in Uber in 2016 to address inefficiencies across ingest and ETL pipelines. In the recent months the EMR team has worked closely with the Apache Hudi community, contributing patches that include updating Hudi to Spark 2.4.4 (HUDI-12), supporting Spark Avro (HUDI-91), adding support for AWS Glue Data Catalog (HUDI-306), as well as multiple bug fixes.

Using Hudi, you can perform record-level inserts, updates, and deletes on S3 allowing you to comply with data privacy laws, consume real time streams and change data captures, reinstate late arriving data and track history and rollbacks in an open, vendor neutral format. You create datasets and tables and Hudi manages the underlying data format. Hudi uses Apache Parquet, and Apache Avro for data storage, and includes built-in integrations with Spark, Hive, and Presto, enabling you to query Hudi datasets using the same tools that you use today with near real-time access to fresh data.

When launching an EMR cluster, the libraries and tools for Hudi are installed and configured automatically any time at least one of the following components is selected: Hive, Spark, or Presto. You can use Spark to create new Hudi datasets, and insert, update, and delete data. Each Hudi dataset is registered in your cluster’s configured metastore (including the AWS Glue Data Catalog), and appears as a table that can be queried using Spark, Hive, and Presto.

Hudi supports two storage types that define how data is written, indexed, and read from S3:

  • Copy on Write – data is stored in columnar format (Parquet) and updates create a new version of the files during writes. This storage type is best used for read-heavy workloads, because the latest version of the dataset is always available in efficient columnar files.
  • Merge on Read – data is stored with a combination of columnar (Parquet) and row-based (Avro) formats; updates are logged to row-based “delta files” and compacted later creating a new version of the columnar files. This storage type is best used for write-heavy workloads, because new commits are written quickly as delta files, but reading the data set requires merging the compacted columnar files with the delta files.

Let’s do a quick overview of how you can set up and use Hudi datasets in an EMR cluster.

Using Apache Hudi with Amazon EMR
I start creating a cluster from the EMR console. In the advanced options I select EMR release 5.28.0 (the first including Hudi) and the following applications: Spark, Hive, and Tez. In the hardware options, I add 3 task nodes to ensure I have enough capacity to run both Spark and Hive.

When the cluster is ready, I use the key pair I selected in the security options to SSH into the master node and access the Spark Shell. I use the following command to start the Spark Shell to use it with Hudi:

$ spark-shell --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer"
              --conf "spark.sql.hive.convertMetastoreParquet=false"
              --jars /usr/lib/hudi/hudi-spark-bundle.jar,/usr/lib/spark/external/lib/spark-avro.jar

There, I use the following Scala code to import some sample ELB logs in a Hudi dataset using the Copy on Write storage type:

import org.apache.spark.sql.SaveMode
import org.apache.spark.sql.functions._
import org.apache.hudi.DataSourceWriteOptions
import org.apache.hudi.config.HoodieWriteConfig
import org.apache.hudi.hive.MultiPartKeysValueExtractor

//Set up various input values as variables
val inputDataPath = "s3://athena-examples-us-west-2/elb/parquet/year=2015/month=1/day=1/"
val hudiTableName = "elb_logs_hudi_cow"
val hudiTablePath = "s3://MY-BUCKET/PATH/" + hudiTableName

// Set up our Hudi Data Source Options
val hudiOptions = Map[String,String](
    DataSourceWriteOptions.RECORDKEY_FIELD_OPT_KEY -> "request_ip",
    DataSourceWriteOptions.PARTITIONPATH_FIELD_OPT_KEY -> "request_verb", 
    HoodieWriteConfig.TABLE_NAME -> hudiTableName, 
    DataSourceWriteOptions.OPERATION_OPT_KEY ->
        DataSourceWriteOptions.INSERT_OPERATION_OPT_VAL, 
    DataSourceWriteOptions.PRECOMBINE_FIELD_OPT_KEY -> "request_timestamp", 
    DataSourceWriteOptions.HIVE_SYNC_ENABLED_OPT_KEY -> "true", 
    DataSourceWriteOptions.HIVE_TABLE_OPT_KEY -> hudiTableName, 
    DataSourceWriteOptions.HIVE_PARTITION_FIELDS_OPT_KEY -> "request_verb", 
    DataSourceWriteOptions.HIVE_ASSUME_DATE_PARTITION_OPT_KEY -> "false", 
    DataSourceWriteOptions.HIVE_PARTITION_EXTRACTOR_CLASS_OPT_KEY ->
        classOf[MultiPartKeysValueExtractor].getName)

// Read data from S3 and create a DataFrame with Partition and Record Key
val inputDF = spark.read.format("parquet").load(inputDataPath)

// Write data into the Hudi dataset
inputDF.write
       .format("org.apache.hudi")
       .options(hudiOptions)
       .mode(SaveMode.Overwrite)
       .save(hudiTablePath)

In the Spark Shell, I can now count the records in the Hudi dataset:

scala> inputDF2.count()
res1: Long = 10491958

In the options, I used the integration with the Hive metastore configured for the cluster, so that the table is created in the default database. In this way, I can use Hive to query the data in the Hudi dataset:

hive> use default;
hive> select count(*) from elb_logs_hudi_cow;
...
OK
10491958
...

I can now update or delete a single record in the dataset. In the Spark Shell, I prepare some variables to find the record I want to update, and a SQL statement to select the value of the column I want to change:

val requestIpToUpdate = "243.80.62.181"
val sqlStatement = s"SELECT elb_name FROM elb_logs_hudi_cow WHERE request_ip = '$requestIpToUpdate'"

I execute the SQL statement to see the current value of the column:

scala> spark.sql(sqlStatement).show()
+------------+                                                                  
|    elb_name|
+------------+
|elb_demo_003|
+------------+

Then, I select and update the record:

// Create a DataFrame with a single record and update column value
val updateDF = inputDF.filter(col("request_ip") === requestIpToUpdate)
                      .withColumn("elb_name", lit("elb_demo_001"))

Now I update the Hudi dataset with a syntax similar to the one I used to create it. But this time, the DataFrame I am writing contains only one record:

// Write the DataFrame as an update to existing Hudi dataset
updateDF.write
        .format("org.apache.hudi")
        .options(hudiOptions)
        .option(DataSourceWriteOptions.OPERATION_OPT_KEY,
                DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
        .mode(SaveMode.Append)
        .save(hudiTablePath)

In the Spark Shell, I check the result of the update:

scala> spark.sql(sqlStatement).show()
+------------+                                                                  
|    elb_name|
+------------+
|elb_demo_001|
+------------+

Now I want to delete the same record. To delete it, I pass the EmptyHoodieRecordPayload payload in the write options:

// Write the DataFrame with an EmptyHoodieRecordPayload for deleting a record
updateDF.write
        .format("org.apache.hudi")
        .options(hudiOptions)
        .option(DataSourceWriteOptions.OPERATION_OPT_KEY,
                DataSourceWriteOptions.UPSERT_OPERATION_OPT_VAL)
        .option(DataSourceWriteOptions.PAYLOAD_CLASS_OPT_KEY,
                "org.apache.hudi.EmptyHoodieRecordPayload")
        .mode(SaveMode.Append)
        .save(hudiTablePath)

In the Spark Shell, I see that the record is no longer available:

scala> spark.sql(sqlStatement).show()
+--------+                                                                      
|elb_name|
+--------+
+--------+

How are all those updates and deletes managed by Hudi? Let’s use the Hudi Command Line Interface (CLI) to connect to the dataset and see now those changes are interpreted as commits:

This dataset is a Copy on Write dataset, that means that each time there is an update to a record, the file that contains that record will be rewritten to contain the updated values. You can see how many records have been written for each commit. The bottom line of the table describes the initial creation of the dataset, above there is the single record update, and at the top the single record delete.

With Hudi, you can roll back to each commit. For example, I can roll back the delete operation with:

hudi:elb_logs_hudi_cow->commit rollback --commit 20191104121031

In the Spark Shell, the record is now back to where it was, just after the update:

scala> spark.sql(sqlStatement).show()
+------------+                                                                  
|    elb_name|
+------------+
|elb_demo_001|
+------------+

Copy on Write is the default storage type. I can repeat the steps above to create and update a Merge on Read dataset type by adding this to our hudiOptions:

DataSourceWriteOptions.STORAGE_TYPE_OPT_KEY -> "MERGE_ON_READ"

If you update a Merge on Read dataset and look at the commits with the Hudi CLI, you can see how different Merge on Read is compared to Copy on Write. With Merge on Read, you are only writing the updated rows and not whole files as with Copy on Write. This is why Merge on Read is helpful for use cases that require more writes, or update/delete heavy workload, with a fewer number of reads. Delta commits are written to disk as Avro records (row-based storage), and compacted data is written as Parquet files (columnar storage). To avoid creating too many delta files, Hudi will automatically compact your dataset so that your reads are as performant as possible.

When a Merge On Read dataset is created, two Hive tables are created:

  • The first table matches the name of the dataset.
  • The second table has the characters _rt appended to its name; the _rt postfix stands for real-time.

When queried, the first table return the data that has been compacted, and will not show the latest delta commits. Using this table provides the best performance, but omits the freshest data. Querying the real-time table will merge the compacted data with the delta commits on read, hence this dataset being called “Merge on Read”. This will result in the freshest data being available, but incurs a performance overhead, and is not as performant as querying the compacted data. In this way, data engineers and analysts have the flexibility to choose between performance and data freshness.

Available Now
This new feature is available now in all regions with EMR 5.28.0. There is no additional cost in using Hudi with EMR. You can learn more about Hudi in the EMR documentation. This new tool can simplify the way you process, update and delete data in S3. Let me know which use cases are you going to use it for!

Danilo

Architecting a Low-Cost Web Content Publishing System

Post Syndicated from Craig Jordan original https://aws.amazon.com/blogs/architecture/architecting-a-low-cost-web-content-publishing-system/

Introduction

When an IT team first contemplates reducing on-premises hardware they manage to support their workloads they often feel a tension between wanting to use cloud-native services versus taking a lift-and-shift approach. Cloud native services based on serverless designs could reduce costs and enable a solution that is easier to operate, but appears to be disruptive to end user processes and tools. A lift-and-shift migration, though it can eliminate on-premises hardware and maintain existing workflows, doesn’t eliminate the need to manage a server infrastructure, does nothing to improve a team’s agility in releasing enhancements after migration to the cloud, and may not optimize the cost of the resulting solution. Rather than settling for an either/or option that sacrifices cost savings and ease of operation in order to be non-intrusive to their web authors’ daily work, the University of Saint Thomas, Minnesota team implemented a creative hybrid approach that both avoids end user disruption and achieves the cost savings, agility, and simplified administration that a cloud-native solution can provide.

The Situation

University of St. Thomas wanted to reduce on-premises management of hardware for their university website. In addition, by migrating this functionality to the cloud, they intended to increase the website’s availability. The on-premises solution was deployed on an IIS server maintained by the IT team, but the content of the website was authored by staff members in departments across the university using two different Content Management Systems (CMS). The publishing process from these tools to the web server worked well, and there was no appetite for eliminating the distributed nature of the web site’s development nor the content management systems that the authors were comfortable with.

The IT team hoped to implement a serverless solution utilizing only Amazon Simple Storage Service (S3) to host the static website content. Not only would that reduce the cost of the solution, it would also eliminate having to manage web servers. One of the two content management systems could publish directly to S3, but unfortunately the other CMS could not.

A lift-and-shift migration approach would move the website onto an IIS server in Amazon Elastic Cloud Compute (EC2) and update the publishing process to write its outputs to this new server. This solution would avoid any impact to the authors because all the change would be accomplished behind the scenes by the IT team. However, this approach did not achieve the team’s goals of creating a solution that cost less and was easier to manage than the current on-premises one.

Rather than giving up on creating a cloud-native solution, the team worked from the constraints on the edges of the solution toward the middle.

Solution

Achieving the cost savings, management ease, and high availability for the solution depended upon using S3 to store the website’s contents (#1 in the diagram). If the CMS tools could have published directly to S3, the solution would have been completed by simply adjusting the CMS tools to target their output to S3. However, only one of the two CMS tools could do this. The other one expects to publish its output to a file system that is accessible to the on-premises server where the CMS tool runs. The team solved this problem by launching a t3.small EC2 instance (#2) to sit between the CMS tools and the S3 bucket that would store the website’s production content. Initially, it seemed like using two simple file sync processes could keep the file system of the EC2 instance synchronized with the CMS files. However, when the team first attempted this approach to build a copy of the website on EC2’s file system, they discovered that one of the sync processes would delete the other tool’s output rather than ignoring it when synchronizing updates from its tool to EC2.

To overcome this issue, the team created separate website roots in the EC2 file system into which each CMS would synchronize. Using Unionfs, a Linux utility that combines multiple directories into a single logical directory, a unified root folder for the website (#3) was created that could be easily pushed to S3 using the S3 CLI.

With this much of the solution in place the team had successfully created a new architecture for their website that was nearly as inexpensive as a static website hosted on S3, but that also maintained the tools and processes that their website authors were familiar with.

There was just one more technical issue to address: The IIS site contained internal metadata that redirected its users from virtual directories to the physical content located elsewhere in the website’s content. For example, https://..../law might be redirected to https://..../lawschool/ To achieve similar functionality, the IT team created one HTML file for each of these redirects and added them to a third website root directory in the EC2 instance (#4). These files include static HTML headers needed to redirect the user’s browser to the desired endpoint. Blending this directory with the other two through Unionfs creates a single logical copy of the website’s contents and that can be synchronized out to S3 with a S3 sync CLI command.

A final enhancement to the website was to use an Amazon CloudFront distribution (#5) to cache its contents providing improved response time for website visitors. The distribution object caching TTLs are set to defaults. The publishing process runs every 15 minutes, so to ensure that the website visitors would receive the latest content, the team wrote an AWS Lambda function (#6) that invalidates the cache each time an object is removed from (created in) the S3 bucket using S3 event notifications.

Conclusion

The University of Saint Thomas IT team found a creative way to implement a new solution for their university website that reduces the time and effort required to manage servers, achieves operational simplicity and cost savings by using cloud-native services, and yet doesn’t interfere with the web authoring tools and processes their customers were happy with. The mix of server-based and serverless components in their design illustrates how flexible cloud architectures can be and highlights the ingenuity of the team that built it.

Acknowledgements

Thank you to the following people at the University of Saint Thomas:

This solution was architected by Julian Mino, Cloud Architect. The creative use of Unionfs was suggested by William Bear, AVP for Applications and Infrastructure and former Linux administrator. Vicky Vue, Systems Engineer and Keith Ketchmark, Sr. Systems Engineer implemented the solution using Terraform, Ansible and Python. Daniel Strojny (Associate Director, Networks & IT Operations) helped resolved some internal DNS issues the team encountered.

Automated Disaster Recovery using CloudEndure

Post Syndicated from Ryan Jaeger original https://aws.amazon.com/blogs/architecture/automated-disaster-recovery-using-cloudendure/

There are any number of events that cause IT outages and impact business continuity. These could include the unexpected infrastructure or application outages caused by flooding, earthquakes, fires, hardware failures, or even malicious attacks. Cloud computing opens a new door to support disaster recovery strategies, with benefits such as elasticity, agility, speed to innovate, and cost savings—all which aid new disaster recovery solutions.

With AWS, organizations can acquire IT resources on-demand, and pay only for the resources they use. Automating disaster recovery (DR) has always been challenging. This blog post shows how you can use automation to allow the orchestration of recovery to eliminate manual processes. CloudEndure Disaster Recovery, an AWS Company, Amazon Route 53, and AWS Lambda are the building blocks to deliver a cost-effective automated DR solution. The example in this post demonstrates how you can recover a production web application with sub-second Recovery Point Objects (RPOs) and Recovery Time Objectives (RTOs) in minutes.

As part of a DR strategy, knowing RPOs and RTOs will determine what kind of solution architecture you need. The RPO represents the point in time of the last recoverable data point (for example, the “last backup”). Any disaster after that point would result in data loss.

The time from the outage to restoration is the RTO. Minimizing RTO and RPO is a cost tradeoff. Restoring from backups and recreating infrastructure after the event is the lowest cost but highest RTO. Conversely, the highest cost and lowest RTO is a solution running a duplicate auto-failover environment.

Solution Overview

CloudEndure is an automated IT resilience solution that lets you recover your environment from unexpected infrastructure or application outages, data corruption, ransomware, or other malicious attacks. It utilizes block-level Continuous Data Replication (CDP), which ensures that target machines are spun up in their most current state during a disaster or drill, so that you can achieve sub-second RPOs. In the event of a disaster, CloudEndure triggers a highly automated machine conversion process and a scalable orchestration engine that can spin up machines in the target AWS Region within minutes. This process enables you to achieve RTOs in minutes. The CloudEndure solution uses a software agent that installs on physical or virtual servers. It connects to a self-service, web-based use console, which then issues an API call to the selected AWS target Region to create a Staging Area in the customer’s AWS account designated to receive the source machine’s replicated data.

Architecture

In the above example, a webserver and database server have the CloudEndure Agent installed, and the disk volumes on each server replicated to a staging environment in the customer’s AWS account. The CloudEndure Replication Server receives the encrypted data replication traffic and writes to the appropriate corresponding EBS volumes. It’s also possible to configure data replication traffic to use VPN or AWS Direct Connect.

With this current setup, if an infrastructure or application outage occurs, a failover to AWS is executed by manually starting the process from the CloudEndure Console. When this happens, CloudEndure creates EC2 instances from the synchronized target EBS volumes. After the failover completes, additional manual steps are needed to change the website’s DNS entry to point to the IP address of the failed over webserver.

Could the CloudEndure failover and DNS update be automated? Yes.

Amazon Route 53 is a highly available and scalable Domain Name System (DNS) web service with three main functions: domain registration, DNS routing, and health checking. A configured Route 53 health check monitors the endpoint of a webserver. If the health check fails over a specified period, an alarm is raised to execute an AWS Lambda function to start the CloudEndure failover process. In addition to health checks, Route 53 DNS Failover allows the DNS record for the webserver to be automatically update based on a healthy endpoint. Now the previously manual process of updating the DNS record to point to the restored web server is automated. You can also build Route 53 DNS Failover configurations to support decision trees to handle complex configurations.

To illustrate this, the following builds on the example by having a primary, secondary, and tertiary DNS Failover choice for the web application:

How Health Checks Work in Complex Amazon Route 53 Configurations

When the CloudEndure failover action executes, it takes several minutes until the target EC2 is launched and configured by CloudEndure. An S3 static web page can be returned to the end-user to improve communication while the failover is happening.

To support this example, Amazon Route 53 DNS failover decision tree can be configured to have a primary, secondary, and tertiary failover. The decision tree logic to support the scenario is the following:

  1. If the primary health check passes, return the primary webserver.
  2. Else, if the secondary health check passes, return the failover webserver.
  3. Else, return the S3 static site.

When the Route 53 health check fails when monitoring the primary endpoint for the webserver, a CloudWatch alarm is configured to ALARM after a set time. This CloudWatch alarm then executes a Lambda function that calls the CloudEndure API to begin the failover.

In the screenshot below, both health checks are reporting “Unhealthy” while the primary health check is in a state of ALARM. At the point, the DNS failover logic should be returning the path to the static S3 site, and the Lambda function executed to start the CloudEndure failover.

The following architecture illustrates the completed scenario:

Conclusion

Having a disaster recovery strategy is critical for business continuity. The benefits of AWS combined with CloudEndure Disaster Recovery creates a non-disruptive DR solution that provides minimal RTO and RPO while reducing total cost of ownership for customers. Leveraging CloudWatch Alarms combined with AWS Lambda for serverless computing are building blocks for a variety of automation scenarios.

 

 

AWS DataSync News – S3 Storage Class Support and Much More

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-datasync-news-s3-storage-class-support-and-much-more/

AWS DataSync helps you to move large amounts of data into or out of the AWS Cloud (read my post, New – AWS DataSync – Automated and Accelerated Data Transfer, to learn more). As I explained in my post DataSync is a great fit for you Migration, Upload & Process, and Backup / DR use cases. DataSync is a managed service, and can be used to do one-time or periodic transfers of any size.

Newest Features
We launched DataSync at AWS re:Invent 2018 and have been adding features to it ever since. Today I would like to give you a brief recap of some of the newest features, and also introduce a few new ones:

  • S3 Storage Class Support
  • SMB Support
  • Additional Regions
  • VPC Endpoint Support
  • FIPS for US Endpoints
  • File and Folder Filtering
  • Embedded CloudWatch Metrics

Let’s take a look at each one…

S3 Storage Class Support
If you are transferring data to an Amazon S3 bucket, you now have control over the storage class that is used for the objects. You simply choose the class when you create a new location for use with DataSync:

You can choose from any of the S3 storage classes:

Objects stored in certain storage classes can incur additional charges for overwriting, deleting, or retrieving. To learn more, read Considerations When Working with S3 Storage Classes in DataSync.

SMB Support
Late last month we announced that AWS DataSync Can Now Transfer Data to and from SMB File Shares. SMB (Server Message Block) protocol is common in Windows-centric environments, and is also the preferred protocol for many file servers and network attached storage (NAS) devices. You can use filter patterns to control the files that are included in or excluded from the transfer, and you can use SMB file shares as the data transfer source or destination (Amazon S3 and Amazon EFS can also be used). You simply create a DataSync location that references your SMB server and share:

To learn more, read Creating a Location for SMB.

Additional Regions
AWS DataSync is now available in more locations. Earlier this year it became available in the AWS GovCloud (US-West) and Middle East (Bahrain) Regions.

VPC Endpoint Support
You can deploy AWS DataSync in a Virtual Private Cloud (VPC). If you do this, data transferred between the DataSync agent will not traverse the public internet:

The VPC endpoints for DataSync are powered by AWS PrivateLink; to learn more read AWS DataSync Now Supports Amazon VPC Endpoints and Using AWS DataSync in a Virtual Private Cloud.

FIPS for US Endpoints
In addition to support for VPC endpoints, we announced that AWS DataSync supports FIPS 140-2 Validated Endpoints in US Regions. The endpoints in these regions use a FIPS 140-2 validated cryptographic security module, making it easier for you to use DataSync for regulated workloads. You can use these endpoints by selecting them when you create your DataSync agent:

File and Folder Filtering
Earlier this year we added the ability to use file path and object key filters to exercise additional control over the data copied in a data transfer. To learn more, read about Excluding and including specific data in transfer tasks using AWS DataSync filters.

Embedded CloudWatch Metrics
Data transfer metrics are available in the Task Execution Details page so that you can track the progress of your transfer:

Other AWS DataSync Resources
Here are some resources to help you to learn more about AWS DataSync:

Jeff;

Learn about AWS Services & Solutions – September AWS Online Tech Talks

Post Syndicated from Jenny Hang original https://aws.amazon.com/blogs/aws/learn-about-aws-services-solutions-september-aws-online-tech-talks/

Learn about AWS Services & Solutions – September AWS Online Tech Talks

AWS Tech Talks

Join us this September to learn about AWS services and solutions. The AWS Online Tech Talks are live, online presentations that cover a broad range of topics at varying technical levels. These tech talks, led by AWS solutions architects and engineers, feature technical deep dives, live demonstrations, customer examples, and Q&A with AWS experts. Register Now!

Note – All sessions are free and in Pacific Time.

Tech talks this month:

 

Compute:

September 23, 2019 | 11:00 AM – 12:00 PM PTBuild Your Hybrid Cloud Architecture with AWS – Learn about the extensive range of services AWS offers to help you build a hybrid cloud architecture best suited for your use case.

September 26, 2019 | 1:00 PM – 2:00 PM PTSelf-Hosted WordPress: It’s Easier Than You Think – Learn how you can easily build a fault-tolerant WordPress site using Amazon Lightsail.

October 3, 2019 | 11:00 AM – 12:00 PM PTLower Costs by Right Sizing Your Instance with Amazon EC2 T3 General Purpose Burstable Instances – Get an overview of T3 instances, understand what workloads are ideal for them, and understand how the T3 credit system works so that you can lower your EC2 instance costs today.

 

Containers:

September 26, 2019 | 11:00 AM – 12:00 PM PTDevelop a Web App Using Amazon ECS and AWS Cloud Development Kit (CDK) – Learn how to build your first app using CDK and AWS container services.

 

Data Lakes & Analytics:

September 26, 2019 | 9:00 AM – 10:00 AM PTBest Practices for Provisioning Amazon MSK Clusters and Using Popular Apache Kafka-Compatible Tooling – Learn best practices on running Apache Kafka production workloads at a lower cost on Amazon MSK.

 

Databases:

September 25, 2019 | 1:00 PM – 2:00 PM PTWhat’s New in Amazon DocumentDB (with MongoDB compatibility) – Learn what’s new in Amazon DocumentDB, a fully managed MongoDB compatible database service designed from the ground up to be fast, scalable, and highly available.

October 3, 2019 | 9:00 AM – 10:00 AM PTBest Practices for Enterprise-Class Security, High-Availability, and Scalability with Amazon ElastiCache – Learn about new enterprise-friendly Amazon ElastiCache enhancements like customer managed key and online scaling up or down to make your critical workloads more secure, scalable and available.

 

DevOps:

October 1, 2019 | 9:00 AM – 10:00 AM PT – CI/CD for Containers: A Way Forward for Your DevOps Pipeline – Learn how to build CI/CD pipelines using AWS services to get the most out of the agility afforded by containers.

 

Enterprise & Hybrid:

September 24, 2019 | 1:00 PM – 2:30 PM PT Virtual Workshop: How to Monitor and Manage Your AWS Costs – Learn how to visualize and manage your AWS cost and usage in this virtual hands-on workshop.

October 2, 2019 | 1:00 PM – 2:00 PM PT – Accelerate Cloud Adoption and Reduce Operational Risk with AWS Managed Services – Learn how AMS accelerates your migration to AWS, reduces your operating costs, improves security and compliance, and enables you to focus on your differentiating business priorities.

 

IoT:

September 25, 2019 | 9:00 AM – 10:00 AM PTComplex Monitoring for Industrial with AWS IoT Data Services – Learn how to solve your complex event monitoring challenges with AWS IoT Data Services.

 

Machine Learning:

September 23, 2019 | 9:00 AM – 10:00 AM PTTraining Machine Learning Models Faster – Learn how to train machine learning models quickly and with a single click using Amazon SageMaker.

September 30, 2019 | 11:00 AM – 12:00 PM PTUsing Containers for Deep Learning Workflows – Learn how containers can help address challenges in deploying deep learning environments.

October 3, 2019 | 1:00 PM – 2:30 PM PTVirtual Workshop: Getting Hands-On with Machine Learning and Ready to Race in the AWS DeepRacer League – Join DeClercq Wentzel, Senior Product Manager for AWS DeepRacer, for a presentation on the basics of machine learning and how to build a reinforcement learning model that you can use to join the AWS DeepRacer League.

 

AWS Marketplace:

September 30, 2019 | 9:00 AM – 10:00 AM PTAdvancing Software Procurement in a Containerized World – Learn how to deploy applications faster with third-party container products.

 

Migration:

September 24, 2019 | 11:00 AM – 12:00 PM PTApplication Migrations Using AWS Server Migration Service (SMS) – Learn how to use AWS Server Migration Service (SMS) for automating application migration and scheduling continuous replication, from your on-premises data centers or Microsoft Azure to AWS.

 

Networking & Content Delivery:

September 25, 2019 | 11:00 AM – 12:00 PM PTBuilding Highly Available and Performant Applications using AWS Global Accelerator – Learn how to build highly available and performant architectures for your applications with AWS Global Accelerator, now with source IP preservation.

September 30, 2019 | 1:00 PM – 2:00 PM PTAWS Office Hours: Amazon CloudFront – Just getting started with Amazon CloudFront and [email protected]? Get answers directly from our experts during AWS Office Hours.

 

Robotics:

October 1, 2019 | 11:00 AM – 12:00 PM PTRobots and STEM: AWS RoboMaker and AWS Educate Unite! – Come join members of the AWS RoboMaker and AWS Educate teams as we provide an overview of our education initiatives and walk you through the newly launched RoboMaker Badge.

 

Security, Identity & Compliance:

October 1, 2019 | 1:00 PM – 2:00 PM PTDeep Dive on Running Active Directory on AWS – Learn how to deploy Active Directory on AWS and start migrating your windows workloads.

 

Serverless:

October 2, 2019 | 9:00 AM – 10:00 AM PTDeep Dive on Amazon EventBridge – Learn how to optimize event-driven applications, and use rules and policies to route, transform, and control access to these events that react to data from SaaS apps.

 

Storage:

September 24, 2019 | 9:00 AM – 10:00 AM PTOptimize Your Amazon S3 Data Lake with S3 Storage Classes and Management Tools – Learn how to use the Amazon S3 Storage Classes and management tools to better manage your data lake at scale and to optimize storage costs and resources.

October 2, 2019 | 11:00 AM – 12:00 PM PTThe Great Migration to Cloud Storage: Choosing the Right Storage Solution for Your Workload – Learn more about AWS storage services and identify which service is the right fit for your business.

 

 

Extract Oracle OLTP data in real time with GoldenGate and query from Amazon Athena

Post Syndicated from Sreekanth Krishnavajjala original https://aws.amazon.com/blogs/big-data/extract-oracle-oltp-data-in-real-time-with-goldengate-and-query-from-amazon-athena/

This post describes how you can improve performance and reduce costs by offloading reporting workloads from an online transaction processing (OLTP) database to Amazon Athena and Amazon S3. The architecture described allows you to implement a reporting system and have an understanding of the data that you receive by being able to query it on arrival. In this solution:

  • Oracle GoldenGate generates a new row on the target for every change on the source to create Slowly Changing Dimension Type 2 (SCD Type 2) data.
  • Athena allows you to run ad hoc queries on the SCD Type 2 data.

Principles of a modern reporting solution

Advanced database solutions use a set of principles to help them build cost-effective reporting solutions. Some of these principles are:

  • Separate the reporting activity from the OLTP. This approach provides resource isolation and enables databases to scale for their respective workloads.
  • Use query engines running on top of distributed file systems like Hadoop Distributed File System (HDFS) and cloud object stores, such as Amazon S3. The advent of query engines that can run on top of open-source HDFS and cloud object stores further reduces the cost of implementing dedicated reporting systems.

Furthermore, you can use these principles when building reporting solutions:

  • To reduce licensing costs of the commercial databases, move the reporting activity to an open-source database.
  • Use a log-based, real-time, change data capture (CDC), data-integration solution, which can replicate OLTP data from source systems, preferably in real-time mode, and provide a current view of the data. You can enable the data replication between the source and the target reporting systems using database CDC solutions. The transaction log-based CDC solutions capture database changes noninvasively from the source database and replicate them to the target datastore or file systems.

Prerequisites

If you use GoldenGate with Kafka and are considering cloud migration, you can benefit from this post. This post also assumes prior knowledge of GoldenGate and does not detail steps to install and configure GoldenGate. Knowledge of Java and Maven is also assumed. Ensure that a VPC with three subnets is available for manual deployment.

Understanding the architecture of this solution

The following workflow diagram (Figure 1) illustrates the solution that this post describes:

  1. Amazon RDS for Oracle acts as the source.
  2. A GoldenGate CDC solution produces data for Amazon Managed Streaming for Apache Kafka (Amazon MSK). GoldenGate streams the database CDC data to the consumer. Kafka topics with an MSK cluster receives the data from GoldenGate.
  3. The Apache Flink application running on Amazon EMR consumes the data and sinks it into an S3 bucket.
  4. Athena analyzes the data through queries. You can optionally run queries from Amazon Redshift Spectrum.

Data Pipeline

Figure 1

Amazon MSK is a fully managed service for Apache Kafka that makes it easy to provision  Kafka clusters with few clicks without the need to provision servers, storage and configuring Apache Zookeeper manually. Kafka is an open-source platform for building real-time streaming data pipelines and applications.

Amazon RDS for Oracle is a fully managed database that frees up your time to focus on application development. It manages time-consuming database administration tasks, including provisioning, backups, software patching, monitoring, and hardware scaling.

GoldenGate is a real-time, log-based, heterogeneous database CDC solution. GoldenGate supports data replication from any supported database to various target databases or big data platforms like Kafka. GoldenGate’s ability to write the transactional data captured from the source in different formats, including delimited text, JSON, and Avro, enables seamless integration with a variety of BI tools. Each row has additional metadata columns including database operation type (Insert/Update/Delete).

Flink is an open-source, stream-processing framework with a distributed streaming dataflow engine for stateful computations over unbounded and bounded data streams. EMR supports Flink, letting you create managed clusters from the AWS Management Console. Flink also supports exactly-once semantics with the checkpointing feature, which is vital to ensure data accuracy when processing database CDC data. You can also use Flink to transform the streaming data row by row or in batches using windowing capabilities.

S3 is an object storage service with high scalability, data availability, security, and performance. You can run big data analytics across your S3 objects with AWS query-in-place services like Athena.

Athena is a serverless query service that makes it easy to query and analyze data in S3. With Athena and S3 as a data source, you define the schema and start querying using standard SQL. There’s no need for complex ETL jobs to prepare your data for analysis, which makes it easy for anyone familiar with SQL skills to analyze large-scale datasets quickly.

The following diagram shows a more detailed view of the data pipeline:

  1. RDS for Oracle runs in a Single-AZ.
  2. GoldenGate runs on an Amazon EC2 instance.
  3. The MSK cluster spans across three Availability Zones.
  4. Kafka topic is set up in MSK.
  5. Flink runs on an EMR Cluster.
  6. Producer Security Group for Oracle DB and GoldenGate instance.
  7. Consumer Security Group for EMR with Flink.
  8. Gateway endpoint for S3 private access.
  9. NAT Gateway to download software components on GoldenGate instance.
  10. S3 bucket and Athena.

For simplicity, this setup uses a single VPC with multiple subnets to deploy resources.

Figure 2

Configuring single-click deployment using AWS CloudFormation

The AWS CloudFormation template included in this post automates the deployment of the end-to-end solution that this blog post describes. The template provisions all required resources including RDS for Oracle, MSK, EMR, S3 bucket, and also adds an EMR step with a JAR file to consume messages from Kafka topic on MSK. Here’s the list of steps to launch the template and test the solution:

  1. Launch the AWS CloudFormation template in the us-east-1
  2. After successful stack creation, obtain GoldenGate Hub Server public IP from the Outputs tab of cloudformation.
  3. Login to GoldenGate hub server using the IP address from step 2 as ec2-user and then switch to oracle user.sudo su – oracle
  4. Connect to the source RDS for Oracle database using the sqlplus client and provide password(source).[[email protected] ~]$ sqlplus [email protected]
  5. Generate database transactions using SQL statements available in oracle user’s home directory.
    SQL> @s
    
     SQL> @s1
    
     SQL> @s2

  6. Query STOCK_TRADES table from Amazon Athena console. It takes a few seconds after committing transactions on the source database for database changes to be available for Athena for querying.

Manually deploying components

The following steps describe the configurations required to stream Oracle-changed data to MSK and sink it to an S3 bucket using Flink running on EMR. You can then query the S3 bucket using Athena. If you deployed the solution using AWS CloudFormation as described in the previous step, skip to the Testing the solution section.

 

  1. Prepare an RDS source database for CDC using GoldenGate.The RDS source database version is Enterprise Edition 12.1.0.2.14. For instructions on configuring the RDS database, see Using Oracle GoldenGate with Amazon RDS. This post does not consider capturing data definition language (DDL).
  2. Configure an EC2 instance for the GoldenGate hub server.Configure the GoldenGate hub server using Oracle Linux server 7.6 (ami-b9c38ad3) image in the us-east-1 Region. The GoldenGate hub server runs the GoldenGate extract process that extracts changes in real time from the database transaction log files. The server also runs a replicat process that publishes database changes to MSK.The GoldenGate hub server requires the following software components:
  • Java JDK 1.8.0 (required for GoldenGate big data adapter).
  • GoldenGate for Oracle (12.3.0.1.4) and GoldenGate for big data adapter (12.3.0.1).
  • Kafka 1.1.1 binaries (required for GoldenGate big data adapter classpath).
  • An IAM role attached to the GoldenGate hub server to allow access to the MSK cluster for GoldenGate processes running on the hub server.Use the GoldenGate (12.3.0) documentation to install and configure the GoldenGate for Oracle database. The GoldenGate Integrated Extract parameter file is eora2msk.prm.
    EXTRACT eora2msk
    SETENV (NLSLANG=AL32UTF8)
    
    USERID [email protected], password ggadmin
    TRANLOGOPTIONS INTEGRATEDPARAMS (max_sga_size 256)
    EXTTRAIL /u01/app/oracle/product/ogg/dirdat/or
    LOGALLSUPCOLS
    
    TABLE SOURCE.STOCK_TRADES;

    The logallsupcols extract parameter ensures that a full database table row is generated for every DML operation on the source, including updates and deletes.

  1. Create a Kafka cluster using MSK and configure Kakfa topic.You can create the MSK cluster from the AWS Management Console, using the AWS CLI, or through an AWS CloudFormation template.
  • Use the list-clusters command to obtain a ClusterArn and a Zookeeper connection string after creating the cluster. You need this information to configure the GoldenGate big data adapter and Flink consumer. The following code illustrates the commands to run:
    $aws kafka list-clusters --region us-east-1
    {
        "ClusterInfoList": [
            {
                "EncryptionInfo": {
                    "EncryptionAtRest": {
                        "DataVolumeKMSKeyId": "arn:aws:kms:us-east-1:xxxxxxxxxxxx:key/717d53d8-9d08-4bbb-832e-de97fadcaf00"
                    }
                }, 
                "BrokerNodeGroupInfo": {
                    "BrokerAZDistribution": "DEFAULT", 
                    "ClientSubnets": [
                        "subnet-098210ac85a046999", 
                        "subnet-0c4b5ee5ff5ef70f2", 
                        "subnet-076c99d28d4ee87b4"
                    ], 
                    "StorageInfo": {
                        "EbsStorageInfo": {
                            "VolumeSize": 1000
                        }
                    }, 
                    "InstanceType": "kafka.m5.large"
                }, 
                "ClusterName": "mskcluster", 
                "CurrentBrokerSoftwareInfo": {
                    "KafkaVersion": "1.1.1"
                }, 
                "CreationTime": "2019-01-24T04:41:56.493Z", 
                "NumberOfBrokerNodes": 3, 
                "ZookeeperConnectString": "10.0.2.9:2181,10.0.0.4:2181,10.0.3.14:2181", 
                "State": "ACTIVE", 
                "CurrentVersion": "K13V1IB3VIYZZH", 
                "ClusterArn": "arn:aws:kafka:us-east-1:xxxxxxxxx:cluster/mskcluster/8920bb38-c227-4bef-9f6c-f5d6b01d2239-3", 
                "EnhancedMonitoring": "DEFAULT"
            }
        ]
    }

  • Obtain the IP addresses of the Kafka broker nodes by using the ClusterArn.
    $aws kafka get-bootstrap-brokers --region us-east-1 --cluster-arn arn:aws:kafka:us-east-1:xxxxxxxxxxxx:cluster/mskcluster/8920bb38-c227-4bef-9f6c-f5d6b01d2239-3
    {
        "BootstrapBrokerString": "10.0.3.6:9092,10.0.2.10:9092,10.0.0.5:9092"
    }

  • Create a Kafka topic. The solution in this post uses the same name as table name for Kafka topic.
    ./kafka-topics.sh --create --zookeeper 10.0.2.9:2181,10.0.0.4:2181,10.0.3.14:2181 --replication-factor 3 --partitions 1 --topic STOCK_TRADES

  1. Provision an EMR cluster with Flink.Create an EMR cluster 5.25 with Flink 1.8.0 (advanced option of the EMR cluster), and enable SSH access to the master node. Create and attach a role to the EMR master node so that Flink consumers can access the Kafka topic in the MSK cluster.
  2. Configure the Oracle GoldenGate big data adapter for Kafka on the GoldenGate hub server.Download and install the Oracle GoldenGate big data adapter (12.3.0.1.0) using the Oracle GoldenGate download link. For more information, see the Oracle GoldenGate 12c (12.3.0.1) installation documentation.The following is the GoldenGate producer property file for Kafka (custom_kafka_producer.properties):
    #Bootstrap broker string obtained from Step 3
    bootstrap.servers= 10.0.3.6:9092,10.0.2.10:9092,10.0.0.5:9092
    #bootstrap.servers=localhost:9092
    acks=1
    reconnect.backoff.ms=1000
    value.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
    key.serializer=org.apache.kafka.common.serialization.ByteArraySerializer
    # 100KB per partition
    batch.size=16384
    linger.ms=0

    The following is the GoldenGate properties file for Kafka (Kafka.props):

    gg.handlerlist = kafkahandler
    gg.handler.kafkahandler.type=kafka
    gg.handler.kafkahandler.KafkaProducerConfigFile=custom_kafka_producer.properties
    #The following resolves the topic name using the short table name
    #gg.handler.kafkahandler.topicName=SOURCE
    gg.handler.kafkahandler.topicMappingTemplate=${tableName}
    #The following selects the message key using the concatenated primary keys
    gg.handler.kafkahandler.keyMappingTemplate=${primaryKeys}
    gg.handler.kafkahandler.format=json_row
    #gg.handler.kafkahandler.format=delimitedtext
    #gg.handler.kafkahandler.SchemaTopicName=mySchemaTopic
    #gg.handler.kafkahandler.SchemaTopicName=oratopic
    gg.handler.kafkahandler.BlockingSend =false
    gg.handler.kafkahandler.includeTokens=false
    gg.handler.kafkahandler.mode=op
    goldengate.userexit.writers=javawriter
    javawriter.stats.display=TRUE
    javawriter.stats.full=TRUE
    
    gg.log=log4j
    #gg.log.level=INFO
    gg.log.level=DEBUG
    gg.report.time=30sec
    gg.classpath=dirprm/:/home/oracle/kafka/kafka_2.11-1.1.1/libs/*
    
    javawriter.bootoptions=-Xmx512m -Xms32m -Djava.class.path=ggjava/ggjava.jar

    The following is the GoldenGate replicat parameter file (rkafka.prm):

    REPLICAT rkafka
    -- Trail file for this example is located in "AdapterExamples/trail" directory
    -- Command to add REPLICAT
    -- add replicat rkafka, exttrail AdapterExamples/trail/tr
    TARGETDB LIBFILE libggjava.so SET property=dirprm/kafka.props
    REPORTCOUNT EVERY 1 MINUTES, RATE
    GROUPTRANSOPS 10000
    MAP SOURCE.STOCK_TRADES, TARGET SOURCE.STOCK_TRADES;

  3. Create an S3 bucket and directory with a table name underneath for Flink to store (sink) Oracle CDC data.
  4. Configure a Flink consumer to read from the Kafka topic that writes the CDC data to an S3 bucket.For instructions on setting up a Flink project using the Maven archetype, see Flink Project Build Setup.The following code example is the pom.xml file, used with the Maven project. For more information, see Getting Started with Maven.
    <project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
      xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd">
      <modelVersion>4.0.0</modelVersion>
    
      <groupId>org.apache.flink</groupId>
      <artifactId>flink-quickstart-java</artifactId>
      <version>1.8.0</version>
      <packaging>jar</packaging>
    
      <name>flink-quickstart-java</name>
      <url>http://www.example.com</url>
    
      <properties>
        <project.build.sourceEncoding>UTF-8</project.build.sourceEncoding>
        <slf4j.version>@[email protected]</slf4j.version>
        <log4j.version>@[email protected]</log4j.version>
        <java.version>1.8</java.version>
        <maven.compiler.source>1.7</maven.compiler.source>
        <maven.compiler.target>1.7</maven.compiler.target>
      </properties>
    
    <dependencies>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-java</artifactId>
            <version>1.8.0</version>
            <scope>compile</scope>
        </dependency>
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-hadoop-compatibility_2.11</artifactId>
            <version>1.8.0</version>
        </dependency>
        <dependency>
         <groupId>org.apache.flink</groupId>
         <artifactId>flink-connector-filesystem_2.11</artifactId>
         <version>1.8.0</version>
        </dependency>
    
        <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-streaming-java_2.11</artifactId>
            <version>1.8.0</version>
            <scope>compile</scope>
        </dependency>
         <dependency>
            <groupId>org.apache.flink</groupId>
            <artifactId>flink-s3-fs-presto</artifactId>
            <version>1.8.0</version>
        </dependency>
        <dependency>
       <groupId>org.apache.flink</groupId>
            <artifactId>flink-connector-kafka_2.11</artifactId>
            <version>1.8.0</version>
        </dependency>
        <dependency>
          <groupId>org.apache.flink</groupId>
          <artifactId>flink-clients_2.11</artifactId>
          <version>1.8.0</version>
        </dependency>
        <dependency>
          <groupId>org.apache.flink</groupId>
          <artifactId>flink-scala_2.11</artifactId>
          <version>1.8.0</version>
        </dependency>
    
        <dependency>
          <groupId>org.apache.flink</groupId>
          <artifactId>flink-streaming-scala_2.11</artifactId>
          <version>1.8.0</version>
        </dependency>
    
        <dependency>
          <groupId>com.typesafe.akka</groupId>
          <artifactId>akka-actor_2.11</artifactId>
          <version>2.4.20</version>
        </dependency>
        <dependency>
           <groupId>com.typesafe.akka</groupId>
           <artifactId>akka-protobuf_2.11</artifactId>
           <version>2.4.20</version>
        </dependency>
    <build>
      <plugins>
         <plugin>
             <groupId>org.apache.maven.plugins</groupId>
                <artifactId>maven-shade-plugin</artifactId>
                <version>3.2.1</version>
                <executions>
                       <execution>
                          <phase>package</phase>
                            <goals>
                                <goal>shade</goal>
                             </goals>
                           <configuration>
                          <artifactSet>
                      <excludes>
    
                             <!-- Excludes here -->
                               </excludes>
    </artifactSet>
                    <filters>
                            <filter>
                                                                                     <artifact>org.apache.flink:*</artifact>
                            </filter>
                       </filters>
                 <transformers>
                   <!-- add Main-Class to manifest file -->
                                                                            <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                                                                            <mainClass>flinkconsumer.flinkconsumer</mainClass>
                   </transformer>
                                                                             <transformer implementation="org.apache.maven.plugins.shade.resource.AppendingTransformer">
                                                                         <resource>reference.conf</resource>
                                                                    </transformer>
                                                            </transformers>
                    <relocations>
                          <relocation>
                                                                <pattern>org.codehaus.plexus.util</pattern>
                                                                  <shadedPattern>org.shaded.plexus.util</shadedPattern>
                        <excludes>
                                                                      <exclude>org.codehaus.plexus.util.xml.Xpp3Dom</exclude>
                                                                      <exclude>org.codehaus.plexus.util.xml.pull.*</exclude>
                                                                  </excludes>
                                                               </relocation>
                                                            </relocations>
                                                            <createDependencyReducedPom>false</createDependencyReducedPom>
                                                    </configuration>
                                            </execution>
                                    </executions>
                            </plugin>
    <!-- Add the main class as a manifest entry -->
                            <plugin>
                                    <groupId>org.apache.maven.plugins</groupId>
                                    <artifactId>maven-jar-plugin</artifactId>
                                    <version>2.5</version>
                                    <configuration>
                                            <archive>
                                                    <manifestEntries>
                                                            <Main-Class>flinkconsumer.flinkconsumer</Main-Class>
                                                    </manifestEntries>
                                            </archive>
           </configuration>
                            </plugin>
    
                            <plugin>
                                    <groupId>org.apache.maven.plugins</groupId>
                                    <artifactId>maven-compiler-plugin</artifactId>
                                    <version>3.1</version>
                                    <configuration>
                                            <source>1.7</source>
                                            <target>1.7</target>
                                    </configuration>
                            </plugin>
                    </plugins>
    
    </build>
    <profiles>
                    <profile>
                            <id>build-jar</id>
                            <activation>
                                    <activeByDefault>false</activeByDefault>
                            </activation>
                    </profile>
            </profiles>
    
    
    </project>

    Compile the following Java program using mvn clean install and generate the JAR file:

    package flinkconsumer;
    import org.apache.flink.api.common.typeinfo.TypeInformation;
    import org.apache.flink.api.java.typeutils.TypeExtractor;
    import org.apache.flink.api.java.utils.ParameterTool;
    import org.apache.flink.streaming.api.datastream.DataStream;
    import org.apache.flink.streaming.api.environment.StreamExecutionEnvironment;
    import org.apache.flink.streaming.api.functions.source.SourceFunction;
    import org.apache.flink.streaming.util.serialization.DeserializationSchema;
    import org.apache.flink.streaming.util.serialization.SerializationSchema;
    import org.apache.flink.streaming.util.serialization.SimpleStringSchema;
    import org.apache.flink.api.common.functions.FlatMapFunction;
    import org.apache.flink.api.common.functions.MapFunction;
    import org.apache.flink.streaming.api.windowing.time.Time;
    import org.apache.flink.util.Collector;
    import org.apache.flink.api.java.tuple.Tuple2;
    import org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumer;
    import org.slf4j.LoggerFactory;
    import org.apache.flink.runtime.state.filesystem.FsStateBackend;
    import akka.actor.ActorSystem;
    import akka.stream.ActorMaterializer;
    import akka.stream.Materializer;
    import com.typesafe.config.Config;
    import org.apache.flink.streaming.connectors.fs.*;
    import org.apache.flink.streaming.api.datastream.*;
    import org.apache.flink.runtime.fs.hdfs.HadoopFileSystem;
    import java.util.stream.Collectors;
    import java.util.Arrays;
    import java.util.ArrayList;
    import java.util.Collections;
    import java.util.List;
    import java.util.Properties;
    import java.util.regex.Pattern;
    import java.io.*;
    import java.net.BindException;
    import java.util.*;
    import java.util.Map.*;
    import java.util.Arrays;
    
    public class flinkconsumer{
    
        public static void main(String[] args) throws Exception {
            // create Streaming execution environment
            StreamExecutionEnvironment env = StreamExecutionEnvironment.getExecutionEnvironment();
            env.setBufferTimeout(1000);
            env.enableCheckpointing(5000);
            Properties properties = new Properties();
            properties.setProperty("bootstrap.servers", "10.0.3.6:9092,10.0.2.10:9092,10.0.0.5:9092");
            properties.setProperty("group.id", "flink");
            properties.setProperty("client.id", "demo1");
    
            DataStream<String> message = env.addSource(new FlinkKafkaConsumer<>("STOCK_TRADES", new SimpleStringSchema(),properties));
            env.enableCheckpointing(60_00);
            env.setStateBackend(new FsStateBackend("hdfs://ip-10-0-3-12.ec2.internal:8020/flink/checkpoints"));
    
            RollingSink<String> sink= new RollingSink<String>("s3://flink-stream-demo/STOCK_TRADES");
           // sink.setBucketer(new DateTimeBucketer("yyyy-MM-dd-HHmm"));
           // The bucket part file size in bytes.
               sink.setBatchSize(400);
             message.map(new MapFunction<String, String>() {
                private static final long serialVersionUID = -6867736771747690202L;
                @Override
                public String map(String value) throws Exception {
                    //return " Value: " + value;
                    return value;
                }
            }).addSink(sink).setParallelism(1);
            env.execute();
        }
    }

    Log in as a Hadoop user to an EMR master node, start Flink, and execute the JAR file:

    $ /usr/bin/flink run ./flink-quickstart-java-1.7.0.jar

  5. Create the stock_trades table from the Athena console. Each JSON document must be on a new line.
    CREATE EXTERNAL TABLE `stock_trades`(
      `trade_id` string COMMENT 'from deserializer', 
      `ticker_symbol` string COMMENT 'from deserializer', 
      `units` int COMMENT 'from deserializer', 
      `unit_price` float COMMENT 'from deserializer', 
      `trade_date` timestamp COMMENT 'from deserializer', 
      `op_type` string COMMENT 'from deserializer')
    ROW FORMAT SERDE 
      'org.openx.data.jsonserde.JsonSerDe' 
    STORED AS INPUTFORMAT 
      'org.apache.hadoop.mapred.TextInputFormat' 
    OUTPUTFORMAT 
      'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
    LOCATION
      's3://flink-cdc-demo/STOCK_TRADES'
    TBLPROPERTIES (
      'has_encrypted_data'='false', 
      'transient_lastDdlTime'='1561051196')

    For more information, see Hive JSON SerDe.

Testing the solution

To test that the solution works, complete the following steps:

  1. Log in to the source RDS instance from the GoldenGate hub server and perform insert, update, and delete operations on the stock_trades table:
    $sqlplus [email protected]
    SQL> insert into stock_trades values(6,'NEW',29,75,sysdate);
    SQL> update stock_trades set units=999 where trade_id=6;
    SQL> insert into stock_trades values(7,'TEST',30,80,SYSDATE);
    SQL>insert into stock_trades values (8,'XYZC', 20, 1800,sysdate);
    SQL> commit;

  2. Monitor the GoldenGate capture from the source database using the following stats command:
    [[email protected] 12.3.0]$ pwd
    /u02/app/oracle/product/ogg/12.3.0
    [[email protected] 12.3.0]$ ./ggsci
    
    Oracle GoldenGate Command Interpreter for Oracle
    Version 12.3.0.1.4 OGGCORE_12.3.0.1.0_PLATFORMS_180415.0359_FBO
    Linux, x64, 64bit (optimized), Oracle 12c on Apr 16 2018 00:53:30
    Operating system character set identified as UTF-8.
    
    Copyright (C) 1995, 2018, Oracle and/or its affiliates. All rights reserved.
    
    
    
    GGSCI (ip-10-0-1-170.ec2.internal) 1> stats eora2msk

  3. Monitor the GoldenGate replicat to a Kafka topic with the following:
    [[email protected] 12.3.0]$ pwd
    /u03/app/oracle/product/ogg/bdata/12.3.0
    [[email protected] 12.3.0]$ ./ggsci
    
    Oracle GoldenGate for Big Data
    Version 12.3.2.1.1 (Build 005)
    
    Oracle GoldenGate Command Interpreter
    Version 12.3.0.1.2 OGGCORE_OGGADP.12.3.0.1.2_PLATFORMS_180712.2305
    Linux, x64, 64bit (optimized), Generic on Jul 13 2018 00:46:09
    Operating system character set identified as UTF-8.
    
    Copyright (C) 1995, 2018, Oracle and/or its affiliates. All rights reserved.
    
    
    
    GGSCI (ip-10-0-1-170.ec2.internal) 1> stats rkafka

  4. Query the stock_trades table using the Athena console.

Summary

This post illustrates how you can offload reporting activity to Athena with S3 to reduce reporting costs and improve OLTP performance on the source database. This post serves as a guide for setting up a solution in the staging environment.

Deploying this solution in a production environment may require additional considerations, for example, high availability of GoldenGate hub servers, different file encoding formats for optimal query performance, and security considerations. Additionally, you can achieve similar outcomes using technologies like AWS Database Migration Service instead of GoldenGate for database CDC and Kafka Connect for the S3 sink.

 


About the Authors

Sreekanth Krishnavajjala is a solutions architect at Amazon Web Services.

 

 

 

 

Vinod Kataria is a senior partner solutions architect at Amazon Web Services.

AWS Lake Formation – Now Generally Available

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-lake-formation-now-generally-available/

As soon as companies started to have data in digital format, it was possible for them to build a data warehouse, collecting data from their operational systems, such as Customer relationship management (CRM) and Enterprise resource planning (ERP) systems, and use this information to support their business decisions.

The reduction in costs of storage, together with an even greater reduction in complexity for managing large quantities of data, made possible by services such as Amazon S3, has allowed companies to retain more information, including raw data that is not structured, such as logs, images, video, and scanned documents.

This is the idea of a data lake: to store all your data in one, centralized repository, at any scale. We are seeing this approach with customers like Netflix, Zillow, NASDAQ, Yelp, iRobot, FINRA, and Lyft. They can run their analytics on this larger dataset, from simple aggregations to complex machine learning algorithms, to better discover patterns in their data and understand their business.

Last year at re:Invent we introduced in preview AWS Lake Formation, a service that makes it easy to ingest, clean, catalog, transform, and secure your data and make it available for analytics and machine learning. I am happy to share that Lake Formation is generally available today!

With Lake Formation you have a central console to manage your data lake, for example to configure the jobs that move data from multiple sources, such as databases and logs, to your data lake. Having such a large and diversified amount of data makes configuring the right access permission also critical. You can secure access to metadata in the Glue Data Catalog and data stored in S3 using a single set of granular data access policies defined in Lake Formation. These policies allow you to define table and column-level data access.

One thing I like the most of Lake Formation is that it works with your data already in S3! You can easily register your existing data with Lake Formation, and you don’t need to change existing processes loading your data to S3. Since data remains in your account, you have full control.

You can also use Glue ML Transforms to easily deduplicate your data. Deduplication is important to reduce the amount of storage you need, but also to make analyzing your data more efficient because you don’t have neither the overhead nor the possible confusion of looking at the same data twice. This problem is trivial if duplicate records can be identified by a unique key, but becomes very challenging when you have to do a “fuzzy match”. A similar approach can be used for record linkage, that is when you are looking for similar items in different tables, for example to do a “fuzzy join” of two databases that do not share a unique key.

In this way, implementing a data lake from scratch is much faster, and managing a data lake is much easier, making these technologies available to more customers.

Creating a Data Lake
Let’s build a data lake using the Lake Formation console. First I register the S3 buckets that are going to be part of my data lake. Then I create a database and grant permission to the IAM users and roles that I am going to use to manage my data lake. The database is registered in the Glue Data Catalog and holds the metadata required to analyze the raw data, such as the structure of the tables that are going to be automatically generated during data ingestion.

Managing permissions is one of the most complex tasks for a data lake. Consider for example the huge amount of data that can be part of it, the sensitive, mission-critical nature of some of the data, and the different structured, semi-structured, and unstructured formats in which data can reside. Lake Formation makes it easier with a central location where you can give IAM users, roles, groups, and Active Directory users (via federation) access to databases, tables, optionally allowing or denying access to specific columns within a table.

To simplify data ingestion, I can use blueprints that create the necessary workflows, crawlers and jobs on AWS Glue for common use cases. Workflows enable orchestration of your data loading workloads by building dependencies between Glue entities, such as triggers, crawlers and jobs, and allow you to track visually the status of the different nodes in the workflows on the console, making it easier to monitor progress and troubleshoot issues.

Database blueprints help load data from operational databases. For example, if you have an e-commerce website, you can ingest all your orders in your data lake. You can load a full snapshot from an existing database, or incrementally load new data. In case of an incremental load, you can select a table and one or more of its columns as bookmark keys (for example, a timestamp in your orders) to determine previously imported data.

Log file blueprints simplify ingesting logging formats used by Application Load Balancers, Elastic Load Balancers, and AWS CloudTrail. Let’s see how that works more in depth.

Security is always a top priority, and I want to be able to have a forensic log of all management operations across my account, so I choose the CloudTrail blueprint. As source, I select a trail collecting my CloudTrail logs from all regions into an S3 bucket. In this way, I’ll be able to query account activity across all my AWS infrastructure. This works similarly for a larger organization having multiple AWS accounts: they just need, when configuring the trail in the CloudTrial console, to apply the trail to their whole organization.

I then select the target database, and the S3 location for the data lake. As data format I use Parquet, a columnar storage format that will make querying the data faster and cheaper. The import frequency can be hourly to monthly, with the option to choose the day of the week and the time. For now, I want to run the workflow on demand. I can do that from the console or programmatically, for example using any AWS SDK or the AWS Command Line Interface (CLI).

Finally, I give the workflow a name, the IAM role to use during execution, and a prefix for the tables that will be automatically created by this workflow.

I start the workflow from the Lake Formation console and select to view the workflow graph. This opens the AWS Glue console, where I can visually see the steps of the workflow and monitor the progress of this run.

When the workflow is completed a new table is available in my data lake database. The source data remain as logs in the S3 bucket output of CloudTrail, but now I have them consolidated, in Parquet format and partitioned by date, in my data lake S3 location. To optimize costs, I can set up an S3 lifecycle policy that automatically expires data in the source S3 bucket after a safe amount of time has passed.

Securing Access to the Data Lake
Lake Formation provides secure and granular access to data stores in the data lake, via a new grant/revoke permissions model that augments IAM policies. It is simple to set up these permissions, for example using the console:

I simply select the IAM user or role I want to grant access to. Then I select the database and optionally the tables and the columns I want to provide access to. It is also possible to select which type of access to provide. For this demo, simple select permissions are sufficient.

Accessing the Data Lake
Now I can query the data using tools like Amazon Athena or Amazon Redshift. For example, I open the query editor in the Athena console. First, I want to use my new data lake to look into which source IP addresses are most common in my AWS Account activity:

SELECT sourceipaddress, count(*)
FROM my_trail_cloudtrail
GROUP BY  sourceipaddress
ORDER BY  2 DESC;

Looking at the result of the query, you can see which are the AWS API endpoints that I use the most. Then, I’d like to check which user identity types are used. That is an information stored in JSON format inside one of the columns. I can use some of the JSON functions available with Amazon Athena to get that information in my SQL statements:

SELECT json_extract_scalar(useridentity, '$.type'), count(*)
FROM "mylake"."my_trail_cloudtrail"
GROUP BY  json_extract_scalar(useridentity, '$.type')
ORDER BY  2 DESC;

Most of the times, AWS services are the ones creating activities in my trail. These queries are just an example, but give me quickly a deeper insight in what is happening in my AWS account.

Think of what could be a similar impact for your business! Using database and logs blueprints, you can quickly create workflows to ingest data from multiple sources within your organization, set the right permission at column level of who can have access to any information collected, clean and prepare your data using machine learning transforms, and correlate and visualize the information using tools like Amazon Athena, Amazon Redshift, and Amazon QuickSight.

Customizing Data Access with Column-Level Permissions
In order to follow data privacy guidelines and compliance, the mission-critical data stored in a data lake requires to create custom views for different stakeholders inside the company. Let’s compare the visibility of two IAM users in my AWS account, one that has full permissions on a table, and one that has only select access to a subset of the columns of the same table.

I already have a user with full access to the table containing my CloudTrail data, it’s called danilop. I create a new limitedview IAM user and I give it access to the Athena console. In the Lake Formation console, I only give this new user select permissions on three of the columns.

To verify the different access to the data in the table, I log in with one user at a time and go to the Athena console. On the left I can explore which tables and columns the logged-in user can see in the Glue Data Catalog. Here’s a comparison for the two users, side-by-side:

The limited user has access only to the three columns that I explicitly configured, and to the four columns used for partitioning the table, whose access is required to see any data. When I query the table in the Athena console with a select * SQL statement, logged in as the limitedview user, I only see data from those seven columns:

Available Now
There is no additional cost in using AWS Lake Formation, you pay for the use of the underlying services such as Amazon S3 and AWS Glue. One of the core benefits of Lake Formation are the security policies it is introducing. Previously you had to use separate policies to secure data and metadata access, and these policies only allowed table-level access. Now you can give access to each user, from a central location, only to the the columns they need to use.

AWS Lake Formation is now available in US East (N. Virginia), US East (Ohio), US West (Oregon), Europe (Ireland), and Asia Pacific (Tokyo). Redshift integration with Lake Formation requires Redshift cluster version 1.0.8610 or higher, your clusters should have been automatically updated by the time you read this. Support for Apache Spark with Amazon EMR will follow over the next few months.

I only scratched the surface of what you can do with Lake Formation. Building and managing a data lake for your business is now much easier, let me know how you are using these new capabilities!

Danilo

AWS Project Resilience – Up to $2K in AWS Credits to Support DR Preparation

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-project-resilience-up-to-2k-in-aws-credits-to-support-dr-preparation/

We want to help state and local governments, community organizations, and educational institutions to better prepare for natural and man-made disasters that could affect their ability to run their mission-critical IT systems.

Today we are launching AWS Project Resilience. This new element of our existing Disaster Response program offers up to $2,000 in AWS credits to organizations of the types that I listed above. The program is open to new and existing customers, with distinct benefits for each:

New Customers – Eligible new customers can submit a request for up to $2,000 in AWS Project Resilience credits that can be used to offset costs incurred by storing critical datasets in Amazon Simple Storage Service (S3).

Existing Customers – Eligible existing customers can submit a request for up to $2,000 in AWS Project Resilience credits to offset the costs incurred by engaging CloudEndure and AWS Disaster Response experts to do a deep dive on an existing business continuity architecture.

Earlier this month I sat down with my colleague Ana Visneski to learn more about disaster preparedness, disaster recovery, and AWS Project Resilience. Here’s our video:

To learn more and to apply to the program, visit the AWS Project Resilience page!

Jeff;

 

Amazon S3 Update – SigV2 Deprecation Period Extended & Modified

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-s3-update-sigv2-deprecation-period-extended-modified/

Every request that you make to the Amazon S3 API must be signed to ensure that it is authentic. In the early days of AWS we used a signing model that is known as Signature Version 2, or SigV2 for short. Back in 2012, we announced SigV4, a more flexible signing method, and made it the sole signing method for all regions launched after 2013. At that time, we recommended that you use it for all new S3 applications.

Last year we announced that we would be ending support for SigV2 later this month. While many customers have updated their applications (often with nothing more than a simple SDK update), to use SigV4, we have also received many requests for us to extend support.

New Date, New Plan
In response to the feedback on our original plan, we are making an important change. Here’s the summary:

Original Plan – Support for SigV2 ends on June 24, 2019.

Revised Plan – Any new buckets created after June 24, 2020 will not support SigV2 signed requests, although existing buckets will continue to support SigV2 while we work with customers to move off this older request signing method.

Even though you can continue to use SigV2 on existing buckets, and in the subset of AWS regions that support SigV2, I encourage you to migrate to SigV4, gaining some important security and efficiency benefits in the process. The newer signing method uses a separate, specialized signing key that is derived from the long-term AWS access key. The key is specific to the service, region, and date. This provides additional isolation between services and regions, and provides better protection against key reuse. Internally, our SigV4 implementation is able to securely cache the results of authentication checks; this reduces latency and adds to the overall resiliency of your application. To learn more, read Changes in Signature Version 4.

Identifying Use of SigV2
S3 has been around since 2006 and some of the code that you or your predecessors wrote way back then might still be around, dutifully making requests that are signed with SigV2. You can use CloudTrail Data Events or S3 Server Access Logs to find the old-school requests and target the applications for updates:

CloudTrail Data Events – Look for the SignatureVersion element within the additionalDataElement of each CloudTrail event entry (read Using AWS CloudTrail to Identify Amazon S3 Signature Version 2 Requests to learn more).

S3 Server Access Logs – Look for the SignatureVersion element in the logs (read Using Amazon S3 Access Logs to Identify Signature Version 2 Requests to learn more).

Updating to SigV4



“Do we need to change our code?”

The Europe (Frankfurt), US East (Ohio), Canada (Central), Europe (London), Asia Pacific (Seoul), Asia Pacific (Mumbai), Europe (Paris), China (Ningxia), Europe (Stockholm), Asia Pacific (Osaka Local), AWS GovCloud (US-East), and Asia Pacific (Hong Kong) Regions were launched after 2013, and support SigV4 but not SigV2. If you have code that accesses S3 buckets in that region, it is already making exclusive use of SigV4.

If you are using the latest version of the AWS SDKs, you are either ready or just about ready for the SigV4 requirement on new buckets beginning June 24, 2020. If you are using an older SDK, please check out the detailed version list at Moving from Signature Version 2 to Signature Version 4 for more information.

There are a few situations where you will need to make some changes to your code. For example, if you are using pre-signed URLs with the AWS Java, JavaScript (node.js), or Python SDK, you need to set the correct region and signature version in the client configuration. Also, be aware that SigV4 pre-signed URLs are valid for a maximum of 7 days, while SigV2 pre-signed URLs can be created with a maximum expiry time that could be many weeks or years in the future (in almost all cases, using time-limited URLs is a much better practice). Using SigV4 will improve your security profile, but might also mandate a change in the way that you create, store, and use the pre-signed URLs. While using long-lived pre-signed URLs was easy and convenient for developers, using SigV4 with URLs that have a finite expiration is a much better security practice.

If you are using Amazon EMR, you should upgrade your clusters to version 5.22.0 or later so that all requests to S3 are made using SigV4 (see Amazon EMR 5.x Release Versions for more info).

If your S3 objects are fronted by Amazon CloudFront and you are signing your own requests, be sure to update your code to use SigV4. If you are using Origin Access Identities to restrict access to S3, be sure to include the x-amz-content-sha256 header and the proper regional S3 domain endpoint.

We’re Here to Help
The AWS team wants to help make your transition to SigV4 as smooth and painless as possible. If you run in to problems, I strongly encourage you to make use of AWS Support, as described in Getting Started with AWS Support.

You can also Discuss this Post on Reddit!

Jeff;

 

How to export an Amazon DynamoDB table to Amazon S3 using AWS Step Functions and AWS Glue

Post Syndicated from Joe Feeney original https://aws.amazon.com/blogs/big-data/how-to-export-an-amazon-dynamodb-table-to-amazon-s3-using-aws-step-functions-and-aws-glue/

In typical AWS fashion, not a week had gone by after I published How Goodreads offloads Amazon DynamoDB tables to Amazon S3 and queries them using Amazon Athena on the AWS Big Data blog when the AWS Glue team released the ability for AWS Glue crawlers and AWS Glue ETL jobs to read from DynamoDB tables natively. I was actually pretty excited about this. Less code means fewer bugs. The original architecture had been around for at least 18 months and could be simplified significantly with a little bit of work.

Refactoring the data pipeline

The AWS Data Pipeline architecture outlined in my previous blog post is just under two years old now. We had used data pipelines as a way to back up Amazon DynamoDB data to Amazon S3 in case of a catastrophic developer error. However, with DynamoDB point-in-time recovery we have a better, native mechanism for disaster recovery. Additionally, with data pipelines we still own the operations associated with the clusters themselves, even if they are transient. A common challenge is keeping our clusters up with recent releases of Amazon EMR to help mitigate any outstanding bugs. Another is the inefficiency of needing to spin up an EMR cluster for each DynamoDB table.

I decided to take a step back and list the capabilities I wanted to have in the next iteration:

  • Export tables using AWS Glue instead of EMR.
    • AWS Glue provides a serverless ETL environment where I don’t have to worry about the underlying infrastructure. This minimizes operational tasks like keeping up with the EMR release tags.
  • Use a workflow solution that works across services like AWS Glue and Amazon Athena.
    • In the first iteration, the workflow was spread across various services. Unless you had the entire pipeline in your head, it was difficult to get a bird’s-eye view of how the pipeline was progressing.
  • Ability to select different formats.
    • For data engineering, I prefer Apache Parquet. However, customers might prefer a different format.
  • Add exported data to Athena.
    • I find that the easier it is for the data to be queried, the more likely it’s used.

Architecture overview

At a high level, this is the architecture:

  • We’re using AWS Step Functions as the workflow engine.
    • Each step is either a built-in Step Functions state, a service integration, or a simple Python AWS Lambda For example, GlueStartJobRun is using the synchronous job run service integration, as discussed in the documentation.
    • We get a visual representation of the entire pipeline.
    • It’s quick to onboard new developers.
  • An event in Amazon CloudWatch Events, which is disabled to start, triggers a Step Functions state machine with a JSON payload that contains the following:
    • AWS Glue job name
    • Export destination
    • DynamoDB table name
    • Desired read percentage
    • AWS Glue crawler name
  • AWS Glue exports a DynamoDB table in your preferred format to S3 as snapshots_your_table_name. The data is partitioned by the snapshot_timestamp
  • An AWS Glue crawler adds or updates your data’s schema and partitions in the AWS Glue Data Catalog.
  • Finally, we create an Athena view that only has data from the latest export snapshot.

A simple AWS Glue ETL job

The script that I created accepts AWS Glue ETL job arguments for the table name, read throughput, output, and format. Behind the scenes, AWS Glue scans the DynamoDB table. AWS Glue makes sure that every top-level attribute makes it into the schema, no matter how sparse your attributes are (as discussed in the DynamoDB documentation).

Here’s the script:

import sys
import datetime
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext

ARG_TABLE_NAME = "table_name"
ARG_READ_PERCENT = "read_percentage"
ARG_OUTPUT = "output_prefix"
ARG_FORMAT = "output_format"

PARTITION = "snapshot_timestamp"

args = getResolvedOptions(sys.argv,
  [
    'JOB_NAME',
    ARG_TABLE_NAME,
    ARG_READ_PERCENT,
    ARG_OUTPUT,
    ARG_FORMAT
  ]
)

table_name = args[ARG_TABLE_NAME]
read = args[ARG_READ_PERCENT]
output_prefix = args[ARG_OUTPUT]
fmt = args[ARG_FORMAT]

print("Table name:", table_name)
print("Read percentage:", read)
print("Output prefix:", output_prefix)
print("Format:", fmt)

date_str = datetime.datetime.utcnow().strftime('%Y-%m-%dT%H:%M')
output = "%s/%s=%s" % (output_prefix, PARTITION, date_str)

sc = SparkContext()
glueContext = GlueContext(sc)

table = glueContext.create_dynamic_frame.from_options(
  "dynamodb",
  connection_options={
    "dynamodb.input.tableName": table_name,
    "dynamodb.throughput.read.percent": read
  }
)

glueContext.write_dynamic_frame.from_options(
  frame=table,
  connection_type="s3",
  connection_options={
    "path": output
  },
  format=fmt,
  transformation_ctx="datasink"
)

There’s not a lot here. We’re creating a DynamicFrameReader of connection type dynamodb and passing in the table name and desired maximum read throughput consumption. We pass that data frame to a DynamicFrameWriter that writes the table to S3 in the specified format.

Athena views

Most teams at Amazon own applications that have multiple DynamoDB tables, including my own team. Our current application uses five primary tables. Ideally, at the end of an export workflow you can write simple, obvious queries across a consistent view of your tables. However, each exported table is partitioned by the timestamp from when the table was exported. This makes querying across one or more tables very cumbersome, because you have to add a WHERE snapshot_timestamp = clause to every table reference in your query. Additionally, each table might have a different snapshot_timestamp value for any given day!

The final step in this export workflow creates an Athena view that adds that WHERE clause for you. This means that you can interact with your DynamoDB exports as if they were one sane view of your exported DynamoDB tables.

Setting up the infrastructure

The AWS CloudFormation stacks I create are split into two stacks. The common stack contains shared infrastructure, and you need only one of these per AWS Region. The table stacks are designed in such a way that you can create one per table-format combination in any given AWS Region. It contains the CloudWatch event logic and AWS Glue components needed to export and transform DynamoDB tables.

Creating the common stack

The common stack contains the majority of the infrastructure. That includes the Step Functions state machine and Lambda functions to trigger and check the state of asynchronous jobs. It also includes IAM roles that the export stacks use, and the S3 bucket to store the exports.

To create the common stack, do the following:

  1. Choose this Launch Stack
  2. Choose I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  3. Choose Create Stack.

Creating the table export stack

If you don’t have a DynamoDB table to export, follow the original blog post. Start with the Working with the Reviews stack section and continue until you’ve added the two Items to the table. Otherwise, feel free to point this CloudFormation stack at your favorite DynamoDB table that is using provisioned throughput. Tables that use on-demand throughput are not currently supported.

Because so much of this architecture is shareable, there’s not much in the table export stack. This stack defines the CloudWatch event used to trigger the Step Functions state machine with a JSON payload containing all the necessary metadata. Additionally, it contains the AWS Glue ETL job that exports the table and the AWS Glue Crawler that updates metadata in the AWS Glue Data Catalog.

Technically, you can define the AWS Glue ETL job in the common stack because it’s already parameterized. However, the default limit for concurrent runs for an AWS Glue job is three. This is a soft limit, but with this architecture you have headroom to export up to 25 tables before asking for a limit increase.

To create the table export stack, do the following:

  1. Choose this Launch Stack
  2. Choose an output format from the list. All the available formats are supported by Athena natively.
  3. Enter your DynamoDB table name.
  4. Enter the percentage of Read Capacity Units (RCUs) that the job should consume from your table’s currently provisioned throughput. This percentage is expressed as a float between 0.1 and 1.0 inclusive. The default is 0.25 (25 percent).

As an example: Suppose that your table’s RCUs are set to 100 and you use the default 0.25, 25 percent. Then the AWS Glue job consume 25 RCUs while running.

  1. Choose Create.

Kicking off a state machine execution

To demonstrate how this works, we run the DynamoDB export state machine manually by passing it the JSON payload that the CloudWatch event would pass to Step Functions.

Getting the JSON payload from CloudWatch Events

To get the JSON payload, do the following:

  1. Open CloudWatch in the AWS Management Console.
  2. In the left column under Events, choose Rules.
  3. Choose your rule from the list. It is prefixed by AWSBigDataBlog-.
  4. For Actions, choose Edit.
  5. Copy the JSON payload from the Configure input section of Targets.
  6. Choose Cancel to exit edit mode.

Starting a state machine execution

To start an execution of the state machine, take the following steps:

  1. Open Step Functions in the console.
  2. Choose the DynamoDBExportAndAthenaLoad state machine.
  3. Choose Start execution.
  4. Paste the JSON payload into the Input
  5. Choose Start execution.

There are a few ways to follow along with the execution. As steps are entered and exited, entries are added to the Execution event history list. This is a great way to see what state (event in Lambda speak) is passed to each step, in case you need to debug.

You can also expand the Visual workflow. It’s a great high-level view to see how the workflow is progressing.

After the workflow is finished, you see two new tables under the dynamodb_exports database in your AWS Glue Data Catalog. Your DynamoDB snapshots table name is prefixed with snapshots_. The schema is formatted for the AWS Glue Data Catalog (lowercase and hyphens transformed to underscores). You also have a view table with the same table name formatted for AWS Glue Data Catalog but without the snapshots_ prefix.

Querying your data

To showcase how having a separate view table of the most recent snapshot of a table is useful, I use the Reviews table from the previous blog post. The table has two items. I have also run the export workflow twice. As you can see when you preview the table, there are four items total. That’s because each snapshot contains two items.

From the items, the latest snapshot_timestamp is 2019-01-11T23:26. When I run the same preview query against the view table reviews, we see that there are only two items, which is what we expect. The view takes care of specifying the where snapshot_timestamp=… clause so you don’t have to.

Wrapping up

In this post, I showed you how to use AWS Glue’s DynamoDB integration and AWS Step Functions to create a workflow to export your DynamoDB tables to S3 in Parquet. I also show how to create an Athena view for each table’s latest snapshot, giving you a consistent view of your DynamoDB table exports.


About the Author

Joe Feeney is a Software Engineer at Amazon Go, where he does secret stuff and he’s quite chuffed with that. He enjoys embarrassing his family by taking Mario Kart entirely too seriously.

 

 

 

Trigger cross-region replication of pre-existing objects using Amazon S3 inventory, Amazon EMR, and Amazon Athena

Post Syndicated from Michael Sambol original https://aws.amazon.com/blogs/big-data/trigger-cross-region-replication-of-pre-existing-objects-using-amazon-s3-inventory-amazon-emr-and-amazon-athena/

In Amazon Simple Storage Service (Amazon S3), you can use cross-region replication (CRR) to copy objects automatically and asynchronously across buckets in different AWS Regions. CRR is a bucket-level configuration, and it can help you meet compliance requirements and minimize latency by keeping copies of your data in different Regions. CRR replicates all objects in the source bucket, or optionally a subset, controlled by prefix and tags.

Objects that exist before you enable CRR (pre-existing objects) are not replicated. Similarly, objects might fail to replicate (failed objects) if permissions aren’t in place, either on the IAM role used for replication or the bucket policy (if the buckets are in different AWS accounts).

In our work with customers, we have seen situations where large numbers of objects aren’t replicated for the previously mentioned reasons. In this post, we show you how to trigger cross-region replication for pre-existing and failed objects.

Methodology

At a high level, our strategy is to perform a copy-in-place operation on pre-existing and failed objects. This operation uses the Amazon S3 API to copy the objects over the top of themselves, preserving tags, access control lists (ACLs), metadata, and encryption keys. The operation also resets the Replication_Status flag on the objects. This triggers cross-region replication, which then copies the objects to the destination bucket.

To accomplish this, we use the following:

  • Amazon S3 inventory to identify objects to copy in place. These objects don’t have a replication status, or they have a status of FAILED.
  • Amazon Athena and AWS Glue to expose the S3 inventory files as a table.
  • Amazon EMR to execute an Apache Spark job that queries the AWS Glue table and performs the copy-in-place operation.

Object filtering

To reduce the size of the problem (we’ve seen buckets with billions of objects!) and eliminate S3 List operations, we use Amazon S3 inventory. S3 inventory is enabled at the bucket level, and it provides a report of S3 objects. The inventory files contain the objects’ replication status: PENDING, COMPLETED, FAILED, or REPLICA. Pre-existing objects do not have a replication status in the inventory.

Interactive analysis

To simplify working with the files that are created by S3 inventory, we create a table in the AWS Glue Data Catalog. You can query this table using Amazon Athena and analyze the objects.  You can also use this table in the Spark job running on Amazon EMR to identify the objects to copy in place.

Copy-in-place execution

We use a Spark job running on Amazon EMR to perform concurrent copy-in-place operations of the S3 objects. This step allows the number of simultaneous copy operations to be scaled up. This improves performance on a large number of objects compared to doing the copy operations consecutively with a single-threaded application.

Account setup

For the purpose of this example, we created three S3 buckets. The buckets are specific to our demonstration. If you’re following along, you need to create your own buckets (with different names).

We’re using a source bucket named crr-preexisting-demo-source and a destination bucket named crr-preexisting-demo-destination. The source bucket contains the pre-existing objects and the objects with the replication status of FAILED. We store the S3 inventory files in a third bucket named crr-preexisting-demo-inventory.

The following diagram illustrates the basic setup.

You can use any bucket to store the inventory, but the bucket policy must include the following statement (change Resource and aws:SourceAccount to match yours).

{
    "Version": "2012-10-17",
    "Id": "S3InventoryPolicy",
    "Statement": [
        {
            "Sid": "S3InventoryStatement",
            "Effect": "Allow",
            "Principal": {
                "Service": "s3.amazonaws.com"
            },
            "Action": "s3:PutObject",
            "Resource": "arn:aws:s3:::crr-preexisting-demo-inventory/*",
            "Condition": {
                "StringEquals": {
                    "s3:x-amz-acl": "bucket-owner-full-control",
                    "aws:SourceAccount": "111111111111"
                }
            }
        }
    ]
}

In our example, we uploaded six objects to crr-preexisting-demo-source. We added three objects (preexisting-*.txt) before CRR was enabled. We also added three objects (failed-*.txt) after permissions were removed from the CRR IAM role, causing CRR to fail.

Enable S3 inventory

You need to enable S3 inventory on the source bucket. You can do this on the Amazon S3 console as follows:

On the Management tab for the source bucket, choose Inventory.

Choose Add new, and complete the settings as shown, choosing the CSV format and selecting the Replication status check box. For detailed instructions for creating an inventory, see How Do I Configure Amazon S3 Inventory? in the Amazon S3 Console User Guide.

After enabling S3 inventory, you need to wait for the inventory files to be delivered. It can take up to 48 hours to deliver the first report. If you’re following the demo, ensure that the inventory report is delivered before proceeding.

Here’s what our example inventory file looks like:

You can also look on the S3 console on the objects’ Overview tab. The pre-existing objects do not have a replication status, but the failed objects show the following:

Register the table in the AWS Glue Data Catalog using Amazon Athena

To be able to query the inventory files using SQL, first you need to create an external table in the AWS Glue Data Catalog. Open the Amazon Athena console at https://console.aws.amazon.com/athena/home.

On the Query Editor tab, run the following SQL statement. This statement registers the external table in the AWS Glue Data Catalog.

CREATE EXTERNAL TABLE IF NOT EXISTS
crr_preexisting_demo (
    `bucket` string,
    key string,
    replication_status string
)
PARTITIONED BY (dt string)
ROW FORMAT DELIMITED
    FIELDS TERMINATED BY ','
    ESCAPED BY '\\'
    LINES TERMINATED BY '\n'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
LOCATION 's3://crr-preexisting-demo-inventory/crr-preexisting-demo-source/crr-preexisting-demo/hive';

After creating the table, you need to make the AWS Glue Data Catalog aware of any existing data and partitions by adding partition metadata to the table. To do this, you use the Metastore Consistency Check utility to scan for and add partition metadata to the AWS Glue Data Catalog.

MSCK REPAIR TABLE crr_preexisting_demo;

To learn more about why this is required, see the documentation on MSCK REPAIR TABLE and data partitioning in the Amazon Athena User Guide.

Now that the table and partitions are registered in the Data Catalog, you can query the inventory files with Amazon Athena.

SELECT * FROM crr_preexisting_demo where dt='2019-02-24-04-00';

The results of the query are as follows.

The query returns all rows in the S3 inventory for a specific delivery date. You’re now ready to launch an EMR cluster to copy in place the pre-existing and failed objects.

Note: If your goal is to fix FAILED objects, make sure that you correct what caused the failure (IAM permissions or S3 bucket policies) before proceeding to the next step.

Create an EMR cluster to copy objects

To parallelize the copy-in-place operations, run a Spark job on Amazon EMR. To facilitate EMR cluster creation and EMR step submission, we wrote a bash script (available in this GitHub repository).

To run the script, clone the GitHub repo. Then launch the EMR cluster as follows:

$ git clone https://github.com/aws-samples/amazon-s3-crr-preexisting-objects
$ ./launch emr.sh

Note: Running the bash script results in AWS charges. By default, it creates two Amazon EC2 instances, one m4.xlarge and one m4.2xlarge. Auto-termination is enabled so when the cluster is finished with the in-place copies, it terminates.

The script performs the following tasks:

  1. Creates the default EMR roles (EMR_EC2_DefaultRole and EMR_DefaultRole).
  2. Uploads the files used for bootstrap actions and steps to Amazon S3 (we use crr-preexisting-demo-inventory to store these files).
  3. Creates an EMR cluster with Apache Spark installed using the create-cluster

After the cluster is provisioned:

  1. A bootstrap action installs boto3 and awscli.
  2. Two steps execute, copying the Spark application to the master node and then running the application.

The following are highlights from the Spark application. You can find the complete code for this example in the amazon-s3-crr-preexisting-objects repo on GitHub.

Here we select records from the table registered with the AWS Glue Data Catalog, filtering for objects with a replication_status of "FAILED" or “”.

query = """
        SELECT bucket, key
        FROM {}
        WHERE dt = '{}'
        AND (replication_status = '""'
        OR replication_status = '"FAILED"')
        """.format(inventory_table, inventory_date)

print('Query: {}'.format(query))

crr_failed = spark.sql(query)

We call the copy_object function for each key returned by the previous query.

def copy_object(self, bucket, key, copy_acls):
        dest_bucket = self._s3.Bucket(bucket)
        dest_obj = dest_bucket.Object(key)

        src_bucket = self._s3.Bucket(bucket)
        src_obj = src_bucket.Object(key)

        # Get the S3 Object's Storage Class, Metadata, 
        # and Server Side Encryption
        storage_class, metadata, sse_type, last_modified = \
            self._get_object_attributes(src_obj)

        # Update the Metadata so the copy will work
        metadata['forcedreplication'] = runtime

        # Get and copy the current ACL
        if copy_acls:
            src_acl = src_obj.Acl()
            src_acl.load()
            dest_acl = {
                'Grants': src_acl.grants,
                'Owner': src_acl.owner
            }

        params = {
            'CopySource': {
                'Bucket': bucket,
                'Key': key
            },
            'MetadataDirective': 'REPLACE',
            'TaggingDirective': 'COPY',
            'Metadata': metadata,
            'StorageClass': storage_class
        }

        # Set Server Side Encryption
        if sse_type == 'AES256':
            params['ServerSideEncryption'] = 'AES256'
        elif sse_type == 'aws:kms':
            kms_key = src_obj.ssekms_key_id
            params['ServerSideEncryption'] = 'aws:kms'
            params['SSEKMSKeyId'] = kms_key

        # Copy the S3 Object over the top of itself, 
        # with the Storage Class, updated Metadata, 
        # and Server Side Encryption
        result = dest_obj.copy_from(**params)

        # Put the ACL back on the Object
        if copy_acls:
            dest_obj.Acl().put(AccessControlPolicy=dest_acl)

        return {
            'CopyInPlace': 'TRUE',
            'LastModified': str(result['CopyObjectResult']['LastModified'])
        }

Note: The Spark application adds a forcedreplication key to the objects’ metadata. It does this because Amazon S3 doesn’t allow you to copy in place without changing the object or its metadata.

Verify the success of the EMR job by running a query in Amazon Athena

The Spark application outputs its results to S3. You can create another external table with Amazon Athena and register it with the AWS Glue Data Catalog. You can then query the table with Athena to ensure that the copy-in-place operation was successful.

CREATE EXTERNAL TABLE IF NOT EXISTS
crr_preexisting_demo_results (
  `bucket` string,
  key string,
  replication_status string,
  last_modified string
)
ROW FORMAT DELIMITED
  FIELDS TERMINATED BY ','
  LINES TERMINATED BY '\n'
  STORED AS TEXTFILE
LOCATION 's3://crr-preexisting-demo-inventory/results';

SELECT * FROM crr_preexisting_demo_results;

The results appear as follows on the console.

Although this shows that the copy-in-place operation was successful, CRR still needs to replicate the objects. Subsequent inventory files show the objects’ replication status as COMPLETED. You can also verify on the console that preexisting-*.txt and failed-*.txt are COMPLETED.

It is worth noting that because CRR requires versioned buckets, the copy-in-place operation produces another version of the objects. You can use S3 lifecycle policies to manage noncurrent versions.

Conclusion

In this post, we showed how to use Amazon S3 inventory, Amazon Athena, the AWS Glue Data Catalog, and Amazon EMR to perform copy-in-place operations on pre-existing and failed objects at scale.

Note: Amazon S3 batch operations is an alternative for copying objects. The difference is that S3 batch operations will not check each object’s existing properties and set object ACLs, storage class, and encryption on an object-by-object basis. For more information, see Introduction to Amazon S3 Batch Operations in the Amazon S3 Console User Guide.

 


About the Authors

Michael Sambol is a senior consultant at AWS. He holds an MS in computer science from Georgia Tech. Michael enjoys working out, playing tennis, traveling, and watching Western movies.

 

 

 

 

Chauncy McCaughey is a senior data architect at AWS. His current side project is using statistical analysis of driving habits and traffic patterns to understand how he always ends up in the slow lane.

 

 

 

Amazon S3 Path Deprecation Plan – The Rest of the Story

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-s3-path-deprecation-plan-the-rest-of-the-story/

Last week we made a fairly quiet (too quiet, in fact) announcement of our plan to slowly and carefully deprecate the path-based access model that is used to specify the address of an object in an S3 bucket. I spent some time talking to the S3 team in order to get a better understanding of the situation in order to write this blog post. Here’s what I learned…

We launched S3 in early 2006. Jeff Bezos’ original spec for S3 was very succinct – he wanted malloc (a key memory allocation function for C programs) for the Internet. From that starting point, S3 has grown to the point where it now stores many trillions of objects and processes millions of requests per second for them. Over the intervening 13 years, we have added many new storage options, features, and security controls to S3.

Old vs. New
S3 currently supports two different addressing models: path-style and virtual-hosted style. Let’s take a quick look at each one. The path-style model looks like either this (the global S3 endpoint):

https://s3.amazonaws.com/jbarr-public/images/ritchie_and_thompson_pdp11.jpeg
https://s3.amazonaws.com/jeffbarr-public/classic_amazon_door_desk.png

Or this (one of the regional S3 endpoints):

https://s3-us-east-2.amazonaws.com/jbarr-public/images/ritchie_and_thompson_pdp11.jpeg
https://s3-us-east-2.amazonaws.com/jeffbarr-public/classic_amazon_door_desk.png

In this example, jbarr-public and jeffbarr-public are bucket names; /images/ritchie_and_thompson_pdp11.jpeg and /classic_amazon_door_desk.png are object keys.

Even though the objects are owned by distinct AWS accounts and are in different S3 buckets (and possibly in distinct AWS regions), both of them are in the DNS subdomain s3.amazonaws.com. Hold that thought while we look at the equivalent virtual-hosted style references (although you might think of these as “new,” they have been around since at least 2010):

https://jbarr-public.s3.amazonaws.com/images/ritchie_and_thompson_pdp11.jpeg
https://jeffbarr-public.s3.amazonaws.com/classic_amazon_door_desk.png

These URLs reference the same objects, but the objects are now in distinct DNS subdomains (jbarr-public.s3.amazonaws.com and jeffbarr-public.s3.amazonaws.com, respectively). The difference is subtle, but very important. When you use a URL to reference an object, DNS resolution is used to map the subdomain name to an IP address. With the path-style model, the subdomain is always s3.amazonaws.com or one of the regional endpoints; with the virtual-hosted style, the subdomain is specific to the bucket. This additional degree of endpoint specificity is the key that opens the door to many important improvements to S3.

Out with the Old
In response to feedback on the original deprecation plan that we announced last week, we are making an important change. Here’s the executive summary:

Original Plan – Support for the path-style model ends on September 30, 2020.

Revised Plan – Support for the path-style model continues for buckets created on or before September 30, 2020. Buckets created after that date must be referenced using the virtual-hosted model.

We are moving to virtual-hosted references for two reasons:

First, anticipating a world with billions of buckets homed in many dozens of regions, routing all incoming requests directly to a small set of endpoints makes less and less sense over time. DNS resolution, scaling, security, and traffic management (including DDoS protection) are more challenging with this centralized model. The virtual-hosted model reduces the area of impact (which we call the “blast radius” internally) when problems arise; this helps us to increase availability and performance.

Second, the team has a lot of powerful features in the works, many of which depend on the use of unique, virtual-hosted style subdomains. Moving to this model will allow you to benefit from these new features as soon as they are announced. For example, we are planning to deprecate some of the oldest security ciphers and versions (details to come later). The deprecation process is easier and smoother (for you and for us) if you are using virtual-hosted references.

In With the New
As just one example of what becomes possible when using virtual-hosted references, we are thinking about providing you with increased control over the security configuration (including ciphers and cipher versions) for each bucket. If you have ideas of your own, feel free to get in touch.

Moving Ahead
Here are some things to know about our plans:

Identifying Path-Style References – You can use S3 Access Logs (look for the Host Header field) and AWS CloudTrail Data Events (look for the host element of the requestParameters entry) to identify the applications that are making path-style requests.

Programmatic Access – If your application accesses S3 using one of the AWS SDKs, you don’t need to do anything, other than ensuring that your SDK is current. The SDKs already use virtual-hosted references to S3, except if the bucket name contains one or more “.” characters.

Bucket Names with Dots – It is important to note that bucket names with “.” characters are perfectly valid for website hosting and other use cases. However, there are some known issues with TLS and with SSL certificates. We are hard at work on a plan to support virtual-host requests to these buckets, and will share the details well ahead of September 30, 2020.

Non-Routable Names – Some characters that are valid in the path component of a URL are not valid as part of a domain name. Also, paths are case-sensitive, but domain and subdomain names are not. We’ve been enforcing more stringent rules for new bucket names since last year. If you have data in a bucket with a non-routable name and you want to switch to virtual-host requests, you can use the new S3 Batch Operations feature to move the data. However, if this is not a viable option, please reach out to AWS Developer Support.

Documentation – We are planning to update the S3 Documentation to encourage all developers to build applications that use virtual-host requests. The Virtual Hosting documentation is a good starting point.

We’re Here to Help
The S3 team has been working with some of our customers to help them to migrate, and they are ready to work with many more.

Our goal is to make this deprecation smooth and uneventful, and we want to help minimize any costs you may incur! Please do not hesitate to reach out to us if you have questions, challenges, or concerns.

Jeff;

PS – Stay tuned for more information on tools and other resources.

New – Amazon S3 Batch Operations

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-amazon-s3-batch-operations/

AWS customers routinely store millions or billions of objects in individual Amazon Simple Storage Service (S3) buckets, taking advantage of S3’s scale, durability, low cost, security, and storage options. These customers store images, videos, log files, backups, and other mission-critical data, and use S3 as a crucial part of their data storage strategy.

Batch Operations
Today, I would like to tell you about Amazon S3 Batch Operations. You can use this new feature to easily process hundreds, millions, or billions of S3 objects in a simple and straightforward fashion. You can copy objects to another bucket, set tags or access control lists (ACLs), initiate a restore from Glacier, or invoke an AWS Lambda function on each one.

This feature builds on S3’s existing support for inventory reports (read my S3 Storage Management Update post to learn more), and can use the reports or CSV files to drive your batch operations. You don’t have to write code, set up any server fleets, or figure out how to partition the work and distribute it to the fleet. Instead, you create a job in minutes with a couple of clicks, turn it loose, and sit back while S3 uses massive, behind-the-scenes parallelism to take care of the work. You can create, monitor, and manage your batch jobs using the S3 Console, the S3 CLI, or the S3 APIs.

A Quick Vocabulary Lesson
Before we get started and create a batch job, let’s review and introduce a couple of important terms:

Bucket – An S3 bucket holds a collection of any number of S3 objects, with optional per-object versioning.

Inventory Report – An S3 inventory report is generated each time a daily or weekly bucket inventory is run. A report can be configured to include all of the objects in a bucket, or to focus on a prefix-delimited subset.

Manifest – A list (either an Inventory Report, or a file in CSV format) that identifies the objects to be processed in the batch job.

Batch Action – The desired action on the objects described by a Manifest. Applying an action to an object constitutes an S3 Batch Task.

IAM Role – An IAM role that provides S3 with permission to read the objects in the inventory report, perform the desired actions, and to write the optional completion report. If you choose Invoke AWS Lambda function as your action, the function’s execution role must grant permission to access the desired AWS services and resources.

Batch Job – References all of the items above. Each job has a status and a priority; higher priority (numerically) jobs take precedence over those with lower priority.

Running a Batch Job
Ok, let’s use the S3 Console to create and run a batch job! In preparation for this blog post I enabled inventory reports for one of my S3 buckets (jbarr-batch-camera) earlier this week, with the reports routed to jbarr-batch-inventory:

I select the desired inventory item, and click Create job from manifest to get started (I can also click Batch operations while browsing my list of buckets). All of the relevant information is already filled in, but I can choose an earlier version of the manifest if I want (this option is only applicable if the manifest is stored in a bucket that has versioning enabled). I click Next to proceed:

I choose my operation (Replace all tags), enter the options that are specific to it (I’ll review the other operations later), and click Next:

I enter a name for my job, set its priority, and request a completion report that encompasses all tasks. Then I choose a bucket for the report and select an IAM Role that grants the necessary permissions (the console also displays a role policy and a trust policy that I can copy and use), and click Next:

Finally, I review my job, and click Create job:

The job enters the Preparing state. S3 Batch Operations checks the manifest and does some other verification, and the job enters the Awaiting your confirmation state (this only happens when I use the console). I select it and click Confirm and run:

I review the confirmation (not shown) to make sure that I understand the action to be performed, and click Run job. The job enters the Ready state, and starts to run shortly thereafter. When it is done it enters the Complete state:

If I was running a job that processed a substantially larger number of objects, I could refresh this page to monitor status. One important thing to know: After the first 1000 objects have been processed, S3 Batch Operations examines and monitors the overall failure rate, and will stop the job if the rate exceeds 50%.

The completion report contains one line for each of my objects, and looks like this:

Other Built-In Batch Operations
I don’t have enough space to give you a full run-through of the other built-in batch operations. Here’s an overview:

The PUT copy operation copies my objects, with control of the storage class, encryption, access control list, tags, and metadata:

I can copy objects to the same bucket to change their encryption status. I can also copy them to another region, or to a bucket owned by another AWS account.

The Replace Access Control List (ACL) operation does exactly that, with control over the permissions that are granted:

And the Restore operation initiates an object-level restore from the Glacier or Glacier Deep Archive storage class:

Invoking AWS Lambda Functions
I have saved the most general option for last. I can invoke a Lambda function for each object, and that Lambda function can programmatically analyze and manipulate each object. The Execution Role for the function must trust S3 Batch Operations:

Also, the Role for the Batch job must allow Lambda functions to be invoked.

With the necessary roles in place, I can create a simple function that calls Amazon Rekognition for each image:

import boto3
def lambda_handler(event, context):
    s3Client = boto3.client('s3')
    rekClient = boto3.client('rekognition')
    
    # Parse job parameters
    jobId = event['job']['id']
    invocationId = event['invocationId']
    invocationSchemaVersion = event['invocationSchemaVersion']

    # Process the task
    task = event['tasks'][0]
    taskId = task['taskId']
    s3Key = task['s3Key']
    s3VersionId = task['s3VersionId']
    s3BucketArn = task['s3BucketArn']
    s3Bucket = s3BucketArn.split(':')[-1]
    print('BatchProcessObject(' + s3Bucket + "/" + s3Key + ')')
    resp = rekClient.detect_labels(Image={'S3Object':{'Bucket' : s3Bucket, 'Name' : s3Key}}, MaxLabels=10, MinConfidence=85)
    
    l = [lb['Name'] for lb in resp['Labels']]
    print(s3Key + ' - Detected:' + str(sorted(l)))

    results = [{
        'taskId': taskId,
        'resultCode': 'Succeeded',
        'resultString': 'Succeeded'
    }]
    
    return {
        'invocationSchemaVersion': invocationSchemaVersion,
        'treatMissingKeysAs': 'PermanentFailure',
        'invocationId': invocationId,
        'results': results
    }

With my function in place, I select Invoke AWS lambda function as my operation when I create my job, and choose my BatchProcessObject function:

Then I create and confirm my job as usual. The function will be invoked for each object, taking advantage of Lambda’s ability to scale and allowing this moderately-sized job to run to completion in less than a minute:

I can find the “Detected” messages in the CloudWatch Logs Console:

As you can see from my very simple example, the ability to easily run Lambda functions on large numbers of S3 objects opens the door to all sorts of interesting applications.

Things to Know
I am looking forward to seeing and hearing about the use cases that you discover for S3 Batch Operations! Before I wrap up, here are some final thoughts:

Job Cloning – You can clone an existing job, fine-tune the parameters, and resubmit it as a fresh job. You can use this to re-run a failed job or to make any necessary adjustments.

Programmatic Job Creation – You could attach a Lambda function to the bucket where you generate your inventory reports and create a fresh batch job each time a report arrives. Jobs that are created programmatically do not need to be confirmed, and are immediately ready to execute.

CSV Object Lists – If you need to process a subset of the objects in a bucket and cannot use a common prefix to identify them, you can create a CSV file and use it to drive your job. You could start from an inventory report and filter the objects based on name or by checking them against a database or other reference. For example, perhaps you use Amazon Comprehend to perform sentiment analysis on all of your stored documents. You can process inventory reports to find documents that have not yet been analyzed and add them to a CSV file.

Job Priorities – You can have multiple jobs active at once in each AWS region. Your jobs with a higher priority take precedence, and can cause existing jobs to be paused momentarily. You can select an active job and click Update priority in order to make changes on the fly:

Learn More
Here are some resources to help you learn more about S3 Batch Operations:

Documentation – Read about Creating a Job, Batch Operations, and Managing Batch Operations Jobs.

Tutorial Videos – Check out the S3 Batch Operations Video Tutorials to learn how to Create a Job, Manage and Track a Job, and to Grant Permissions.

Now Available
You can start using S3 Batch Operations in all commercial AWS regions except Asia Pacific (Osaka) today. S3 Batch Operations is also available in both of the AWS GovCloud (US) regions.

Jeff;

New Amazon S3 Storage Class – Glacier Deep Archive

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/new-amazon-s3-storage-class-glacier-deep-archive/

Many AWS customers collect and store large volumes (often a petabyte or more) of important data but seldom access it. In some cases raw data is collected and immediately processed, then stored for years or decades just in case there’s a need for further processing or analysis. In other cases, the data is retained for compliance or auditing purposes. Here are some of the industries and use cases that fit this description:

Financial – Transaction archives, activity & audit logs, and communication logs.

Health Care / Life Sciences – Electronic medical records, images (X-Ray, MRI, or CT), genome sequences, records of pharmaceutical development.

Media & Entertainment – Media archives and raw production footage.

Physical Security – Raw camera footage.

Online Advertising – Clickstreams and ad delivery logs.

Transportation – Vehicle telemetry, video, RADAR, and LIDAR data.

Science / Research / Education – Research input and results, including data relevant to seismic tests for oil & gas exploration.

Today we are introducing a new and even more cost-effective way to store important, infrequently accessed data in Amazon S3.

Amazon S3 Glacier Deep Archive Storage Class
The new Glacier Deep Archive storage class is designed to provide durable and secure long-term storage for large amounts of data at a price that is competitive with off-premises tape archival services. Data is stored across 3 or more AWS Availability Zones and can be retrieved in 12 hours or less. You no longer need to deal with expensive and finicky tape drives, arrange for off-premises storage, or worry about migrating data to newer generations of media.

Your existing S3-compatible applications, tools, code, scripts, and lifecycle rules can all take advantage of Glacier Deep Archive storage. You can specify the new storage class when you upload objects, alter the storage class of existing objects manually or programmatically, or use lifecycle rules to arrange for migration based on object age. You can also make use of other S3 features such as Storage Class Analysis, Object Tagging, Object Lock, and Cross-Region Replication.

The existing S3 Glacier storage class allows you to access your data in minutes (using expedited retrieval) and is a good fit for data that requires faster access. To learn more about the entire range of options, read Storage Classes in the S3 Developer Guide. If you are already making use of the Glacier storage class and rarely access your data, you can switch to Deep Archive and begin to see cost savings right away.

Using Glacier Deep Archive Storage – Console
I can switch the storage class of an existing S3 object to Glacier Deep Archive using the S3 Console. I locate the file and click Properties:

Then I click Storage class:

Next, I select Glacier Deep Archive and click Save:

I cannot download the object or edit any of its properties or permissions after I make this change:

In the unlikely event that I need to access this 2013-era video, I select it and choose Restore from the Actions menu:

Then I specify the number of days to keep the restored copy available, and choose either bulk or standard retrieval:

Using Glacier Deep Archive Storage – Lifecycle Rules
I can also use S3 lifecycle rules. I select the bucket and click Management, then select Lifecycle:

Then I click Add lifecycle rule and create my rule. I enter a name (ArchiveOldMovies), and can optionally use a path or tag filter to limit the scope of the rule:

Next, I indicate that I want the rule to apply to the Current version of my objects, and specify that I want my objects to transition to Glacier Deep Archive 30 days after they are created:

Using Glacier Deep Archive – CLI / Programmatic Access
I can use the CLI to upload a new object and set the storage class:

$ aws s3 cp new.mov s3://awsroadtrip-videos-raw/ --storage-class DEEP_ARCHIVE

I can also change the storage class of an existing object by copying it over itself:

$ aws s3 cp s3://awsroadtrip-videos-raw/new.mov s3://awsroadtrip-videos-raw/new.mov --storage-class DEEP_ARCHIVE

If I am building a system that manages archiving and restoration, I can opt to receive notifications on an SNS topic, an SQS queue, or a Lambda function when a restore is initiated and/or completed:

Other Access Methods
You can also use Tape Gateway configuration of AWS Storage Gateway to create a Virtual Tape Library (VTL) and configure it to use Glacier Deep Archive for storage of archived virtual tapes. This will allow you to move your existing tape-based backups to the AWS Cloud without making any changes to your existing backup workflows. You can retrieve virtual tapes archived in Glacier Deep Archive to S3 within twelve hours. With Tape Gateway and S3 Glacier Deep Archive, you no longer need on-premises physical tape libraries, and you don’t need to manage hardware refreshes and rewrite data to new physical tapes as technologies evolve. For more information, visit the Test Your Gateway Setup with Backup Software page of Storage Gateway User Guide.

Now Available
The S3 Glacier Deep Archive storage class is available today in all commercial regions and in both AWS GovCloud regions. Pricing varies by region, and the storage cost is up to 75% less than for the existing S3 Glacier storage class; visit the S3 Pricing page for more information.

Jeff;

Improve Apache Spark write performance on Apache Parquet formats with the EMRFS S3-optimized committer

Post Syndicated from Peter Slawski original https://aws.amazon.com/blogs/big-data/improve-apache-spark-write-performance-on-apache-parquet-formats-with-the-emrfs-s3-optimized-committer/

The EMRFS S3-optimized committer is a new output committer available for use with Apache Spark jobs as of Amazon EMR 5.19.0. This committer improves performance when writing Apache Parquet files to Amazon S3 using the EMR File System (EMRFS). In this post, we run a performance benchmark to compare this new optimized committer with existing committer algorithms, namely FileOutputCommitter algorithm versions 1 and 2. We close with a discussion on current limitations for the new committer, providing workarounds where possible.

Comparison with FileOutputCommitter

In Amazon EMR version 5.19.0 and earlier, Spark jobs that write Parquet to Amazon S3 use a Hadoop commit algorithm called FileOutputCommitter by default. There are two versions of this algorithm, version 1 and 2. Both versions rely on writing intermediate task output to temporary locations. They subsequently perform rename operations to make the data visible at task or job completion time.

Algorithm version 1 has two phases of rename: one to commit the individual task output, and the other to commit the overall job output from completed/successful tasks. Algorithm version 2 is more efficient because task commits rename files directly to the final output location. This eliminates the second rename phase, but it makes partial data visible before the job completes, which not all workloads can tolerate.

The renames that are performed are fast, metadata-only operations on the Hadoop Distributed File System (HDFS). However, when output is written to object stores such as Amazon S3, renames are implemented by copying data to the target and then deleting the source. This rename “penalty” is exacerbated with directory renames, which can happen in both phases of FileOutputCommitter v1. Whereas these are single metadata-only operations on HDFS, committers must execute N copy-and-delete operations on S3.

To partially mitigate this, Amazon EMR 5.14.0+ defaults to FileOutputCommitter v2 when writing Parquet data to S3 with EMRFS in Spark. The new EMRFS S3-optimized committer improves on that work to avoid rename operations altogether by using the transactional properties of Amazon S3 multipart uploads. Tasks may then write their data directly to the final output location, but defer completion of each output file until task commit time.

Performance test

We evaluated the write performance of the different committers by executing the following INSERT OVERWRITE Spark SQL query. The SELECT * FROM range(…) clause generated data at execution time. This produced ~15 GB of data across exactly 100 Parquet files in Amazon S3.

SET rows=4e9; -- 4 Billion
SET partitions=100;

INSERT OVERWRITE DIRECTORY ‘s3://${bucket}/perf-test/${trial_id}’
USING PARQUET SELECT * FROM range(0, ${rows}, 1, ${partitions});

Note: The EMR cluster ran in the same AWS Region as the S3 bucket. The trial_id property used a UUID generator to ensure that there was no conflict between test runs.

We executed our test on an EMR cluster created with the emr-5.19.0 release label, with a single m5d.2xlarge instance in the master group, and eight m5d.2xlarge instances in the core group. We used the default Spark configuration properties set by Amazon EMR for this cluster configuration, which include the following:

spark.dynamicAllocation.enabled true
spark.executor.memory 11168M
spark.executor.cores 4

After running 10 trials for each committer, we captured and summarized query execution times in the following chart. Whereas FileOutputCommitter v2 averaged 49 seconds, the EMRFS S3-optimized committer averaged only 31 seconds—a 1.6x speedup.

As mentioned earlier, FileOutputCommitter v2 eliminates some, but not all, rename operations that FileOutputCommitter v1 uses. To illustrate the full performance impact of renames against S3, we reran the test using FileOutputCommitter v1. In this scenario, we observed an average runtime of 450 seconds, which is 14.5x slower than the EMRFS S3-optimized committer.

The last scenario we evaluated is the case when EMRFS consistent view is enabled, which addresses issues that can arise due to the Amazon S3 data consistency model. In this mode, the EMRFS S3-optimized committer time was unaffected by this change and still averaged 30 seconds. On the other hand, FileOutputCommitter v2 averaged 53 seconds, which was slower than when the consistent view feature was turned off, widening the overall performance difference to 1.8x.

Job correctness

The EMRFS S3-optimized committer has the same limitations that FileOutputCommitter v2 has because both improve performance by fully delegating commit responsibilities to the individual tasks. The following is a discussion of the notable consequences of this design choice.

Partial results from incomplete or failed jobs

Because both committers have their tasks write to the final output location, concurrent readers of that output location can view partial results when using either of them. If a job fails, partial results are left behind from any tasks that have committed before the overall job failed. This situation can lead to duplicate output if the job is run again without first cleaning up the output location.

One way to mitigate this issue is to ensure that a job uses a different output location each time it runs, publishing the location to downstream readers only if the job succeeds. The following code block is an example of this strategy for workloads that use Hive tables. Notice how output_location is set to a unique value each time the job is run, and that the table partition is registered only if the rest of the query succeeds. As long as readers exclusively access data via the table abstraction, they cannot see results before the job finishes.

SET attempt_id=<a random UUID>;
SET output_location=s3://bucket/${attempt_id};

INSERT OVERWRITE DIRECTORY ‘${output_location}’
USING PARQUET SELECT * FROM input;

ALTER TABLE output ADD PARTITION (dt = ‘2018-11-26’)
LOCATION ‘${output_location}’;

This approach requires treating the locations that partitions point to as immutable. Updates to partition contents require restating all results into a new location in S3, and then updating the partition metadata to point to that new location.

Duplicate results from non-idempotent tasks

Another scenario that can cause both committers to produce incorrect results is when jobs composed of non-idempotent tasks produce outputs into non-deterministic locations for each task attempt.

The following is an example of a query that illustrates the issue. It uses a timestamp-based table partitioning scheme to ensure that it writes to a different location for each task attempt.

SET hive.exec.dynamic.partition=true
SET hive.exec.dynamic.partition.mode=nonstrict;

INSERT INTO data PARTITION (time) SELECT 42, current_timestamp();

You can avoid the issue of duplicate results in this scenario by ensuring that tasks write to a consistent location across task attempts. For example, instead of calling functions that return the current timestamp within tasks, consider providing the current timestamp as an input to the job. Similarly, if a random number generator is used within jobs, consider using a fixed seed or one that is based on the task’s partition number to ensure that task reattempts uses the same value.

Note: Spark’s built-in random functions rand(), randn(), and uuid() are already designed with this in mind.

Enabling the EMRFS S3-optimized committer

Starting with Amazon EMR version 5.20.0, the EMRFS S3-optimized committer is enabled by default. In Amazon EMR version 5.19.0, you can enable the committer by setting the spark.sql.parquet.fs.optimized.committer.optimization-enabled property to true from within Spark or when creating clusters. The committer takes effect when you use Spark’s built-in Parquet support to write Parquet files into Amazon S3 with EMRFS. This includes using the Parquet data source with Spark SQL, DataFrames, or Datasets. However, there are some use cases when the EMRFS S3-optimized committer does not take effect, and some use cases where Spark performs its own renames entirely outside of the committer. For more information about the committer and about these special cases, see Using the EMRFS S3-optimized Committer in the Amazon EMR Release Guide.

Related Work – S3A Committers

The EMRFS S3-optimized committer was inspired by concepts used by committers that support the S3A file system. The key take-away is that these committers use the transactional nature of S3 multipart uploads to eliminate some or all of the rename costs. This is also the core concept used by the EMRFS S3-optimized committer.

For more information about the various committers available within the ecosystem, including those that support the S3A file system, see the official Apache Hadoop documentation.

Summary

The EMRFS S3-optimized committer improves write performance compared to FileOutputCommitter. Starting with Amazon EMR version 5.19.0, you can use it with Spark’s built-in Parquet support. For more information, see Using the EMRFS S3-optimized Committer in the Amazon EMR Release Guide.

 


About the authors

Peter Slawski is a software development engineer with Amazon Web Services.

 

 

 

 

Jonathan Kelly is a senior software development engineer with Amazon Web Services.