Tag Archives: Intermediate (200)

The anatomy of ransomware event targeting data residing in Amazon S3

Post Syndicated from Megan O'Neil original https://aws.amazon.com/blogs/security/anatomy-of-a-ransomware-event-targeting-data-in-amazon-s3/

Ransomware events have significantly increased over the past several years and captured worldwide attention. Traditional ransomware events affect mostly infrastructure resources like servers, databases, and connected file systems. However, there are also non-traditional events that you may not be as familiar with, such as ransomware events that target data stored in Amazon Simple Storage Service (Amazon S3). There are important steps you can take to help prevent these events, and to identify possible ransomware events early so that you can take action to recover. The goal of this post is to help you learn about the AWS services and features that you can use to protect against ransomware events in your environment, and to investigate possible ransomware events if they occur.

Ransomware is a type of malware that bad actors can use to extort money from entities. The actors can use a range of tactics to gain unauthorized access to their target’s data and systems, including but not limited to taking advantage of unpatched software flaws, misuse of weak credentials or previous unintended disclosure of credentials, and using social engineering. In a ransomware event, a legitimate entity’s access to their data and systems is restricted by the bad actors, and a ransom demand is made for the safe return of these digital assets. There are several methods actors use to restrict or disable authorized access to resources including a) encryption or deletion, b) modified access controls, and c) network-based Denial of Service (DoS) attacks. In some cases, after the target’s data access is restored by providing the encryption key or transferring the data back, bad actors who have a copy of the data demand a second ransom—promising not to retain the data in order to sell or publicly release it.

In the next sections, we’ll describe several important stages of your response to a ransomware event in Amazon S3, including detection, response, recovery, and protection.

Observable activity

The most common event that leads to a ransomware event that targets data in Amazon S3, as observed by the AWS Customer Incident Response Team (CIRT), is unintended disclosure of Identity and Access Management (IAM) access keys. Another likely cause is if there is an application with a software flaw that is hosted on an Amazon Elastic Compute Cloud (Amazon EC2) instance with an attached IAM instance profile and associated permissions, and the instance is using Instance Metadata Service Version 1 (IMDSv1). In this case, an unauthorized user might be able to use AWS Security Token Service (AWS STS) session keys from the IAM instance profile for your EC2 instance to ransom objects in S3 buckets. In this post, we will focus on the most common scenario, which is unintended disclosure of static IAM access keys.

Detection

After a bad actor has obtained credentials, they use AWS API actions that they iterate through to discover the type of access that the exposed IAM principal has been granted. Bad actors can do this in multiple ways, which can generate different levels of activity. This activity might alert your security teams because of an increase in API calls that result in errors. Other times, if a bad actor’s goal is to ransom S3 objects, then the API calls will be specific to Amazon S3. If access to Amazon S3 is permitted through the exposed IAM principal, then you might see an increase in API actions such as s3:ListBuckets, s3:GetBucketLocation, s3:GetBucketPolicy, and s3:GetBucketAcl.

Analysis

In this section, we’ll describe where to find the log and metric data to help you analyze this type of ransomware event in more detail.

When a ransomware event targets data stored in Amazon S3, often the objects stored in S3 buckets are deleted, without the bad actor making copies. This is more like a data destruction event than a ransomware event where objects are encrypted.

There are several logs that will capture this activity. You can enable AWS CloudTrail event logging for Amazon S3 data, which allows you to review the activity logs to understand read and delete actions that were taken on specific objects.

In addition, if you have enabled Amazon CloudWatch metrics for Amazon S3 prior to the ransomware event, you can use the sum of the BytesDownloaded metric to gain insight into abnormal transfer spikes.

Another way to gain information is to use the region-DataTransfer-Out-Bytes metric, which shows the amount of data transferred from Amazon S3 to the internet. This metric is enabled by default and is associated with your AWS billing and usage reports for Amazon S3.

For more information, see the AWS CIRT team’s Incident Response Playbook: Ransom Response for S3, as well as the other publicly available response frameworks available at the AWS customer playbooks GitHub repository.

Response

Next, we’ll walk through how to respond to the unintended disclosure of IAM access keys. Based on the business impact, you may decide to create a second set of access keys to replace all legitimate use of those credentials so that legitimate systems are not interrupted when you deactivate the compromised access keys. You can deactivate the access keys by using the IAM console or through automation, as defined in your incident response plan. However, you also need to document specific details for the event within your secure and private incident response documentation so that you can reference them in the future. If the activity was related to the use of an IAM role or temporary credentials, you need to take an additional step and revoke any active sessions. To do this, in the IAM console, you choose the Revoke active session button, which will attach a policy that denies access to users who assumed the role before that moment. Then you can delete the exposed access keys.

In addition, you can use the AWS CloudTrail dashboard and event history (which includes 90 days of logs) to review the IAM related activities by that compromised IAM user or role. Your analysis can show potential persistent access that might have been created by the bad actor. In addition, you can use the IAM console to look at the IAM credential report (this report is updated every 4 hours) to review activity such as access key last used, user creation time, and password last used. Alternatively, you can use Amazon Athena to query the CloudTrail logs for the same information. See the following example of an Athena query that will take an IAM user Amazon Resource Number (ARN) to show activity for a particular time frame.

SELECT eventtime, eventname, awsregion, sourceipaddress, useragent
FROM cloudtrail
WHERE useridentity.arn = 'arn:aws:iam::1234567890:user/Name' AND
-- Enter timeframe
(event_date >= '2022/08/04' AND event_date <= '2022/11/04')
ORDER BY eventtime ASC

Recovery

After you’ve removed access from the bad actor, you have multiple options to recover data, which we discuss in the following sections. Keep in mind that there is currently no undelete capability for Amazon S3, and AWS does not have the ability to recover data after a delete operation. In addition, many of the recovery options require configuration upon bucket creation.

S3 Versioning

Using versioning in S3 buckets is a way to keep multiple versions of an object in the same bucket, which gives you the ability to restore a particular version during the recovery process. You can use the S3 Versioning feature to preserve, retrieve, and restore every version of every object stored in your buckets. With versioning, you can recover more easily from both unintended user actions and application failures. Versioning-enabled buckets can help you recover objects from accidental deletion or overwrite. For example, if you delete an object, Amazon S3 inserts a delete marker instead of removing the object permanently. The previous version remains in the bucket and becomes a noncurrent version. You can restore the previous version. Versioning is not enabled by default and incurs additional costs, because you are maintaining multiple copies of the same object. For more information about cost, see the Amazon S3 pricing page.

AWS Backup

Using AWS Backup gives you the ability to create and maintain separate copies of your S3 data under separate access credentials that can be used to restore data during a recovery process. AWS Backup provides centralized backup for several AWS services, so you can manage your backups in one location. AWS Backup for Amazon S3 provides you with two options: continuous backups, which allow you to restore to any point in time within the last 35 days; and periodic backups, which allow you to retain data for a specified duration, including indefinitely. For more information, see Using AWS Backup for Amazon S3.

Protection

In this section, we’ll describe some of the preventative security controls available in AWS.

S3 Object Lock

You can add another layer of protection against object changes and deletion by enabling S3 Object Lock for your S3 buckets. With S3 Object Lock, you can store objects using a write-once-read-many (WORM) model and can help prevent objects from being deleted or overwritten for a fixed amount of time or indefinitely.

AWS Backup Vault Lock

Similar to S3 Object lock, which adds additional protection to S3 objects, if you use AWS Backup you can consider enabling AWS Backup Vault Lock, which enforces the same WORM setting for all the backups you store and create in a backup vault. AWS Backup Vault Lock helps you to prevent inadvertent or malicious delete operations by the AWS account root user.

Amazon S3 Inventory

To make sure that your organization understands the sensitivity of the objects you store in Amazon S3, you should inventory your most critical and sensitive data across Amazon S3 and make sure that the appropriate bucket configuration is in place to protect and enable recovery of your data. You can use Amazon S3 Inventory to understand what objects are in your S3 buckets, and the existing configurations, including encryption status, replication status, and object lock information. You can use resource tags to label the classification and owner of the objects in Amazon S3, and take automated action and apply controls that match the sensitivity of the objects stored in a particular S3 bucket.

MFA delete

Another preventative control you can use is to enforce multi-factor authentication (MFA) delete in S3 Versioning. MFA delete provides added security and can help prevent accidental bucket deletions, by requiring the user who initiates the delete action to prove physical or virtual possession of an MFA device with an MFA code. This adds an extra layer of friction and security to the delete action.

Use IAM roles for short-term credentials

Because many ransomware events arise from unintended disclosure of static IAM access keys, AWS recommends that you use IAM roles that provide short-term credentials, rather than using long-term IAM access keys. This includes using identity federation for your developers who are accessing AWS, using IAM roles for system-to-system access, and using IAM Roles Anywhere for hybrid access. For most use cases, you shouldn’t need to use static keys or long-term access keys. Now is a good time to audit and work toward eliminating the use of these types of keys in your environment. Consider taking the following steps:

  1. Create an inventory across all of your AWS accounts and identify the IAM user, when the credentials were last rotated and last used, and the attached policy.
  2. Disable and delete all AWS account root access keys.
  3. Rotate the credentials and apply MFA to the user.
  4. Re-architect to take advantage of temporary role-based access, such as IAM roles or IAM Roles Anywhere.
  5. Review attached policies to make sure that you’re enforcing least privilege access, including removing wild cards from the policy.

Server-side encryption with customer managed KMS keys

Another protection you can use is to implement server-side encryption with AWS Key Management Service (SSE-KMS) and use customer managed keys to encrypt your S3 objects. Using a customer managed key requires you to apply a specific key policy around who can encrypt and decrypt the data within your bucket, which provides an additional access control mechanism to protect your data. You can also centrally manage AWS KMS keys and audit their usage with an audit trail of when the key was used and by whom.

GuardDuty protections for Amazon S3

You can enable Amazon S3 protection in Amazon GuardDuty. With S3 protection, GuardDuty monitors object-level API operations to identify potential security risks for data in your S3 buckets. This includes findings related to anomalous API activity and unusual behavior related to your data in Amazon S3, and can help you identify a security event early on.

Conclusion

In this post, you learned about ransomware events that target data stored in Amazon S3. By taking proactive steps, you can identify potential ransomware events quickly, and you can put in place additional protections to help you reduce the risk of this type of security event in the future.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Security, Identity and Compliance re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Megan O’Neil

Megan is a Principal Specialist Solutions Architect focused on threat detection and incident response. Megan and her team enable AWS customers to implement sophisticated, scalable, and secure solutions that solve their business challenges.

Karthik Ram

Karthik Ram

Karthik is a Senior Solutions Architect with Amazon Web Services based in Columbus, Ohio. He has a background in IT networking, infrastructure architecture and Security. At AWS, Karthik helps customers build secure and innovative cloud solutions, solving their business problems using data driven approaches. Karthik’s Area of Depth is Cloud Security with a focus on Threat Detection and Incident Response (TDIR).

Kyle Dickinson

Kyle Dickinson

Kyle is a Sr. Security Solution Architect, specializing in threat detection, incident response. He focuses on working with customers to respond to security events with confidence. He also hosts AWS on Air: Lockdown, a livestream security show. When he’s not – he enjoys hockey, BBQ, and trying to convince his Shitzu that he’s in-fact, not a large dog.

Analyze Amazon S3 storage costs using AWS Cost and Usage Reports, Amazon S3 Inventory, and Amazon Athena

Post Syndicated from Dagar Katyal original https://aws.amazon.com/blogs/big-data/analyze-amazon-s3-storage-costs-using-aws-cost-and-usage-reports-amazon-s3-inventory-and-amazon-athena/

Since its launch in 2006, Amazon Simple Storage Service (Amazon S3) has experienced major growth, supporting multiple use cases such as hosting websites, creating data lakes, serving as object storage for consumer applications, storing logs, and archiving data. As the application portfolio grows, customers tend to store data from multiple application and different business functions in a single S3 bucket, which can grow the storage in S3 buckets to hundreds of TBs. The AWS Billing console provides a way to look at the total storage cost of data stored in Amazon S3, but sometimes IT organizations need to understand the breakdown of costs of a particular S3 bucket by various prefixes or objects corresponding to a particular user or application. There are various reasons to analyze the costs of S3 buckets, such as to identify the spend breakdown, do internal chargebacks, understand the cost breakdown by business unit and application, and many more. As of this writing, there is no easy way to do a cost breakdown of S3 buckets by objects and prefixes.

In this post, we discuss a solution using Amazon Athena to query AWS Cost and Usage Reports and Amazon S3 Inventory reports to analyze the cost by prefixes and objects in an S3 bucket.

Overview of solution

The following figure shows the architecture for this solution. First, we enable the AWS Cost and Usage Reports (AWS CUR) and Amazon S3 Inventory features, which save the output into two separate pre-created S3 buckets. We then use Athena to query these S3 buckets for AWS CUR data and S3 object inventory data to correlate and allocate the cost breakdown at the object or prefix level.

architecture diagram

To implement the solution, we complete the following steps:

  1. Create S3 buckets for AWS CUR, S3 object inventory, and Athena results. Alternatively, you can create these respective buckets when enabling the respective individual features, but for the purpose of this post, we create all of them at the beginning.
  2. Enable the Cost and Usage Reports.
  3. Enable Amazon S3 Inventory configuration.
  4. Create AWS Glue Data Catalog tables for the CUR and S3 object inventory to query using Athena.
  5. Run queries in Athena.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Create S3 buckets

Amazon S3 is an object storage service offering industry-leading scalability, data availability, security, and performance. Customers of all sizes and industries can store and protect any amount of data for virtually any use case, such as data lakes, cloud-native applications, and mobile apps. With cost-effective storage classes and easy-to-use management features, you can optimize costs, organize data, and configure fine-tuned access controls to meet specific business, organizational, and compliance requirements.

For this post, we use the S3 bucket s3-object-cost-allocation as the primary bucket for cost allocation. This S3 bucket is conveniently modeled to contain several prefixes and objects of different sizes for which cost allocation needs to be done based on the overall cost of the bucket. In a real-world scenario, you should use a bucket that has data for multiple teams and for which you need to allocate costs by prefix or object. Going forward, we refer to this bucket as the primary object bucket.

The following screenshot shows our S3 bucket and folders.

example Folders created

Now let’s create the three additional operational S3 buckets to store the datasets generated to calculate costs for the objects. You can create the following buckets or any existing buckets as needed:

  • cur-cost-usage-reports-<account_number> – This bucket is used to save the Cost and Usage Reports for the account.
  • S3-inventory-configurations-<account_number> – This bucket is used to save the inventory configurations of our primary object bucket.
  • athena-query-bucket-<account_number> – This bucket is used to save the query results from Athena.

Complete the following steps to create your S3 buckets:

  • On the Amazon S3 console, choose Buckets in the navigation pane.
  • Choose Create bucket.
  • For Bucket name, enter the name of your bucket (cur-cost-usage-reports-<account_number>).
  • For AWS Region, choose your preferred Region.
  • Leave all other settings at default (or according to your organization’s standards).
  • Choose Create bucket.
    create s3 bucket
  • Repeat these steps to create s3-inventory-configurations-<account_number> and athena-query-bucket-<account_number>.

Enable the Cost and Usage Reports

The AWS Cost and Usage Reports (AWS CUR) contains the most comprehensive set of cost and usage data available. You can use Cost and Usage Reports to publish your AWS billing reports to an S3 bucket that you own. You can receive reports that break down your costs by the hour, day, or month; by product or product resource; or by tags that you define yourself.

Complete the following steps to enable Cost and Usage Reports for your account:

  • On the AWS Billing console, in the navigation pane, choose Cost & Usage Reports.
  • Choose Create report.
  • For Report name, enter a name for your report, such as account-cur-s3.
  • For Additional report details, select Include resource IDs to include the IDs of each individual resource in the report.Including resource IDs will create individual line items for each of your resources. This can increase the size of your Cost and Usage Reports files significantly, which can affect the S3 storage costs for your CUR, based on your AWS usage. We need this feature enabled for this post.
  • For Data refresh settings, select whether you want the Cost and Usage Reports to refresh if AWS applies refunds, credits, or support fees to your account after finalizing your bill.When a report refreshes, a new report is uploaded to Amazon S3.
  • Choose Next.
  • For S3 bucket, choose Configure.
  • For Configure S3 Bucket, select an existing bucket created in the previous section (cur-cost-usage-reports-<account_number>) and choose Next.
  • Review the bucket policy, select I have confirmed that this policy is correct, and choose Save. This default bucket policy provides Cost and Usage Reports access to write data to Amazon S3.
  • For Report path prefix, enter cur-data/account-cur-daily.
  • For Time granularity, choose Daily.
  • For Report versioning, choose Overwrite existing report.
  • For Enable report data integration for, select Amazon Athena.
  • Choose Next.
  • After you have reviewed the settings for your report, choose Review and Complete.
    create cost and usage report

The Cost and Usage reports will be delivered to the S3 buckets within 24 hours.

The following sample CUR in CSV format shows different columns of the Cost and Usage Report, including bill_invoice_id, bill_invoicing_entity, bill_payer_account_id, and line_item_product_code, to name a few.

sample cost and usage report

Enable Amazon S3 Inventory configuration

Amazon S3 Inventory is one of the tools Amazon S3 provides to help manage your storage. You can use it to audit and report on the replication and encryption status of your objects for business, compliance, and regulatory needs. Amazon S3 Inventory provides comma-separated values (CSV), Apache Optimized Row Columnar (ORC), or Apache Parquet output files that list your objects and their corresponding metadata on a daily or weekly basis for an S3 bucket or a shared prefix (objects that have names that begin with a common string).

Complete the following steps to enable Amazon S3 Inventory on the primary object bucket:

  • On the Amazon S3 console, choose Buckets in the navigation pane.
  • Choose the bucket for which you want to configure Amazon S3 Inventory.
    This will be the existing bucket in your account that has data that needs to be analyzed. This could be your data lake or application S3 bucket. We created the bucket s3-object-cost-allocation with some sample data and folder structure.
  • Choose Management.
  • Under Inventory configurations, choose Create inventory configuration.
  • For Inventory configuration name, enter s3-object-cost-allocation.
  • For Inventory scope, leave Prefix blank.
    This is to ensure that all objects are covered for the report.
  • For Object Versions, select Current version only.
  • For Report details, choose This account.
  • For Destination, choose the destination bucket we created (s3-inventory-configurations-<account_number>).
  • For Frequency, choose Daily.
  • For Output format, choose as Apache Parquet.
  • For Status, choose Enable.
  • Keep server-side encryption disabled. To use server-side encryption, choose Enable and specify the encryption key.
  • For Additional fields, select the following to add to the inventory report:
    • Size – The object size in bytes.
    • Last modified date – The object creation date or the last modified date, whichever is the latest.
    • Multipart upload – Specifies that the object was uploaded as a multipart upload. For more information, see Uploading and copying objects using multipart upload.
    • Replication status – The replication status of the object. For more information, see Using the S3 console.
    • Encryption status – The server-side encryption used to encrypt the object. For more information, see Protecting data using server-side encryption.
    • Bucket key status – Indicates whether a bucket-level key generated by AWS KMS applies to the object.
    • Storage class – The storage class used for storing the object.
    • Intelligent-Tiering: Access tier – Indicates the access tier of the object if it was stored in Intelligent-Tie
      create s3 inventory
  • Choose Create.
    s3 inventory configuration

It may take up to 48 hours to deliver the first report.

Create AWS Glue Data Catalog tables for CUR and Amazon S3 Inventory reports

Wait for up to 48 hours for the previous step to generate the reports. In this section, we use Athena to create and define AWS Glue Data Catalog tables for the data that has been created using Cost and Usage Reports and Amazon S3 Inventory reports.

Athena is a serverless, interactive analytics service built on open-source frameworks, supporting open-table and file formats. Athena provides a simplified, flexible way to analyze petabytes of data where it lives.

Complete the following steps to create the tables using Athena:

  • Navigate to the Athena console.
  • If you’re using Athena for the first time, you need to set up a query result location in Amazon S3. If you preconfigured this in Athena , you can skip this step.
    • Choose View settings.
      athena setup query bucket
    • Choose Manage.
    • In the section Query result location and encryption, choose Browse S3 and choose the bucket that we created (athena-query-bucket-<account_number>).
    • Choose Save.
      Athena Config
    • Navigate back to the Athena query editor.
  • Run the following query in Athena to create a table for Cost and Usage Reports. Verify and update the section for <<LOCATION>> at the end of the query and point it to the correct S3 bucket and location. Note that the new table name should be account_cur.
    CREATE EXTERNAL TABLE `account_cur`(
    `identity_line_item_id` string,
    `identity_time_interval` string,
    `bill_invoice_id` string,
    `bill_billing_entity` string,
    `bill_bill_type` string,
    `bill_payer_account_id` string,
    `bill_billing_period_start_date` timestamp,
    `bill_billing_period_end_date` timestamp,
    `line_item_usage_account_id` string,
    `line_item_line_item_type` string,
    `line_item_usage_start_date` timestamp,
    `line_item_usage_end_date` timestamp,
    `line_item_product_code` string,
    `line_item_usage_type` string,
    `line_item_operation` string,
    `line_item_availability_zone` string,
    `line_item_resource_id` string,
    `line_item_usage_amount` double,
    `line_item_normalization_factor` double,
    `line_item_normalized_usage_amount` double,
    `line_item_currency_code` string,
    `line_item_unblended_rate` string,
    `line_item_unblended_cost` double,
    `line_item_blended_rate` string,
    `line_item_blended_cost` double,
    `line_item_line_item_description` string,
    `line_item_tax_type` string,
    `line_item_legal_entity` string,
    `product_product_name` string,
    `product_availability` string,
    `product_description` string,
    `product_durability` string,
    `product_event_type` string,
    `product_fee_code` string,
    `product_fee_description` string,
    `product_free_query_types` string,
    `product_from_location` string,
    `product_from_location_type` string,
    `product_from_region_code` string,
    `product_group` string,
    `product_group_description` string,
    `product_location` string,
    `product_location_type` string,
    `product_message_delivery_frequency` string,
    `product_message_delivery_order` string,
    `product_operation` string,
    `product_platopricingtype` string,
    `product_product_family` string,
    `product_queue_type` string,
    `product_region` string,
    `product_region_code` string,
    `product_servicecode` string,
    `product_servicename` string,
    `product_sku` string,
    `product_storage_class` string,
    `product_storage_media` string,
    `product_to_location` string,
    `product_to_location_type` string,
    `product_to_region_code` string,
    `product_transfer_type` string,
    `product_usagetype` string,
    `product_version` string,
    `product_volume_type` string,
    `pricing_rate_code` string,
    `pricing_rate_id` string,
    `pricing_currency` string,
    `pricing_public_on_demand_cost` double,
    `pricing_public_on_demand_rate` string,
    `pricing_term` string,
    `pricing_unit` string,
    `reservation_amortized_upfront_cost_for_usage` double,
    `reservation_amortized_upfront_fee_for_billing_period` double,
    `reservation_effective_cost` double,
    `reservation_end_time` string,
    `reservation_modification_status` string,
    `reservation_normalized_units_per_reservation` string,
    `reservation_number_of_reservations` string,
    `reservation_recurring_fee_for_usage` double,
    `reservation_start_time` string,
    `reservation_subscription_id` string,
    `reservation_total_reserved_normalized_units` string,
    `reservation_total_reserved_units` string,
    `reservation_units_per_reservation` string,
    `reservation_unused_amortized_upfront_fee_for_billing_period` double,
    `reservation_unused_normalized_unit_quantity` double,
    `reservation_unused_quantity` double,
    `reservation_unused_recurring_fee` double,
    `reservation_upfront_value` double,
    `savings_plan_total_commitment_to_date` double,
    `savings_plan_savings_plan_a_r_n` string,
    `savings_plan_savings_plan_rate` double,
    `savings_plan_used_commitment` double,
    `savings_plan_savings_plan_effective_cost` double,
    `savings_plan_amortized_upfront_commitment_for_billing_period` double,
    `savings_plan_recurring_commitment_for_billing_period` double,
    `resource_tags_user_bucket_name` string,
    `resource_tags_user_cost_tracking` string)
    PARTITIONED BY (
    `year` string,
    `month` string)
    ROW FORMAT SERDE
    'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
    STORED AS INPUTFORMAT
    'org.apache.hadoop.mapred.TextInputFormat'
    OUTPUTFORMAT
    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
    '<<LOCATION>>'
  • Run the following query in Athena to create the table for Amazon S3 Inventory. Verify and update the section for <<LOCATION>> at the end of the query and point it to the correct S3 bucket and location.
    • To get the exact value of the location, navigate to the bucket where inventory configurations are stored and navigate to the folder path Hive . Use the S3 URI to replace <<LOCATION>> in the query.query path location
      CREATE EXTERNAL TABLE s3_object_inventory(
               bucket string,
               key string,
               version_id string,
               is_latest boolean,
               is_delete_marker boolean,
               size bigint,
               last_modified_date bigint,
               storage_class string,
               is_multipart_uploaded boolean,
               replication_status string,
               encryption_status string,
               intelligent_tiering_access_tier string,
               bucket_key_status string
      ) PARTITIONED BY (
              dt string
      )
      ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
        STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.SymlinkTextInputFormat'
        OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.IgnoreKeyTextOutputFormat'
        LOCATION '<<LOCATION>>';
      
  • We need to refresh the partitions and add new inventory lists to the table. Use the following commands to add data to the CUR table and Amazon S3 Inventory table:
    MSCK REPAIR TABLE `account_cur`;
    
    MSCK REPAIR TABLE s3_object_inventory;

Run queries in Athena to allocate the cost of objects in an S3 bucket

Now we can query the data we have available to get a cost allocation breakdown at the prefix level.

We need to provide some information in the following queries:

  • Update <<YYYY-MM-DD>> with the date for which you want to analyze the data
  • Update <<prefix>> with the prefix values for your bucket that needs to be analyzed
  • Update <<bucket_name>> with the name of the bucket that needs to be analyzed

We use the following part of the query to calculate the size of storage being used by the target prefix that we want to calculate the cost for:

select date_parse(dt,'%Y-%m-%d-%H-%i') dt, cast (sum(size) as double) targetPrefixBytes
from s3_object_inventory
where date_parse(dt,'%Y-%m-%d-%H-%i') = cast('<<YYYY-MM-DD>>' as timestamp)
and key like '<<prefix>>/%'
group by dt

Next, we calculate the total size of the bucket on that particular date:

select date_parse(dt,'%Y-%m-%d-%H-%i') dt, cast (sum(size) as double) totalBytes
from s3_object_inventory
where date_parse(dt,'%Y-%m-%d-%H-%i') = cast('<<YYYY-MM-DD>>' as timestamp)
group by dt

We query the CUR table to get the cost of a particular bucket on a particular date:

select line_item_usage_start_date as dt, sum(line_item_blended_cost) as line_item_blended_cost
from "account_cur"
where line_item_product_code = 'AmazonS3'
and product_servicecode = 'AmazonS3'
and line_item_operation = 'StandardStorage'
and line_item_resource_id = '<<bucket_name>>'
and line_item_usage_start_date = cast('<<YYYY-MM-DD>>' as timestamp)
group by line_item_usage_start_date

Putting all of this together, we can calculate the cost of a particular prefix (folder or a file) on a specific date. The complete query is as follows:

with
cost as (select line_item_usage_start_date as dt, sum(line_item_blended_cost) as line_item_blended_cost
from "account_cur"
where line_item_product_code = 'AmazonS3'
and product_servicecode = 'AmazonS3'
and line_item_operation = 'StandardStorage'
and line_item_resource_id = '<<bucket_name>>'
and line_item_usage_start_date = cast('<<YYYY-MM-DD>>' as timestamp)
group by line_item_usage_start_date),
total as (select date_parse(dt,'%Y-%m-%d-%H-%i') dt, cast (sum(size) as double) totalBytes
from s3_object_inventory
where date_parse(dt,'%Y-%m-%d-%H-%i') = cast('<<YYYY-MM-DD>>' as timestamp)
group by dt),
target as (select date_parse(dt,'%Y-%m-%d-%H-%i') dt, cast (sum(size) as double) targetPrefixBytes
from s3_object_inventory
where date_parse(dt,'%Y-%m-%d-%H-%i') = cast('<<YYYY-MM-DD>>' as timestamp)
and key like '<<prefix>>/%'
group by dt)
select target.dt,
(target.targetPrefixBytes/ total.totalBytes * 100) percentUsed,
cost.line_item_blended_cost totalCost,
cost.line_item_blended_cost*(target.targetPrefixBytes/ total.totalBytes) as prefixCost
from target, total, cost
where target.dt = total.dt
and target.dt = cost.dt

The following screenshot shows the results table for the sample data we used in this post. We get the following information:

  • dt – Date
  • percentUsed – The percentage of prefix space compared to overall bucket space
  • totalCost – The total cost of the bucket
  • prefixCost – The cost of the space used by the prefix

final result percetage

Clean up

To stop incurring costs, be sure to disable Amazon S3 Inventory and Cost and Usage Reports when you’re done.

Delete the S3 buckets created for the Amazon S3 Inventory reports and Cost and Usage Reports to avoid storage charges.

Other methods for Amazon S3 storage analysis

Amazon S3 Storage Lens can provide a single view of object storage usage and activity across your entire Amazon S3 storage. With S3 Storage Lens, you can understand, analyze, and optimize storage with over 29 usage and activity metrics and interactive dashboards to aggregate data for your entire organization, specific accounts, Regions, buckets, or prefixes. All of this data is accessible on the Amazon S3 console or as raw data in an S3 bucket.

S3 Storage Lens doesn’t provide cost analysis based on an object or prefix in a single bucket. If you want visibility of storage usage and trends across the entire storage footprint along with recommendations on cost efficiency and data protection best practices, S3 Storage Lens is the right option. But if you want a cost analysis of specific S3 buckets and looking for ways to get cost allocation of S3 objects at the object or prefix level, the solution in this post would be the best fit.

Conclusion

In this post, we detailed how to create a cost breakdown model at the object or prefix level for S3 buckets that contains data for multiple business units and applications. We used Athena to query the reports and datasets produced by the AWS CUR and Amazon S3 Inventory features that, when correlated, give us the cost allocation at the object and prefix level. This solution gives you an easy way to calculate costs for independent objects and prefixes, which can be used for internal chargebacks or just to know the per-object or per-prefix spending in a shared S3 bucket.


About the Authors


Dagar Katyal
is a Senior Solutions Architect at AWS, based in Chicago, Illinois. He works with customers and provides guidance for key strategic initiatives important for their business. Dagar has an MBA and has spent years over 15 years working with customers on projects on analytics strategy, roadmap, and using data as a key differentiator. When not working with customers, Dagar spends time with his family and doing home improvement projects.


Saiteja Pudi
is a Solutions Architect at AWS, based in Dallas, Tx. He has been with AWS for more than 3 years now, helping customers derive the true potential of AWS by being their trusted advisor. He comes from an application development background, interested in Data Science and Machine Learning.

How Amazon Devices scaled and optimized real-time demand and supply forecasts using serverless analytics

Post Syndicated from Avinash Kolluri original https://aws.amazon.com/blogs/big-data/how-amazon-devices-scaled-and-optimized-real-time-demand-and-supply-forecasts-using-serverless-analytics/

Every day, Amazon devices process and analyze billions of transactions from global shipping, inventory, capacity, supply, sales, marketing, producers, and customer service teams. This data is used in procuring devices’ inventory to meet Amazon customers’ demands. With data volumes exhibiting a double-digit percentage growth rate year on year and the COVID pandemic disrupting global logistics in 2021, it became more critical to scale and generate near-real-time data.

This post shows you how we migrated to a serverless data lake built on AWS that consumes data automatically from multiple sources and different formats. Furthermore, it created further opportunities for our data scientists and engineers to use AI and machine learning (ML) services to continuously feed and analyze data.

Challenges and design concerns

Our legacy architecture primarily used Amazon Elastic Compute Cloud (Amazon EC2) to extract the data from various internal heterogeneous data sources and REST APIs with the combination of Amazon Simple Storage Service (Amazon S3) to load the data and Amazon Redshift for further analysis and generating the purchase orders.

We found this approach resulted in a few deficiencies and therefore drove improvements in the following areas:

  • Developer velocity – Due to the lack of unification and discovery of schema, which are primary reasons for runtime failures, developers often spent time dealing with operational and maintenance issues.
  • Scalability – Most of these datasets are shared across the globe. Therefore, we must meet the scaling limits while querying the data.
  • Minimal infrastructure maintenance – The current process spans multiple computes depending on the data source. Therefore, reducing infrastructure maintenance is critical.
  • Responsiveness to data source changes – Our current system gets data from various heterogeneous data stores and services. Any updates to those services takes months of developer cycles. The response times for these data sources are critical to our key stakeholders. Therefore, we must take a data-driven approach to select a high-performance architecture.
  • Storage and redundancy – Due to the heterogeneous data stores and models, it was challenging to store the different datasets from various business stakeholder teams. Therefore, having versioning along with incremental and differential data to compare will provide a remarkable ability to generate more optimized plans
  • Fugitive and accessibility – Due to the volatile nature of logistics, a few business stakeholder teams have a requirement to analyze the data on demand and generate the near-real-time optimal plan for the purchase orders. This introduces the need for both polling and pushing the data to access and analyze in near-real time.

Implementation strategy

Based on these requirements, we changed strategies and started analyzing each issue to identify the solution. Architecturally, we chose a serverless model, and the data lake architecture action line refers to all the architectural gaps and challenging features we determined were part of the improvements. From an operational standpoint, we designed a new shared responsibility model for data ingestion using AWS Glue instead of internal services (REST APIs) designed on Amazon EC2 to extract the data. We also used AWS Lambda for data processing. Then we chose Amazon Athena as our query service. To further optimize and improve the developer velocity for our data consumers, we added Amazon DynamoDB as a metadata store for different data sources landing in the data lake. These two decisions drove every design and implementation decision we made.

The following diagram illustrates the architecture

arch

In the following sections, we look at each component in the architecture in more detail as we move through the process flow.

AWS Glue for ETL

To meet customer demand while supporting the scale of new businesses’ data sources, it was critical for us to have a high degree of agility, scalability, and responsiveness in querying various data sources.

AWS Glue is a serverless data integration service that makes it easy for analytics users to discover, prepare, move, and integrate data from multiple sources. You can use it for analytics, ML, and application development. It also includes additional productivity and DataOps tooling for authoring, running jobs, and implementing business workflows.

With AWS Glue, you can discover and connect to more than 70 diverse data sources and manage your data in a centralized data catalog. You can visually create, run, and monitor extract, transform, and load (ETL) pipelines to load data into your data lakes. Also, you can immediately search and query cataloged data using Athena, Amazon EMR, and Amazon Redshift Spectrum.

AWS Glue made it easy for us to connect to the data in various data stores, edit and clean the data as needed, and load the data into an AWS-provisioned store for a unified view. AWS Glue jobs can be scheduled or called on demand to extract data from the client’s resource and from the data lake.

Some responsibilities of these jobs are as follows:

  • Extracting and converting a source entity to data entity
  • Enrich the data to contain year, month, and day for better cataloging and include a snapshot ID for better querying
  • Perform input validation and path generation for Amazon S3
  • Associate the accredited metadata based on the source system

Querying REST APIs from internal services is one of our core challenges, and considering the minimal infrastructure, we wanted to use them in this project. AWS Glue connectors assisted us in adhering to the requirement and goal. To query data from REST APIs and other data sources, we used PySpark and JDBC modules.

AWS Glue supports a wide variety of connection types. For more details, refer to Connection Types and Options for ETL in AWS Glue.

S3 bucket as landing zone

We used an S3 bucket as the immediate landing zone of the extracted data, which is further processed and optimized.

Lambda as AWS Glue ETL Trigger

We enabled S3 event notifications on the S3 bucket to trigger Lambda, which further partitions our data. The data is partitioned on InputDataSetName, Year, Month, and Date. Any query processor running on top of this data will scan only a subset of data for better cost and performance optimization. Our data can be stored in various formats, such as CSV, JSON, and Parquet.

The raw data isn’t ideal for most of our use cases to generate the optimal plan because it often has duplicates or incorrect data types. Most importantly, the data is in multiple formats, but we quickly modified the data and observed significant query performance gains from using the Parquet format. Here, we used one of the performance tips in Top 10 performance tuning tips for Amazon Athena.

AWS Glue jobs for ETL

We wanted better data segregation and accessibility, so we chose to have a different S3 bucket to improve performance further. We used the same AWS Glue jobs to further transform and load the data into the required S3 bucket and a portion of extracted metadata into DynamoDB.

DynamoDB as metadata store

Now that we have the data, various business stakeholders further consume it. This leaves us with two questions: which source data resides on the data lake and what version. We chose DynamoDB as our metadata store, which provides the latest details to the consumers to query the data effectively. Every dataset in our system is uniquely identified by snapshot ID, which we can search from our metadata store. Clients access this data store with an API’s.

Amazon S3 as data lake

For better data quality, we extracted the enriched data into another S3 bucket with the same AWS Glue job.

AWS Glue Crawler

Crawlers are the “secret sauce” that enables us to be responsive to schema changes. Throughout the process, we chose to make each step as schema-agnostic as possible, which allows any schema changes to flow through until they reach AWS Glue. With a crawler, we could maintain the agnostic changes happening to the schema. This helped us automatically crawl the data from Amazon S3 and generate the schema and tables.

AWS Glue Data Catalog

The Data Catalog helped us maintain the catalog as an index to the data’s location, schema, and runtime metrics in Amazon S3. Information in the Data Catalog is stored as metadata tables, where each table specifies a single data store.

Athena for SQL queries

Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries you run. We considered operational stability and increasing developer velocity as our key improvement factors.

We further optimized the process to query Athena so that users can plug in the values and the queries to get data out of Athena by creating the following:

  • An AWS Cloud Development Kit (AWS CDK) template to create Athena infrastructure and AWS Identity and Access Management (IAM) roles to access data lake S3 buckets and the Data Catalog from any account
  • A library so that client can provide an IAM role, query, data format, and output location to start an Athena query and get the status and result of the query run in the bucket of their choice.

To query Athena is a two-step process:

  • StartQueryExecution – This starts the query run and gets the run ID. Users can provide the output location where the output of the query will be stored.
  • GetQueryExecution – This gets the query status because the run is asynchronous. When successful, you can query the output in an S3 file or via API.

The helper method for starting the query run and getting the result would be in the library.

Data lake metadata service

This service is custom developed and interacts with DynamoDB to get the metadata (dataset name, snapshot ID, partition string, timestamp, and S3 link of the data) in the form of a REST API. When the schema is discovered, clients use Athena as their query processor to query the data.

Because all datasets have a snapshot ID are partitioned, the join query doesn’t result in a full table scan but only a partition scan on Amazon S3. We used Athena as our query processor because of its ease in not managing our query infrastructure. Later, if we feel we need something more, we can use either Redshift Spectrum or Amazon EMR.

Conclusion

Amazon Devices teams discovered significant value by moving to a data lake architecture using AWS Glue, which enabled multiple global business stakeholders to ingest data in more productive ways. This enabled the teams to generate the optimal plan to place purchase orders for devices by analyzing the different datasets in near-real time with appropriate business logic to solve the problems of the supply chain, demand, and forecast.

From an operational perspective, the investment has already started to pay off:

  • It standardized our ingestion, storage, and retrieval mechanisms, saving onboarding time. Before the implementation of this system, one dataset took 1 month to onboard. Due to our new architecture, we were able to onboard 15 new datasets in less than 2 months, which improved our agility by 70%.
  • It removed scaling bottlenecks, creating a homogeneous system that can quickly scale to thousands of runs.
  • The solution added schema and data quality validation before accepting any inputs and rejecting them if data quality violations are discovered.
  • It made it easy to retrieve datasets while supporting future simulations and back tester use cases requiring versioned inputs. This will make launching and testing models simpler.
  • The solution created a common infrastructure that can be easily extended to other teams across DIAL having similar issues with data ingestion, storage, and retrieval use cases.
  • Our operating costs have fallen by almost 90%.
  • This data lake can be accessed efficiently by our data scientists and engineers to perform other analytics and have a predictive approach as a future opportunity to generate accurate plans for the purchase orders.

The steps in this post can help you plan to build a similar modern data strategy using AWS-managed services to ingest data from different sources, automatically create metadata catalogs, share data seamlessly between the data lake and data warehouse, and create alerts in the event of an orchestrated data workflow failure.


About the authors

avinash_kolluriAvinash Kolluri is a Senior Solutions Architect at AWS. He works across Amazon Alexa and Devices to architect and design modern distributed solutions. His passion is to build cost-effective and highly scalable solutions on AWS. In his spare time, he enjoys cooking fusion recipes and traveling.

vipulVipul Verma is a Sr.Software Engineer at Amazon.com. He has been with Amazon since 2015,solving real-world challenges through technology that directly impact and improve the life of Amazon customers. In his spare time, he enjoys hiking.

Visualize multivariate data using a radar chart in Amazon QuickSight

Post Syndicated from Bhupinder Chadha original https://aws.amazon.com/blogs/big-data/visualize-multivariate-data-using-a-radar-chart-in-amazon-quicksight/

At AWS re:Invent 2022, we announced the general availability of two new Amazon QuickSight visuals: small multiples and text boxes. We are excited to add another new visual to QuickSight: radar charts. With radar charts, you can compare two or more items across multiple variables in QuickSight.

In this post, we explore radar charts, its use cases, and how to configure one.

What is a radar chart?

Radar charts (also known as spider charts, polar charts, web charts, or star plots) are a way to visualize multivariate data similar to a parallel coordinates chart. They are used to plot one or more groups of values over multiple common variables. They do this by providing an axis for each variable, and these axes are arranged radially around a central point and spaced equally. The center of the chart represents the minimum value, and the edges represent the maximum value on the axis. The data from a single observation is plotted along each axis and connected to form a polygon. Multiple observations can be placed in a single chart by displaying multiple polygons.

For example, consider HR wanting to comparing the employee satisfaction score for different departments like sales, marketing, and finance against various metrics such as work/life balance, diversity, inclusiveness, growth opportunities, and wages. As shown in the following radar chart, each employee metric forms the axis with each department being represented by individual series.

Another effective way of comparing radar charts is to compare a given department against the average or baseline value. For instance, the sales department feels less compensated compared to the baseline, but ranks high on work/life balance.

When to use radar charts

Radar charts are a great option when space is a constraint and you want to compare multiple groups in a compact space. Radar charts are best used for the following:

  • Visualizing multivariate data, such as comparing cars across different stats like mileage, max speed, engine power, and driving pleasure
  • Comparative analysis (comparing two or more items across a list of common variables)
  • Spot outliers and commonality

Compared to parallel coordinates, radar charts are ideal when there are a few groups of items to be compared. You should also be mindful of not displaying too many variables, which can make the chart look cluttered and difficult to read.

Radar chart use cases

Radar charts have a wide variety of industry use cases, some of which are as follows:

  • Sports analytics – Compare athlete performance across different performance parameters for selection criteria
  • Strategy – Compare and measure different technology costs between various parameters, such as contact center, claims, massive claims, and others
  • Sales – Compare performance of sales reps across different parameters like deals closed, average deal size, net new customer wins, total revenue, and deals in the pipeline
  • Call centers – Compare call center staff performance against the staff average across different dimensions
  • HR – Compare company scores in terms of diversity, work/life balance, benefits, and more
  • User research and customer success – Compare customer satisfaction scores across different parts of the product

Different radar chart configurations

Let’s use an example of visualizing staff performance within a team, using the following sample data. The goal is to compare employee performance based on various qualities like communication, work quality, productivity, creativity, dependability, punctuality, and technical skills, ranging between a score of 0–10.

To add a radar chart to your analysis, choose the radar chart icon from the visual selector.

Depending on your use case and how the data is structured, you can configure radar charts in different ways.

Value as axis (UC1 and 2 tab from the dataset)

In this scenario, all qualities (communication, dependability, and so on) are defined as measures, and the employee is defined as a dimension in the dataset.

To visualize this data in a radar chart, drag all the variables to the Values field well and the Employee field to the Color field well.

Category as axis (UC1 and 2 tab from the dataset)

Another way to visualize the same data is to reverse the series and axis configuration, where each quality is displayed as a series and employees are displayed on the axis. For this, drag the Employee field to the Category field well and all the qualities to the Value field well.

Category as axis with color (UC3 tab from the dataset)

We can visualize the same use case with a different data structure, where all the qualities and employees are defined as a dimension and scores as values.

To achieve this use case, drag the field that you want to visualize as the axis to the Category field and individual series to the Color field. In our case, we chose Qualities as our axis, added Score to the Value field well, and visualized the values for each employee by adding Employee to the Color field well.

Styling radar charts

You can customize your radar charts with the following formatting options:

  • Series style – You can choose to display the chart as either a line (default) or area series

  • Start angle – By default, this is set to 90 degrees, but you can choose a different angle if you want to rotate the radar chart to better utilize the available real estate

  • Fill area – This option applies odd/even coloring for the plot area

  • Grid shape – Choose between circle or polygon for grid shape

Summary

In this post, we looked at how radar charts can help you visualize and compare items across different variables. We also learned about the different configurations supported by radar charts and styling options to help you customize its look and feel.

We encourage you to explore radar charts and leave a comment with your feedback.


About the author

Bhupinder Chadha is a senior product manager for Amazon QuickSight focused on visualization and front end experiences. He is passionate about BI, data visualization and low-code/no-code experiences. Prior to QuickSight he was the lead product manager for Inforiver, responsible for building a enterprise BI product from ground up. Bhupinder started his career in presales, followed by a small gig in consulting and then PM for xViz, an add on visualization product.

Define a custom session duration and terminate active sessions in IAM Identity Center

Post Syndicated from Ron Cully original https://aws.amazon.com/blogs/security/define-a-custom-session-duration-and-terminate-active-sessions-in-iam-identity-center/

Managing access to accounts and applications requires a balance between delivering simple, convenient access and managing the risks associated with active user sessions. Based on your organization’s needs, you might want to make it simple for end users to sign in and to operate long enough to get their work done, without the disruptions associated with requiring re-authentication. You might also consider shortening the session to help meet your compliance or security requirements. At the same time, you might want to terminate active sessions that your users don’t need, such as sessions for former employees, sessions for which the user failed to sign out on a second device, or sessions with suspicious activity.

With AWS IAM Identity Center (successor to AWS Single Sign-On), you now have the option to configure the appropriate session duration for your organization’s needs while using new session management capabilities to look up active user sessions and revoke unwanted sessions.

In this blog post, I show you how to use these new features in IAM Identity Center. First, I walk you through how to configure the session duration for your IAM Identity Center users. Then I show you how to identify existing active sessions and terminate them.

What is IAM Identity Center?

IAM Identity Center helps you securely create or connect your workforce identities and manage their access centrally across AWS accounts and applications. IAM Identity Center is the recommended approach for workforce identities to access AWS resources. In IAM Identity Center, you can integrate with an external identity provider (IdP), such as Okta Universal Directory, Microsoft Azure Active Directory, or Microsoft Active Directory Domain Services, as an identity source or you can create users directly in IAM Identity Center. The service is built on the capabilities of AWS Identity and Access Management (IAM) and is offered at no additional cost.

IAM Identity Center sign-in and sessions

You can use IAM Identity Center to access applications and accounts and to get credentials for the AWS Management Console, AWS Command Line Interface (AWS CLI), and AWS SDK sessions. When you log in to IAM Identity Center through a browser or the AWS CLI, an AWS access portal session is created. When you federate into the console, IAM Identity Center uses the session duration setting on the permission set to control the duration of the session.

Note: The access portal session duration for IAM Identity Center differs from the IAM permission set session duration, which defines how long a user can access their account through the IAM Identity Center console.

Before the release of the new session management feature, the AWS access portal session duration was fixed at 8 hours. Now you can configure the session duration for the AWS access portal in IAM Identity Center from 15 minutes to 7 days. The access portal session duration determines how long the user can access the portal, applications, and accounts, and run CLI commands without re-authenticating. If you have an external IdP connected to IAM Identity Center, the access portal session duration will be the lesser of either the session duration that you set in your IdP or the session duration defined in IAM Identity Center. Users can access accounts and applications until the access portal session expires and initiates re-authentication.

When users access accounts or applications through IAM Identity Center, it creates an additional session that is separate but related to the AWS access portal session. AWS CLI sessions use the AWS access portal session to access roles. The duration of console sessions is defined as part of the permission set that the user accessed. When a console session starts, it continues until the duration expires or the user ends the session. IAM Identity Center-enabled application sessions re-verify the AWS access portal session approximately every 60 minutes. These sessions continue until the AWS access portal session terminates, until another application-specific condition terminates the session, or until the user terminates the session.

To summarize:

  • After a user signs in to IAM Identity Center, they can access their assigned roles and applications for a fixed period, after which they must re-authenticate.
  • If a user accesses an assigned permission set, the user has access to the corresponding role for the duration defined in the permission set (or by the user terminating the session).
  • The AWS CLI uses the AWS access portal session to access roles. The AWS CLI refreshes the IAM permission set in the background. The CLI job continues to run until the access portal session expires.
  • If users access an IAM Identity Center-enabled application, the user can retain access to an application for up to an hour after the access portal session has expired.

Note: IAM Identity Center doesn’t currently support session management capabilities for Active Directory identity sources.

For more information about session management features, see Authentication sessions in the documentation.

Configure session duration

In this section, I show you how to configure the session duration for the AWS access portal in IAM Identity Center. You can choose a session duration between 15 minutes and 7 days.

Session duration is a global setting in IAM Identity Center. After you set the session duration, the maximum session duration applies to IAM Identity Center users.

To configure session duration for the AWS access portal:

  1. Open the IAM Identity Center console.
  2. In the left navigation pane, choose Settings.
  3. On the Settings page, choose the Authentication tab.
  4. Under Authentication, next to Session settings, choose Configure.
  5. For Configure session settings, choose a maximum session duration from the list of pre-defined session durations in the dropdown. To set a custom session duration, select Custom duration, enter the length for the session in minutes, and then choose Save.
Figure 1: Set access portal session duration

Figure 1: Set access portal session duration

Congratulations! You have just modified the session duration for your users. This new duration will take effect on each user’s next sign-in.

Find and terminate AWS access portal sessions

With this new release, you can find active portal sessions for your IAM Identity Center users, and if needed, you can terminate the sessions. This can be useful in situations such as the following:

  • A user no longer works for your organization or was removed from projects that gave them access to applications or permission sets that they should no longer use.
  • If a device is lost or stolen, the user can contact you to end the session. This reduces the risk that someone will access the device and use the open session.

In these cases, you can find a user’s active sessions in the AWS access portal, select the session that you’re interested in, and terminate it. Depending on the situation, you might also want to deactivate sign-in for the user from the system before revoking the user’s session. You can deactivate sign-in for users in the IAM Identity Center console or in your third-party IdP.

If you first deactivate the user’s sign-in in your IdP, and then deactivate the user’s sign-in in IAM Identity Center, deactivation will take effect in IAM Identity Center without synchronization latency. However, if you deactivate the user in IAM Identity Center first, then it is possible that the IdP could activate the user again. By first deactivating the user’s sign-in in your IdP, you can prevent the user from signing in again when you revoke their session. This action is advisable when a user has left your organization and should no longer have access, or if you suspect a valid user’s credentials were stolen and you want to block access until you reset the user’s passwords.

Termination of the access portal session does not affect the active permission set session started from the access portal. IAM role session duration when assumed from the access portal will last as long as the duration specified in the permission set. For AWS CLI sessions, it can take up to an hour for the CLI to terminate after the access portal session is terminated.

Tip: Activate multi-factor authentication (MFA) wherever possible. MFA offers an additional layer of protection to help prevent unauthorized individuals from gaining access to systems or data.

To manage active access portal sessions in the AWS access portal:

  1. Open the IAM Identity Center console.
  2. In the left navigation pane, choose Users.
  3. On the Users page, choose the username of the user whose sessions you want to manage. This takes you to a page with the user’s information.
  4. On the user’s page, choose the Active sessions tab. The number in parentheses next to Active sessions indicates the number of current active sessions for this user.
    Figure 2: View active access portal sessions

    Figure 2: View active access portal sessions

  5. Select the sessions that you want to delete, and then choose Delete session. A dialog box appears that confirms you’re deleting active sessions for this user.
    Figure 3: Delete selected active sessions

    Figure 3: Delete selected active sessions

  6. Review the information in the dialog box, and if you want to continue, choose Delete session.

Conclusion

In this blog post, you learned how IAM Identity Center manages sessions, how to modify the session duration for the AWS access portal, and how to view, search, and terminate active access portal sessions. I also shared some tips on how to think about the appropriate session duration for your use case and related steps that you should take when terminating sessions for users who shouldn’t have permission to sign in again after their session has ended.

With this new feature, you now have more control over user session management. You can use the console to set configurable session lengths based on your organization’s security requirements and desired end-user experience, and you can also terminate sessions, enabling you to manage sessions that are no longer needed or potentially suspicious.

To learn more, see Manage IAM Identity Center integrated application sessions.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Ron Cully

Ron is a Principal Product Manager at AWS where he has led feature and roadmap planning for workforce identity products for over 6 years. He has over 25 years of experience leading networking and directory related product delivery. Ron is passionate about delivering solutions to help make it easier for you to migrate identity-aware workloads, simplify resource and application authorization, and give people a simple sign-in and access experience in the cloud.

Palak Arora

Palak Arora

Palak is a Senior Product Manager at AWS Identity. She has over eight years of cyber security experience with specialization in Identity and Access Management (IAM) domain. She has helped various customers across different sectors to define their enterprise and customer IAM roadmap and strategy, and improve the overall technology risk landscape.

Create more partitions and retain data for longer in your MSK Serverless clusters

Post Syndicated from Usama Naseem original https://aws.amazon.com/blogs/big-data/create-more-partitions-and-retain-data-for-longer-in-your-msk-serverless-clusters/

In April 2022, Amazon Managed Streaming for Apache Kafka (Amazon MSK) launched an exciting new capability, Amazon MSK Serverless. Amazon MSK is a fully managed service for Apache Kafka that makes it easier for developers to build and run highly available, secure, and scalable applications based on Apache Kafka. With MSK Serverless, developers can run their applications without having to provision, configure, or optimize their Apache Kafka clusters. MSK Serverless automatically provisions and scales compute and storage resources, so developers have access to on-demand streaming capacity and storage.

Over the remainder of 2022, the team collected customer feedback and worked backward from customer requirements to add new capabilities that made MSK Serverless even better. In this post, we discuss a few of these enhancements in detail and provide an example use case.

Higher default quota for partitions in a cluster

Data in Apache Kafka is written to topics, which can be partitioned into multiple log files called partitions. When a producer application writes data to a topic, it is appended to one of these partitions. MSK Serverless launched with a maximum quota of 120 partitions per cluster. However, our customers told us that they needed more partitions per cluster for a variety of use cases, ranging from change data capture (CDC) to faster real-time data processing.

In December 2022, we increased the default quota for partitions for MSK Serverless clusters. With the increased quota, you can create up to 2,400 partitions per cluster. The 20-fold increase in the number of partitions you can have per cluster lets you create more topics per cluster and have more applications consume data in parallel. You can also implement better isolation of data with fine-grained access control. More partitions are particularly useful for CDC use cases where each table in the database has hundreds of unqiue keys, which are each mapped to a unique partition. With more partitions, you can use MSK Serverless for capturing changes in larger databases with lots of tables and hundreds of keys. Note that the 2,400 limit only applies to leader partitions. MSK Serverless creates two replicas of each partition by default at no additional cost that don’t count towards this limit.

Unlimited data retention duration

The data you produce to your topics can be retained in Apache Kafka for a configurable duration, depending on how long you need to access data using Apache Kafka consumer APIs. Typically, customers retain data for short periods of time, ranging from a few hours to a few days. Previously, MSK Serverless limited data retention to a maximum of 24 hours (1 day), which is sufficient for most popular Apache Kafka use cases. However, some use cases require customers to retain data for longer, such as retaining data for audit purposes or maintaining application recovery SLAs.

Now, with the increase in the data retention duration quota, you can retain data for as long as you need in your MSK Serverless clusters. Longer data retention is particularly useful for use cases where your consumer applications need quick access to older data. For instance, in the case of a failure, the application may need to access data from the start of the topic to reconstruct its state. Because you can now retain data in your topics for longer durations, you can restore your application’s state by accessing older data using Kafka’s consumer API, making it easier to recover from such failures. After the application recovers, you can configure your application to start consuming the data from the earliest timestamp you need to reestablish your application’s state. Note that you can only retain up to 250 GB of data per partition. As long as your partition doesn’t reach 250 GB in size, you may retain it for as long as you wish. You may create more partitions if you need more storage for a given topic.

These new quotas are available in all Regions where MSK Serverless is available. For more information, navigate to the MSK Serverless tab on the Amazon MSK pricing page and choose the Region drop-down menu.

You can also request an increase to the maximum number of partitions quota by contacting AWS Support if you need more than 2,400 partitions in a cluster. The quotas for more partitions and longer retention are applied to both existing and new clusters.

Getting started: Create a topic with 1,000 partitions and 7-day retention

In this section, we demonstrate how to create a topic in MSK Serverless, specify the number of partitions, and set its retention duration.

As a prerequisite, you must have an MSK Serverless cluster and an Apache Kafka client. Refer to Getting started using MSK Serverless clusters for step-by-step instructions.

  1. On your client machine, access kafka_2.12-2.8.1/bin and run the following export command (replace the ‘my-endpoint’ with the bootstrap server string of your MSK Serverless cluster):
    export BS=my-endpoint

  2. Run the following command to create a topic called msk-sample-topic with 1,000 partitions and 7-day data retention (604,800,000 milliseconds):
    <path-to-your-kafka-installation>/bin/kafka-topics.sh 
    --bootstrap-server $BS 
    --command-config client.properties 
    --create 
    --topic msk-sample-topic 
    --partitions 1000 
    --config retention.ms=604800000

  3. (Optional) Run the following command to view the details of the topic you created in step 2 above:
    <path-to-your-kafka-installation>/bin/kafka-topics.sh 
    --bootstrap-server $BS
    --command-config client.properties 
    --topic msk-sample-topic 
    --describe | head -n1

    You will see the following result:

    Topic: msk-sample-topic TopicId: Ze76LY9EQuiH0xOIenx_HA PartitionCount: 1000ReplicationFactor: 3    Configs: min.insync.replicas=2,segment.bytes=134217728,retention.ms=604800000,message.format.version=2.8-IV2,unclean.leader.election.enable=false,retention.bytes=268435456000

Clean up

To avoid incurring charges on the AWS resources created in this post, delete the MSK Serverless cluster and the Amazon Elastic Compute Cloud (Amazon EC2) instance for your client machine.

  1. On the Amazon MSK console, select the MSK Serverless cluster you used for this solution.
  2. Choose Actions, then choose Delete.
  3. On the Amazon EC2 console, select the instance that you created for your Apache Kafka client machine.
  4. Choose Instance state, then choose Terminate instance.

Conclusion

This post demonstrated how to create an MSK Serverless cluster topic with 1,000 partitions and 7-day retention. With the new quota increases, you can create up to 2,400 partitions per cluster and retain data for as long as you need. If you have comments or feedback, please feel free to leave them in the comments.


About the author

Usama Naseem is a Senior Product Manager for Amazon MSK and focuses on MSK Serverless. Previously, he held product management roles for AWS Lambda and Amazon Fresh. He is passionate about giving customers the tools to build real-time applications in the cloud. Outside of work, he continues to be under the delusion that he will be the best squash player in the world one day.

Run Apache Spark workloads 3.5 times faster with Amazon EMR 6.9

Post Syndicated from Sekar Srinivasan original https://aws.amazon.com/blogs/big-data/run-apache-spark-workloads-3-5-times-faster-with-amazon-emr-6-9/

The Amazon EMR runtime for Apache Spark is a performance-optimized runtime for Apache Spark that is 100% API compatible with open-source Apache Spark. With Amazon EMR release 6.9.0, the EMR runtime for Apache Spark supports equivalent Spark version 3.3.0.

With Amazon EMR 6.9.0, you can now run your Apache Spark 3.x applications faster and at lower cost without requiring any changes to your applications. In our performance benchmark tests, derived from TPC-DS performance tests at 3 TB scale, we found the EMR runtime for Apache Spark 3.3.0 provides a 3.5 times (using total runtime) performance improvement on average over open-source Apache Spark 3.3.0.

In this post, we analyze the results from our benchmark tests running a TPC-DS application on open-source Apache Spark and then on Amazon EMR 6.9, which comes with an optimized Spark runtime that is compatible with open-source Spark. We walk through a detailed cost analysis and finally provide step-by-step instructions to run the benchmark.

Results observed

To evaluate the performance improvements, we used an open-source Spark performance test utility that is derived from the TPC-DS performance test toolkit. We ran the tests on a seven-node (six core nodes and one primary node) c5d.9xlarge EMR cluster with the EMR runtime for Apache Spark, and a second seven-node self-managed cluster on Amazon Elastic Compute Cloud (Amazon EC2) with the equivalent open-source version of Spark. We ran both the tests with data in Amazon Simple Storage Service (Amazon S3).

Dynamic Resource Allocation (DRA) is a great feature to use for varying workloads. However, for a benchmarking exercise where we compare two platforms purely on performance, and test data volumes don’t change (3 TB in our case), we believe it’s best to avoid variability in order to run an apples-to-apples comparison. In our tests in both open-source Spark and Amazon EMR, we disabled DRA while running the benchmarking application.

The following table shows the total job runtime for all queries (in seconds) in the 3 TB query dataset between Amazon EMR version 6.9.0 and open-source Spark version 3.3.0. We observed that our TPC-DS tests had a total job runtime on Amazon EMR on Amazon EC2 that was 3.5 times faster than that using an open-source Spark cluster of the same configuration.

The per-query speedup on Amazon EMR 6.9 with and without the EMR runtime for Apache Spark is illustrated in the following chart. The horizontal axis shows each query in the 3 TB benchmark. The vertical axis shows the speedup of each query due to the EMR runtime. Notable performance gains are over 10 times faster for TPC-DS queries 24b, 72, 95, and 96.

Cost analysis

The performance improvements of the EMR runtime for Apache Spark directly translate to lower costs. We were able to realize a 67% cost savings running the benchmark application on Amazon EMR in comparison with the cost incurred to run the same application on open-source Spark on Amazon EC2 with the same cluster sizing due to reduced hours of Amazon EMR and Amazon EC2 usage. Amazon EMR pricing is for EMR applications running on EMR clusters with EC2 instances. The Amazon EMR price is added to the underlying compute and storage prices such as EC2 instance price and Amazon Elastic Block Store (Amazon EBS) cost (if attaching EBS volumes). Overall, the estimated benchmark cost in the US East (N. Virginia) Region is $27.01 per run for the open-source Spark on Amazon EC2 and $8.82 per run for Amazon EMR.

Benchmark Job Runtime (Hour) Estimated Cost Total EC2 Instance Total vCPU Total Memory (GiB) Root Device (Amazon EBS)

Open-source Spark on Amazon EC2

(1 primary and 6 core nodes)

2.23 $27.01 7 252 504 20 GiB gp2

Amazon EMR on Amazon EC2

(1 primary and 6 core nodes)

0.63 $8.82 7 252 504 20 GiB gp2

Cost breakdown

The following is the cost breakdown for the open-source Spark on Amazon EC2 job ($27.01):

  • Total Amazon EC2 cost – (7 * $1.728 * 2.23) = (number of instances * c5d.9xlarge hourly rate * job runtime in hour) = $26.97
  • Amazon EBS cost – ($0.1/730 * 20 * 7 * 2.23) = (Amazon EBS per GB-hourly rate * root EBS size * number of instances * job runtime in hour) = $0.042

The following is the cost breakdown for the Amazon EMR on Amazon EC2 job ($8.82):

  • Total Amazon EMR cost – (7 * $0.27 * 0.63) = ((number of core nodes + number of primary nodes)* c5d.9xlarge Amazon EMR price * job runtime in hour) = $1.19
  • Total Amazon EC2 cost – (7 * $1.728 * 0.63) = ((number of core nodes + number of primary nodes)* c5d.9xlarge instance price * job runtime in hour) = $7.62
  • Amazon EBS cost – ($0.1/730 * 20 GiB * 7 * 0.63) = (Amazon EBS per GB-hourly rate * EBS size * number of instances * job runtime in hour) = $0.012

Set up OSS Spark benchmarking

In the following sections, we provide a brief outline of the steps involved in setting up the benchmarking. For detailed instructions with examples, refer to the GitHub repo.

For our OSS Spark benchmarking, we use the open-source tool Flintrock to launch our Amazon EC2-based Apache Spark cluster. Flintrock provides a quick way to launch an Apache Spark cluster on Amazon EC2 using the command line.

Prerequisites

Complete the following prerequisite steps:

  1. Have Python 3.7.x or above.
  2. Have Pip3 22.2.2 or above.
  3. Add the Python bin directory to your environment path. The Flintrock binary will be installed in this path.
  4. Run aws configure to configure your AWS Command Line Interface (AWS CLI) shell to point to the benchmarking account. Refer to Quick configuration with aws configure for instructions.
  5. Have a key pair with restrictive file permissions to access the OSS Spark primary node.
  6. Create a new S3 bucket in your test account if needed.
  7. Copy the TPC-DS source data as input to your S3 bucket.
  8. Build the benchmark application following the steps provided in Steps to build spark-benchmark-assembly application. Alternatively, you can download a pre-built spark-benchmark-assembly-3.3.0.jar if you want a Spark 3.3.0-based application.

Deploy the Spark cluster and run the benchmark job

Complete the following steps:

  1. Install the Flintrock tool via pip as shown in Steps to setup OSS Spark Benchmarking.
  2. Run the command flintrock configure, which pops up a default configuration file.
  3. Modify the default config.yaml file based on your needs. Alternatively, copy and paste the config.yaml file content to the default configure file. Then save the file to where it was.
  4. Finally, launch the 7-node Spark cluster on Amazon EC2 via Flintrock.

This should create a Spark cluster with one primary node and six worker nodes. If you see any error messages, double-check the config file values, especially the Spark and Hadoop versions and the attributes of download-source and the AMI.

The OSS Spark cluster doesn’t come with YARN resource manager. To enable it, we need to configure the cluster.

  1. Download the yarn-site.xml and enable-yarn.sh files from the GitHub repo.
  2. Replace <private ip of primary node> with the IP address of the primary node in your Flintrock cluster.

You can retrieve the IP address from the Amazon EC2 console.

  1. Upload the files to all the nodes of the Spark cluster.
  2. Run the enable-yarn script.
  3. Enable Snappy support in Hadoop (the benchmark job reads Snappy compressed data).
  4. Download the benchmark utility application JAR file spark-benchmark-assembly-3.3.0.jar to your local machine.
  5. Copy this file to the cluster.
  6. Log in to the primary node and start YARN.
  7. Submit the benchmark job on the open-source Spark cluster as shown in Submit the benchmark job.

Summarize the results

Download the test result file from the output S3 bucket s3://$YOUR_S3_BUCKET/EC2_TPCDS-TEST-3T-RESULT/timestamp=xxxx/summary.csv/xxx.csv. (Replace $YOUR_S3_BUCKET with your S3 bucket name.) You can use the Amazon S3 console and navigate to the output S3 location or use the AWS CLI.

The Spark benchmark application creates a timestamp folder and writes a summary file inside a summary.csv prefix. Your timestamp and file name will be different from the one shown in the preceding example.

The output CSV files have four columns without header names. They are:

  • Query name
  • Median time
  • Minimum time
  • Maximum time

The following screenshot shows a sample output. We have manually added column names. The way we calculate the geomean and the total job runtime is based on arithmetic means. We first take the mean of the med, min, and max values using the formula AVERAGE(B2:D2). Then we take a geometric mean of the Avg column using the formula GEOMEAN(E2:E105).

Set up Amazon EMR benchmarking

For detailed instructions, see Steps to setup EMR Benchmarking.

Prerequisites

Complete the following prerequisite steps:

  1. Run aws configure to configure your AWS CLI shell to point to the benchmarking account. Refer to Quick configuration with aws configure for instructions.
  2. Upload the benchmark application to Amazon S3.

Deploy the EMR cluster and run the benchmark job

Complete the following steps:

  1. Spin up Amazon EMR in your AWS CLI shell using command line as shown in Deploy EMR Cluster and run benchmark job.
  2. Configure Amazon EMR with one primary (c5d.9xlarge) and six core (c5d.9xlarge) nodes. Refer to create-cluster for a detailed description of AWS CLI options.
  3. Store the cluster ID from the response. You need this in the next step.
  4. Submit the benchmark job in Amazon EMR using add-steps in the AWS CLI.

Summarize the results

Summarize the results from the output bucket s3://$YOUR_S3_BUCKET/blog/EMRONEC2_TPCDS-TEST-3T-RESULT in the same manner as we did for the OSS results and compare.

Clean up

To avoid incurring future charges, delete the resources you created using the instructions in the Cleanup section of the GitHub repo.

  1. Stop the EMR and OSS Spark clusters. You may also delete them if you don’t want to retain the content. You can delete these resources by running the script cleanup-benchmark-env.sh from a terminal in your benchmark environment.
  2. If you used AWS Cloud9 as your IDE for building the benchmark application JAR file using Steps to build spark-benchmark-assembly application, you may want to delete the environment as well.

Conclusion

You can run your Apache Spark workloads 3.5 times (based on total runtime) faster and at lower cost without making any changes to your applications by using Amazon EMR 6.9.0.

To keep up to date, subscribe to the Big Data Blog’s RSS feed to learn more about the EMR runtime for Apache Spark, configuration best practices, and tuning advice.

For past benchmark tests, see Run Apache Spark 3.0 workloads 1.7 times faster with Amazon EMR runtime for Apache Spark. Note that the past benchmark result of 1.7 times performance was based on geometric mean. Based on geometric mean, the performance in Amazon EMR 6.9 was two times faster.


About the authors

Sekar Srinivasan is a Sr. Specialist Solutions Architect at AWS focused on Big Data and Analytics. Sekar has over 20 years of experience working with data. He is passionate about helping customers build scalable solutions modernizing their architecture and generating insights from their data. In his spare time he likes to work on non-profit projects, especially those focused on underprivileged Children’s education.

Prabu Ravichandran is a Senior Data Architect with Amazon Web Services, focussed on Analytics, data Lake architecture and implementation. He helps customers architect and build scalable and robust solutions using AWS services. In his free time, Prabu enjoys traveling and spending time with family.

Introducing native support for Apache Hudi, Delta Lake, and Apache Iceberg on AWS Glue for Apache Spark, Part 1: Getting Started

Post Syndicated from Akira Ajisaka original https://aws.amazon.com/blogs/big-data/part-1-getting-started-introducing-native-support-for-apache-hudi-delta-lake-and-apache-iceberg-on-aws-glue-for-apache-spark/

AWS Glue is a serverless, scalable data integration service that makes it easier to discover, prepare, move, and integrate data from multiple sources. AWS Glue provides an extensible architecture that enables users with different data processing use cases.

A common use case is building data lakes on Amazon Simple Storage Service (Amazon S3) using AWS Glue extract, transform, and load (ETL) jobs. Data lakes free you from proprietary data formats defined by the business intelligence (BI) tools and limited capacity of proprietary storage. In addition, data lakes help you break down data silos to maximize end-to-end data insights. As data lakes have grown in size and matured in usage, a significant amount of effort can be spent keeping the data up to date by ensuring files are updated in a transactionally consistent manner.

AWS Glue customers can now use the following open-source data lake storage frameworks: Apache Hudi, Linux Foundation Delta Lake, and Apache Iceberg. These data lake frameworks help you store data and interface data with your applications and frameworks. Although popular data file formats such as Apache Parquet, CSV, and JSON can store big data, data lake frameworks bundle distributed big data files into tabular structures that are otherwise hard to manage. This makes data lake table frameworks the building constructs of databases on data lakes.

We announced general availability for native support for Apache Hudi, Linux Foundation Delta Lake, and Apache Iceberg on AWS Glue for Spark. This feature removes the need to install a separate connector or associated dependencies, manage versions, and simplifies the configuration steps required to use these frameworks in AWS Glue for Apache Spark. With these open-source data lake frameworks, you can simplify incremental data processing in data lakes built on Amazon S3 by using ACID (atomicity, consistency, isolation, durability) transactions, upserts, and deletes.

This post demonstrates how AWS Glue for Apache Spark works with Hudi, Delta, and Iceberg dataset tables, and describes typical use cases on an AWS Glue Studio notebook.

Enable Hudi, Delta, Iceberg in Glue for Apache Spark

You can use Hudi, Delta, or Iceberg by specifying a new job parameter --datalake-formats. For example, if you want to use Hudi, you need to specify the key as --datalake-formats and the value as hudi. If the option is set, AWS Glue automatically adds the required JAR files into the runtime Java classpath, and that’s all you need. You don’t need to build and configure the required libraries or install a separate connector. You can use the following library versions with this option.

AWS Glue version Hudi Delta Lake Iceberg
AWS Glue 3.0 0.10.1 1.0.0 0.13.1
AWS Glue 4.0 0.12.1 2.1.0 1.0.0

If you want to use other versions of the preceding libraries, you can choose either of the following options:

If you choose either of the preceding options, you need to make sure the --datalake-formats job parameter is unspecified. For more information, see Process Apache Hudi, Delta Lake, Apache Iceberg datasets at scale, part 1: AWS Glue Studio Notebook.

Prerequisites

To continue this tutorial, you need to create the following AWS resources in advance:

Process Hudi, Delta, and Iceberg datasets on an AWS Glue Studio notebook

AWS Glue Studio notebooks provide serverless notebooks with minimal setup. It makes data engineers and developers quickly and interactively explore and process their datasets. You can start using Hudi, Delta, or Iceberg in an AWS Glue Studio notebook by specifying the parameter via %%configure magic and setting the AWS Glue version to 3.0 as follows:

# Use Glue version 3.0
%glue_version 3.0

# Configure '--datalake-formats' Job parameter
%%configure
{
  "--datalake-formats": "your_comma_separated_formats"
}

For more information, refer to the example notebooks available in the GitHub repository:

For this post, we use an Iceberg DataFrame as an example.

The following sections explain how to use an AWS Glue Studio notebook to create an Iceberg table and append records to the table.

Launch a Jupyter notebook to process Iceberg tables

Complete the following steps to launch an AWS Glue Studio notebook:

  1. Download the Jupyter notebook file.
  2. On the AWS Glue console, choose Jobs in the navigation plane.
  3. Under Create job, select Jupyter Notebook.

  1. Select Upload and edit an existing notebook.
  2. Upload native_iceberg_dataframe.ipynb through Choose file under File upload.

  1. Choose Create.
  2. For Job name, enter native_iceberg_dataframe.
  3. For IAM Role, choose your IAM role.
  4. Choose Start notebook job.

Prepare and configure SparkSession with Iceberg configuration

Complete the following steps to configure SparkSession to process Iceberg tables:

  1. Run the following cell.

You can see --datalake-formats iceberg is set by the %%configure Jupyter magic command. For more information about Jupyter magics, refer to Configuring AWS Glue interactive sessions for Jupyter and AWS Glue Studio notebooks.

  1. Provide your S3 bucket name and bucket prefix for your Iceberg table location in the following cell, and run it.

  1. Run the following cells to initialize SparkSession.

  1. Optionally, if you previously ran the notebook, you need to run the following cell to clean up existing resources.

Now you’re ready to create Iceberg tables using the notebook.

Create an Iceberg table

Complete the following steps to create an Iceberg table using the notebook:

  1. Run the following cell to create a DataFrame (df_products) to write.

If successful, you can see the following table.

  1. Run the following cell to create an Iceberg table using the DataFrame.

  1. Now you can read data from the Iceberg table by running the following cell.

Append records to the Iceberg table

Complete the following steps to append records to the Iceberg table:

  1. Run the following cell to create a DataFrame (df_products_appends) to append.

  1. Run the following cell to append the records to the table.

  1. Run the following cell to confirm that the preceding records are successfully appended to the table.

Clean up

To avoid incurring ongoing charges, clean up your resources:

  1. Run step 4 in the Prepare and configure SparkSession with Iceberg configuration section in this post to delete the table and underlying S3 objects.
  2. On the AWS Glue console, choose Jobs in the navigation plane.
  3. Select your job and on the Actions menu, choose Delete job(s).
  4. Choose Delete to confirm.

Considerations

With this capability, you have three different options to access Hudi, Delta, and Iceberg tables:

  • Spark DataFrames, for example spark.read.format("hudi").load("s3://path_to_data")
  • SparkSQL, for example SELECT * FROM table
  • GlueContext, for example create_data_frame.from_catalog, write_data_frame.from_catalog, getDataFrame, and writeDataFrame

Learn more in Using the Hudi framework in AWS Glue, Using the Delta Lake framework in AWS Glue, and Using the Iceberg framework in AWS Glue.

Delta Lake native integration works with the catalog tables created from native Delta Lake tables by AWS Glue crawlers. This integration does not depend on manifest files. For more information, refer to Introducing native Delta Lake table support with AWS Glue crawlers.

Conclusion

This post demonstrated how to process Apache Hudi, Delta Lake, Apache Iceberg datasets using AWS Glue for Apache Spark. You can integrate your data using those data lake formats easily without struggling with library dependency management.

In subsequent posts in this series, we’ll show you how you can use AWS Glue Studio to visually author your ETL jobs with simpler configuration and setup for these data lake formats, and how to use AWS Glue workflows to orchestrate data pipelines and automate ingestion into your data lakes on Amazon S3 with AWS Glue jobs. Stay tuned!

If you have comments or feedback, please leave them in the comments.


About the authors

Akira Ajisaka is a Senior Software Development Engineer on the AWS Glue team. He likes open-source software and distributed systems. In his spare time, he enjoys playing both arcade and console games.

Noritaka Sekiyama is a Principal Big Data Architect on the AWS Glue team. He is responsible for building software artifacts to help customers. In his spare time, he enjoys cycling with his new road bike.

Savio Dsouza is a Software Development Manager on the AWS Glue team. His teams work on building and innovating in distributed compute systems and frameworks, namely on Apache Spark.

Visualize AWS WAF logs with an Amazon CloudWatch dashboard

Post Syndicated from Diana Alvarado original https://aws.amazon.com/blogs/security/visualize-aws-waf-logs-with-an-amazon-cloudwatch-dashboard/

AWS WAF is a web application firewall service that helps you protect your applications from common exploits that could affect your application’s availability and your security posture. One of the most useful ways to detect and respond to malicious web activity is to collect and analyze AWS WAF logs. You can perform this task conveniently by sending your AWS WAF logs to Amazon CloudWatch Logs and visualizing them through an Amazon CloudWatch dashboard.

In this blog post, I’ll show you how to use Amazon CloudWatch to monitor and analyze AWS WAF activity using the options in CloudWatch metrics, Contributor Insights, and Logs Insights. I’ll also walk you through how to deploy this solution in your own AWS account by using an AWS CloudFormation template.

Prerequisites

This blog post builds on the concepts introduced in the blog post Analyzing AWS WAF Logs in Amazon CloudWatch Logs. There we introduced how to natively set up AWS WAF logging to Amazon CloudWatch logs, and discussed the basic options that are available for visualizing and analyzing the data provided in the logs.

The only AWS services that you need to turn on for this solution are Amazon CloudWatch and AWS WAF. The solution assumes that you’ve previously set up AWS WAF log delivery to Amazon CloudWatch Logs. If you have not done so, follow the instructions for AWS WAF logging destinations – CloudWatch Logs.

You will need to provide the following parameters for the CloudFormation template:

  • CloudWatch log group name for the AWS WAF logs
  • The AWS Region for the logs
  • The name of the AWS WAF web access control list (web ACL)

Solution overview

The architecture of the solution is outlined in Figure 1. The solution takes advantage of the native integration available between AWS WAF and CloudWatch, which simplifies the setup and management of this solution.

Figure 1: Solution architecture

Figure 1: Solution architecture

In the solution, the logs are sent to CloudWatch (when you enable log delivery). From there, they’re ready to be consumed by all the different service options that CloudWatch offers, including the ones that we’ll use in this solution: CloudWatch Logs Insights and Contributor Insights.

Deploy the solution

Choose the following Launch stack button to launch the CloudFormation stack in your account.

Launch Stack

You’ll be redirected to the CloudFormation service in the AWS US East (N. Virginia) Region, which is the default Region to deploy this solution, although this can vary depending on where your web ACL is located. You can change the Region as preferred. The template will spin up multiple cloud resources, such as the following:

  • CloudWatch Logs Insights queries
  • CloudWatch Contributor Insights visuals
  • CloudWatch dashboard

The solution is quickly deployed to your account and is ready to use in less than 30 minutes. You can use the solution when the status of the stack changes to CREATE_COMPLETE.

As a measure to control costs, you can also choose whether to create the Contributor Insights rules and enable them by default. For more information on costs, see the Cost considerations section later in this post.

Explore and validate the dashboard

When the CloudFormation stack is complete, you can choose the Output tab in the CloudFormation console and then choose the dashboard link. This will take you to the CloudWatch service in the AWS Management Console. The dashboard time range presents information for the last hour of activity by default, and can go up to one week, but keep in mind that Contributor Insights has a maximum time range of 24 hours. You can also select a different dashboard refresh interval from 10 seconds up to 15 minutes.

The dashboard provides the following information from CloudWatch.

Rule name Description
WAF_top_terminating_rules This rule shows the top rules where the requests are being terminated by AWS WAF. This can help you understand the main cause of blocked requests.
WAF_top_ips This rule shows the top source IPs for requests. This can help you understand if the traffic and activity that you see is spread across many IPs or concentrated in a small group of IPs.
WAF_top_countries This rule shows the main source countries for the IPs in the requests. This can help you visualize where the traffic is originating.
WAF_top_user_agents This rule shows the main user agents that are being used to generate the requests. This will help you isolate problematic devices or identify potential false positives.
WAF_top_uri This rule shows the main URIs in the requests that are being evaluated. This can help you identify if one specific path is the target of activity.
WAF_top_http This rule shows the HTTP methods used for the requests examined by AWS WAF. This can help you understand the pattern of behavior of the traffic.
WAF_top_referrer_hosts This rule shows the main referrer from which requests are being sent. This can help you identify incorrect or suspicious origins of requests based on the known application flow.
WAF_top_rate_rules This rule shows the main rate rules being applied to traffic. It helps understand volumetric activity identified by AWS WAF.
WAF_top_labels This rule shows the top labels found in logs. This can help you visualize the main rules that are matching on the requests evaluated by AWS WAF.

The dashboard also provides the following information from the default CloudWatch metrics sent by AWS WAF.

Rule name Description
AllowedvsBlockedRequests This metric shows the number of all blocked and allowed requests. This can help you understand the number of requests that AWS WAF is actively blocking.
Bot Requests vs non-Bot requests This visual shows the number of requests identified as bots versus non-bots (if you’re using AWS WAF Bot Control).
All Requests This metric shows the number of all requests, separated by bot and non-bot origin. This can help you understand all requests that AWS WAF is evaluating.
CountedRequests This metric shows the number of all counted requests. This can help you understand the requests that are matching a rule but not being blocked, and aid the decision of a configuration change during the testing phase.
CaptchaRequests This metric shows requests that go through the CAPTCHA rule.

Figure 2 shows an example of how the CloudWatch dashboard displays the data within this solution. You can rearrange and customize the elements within the dashboard as needed.

You can review each of the queries and rules deployed with this solution. You can also customize these baseline queries and rules to provide more detailed information or to add custom queries and rules to the solution code. For more information on how to build queries and use CloudWatch Logs and Contributor Insights, see the CloudWatch documentation.

Use the dashboard for monitoring

After you’ve set up the dashboard, you can monitor the activity of the sites that are protected by AWS WAF. If suspicious activity is reported, you can use the visuals to understand the traffic in more detail, and drive incident response actions as needed.

Let’s consider an example of how to use your new dashboard and its data to drive security operations decisions. Suppose that you have a website that sells custom clothing at a bargain price. It has a sign-up link to receive offers, and you’re getting reports of unusual activity by the application team. By looking at the metrics for the web ACL that protects the site, you can see the main country for source traffic and the contributing URIs, as shown in Figure 3. You can also see that most of the activity is being detected by rules that you have in place, so you can set the rules to block traffic, or if they are already blocking, you can just monitor the activity.

You can use the same visuals to decide whether an AWS WAF rule with high activity can be changed to autoblock suspicious web traffic without affecting valid customer traffic. By looking at the top terminating rules and cross-referencing information, such as source IPs, user agents, top URIs, and other request identifiers, you can understand the traffic pattern and activity of different applications and endpoints. From here, you can investigate further by using specific queries with CloudWatch Logs Insights.

Operational and security management with CloudWatch Logs Insights

You can use CloudWatch Logs Insights to interactively search and analyze log data in Amazon CloudWatch Logs using advanced queries to effectively investigate operational issues and security incidents.

Examine a bot reported as a false positive

You can use CloudWatch Logs Insights to identify requests that have specific labels to understand where the traffic is originating from based on source IP address and other essential event details. A simple example is investigating requests flagged as potential false positives.

Imagine that you have a reported false positive request that was flagged as a non-browser by AWS WAF Bot Control. You can run the non-browser user agent query that was created by the provided template on CloudWatch Logs Insights, as shown in the following example, and then verify the source IPs for the top hits for this rule group. Or you can look for a specific request that has been flagged as a false positive, in order to review the details and make adjustments as needed.

fields @timestamp, httpRequest.clientIp 
| filter @message like "awswaf:managed:aws:botcontrol:signal:non_browser_user_agent" 
| parse @message ""labels":[*]"as Labels 
| stats count(*) as requestCount by httpRequest.clientIP 
| display @timestamp,httpRequest.clientIp, httpRequest.uri,Labels 
| sort requestCount desc 
| limit 10

The non-browser user agent query also allows you confirm whether this request has other rule hits that were in count mode and were non-terminating; you can do this by examining the labels. If there are multiple rules matching the requests, that can be an indicator of suspicious activity.

If you have a CAPTCHA challenge configured on the endpoint, you can also look at CAPTCHA responses. The CaptchaTokenqueryDefinition query provided in this solution uses a variation of the preceding format, and can display the main IPs from which bad tokens are being sent. An example query is shown following, along with the query results in Figure 4. If you have signals from non-browser user agents and CAPTCHA tokens missing, then that is a strong indicator of suspicious activity.

fields @timestamp, httpRequest.clientIp 
| filter captchaResponse.failureReason = "TOKEN_MISSING" 
| stats count(*) as requestCount by httpRequest.clientIp, httpRequest.country 
| sort requestCount desc 
| limit 10
Figure 4: Main IP addresses and number of counts for CAPTCHA responses

Figure 4: Main IP addresses and number of counts for CAPTCHA responses

This information can provide an indication of the main source of activity. You can then use other visuals, like top user agents or top referrers, to provide more context to the information and inform further actions, such as adding new rules to the AWS WAF configuration.

You can adapt the queries provided in the sample solution to other use cases by using the fields provided in the left-hand pane of CloudWatch Logs Insights.

Cost considerations

Configuring AWS WAF to send logs to Amazon CloudWatch logs doesn’t have an additional cost. The cost incurred is for the use of the CloudWatch features and services, such as log storage and retention, Contributor Insights rules enabled, Logs Insights queries run, matched log events, and CloudWatch dashboards. For detailed information on the pricing of these features, see the CloudWatch Logs pricing information. You can also get an estimate of potential costs by using the AWS pricing calculator for CloudWatch.

One way to help offset the cost of CloudWatch features and services is to restrict the use of the dashboard and enforce a log retention policy for AWS WAF that makes it cost effective. If you use the queries and monitoring only as-needed, this can also help reduce costs. By limiting the running of queries and the matched log events for the Contributor Insights rules, you can enable the rules only when you need them. AWS WAF also provides the option to filter the logs that are sent when logging is enabled. For more information, see AWS WAF log filtering.

Conclusion

In this post, you learned how to use a pre-built CloudWatch dashboard to monitor AWS WAF activity by using metrics and Contributor Insights rules. The dashboard can help you identify traffic patterns and activity, and you can use the sample Logs Insights queries to explore the log information in more detail and examine false positives and suspicious activity, for rule tuning.

For more information on AWS WAF and the features mentioned in this post, see the AWS WAF documentation.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on AWS WAF re:Post.

Want more AWS Security news? Follow us on Twitter.

Diana Alvarado

Diana Alvarado

Diana is Sr security solutions architect at AWS. She is passionate about helping customers solve difficult cloud challenges, she has a soft spot for all things logs.

Automate deployment and version updates for Amazon Kinesis Data Analytics applications with AWS CodePipeline

Post Syndicated from Anand Shah original https://aws.amazon.com/blogs/big-data/automate-deployment-and-version-updates-for-amazon-kinesis-data-analytics-applications-with-aws-codepipeline/

Amazon Kinesis Data Analytics is the easiest way to transform and analyze streaming data in real time using Apache Flink. Customers are already using Kinesis Data Analytics to perform real-time analytics on fast-moving data generated from data sources like IoT sensors, change data capture (CDC) events, gaming, social media, and many others. Apache Flink is a popular open-source framework and distributed processing engine for stateful computations over unbounded and bounded data streams.

Although building Apache Flink applications is typically the responsibility of a data engineering team, automating the deployment and provisioning infrastructure as code (IaC) is usually owned by the platform (or DevOps) team.

The following are typical responsibilities of the data engineering role:

  • Write code for real-time analytics Apache Flink applications
  • Roll out new application versions or roll them back (for example, in the case of a critical bug)

The following are typical responsibilities of the platform role:

  • Write code for IaC
  • Provision the required resources in the cloud and manage their access

In this post, we show how you can automate deployment and version updates for Kinesis Data Analytics applications and allow both Platform and engineering teams to effectively collaborate and co-own the final solution using AWS CodePipeline with the AWS Cloud Development Kit (AWS CDK).

Solution overview

To demonstrate the automated deployment and version update of a Kinesis Data Analytics application, we use the following example real-time data analytics architecture for this post.

Real-time data analytics architecture

The workflow includes the following steps:

  1. An AWS Lambda function (acting as data source) is the event producer pushing events on demand to Amazon Kinesis Data Streams when invoked.
  2. The Kinesis data stream receives and stores real-time events.
  3. The Kinesis Data Analytics application reads events from the data stream and performs real-time analytics on it.

Generic architecture

You can refer to the following generic architecture to adapt this example to your preferred CI/CD tool (for example, Jenkins). The overall deployment process is divided into three high-level parts:

  1. Infrastructure CI/CD – This portion is highlighted in orange. The infrastructure CI/CD pipeline is responsible for deploying all the real-time streaming architecture components, including the Kinesis Data Analytics application and any connected resources typically deployed using AWS CloudFormation.
  2. ApplicationStack – This portion is highlighted in gray. The application stack is deployed by the infrastructure CI/CD component using AWS CloudFormation.
  3. Application CI/CD – This portion is highlighted in green. The application CI/CD pipeline updates the Kinesis Data Analytics application in three steps:
    1. The pipeline builds the Java or Python source code of the Kinesis Data Analytics application and produces the application as a binary file.
    2. The pipeline pushes the latest binary file to the Amazon Simple Storage Service (Amazon S3) artifact bucket after a successful build as Kinesis Data Analytics application binary files are referenced from S3.
    3. The S3 bucket file put event triggers a Lambda function, which updates the version of the Kinesis Data Analytics application by deploying the latest binary.

The following diagram illustrates this workflow.

Workflow illustrated

CI/CD architecture with CodePipeline

In this post, we implement the generic architecture using CodePipeline. The following diagram illustrates our updated architecture.

Updated architecture illustrated

The final solution includes the following steps:

  1. The platform (DevOps) team and data engineering team push their source code to their respective code repositories.
  2. CodePipeline deploys the whole infrastructure as three stacks:
    1. InfraPipelineStack – Contains a pipeline to deploy the overall infrastructure.
    2. ApplicationPipelineStack – Contains a pipeline to build and deploy Kinesis Data Analytics application binaries. In this post, we build a Java source using the JavaBuildPipeline AWS CDK construct. You can use the PythonBuildPipeline AWS CDK construct to build a Python source.
    3. ApplicationStack – Contains real-time data analytics pipeline resources including Lambda (data source), Kinesis Data Streams (storage), and Kinesis Data Analytics (Apache Flink application).

Deploy resources using AWS CDK

The following GitHub repository contains the AWS CDK code to create all the necessary resources for the data pipeline. This removes opportunities for manual error, increases efficiency, and ensures consistent configurations over time. To deploy the resources, complete the following steps:

  1. Clone the GitHub repository to your local computer using the following command:
git clone https://github.com/aws-samples/automate-deployment-and-version-update-of-kda-application
  1. Download and install the latest Node.js.
  2. Run the following command to install the latest version of AWS CDK:
npm install -g aws-cdk
  1. Run cdk bootstrap to initialize the AWS CDK environment in your AWS account. Replace your AWS account ID and Region before running the following command.
cdk bootstrap aws://123456789012/us-east-1

To learn more about the bootstrapping process, refer to Bootstrapping.

Part 1: Data engineering and platform teams push source code to their code repositories

The data engineering and platform teams begin work in their respective code repositories, as illustrated in the following figure.

The data engineering and platform teams begin work in their respective code repositories, as illustrated in the following figure.

In this post, we use two folders instead of two GitHub repositories, which you can find under the root folder of the cloned repository:

  • kinesis-analytics-application – This folder contains example source code of the Kinesis Data Analytics application. This represents your Kinesis Data Analytics application source code developed by your data engineering team.
  • infrastructure-cdk – This folder contains example AWS CDK source code of the final solution used for provisioning all the required resources and CodePipeline. You can reuse this code for your Kinesis Data Analytics application deployment.

Application development teams usually stores the application source code in git repositories. For the demonstration purpose, we will use source code as zip file downloaded from Github instead of connecting CodePipeline to the Github repository. You may want to directly connect source repository with CodePipeline. To learn more about how to connect, refer to Create a connection to GitHub.

Part 2: The platform team deploys the application pipeline

The following figure illustrates the next step in the workflow.

Next step in the workflow illustrated

In this step, you deploy the first pipeline to build the Java source code from kinesis-analytics-application. Complete the following steps to deploy ApplicationPipelineStack:

  1. Open your terminal, bash, or command window depending on your OS.
  2. Switch the current path to the folder infrastructure-cdk.
  3. Run npm install to download all dependencies.
  4. Run cdk deploy ApplicationPipelineStack to deploy the application pipeline.

This process should take about 5 minutes to complete and deploys the following resources to your AWS account, highlighted in green in the preceding diagram:

  • CodePipeline, containing stages for AWS CodeBuild and AWS CodeDeploy
  • An S3 bucket to store binaries
  • A Lambda function to update the Kinesis Data Analytics application JAR after manual approval

Trigger an automatic build for the application pipeline

After the cdk deploy command is successful, complete the following steps to automatically run the pipeline:

  1. Download the source code .zip file.
  2. On the AWS CloudFormation console, choose Stacks in the navigation pane.
  3. Choose the stack ApplicationPipelineStack.Choose the stack ApplicationPipelineStack.
  4. On the Outputs tab, choose the link for the key ArtifactBucketLink.On the Outputs tab, choose the link for the key ArtifactBucketLink.

You’re redirected to the S3 artifact bucket.

  1. Choose Upload.
  2. Upload the source code .zip file you downloaded.

The first pipeline run (shown as Auto Build in the following diagram) starts automatically and takes about 5 minutes to reach the manual approval stage. The pipeline automatically downloads the source code from the artifact bucket, builds the Java project kinesis-analytics-application using Maven, and publishes the output binary JAR file back to the artifact bucket under the directory jars.

The pipeline automatically downloads the source code from the artifact bucket, builds the Java project kinesis-analytics-application using Maven, and publishes the output binary JAR file back to the artifact bucket under the directory jars.

View the application pipeline run

Complete the following steps to view the application pipeline run:

  1. On the AWS CloudFormation console, navigate to the stack ApplicationPipelineStack.
  2. On the Outputs tab, choose the link for the key ApplicationCodePipelineLink.On the Outputs tab, choose the link for the key ApplicationCodePipelineLink.

You’re redirected to the pipeline details page. You can see a detailed view of the pipeline, including the state of each action in each stage and the state of the transitions.

Do not approve the build for the manual approval stage yet; this is done later.

Part 3: The platform team deploys the infrastructure pipeline

The application pipeline run publishes a JAR file named kinesis-analytics-application-final.jar to the artifact bucket. Next, we deploy the Kinesis Data Analytics architecture. Complete the following steps to deploy the example flow:

  1. Open a terminal, bash, or command window depending on your OS.
  2. Switch the current path to the folder infrastructure-cdk.
  3. Run cdk deploy InfraPipelineStack to deploy the infrastructure pipeline.

This process should take about 5 minutes to complete and deploys a pipeline containing stages for CodeBuild and CodeDeploy to your AWS account, as highlighted in green in the following diagram.

This process should take about 5 minutes to complete and deploys a pipeline containing stages for CodeBuild and CodeDeploy to your AWS account, as highlighted in green in the following diagram.

When the cdk deploy is complete, the infrastructure pipeline run starts automatically (shown as Auto Build 1 in the following diagram) and takes about 10 minutes to download the source code from the artifact bucket, build the AWS CDK project infrastructure-stack, and deploy ApplicationStack automatically to your AWS account. When the infrastructure pipeline run is complete, the following resources are deployed to your account (shown in green in following diagram):

  • A CloudFormation template named app-ApplicationStack
  • A Lambda function acting as a data source
  • A Kinesis data stream acting as the stream storage
  • A Kinesis Data Analytics application with the first version of kinesis-analytics-application-final.jarWhen the infrastructure pipeline run is complete, the following resources are deployed to your account (shown in green in following diagram):

View the infrastructure pipeline run

Complete the following steps to view the application pipeline run:

  1. On the AWS CloudFormation console, navigate to the stack InfraPipelineStack.On the AWS CloudFormation console, navigate to the stack InfraPipelineStack.
  2. On the Outputs tab, choose the link for the key InfraCodePipelineLink.On the Outputs tab, choose the link for the key InfraCodePipelineLink.

You’re redirected to the pipeline details page. You can see a detailed view of the pipeline, including the state of each action in each stage and the state of the transitions.

Step 4: The data engineering team deploys the application

Now your account has everything in place for the data engineering team to work independently and roll out new versions of the Kinesis Data Analytics application. You can approve the respective application build from the application pipeline to deploy new versions of the application. The following diagram illustrates the full workflow.

Diagram illustrates the full workflow.

The build process starts automatically when it detects changes in the source code. You can test a version update by re-uploading the source code .zip file to the S3 artifact bucket. In a real-world use case, you update the main branch either via a pull request or by merging your changes, and this action triggers a new pipeline run automatically.

View the current application version

To view the current version of the Kinesis Data Analytics application, complete the following steps:

  1. On the AWS CloudFormation console, navigate to the stack InfraPipelineStack.
  2. On the Outputs tab, choose the link for the key KDAApplicationLink.On the Outputs tab, choose the link for the key KDAApplicationLink.

You’re redirected to the Kinesis Data Analytics application details page. You can find the current application version by looking at Version ID.

Find the current application version by looking at Version ID

Approve the application deployment

Complete the following steps to approve the deployment (or version update) of the Kinesis Data Analytics application:

  1. On the AWS CloudFormation console, navigate to the stack ApplicationPipelineStack.
  2. On the Outputs tab, choose the link for the key ApplicationCodePipelineLink.
  3. Choose Review from the pipeline approval stage.Choose Review from the pipeline approval stage
  4. When prompted, choose Approve to provide approval (optionally adding any comments) for the Kinesis Data Analytics application deployment or version update.Choose Approve to provide approval
  5. Repeat the steps mentioned earlier to view the current application version.

You should see the application version as defined in Version ID increased by one, as shown in the following screenshot.

Application version as defined in Version ID increased by one

Deploying a new version of the Kinesis Data Analytics application will cause a downtime of around 5 minutes because the Lambda function responsible for the version update makes the API call UpdateApplication, which restarts the application after updating the version. However, the application resumes stream processing where it left off after the restart.

Clean up

Complete the following steps to delete your resources and stop incurring costs:

  1. On the AWS CloudFormation console, select the stack InfraPipelineStack and choose Delete.
  2. Select the stack app-ApplicationStack and choose Delete.
  3. Select stack ApplicationPipelineStack and choose Delete.
  4. On the Amazon S3 console, select the bucket with the name starting with javaappCodePipeline and choose Empty.
  5. Enter permanently delete to confirm the choice.
  6. Select the bucket again and choose Delete.
  7. Confirm the action by entering the bucket name when prompted.
  8. Repeat these steps to delete the bucket with the name starting with infrapipelinestack-pipelineartifactsbucket.

Summary

This post demonstrated how to automate deployment and version updates for your Kinesis Data Analytics applications using CodePipeline and AWS CDK.

For more information, see Continuous integration and delivery (CI/CD) using CDK Pipelines and CodePipeline tutorials.


About the Author

About the AuthorAnand Shah is a Big Data Prototyping Solutions Architect at AWS. He works with AWS customers and their engineering teams to build prototypes using AWS analytics services and purpose-built databases. Anand helps customers solve the most challenging problems using the art of the possible technology. He enjoys beaches in his leisure time.

How to run AWS CloudHSM workloads in container environments

Post Syndicated from Derek Tumulak original https://aws.amazon.com/blogs/security/how-to-run-aws-cloudhsm-workloads-on-docker-containers/

January 25, 2023: We updated this post to reflect the fact that CloudHSM SDK3 does not support serverless environments and we strongly recommend deploying SDK5.


AWS CloudHSM provides hardware security modules (HSMs) in the AWS Cloud. With CloudHSM, you can generate and use your own encryption keys in the AWS Cloud, and manage your keys by using FIPS 140-2 Level 3 validated HSMs. Your HSMs are part of a CloudHSM cluster. CloudHSM automatically manages synchronization, high availability, and failover within a cluster.

CloudHSM is part of the AWS Cryptography suite of services, which also includes AWS Key Management Service (AWS KMS), AWS Secrets Manager, and AWS Private Certificate Authority (AWS Private CA). AWS KMS, Secrets Manager, and AWS Private CA are fully managed services that are convenient to use and integrate. You’ll generally use CloudHSM only if your workload requires single-tenant HSMs under your own control, or if you need cryptographic algorithms or interfaces that aren’t available in the fully managed alternatives.

CloudHSM offers several options for you to connect your application to your HSMs, including PKCS#11, Java Cryptography Extensions (JCE), OpenSSL Dynamic Engine, or Microsoft Cryptography API: Next Generation (CNG). Regardless of which library you choose, you’ll use the CloudHSM client to connect to HSMs in your cluster.

In this blog post, I’ll show you how to use Docker to develop, deploy, and run applications by using the CloudHSM SDK, and how to manage and orchestrate workloads by using tools and services like Amazon Elastic Container Service (Amazon ECS), Kubernetes, Amazon Elastic Kubernetes Service (Amazon EKS), and Jenkins.

Solution overview

This solution demonstrates how to create a Docker container that uses the CloudHSM JCE SDK to generate a key and use it to encrypt and decrypt data.

Note: In this example, you must manually enter the crypto user (CU) credentials as environment variables when you run the container. For production workloads, you’ll need to consider how to secure and automate the handling and distribution of these credentials. You should work with your security or compliance officer to ensure that you’re using an appropriate method of securing HSM login credentials. For more information on securing credentials, see AWS Secrets Manager.

Figure 1 shows the solution architecture. The Java application, running in a Docker container, integrates with JCE and communicates with CloudHSM instances in a CloudHSM cluster through HSM elastic network interfaces (ENIs). The Docker container runs in an EC2 instance, and access to the HSM ENIs is controlled with a security group.

Figure 1: Architecture diagram

Figure 1: Architecture diagram

Prerequisites

To implement this solution, you need to have working knowledge of the following items:

  • CloudHSM
  • Docker 20.10.17 – used at the time of this post
  • Java 8 or Java 11 – supported at the time of this post
  • Maven 3.05 – used at the time of this post

Here’s what you’ll need to follow along with my example:

  1. An active CloudHSM cluster with at least one active HSM instance. You can follow the CloudHSM getting started guide to create, initialize, and activate a CloudHSM cluster.

    Note: For a production cluster, you should have at least two active HSM instances spread across Availability Zones in the Region.

  2. An Amazon Linux 2 EC2 instance in the same virtual private cloud (VPC) in which you created your CloudHSM cluster. The Amazon Elastic Compute Cloud (Amazon EC2) instance must have the CloudHSM cluster security group attached—this security group is automatically created during the cluster initialization and is used to control network access to the HSMs. To learn about attaching security groups to allow EC2 instances to connect to your HSMs, see Create a cluster in the AWS CloudHSM User Guide.
  3. A CloudHSM crypto user (CU) account. You can create a CU by following the steps in the topic Managing HSM users in AWS CloudHSM in the AWS CloudHSM User Guide.

Solution details

In this section, I’ll walk you through how to download, configure, compile, and run a solution in Docker.

To set up Docker and run the application that encrypts and decrypts data with a key in AWS CloudHSM

  1. On your Amazon Linux EC2 instance, install Docker by running the following command.

    # sudo yum -y install docker

  2. Start the docker service.

    # sudo service docker start

  3. Create a new directory and move to it. In my example, I use a directory named cloudhsm_container. You’ll use the new directory to configure the Docker image.

    # mkdir cloudhsm_container
    # cd cloudhsm_container

  4. Copy the CloudHSM cluster’s trust anchor certificate (customerCA.crt) to the directory that you just created. You can find the trust anchor certificate on a working CloudHSM client instance under the path /opt/cloudhsm/etc/customerCA.crt. The certificate is created during initialization of the CloudHSM cluster and is required to connect to the CloudHSM cluster. This enables our application to validate that the certificate presented by the CloudHSM cluster was signed by our trust anchor certificate.
  5. In your new directory (cloudhsm_container), create a new file with the name run_sample.sh that includes the following contents. The script runs the Java class that is used to generate an Advanced Encryption Standard (AES) key to encrypt and decrypt your data.
    #! /bin/bash
    
    # start application
    echo -e "\n* Entering AES GCM encrypt/decrypt sample in Docker ... \n"
    
    java -ea -jar target/assembly/aesgcm-runner.jar -method environment
    
    echo -e "\n* Exiting AES GCM encrypt/decrypt sample in Docker ... \n"

  6. In the new directory, create another new file and name it Dockerfile (with no extension). This file will specify that the Docker image is built with the following components:
    • The CloudHSM client package.
    • The CloudHSM Java JCE package.
    • OpenJDK 1.8 (Java 8). This is needed to compile and run the Java classes and JAR files.
    • Maven, a build automation tool that is needed to assist with building the Java classes and JAR files.
    • The AWS CloudHSM Java JCE samples that will be downloaded and built as part of the solution.
  7. Cut and paste the following contents into Dockerfile.

    Note: You will need to customize your Dockerfile, as follows:

    • Make sure to specify the SDK version to replace the one specified in the pom.xml file in the sample code. As of the writing of this post, the most current version is 5.7.0. To find the SDK version, follow the steps in the topic Check your client SDK version. For more information, see the Building section in the README file for the Cloud HSM JCE examples.
    • Make sure to update the HSM_IP line with the IP of an HSM in your CloudHSM cluster. You can get your HSM IPs from the CloudHSM console, or by running the describe-clusters AWS CLI command.
      	# Use the amazon linux image
      	FROM amazonlinux:2
      
      	# Pass HSM IP address as a build argument
      	ARG HSM_IP
      
      	# Install CloudHSM client
      	RUN yum install -y https://s3.amazonaws.com/cloudhsmv2-software/CloudHsmClient/EL7/cloudhsm-jce-latest.el7.x86_64.rpm
      
      	# Install Java, Maven, wget, unzip and ncurses-compat-libs
      	RUN yum install -y java maven wget unzip ncurses-compat-libs
              
      	# Create a work dir
      	WORKDIR /app
              
      	# Download sample code
      	RUN wget https://github.com/aws-samples/aws-cloudhsm-jce-examples/archive/refs/heads/sdk5.zip
              
      	# unzip sample code
      	RUN unzip sdk5.zip
             
      	# Change to the create directory
      	WORKDIR aws-cloudhsm-jce-examples-sdk5
      
      # Build JAR files using the installed CloudHSM JCE Provider version
      RUN export CLOUDHSM_CLIENT_VERSION=`rpm -qi cloudhsm-jce | awk -F': ' '/Version/ {print $2}'` \
              && mvn validate -DcloudhsmVersion=$CLOUDHSM_CLIENT_VERSION \
              && mvn clean package -DcloudhsmVersion=$CLOUDHSM_CLIENT_VERSION
              
        # Configure cloudhsm-client
        COPY customerCA.crt /opt/cloudhsm/etc/
        RUN /opt/cloudhsm/bin/configure-jce -a $HSM_IP
             
        # Copy the run_sample.sh script
        COPY run_sample.sh .
              
        # Run the script
        CMD ["bash","run_sample.sh"]

  8. Now you’re ready to build the Docker image. Run the following command, with the name jce_sample. This command will let you use the Dockerfile that you created in step 6 to create the image.

    # sudo docker build --build-arg HSM_IP=”<your HSM IP address>” -t jce_sample .

  9. To run a Docker container from the Docker image that you just created, run the following command. Make sure to replace the user and password with your actual CU username and password. (If you need help setting up your CU credentials, see prerequisite 3. For more information on how to provide CU credentials to the AWS CloudHSM Java JCE Library, see Providing credentials to the JCE provider in the CloudHSM User Guide).

    # sudo docker run --env HSM_USER=<user> --env HSM_PASSWORD=<password> jce_sample

    If successful, the output should look like this:

    	* Entering AES GCM encrypt/decrypt sample in Docker ... 
    
    	737F92D1B7346267D329C16E
    	Successful decryption
    
    	* Exiting AES GCM encrypt/decrypt sample in Docker ...

Conclusion

This solution provides an example of how to run CloudHSM client workloads in Docker containers. You can use the solution as a reference to implement your cryptographic application in a way that benefits from the high availability and load balancing built in to CloudHSM without compromising the flexibility that Docker provides for developing, deploying, and running applications.

If you have comments about this post, submit them in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Derek Tumulak

Derek Tumulak

Derek joined AWS in May 2021 as a Principal Product Manager. He is a data protection and cybersecurity expert who is enthusiastic about assisting customers with a wide range of sophisticated use cases.

How to improve security incident investigations using Amazon Detective finding groups

Post Syndicated from Anna McAbee original https://aws.amazon.com/blogs/security/how-to-improve-security-incident-investigations-using-amazon-detective-finding-groups/

Uncovering the root cause of an Amazon GuardDuty finding can be a complex task, requiring security operations center (SOC) analysts to collect a variety of logs, correlate information across logs, and determine the full scope of affected resources.

Sometimes you need to do this type of in-depth analysis because investigating individual security findings in insolation doesn’t always capture the full impact of affected resources.

With Amazon Detective, you can analyze and visualize various logs and relationships between AWS entities to streamline your investigation. In this post, you will learn how to use a feature of Detective—finding groups—to simplify and expedite the investigation of a GuardDuty finding.

Detective uses machine learning, statistical analysis, and graph theory to generate visualizations that help you to conduct faster and more efficient security investigations. The finding groups feature reduces triage time and provides a clear view of related GuardDuty findings. With finding groups, you can investigate entities and security findings that might have been overlooked in isolation. Finding groups also map GuardDuty findings and their relevant tactics, techniques, and procedures to the MITRE ATT&CK framework. By using MITRE ATT&CK, you can better understand the event lifecycle of a finding group.

Finding groups are automatically enabled for both existing and new customers in AWS Regions that support Detective. There is no additional charge for finding groups. If you don’t currently use Detective, you can start a free 30-day trial.

Use finding groups to simplify an investigation

Because finding groups are enabled by default, you start your investigation by simply navigating to the Detective console. You will see these finding groups in two different places: the Summary and the Finding groups pages. On the Finding groups overview page, you can also use the search capability to look for collected metadata for finding groups, such as severity, title, finding group ID, observed tactics, AWS accounts, entities, finding ID, and status. The entities information can help you narrow down finding groups that are more relevant for specific workloads.

Figure 1 shows the finding groups area on the Summary page in the Amazon Detective console, which provides high-level information on some of the individual finding groups.

Figure 1: Detective console summary page

Figure 1: Detective console summary page

Figure 2 shows the Finding groups overview page, with a list of finding groups filtered by status. The finding group shown has a status of Active.

Figure 2: Detective console finding groups overview page

Figure 2: Detective console finding groups overview page

You can choose the finding group title to see details like the severity of the finding group, the status, scope time, parent or child finding groups, and the observed tactics from the MITRE ATT&CK framework. Figure 3 shows a specific finding group details page.

Figure 3: Detective console showing a specific finding group details page

Figure 3: Detective console showing a specific finding group details page

Below the finding group details, you can review the entities and associated findings for this finding group, as shown in Figure 4. From the Involved entities tab, you can pivot to the entity profile pages for more details about that entity’s behavior. From the Involved findings tab, you can select a finding to review the details pane.

Figure 4: Detective console showing involved entities of a finding group

Figure 4: Detective console showing involved entities of a finding group

In Figure 4, the search functionality on the Involved entities tab is being used to look at involved entities that are of type AWS role or EC2 instance. With such a search filter in Detective, you have more data in a single place to understand which Amazon Elastic Compute Cloud (Amazon EC2) instances and AWS Identity and Access Management (IAM) roles were involved in the GuardDuty finding and what findings were associated with each entity. You can also select these different entities to see more details. With finding groups, you no longer have to craft specific log searches or search for the AWS resources and entities that you should investigate. Detective has done this correlation for you, which reduces the triage time and provides a more comprehensive investigation.

With the release of finding groups, Detective infers relationships between findings and groups them together, providing a more convenient starting point for investigations. Detective has evolved from helping you determine which resources are related to a single entity (for example, what EC2 instances are communicating with a malicious IP), to correlating multiple related findings together and showing what MITRE tactics are aligned across those findings, helping you better understand a more advanced single security event.

Conclusion

In this blog post, we showed how you can use Detective finding groups to simplify security investigations through grouping related GuardDuty findings and AWS entities, which provides a more comprehensive view of the lifecycle of the potential security incident. Finding groups are automatically enabled for both existing and new customers in AWS Regions that support Detective. There is no additional charge for finding groups. If you don’t currently use Detective, you can start a free 30-day trial. For more information on finding groups, see Analyzing finding groups in the Amazon Detective User Guide.

If you have feedback about this post, submit comments in the Comments section below. You can also start a new thread on the Amazon Detective re:Post or contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Author

Anna McAbee

Anna is a Security Specialist Solutions Architect focused on threat detection and incident response at AWS. Before AWS, she worked as an AWS customer in financial services on both the offensive and defensive sides of security. Outside of work, Anna enjoys cheering on the Florida Gators football team, wine tasting, and traveling the world.

Author

Marshall Jones

Marshall is a Worldwide Security Specialist Solutions Architect at AWS. His background is in AWS consulting and security architecture, focused on a variety of security domains including edge, threat detection, and compliance. Today, he is focused on helping enterprise AWS customers adopt and operationalize AWS security services to increase security effectiveness and reduce risk.

Luis Pastor

Luis Pastor

Luis is a Security Specialist Solutions Architect focused on infrastructure security at AWS. Before AWS he worked with large and boutique system integrators, helping clients in an array of industries improve their security posture and reach and maintain compliance in hybrid environments. Luis enjoys keeping active, cooking and eating spicy food, specially Mexican cuisine.

Use AWS WAF CAPTCHA to protect your application against common bot traffic

Post Syndicated from Abhinav Bannerjee original https://aws.amazon.com/blogs/security/use-aws-waf-captcha-to-protect-your-application-against-common-bot-traffic/

In this blog post, you’ll learn how you can use a Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) with other AWS WAF controls as part of a layered approach to provide comprehensive protection against bot traffic. We’ll describe a workflow that tracks the number of incoming requests to a site’s store page. The workflow then limits those requests if they exceed a certain threshold. Requests from IP addresses that exceed the threshold will be presented a CAPTCHA challenge to prove that the requests are being made by a human.

Amazon Web Services (AWS) offers many tools and recommendations that companies can use as they face challenges with bot traffic on their websites. Web applications can be compromised through a variety of vectors, including cross-site scripting, SQL injection, path traversal, local file inclusion, and distributed denial-of-service (DDoS) attacks. AWS WAF offers managed rules that are designed to provide protection against common application vulnerabilities or other unwanted web traffic, without requiring you to write your own rules.

There are some web attacks like web scraping, credential stuffing, and layer 7 DDoS attempts conducted by bots (as well as by humans) that target sensitive areas of your website, such as your store page. A CAPTCHA mitigates undesirable traffic by requiring the visitor to complete challenges before they are allowed to access protected resources. You can implement CAPTCHA to help prevent unwanted activities. Last year, AWS introduced AWS WAF CAPTCHA, which allows customers to set up AWS WAF rules that require CAPTCHA challenges to be completed for common targets such as forms (for example, search forms).

Scenario

Consider an attack where the unauthorized user is attempting to overwhelm a site’s store page by repeatedly sending search requests for different items.

Assume that traffic visits a website that is hosted through Amazon CloudFront and attempts the above behavior on the /store URL. In this scenario, there is a rate-based rule in place that will track the number of requests coming in from each IP. This rate-based rule tracks the rate of requests for each originating IP address and invokes the rule action on IPs with rates that go over the limit. With CAPTCHA implemented as the rule action, excessive attempts to search within a 5-minute window will result in a CAPTCHA challenge being presented to the user. This workflow is shown in Figure 1.

Figure 1: User visits a store page and is evaluated by a rate-based rule

Figure 1: User visits a store page and is evaluated by a rate-based rule

When a user solves a CAPTCHA challenge, AWS automatically generates and encrypts a token and sends it to the client as a cookie. The client requests aren’t challenged again until the token has expired. AWS WAF calculates token expiration by using the immunity time configuration. You can configure the immunity time in a web access control list (web ACL) CAPTCHA configuration and in the configuration for a rule’s action setting. When a user provides an incorrect answer to a CAPTCHA challenge, the challenge informs the user and loads a new puzzle. When the user solves the challenge, the challenge automatically submits the original web request, updated with the CAPTCHA token from the successful puzzle completion.

Walkthrough

This workflow will require an AWS WAF rule within a new or existing rule group or web ACL. The rule will define how web requests are inspected and the action to take.

To create an AWS WAF rate-based rule

  1. Open the AWS WAF console and in the left navigation pane, choose Web ACLs.
  2. Choose an existing web ACL, or choose Create web ACL at the top right to create a new web ACL.
  3. Under Rules, choose Add rules, and then in the drop-down list, choose Add my own rules and rule groups.
  4. For Rule type, choose Rule builder.
  5. In the Rule builder section, for Name, enter your rule name. For Type, choose Rate-based rule.
  6. In the Request rate details section, enter your rate limit (for example, 100). For IP address to use for rate limiting, choose Source IP address, and for Criteria to count requests toward rate limit, choose Only consider requests that match criteria in a rule statement.
  7. For Count only the requests that match the following statement, choose Matches the statement from the drop-down list.
  8. In the Statement section, for Inspect, choose URI path. For Match type , choose Contains string.
  9. For String to match, enter the URI path of your web page (for example, /store).
  10. In the Action section, choose CAPTCHA.
  11. (Optional) For Immunity time, choose Set a custom immunity time for this rule, or keep the default value (300 seconds).
  12. To finish, choose Add rule, and then choose Save to add the rule to your web ACL.

After you add the rule, go to the Rules tab of your web ACL and navigate to your rule. Confirm that the output resembles what is shown in Figure 2. You should have a rate-based rule with a scope-down statement that matches the store URI path you entered earlier, and the action should be set to CAPTCHA.

The following is the JSON for the CAPTCHA rule that you just created. You can use this to validate your configuration. You can also use this JSON in the rule builder while creating the rule.

{
  "Name": "CaptchaOnRBR",
  "Priority": 0,
  "Statement": {
    "RateBasedStatement": {
      "Limit": 100,
      "AggregateKeyType": "IP",
      "ScopeDownStatement": {
        "ByteMatchStatement": {
          "SearchString": "/store",
          "FieldToMatch": {
            "UriPath": {}
          },
          "TextTransformations": [
            {
              "Priority": 0,
              "Type": "NONE"
            }
          ],
          "PositionalConstraint": "CONTAINS"
        }
      }
    }
  },
  "Action": {
    "Captcha": {}
  },
  "VisibilityConfig": {
    "SampledRequestsEnabled": true,
    "CloudWatchMetricsEnabled": true,
    "MetricName": "CaptchaOnRBR"
  },
  "CaptchaConfig": {
    "ImmunityTimeProperty": {
      "ImmunityTime": 60
    }
  }
}

After you complete this configuration, the rule will be invoked when an IP address unsuccessfully attempts to search the store at a rate that exceeds the threshold. This user will be presented with a CAPTCHA challenge, as shown in Figure 6. If the user is successful, they will be routed back to the store page. Otherwise, they will be served a new puzzle until it is solved.

Figure 3: CAPTCHA challenge presented to a request that exceeded the threshold

Figure 3: CAPTCHA challenge presented to a request that exceeded the threshold

Implementing rate-based rules and CAPTCHA also allows you to track IP addresses, limit the number of invalid search attempts, and use the specific IP information available to you within sampled requests and AWS WAF logs to work to prevent that traffic from affecting your resources. Additionally, you have visibility into IPs addresses blocked by rate-based rules so that you can later add these addresses to a block list or create custom logic as needed to mitigate false positives.

Conclusion

In this blog post, you learned how to configure and deploy a CAPTCHA challenge with AWS WAF that checks for web requests that exceed a certain rate threshold and requires the client sending such requests to solve a challenge. Please note the additional charge for enabling CAPTCHA on your web ACL (pricing can be found here). Although CAPTCHA challenges are simple for humans to complete, they should be harder for common bots to complete with any meaningful rate of success. You can use a CAPTCHA challenge when a block action would stop too many legitimate requests, but letting all traffic through would result in unacceptably high levels of unwanted requests, such as from bots.

For more information and guidance on AWS WAF rate-based rules, see the blog post The three most important AWS WAF rate-based rules and the AWS whitepaper AWS Best Practices for DDoS Resiliency. You can also check out these additional resources:

 
If you have feedback about this blog post, submit comments in the Comments section below. You can also start a new thread on AWS WAF re:Post to get answers from the community.

Want more AWS Security news? Follow us on Twitter.

Abhinav Bannerjee

Abhinav Bannerjee

Abhinav is a Solutions Architect based out of Texas. He works closely with small to medium sized businesses to help them scale their adoption of Amazon Web Services.

Fenil Patel

Fenil Patel

Fenil is a Solutions Architect based out of New Jersey. His main focus is helping customers optimize and secure content delivery using AWS Edge Services.

A dive into redBus’s data platform and how they used Amazon QuickSight to accelerate business insights

Post Syndicated from Girish Kumar Chidananda original https://aws.amazon.com/blogs/big-data/a-dive-into-redbuss-data-platform-and-how-they-used-amazon-quicksight-to-accelerate-business-insights/

This post is co-authored with Girish Kumar Chidananda from redBus.

redBus is one of the earliest adopters of AWS in India, and most of its services and applications are hosted on the AWS Cloud. AWS provided redBus the flexibility to scale their infrastructure rapidly while keeping costs extremely low. AWS has a comprehensive suite of services to cater to most of their needs, including providing customer support that redBus can vouch for.

In this post, we share redBus’s data platform architecture, and how various components are connected to form their data highway. We also discuss the challenges redBus faced in building dashboards for their real-time business intelligence (BI) use cases, and how they used Amazon QuickSight, a fast, easy-to-use, cloud-powered business analytics service that makes it easy for all employees within redBus to build visualizations and perform ad hoc analysis to gain business insights from their data, any time, and on any device.

About redBus

redBus is the world’s largest online bus ticketing platform built in India and serving more than 36 million happy customers around the world. Along with its bus ticketing vertical, redBus also runs a rail ticketing service called redRails and a bus and car rental service called rYde. It is part of the GO-MMT group, which is India’s leading online travel company, with an extensive brand portfolio that includes other prominent online travel brands like MakeMyTrip and Goibibo.

redBus’s data highway 1.0

redBus relies heavily on making data-driven decisions at every level, from its traveler journey tracking, forecasting demand during high traffic, identifying and addressing bottlenecks in their bus operators signup process, and more. As redBus’s business started growing in terms of the number of cities and countries they operated in and the number of bus operators and travelers using the service in each city, the amount of incoming data also increased. The need to access and analyze the data in one place required them to build their own data platform, as shown in the following diagram.

redBus data platform 1.0

In the following sections, we look at each component in more detail.

Data ingestion sources

With the data platform 1.0, the data is ingested from various sources:

  • Real time – The real-time data flows from redBus mobile apps, the backend microservices, and when a passenger, bus operator, or application does any operation like booking bus tickets, searching the bus inventory, uploading a KYC document, and more
  • Batch mode – Scheduled jobs fetch data from multiple persistent data stores like Amazon Relational Database Service (Amazon RDS), where the OLTP data from all its applications are stored, Apache Cassandra clusters, where the bus inventory from various operators is stored, Arango DB, where the user identity graphs are stored, and more

Data cataloging

The real-time data is ingested into their self-managed Apache Nifi clusters, an open-source data platform that is used to clean, analyze, and catalog the data with its routing capabilities before sending the data to its destination.

Storage and analytics

redBus uses the following services for its storage and analytical needs:

  • Amazon Simple Storage Service (Amazon S3), an object storage service that provides the foundation for their data lake because of its virtually unlimited scalability and higher durability. Real-time data flows from Apache Druid and data from the data stores flow at regular intervals based on the schedules.
  • Apache Druid, an OLAP-style data store (data flows via Kafka Druid data loader), which computes facts and metrics against various dimensions during the data loading process.
  • Amazon Redshift, a cloud data warehouse service that helps you analyze exabytes of data and run complex analytical queries. redBus uses Amazon Redshift to store the processed data from Amazon S3 and the aggregated data from Apache Druid.

Querying and visualization

To make redBus as data-driven as possible, they ensured that the data is accessible to their SRE engineers, data engineers, and business analysts via a visualization layer. This layer features dashboards being served using Apache SuperSet, an open-source data visualization application, and Amazon Athena, an interactive query service to analyze data in Amazon S3 using standard SQL for ad hoc querying requirements.

The challenges

Initially, redBus handled data that was being ingested at the rate of 10 million events per day. Over time, as its business started growing, so did the data volume (from gigabytes to terabytes to petabytes), data ingestion per day (from 10 million to 320 million events), and its business intelligence dashboard needs. Soon after, they started facing challenges with their self-managed Superset’s BI capabilities, and the increased operational complexities.

Limited BI capabilities

redBus encountered the following BI limitations:

  • Inability to create visualizations from multiple data sources – Superset doesn’t allow creating visualizations from multiple tables within its data exploration layer. redBus data engineers had to have the tables joined beforehand at the data source level itself. In order to create a 360-degree view for redBus’s business stakeholders, it became inconvenient for data engineers to maintain multiple tables supporting the visualization layer.
  • No global filter for visuals in a dashboard – A global or primary filter across visuals in a dashboard is not supported in Superset. For example, consider there are visuals like Sales Wins by Region, YTD Revenue Realized by Region, Sales Pipeline by Region, and more in a dashboard, and a filter Region is added to the dashboard with values like EMEA, APAC, and US. The filter Region will only apply to one of the visuals, not the entire dashboard. However, dashboard users expected filtering across the dashboard.
  • Not a business-user friendly tool – Superset is highly developer centric when it comes to customization. For example, if a redBus business analyst had to customize a timed refresh that automatically re-queries every slice on a dashboard according to a pre-set value, then the analyst has to update the dashboard’s JSON metadata field. Therefore, having knowledge of JSON and its syntax is mandatory for doing any customization on the visuals or dashboard.

Increased operational cost

Although Superset is open source, which means there are no licensing costs, it also means there is more effort in maintaining all the components required for it to function as an enterprise-grade BI tool. redBus has deployed and maintained a web server (Nginx) fronted by an Application Load Balancer to do the load balancing; a metadata database server (MySQL) where Superset stores its internal information like users, slices, and dashboard definitions; an asynchronous task queue (Celery) for supporting long-running queries; a message broker (RabbitMQ); and a distributed caching server (Redis) for caching the results, charting data, and more on Amazon Elastic Compute Cloud (Amazon EC2) instances. The following diagram illustrates this architecture.

Apache Superset Deploment at redBus

redBus’s DevOps team had to do the heavy lifting of provisioning the infrastructure, taking backups, scaling the components manually as needed, upgrading the components individually, and more. It also required a Python web developer to be around for making the configurational changes so all the components work together seamlessly. All these manual operations increased the total cost of ownership for redBus.

Journey towards QuickSight

redBus started exploring BI solutions primarily around a couple of its dashboarding requirements:

  • BI dashboards for business stakeholders and analysts, where the data is sourced via Amazon S3 and Amazon Redshift.
  • A real-time application performance monitoring (APM) dashboard to help their SRE engineers and developers identify the root cause of an issue in their microservices deployment so they can fix the issues before they affect their customer’s experience. In this case, the data is sourced via Druid.

QuickSight fit into most of redBus’s BI dashboard requirements, and in no time their data platform team started with a proof of concept (POC) for a couple of their complex dashboards. At the end of the POC, which spanned a month’s time, the team shared their findings.

First, QuickSight is rich in BI capabilities, including the following:

  • It’s a self-service BI solution with drag-and-drop features that could help redBus analysts comfortably use it without any coding efforts.
  • Visualizations from multiple data sources in a single dashboard could help redBus business stakeholders get a 360-degree view of sales, forecasting, and insights in a single pane of glass.
  • Cascading filters across visuals and across sheets in a dashboard are much-needed features for redBus’s BI requirements.
  • QuickSight offers Excel-like visuals—tables with calculations, pivot tables with cell grouping, and styling are attractive for the viewers.
  • The Super-fast, Parallel, In-memory Calculation Engine (SPICE) in QuickSight could help redBus scale to hundreds of thousands of users, who can all simultaneously perform fast interactive analysis across a wide variety of AWS data sources.
  • Off-the-shelf ML insights and forecasting at no additional cost would allow redBus’s data science team to focus on ML models besides sales forecasting and similar models.
  • Built-in row-level security (RLS) could allow redBus to grant filtered access for their viewers. For example, redBus has many business analysts who manage different countries. With RLS, each business analyst only sees data related to their assigned country within a single dashboard.
  • redBus uses OneLogin as its identity provider, which supports Security Assertion Markup Language 2.0 (SAML 2.0). With the help of identity federation and single sign-on support from QuickSight, redBus could provide a simple onboarding flow for their QuickSight users.
  • QuickSight offers built-in alerts and email notification capabilities.

Secondly, QuickSight is a fully managed, cloud-native, serverless BI service offering from AWS, with the following features:

  • redBus engineers don’t need to focus on the heavy lifting of provisioning, scaling, and maintaining their BI solution on EC2 instances.
  • QuickSight offers native integration with AWS services like Amazon Redshift, Amazon S3, and Athena, and other popular frameworks like Presto, Snowflake, Teradata, and more. QuickSight connects to most of the data sources that redBus already has except Apache Druid, because native integration with Druid was not available as of December 2022. For a complete list of the supported data sources, see Supported data sources.

The outcome

Considering all the rich features and lower total cost of ownership, redBus chose QuickSight for their BI dashboard requirements. With QuickSight, redBus’s data engineers have built a number of dashboards in no time to give insights from petabytes of data to business stakeholders and analysts. The redBus data highway evolved to bring business intelligence to a much wider audience in their organization, with better performance and faster time-to-value. As of November 2022, it combines QuickSight for business users and Superset for real-time APM dashboards (at the time of writing, QuickSight doesn’t offer a native connector to Druid), as shown in the following diagram.

redBus data platform 2.0

Sales anomaly detection dashboard

Although there are many dashboards that redBus deployed to production, sales anomaly detection is one of the interesting dashboards that redBus built. It uses redBus’s proprietary sales forecasting model, which in turn is sourced by historical sales data from Amazon Redshift tables and real-time sales data from Druid tables, as shown in the following figure.

Sales anomaly detection data flow

At regular intervals, the scheduled jobs feed the redBus forecasting model with real-time and historical sales data, and then the forecasted data is pushed into an Amazon Redshift table. The sales anomaly detection dashboard in QuickSight is served by the resultant Amazon Redshift table.

The following is one of the visuals from the sales anomaly detection dashboard. It’s built using a line chart representing hourly actual sales, predicted sales, and an alert threshold for a time series for a particular business cohort in redBus.

Sales and Predicted Sales for a particular cohort

In this visual, each bar represents the number of sales anomalies triggered at a particular point in the time series.

redBus’s analysts could further drill down to the sales details and anomalies at the minute level, as shown in the following diagram. This drill-down feature comes out of the box with QuickSight.

Drill-Down Chart - Sales and Predicted Sales for a particular cohort

For more details on adding drill-downs to QuickSight dashboard visuals, see Adding drill-downs to visual data in Amazon QuickSight.

Apart from the visuals, it has become one of viewers’ favorite dashboards at redBus due to the following notable features:

  • Because filtering across visuals is an out-of-the-box feature in QuickSight, a timestamp-based filter is added to the dashboard. This helps in filtering multiple visuals in the dashboard in a single click.
  • URL actions configured on the visuals help the viewers navigate to the context-sensitive in-house applications.
  • Email alerts configured on KPIs and Gauge visuals help the viewers get notifications on time.

Next steps

Apart from building new dashboards for their BI dashboard needs, redBus is taking the following next steps:

  • Exploring QuickSight Embedded Analytics for a couple of their application requirements to accelerate time to insights for users with in-context data visuals, interactive dashboards, and more directly within applications
  • Exploring QuickSight Q, which could enable their business stakeholders to ask questions in natural language and receive accurate answers with relevant visualizations that can help them gain insights from the data
  • Building a unified dashboarding solution using QuickSight covering all their data sources as integrations become available

Conclusion

In this post, we showed you how redBus built its data platform using various AWS services and Apache frameworks, the challenges the platform went through (especially in their BI dashboard requirements and challenges while scaling), and how they used QuickSight and lowered the total cost of ownership.

To know more about engineering at redBus, check out their medium blog posts. To learn more about what is happening in QuickSight or if you have any questions, reach out to the QuickSight Community, which is very active and offers several resources.


About the Authors


Author: Girish Chidanand
Girish Kumar Chidananda
works as a Senior Engineering Manager – Data Engineering at redBus, where he has been building various data engineering applications and components for redBus for the last 5 years. Prior to starting his journey in the IT industry, he worked as a Mechanical and Control systems engineer in various organizations, and he holds an MS degree in Fluid Power Engineering from University of Bath.


Author: Kayalvizhi Kandasamy
Kayalvizhi Kandasamy
works with digital-native companies to support their innovation. As a Senior Solutions Architect (APAC) at Amazon Web Services, she uses her experience to help people bring their ideas to life, focusing primarily on microservice architectures and cloud-native solutions using AWS services. Outside of work, she likes playing chess and is a FIDE rated chess player. She also coaches her daughters the art of playing chess, and prepares them for various chess tournaments.

Automate data lineage on Amazon MWAA with OpenLineage

Post Syndicated from Stephen Said original https://aws.amazon.com/blogs/big-data/automate-data-lineage-on-amazon-mwaa-with-openlineage/

In modern data architectures, datasets are combined across an organization using a variety of purpose-built services to unlock insights. As a result, data governance becomes a key component for data consumers and producers to know that their data-driven decisions are based on trusted and accurate datasets. One aspect of data governance is data lineage, which captures the flow of data as it goes through various systems and allows consumers to understand how a dataset was derived.

In order to capture data lineage consistently across various analytical services, you need to use a common lineage model and a robust job orchestration that is able to tie together diverse data flows. One possible solution is the open-source OpenLineage project. It provides a technology-agnostic metadata model for capturing data lineage and integrates with widely used tools. For job orchestration, it integrates with Apache Airflow, which you can run on AWS conveniently through the managed service Amazon Managed Workflows for Apache Airflow (Amazon MWAA). OpenLineage provides a plugin for Apache Airflow that extracts data lineage from Directed Acyclic Graphs (DAGs).

In this post, we show how to get started with data lineage on AWS using OpenLineage. We provide a step-by-step configuration guide for the openlineage-airflow plugin on Amazon MWAA. Additionally, we share an AWS Cloud Development Kit (AWS CDK) project that deploys a pre-configured demo environment for evaluating and experiencing OpenLineage first-hand.

OpenLineage on Apache Airflow

In the following example, Airflow turns OLTP data into a star schema on Amazon Redshift Serverless.

After staging and preparing source data from Amazon Simple Storage Service (Amazon S3), fact and dimension tables are eventually created. For this, Airflow orchestrates the execution of SQL statements that create and populate tables on Redshift Serverless.

Overview on DAGs in Amazon MWAA

The openlineage-airflow plugin collects metadata about creation of datasets and dependencies between them. This allows us to move from a jobs-centric approach of Airflow to a datasets-centric approach, improving the observability of workflows.

The following screenshot shows parts of the captured lineage for the previous example. It’s displayed in Marquez, an open-source metadata service for collection and visualization of data lineage with support for the OpenLineage standard. In Marquez, you can analyze the upstream datasets and transformations that eventually create the user dimension table on the right.

Data lineage graph in marquez

The example in this post is based on SQL and Amazon Redshift. OpenLineage also supports other transformation engines and data stores such as Apache Spark and dbt.

Solution overview

The following diagram shows the AWS setup required to capture data lineage using OpenLineage.

Solution overview

The workflow includes the following components:

  1. The openlineage-airflow plugin is configured on Airflow as a lineage backend. Metadata about the DAG runs is passed by Airflow core to the plugin, which converts it into OpenLineage format and sends it to an external metadata store. In our demo setup, we use Marquez as the metadata store.
  2. The openlineage-airflow plugin receives its configuration from environment variables. To populate these variables on Amazon MWAA, a custom Airflow plugin is used. First, the plugin reads source values from AWS Secrets Manager. Then, it creates environment variables.
  3. Secrets Manager is configured as a secrets backend. Typically, this type of configuration is stored in Airflow’s native metadata database. However, this approach has limitations. For instance, in case of multiple Airflow environments, you need to track and store credentials across multiple environments, and updating credentials requires you to update all the environments. With a secrets backend, you can centralize configuration.
  4. For demo purposes, we collect data lineage from a data pipeline, which creates a star schema in Redshift Serverless.

In the following sections, we walk you through the steps for end-to-end configuration.

Install the openlineage-airflow plugin

Specify the following dependency in the requirements.txt file of the Amazon MWAA environment. Note that the latest Airflow version currently available on Amazon MWAA is 2.4.3; for this post, use the compatible version 0.19.2 of the plugin:

openlineage-airflow==0.19.2

For more details on installing Python dependencies on Amazon MWAA, refer to Installing Python dependencies.

For Airflow < 2.3, configure the plugin’s lineage backend through the following configuration overrides on the Amazon MWAA environment and load it immediately at Airflow start by disabling lazy load of plugins:

AirflowConfigurationOptions:
    core.lazy_load_plugins: False
    lineage.backend: openlineage.lineage_backend.OpenLineageBackend

For more information on configuration overrides, refer to Configuration options overview.

Configure the Secrets Manager backend with Amazon MWAA

Using Secrets Manager as a secrets backend for Amazon MWAA is straightforward. First, provide the execution role of Amazon MWAA with read permission to Secrets Manager. You can use the following policy template as a starting point:

{
  "Version": "2012-10-17",
  "Statement": [
    {
      "Effect": "Allow",
      "Action": [
        "secretsmanager:GetResourcePolicy",
        "secretsmanager:GetSecretValue",
        "secretsmanager:DescribeSecret",
        "secretsmanager:ListSecretVersionIds"
      ],
      "Resource": "arn:aws:secretsmanager:AWS_REGION:<YOUR_ACCOUNT_ID>:secret:"
    },
    {
      "Effect": "Allow",
      "Action": "secretsmanager:ListSecrets",
      "Resource": ""
    }
  ]
}

Second, configure Secrets Manager as a backend in Amazon MWAA through the following configuration overrides:

AirflowConfigurationOptions:
secrets.backend: airflow.contrib.secrets.aws_secrets_manager.SecretsManagerBackend
secrets.backend_kwargs: '{"connections_prefix" : "airflow/connections", "variables_prefix" : "airflow/variables"}'

For more information configuring a secrets backend in Amazon MWAA, refer to Configuring an Apache Airflow connection using a Secrets Manager secret and Move your Apache Airflow connections and variables to AWS Secrets Manager.

Deploy a custom envvar plugin to Amazon MWAA

Apache Airflow has a built-in plugin manager through which it can be extended with custom functionality. In our case, this functionality is to populate OpenLineage-specific environment variables based on values in Secrets Manager. Natively, Amazon MWAA allows environment variables with the prefix AIRFLOW__, but the openlineage-airflow plugin expects the prefix OPENLINEAGE__.

The following Python code is used in the plugin. We assume the file is called envvar_plugin.py:

from airflow.plugins_manager import AirflowPlugin
from airflow.models import Variable
import os

os.environ["OPENLINEAGE_URL"] = Variable.get('OPENLINEAGE_URL', default_var='')

class EnvVarPlugin(AirflowPlugin):
  name = "env_var_plugin"

Amazon MWAA has a mechanism to install a plugin through a zip archive. You zip your code, upload the archive to an S3 bucket, and pass the URL to the file to Amazon MWAA:

zip plugins.zip envvar_plugin.py

Upload plugins.zip to an S3 bucket and configure the URL in Amazon MWAA. The following screenshot shows the configuration via the Amazon MWAA console.

Configuration of a custom plugin in Amazon MWAA

For more information on installing custom plugins on Amazon MWAA, refer to Creating a custom plugin that generates runtime environment variables.

Configure connectivity between the openlineage-airflow plugin and Marquez

As a last step, store the URL to Marquez in Secrets Manager. For this, create a secret called airflow/variables/OPENLINEAGE_URL with value <protocol>://<hostname/ip>:<port> (for example, https://marquez.mysite.com:5000).

Configuration of OPENLINEAGE_URL as secret

In case you need to spin up Marquez on AWS, you have multiple options to host, including running it on Amazon Elastic Kubernetes Service (Amazon EKS) or Amazon Elastic Compute Cloud (Amazon EC2). Refer to Running Marquez on AWS or check out our infrastructure template in the next section to deploy Marquez on AWS.

Deploy with an AWS CDK-based solution template

Assuming you want to set up a demo infrastructure for all of the above in one step, you can use the following template based on the AWS CDK.

The template has the following prerequisites:

Complete the following steps to deploy the template:

  1. Clone GitHub repository and install Python dependencies. Bootstrap the AWS CDK if required.
    git clone https://github.com/aws-samples/aws-mwaa-openlineage 
    	cd aws-mwaa-openlineage
    	python3 -m venv .env && source .env/bin/activate
    	python3 -m pip install -r requirements.txt
    	cdk bootstrap

  2. Update the value for the variable EXTERNAL_IP in constants.py to your outbound IP for connecting to the internet:
    # Set variable to outbound IP for connecting to the internet.
    EXTERNAL_IP = "255.255.255.255"

    This configures security groups so that you can access Marquez but block other clients. constants.py is found in the root folder of the cloned repository.

  3. Deploy the VPC_S3 stack to provision a new VPC dedicated for this solution as well as the security groups that are used by the different components:
    cdk deploy vpc-s3

    It creates a new S3 bucket and uploads the source raw data based on the TICKIT sample database. This serves as the landing area from the OLTP database. We then need to parse the metadata of these files through an AWS Glue crawler, which facilitates the native integration between Amazon Redshift and the S3 data lake.

  4. Deploy the lineage stack to create an EC2 instance that hosts Marquez:
    cdk deploy marquez

    Access the Marquez web UI through https://{ec2.public_dns_name}:3000/. This URL is also available as part of the AWS CDK outputs for the lineage stack.

  5. Deploy the Amazon Redshift stack to create a Redshift Serverless endpoint:
    cdk deploy redshift

  6. Deploy the Amazon MWAA stack to create an Amazon MWAA environment:
    cdk deploy mwaa

    You can access the Amazon MWAA UI through the URL provided in the AWS CDK output.

Test a sample data pipeline

On Amazon MWAA, you can see an example data pipeline deployed that consists of two DAGs. It builds a star schema on top of the TICKIT sample database. One DAG is responsible for loading data from the S3 data lake into an Amazon Redshift staging layer; the second DAG loads data from the staging layer to the dimensional model.

Datamodel of star schema

Open the Amazon MWAA UI through the URL obtained in the deployment steps and launch the following DAGs: rs_source_to_staging and rs_staging_to_dm. As part of the run, the lineage metadata is sent to Marquez.

After the DAG has been run, open the Marquez URL obtained in the deployment steps. In Marquez, you can find the lineage metadata for the computed star schema and related data assets on Amazon Redshift.

Clean up

Delete the AWS CDK stacks to avoid ongoing charges for the resources that you created. Run the following command in the aws-mwaa-openlineage project directory so that all resources are undeployed:

cdk destroy --all

Summary

In this post, we showed you how to automate data lineage with OpenLineage on Amazon MWAA. As part of this, we covered how to install and configure the openlineage-airflow plugin on Amazon MWAA. Additionally, we provided a ready-to-use infrastructure template for a complete demo environment.

We encourage you to explore what else can be achieved with OpenLineage. A job orchestrator like Apache Airflow is only one piece of a data platform and not all possible data lineage can be captured on it. We recommend exploring OpenLineage’s integration with other platforms like Apache Spark or dbt. For more information, refer to Integrations.

Additionally, we recommend you visit the AWS Big Data Blog for other useful blog posts on Amazon MWAA and data governance on AWS.


About the Authors

Stephen Said is a Senior Solutions Architect and works with Digital Native Businesses. His areas of interest are data analytics, data platforms and cloud-native software engineering.

Vishwanatha Nayak is a Senior Solutions Architect at AWS. He works with large enterprise customers helping them design and build secure, cost-effective, and reliable modern data platforms using the AWS cloud. He is passionate about technology and likes sharing knowledge through blog posts and twitch sessions.

Paul Villena is an Analytics Solutions Architect with expertise in building modern data and analytics solutions to drive business value. He works with customers to help them harness the power of the cloud. His areas of interests are infrastructure-as-code, serverless technologies and coding in Python.

Enable cross-account sharing with direct IAM principals using AWS Lake Formation Tags

Post Syndicated from Aarthi Srinivasan original https://aws.amazon.com/blogs/big-data/enable-cross-account-sharing-with-direct-iam-principals-using-aws-lake-formation-tags/

With AWS Lake Formation, you can build data lakes with multiple AWS accounts in a variety of ways. For example, you could build a data mesh, implementing a centralized data governance model and decoupling data producers from the central governance. Such data lakes enable the data as an asset paradigm and unleash new possibilities with data discovery and exploration across organization-wide consumers. While enabling the power of data in decision-making across your organization, it’s also crucial to secure the data. With Lake Formation, sharing datasets across accounts only requires a few simple steps, and you can control what you share.

Lake Formation has launched Version 3 capabilities for sharing AWS Glue Data Catalog resources across accounts. When moving to Lake Formation cross-account sharing V3, you get several benefits. When moving from V1, you get more optimized usage of AWS Resource Access Manager (AWS RAM) to scale sharing of resources. When moving from V2, you get a few enhancements. First, you don’t have to maintain AWS Glue resource policies to share using LF-tags because Version 3 uses AWS RAM. Second, you can share with AWS Organizations using LF-tags. Third, you can share to individual AWS Identity and Access Management (IAM) users and roles in other accounts, thereby providing data owners control over which individuals can access their data.

Lake Formation tag-based access control (LF-TBAC) is an authorization strategy that defines permissions based on attributes called LF-tags. LF-tags are different from IAM resource tags and are associated only with Lake Formation databases, tables, and columns. LF-TBAC allows you to define the grant and revoke permissions policy by grouping Data Catalog resources, and therefore helps in scaling permissions across a large number of databases and tables. LF-tags are inherited from a database to all its tables and all the columns of each table.

Version 3 offers the following benefits:

  • True central governance with cross-account sharing to specific IAM principals in the target account
  • Ease of use in not having to maintain an AWS Glue resource policy for LF-TBAC
  • Efficient reuse of AWS RAM shares
  • Ease of use in scaling to hundreds of accounts with LF-TBAC

In this post, we illustrate the new features of cross-account sharing Version 3 in a producer-consumer scenario using TPC datasets. We walk through the setup of using LF-TBAC to share data catalog resources from the data producer account to direct IAM users in the consumer account. We also go through the steps in the receiving account to accept the shares and query the data.

Solution overview

To demonstrate the Lake Formation cross-account Version 3 features, we use the TPC datasets available at s3://aws-data-analytics-workshops/shared_datasets/tpcparquet/. The solution consists of steps in both accounts.

In account A, complete the following steps:

  1. As a data producer, register the dataset with Lake Formation and create AWS Glue Data Catalog tables.
  2. Create LF-tags and associate them with the database and tables.
  3. Grant LF-tag based permissions on resources directly to personas in consumer account B.

The following steps take place in account B:

  1. The consumer account data lake admin reviews and accepts the AWS RAM invitations.
  2. The data lake admin gives CREATE DATABASE access to the IAM user lf_business_analysts.
  3. The data lake admin creates a database for the marketing team and grants CREATE TABLE access to lf_campaign_manager.
  4. The IAM users create resource links on the shared database and tables and query them in Amazon Athena.

The producer account A has the following personas:

  • Data lake admin – Manages the data lake in the producer account
  • lf-producersteward – Manages the data and user access

The consumer account B has the following personas:

  • Data lake admin – Manages the data lake in the consumer account
  • lf-business-analysts – The business analysts in the sales team needs access to non-PII data
  • lf-campaign-manager – The manager in the marketing team needs access to data related to products and promotions

Prerequisites

You need the following prerequisites:

  • Two AWS accounts. For this demonstration of how AWS RAM invites are created and accepted, you should use two accounts that are not part of the same organization.
  • An admin IAM user in both accounts to launch the AWS CloudFormation stacks.
  • Lake Formation mode enabled in both the producer and consumer account with cross-account Version 3. For instructions, refer to Change the default permission model.

Lake Formation and AWS CloudFormation setup in account A

To keep the setup simple, we have an IAM admin registered as the data lake admin.

  1. Sign into the AWS Management Console in the us-east-1 Region.
  2. On the Lake Formation console, under Permissions in the navigation pane, choose Administrative roles and tasks.
  3. Select Choose Administrators under Datalake administrators.
  4. In the pop-up window Manage data lake administrators, under IAM users and roles, choose IAM admin user and choose Save.
  5. Choose Launch Stack to deploy the CloudFormation template:
    BDB-2063-launch-cloudformation-stack
  6. Choose Next.
  7. Provide a name for the stack and choose Next.
  8. On the next page, choose Next.
  9. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  10. Choose Create.

Stack creation should take about 2–3 minutes. The stack establishes the producer setup as follows:

  • Creates an Amazon Simple Storage Service (Amazon S3) data lake bucket
  • Registers the data lake bucket with Lake Formation
  • Creates an AWS Glue database and tables
  • Creates an IAM user (lf-producersteward) who will act as producer steward
  • Creates LF-tags and assigns them to the created catalog resources as specified in the following table
Database Table LF-Tag Key LF-Tag Value Resource Tagged
lftpcdb . Sensitivity Public DATABASE
lftpcdb items HasCampaign true TABLE
lftpcdb promotions HasCampaign true TABLE
lftpcdb customers table columns = "c_last_name","c_first_name","c_email_address" Sensitivity Confidential TABLECOLUMNS

Verify permissions in account A

After the CloudFormation stack launches, complete the following steps in account A:

  1. On the AWS CloudFormation console, navigate to the Outputs tab of the stack.

  1. Choose the LFProducerStewardCredentials value to navigate to the AWS Secrets Manager console.
  2. In the Secret value section, choose Retrieve secret value.
  3. Note down the secret value for the password for IAM user lf-producersteward.

You need this to log in to the console later as the user lf-producersteward.

  1. On the LakeFormation console, choose Databases on the navigation pane.
  2. Open the database lftpcdb.
  3. Verify the LF-tags on the database are created.

  1. Choose View tables and choose the items table to verify the LF-tags.

  1. Repeat the steps for the promotions and customers tables to verify the LF-tags assigned.

  1. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  2. Select the database lftpcdb and on the Actions menu, choose View Permissions.
  3. Verify that there are no default permissions granted on the database lftpcdb for IAMAllowedPrincipals.
  4. If you find any, select the permission and choose Revoke to revoke the permission.
  5. On the AWS Management Console, choose the AWS CloudShell icon on the top menu.

This opens AWS CloudShell in another tab of the browser. Allow a few minutes for the CloudShell environment to set up.

  1. Run the following AWS Command Line Interface (AWS CLI) command after replacing {BUCKET_NAME} with DataLakeBucket from the stack output.
aws s3 cp s3://aws-data-analytics-workshops/shared_datasets/tpcparquet/ s3://${BUCKET_NAME}/tpcparquet/  --recursive

If CloudShell isn’t available in your chosen Region, run the following AWS CLI command to copy the required dataset from your preferred AWS CLI environment as the IAM admin user.

  1. Verify that your S3 bucket has the dataset copied in it.
  2. Log out as the IAM admin user.

Grant permissions in account A

Next, we continue granting Lake Formation permissions to the dataset as a data steward within the producer account. The data steward grants the following LF-tag-based permissions to the consumer personas.

Consumer Persona LF-tag Policy
lf-business-analysts Sensitivity=Public
lf-campaign-manager HasCampaign=true
  1. Log in to account A as user lf-producersteward, using the password you noted from Secrets Manager earlier.
  2. On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
  3. Choose Grant.
  4. Under Principals, select External accounts.
  5. Provide the ARN of the IAM user in the consumer account (arn:aws:iam::<accountB_id>:user/lf-business-analysts) and press Enter.

  1. Under LF_Tags or catalog resources, select Resources matched by LF-Tags.
  2. Choose Add LF-Tag to add a new key-value pair.
  3. For the key, choose Sensitivity and for the value, choose Public.
  4. Under Database permissions, select Describe, and under Table permissions, select Select and Describe.

  1. Choose Grant to apply the permissions.
  2. On the Lake Formation console, under Permissions in the navigation pane, choose Data Lake permissions.
  3. Choose Grant.
  4. Under Principals, select External accounts.
  5. Provide the ARN of the IAM user in the consumer account (arn:aws:iam::<accountB_id>:user/lf-campaign-manager) and press Enter.
  6. Under LF_Tags or catalog resources, select Resources matched by LF-Tags.
  7. Choose Add LF-Tag to add a new key-value pair.
  8. For the key, choose HasCampaign and for the value, choose true.

  1. Under Database permissions, select Describe, and under Table permissions, select Select and Describe.
  2. Choose Grant to apply the permissions.
  3.  Verify on the Data lake permissions tab that the permissions you have granted show up correctly.

AWS CloudFormation setup in account B

Complete the following steps in the consumer account:

  1. Log in as an IAM admin user in account B and launch the CloudFormation stack:
    BDB-2063-launch-cloudformation-stack
  2. Choose Next.
  3. Provide a name for the stack, then choose Next.
  4. On the next page, choose Next.
  5. Review the details on the final page and select I acknowledge that AWS CloudFormation might create IAM resources.
  6. Choose Create.

Stack creation should take about 2–3 minutes. The stack sets up the following resources in account B:

  • IAM users datalakeadmin1, lf-business-analysts, and lf-campaign-manager, with relevant IAM and Lake Formation permissions
  • A database called db_for_shared_tables with Create_Table permissions to the lf-campaign-manager user
  • An S3 bucket named lfblog-athenaresults-<your-accountB-id>-us-east-1 with ListBucket and write permissions to lf-business-analysts and lf-campaign-manager

Note down the stack output details.

Accept resource shares in account B

After you launch the CloudFormation stack, complete the following steps in account B:

  1. On the CloudFormation stack Outputs tab, choose the link for DataLakeAdminCredentials.

This takes you to the Secrets Manager console.

  1. On the Secrets Manager console, choose Retrieve secret value and copy the password for DataLakeAdmin user.
  2. Use the ConsoleIAMLoginURL value from the CloudFormation template output to log in to account B with the data lake admin user name datalakeadmin1 and the password you copied from Secrets Manager.
  3. Open the AWS RAM console in another browser tab.
  4. In the navigation pane, under Shared with me, choose Resource shares to view the pending invitations.

You should see two resource share invitations from the producer account A: one for database-level share and one for table-level share.

  1. Choose each resource share link, review the details, and choose Accept.

After you accept the invitations, the status of the resource shares changes from Active from Pending.

Grant permissions in account B

To grant permissions in account B, complete the following steps:

  1. On the Lake Formation console, under Permissions on the navigation pane, choose Administrative roles and tasks.

  1. Under Database creators, choose Grant.

  1. Under IAM users and roles, choose lf-business-analysts.
  2. For Catalog permissions, select Create database.
  3. Choose Grant.
  4. Log out of the console as the data lake admin user.

Query the shared datasets as consumer users

To validate the lf-business-analysts user’s data access, perform the following steps:

  1. Log in to the console as lf-business-analysts, using the credentials noted from the CloudFormation stack output.
  2. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.

  1. Select the database lftpcdb and on the Actions menu, choose Create resource link.

  1. Under Resource link name, enter rl_lftpcdb.
  2. Choose Create.
  3. After the resource link is created, select the resource link and choose View tables.

You can now see the four tables in the shared database.

  1. Open the Athena console in another browser tab and choose the lfblog-athenaresults-<your-accountB-id>-us-east-1 bucket as the query results location.
  2. Verify data access using the following query (for more information, refer to Running SQL queries using Amazon Athena):
Select * from rl_lftpcdb.customers limit 10;

The following screenshot shows the query output.

Notice that account A shared the database lftpcdb to account B using the LF-tag expression Sensitivity=Public. Columns c_first_name, c_last_name, and c_email_address in table customers were overwritten with Sensitivity=Confidential. Therefore, these three columns are not visible to user lf-business-analysts.

You can preview the other tables from the database similarly to see the available columns and data.

  1. Log out of the console as lf-business-analysts.

Now we can validate the lf-campaign-manager user’s data access.

  1. Log in to the console as lf-campaign-manager using the credentials noted from the CloudFormation stack output.
  2. On the Lake Formation console, under Data catalog in the navigation pane, choose Databases.
  3. Verify that you can see the database db_for_shared_tables shared by the data lake admin.

  1. Under Data catalog in the navigation pane, choose Tables.

You should be able to see the two tables shared from account A using the LF-tag expression HasCampaign=true. The two tables show the Owner account ID as account A.

Because lf-campaign-manager received table level shares, this user will create table-level resource links for querying in Athena.

  1. Select the promotions table, and on the Actions menu, choose Create resource link.

  1. For Resource link name, enter rl_promotions.

  1. Under Database, choose db_for_shared_tables for the database to contain the resource link.
  2. Choose Create.
  3. Repeat the table resource link creation for the other table items.

Notice that the resource links show account B as owner, whereas the actual tables show account A as the owner.

  1. Open the Athena console in another browser tab and choose the lfblog-athenaresults-<your-accountB-id>-us-east-1 bucket as the query results location.
  2. 11. Query the tables using the resource links.

As shown in the following screenshot, all columns of both tables are accessible to lf-campaign-manager.

In summary, you have seen how LF-tags are used to share a database and select tables from one account to another account’s IAM users.

Clean up

To avoid incurring charges on the AWS resources created in this post, you can perform the following steps.

First, clean up resources in account A:

  1. Empty the S3 bucket created for this post by deleting the downloaded objects from your S3 bucket.
  2. Delete the CloudFormation stack.

This deletes the S3 bucket, custom IAM roles, policies, and the LF database, tables, and permissions.

  1. You may choose to undo the Lake Formation settings also and add IAM access back from the Lake Formation console Settings page.

Now complete the following steps in account B:

  1. Empty the S3 bucket lfblog-athenaresults-<your-accountB-id>-us-east-1 used as the Athena query results location.
  2. Revoke permission to lf-business-analysts as database creator.
  3. Delete the CloudFormation stack.

This deletes the IAM users, S3 bucket, Lake Formation database db_for_shared_tables, resource links, and all the permissions from Lake Formation.

If there are any resource links and permissions left, delete them manually in Lake Formation from both accounts.

Conclusion

In this post, we illustrated the benefits of using Lake Formation cross-account sharing Version 3 using LF-tags to direct IAM principals and how to receive the shared tables in the consumer account. We used a two-account scenario in which a data producer account shares a database and specific tables to individual IAM users in another account using LF-tags. In the receiving account, we showed the role played by a data lake admin vs. the receiving IAM users. We also illustrated how to overwrite column tags to mask and share PII data.

With Version 3 of cross-account sharing features, Lake Formation makes possible more modern data mesh models, where a producer can directly share to an IAM principal in another account, instead of the entire account. Data mesh implementation becomes easier for data administrators and data platform owners because they can easily scale to hundreds of consumer accounts using the LF-tags based sharing to organizational units or IDs.

We encourage you to upgrade your Lake Formation cross-account sharing to Version 3 and benefit from the enhancements. For more details, see Updating cross-account data sharing version settings.


About the authors

Aarthi Srinivasan is a Senior Big Data Architect with AWS Lake Formation. She likes building data lake solutions for AWS customers and partners. When not on the keyboard, she explores the latest science and technology trends and spends time with her family.

Srividya Parthasarathy is a Senior Big Data Architect on the AWS Lake Formation team. She enjoys building analytics and data mesh solutions on AWS and sharing them with the community.

How to encrypt sensitive caller voice input in Amazon Lex

Post Syndicated from Herbert Guerrero original https://aws.amazon.com/blogs/security/how-to-encrypt-sensitive-caller-authentication-voice-input-in-amazon-lex/

In the telecommunications industry, sensitive authentication and user data are typically received through mobile voice and keypads, and companies are responsible for protecting the data obtained through these channels. The increasing use of voice-driven interactive voice response (IVR) has resulted in a need to provide solutions that can protect user data that is gathered from mobile voice inputs. In this blog post, you’ll see how to protect a caller’s sensitive voice data that was captured through Amazon Lex by using data encryption implemented through AWS Lambda functions. The solution described in this post helps you to protect customer data received through voice channels from inadvertent or unknown access. The solution also includes decryption capabilities, which give an authorized administrator or operator the ability to decrypt user data from a Lambda console.

Solution overview

To demonstrate the IVR solution described in this post, a caller speaks two sensitive pieces of data—credit card number and zip code—from an Amazon Connect contact flow. The spoken values are encrypted and returned to the contact flow to be stored in contact attributes. The encrypted ciphertext is retained as a contact attribute for decryption purposes. Amazon CloudWatch Logs is enabled in the contact flow, but only the encrypted values are logged in log streams.

For this solution, conversation logs for this Amazon Lex bot are not enabled. An operator with assigned AWS Identity and Access Management (IAM) permissions can monitor the logged encrypted entries from CloudWatch Logs. For more information, see Working with log groups and log streams in the Amazon CloudWatch Logs User Guide.

Solution architecture

Figure 1 shows the overview of the solution described in this blog post.

Figure 1: Example of solution architecture

Figure 1: Example of solution architecture

Figure 1 shows the following high-level steps of the solution, and the number labels correspond to the following steps.

  1. A caller places an inbound call.
  2. An Amazon Connect contact flow leverages a Get customer input block, backed by an Amazon Lex bot, to prompt the caller for numerical data.
  3. The Amazon Lex bot invokes the Lambda function dev-encryption-core-EncryptFn.
  4. The Lambda function uses the AWS Encryption SDK to encrypt the caller’s plain text data.
  5. The AWS Encryption SDK obtains encryption keys from AWS Key Management Service (AWS KMS).
  6. The caller’s data is encrypted by using the AWS KMS keys obtained from AWS KMS.
  7. The Lambda function appends the encrypted data to the Amazon Lex bot session attributes.
  8. Amazon Lex returns the fully encrypted data back to Amazon Connect.

Overview of a contact flow

Figure 2: Contact flow captures input values using Amazon Lex and returns their encrypted values

Figure 2: Contact flow captures input values using Amazon Lex and returns their encrypted values

Figure 2 shows an overview of the contact flow, which has two main steps:

  1. The first numerical data (in this example, an encrypted credit card number value) is stored in contact attributes.
  2. The second numerical data (in this example, an encrypted zip code value) is stored in contact attributes.

Prerequisites

This solution uses the following AWS services:

The following need to be installed in your local machine:

To implement the solution in this post, you first need the Amazon Connect instance prerequisite in place.

To set up the Amazon Connect instance (if none exists)

  1. Create an Amazon Connect instance with a claimed phone number and a configured Amazon Connect user linked to a basic routing profile. For more information about setting up a contact center, see Set up your contact center in the Amazon Connect Administrator Guide.
  2. Assign the CallCenterManager or Admin security profile to an Amazon Connect user.
  3. In the newly created Amazon Connect instance, under the Overview section, find the access URL with the format
    https://<aliasname>.awsapps.com/connect/login

    • Make note of the access URL, which you will use later to log in to the Amazon Connect Dashboard.
  4. Log in to your Amazon Connect instance with a Connect user that has Admin or CallCenterManager permissions.

Solution procedures

This solution includes the following procedures:

  1. Clone the project or download the solution zip file.
  2. Create AWS resources needed for encryption and decryption.
  3. Configure the Amazon Lex bot in Amazon Connect.
  4. Create the contact flow in Amazon Connect.
  5. Validate the solution.
  6. Decrypt the collected data.

To clone or download the solution

  • Log in to the GitHub repo.
  • Clone or download the solution files to your local machine.

The downloaded file contains the artifacts needed for the deployment.

To create AWS resources needed for encryption and decryption

  1. From the command line, change directory to the project’s root directory.
  2. Run npm install.
  3. Run npm run build to transpile TypeScript to JavaScript and package code and its dependencies before deploying to AWS.
  4. Run cdk deploy CoreStack.

To configure the Amazon Lex bot in your Amazon Connect instance

  1. In the Amazon Connect console, choose Contact flows and scroll to the Amazon Lex section.
    Figure 3: Select Contact flows

    Figure 3: Select Contact flows

  2. From the Bot menu, select secure_LexInput(Classic). Then select +Add Amazon Lex Bot.
    Figure 4: Configure the Amazon Lex bot to Amazon Connect

    Figure 4: Configure the Amazon Lex bot to Amazon Connect

To import contact flow into Amazon Connect

  1. In the Amazon Connect console, choose Overview, and then choose Login as administrator.
  2. From the Routing menu on the left side, choose Contact flows to show the list of contact flows.
  3. Choose Create Contact flow.
  4. Choose the arrow to the right of the Save button and choose Import flow (beta). This imports the contact flow that you previously downloaded in the procedure To clone or download the solution.

    The contact flow already has the Amazon Lex bot configured.

    Figure 5: Select Import flow (beta)

    Figure 5: Select Import flow (beta)

  5. In the upper right corner of the contact flow, choose Save, and then choose OK to save the changes.
  6. Choose Publish to make the contact flow ready for use during the validation steps.
  7. (Optional) Claim a phone number (if none is available), using the following steps:
    1. In the Connect Dashboard, on the navigation menu, choose Channels, and then choose Phone numbers.
    2. On the right side of the page, choose Claim a number.
    3. Select the DID (Direct Inward Dialing) tab. Use the drop-down arrow to choose your country/region. When numbers are returned, choose one.
    4. Write down the phone number. You call it later in this post.
  8. (Optional) On the Edit Phone number page, in the Description box, you can type a note if desired.
  9. To assign the contact flow to your claimed phone number, for Contact flow / IVR, choose the drop-down arrow, and then choose Secure_Lex_Input.
  10. Choose Save.
    Figure 6: Under Contact flow / IVR, select the imported contact flow

    Figure 6: Under Contact flow / IVR, select the imported contact flow

For more information, see Set up phone numbers for your contact center in the Amazon Connect Administrator Guide.

To validate the solution

  1. Dial the test phone number to go through the voice prompt flow.
  2. When prompted, speak a 16-digit credit card number (you have a maximum of two retries), then speak a 5-digit zip code (also a maximum of two retries).
  3. After you complete your test call, review the log streams in Amazon CloudWatch Logs to confirm that the digits that you entered are now encrypted and stored as a contact attribute. The two entered values zipcode and creditcard are stored in contact attributes. Both are encrypted.
    Figure 7: Sample log showing encrypted values for zipcode and creditcard

    Figure 7: Sample log showing encrypted values for zipcode and creditcard

  4. Log in to your Amazon Connect Dashboard as a Supervisor. The URL is provided after the connect instance has been created. In the navigation menu, choose Contact search.
    Figure 8: Choose Contact search to look for the call information

    Figure 8: Choose Contact search to look for the call information

  5. Locate your inbound call on the Contact search list. Note that it can take up to 60 seconds for data to appear in the Contact search list.
  6. Select the Contact ID for your call.
    Figure 9: The Contact search showing the contact details for your test call

    Figure 9: The Contact search showing the contact details for your test call

  7. Copy the encrypted values for creditcard and zipcode and make note of them; you will use these values in the next procedure.
    Figure 10: Contact attributes stored in a contact flow are registered as part of the contact details

    Figure 10: Contact attributes stored in a contact flow are registered as part of the contact details

To decrypt the collected data

  1. In the AWS Lambda console, choose Functions.
  2. Use the Search bar to look for the dev-encryption-core-DecryptFn Lambda function, and then select the name link to open it.
  3. Under folder encryption-master, open the test folder. Under the tab \events, locate the file decrypt.json.
  4. Use the following steps to create a sample test event in the console by using the contents from decrypt.json. For more details, see Testing Lambda functions in the console.
    1. Choose the down arrow on the right side of Test.
    2. Choose Configure test event.
    3. Choose Create new test event.
    4. For Event name, enter decryptTest.
    5. Paste the contents from decrypt.json.
      {
          "Details": {
              "Parameters": {
                  "encrypted": "<encrypted-value-here>"
              }
          }
      }

    6. Choose Save.
  5. Use the encrypted values saved in the Validate a solution procedure and replace the ones in the recently created test event.
    Figure 11: Replace the creditcard or zipCode values with the ones from the Contact Search page

    Figure 11: Replace the creditcard or zipCode values with the ones from the Contact Search page

  6. Choose Test. The output from the test shows the values decrypted by the Lambda function. This is shown in Figure 12 under the Execution result tab.
    Figure 12: Result from the decryption operation

    Figure 12: Result from the decryption operation

Note: Make sure that only the appropriate authorized administrator or operator, application, or AWS service is able to invoke the decryption Lambda function.

You have now successfully implemented the solution by encrypting and decrypting the voice input of your test call, which you collected through Amazon Lex.

Cleanup

To avoid incurring future charges, follow these steps to clean up the deployed resources that you created when implementing this solution.

To delete the Amazon Connect instance

  1. In the Amazon Connect console, under Instance alias, select the name of the Amazon Connect instance, and choose Delete.
  2. When prompted, type the name of the instance, and then choose Delete.

To delete the Amazon Lex bot

  1. In the Amazon Lex console, choose the bot that you created in the To configure the Amazon Lex bot procedure.
  2. Choose Delete, and then choose Continue.

To delete the AWS CloudFormation stack

  1. In the AWS CloudFormation console, on the Stacks page, select the stack you created in the procedure To create AWS resources needed for encryption and decryption.
  2. In the stack details pane, choose Delete.
  3. Choose Delete stack when prompted. This deletes the Amazon S3 bucket, IAM roles and AWS Lambda functions you created for testing. This will also schedule a deletion date on the AWS KMS key.

Conclusion

In this post, you learned how an Amazon Connect contact flow can collect voice inputs from a caller by using Amazon Lex, and how you can encrypt these inputs by using your own AWS KMS key. This solution can help improve the security of voice input that is collected through Amazon Connect. For cost information, see the Amazon Connect pricing page.

For more information, see the blog post Creating a secure IVR solution with Amazon Connect and the topic Encrypt customer input (using OpenSSL) in the Amazon Connect Administrator Guide. As previously mentioned, the increasing use of voice-driven IVR has resulted in a need to provide solutions that can protect user data gathered from mobile voice inputs.

Additional resources include the AWS Lambda Developer Guide, the Amazon Lex Developer Guide, the Amazon Connect Administrator Guide, the AWS Nodejs SDK, and the AWS SDK for Python (Boto3).

If you need help with setting up this solution, you can get assistance from AWS Professional Services. You can also seek assistance from Amazon Connect partners available worldwide.

 
If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security news? Follow us on Twitter.

Herbert Guerrero

Herbert Guerrero

Herbert is a Senior Proserve Consultant for Connect. He enjoys designing and developing high-usability and scalable solutions. Understanding success criteria helps Herbert work backwards and deliver well-architected solutions. His engineering background informs the way he engages with customers’ mental models of what their solutions should look like.

Ed Valdez

Ed Valdez

Ed is a Specialty Consultant with Amazon Web Services. As a software development professional with over 23 years of experience, he specializes on designing and delivering customer-centric solutions within the contact center domain.

How BookMyShow saved 80% in costs by migrating to an AWS modern data architecture

Post Syndicated from Mahesh Vandi Chalil original https://aws.amazon.com/blogs/big-data/how-bookmyshow-saved-80-in-costs-by-migrating-to-an-aws-modern-data-architecture/

This is a guest post co-authored by Mahesh Vandi Chalil, Chief Technology Officer of BookMyShow.

BookMyShow (BMS), a leading entertainment company in India, provides an online ticketing platform for movies, plays, concerts, and sporting events. Selling up to 200 million tickets on an annual run rate basis (pre-COVID) to customers in India, Sri Lanka, Singapore, Indonesia, and the Middle East, BookMyShow also offers an online media streaming service and end-to-end management for virtual and on-ground entertainment experiences across all genres.

The pandemic gave BMS the opportunity to migrate and modernize our 15-year-old analytics solution to a modern data architecture on AWS. This architecture is modern, secure, governed, and cost-optimized architecture, with the ability to scale to petabytes. BMS migrated and modernized from on-premises and other cloud platforms to AWS in just four months. This project was run in parallel with our application migration project and achieved 90% cost savings in storage and 80% cost savings in analytics spend.

The BMS analytics platform caters to business needs for sales and marketing, finance, and business partners (e.g., cinemas and event owners), and provides application functionality for audience, personalization, pricing, and data science teams. The prior analytics solution had multiple copies of data, for a total of over 40 TB, with approximately 80 TB of data in other cloud storage. Data was stored on‑premises and in the cloud in various data stores. Growing organically, the teams had the freedom to choose their technology stack for individual projects, which led to the proliferation of various tools, technology, and practices. Individual teams for personalization, audience, data engineering, data science, and analytics used a variety of products for ingestion, data processing, and visualization.

This post discusses BMS’s migration and modernization journey, and how BMS, AWS, and AWS Partner Minfy Technologies team worked together to successfully complete the migration in four months and saving costs. The migration tenets using the AWS modern data architecture made the project a huge success.

Challenges in the prior analytics platform

  • Varied Technology: Multiple teams used various products, languages, and versions of software.
  • Larger Migration Project: Because the analytics modernization was a parallel project with application migration, planning was crucial in order to consider the changes in core applications and project timelines.
  • Resources: Experienced resource churn from the application migration project, and had very little documentation of current systems.
  • Data : Had multiple copies of data and no single source of truth; each data store provided a view for the business unit.
  • Ingestion Pipelines: Complex data pipelines moved data across various data stores at varied frequencies. We had multiple approaches in place to ingest data to Cloudera, via over 100 Kafka consumers from transaction systems and MQTT(Message Queue Telemetry Transport messaging protocol) for clickstreams, stored procedures, and Spark jobs. We had approximately 100 jobs for data ingestion across Spark, Alteryx, Beam, NiFi, and more.
  • Hadoop Clusters: Large dedicated hardware on which the Hadoop clusters were configured incurring fixed costs. On-premises Cloudera setup catered to most of the data engineering, audience, and personalization batch processing workloads. Teams had their implementation of HBase and Hive for our audience and personalization applications.
  • Data warehouse: The data engineering team used TiDB as their on-premises data warehouse. However, each consumer team had their own perspective of data needed for analysis. As this siloed architecture evolved, it resulted in expensive storage and operational costs to maintain these separate environments.
  • Analytics Database: The analytics team used data sourced from other transactional systems and denormalized data. The team had their own extract, transform, and load (ETL) pipeline, using Alteryx with a visualization tool.

Migration tenets followed which led to project success:

  • Prioritize by business functionality.
  • Apply best practices when building a modern data architecture from Day 1.
  • Move only required data, canonicalize the data, and store it in the most optimal format in the target. Remove data redundancy as much possible. Mark scope for optimization for the future when changes are intrusive.
  • Build the data architecture while keeping data formats, volumes, governance, and security in mind.
  • Simplify ELT and processing jobs by categorizing the jobs as rehosted, rewritten, and retired. Finalize canonical data format, transformation, enrichment, compression, and storage format as Parquet.
  • Rehost machine learning (ML) jobs that were critical for business.
  • Work backward to achieve our goals, and clear roadblocks and alter decisions to move forward.
  • Use serverless options as a first option and pay per use. Assess the cost and effort for rearchitecting to select the right approach. Execute a proof of concept to validate this for each component and service.

Strategies applied to succeed in this migration:

  • Team – We created a unified team with people from data engineering, analytics, and data science as part of the analytics migration project. Site reliability engineering (SRE) and application teams were involved when critical decisions were needed regarding data or timeline for alignment. The analytics, data engineering, and data science teams spent considerable time planning, understanding the code, and iteratively looking at the existing data sources, data pipelines, and processing jobs. AWS team with partner team from Minfy Technologies helped BMS arrive at a migration plan after a proof of concept for each of the components in data ingestion, data processing, data warehouse, ML, and analytics dashboards.
  • Workshops – The AWS team conducted a series of workshops and immersion days, and coached the BMS team on the technology and best practices to deploy the analytics services. The AWS team helped BMS explore the configuration and benefits of the migration approach for each scenario (data migration, data pipeline, data processing, visualization, and machine learning) via proof-of-concepts (POCs). The team captured the changes required in the existing code for migration. BMS team also got acquainted with the following AWS services:
  • Proof of concept – The BMS team, with help from the partner and AWS team, implemented multiple proofs of concept to validate the migration approach:
    • Performed batch processing of Spark jobs in Amazon EMR, in which we checked the runtime, required code changes, and cost.
    • Ran clickstream analysis jobs in Amazon EMR, testing the end-to-end pipeline. Team conducted proofs of concept on AWS IoT Core for MQTT protocol and streaming to Amazon S3.
    • Migrated ML models to Amazon SageMaker and orchestrated with Amazon MWAA.
    • Created sample QuickSight reports and dashboards, in which features and time to build were assessed.
    • Configured for key scenarios for Amazon Redshift, in which time for loading data, query performance, and cost were assessed.
  • Effort vs. cost analysis – Team performed the following assessments:
    • Compared the ingestion pipelines, the difference in data structure in each store, the basis of the current business need for the data source, the activity for preprocessing the data before migration, data migration to Amazon S3, and change data capture (CDC) from the migrated applications in AWS.
    • Assessed the effort to migrate approximately 200 jobs, determined which jobs were redundant or need improvement from a functional perspective, and completed a migration list for the target state. The modernization of the MQTT workflow code to serverless was time-consuming, decided to rehost on Amazon Elastic Compute Cloud (Amazon EC2) and modernization to Amazon Kinesis in to the next phase.
    • Reviewed over 400 reports and dashboards, prioritized development in phases, and reassessed business user needs.

AWS cloud services chosen for proposed architecture:

  • Data lake – We used Amazon S3 as the data lake to store the single truth of information for all raw and processed data, thereby reducing the copies of data storage and storage costs.
  • Ingestion – Because we had multiple sources of truth in the current architecture, we arrived at a common structure before migration to Amazon S3, and existing pipelines were modified to do preprocessing. These one-time preprocessing jobs were run in Cloudera, because the source data was on-premises, and on Amazon EMR for data in the cloud. We designed new data pipelines for ingestion from transactional systems on the AWS cloud using AWS Glue ETL.
  • Processing – Processing jobs were segregated based on runtime into two categories: batch and near-real time. Batch processes were further divided into transient Amazon EMR clusters with varying runtimes and Hadoop application requirements like HBase. Near-real-time jobs were provisioned in an Amazon EMR permanent cluster for clickstream analytics, and a data pipeline from transactional systems. We adopted a serverless approach using AWS Glue ETL for new data pipelines from transactional systems on the AWS cloud.
  • Data warehouse – We chose Amazon Redshift as our data warehouse, and planned on how the data would be distributed based on query patterns.
  • Visualization – We built the reports in Amazon QuickSight in phases and prioritized them based on business demand. We discussed with business users their current needs and identified the immediate reports required. We defined the phases of report and dashboard creation and built the reports in Amazon QuickSight. We plan to use embedded reports for external users in the future.
  • Machine learning – Custom ML models were deployed on Amazon SageMaker. Existing Airflow DAGs were migrated to Amazon MWAA.
  • Governance, security, and compliance – Governance with Amazon Lake Formation was adopted from Day 1. We configured the AWS Glue Data Catalog to reference data used as sources and targets. We had to comply to Payment Card Industry (PCI) guidelines because payment information was in the data lake, so we ensured the necessary security policies.

Solution overview

BMS modern data architecture

The following diagram illustrates our modern data architecture.

The architecture includes the following components:

  1. Source systems – These include the following:
    • Data from transactional systems stored in MariaDB (booking and transactions).
    • User interaction clickstream data via Kafka consumers to DataOps MariaDB.
    • Members and seat allocation information from MongoDB.
    • SQL Server for specific offers and payment information.
  2. Data pipeline – Spark jobs on an Amazon EMR permanent cluster process the clickstream data from Kafka clusters.
  3. Data lake – Data from source systems was stored in their respective Amazon S3 buckets, with prefixes for optimized data querying. For Amazon S3, we followed a hierarchy to store raw, summarized, and team or service-related data in different parent folders as per the source and type of data. Lifecycle polices were added to logs and temp folders of different services as per teams’ requirements.
  4. Data processing – Transient Amazon EMR clusters are used for processing data into a curated format for the audience, personalization, and analytics teams. Small file merger jobs merge the clickstream data to a larger file size, which saved costs for one-time queries.
  5. Governance – AWS Lake Formation enables the usage of AWS Glue crawlers to capture the schema of data stored in the data lake and version changes in the schema. The Data Catalog and security policy in AWS Lake Formation enable access to data for roles and users in Amazon Redshift, Amazon Athena, Amazon QuickSight, and data science jobs. AWS Glue ETL jobs load the processed data to Amazon Redshift at scheduled intervals.
  6. Queries – The analytics team used Amazon Athena to perform one-time queries raised from business teams on the data lake. Because report development is in phases, Amazon Athena was used for exporting data.
  7. Data warehouse – Amazon Redshift was used as the data warehouse, where the reports for the sales teams, management, and third parties (i.e., theaters and events) are processed and stored for quick retrieval. Views to analyze the total sales, movie sale trends, member behavior, and payment modes are configured here. We use materialized views for denormalized tables, different schemas for metadata, and transactional and behavior data.
  8. Reports – We used Amazon QuickSight reports for various business, marketing, and product use cases.
  9. Machine learning – Some of the models deployed on Amazon SageMaker are as follows:
    • Content popularity – Decides the recommended content for users.
    • Live event popularity – Calculates the popularity of live entertainment events in different regions.
    • Trending searches – Identifies trending searches across regions.

Walkthrough

Migration execution steps

We standardized tools, services, and processes for data engineering, analytics, and data science:

  • Data lake
    • Identified the source data to be migrated from Archival DB, BigQuery, TiDB, and the analytics database.
    • Built a canonical data model that catered to multiple business teams and reduced the copies of data, and therefore storage and operational costs. Modified existing jobs to facilitate migration to a canonical format.
    • Identified the source systems, capacity required, anticipated growth, owners, and access requirements.
    • Ran the bulk data migration to Amazon S3 from various sources.
  • Ingestion
    • Transaction systems – Retained the existing Kafka queues and consumers.
    • Clickstream data – Successfully conducted a proof of concept to use AWS IoT Core for MQTT protocol. But because we needed to make changes in the application to publish to AWS IoT Core, we decided to implement it as part of mobile application modernization at a later time. We decided to rehost the MQTT server on Amazon EC2.
  • Processing
  • Listed the data pipelines relevant to business and migrated them with minimal modification.
  • Categorized workloads into critical jobs, redundant jobs, or jobs that can be optimized:
    • Spark jobs were migrated to Amazon EMR.
    • HBase jobs were migrated to Amazon EMR with HBase.
    • Metadata stored in Hive-based jobs were modified to use the AWS Glue Data Catalog.
    • NiFi jobs were simplified and rewritten in Spark run in Amazon EMR.
  • Amazon EMR clusters were configured one persistent cluster for streaming the clickstream and personalization workloads. We used multiple transient clusters for running all other Spark ETL or processing jobs. We used Spot Instances for task nodes to save costs. We optimized data storage with specific jobs to merge small files and compressed file format conversions.
  • AWS Glue crawlers identified new data in Amazon S3. AWS Glue ETL jobs transformed and uploaded processed data to the Amazon Redshift data warehouse.
  • Datawarehouse
    • Defined the data warehouse schema by categorizing the critical reports required by the business, keeping in mind the workload and reports required in future.
    • Defined the staging area for incremental data loaded into Amazon Redshift, materialized views, and tuning the queries based on usage. The transaction and primary metadata are stored in Amazon Redshift to cater to all data analysis and reporting requirements. We created materialized views and denormalized tables in Amazon Redshift to use as data sources for Amazon QuickSight dashboards and segmentation jobs, respectively.
    • Optimally used the Amazon Redshift cluster by loading last two years data in Amazon Redshift, and used Amazon Redshift Spectrum to query historical data through external tables. This helped balance the usage and cost of the Amazon Redshift cluster.
  • Visualization
    • Amazon QuickSight dashboards were created for the sales and marketing team in Phase 1:
      • Sales summary report – An executive summary dashboard to get an overview of sales across the country by region, city, movie, theatre, genre, and more.
      • Live entertainment – A dedicated report for live entertainment vertical events.
      • Coupons – A report for coupons purchased and redeemed.
      • BookASmile – A dashboard to analyze the data for BookASmile, a charity initiative.
  • Machine learning
    • Listed the ML workloads to be migrated based on current business needs.
    • Priority ML processing jobs were deployed on Amazon EMR. Models were modified to use Amazon S3 as source and target, and new APIs were exposed to use the functionality. ML models were deployed on Amazon SageMaker for movies, live event clickstream analysis, and personalization.
    • Existing artifacts in Airflow orchestration were migrated to Amazon MWAA.
  • Security
    • AWS Lake Formation was the foundation of the data lake, with the AWS Glue Data Catalog as the foundation for the central catalog for the data stored in Amazon S3. This provided access to the data by various functionalities, including the audience, personalization, analytics, and data science teams.
    • Personally identifiable information (PII) and payment data was stored in the data lake and data warehouse, so we had to comply to PCI guidelines. Encryption of data at rest and in transit was considered and configured in each service level (Amazon S3, AWS Glue Data Catalog, Amazon EMR, AWS Glue, Amazon Redshift, and QuickSight). Clear roles, responsibilities, and access permissions for different user groups and privileges were listed and configured in AWS Identity and Access Management (IAM) and individual services.
    • Existing single sign-on (SSO) integration with Microsoft Active Directory was used for Amazon QuickSight user access.
  • Automation
    • We used AWS CloudFormation for the creation and modification of all the core and analytics services.
    • AWS Step Functions was used to orchestrate Spark jobs on Amazon EMR.
    • Scheduled jobs were configured in AWS Glue for uploading data in Amazon Redshift based on business needs.
    • Monitoring of the analytics services was done using Amazon CloudWatch metrics, and right-sizing of instances and configuration was achieved. Spark job performance on Amazon EMR was analyzed using the native Spark logs and Spark user interface (UI).
    • Lifecycle policies were applied to the data lake to optimize the data storage costs over time.

Benefits of a modern data architecture

A modern data architecture offered us the following benefits:

  • Scalability – We moved from a fixed infrastructure to the minimal infrastructure required, with configuration to scale on demand. Services like Amazon EMR and Amazon Redshift enable us to do this with just a few clicks.
  • Agility – We use purpose-built managed services instead of reinventing the wheel. Automation and monitoring were key considerations, which enable us to make changes quickly.
  • Serverless – Adoption of serverless services like Amazon S3, AWS Glue, Amazon Athena, AWS Step Functions, and AWS Lambda support us when our business has sudden spikes with new movies or events launched.
  • Cost savings – Our storage size was reduced by 90%. Our overall spend on analytics and ML was reduced by 80%.

Conclusion

In this post, we showed you how a modern data architecture on AWS helped BMS to easily share data across organizational boundaries. This allowed BMS to make decisions with speed and agility at scale; ensure compliance via unified data access, security, and governance; and to scale systems at a low cost without compromising performance. Working with the AWS and Minfy Technologies teams helped BMS choose the correct technology services and complete the migration in four months. BMS achieved the scalability and cost-optimization goals with this updated architecture, which has set the stage for innovation using graph databases and enhanced our ML projects to improve customer experience.


About the Authors

Mahesh Vandi Chalil is Chief Technology Officer at BookMyShow, India’s leading entertainment destination. Mahesh has over two decades of global experience, passionate about building scalable products that delight customers while keeping innovation as the top goal motivating his team to constantly aspire for these. Mahesh invests his energies in creating and nurturing the next generation of technology leaders and entrepreneurs, both within the organization and outside of it. A proud husband and father of two daughters and plays cricket during his leisure time.

Priya Jathar is a Solutions Architect working in Digital Native Business segment at AWS. She has more two decades of IT experience, with expertise in Application Development, Database, and Analytics. She is a builder who enjoys innovating with new technologies to achieve business goals. Currently helping customers Migrate, Modernise, and Innovate in Cloud. In her free time she likes to paint, and hone her gardening and cooking skills.

Vatsal Shah is a Senior Solutions Architect at AWS based out of Mumbai, India. He has more than nine years of industry experience, including leadership roles in product engineering, SRE, and cloud architecture. He currently focuses on enabling large startups to streamline their cloud operations and help them scale on the cloud. He also specializes in AI and Machine Learning use cases.

Code conversion from Greenplum to Amazon Redshift: Handling arrays, dates, and regular expressions

Post Syndicated from Jagrit Shrestha original https://aws.amazon.com/blogs/big-data/code-conversion-from-greenplum-to-amazon-redshift-handling-arrays-dates-and-regular-expressions/

Amazon Redshift is a fully managed service for data lakes, data analytics, and data warehouses for startups, medium enterprises, and large enterprises. Amazon Redshift is used by tens of thousands of businesses around the globe for modernizing their data analytics platform.

Greenplum is an open-source, massively parallel database used for analytics, mostly for on-premises infrastructure. Greenplum is based on the PostgreSQL database engine.

Many customers have found migration to Amazon Redshift from Greenplum an attractive option instead of managing on-premises Greenplum for the following reasons:

Even though both Greenplum and Amazon Redshift use the open-source PostgreSQL database engine, migration still requires a lot of planning and manual intervention. This post covers the key functions and considerations while performing code conversion from Greenplum to Amazon Redshift. It is focused on the migration of procedures, functions, and views.

Solution overview

AWS Database Migration Service (AWS DMS) and the AWS Schema Conversion Tool (AWS SCT) can migrate most of the objects in a heterogeneous database migration from Greenplum to Amazon Redshift. But there are some situations where code conversion teams encounter errors and warnings for views, procedures, and functions while creating them in Amazon Redshift. To address this type of situation, manual conversion of the code is required.

The posts focuses on how to handle the following while migrating from Greenplum to Amazon Redshift:

  • Arrays
  • Dates and timestamps
  • Regular expressions (regex)

Please note that for this post, we use Greenplum 4.3 and Amazon Redshift PostgreSQL 8.2.

Working with array functions

The AWS SCT doesn’t convert array functions while migrating from Greenplum or PostgreSQL to Amazon Redshift. Developers need to extensively convert those functions manually. This post outlines the most common array functions:

  • ARRAY_UPPER
  • JSON_EXTACT_ARRAY_ELEMENT_TEXT and JSON_ARRAY_LENGTH
  • UNNEST ()
  • STRING_AGG()
  • ANY ARRAY()

ARRAY_UPPER()

This function returns the upper bound of an array. It can be used to extract the nth element from an array in PostgreSQL or Greenplum.

The Greenplum code is as follows:

With temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
array['"111-222-3333"','"101-201-3001"','"XXX-YYY-ZZZZ"','NULL'] as PhoneNumbers
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
array['222-333-4444','201-301-4001','AAA-BBB-CCCC'] as PhoneNumbers
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
array['333-444-5555','301-401-3001','DDD-EEE-FFFF'] as PhoneNumbers
)
Select Firstname, PhoneNumbers[ARRAY_UPPER(PhoneNumbers,1)]

There is no function to extract an element from an array in Amazon Redshift; however, there are two JSON functions that can be used for this purpose:

  • JSON_EXTRACT_ARRAY_ELEMENT_TEXT() – Returns a JSON array element in the outermost array of a JSON string
  • JSON_ARRAY_LENGTH() – Returns the number of elements in the outer array of a JSON string

See the following code:

With temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
array['"111-222-3333"','"101-201-3001"','"XXX-YYY-ZZZZ"'] as PhoneNumbers
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
array['"222-333-4444"','"201-301-4001"','"AAA-BBB-CCCC"'] as PhoneNumbers
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
array['"333-444-5555"','"301-401-3001"','"DDD-EEE-FFFF"'] as PhoneNumbers
)

Select
FirstName
,('['+array_to_string(phoneNumbers,',')+']') as JSONConvertedField
,JSON_EXTRACT_ARRAY_ELEMENT_TEXT
(
'['+array_to_string(phoneNumbers,',')+']'
,JSON_ARRAY_LENGTH('['+array_to_string(phoneNumbers,',')+']')-1
) as LastElementFromArray
from temp1

UNNEST()

UNNEST() is PostgreSQL’s system function for semi-structured data, expanding an array, or a combination of arrays to a set of rows. It is introduced to improve the database performance of thousands or records for inserts, updates, and deletes.

You can use UNNEST() for basic array, multiple arrays, and multiple arrays with different lengths.

Some of Amazon Redshift functions used to unnest arrays are split_part, json_extract_path_text, json_array_length, and json_extract_array_element_text.

In Greenplum, the UNNEST function is used to expand an array to a set of rows:

Select ‘A’,unnest(array([1,2])

Output
A 1
A 2

with temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
'111-222-3333' as Mobilephone,'101-201-3001' as HomePhone
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
'222-333-4444' as Mobilephone,'201-301-4001' as HomePhone
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
'333-444-5555' as Mobilephone,'301-401-3001' as HomePhone
)

select
FirstName
,LastName
,unnest(array[‘Mobile’::text,’HomePhone’::text]) as PhoneType
,unnest(array[MobilePhone::text,HomePhone::text]) as PhoneNumber
from
temp1
order by 1,2,3

Amazon Redshift doesn’t support the UNNEST function; you can use the following workaround:

with temp1 as
(
Select 'John' as FirstName, 'Smith' as LastName ,
'111-222-3333' as Mobilephone,'101-201-3001' as HomePhone
union all
Select 'Bob' as FirstName, 'Haris' as LastName ,
'222-333-4444' as Mobilephone,'201-301-4001' as HomePhone
union all
Select 'Mary' as FirstName, 'Jane' as LastName ,
'333-444-5555' as Mobilephone,'301-401-3001' as HomePhone
),
ns as
(
Select row_number() over(order by 1) as n from pg_tables
)

Select
FirstName
,LastName
,split_part('Mobile,Home',',',ns.n::int) as PhoneType
,split_part(MobilePhone|| '&&' || HomePhone, '&&', ns.n::int) as PhoneNumber
from
temp1, ns
where
ns.n<=regexp_count('Mobile,Home',',')+1
order by 1,2,3

When the element of array is in the form of array itself, use the JSON_EXTRACT_ARRAY_ELEMENT_TEXT() function and JSON_ARRAY_LENGTH:

with ns as
(
Select row_number() over(order by 1) as n from pg_tables
)

Select JSON_EXTRACT_ARRAY_ELEMENT_TEXT('["arrayelement1","arrayelement2"]',ns.n-1)
from ns
where
ns.n<=JSON_ARRAY_LENGTH('["arrayelement1","arrayelement2"]')

STRING_AGG()

The STRING_AGG() function is an aggregate function that concatenates a list of strings and places a separator between them. The function doesn’t add the separator at the end of the string. See the following code:

STRING_AGG ( expression, separator [order_by_clause] )

The Greenplum code is as follows:

with temp1 as
(
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Doe'::text as LastName
union all
Select 'Finance'::text as Dept, 'Mary'::text as FirstName, 'Jane'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Bob'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Steve'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Account'::text as Dept, 'Phil'::text as FirstName, 'Adams'::text as LastName
union all
Select 'Account'::text as Dept, 'Jim'::text as FirstName, 'Smith'::text as LastName
)
Select dept,STRING_AGG(FirstName||' '||LastName,' ; ') as Employees from temp1 group by dept order by 1

The Amazon Redshift equivalent for the STRING_AGG() function is LISTAGG(). This aggregate function orders the rows for that group according to the ORDER BY expression, then concatenates the values into a single string:

LISTAGG(expression, separator [order_by_clause])

See the following code:

Create temporary Table temp1 as
(
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Finance'::text as Dept, 'John'::text as FirstName, 'Doe'::text as LastName
union all
Select 'Finance'::text as Dept, 'Mary'::text as FirstName, 'Jane'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Bob'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Marketing'::text as Dept, 'Steve'::text as FirstName, 'Smith'::text as LastName
union all
Select 'Account'::text as Dept, 'Phil'::text as FirstName, 'Adams'::text as LastName
union all
Select 'Account'::text as Dept, 'Jim'::text as FirstName, 'Smith'::text as LastName
)

Select dept,LISTAGG(FirstName||' '||LastName,' ; ') as Employees from temp1
group by dept
order by 1

ANY ARRAY()

The PostgreSQL ANY ARRAY() function evaluates and compare the left-hand expression to each element in array:

Select * from temp1 where DeptName = ANY ARRAY('10-F','20-F','30-F')

In Amazon Redshift, the evaluation can be achieved with an IN operator:

Select * from temp1 where DeptName IN ('10-F','20-F','30-F')

Working with date functions

In this section, we discuss calculating the difference between date_part for Greenplum and datediff for Amazon Redshift.

When the application needs to calculate the difference between the subfields of dates for Greenplum, it uses the function date_part, which allows you to retrieve subfields such as year, month, week, and day. In the following example queries, we calculate the number of completion_days by calculating the difference between originated_date and eco_date.

To calculate the difference between the subfields of the date, Amazon Redshift has the function datediff. The following queries show an example of how to calculate the completion_days as the difference between eco_date and orginated_date. DATEDIFF determines the number of date part boundaries that are crossed between the two expressions.

We compare the Greenplum and Amazon Redshift queries as follows:

  • Difference by year

The following Greenplum query returns 1 year between 2009-01-01 and 2009-12-31:

SELECT date_part(‘year’, TIMESTAMP ‘2009-01-01’) - date_part(‘year’, 2008-12-31’) as year;

The following Amazon Redshift query returns 1 year between 2009-01-01 and 2009-12-31:

SELECT datediff (year, ‘2008-12-31’ , ‘2009-01-01’ ) as year;
  • Difference by month

The following Greenplum query returns 1 month between 2009-01-01 and 2008-12-31:

SELECT (date_part(‘year’, ‘2009-01-01’ :: date) - date_part(‘year’, ‘2008-12-31’ :: date)) * 12 +<br />(date_part(‘month’, ‘2009-01-01’) - date_part(‘month’, ‘2008-12-31’ :: date)) as month;

The following Amazon Redshift query returns 1 month between 2009-01-01 and 2008-12-31:

SELECT datediff( month, ‘2008-12-31’ , ‘2009-01-01’ ) as month;
  • Difference by week

The following Greenplum query returns 0 weeks between 2009-01-01 and 2009-12-31:

SELECT date_part(‘week’, timestamp ‘2009-01-01’ ) - date_part(‘week’, timestamp ‘2008-12-31’) as week;

The following Amazon Redshift query returns 0 weeks between 2009-01-01 and 2009-12-31:

SELECT datediff( week, ‘2008-12-31’ , ‘2009-01-01’ ) as week;
  • Difference by day

The following Greenplum query returns 1 day:

SELECT date_part ('day', '2009-01-01 24:00:00' :: timestamp - '2008-12-31 24:00:00 :: timestamp) as day;

The following Amazon Redshift query returns 1 day:

SELECT datediff (day, ‘2008-12-31’, ‘2009-01-01’) as day;
  • Difference by hour

The following Greenplum query returns 1 hour:

SELECT date_part(‘hour’, ‘2009-01-01 22:56:10’ :: timestamp - ‘2008-12-31 21:54:55' :: timestamp)

The following Amazon Redshift query returns 1 hour:

SELECT datediff (hour, ‘2009-01-01 21:56:10’, ‘2009-01-01’ ) as hour;
  • Difference by minute

The following Greenplum query returns 3 minutes:

SELECT date_part(‘minute’, ‘2009-01-01 22:56:10’ :: timestamp - ‘2009-01-01 21:53:10’ :: timestamp) as minutes;

The following Amazon Redshift query returns 1 minute:

SELECT datediff(minute, ‘2009-01-01 21:56:10’, ‘2009-01-01 21:57:55’) as minute;
  • Difference by second

The following Greenplum query returns 40 seconds:

SELECT date_part(‘second’, ‘2009-01-01 22:56:50’ :: timestamp - ‘2009-01-01 21:53:10’ : : timestamp) as seconds;

The following Amazon Redshift query returns 45 seconds:

SELECT datediff(second, ‘2009-01-01 21:56:10’, ‘2009-01-01- 21:56:55’) as seconds;

Now let’s look at how we use Amazon Redshift to calculate days and weeks in seconds.

The following Amazon Redshift query displays 2 days:

SELECT datediff(second, ‘2008-12-30 21:56:10’, ‘2009-01-01- 21:56:55’)/(60*60*24) as days;

The following Amazon Redshift query displays 9 weeks:

SELECT datediff(second, ‘2008-10-30 21:56:10’, ‘2009-01-01- 21:56:55’)/(60*60*24*7) as weeks;

For Greenplum, the date subfields need to be in single quotes, whereas for Amazon Redshift, we can use date subfields such as year, month, week, day, minute, second without quotes. For Greenplum, we have to subtract the subfield from one part to another part, whereas for Amazon Redshift we can use commas to separate the two dates.

Extract ISOYEAR from date

ISOYEAR 8601 is a week-numbering year. It begins with the Monday of the week containing the 4th of January. So for the date of early January or late December, the ISO year may be different from the Gregorian year. ISO year has 52 or 53 full weeks (364 or 371 days). The extra week is called a leap week; a year with such a week is called a leap year.

The following Greenplum query displays the ISOYEAR 2020:

SELECT extract (ISOYEAR from ‘2019-12-30’ :: date) as ISOYEARS;

The following Amazon Redshift query displays the ISOYEAR 2020:

SELECT to_char(‘2019-12-30’ :: date, ‘IYYYY’) as ISOYEARS;

Function to generate_series()

Greenplum has adopted the PostgreSQL function generate_series(). But the generate_series function works differently with Amazon Redshift while retrieving records from the table because it’s a leader node-only function.

To display a series of numbers in Amazon Redshift, run the following query on the leader node. In this example, it displays 10 rows, numbered 1–10:

SELECT generate_series(1,10);

To display a series of days for a given date, use the following query. It extracts the day from the given date and subtracts 1, to display a series of numbers from 0–6:

SELECT generate_series(0, extract(day from date ‘2009-01-07’) :: int -1);

But for the queries fetching the record from the table, joining with another table’s row, and processing data at the compute node, it doesn’t work, and generates an error message with Invalid Operation. The following code is an example of a SQL statement that works for Greenplum but fails for Amazon Redshift:

SELECT column_1,
FROM table_1t1
JOIN table_2 t2
ON t2.code = t1.code
CROSS JOIN generate_series(1,12) gen(fiscal_month)
WHERE condition_1

For Amazon Redshift, the solution is to create a table to store the series data, and rewrite the code as follows:

SELECT column1,
FROM table_t1 t1
JOIN table_t2 t2
ON t2.code = t1.code
CROSS JOIN (select “number” as fiscal_month FROM table_t3 WHERE “number”<=12) gen
WHERE condition_1

Working with regular expressions (regex functions)

Amazon Redshift and Greenplum both support three conditions for pattern matching:

  • LIKE
  • SIMILAR TO
  • POSIX operators

In this post, we don’t discuss all of these pattern matching in detail. Instead, we discuss a few regex functions and regex escape characters that aren’t supported by Amazon Redshift.

Regexp_split_to_table function

The Regex_split_to_table function splits a string using a POSIX regular expression pattern as delimiter.

This function has the following syntax:

Regexp_split_to_table(string,pattern [,flags])

For Greenplum, we use the following query:

select regexp_split_to_table ('bat,cat,hat',’\,’) as regexp_split_table_GP

For Amazon Redshift, the regexp_split_to_table function has to be converted using the Amazon Redshift split_part function:

SELECT column1,
FROM table_t1 t1
JOIN table_t2 t2
ON t2.code = t1.code
CROSS JOIN (select “number” as fiscal_month FROM table_t3 WHERE “number”<=12) gen
WHERE condition_1

Another way to convert regexp_split_to_table is as follows:

SELECT column1,
FROM table_t1 t1
JOIN table_t2 t2
ON t2.code = t1.code
CROSS JOIN (select “number” as fiscal_month FROM table_t3 WHERE “number”<=12) gen
WHERE condition_1

Substring from regex expressions

Substring (the string from the regex pattern) extracts the substring or value matching the pattern that is passed on. If there is no match, null is returned. For more information, refer to Pattern Matching.

We use the following code in Greenplum:

create temp table data1 ( col1 varchar );
insert into data1 values ('hellohowareyou 12\687687abcd');
select substring( col1 from '[A-Za-z]+$') from data1;
from data1

We can use the regexp_substr function to convert this code to Amazon Redshift. It returns the characters extracted from a string by searching for a regular expression pattern. The syntax is as follows:

REGEXP_SUBSTR ( source_string, pattern [, position [, occurrence [, parameters ] ] ] )
select regexp_substr( col1, '[A-Za-z]+$') as substring_from_rs from data1

Key points while converting regular expression escapes

The Postgres escape character E doesn’t work in Amazon Redshift. Additionally, the following Greenplum regular expression constraints aren’t supported in Amazon Redshift:

  • \m – Matches only at the beginning of a word
  • \y – Matches only at the beginning or end of a word

For Amazon Redshift, use \\< and \\>, or [[:<:]] and [[:>:]] instead.

Use the following code for Greenplum:

select col1,
case
when (col1) ~ E '\\m[0-9]{2}[A-Z]{1}[0-9]{1}' then
regexp_replace(col1, E '([0-9]{2})([A-Z]{1})([0-9]{1})',E '\\2')
else 'nothing'
end as regex_test
from temp1123

Use the following code for Amazon Redshift:

select col1,
case
when (col1) ~ '\\<[0-9]{2}[A-Z]{1}[0-9]{1}>\\' then
regexp_replace(col1,'([0-9]{2})([A-Z]{1})([0-9]{1})','\\2')
else 'nothing'
end as regex_test
from temp1123

OR

select col1,
case
when (col1) ~ '[[:<:]][0-9]{2}[A-Z]{1}[0-9]{1}[[:>:]]' then
regexp_replace(col1,'([0-9]{2})([A-Z]{1})([0-9]{1}) (.*)','\\2')
else 'nothing'
end as regex_test
from temp1123

Conclusion

For heterogeneous database migration from Greenplum to the Amazon Redshift, you can use AWS DMS and the AWS SCT to migrate most of the database objects, such as tables, views, stored procedures, and functions.

There are some situations in which one function is used for the source environment, and the target environment doesn’t support the same function. In this case, manual conversion is required to produce the same results set and complete the database migration.

In some cases, use of a new window function supported by the target environment proves more efficient for analytical queries to process petabytes of data.

This post included several situations where manual code conversion is required, which also improves the code efficiency and make queries efficient.

If you have any questions or suggestions, please share your feedback.


About the Authors

Jagrit Shrestha is a Database consultant at Amazon Web Services (AWS). He works as a database specialist helping customers migrate their on-premises database workloads to AWS and provide technical guidance.

Ishwar Adhikary is a Database Consultant at Amazon Web Services (AWS). He works closely with customers to modernize their database and application infrastructures. His focus area is migration of relational databases from On-premise data center to AWS Cloud.

Shrenik Parekh works as a Database Consultants at Amazon Web Services (AWS). He is expertise in database migration assessment, database migration, modernizing database environment with purpose-built database using AWS cloud database services. He is also focused on AWS web services for data analytics. In his spare time, he loves hiking, yoga and other outdoor activities.

Santhosh Meenhallimath is a Data Architect at AWS. He works on building analytical solutions, building data lakes and migrate Database into AWS.

How to query and visualize Macie sensitive data discovery results with Athena and QuickSight

Post Syndicated from Keith Rozario original https://aws.amazon.com/blogs/security/how-to-query-and-visualize-macie-sensitive-data-discovery-results-with-athena-and-quicksight/

Amazon Macie is a fully managed data security service that uses machine learning and pattern matching to help you discover and protect sensitive data in Amazon Simple Storage Service (Amazon S3). With Macie, you can analyze objects in your S3 buckets to detect occurrences of sensitive data, such as personally identifiable information (PII), financial information, personal health information, and access credentials.

In this post, we walk you through a solution to gain comprehensive and organization-wide visibility into which types of sensitive data are present in your S3 storage, where the data is located, and how much is present. Once enabled, Macie automatically starts discovering sensitive data in your S3 storage and builds a sensitive data profile for each bucket. The profiles are organized in a visual, interactive data map, and you can use the data map to run targeted sensitive data discovery jobs. Both automated data discovery and targeted jobs produce rich, detailed sensitive data discovery results. This solution uses Amazon Athena and Amazon QuickSight to deep-dive on the Macie results, and to help you analyze, visualize, and report on sensitive data discovered by Macie, even when the data is distributed across millions of objects, thousands of S3 buckets, and thousands of AWS accounts. Athena is an interactive query service that makes it simpler to analyze data directly in Amazon S3 using standard SQL. QuickSight is a cloud-scale business intelligence tool that connects to multiple data sources, including Athena databases and tables.

This solution is relevant to data security, data governance, and security operations engineering teams.

The challenge: how to summarize sensitive data discovered in your growing S3 storage

Macie issues findings when an object is found to contain sensitive data. In addition to findings, Macie keeps a record of each S3 object analyzed in a bucket of your choice for long-term storage. These records are known as sensitive data discovery results, and they include additional context about your data in Amazon S3. Due to the large size of the results file, Macie exports the sensitive data discovery results to an S3 bucket, so you need to take additional steps to query and visualize the results. We discuss the differences between findings and results in more detail later in this post.

With the increasing number of data privacy guidelines and compliance mandates, customers need to scale their monitoring to encompass thousands of S3 buckets across their organization. The growing volume of data to assess, and the growing list of findings from discovery jobs, can make it difficult to review and remediate issues in a timely manner. In addition to viewing individual findings for specific objects, customers need a way to comprehensively view, summarize, and monitor sensitive data discovered across their S3 buckets.

To illustrate this point, we ran a Macie sensitive data discovery job on a dataset created by AWS. The dataset contains about 7,500 files that have sensitive information, and Macie generated a finding for each sensitive file analyzed, as shown in Figure 1.

Figure 1: Macie findings from the dataset

Figure 1: Macie findings from the dataset

Your security team could spend days, if not months, analyzing these individual findings manually. Instead, we outline how you can use Athena and QuickSight to query and visualize the Macie sensitive data discovery results to understand your data security posture.

The additional information in the sensitive data discovery results will help you gain comprehensive visibility into your data security posture. With this visibility, you can answer questions such as the following:

  • What are the top 5 most commonly occurring sensitive data types?
  • Which AWS accounts have the most findings?
  • How many S3 buckets are affected by each of the sensitive data types?

Your security team can write their own customized queries to answer questions such as the following:

  • Is there sensitive data in AWS accounts that are used for development purposes?
  • Is sensitive data present in S3 buckets that previously did not contain sensitive information?
  • Was there a change in configuration for S3 buckets containing the greatest amount of sensitive data?

How are findings different from results?

As a Macie job progresses, it produces two key types of output: sensitive data findings (or findings for short), and sensitive data discovery results (or results).

Findings provide a report of potential policy violations with an S3 bucket, or the presence of sensitive data in a specific S3 object. Each finding provides a severity rating, information about the affected resource, and additional details, such as when Macie found the issue. Findings are published to the Macie console, AWS Security Hub, and Amazon EventBridge.

In contrast, results are a collection of records for each S3 object that a Macie job analyzed. These records contain information about objects that do and do not contain sensitive data, including up to 1,000 occurrences of each sensitive data type that Macie found in a given object, and whether Macie was unable to analyze an object because of issues such as permissions settings or use of an unsupported format. If an object contains sensitive data, the results record includes detailed information that isn’t available in the finding for the object.

One of the key benefits of querying results is to uncover gaps in your data protection initiatives—these gaps can occur when data in certain buckets can’t be analyzed because Macie was denied access to those buckets, or was unable to decrypt specific objects. The following table maps some of the key differences between findings and results.

Findings Results
Enabled by default Yes No
Location of published results Macie console, Security Hub, and EventBridge S3 bucket
Details of S3 objects that couldn’t be scanned No Yes
Details of S3 objects in which no sensitive data was found No Yes
Identification of files inside compressed archives that contain sensitive data No Yes
Number of occurrences reported per object Up to 15 Up to 1,000
Retention period 90 days in Macie console Defined by customer

Architecture

As shown in Figure 2, you can build out the solution in three steps:

  1. Enable the results and publish them to an S3 bucket
  2. Build out the Athena table to query the results by using SQL
  3. Visualize the results with QuickSight
Figure 2: Architecture diagram showing the flow of the solution

Figure 2: Architecture diagram showing the flow of the solution

Prerequisites

To implement the solution in this blog post, you must first complete the following prerequisites:

Figure 3: Sample data loaded into three different AWS accounts

Figure 3: Sample data loaded into three different AWS accounts

Note: All data in this blog post has been artificially created by AWS for demonstration purposes and has not been collected from any individual person. Similarly, such data does not, nor is it intended, to relate back to any individual person.

Step 1: Enable the results and publish them to an S3 bucket

Publication of the discovery results to Amazon S3 is not enabled by default. The setup requires that you specify an S3 bucket to store the results (we also refer to this as the discovery results bucket), and use an AWS Key Management Service (AWS KMS) key to encrypt the bucket.

If you are analyzing data across multiple accounts in your organization, then you need to enable the results in your delegated Macie administrator account. You do not need to enable results in individual member accounts. However, if you’re running Macie jobs in a standalone account, then you should enable the Macie results directly in that account.

To enable the results

  1. Open the Macie console.
  2. Select the AWS Region from the upper right of the page.
  3. From the left navigation pane, select Discovery results.
  4. Select Configure now.
  5. Select Create Bucket, and enter a unique bucket name. This will be the discovery results bucket name. Make note of this name because you will use it when you configure the Athena tables later in this post.
  6. Under Encryption settings, select Create new key. This takes you to the AWS KMS console in a new browser tab.
  7. In the AWS KMS console, do the following:
    1. For Key type, choose symmetric, and for Key usage, choose Encrypt and Decrypt.
    2. Enter a meaningful key alias (for example, macie-results-key) and description.
    3. (Optional) For simplicity, set your current user or role as the Key Administrator.
    4. Set your current user/role as a user of this key in the key usage permissions step. This will give you the right permissions to run the Athena queries later.
    5. Review the settings and choose Finish.
  8. Navigate to the browser tab with the Macie console.
  9. From the AWS KMS Key dropdown, select the new key.
  10. To view KMS key policy statements that were automatically generated for your specific key, account, and Region, select View Policy. Copy these statements in their entirety to your clipboard.
  11. Navigate back to the browser tab with the AWS KMS console and then do the following:
    1. Select Customer managed keys.
    2. Choose the KMS key that you created, choose Switch to policy view, and under Key policy, select Edit.
    3. In the key policy, paste the statements that you copied. When you add the statements, do not delete any existing statements and make sure that the syntax is valid. Policies are in JSON format.
  12. Navigate back to the Macie console browser tab.
  13. Review the inputs in the Settings page for Discovery results and then choose Save. Macie will perform a check to make sure that it has the right access to the KMS key, and then it will create a new S3 bucket with the required permissions.
  14. If you haven’t run a Macie discovery job in the last 90 days, you will need to run a new discovery job to publish the results to the bucket.

In this step, you created a new S3 bucket and KMS key that you are using only for Macie. For instructions on how to enable and configure the results using existing resources, see Storing and retaining sensitive data discovery results with Amazon Macie. Make sure to review Macie pricing details before creating and running a sensitive data discovery job.

Step 2: Build out the Athena table to query the results using SQL

Now that you have enabled the discovery results, Macie will begin publishing them into your discovery results bucket in the form of jsonl.gz files. Depending on the amount of data, there could be thousands of individual files, with each file containing multiple records. To identify the top five most commonly occurring sensitive data types in your organization, you would need to query all of these files together.

In this step, you will configure Athena so that it can query the results using SQL syntax. Before you can run an Athena query, you must specify a query result bucket location in Amazon S3. This is different from the Macie discovery results bucket that you created in the previous step.

If you haven’t set up Athena previously, we recommend that you create a separate S3 bucket, and specify a query result location using the Athena console. After you’ve set up the query result location, you can configure Athena.

To create a new Athena database and table for the Macie results

  1. Open the Athena console, and in the query editor, enter the following data definition language (DDL) statement. In the context of SQL, a DDL statement is a syntax for creating and modifying database objects, such as tables. For this example, we named our database macie_results.
    CREATE DATABASE macie_results;
    

    After running this step, you’ll see a new database in the Database dropdown. Make sure that the new macie_results database is selected for the next queries.

    Figure 4: Create database in the Athena console

    Figure 4: Create database in the Athena console

  2. Create a table in the database by using the following DDL statement. Make sure to replace <RESULTS-BUCKET-NAME> with the name of the discovery results bucket that you created previously.
    CREATE EXTERNAL TABLE maciedetail_all_jobs(
    	accountid string,
    	category string,
    	classificationdetails struct<jobArn:string,result:struct<status:struct<code:string,reason:string>,sizeClassified:string,mimeType:string,sensitiveData:array<struct<category:string,totalCount:string,detections:array<struct<type:string,count:string,occurrences:struct<lineRanges:array<struct<start:string,`end`:string,`startColumn`:string>>,pages:array<struct<pageNumber:string>>,records:array<struct<recordIndex:string,jsonPath:string>>,cells:array<struct<row:string,`column`:string,`columnName`:string,cellReference:string>>>>>>>,customDataIdentifiers:struct<totalCount:string,detections:array<struct<arn:string,name:string,count:string,occurrences:struct<lineRanges:array<struct<start:string,`end`:string,`startColumn`:string>>,pages:array<string>,records:array<string>,cells:array<string>>>>>>,detailedResultsLocation:string,jobId:string>,
    	createdat string,
    	description string,
    	id string,
    	partition string,
    	region string,
    	resourcesaffected struct<s3Bucket:struct<arn:string,name:string,createdAt:string,owner:struct<displayName:string,id:string>,tags:array<string>,defaultServerSideEncryption:struct<encryptionType:string,kmsMasterKeyId:string>,publicAccess:struct<permissionConfiguration:struct<bucketLevelPermissions:struct<accessControlList:struct<allowsPublicReadAccess:boolean,allowsPublicWriteAccess:boolean>,bucketPolicy:struct<allowsPublicReadAccess:boolean,allowsPublicWriteAccess:boolean>,blockPublicAccess:struct<ignorePublicAcls:boolean,restrictPublicBuckets:boolean,blockPublicAcls:boolean,blockPublicPolicy:boolean>>,accountLevelPermissions:struct<blockPublicAccess:struct<ignorePublicAcls:boolean,restrictPublicBuckets:boolean,blockPublicAcls:boolean,blockPublicPolicy:boolean>>>,effectivePermission:string>>,s3Object:struct<bucketArn:string,key:string,path:string,extension:string,lastModified:string,eTag:string,serverSideEncryption:struct<encryptionType:string,kmsMasterKeyId:string>,size:string,storageClass:string,tags:array<string>,embeddedFileDetails:struct<filePath:string,fileExtension:string,fileSize:string,fileLastModified:string>,publicAccess:boolean>>,
    	schemaversion string,
    	severity struct<description:string,score:int>,
    	title string,
    	type string,
    	updatedat string)
    ROW FORMAT SERDE
    	'org.openx.data.jsonserde.JsonSerDe'
    WITH SERDEPROPERTIES (
    	'paths'='accountId,category,classificationDetails,createdAt,description,id,partition,region,resourcesAffected,schemaVersion,severity,title,type,updatedAt')
    STORED AS INPUTFORMAT
    	'org.apache.hadoop.mapred.TextInputFormat'
    OUTPUTFORMAT
    	'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
    LOCATION
    	's3://<RESULTS-BUCKET-NAME>/AWSLogs/'
    

    After you complete this step, you will see a new table named maciedetail_all_jobs in the Tables section of the query editor.

  3. Query the results to start gaining insights. For example, to identify the top five most common sensitive data types, run the following query:
    select sensitive_data.category,
    	detections_data.type,
    	sum(cast(detections_data.count as INT)) total_detections
    from maciedetail_all_jobs,
    	unnest(classificationdetails.result.sensitiveData) as t(sensitive_data),
    	unnest(sensitive_data.detections) as t(detections_data)
    where classificationdetails.result.sensitiveData is not null
    and resourcesaffected.s3object.embeddedfiledetails is null
    group by sensitive_data.category, detections_data.type
    order by total_detections desc
    LIMIT 5
    

    Running this query on the sample dataset gives the following output.

    Results of a query showing the five most common sensitive data types in the dataset

    Figure 5: Results of a query showing the five most common sensitive data types in the dataset

  4. (Optional) The previous query ran on all of the results available for Macie. You can further query which accounts have the greatest amount of sensitive data detected.
    select accountid,
    	sum(cast(detections_data.count as INT)) total_detections
    from maciedetail_all_jobs,
    	unnest(classificationdetails.result.sensitiveData) as t(sensitive_data),
    	unnest(sensitive_data.detections) as t(detections_data)
    where classificationdetails.result.sensitiveData is not null
    and resourcesaffected.s3object.embeddedfiledetails is null
    group by accountid
    order by total_detections desc
    

    To test this query, we distributed the synthetic dataset across three member accounts in our organization, ran the query, and received the following output. If you enable Macie in just a single account, then you will only receive results for that one account.

    Figure 6: Query results for total number of sensitive data detections across all accounts in an organization

    Figure 6: Query results for total number of sensitive data detections across all accounts in an organization

For a list of more example queries, see the amazon-macie-results-analytics GitHub repository.

Step 3: Visualize the results with QuickSight

In the previous step, you used Athena to query your Macie discovery results. Although the queries were powerful, they only produced tabular data as their output. In this step, you will use QuickSight to visualize the results of your Macie jobs.

Before creating the visualizations, you first need to grant QuickSight the right permissions to access Athena, the results bucket, and the KMS key that you used to encrypt the results.

To allow QuickSight access to the KMS key

  1. Open the AWS Identity and Access Management (IAM) console, and then do the following:
    1. In the navigation pane, choose Roles.
    2. In the search pane for roles, search for aws-quicksight-s3-consumers-role-v0. If this role does not exist, search for aws-quicksight-service-role-v0.
    3. Select the role and copy the role ARN. You will need this role ARN to modify the KMS key policy to grant permissions for this role.
  2. Open the AWS KMS console and then do the following:
    1. Select Customer managed keys.
    2. Choose the KMS key that you created.
    3. Paste the following statement in the key policy. When you add the statement, do not delete any existing statements, and make sure that the syntax is valid. Replace <QUICKSIGHT_SERVICE_ROLE_ARN> and <KMS_KEY_ARN> with your own information. Policies are in JSON format.
	{ "Sid": "Allow Quicksight Service Role to use the key",
		"Effect": "Allow",
		"Principal": {
			"AWS": <QUICKSIGHT_SERVICE_ROLE_ARN>
		},
		"Action": "kms:Decrypt",
		"Resource": <KMS_KEY_ARN>
	}

To allow QuickSight access to Athena and the discovery results S3 bucket

  1. In QuickSight, in the upper right, choose your user icon to open the profile menu, and choose US East (N.Virginia). You can only modify permissions in this Region.
  2. In the upper right, open the profile menu again, and select Manage QuickSight.
  3. Select Security & permissions.
  4. Under QuickSight access to AWS services, choose Manage.
  5. Make sure that the S3 checkbox is selected, click on Select S3 buckets, and then do the following:
    1. Choose the discovery results bucket.
    2. You do not need to check the box under Write permissions for Athena workgroup. The write permissions are not required for this post.
    3. Select Finish.
  6. Make sure that the Amazon Athena checkbox is selected.
  7. Review the selections and be careful that you don’t inadvertently disable AWS services and resources that other users might be using.
  8. Select Save.
  9. In QuickSight, in the upper right, open the profile menu, and choose the Region where your results bucket is located.

Now that you’ve granted QuickSight the right permissions, you can begin creating visualizations.

To create a new dataset referencing the Athena table

  1. On the QuickSight start page, choose Datasets.
  2. On the Datasets page, choose New dataset.
  3. From the list of data sources, select Athena.
  4. Enter a meaningful name for the data source (for example, macie_datasource) and choose Create data source.
  5. Select the database that you created in Athena (for example, macie_results).
  6. Select the table that you created in Athena (for example, maciedetail_all_jobs), and choose Select.
  7. You can either import the data into SPICE or query the data directly. We recommend that you use SPICE for improved performance, but the visualizations will still work if you query the data directly.
  8. To create an analysis using the data as-is, choose Visualize.

You can then visualize the Macie results in the QuickSight console. The following example shows a delegated Macie administrator account that is running a visualization, with account IDs on the y axis and the count of affected resources on the x axis.

Figure 7: Visualize query results to identify total number of sensitive data detections across accounts in an organization

Figure 7: Visualize query results to identify total number of sensitive data detections across accounts in an organization

You can also visualize the aggregated data in QuickSight. For example, you can view the number of findings for each sensitive data category in each S3 bucket. The Athena table doesn’t provide aggregated data necessary for visualization. Instead, you need to query the table and then visualize the output of the query.

To query the table and visualize the output in QuickSight

  1. On the Amazon QuickSight start page, choose Datasets.
  2. On the Datasets page, choose New dataset.
  3. Select the data source that you created in Athena (for example, macie_datasource) and then choose Create Dataset.
  4. Select the database that you created in Athena (for example, macie_results).
  5. Choose Use Custom SQL, enter the following query below, and choose Confirm Query.
    	select resourcesaffected.s3bucket.name as bucket_name,
    		sensitive_data.category,
    		detections_data.type,
    		sum(cast(detections_data.count as INT)) total_detections
    	from macie_results.maciedetail_all_jobs,
    		unnest(classificationdetails.result.sensitiveData) as t(sensitive_data),unnest(sensitive_data.detections) as t(detections_data)
    where classificationdetails.result.sensitiveData is not null
    and resourcesaffected.s3object.embeddedfiledetails is null
    group by resourcesaffected.s3bucket.name, sensitive_data.category, detections_data.type
    order by total_detections desc
    	

  6. You can either import the data into SPICE or query the data directly.
  7. To create an analysis using the data as-is, choose Visualize.

Now you can visualize the output of the query that aggregates data across your S3 buckets. For example, we used the name of the S3 bucket to group the results, and then we created a donut chart of the output, as shown in Figure 6.

Figure 8: Visualize query results for total number of sensitive data detections across each S3 bucket in an organization

Figure 8: Visualize query results for total number of sensitive data detections across each S3 bucket in an organization

From the visualizations, we can identify which buckets or accounts in our organizations contain the most sensitive data, for further action. Visualizations can also act as a dashboard to track remediation.

If you encounter permissions issues, see Insufficient permissions when using Athena with Amazon QuickSight and Troubleshooting key access for troubleshooting steps.

You can replicate the preceding steps by using the sample queries from the amazon-macie-results-analytics GitHub repo to view data that is aggregated across S3 buckets, AWS accounts, or individual Macie jobs. Using these queries with the results of your Macie results will help you get started with tracking the security posture of your data in Amazon S3.

Conclusion

In this post, you learned how to enable sensitive data discovery results for Macie, query those results with Athena, and visualize the results in QuickSight.

Because Macie sensitive data discovery results provide more granular data than the findings, you can pursue a more comprehensive incident response when sensitive data is discovered. The sample queries in this post provide answers to some generic questions that you might have. After you become familiar with the structure, you can run other interesting queries on the data.

We hope that you can use this solution to write your own queries to gain further insights into sensitive data discovered in S3 buckets, according to the business needs and regulatory requirements of your organization. You can consider using this solution to better understand and identify data security risks that need immediate attention. For example, you can use this solution to answer questions such as the following:

  • Is financial information present in an AWS account where it shouldn’t be?
  • Are S3 buckets that contain PII properly hardened with access controls and encryption?

You can also use this solution to understand gaps in your data security initiatives by tracking files that Macie couldn’t analyze due to encryption or permission issues. To further expand your knowledge of Macie capabilities and features, see the following resources:

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on Amazon Macie re:Post.

Want more AWS Security news? Follow us on Twitter.

Author

Keith Rozario

Keith is a Sr. Solution Architect at Amazon Web Services based in Singapore, where he helps customers develop solutions for their most complex business problems. He loves road cycling, reading comics from DC, and enjoying the sweet sound of music from AC/DC.

Author

Scott Ward

Scott is a Principal Solutions Architect with AWS External Security Services (ESS) and has been with Amazon for over 20 years. Scott provides technical guidance to the ESS services, such as GuardDuty, Security Hub, Macie, Inspector and Detective, and helps customers make their applications secure. Scott has a deep background in supporting, enhancing, and building global financial solutions to meet the needs of large companies, including many years of supporting the global financial systems for Amazon.com.

Author

Koulick Ghosh

Koulick is a Senior Product Manager in AWS Security based in Seattle, WA. He loves speaking with customers on how AWS Security services can help make them more secure. In his free-time, he enjoys playing the guitar, reading, and exploring the Pacific Northwest.