Tag Archives: Amazon S3

Top 10 security best practices for securing data in Amazon S3

Post Syndicated from Megan O'Neil original https://aws.amazon.com/blogs/security/top-10-security-best-practices-for-securing-data-in-amazon-s3/

With more than 100 trillion objects in Amazon Simple Storage Service (Amazon S3) and an almost unimaginably broad set of use cases, securing data stored in Amazon S3 is important for every organization. So, we’ve curated the top 10 controls for securing your data in S3. By default, all S3 buckets are private and can be accessed only by users who are explicitly granted access through ACLs, S3 bucket policies, and identity-based policies. In this post, we review the latest S3 features and Amazon Web Services (AWS) services that you can use to help secure your data in S3, including organization-wide preventative controls such as AWS Organizations service control policies (SCPs). We also provide recommendations for S3 detective controls, such as Amazon GuardDuty for S3, AWS CloudTrail object-level logging, AWS Security Hub S3 controls, and CloudTrail configuration specific to S3 data events. In addition, we provide data protection options and considerations for encrypting data in S3. Finally, we review backup and recovery recommendations for data stored in S3. Given the broad set of use cases that S3 supports, you should determine the priority of controls applied in accordance with your specific use case and associated details.

Block public S3 buckets at the organization level

Designate AWS accounts for public S3 use and prevent all other S3 buckets from inadvertently becoming public by enabling S3 Block Public Access. Use Organizations SCPs to confirm that the S3 Block Public Access setting cannot be changed. S3 Block Public Access provides a level of protection that works at the account level and also on individual buckets, including those that you create in the future. You have the ability to block existing public access—whether it was specified by an ACL or a policy—and to establish that public access isn’t granted to newly created items. This allows only designated AWS accounts to have public S3 buckets while blocking all other AWS accounts. To learn more about Organizations SCPs, see Service control policies.

Use bucket policies to verify all access granted is restricted and specific

Check that the access granted in the Amazon S3 bucket policy is restricted to specific AWS principals, federated users, service principals, IP addresses, or VPCs that you provide. A bucket policy that allows a wildcard identity such as Principal “*” can potentially be accessed by anyone. A bucket policy that allows a wildcard action “*” can potentially allow a user to perform any action in the bucket. For more information, see Using bucket policies.

Ensure that any identity-based policies don’t use wildcard actions

Identity policies are policies assigned to AWS Identity and Access Management (IAM) users and roles and should follow the principle of least privilege to help prevent inadvertent access or changes to resources. Establishing least privilege identity policies includes defining specific actions such as S3:GetObject or S3:PutObject instead of S3:*. In addition, you can use predefined AWS-wide condition keys and S3‐specific condition keys to specify additional controls on specific actions. An example of an AWS-wide condition key commonly used for S3 is IpAddress: { aws:SourceIP: “10.10.10.10”}, where you can specify your organization’s internal IP space for specific actions in S3. See IAM.1 in Monitor S3 using Security Hub and CloudWatch Logs for detecting policies with wildcard actions and wildcard resources are present in your accounts with Security Hub.

Consider splitting read, write, and delete access. Allow only write access to users or services that generate and write data to S3 but don’t need to read or delete objects. Define an S3 lifecycle policy to remove objects on a schedule instead of through manual intervention— see Managing your storage lifecycle. This allows you to remove delete actions from your identity-based policies. Verify your policies with the IAM policy simulator. Use IAM Access Analyzer to help you identify, review, and design S3 bucket policies or IAM policies that grant access to your S3 resources from outside of your AWS account.

Enable S3 protection in GuardDuty to detect suspicious activities

In 2020, GuardDuty announced coverage for S3. Turning this on enables GuardDuty to continuously monitor and profile S3 data access events (data plane operations) and S3 configuration (control plane APIs) to detect suspicious activities. Activities such as requests coming from unusual geolocations, disabling of preventative controls, and API call patterns consistent with an attempt to discover misconfigured bucket permissions. To achieve this, GuardDuty uses a combination of anomaly detection, machine learning, and continuously updated threat intelligence. To learn more, including how to enable GuardDuty for S3, see Amazon S3 protection in Amazon GuardDuty.

Use Macie to scan for sensitive data outside of designated areas

In May of 2020, AWS re-launched Amazon Macie. Macie is a fully managed service that helps you discover and protect your sensitive data by using machine learning to automatically review and classify your data in S3. Enabling Macie organization wide is a straightforward and cost-efficient method for you to get a central, continuously updated view of your entire organization’s S3 environment and monitor your adherence to security best practices through a central console. Macie continually evaluates all buckets for encryption and access control, alerting you of buckets that are public, unencrypted, or shared or replicated outside of your organization. Macie evaluates sensitive data using a fully-managed list of common sensitive data types and custom data types you create, and then issues findings for any object where sensitive data is found.

Encrypt your data in S3

There are four options for encrypting data in S3, including client-side and server-side options. With server-side encryption, S3 encrypts your data at the object level as it writes it to disks in AWS data centers and decrypts it when you access it. As long as you authenticate your request and you have access permissions, there is no difference in the way you access encrypted or unencrypted objects.

The first two options use AWS Key Management Service (AWS KMS). AWS KMS lets you create and manage cryptographic keys and control their use across a wide range of AWS services and their applications. There are options for managing which encryption key AWS uses to encrypt your S3 data.

  • Server-side encryption with Amazon S3-managed encryption keys (SSE-S3). When you use SSE-S3, each object is encrypted with a unique key that’s managed by AWS. This option enables you to encrypt your data by checking a box with no additional steps. The encryption and decryption are handled for you transparently. SSE-S3 is a convenient and cost-effective option.
  • Server-side encryption with customer master keys (CMKs) stored in AWS KMS (SSE-KMS), is similar to SSE-S3, but with some additional benefits and costs compared to SSE-S3. There are separate permissions for the use of a CMK that provide added protection against unauthorized access of your objects in S3. SSE-KMS also provides you with an audit trail that shows when your CMK was used and by whom. SSE-KMS gives you control of the key access policy, which might provide you with more granular control depending on your use case.
  • In server-side encryption with customer-provided keys (SSE-C), you manage the encryption keys and S3 manages the encryption as it writes to disks and decryption when you access your objects. This option is useful if you need to provide and manage your own encryption keys. Keep in mind that you are responsible for the creation, storage, and tracking of the keys used to encrypt each object and AWS has no ability to recover customer-provided keys if they’re lost. The major thing to account for with SSE-C is that you must provide the customer-managed key every-time you PUT or GET an object.
  • Client-side encryption is another option to encrypt your data in S3. You can use a CMK stored in AWS KMS or use a master key that you store within your application. Client-side encryption means that you encrypt the data before you send it to AWS and that you decrypt it after you retrieve it from AWS. AWS doesn’t manage your keys and isn’t responsible for encryption or decryption. Usually, client-side encryption needs to be deeply embedded into your application to work.

Protect data in S3 from accidental deletion using S3 Versioning and S3 Object Lock

Amazon S3 is designed for durability of 99.999999999 percent of objects across multiple Availability Zones, is resilient against events that impact an entire zone, and designed for 99.99 percent availability over a given year. In many cases, when it comes to strategies to back up your data in S3, it’s about protecting buckets and objects from accidental deletion, in which case S3 Versioning can be used to preserve, retrieve, and restore every version of every object stored in your buckets. S3 Versioning lets you keep multiple versions of an object in the same bucket and can help you recover objects from accidental deletion or overwrite. Keep in mind this feature has costs associated. You may consider S3 Versioning in selective scenarios such as S3 buckets that store critical backup data or sensitive data.

With S3 Versioning enabled on your S3 buckets, you can optionally add another layer of security by configuring a bucket to enable multi-factor authentication (MFA) delete. With this configuration, the bucket owner must include two forms of authentication in any request to delete a version or to change the versioning state of the bucket.

S3 Object Lock is a feature that helps you mitigate data loss by storing objects using a write-once-read-many (WORM) model. By using Object Lock, you can prevent an object from being overwritten or deleted for a fixed time or indefinitely. Keep in mind that there are specific use cases for Object Lock, including scenarios where it is imperative that data is not changed or deleted after it has been written.

Enable logging for S3 using CloudTrail and S3 server access logging

Amazon S3 is integrated with CloudTrail. CloudTrail captures a subset of API calls, including calls from the S3 console and code calls to the S3 APIs. In addition, you can enable CloudTrail data events for all your buckets or for a list of specific buckets. Keep in mind that a very active S3 bucket can generate a large amount of log data and increase CloudTrail costs. If this is concern around cost then consider enabling this additional logging only for S3 buckets with critical data.

Server access logging provides detailed records of the requests that are made to a bucket. Server access logs can assist you in security and access audits.

Backup your data in S3

Although S3 stores your data across multiple geographically diverse Availability Zones by default, your compliance requirements might dictate that you store data at even greater distances. Cross-region replication (CRR) allows you to replicate data between distant AWS Regions to help satisfy these requirements. CRR enables automatic, asynchronous copying of objects across buckets in different AWS Regions. For more information on object replication, see Replicating objects. Keep in mind that this feature has costs associated, you might consider CCR in selective scenarios such as S3 buckets that store critical backup data or sensitive data.

Monitor S3 using Security Hub and CloudWatch Logs

Security Hub provides you with a comprehensive view of your security state in AWS and helps you check your environment against security industry standards and best practices. Security Hub collects security data from across AWS accounts, services, and supported third-party partner products and helps you analyze your security trends and identify the highest priority security issues.

The AWS Foundational Security Best Practices standard is a set of controls that detect when your deployed accounts and resources deviate from security best practices, and provides clear remediation steps. The controls contain best practices from across multiple AWS services, including S3. We recommend you enable the AWS Foundational Security Best Practices as it includes the following detective controls for S3 and IAM:

IAM.1: IAM policies should not allow full “*” administrative privileges.
S3.1: Block Public Access setting should be enabled
S3.2: S3 buckets should prohibit public read access
S3.3: S3 buckets should prohibit public write access
S3.4: S3 buckets should have server-side encryption enabled
S3.5: S3 buckets should require requests to use Secure Socket layer
S3.6: Amazon S3 permissions granted to other AWS accounts in bucket policies should be restricted
S3.8: S3 Block Public Access setting should be enabled at the bucket level

For details of each control, including remediation steps, please review the AWS Foundational Security Best Practices controls.

If there is a specific S3 API activity not covered above that you’d like to be alerted on, you can use CloudTrail Logs together with Amazon CloudWatch for S3 to do so. CloudTrail integration with CloudWatch Logs delivers S3 bucket-level API activity captured by CloudTrail to a CloudWatch log stream in the CloudWatch log group that you specify. You create CloudWatch alarms for monitoring specific API activity and receive email notifications when the specific API activity occurs.

Conclusion

By using the ten practices described in this blog post, you can build strong protection mechanisms for your data in Amazon S3, including least privilege access, encryption of data at rest, blocking public access, logging, monitoring, and configuration checks.

Depending on your use case, you should consider additional protection mechanisms. For example, there are security-related controls available for large shared datasets in S3 such as Access Points, which you can use to decompose one large bucket policy into separate, discrete access point policies for each application that needs to access the shared data set. To learn more about S3 security, see Amazon S3 Security documentation.

Now that you’ve reviewed the top 10 security best practices to make your data in S3 more secure, make sure you have these controls set up in your AWS accounts—and go build securely!

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon S3 forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Megan O’Neil

Megan is a Senior Specialist Solutions Architect focused on threat detection and incident response. Megan and her team enable AWS customers to implement sophisticated, scalable, and secure solutions that solve their business challenges.

Author

Temi Adebambo

Temi is the Senior Manager for the America’s Security and Network Solutions Architect team. His team is focused on working with customers on cloud migration and modernization, cybersecurity strategy, architecture best practices, and innovation in the cloud. Before AWS, he spent over 14 years as a consultant, advising CISOs and security leaders at some of the largest global enterprises.

How to securely create and store your CRL for ACM Private CA

Post Syndicated from Tracy Pierce original https://aws.amazon.com/blogs/security/how-to-securely-create-and-store-your-crl-for-acm-private-ca/

In this blog post, I show you how to protect your Amazon Simple Storage Service (Amazon S3) bucket while still allowing access to your AWS Certificate Manager (ACM) Private Certificate Authority (CA) certificate revocation list (CRL).

A CRL is a list of certificates that have been revoked by the CA. Certificates can be revoked because they might have inadvertently been shared, or to discontinue their use, such as when someone leaves the company or an IoT device is decommissioned. In this solution, you use a combination of separate AWS accounts, Amazon S3 Block Public Access (BPA) settings, and a new parameter created by ACM Private CA called S3ObjectAcl to mark the CRL as private. This new parameter allows you to set the privacy of your CRL as PUBLIC_READ or BUCKET_OWNER_FULL_CONTROL. If you choose PUBLIC_READ, the CRL will be accessible over the internet. If you choose BUCKET_OWNER_FULL_CONTROL, then only the CRL S3 bucket owner can access it, and you will need to use Amazon CloudFront to serve the CRL stored in Amazon S3 using origin access identity (OAI). This is because most TLS implementations expect a public endpoint for access.

A best practice for Amazon S3 is to apply the principle of least privilege. To support least privilege, you want to ensure you have the BPA settings for Amazon S3 enabled. These settings deny public access to your S3 objects by using ACLs, bucket policies, or access point policies. I’m going to walk you through setting up your CRL as a private object in an isolated secondary account with BPA settings for access, and a CloudFront distribution with OAI settings enabled. This will confirm that access can only be made through the CloudFront distribution and not directly to your S3 bucket. This enables you to maintain your private CA in your primary account, accessible only by your public key infrastructure (PKI) security team.

As part of the private infrastructure setup, you will create a CloudFront distribution to provide access to your CRL. While not required, it allows access to private CRLs, and is helpful in the event you want to move the CRL to a different location later. However, this does come with an extra cost, so that’s something to consider when choosing to make your CRL private instead of public.

Prerequisites

For this walkthrough, you should have the following resources ready to use:

CRL solution overview

The solution consists of creating an S3 bucket in an isolated secondary account, enabling all BPA settings, creating a CloudFront OAI, and a CloudFront distribution.
 

Figure 1: Solution flow diagram

Figure 1: Solution flow diagram

As shown in Figure 1, the steps in the solution are as follows:

  1. Set up the S3 bucket in the secondary account with BPA settings enabled.
  2. Create the CloudFront distribution and point it to the S3 bucket.
  3. Create your private CA in AWS Certificate Manager (ACM).

In this post, I walk you through each of these steps.

Deploying the CRL solution

In this section, you walk through each item in the solution overview above. This will allow access to your CRL stored in an isolated secondary account, away from your private CA.

To create your S3 bucket

  1. Sign in to the AWS Management Console of your secondary account. For Services, select S3.
  2. In the S3 console, choose Create bucket.
  3. Give the bucket a unique name. For this walkthrough, I named my bucket example-test-crl-bucket-us-east-1, as shown in Figure 2. Because S3 buckets are unique across all of AWS and not just within your account, you must create your own unique bucket name when completing this tutorial. Remember to follow the S3 naming conventions when choosing your bucket name.
     
    Figure 2: Creating an S3 bucket

    Figure 2: Creating an S3 bucket

  4. Choose Next, and then choose Next again.
  5. For Block Public Access settings for this bucket, make sure the Block all public access check box is selected, as shown in Figure 3.
     
    Figure 3: S3 block public access bucket settings

    Figure 3: S3 block public access bucket settings

  6. Choose Create bucket.
  7. Select the bucket you just created, and then choose the Permissions tab.
  8. For Bucket Policy, choose Edit, and in the text field, paste the following policy (remember to replace each <user input placeholder> with your own value).
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "Service": "acm-pca.amazonaws.com"
          },
          "Action": [
            "s3:PutObject",
            "s3:PutObjectAcl",
            "s3:GetBucketAcl",
            "s3:GetBucketLocation"
          ],
          "Resource": [
              "arn:aws:s3:::<your-bucket-name>/*",
              "arn:aws:s3:::<your-bucket-name>"
          ]
        }
      ]
    }
    

  9. Choose Save changes.
  10. Next to Object Ownership choose Edit.
  11. Select Bucket owner preferred, and then choose Save changes.

To create your CloudFront distribution

  1. Still in the console of your secondary account, from the Services menu, switch to the CloudFront console.
  2. Choose Create Distribution.
  3. For Select a delivery method for your content, under Web, choose Get Started.
  4. On the Origin Settings page, do the following, as shown in Figure 4:
    1. For Origin Domain Name, select the bucket you created earlier. In this example, my bucket name is example-test-crl-bucket-us-east-1.s3.amazonaws.com.
    2. For Restrict Bucket Access, select Yes.
    3. For Origin Access Identity, select Create a New Identity.
    4. For Comment enter a name. In this example, I entered access-identity-crl.
    5. For Grant Read Permissions on Bucket, select Yes, Update Bucket Policy.
    6. Leave all other defaults.
       
      Figure 4: CloudFront <strong>Origin Settings</strong> page

      Figure 4: CloudFront Origin Settings page

  5. Choose Create Distribution.

To create your private CA

  1. (Optional) If you have already created a private CA, you can update your CRL pointer by using the update-certificate-authority API. You must do this step from the CLI because you can’t select an S3 bucket in a secondary account for the CRL home when you create the CRL through the console. If you haven’t already created a private CA, follow the remaining steps in this procedure.
  2. Use a text editor to create a file named ca_config.txt that holds your CA configuration information. In the following example ca_config.txt file, replace each <user input placeholder> with your own value.
    {
        "KeyAlgorithm": "<RSA_2048>",
        "SigningAlgorithm": "<SHA256WITHRSA>",
        "Subject": {
            "Country": "<US>",
            "Organization": "<Example LLC>",
            "OrganizationalUnit": "<Security>",
            "DistinguishedNameQualifier": "<Example.com>",
            "State": "<Washington>",
            "CommonName": "<Example LLC>",
            "Locality": "<Seattle>"
        }
    }
    

  3. From the CLI configured with a credential profile for your primary account, use the create-certificate-authority command to create your CA. In the following example, replace each <user input placeholder> with your own value.
    aws acm-pca create-certificate-authority --certificate-authority-configuration file://ca_config.txt --certificate-authority-type “ROOT” --profile <primary_account_credentials>
    

  4. With the CA created, use the describe-certificate-authority command to verify success. In the following example, replace each <user input placeholder> with your own value.
    aws acm-pca describe-certificate-authority --certificate-authority-arn <arn:aws:acm-pca:us-east-1:111122223333:certificate-authority/12345678-1234-1234-1234-123456789012> --profile <primary_account_credentials>
    

  5. You should see the CA in the PENDING_CERTIFICATE state. Use the get-certificate-authority-csr command to retrieve the certificate signing request (CSR), and sign it with your ACM private CA. In the following example, replace each <user input placeholder> with your own value.
    aws acm-pca get-certificate-authority-csr --certificate-authority-arn <arn:aws:acm-pca:us-east-1:111122223333:certificate-authority/12345678-1234-1234-1234-123456789012> --output text > <cert_1.csr> --profile <primary_account_credentials>
    

  6. Now that you have your CSR, use it to issue a certificate. Because this example sets up a ROOT CA, you will issue a self-signed RootCACertificate. You do this by using the issue-certificate command. In the following example, replace each <user input placeholder> with your own value. You can find all allowable values in the ACM PCA documentation.
    aws acm-pca issue-certificate --certificate-authority-arn <arn:aws:acm-pca:us-east-1:111122223333:certificate-authority/12345678-1234-1234-1234-123456789012> --template-arn arn:aws:acm-pca:::template/RootCACertificate/V1 --csr fileb://<cert_1.csr> --signing-algorithm SHA256WITHRSA --validity Value=365,Type=DAYS --profile <primary_account_credentials>
    

  7. Now that the certificate is issued, you can retrieve it. You do this by using the get-certificate command. In the following example, replace each <user input placeholder> with your own value.
    aws acm-pca get-certificate --certificate-authority-arn <arn:aws:acm-pca:us-east-1:111122223333:certificate-authority/12345678-1234-1234-1234-123456789012> --certificate-arn <arn:aws:acm-pca:us-east-1:111122223333:certificate-authority/12345678-1234-1234-1234-123456789012/certificate/6707447683a9b7f4055627ffd55cebcc> --output text --profile <primary_account_credentials> > ca_cert.pem
    

  8. Import the certificate ca_cert.pem into your CA to move it into the ACTIVE state for further use. You do this by using the import-certificate-authority-certificate command. In the following example, replace each <user input placeholder> with your own value.
    aws acm-pca import-certificate-authority-certificate --certificate-authority-arn <arn:aws:acm-pca:us-east-1:111122223333:certificate-authority/12345678-1234-1234-1234-123456789012> --certificate fileb://ca_cert.pem --profile <primary_account_credentials>
    

  9. Use a text editor to create a file named revoke_config.txt that holds your CRL information pointing to your CloudFront distribution ID. In the following example revoke_config.txt, replace each <user input placeholder> with your own value.
    {
        "CrlConfiguration": {
            "Enabled": <true>,
            "ExpirationInDays": <365>,
            "CustomCname": "<example1234.cloudfront.net>",
            "S3BucketName": "<example-test-crl-bucket-us-east-1>",
            "S3ObjectAcl": "<BUCKET_OWNER_FULL_CONTROL>"
        }
    }
    

  10. Update your CA CRL CNAME to point to the CloudFront distribution you created. You do this by using the update-certificate-authority command. In the following example, replace each <user input placeholder> with your own value.
    aws acm-pca update-certificate-authority --certificate-authority-arn <arn:aws:acm-pca:us-east-1:111122223333:certificate-authority/12345678-1234-1234-1234-123456789012> --revocation-configuration file://revoke_config.txt --profile <primary_account_credentials>
    

You can use the describe-certificate-authority command to verify that your CA is in the ACTIVE state. After the CA is active, ACM generates your CRL periodically for you, and places it into your specified S3 bucket. It also generates a new CRL list shortly after you revoke any certificate, so you have the most updated copy.

Now that the PCA, CRL, and CloudFront distribution are all set up, you can test to verify the CRL is served appropriately.

To test that the CRL is served appropriately

  1. Create a CSR to issue a new certificate from your PCA. In the following example, replace each <user input placeholder> with your own value. Enter a secure PEM password when prompted and provide the appropriate field data.

    Note: Do not enter any values for the unused attributes, just press Enter with no value.

    openssl req -new -newkey rsa:2048 -days 365 -keyout <test_cert_private_key.pem> -out <test_csr.csr>
    

  2. Issue a new certificate using the issue-certificate command. In the following example, replace each <user input placeholder> with your own value. You can find all allowable values in the ACM PCA documentation.
    aws acm-pca issue-certificate --certificate-authority-arn <arn:aws:acm-pca:us-east-1:111122223333:certificate-authority/12345678-1234-1234-1234-123456789012> --csr file://<test_csr.csr> --signing-algorithm <SHA256WITHRSA> --validity Value=<31>,Type=<DAYS> --idempotency-token 1 --profile <primary_account_credentials>
    

  3. After issuing the certificate, you can use the get-certificate command retrieve it, parse it, then get the CRL URL from the certificate just like a PKI client would. In the following example, replace each <user input placeholder> with your own value. This command uses the JQ package.
    aws acm-pca get-certificate --certificate-authority-arn <arn:aws:acm-pca:us-east-1:111122223333:certificate-authority/12345678-1234-1234-1234-123456789012> --certificate-arn <arn:aws:acm-pca:us-east-1:111122223333:certificate-authority/12345678-1234-1234-1234-123456789012/certificate/6707447683a9b7f4055example1234> | jq -r '.Certificate' > cert.pem openssl x509 -in cert.pem -text -noout | grep crl 
    

    You should see an output similar to the following, but with the domain names of your CloudFront distribution and your CRL file:

    http://<example1234.cloudfront.net>/crl/<7215e983-3828-435c-a458-b9e4dd16bab1.crl>
    

  4. Run the curl command to download your CRL file. In the following example, replace each <user input placeholder> with your own value.
    curl http://<example1234.cloudfront.net>/crl/<7215e983-3828-435c-a458-b9e4dd16bab1.crl>
    

Security best practices

The following are some of the security best practices for setting up and maintaining your private CA in ACM Private CA.

  • Place your root CA in its own account. You want your root CA to be the ultimate authority for your private certificates, limiting access to it is key to keeping it secure.
  • Minimize access to the root CA. This is one of the best ways of reducing the risk of intentional or unintentional inappropriate access or configuration. If the root CA was to be inappropriately accessed, all subordinate CAs and certificates would need to be revoked and recreated.
  • Keep your CRL in a separate account from the root CA. The reason for placing the CRL in a separate account is because some external entities—such as customers or users who aren’t part of your AWS organization, or external applications—might need to access the CRL to check for revocation. To provide access to these external entities, the CRL object and the S3 bucket need to be accessible, so you don’t want to place your CRL in the same account as your private CA.

For more information, see ACM Private CA best practices in the AWS Private CA User Guide.

Conclusion

You’ve now successfully set up your private CA and have stored your CRL in an isolated secondary account. You configured your S3 bucket with Block Public Access settings, created a custom URL through CloudFront, enabled OAI settings, and pointed your DNS to it by using Route 53. This restricts access to your S3 bucket through CloudFront and your OAI only. You walked through the setup of each step, from bucket configurations, hosted zone setup, distribution setup, and finally, private CA configuration and setup. You can now store your private CA in an account with limited access, while your CRL is hosted in a separate account that allows external entity access.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS Certificate Manager forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Tracy Pierce

Tracy is a Senior Security Consultant for Engagement Security. She enjoys the peculiar culture of Amazon and uses that to ensure that every day is exciting for her fellow engineers and customers alike. Customer obsession is her highest priority both internally and externally. She has her AS in Computer Security and Forensics from Sullivan College of Technology and Design, Systems Security Certified Practitioner (SSCP) certification, AWS Developer Associate certification, AWS Solutions Architect Associates certificate, and AWS Security Specialist certification. Outside of work, she enjoys time with friends, her fiancé, her Great Dane, and three cats. She also reads (a lot), builds Legos, and loves glitter.

Use IAM Access Analyzer to generate IAM policies based on access activity found in your organization trail

Post Syndicated from Mathangi Ramesh original https://aws.amazon.com/blogs/security/use-iam-access-analyzer-to-generate-iam-policies-based-on-access-activity-found-in-your-organization-trail/

In April 2021, AWS Identity and Access Management (IAM) Access Analyzer added policy generation to help you create fine-grained policies based on AWS CloudTrail activity stored within your account. Now, we’re extending policy generation to enable you to generate policies based on access activity stored in a designated account. For example, you can use AWS Organizations to define a uniform event logging strategy for your organization and store all CloudTrail logs in your management account to streamline governance activities. You can use Access Analyzer to review access activity stored in your designated account and generate a fine-grained IAM policy in your member accounts. This helps you to create policies that provide only the required permissions for your workloads.

Customers that use a multi-account strategy consolidate all access activity information in a designated account to simplify monitoring activities. By using AWS Organizations, you can create a trail that will log events for all Amazon Web Services (AWS) accounts into a single management account to help streamline governance activities. This is sometimes referred to as an organization trail. You can learn more from Creating a trail for an organization. With this launch, you can use Access Analyzer to generate fine-grained policies in your member account and grant just the required permissions to your IAM roles and users based on access activity stored in your organization trail.

When you request a policy, Access Analyzer analyzes your activity in CloudTrail logs and generates a policy based on that activity. The generated policy grants only the required permissions for your workloads and makes it easier for you to implement least privilege permissions. In this blog post, I’ll explain how to set up the permissions for Access Analyzer to access your organization trail and analyze activity to generate a policy. To generate a policy in your member account, you need to grant Access Analyzer limited cross-account access to access the Amazon Simple Storage Service (Amazon S3) bucket where logs are stored and review access activity.

Generate a policy for a role based on its access activity in the organization trail

In this example, you will set fine-grained permissions for a role used in a development account. The example assumes that your company uses Organizations and maintains an organization trail that logs all events for all AWS accounts in the organization. The logs are stored in an S3 bucket in the management account. You can use Access Analyzer to generate a policy based on the actions required by the role. To use Access Analyzer, you must first update the permissions on the S3 bucket where the CloudTrail logs are stored, to grant access to Access Analyzer.

To grant permissions for Access Analyzer to access and review centrally stored logs and generate policies

  1. Sign in to the AWS Management Console using your management account and go to S3 settings.
  2. Select the bucket where the logs from the organization trail are stored.
  3. Change object ownership to bucket owner preferred. To generate a policy, all of the objects in the bucket must be owned by the bucket owner.
  4. Update the bucket policy to grant cross-account access to Access Analyzer by adding the following statement to the bucket policy. This grants Access Analyzer limited access to the CloudTrail data. Replace the <organization-bucket-name>, and <organization-id> with your values and then save the policy.
    {
        "Version": "2012-10-17",
        "Statement": 
        [
        {
            "Sid": "PolicyGenerationPermissions",
            "Effect": "Allow",
            "Principal": {
                "AWS": "*"
            },
            "Action": [
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::<organization-bucket-name>",
                "arn:aws:s3:::my-organization-bucket/AWSLogs/o-exampleorgid/${aws:PrincipalAccount}/*
    "
            ],
            "Condition": {
    "StringEquals":{
    "aws:PrincipalOrgID":"<organization-id>"
    },
    
                "StringLike": {"aws:PrincipalArn":"arn:aws:iam::${aws:PrincipalAccount}:role/service-role/AccessAnalyzerMonitorServiceRole*"            }
            }
        }
        ]
    }
    

By using the preceding statement, you’re allowing listbucket and getobject for the bucket my-organization-bucket-name if the role accessing it belongs to an account in your Organizations and has a name that starts with AccessAnalyzerMonitorServiceRole. Using aws:PrincipalAccount in the resource section of the statement allows the role to retrieve only the CloudTrail logs belonging to its own account. If you are encrypting your logs, update your AWS Key Management Service (AWS KMS) key policy to grant Access Analyzer access to use your key.

Now that you’ve set the required permissions, you can use the development account and the following steps to generate a policy.

To generate a policy in the AWS Management Console

  1. Use your development account to open the IAM Console, and then in the navigation pane choose Roles.
  2. Select a role to analyze. This example uses AWS_Test_Role.
  3. Under Generate policy based on CloudTrail events, choose Generate policy, as shown in Figure 1.
     
    Figure 1: Generate policy from the role detail page

    Figure 1: Generate policy from the role detail page

  4. In the Generate policy page, select the time window for which IAM Access Analyzer will review the CloudTrail logs to create the policy. In this example, specific dates are chosen, as shown in Figure 2.
     
    Figure 2: Specify the time period

    Figure 2: Specify the time period

  5. Under CloudTrail access, select the organization trail you want to use as shown in Figure 3.

    Note: If you’re using this feature for the first time: select create a new service role, and then choose Generate policy.

    This example uses an existing service role “AccessAnalyzerMonitorServiceRole_MBYF6V8AIK.”
     

    Figure 3: CloudTrail access

    Figure 3: CloudTrail access

  6. After the policy is ready, you’ll see a notification on the role page. To review the permissions, choose View generated policy, as shown in Figure 4.
     
    Figure 4: Policy generation progress

    Figure 4: Policy generation progress

After the policy is generated, you can see a summary of the services and associated actions in the generated policy. You can customize it by reviewing the services used and selecting additional required actions from the drop down. To refine permissions further, you can replace the resource-level placeholders in the policies to restrict permissions to just the required access. You can learn more about granting fine-grained permissions and creating the policy as described in this blog post.

Conclusion

Access Analyzer makes it easier to grant fine-grained permissions to your IAM roles and users by generating IAM policies based on the CloudTrail activity centrally stored in a designated account such as your AWS Organizations management accounts. To learn more about how to generate a policy, see Generate policies based on access activity in the IAM User Guide.

If you have feedback about this blog post, submit comments in the Comments section below. If you have questions about this blog post, start a new thread on the IAM forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Mathangi Ramesh

Mathangi Ramesh

Mathangi is the product manager for AWS Identity and Access Management. She enjoys talking to customers and working with data to solve problems. Outside of work, Mathangi is a fitness enthusiast and a Bharatanatyam dancer. She holds an MBA degree from Carnegie Mellon University.

Strengthen the security of sensitive data stored in Amazon S3 by using additional AWS services

Post Syndicated from Jerry Mullis original https://aws.amazon.com/blogs/security/strengthen-the-security-of-sensitive-data-stored-in-amazon-s3-by-using-additional-aws-services/

In this post, we describe the AWS services that you can use to both detect and protect your data stored in Amazon Simple Storage Service (Amazon S3). When you analyze security in depth for your Amazon S3 storage, consider doing the following:

Using these additional AWS services along with Amazon S3 can improve your security posture across your accounts.

Audit and restrict Amazon S3 access with IAM Access Analyzer

IAM Access Analyzer allows you to identify unintended access to your resources and data. Users and developers need access to Amazon S3, but it’s important for you to keep users and privileges accurate and up to date.

Amazon S3 can often house sensitive and confidential information. To help secure your data within Amazon S3, you should be using AWS Key Management Service (AWS KMS) with server-side encryption at rest for Amazon S3. It is also important that you secure the S3 buckets so that you only allow access to the developers and users who require that access. Bucket policies and access control lists (ACLs) are the foundation of Amazon S3 security. Your configuration of these policies and lists determines the accessibility of objects within Amazon S3, and it is important to audit them regularly to properly secure and maintain the security of your Amazon S3 bucket.

IAM Access Analyzer can scan all the supported resources within a zone of trust. Access Analyzer then provides you with insight when a bucket policy or ACL allows access to any external entities that are not within your organization or your AWS account’s zone of trust.

To setup and use IAM Access Analyzer, follow the instructions for Enabling Access Analyzer in the AWS IAM User Guide.

The example in Figure 1 shows creating an analyzer with the zone of trust as the current account, but you can also create an analyzer with the organization as the zone of trust.

Figure 1: Creating IAM Access Analyzer and zone of trust

Figure 1: Creating IAM Access Analyzer and zone of trust

After you create your analyzer, IAM Access Analyzer automatically scans the resources in your zone of trust and returns the findings from your Amazon S3 storage environment. The initial scan shown in Figure 2 shows the findings of an unsecured S3 bucket.

Figure 2: Example of unsecured S3 bucket findings

Figure 2: Example of unsecured S3 bucket findings

For each finding, you can decide which action you would like to take. As shown in figure 3, you are given the option to archive (if the finding indicates intended access) or take action to modify bucket permissions (if the finding indicates unintended access).

Figure 3: Displays choice of actions to take

Figure 3: Displays choice of actions to take

After you address the initial findings, Access Analyzer monitors your bucket policies for changes, and notifies you of access issues it finds. Access Analyzer is regional and must be enabled in each AWS Region independently.

Classify and secure sensitive data with Macie

Organizational compliance standards often require the identification and securing of sensitive data. Your organization’s sensitive data might contain personally identifiable information (PII), which includes things such as credit card numbers, birthdates, and addresses.

Macie is a data security and privacy service offered by AWS that uses machine learning and pattern matching to discover the sensitive data stored within Amazon S3. You can define your own custom type of sensitive data category that might be unique to your business or use case. Macie will automatically provide an inventory of S3 buckets and alert you of unprotected sensitive data.

Figure 4 shows a sample result from a Macie scan in which you can see important information regarding Amazon S3 public access, encryption settings, and sharing.

Figure 4: Sample results from a Macie scan

Figure 4: Sample results from a Macie scan

In addition to finding potential sensitive data, Macie also gives you a severity score based on the privacy risk, as shown in the example data in Figure 5.

Figure 5: Example Macie severity scores

Figure 5: Example Macie severity scores

When you use Macie in conjunction with AWS Step Functions, you can also automatically remediate any issues found. You can use this combination to help meet regulations such as General Data Protection Regulation (GDPR) and Health Insurance Portability and Accountability Act (HIPAA). Macie allows you to have constant visibility of sensitive data within your Amazon S3 storage environment.

When you deploy Macie in a multi-account configuration, your usage is rolled up to the master account to provide the total usage for all accounts and a breakdown across the entire organization.

Detect malicious access patterns with GuardDuty

Your customers and users can commit thousands of actions each day on S3 buckets. Discerning access patterns manually can be extremely time consuming as the volume of data increases. GuardDuty uses machine learning, anomaly detection, and integrated threat intelligence to analyze billions of events across multiple accounts and uses data collected in AWS CloudTrail logs for S3 data events as well as S3 access logs, VPC Flow Logs, and DNS logs. GuardDuty can be configured to analyze these logs and notify you of suspicious activity, such as unusual data access patterns, unusual discovery API calls, and more. After you receive a list of findings on these activities, you will be able to make informed decisions to secure your S3 buckets.

Figure 6 shows a sample list of findings returned by GuardDuty which shows the finding type, resource affected, and count of occurrences.

Figure 6: Example GuardDuty list of findings

Figure 6: Example GuardDuty list of findings

You can select one of the results in Figure 6 to see the IP address and details associated from this potential malicious IP caller, as shown in Figure 7.

Figure 7: GuardDuty Malicious IP Caller detailed findings

Figure 7: GuardDuty Malicious IP Caller detailed findings

Monitor and remediate configuration changes with AWS Config

Configuration management is important when securing Amazon S3, to prevent unauthorized users from gaining access. It is important that you monitor the configuration changes of your S3 buckets, whether the changes are intentional or unintentional. AWS Config can track all configuration changes that are made to an S3 bucket. For example, if an S3 bucket had its permissions and configurations unexpectedly changed, using AWS Config allows you to see the changes made, as well as who made them.

With AWS Config, you can set up AWS Config managed rules that serve as a baseline for your S3 bucket. When any bucket has configurations that deviate from this baseline, you can be alerted by Amazon Simple Notification Service (Amazon SNS) of the bucket being noncompliant.

AWS Config can be used in conjunction with a service called AWS Lambda. If an S3 bucket is noncompliant, AWS Config can trigger a preprogrammed Lambda function and then the Lambda function can resolve those issues. This combination can be used to reduce your operational overhead in maintaining compliance within your S3 buckets.

Figure 8 shows a sample of AWS Config managed rules selected for configuration monitoring and gives a brief description of what the rule does.

Figure 8: Sample selections of AWS Managed Rules

Figure 8: Sample selections of AWS Managed Rules

Figure 9 shows a sample result of a non-compliant configuration and resource inventory listing the type of resource affected and the number of occurrences.

Figure 9: Example of AWS Config non-compliant resources

Figure 9: Example of AWS Config non-compliant resources

Conclusion

AWS has many offerings to help you audit and secure your storage environment. In this post, we discussed the particular combination of AWS services that together will help reduce the amount of time and focus your business devotes to security practices. This combination of services will also enable you to automate your responses to any unwanted permission and configuration changes, saving you valuable time and resources to dedicate elsewhere in your organization.

For more information about pricing of the services mentioned in this post, see AWS Free Tier and AWS Pricing. For more information about Amazon S3 security, see Amazon S3 Preventative Security Best Practices in the Amazon S3 User Guide.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Jerry Mullis

Jerry is an Associate Solutions Architect at AWS. His interests are in data migration, machine learning, and device automation. Jerry has previous experience in machine learning research and healthcare management. His certifications include AWS Solutions Architect Pro, AWS Developer Associate, AWS Sysops Admin Associate and AWS Certified Cloud Practitioner. In his free time, Jerry enjoys hiking, playing basketball, and spending time with his wife.

Author

Dave Geyer

Dave is an Associate Solutions Architect at AWS. He has a background in data management and organizational design, and is interested in data analytics and infrastructure security. Dave has advised and worked for customers in the commercial and public sectors, providing them with architectural best practices and recommendations. Dave is interested in the aerospace and financial services industries. Outside of work, he is an adrenaline junkie, and is passionate about mountaineering and high altitudes.

Author

Andrew Chen

Andrew is an Associate Solutions Architect with an interest in data analytics, machine learning, and virtualization of infrastructure. Andrew has previous experience in management consulting in which he worked as a technical lead for various cloud migration projects. In his free time, Andrew enjoys fishing, hiking, kayaking, and keeping up with financial markets.

Implement tenant isolation for Amazon S3 and Aurora PostgreSQL by using ABAC

Post Syndicated from Ashutosh Upadhyay original https://aws.amazon.com/blogs/security/implement-tenant-isolation-for-amazon-s3-and-aurora-postgresql-by-using-abac/

In software as a service (SaaS) systems, which are designed to be used by multiple customers, isolating tenant data is a fundamental responsibility for SaaS providers. The practice of isolation of data in a multi-tenant application platform is called tenant isolation. In this post, we describe an approach you can use to achieve tenant isolation in Amazon Simple Storage Service (Amazon S3) and Amazon Aurora PostgreSQL-Compatible Edition databases by implementing attribute-based access control (ABAC). You can also adapt the same approach to achieve tenant isolation in other AWS services.

ABAC in Amazon Web Services (AWS), which uses tags to store attributes, offers advantages over the traditional role-based access control (RBAC) model. You can use fewer permissions policies, update your access control more efficiently as you grow, and last but not least, apply granular permissions for various AWS services. These granular permissions help you to implement an effective and coherent tenant isolation strategy for your customers and clients. Using the ABAC model helps you scale your permissions and simplify the management of granular policies. The ABAC model reduces the time and effort it takes to maintain policies that allow access to only the required resources.

The solution we present here uses the pool model of data partitioning. The pool model helps you avoid the higher costs of duplicated resources for each tenant and the specialized infrastructure code required to set up and maintain those copies.

Solution overview

In a typical customer environment where this solution is implemented, the tenant request for access might land at Amazon API Gateway, together with the tenant identifier, which in turn calls an AWS Lambda function. The Lambda function is envisaged to be operating with a basic Lambda execution role. This Lambda role should also have permissions to assume the tenant roles. As the request progresses, the Lambda function assumes the tenant role and makes the necessary calls to Amazon S3 or to an Aurora PostgreSQL-Compatible database. This solution helps you to achieve tenant isolation for objects stored in Amazon S3 and data elements stored in an Aurora PostgreSQL-Compatible database cluster.

Figure 1 shows the tenant isolation architecture for both Amazon S3 and Amazon Aurora PostgreSQL-Compatible databases.

Figure 1: Tenant isolation architecture diagram

Figure 1: Tenant isolation architecture diagram

As shown in the numbered diagram steps, the workflow for Amazon S3 tenant isolation is as follows:

  1. AWS Lambda sends an AWS Security Token Service (AWS STS) assume role request to AWS Identity and Access Management (IAM).
  2. IAM validates the request and returns the tenant role.
  3. Lambda sends a request to Amazon S3 with the assumed role.
  4. Amazon S3 sends the response back to Lambda.

The diagram also shows the workflow steps for tenant isolation for Aurora PostgreSQL-Compatible databases, as follows:

  1. Lambda sends an STS assume role request to IAM.
  2. IAM validates the request and returns the tenant role.
  3. Lambda sends a request to IAM for database authorization.
  4. IAM validates the request and returns the database password token.
  5. Lambda sends a request to the Aurora PostgreSQL-Compatible database with the database user and password token.
  6. Aurora PostgreSQL-Compatible database returns the response to Lambda.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  1. An AWS account for your workload.
  2. An Amazon S3 bucket.
  3. An Aurora PostgreSQL-Compatible cluster with a database created.

    Note: Make sure to note down the default master database user and password, and make sure that you can connect to the database from your desktop or from another server (for example, from Amazon Elastic Compute Cloud (Amazon EC2) instances).

  4. A security group and inbound rules that are set up to allow an inbound PostgreSQL TCP connection (Port 5432) from Lambda functions. This solution uses regular non-VPC Lambda functions, and therefore the security group of the Aurora PostgreSQL-Compatible database cluster should allow an inbound PostgreSQL TCP connection (Port 5432) from anywhere (0.0.0.0/0).

Make sure that you’ve completed the prerequisites before proceeding with the next steps.

Deploy the solution

The following sections describe how to create the IAM roles, IAM policies, and Lambda functions that are required for the solution. These steps also include guidelines on the changes that you’ll need to make to the prerequisite components Amazon S3 and the Aurora PostgreSQL-Compatible database cluster.

Step 1: Create the IAM policies

In this step, you create two IAM policies with the required permissions for Amazon S3 and the Aurora PostgreSQL database.

To create the IAM policies

  1. Open the AWS Management Console.
  2. Choose IAM, choose Policies, and then choose Create policy.
  3. Use the following JSON policy document to create the policy. Replace the placeholder <111122223333> with the bucket name from your account.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Effect": "Allow",
                "Action": [
                    "s3:Get*",
                    "s3:List*"
                ],
                "Resource": "arn:aws:s3:::sts-ti-demo-<111122223333>/${aws:PrincipalTag/s3_home}/*"
            }
        ]
    }
    

  4. Save the policy with the name sts-ti-demo-s3-access-policy.

    Figure 2: Create the IAM policy for Amazon S3 (sts-ti-demo-s3-access-policy)

    Figure 2: Create the IAM policy for Amazon S3 (sts-ti-demo-s3-access-policy)

  5. Open the AWS Management Console.
  6. Choose IAM, choose Policies, and then choose Create policy.
  7. Use the following JSON policy document to create a second policy. This policy grants an IAM role permission to connect to an Aurora PostgreSQL-Compatible database through a database user that is IAM authenticated. Replace the placeholders with the appropriate Region, account number, and cluster resource ID of the Aurora PostgreSQL-Compatible database cluster, respectively.
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "rds-db:connect"
            ],
            "Resource": [
                "arn:aws:rds-db:<us-west-2>:<111122223333>:dbuser:<cluster- ZTISAAAABBBBCCCCDDDDEEEEL4>/${aws:PrincipalTag/dbuser}"
            ]
        }
    ]
}
  • Save the policy with the name sts-ti-demo-dbuser-policy.

    Figure 3: Create the IAM policy for Aurora PostgreSQL database (sts-ti-demo-dbuser-policy)

    Figure 3: Create the IAM policy for Aurora PostgreSQL database (sts-ti-demo-dbuser-policy)

Note: Make sure that you use the cluster resource ID for the clustered database. However, if you intend to adapt this solution for your Aurora PostgreSQL-Compatible non-clustered database, you should use the instance resource ID instead.

Step 2: Create the IAM roles

In this step, you create two IAM roles for the two different tenants, and also apply the necessary permissions and tags.

To create the IAM roles

  1. In the IAM console, choose Roles, and then choose Create role.
  2. On the Trusted entities page, choose the EC2 service as the trusted entity.
  3. On the Permissions policies page, select sts-ti-demo-s3-access-policy and sts-ti-demo-dbuser-policy.
  4. On the Tags page, add two tags with the following keys and values.

    Tag key Tag value
    s3_home tenant1_home
    dbuser tenant1_dbuser
  5. On the Review screen, name the role assumeRole-tenant1, and then choose Save.
  6. In the IAM console, choose Roles, and then choose Create role.
  7. On the Trusted entities page, choose the EC2 service as the trusted entity.
  8. On the Permissions policies page, select sts-ti-demo-s3-access-policy and sts-ti-demo-dbuser-policy.
  9. On the Tags page, add two tags with the following keys and values.

    Tag key Tag value
    s3_home tenant2_home
    dbuser tenant2_dbuser
  10. On the Review screen, name the role assumeRole-tenant2, and then choose Save.

Step 3: Create and apply the IAM policies for the tenants

In this step, you create a policy and a role for the Lambda functions. You also create two separate tenant roles, and establish a trust relationship with the role that you created for the Lambda functions.

To create and apply the IAM policies for tenant1

  1. In the IAM console, choose Policies, and then choose Create policy.
  2. Use the following JSON policy document to create the policy. Replace the placeholder <111122223333> with your AWS account number.
    {
        "Version": "2012-10-17",
        "Statement": [
            {
                "Sid": "VisualEditor0",
                "Effect": "Allow",
                "Action": "sts:AssumeRole",
                "Resource": [
                    "arn:aws:iam::<111122223333>:role/assumeRole-tenant1",
                    "arn:aws:iam::<111122223333>:role/assumeRole-tenant2"
                ]
            }
        ]
    }
    

  3. Save the policy with the name sts-ti-demo-assumerole-policy.
  4. In the IAM console, choose Roles, and then choose Create role.
  5. On the Trusted entities page, select the Lambda service as the trusted entity.
  6. On the Permissions policies page, select sts-ti-demo-assumerole-policy and AWSLambdaBasicExecutionRole.
  7. On the review screen, name the role sts-ti-demo-lambda-role, and then choose Save.
  8. In the IAM console, go to Roles, and enter assumeRole-tenant1 in the search box.
  9. Select the assumeRole-tenant1 role and go to the Trust relationship tab.
  10. Choose Edit the trust relationship, and replace the existing value with the following JSON document. Replace the placeholder <111122223333> with your AWS account number, and choose Update trust policy to save the policy.
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": "arn:aws:iam::<111122223333>:role/sts-ti-demo-lambda-role"
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }
    

To verify that the policies are applied correctly for tenant1

In the IAM console, go to Roles, and enter assumeRole-tenant1 in the search box. Select the assumeRole-tenant1 role and on the Permissions tab, verify that sts-ti-demo-dbuser-policy and sts-ti-demo-s3-access-policy appear in the list of policies, as shown in Figure 4.

Figure 4: The assumeRole-tenant1 Permissions tab

Figure 4: The assumeRole-tenant1 Permissions tab

On the Trust relationships tab, verify that sts-ti-demo-lambda-role appears under Trusted entities, as shown in Figure 5.

Figure 5: The assumeRole-tenant1 Trust relationships tab

Figure 5: The assumeRole-tenant1 Trust relationships tab

On the Tags tab, verify that the following tags appear, as shown in Figure 6.

Tag key Tag value
dbuser tenant1_dbuser
s3_home tenant1_home

 

Figure 6: The assumeRole-tenant1 Tags tab

Figure 6: The assumeRole-tenant1 Tags tab

To create and apply the IAM policies for tenant2

  1. In the IAM console, go to Roles, and enter assumeRole-tenant2 in the search box.
  2. Select the assumeRole-tenant2 role and go to the Trust relationship tab.
  3. Edit the trust relationship, replacing the existing value with the following JSON document. Replace the placeholder <111122223333> with your AWS account number.
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": "arn:aws:iam::<111122223333>:role/sts-ti-demo-lambda-role"
          },
          "Action": "sts:AssumeRole"
        }
      ]
    }
    

  4. Choose Update trust policy to save the policy.

To verify that the policies are applied correctly for tenant2

In the IAM console, go to Roles, and enter assumeRole-tenant2 in the search box. Select the assumeRole-tenant2 role and on the Permissions tab, verify that sts-ti-demo-dbuser-policy and sts-ti-demo-s3-access-policy appear in the list of policies, you did for tenant1. On the Trust relationships tab, verify that sts-ti-demo-lambda-role appears under Trusted entities.

On the Tags tab, verify that the following tags appear, as shown in Figure 7.

Tag key Tag value
dbuser tenant2_dbuser
s3_home tenant2_home
Figure 7: The assumeRole-tenant2 Tags tab

Figure 7: The assumeRole-tenant2 Tags tab

Step 4: Set up an Amazon S3 bucket

Next, you’ll set up an S3 bucket that you’ll use as part of this solution. You can either create a new S3 bucket or re-purpose an existing one. The following steps show you how to create two user homes (that is, S3 prefixes, which are also known as folders) in the S3 bucket.

  1. In the AWS Management Console, go to Amazon S3 and select the S3 bucket you want to use.
  2. Create two prefixes (folders) with the names tenant1_home and tenant2_home.
  3. Place two test objects with the names tenant.info-tenant1_home and tenant.info-tenant2_home in the prefixes that you just created, respectively.

Step 5: Set up test objects in Aurora PostgreSQL-Compatible database

In this step, you create a table in Aurora PostgreSQL-Compatible Edition, insert tenant metadata, create a row level security (RLS) policy, create tenant users, and grant permission for testing purposes.

To set up Aurora PostgreSQL-Compatible

  1. Connect to Aurora PostgreSQL-Compatible through a client of your choice, using the master database user and password that you obtained at the time of cluster creation.
  2. Run the following commands to create a table for testing purposes and to insert a couple of testing records.
    CREATE TABLE tenant_metadata (
        tenant_id VARCHAR(30) PRIMARY KEY,
        email     VARCHAR(50) UNIQUE,
        status    VARCHAR(10) CHECK (status IN ('active', 'suspended', 'disabled')),
        tier      VARCHAR(10) CHECK (tier IN ('gold', 'silver', 'bronze')));
    
    INSERT INTO tenant_metadata (tenant_id, email, status, tier) 
    VALUES ('tenant1_dbuser','[email protected]','active','gold');
    INSERT INTO tenant_metadata (tenant_id, email, status, tier) 
    VALUES ('tenant2_dbuser','[email protected]','suspended','silver');
    ALTER TABLE tenant_metadata ENABLE ROW LEVEL SECURITY;
    

  3. Run the following command to query the newly created database table.
    SELECT * FROM tenant_metadata;
    

    Figure 8: The tenant_metadata table content

    Figure 8: The tenant_metadata table content

  4. Run the following command to create the row level security policy.
    CREATE POLICY tenant_isolation_policy ON tenant_metadata
    USING (tenant_id = current_user);
    

  5. Run the following commands to establish two tenant users and grant them the necessary permissions.
    CREATE USER tenant1_dbuser WITH LOGIN;
    CREATE USER tenant2_dbuser WITH LOGIN;
    GRANT rds_iam TO tenant1_dbuser;
    GRANT rds_iam TO tenant2_dbuser;
    
    GRANT select, insert, update, delete ON tenant_metadata to tenant1_dbuser, tenant2_dbuser;
    

  6. Run the following commands to verify the newly created tenant users.
    SELECT usename AS role_name,
      CASE
         WHEN usesuper AND usecreatedb THEN
           CAST('superuser, create database' AS pg_catalog.text)
         WHEN usesuper THEN
            CAST('superuser' AS pg_catalog.text)
         WHEN usecreatedb THEN
            CAST('create database' AS pg_catalog.text)
         ELSE
            CAST('' AS pg_catalog.text)
      END role_attributes
    FROM pg_catalog.pg_user
    WHERE usename LIKE (‘tenant%’)
    ORDER BY role_name desc;
    

    Figure 9: Verify the newly created tenant users output

    Figure 9: Verify the newly created tenant users output

Step 6: Set up the AWS Lambda functions

Next, you’ll create two Lambda functions for Amazon S3 and Aurora PostgreSQL-Compatible. You also need to create a Lambda layer for the Python package PG8000.

To set up the Lambda function for Amazon S3

  1. Navigate to the Lambda console, and choose Create function.
  2. Choose Author from scratch. For Function name, enter sts-ti-demo-s3-lambda.
  3. For Runtime, choose Python 3.7.
  4. Change the default execution role to Use an existing role, and then select sts-ti-demo-lambda-role from the drop-down list.
  5. Keep Advanced settings as the default value, and then choose Create function.
  6. Copy the following Python code into the lambda_function.py file that is created in your Lambda function.
    import json
    import os
    import time 
    
    def lambda_handler(event, context):
        import boto3
        bucket_name     =   os.environ['s3_bucket_name']
    
        try:
            login_tenant_id =   event['login_tenant_id']
            data_tenant_id  =   event['s3_tenant_home']
        except:
            return {
                'statusCode': 400,
                'body': 'Error in reading parameters'
            }
    
        prefix_of_role  =   'assumeRole'
        file_name       =   'tenant.info' + '-' + data_tenant_id
    
        # create an STS client object that represents a live connection to the STS service
        sts_client = boto3.client('sts')
        account_of_role = sts_client.get_caller_identity()['Account']
        role_to_assume  =   'arn:aws:iam::' + account_of_role + ':role/' + prefix_of_role + '-' + login_tenant_id
    
        # Call the assume_role method of the STSConnection object and pass the role
        # ARN and a role session name.
        RoleSessionName = 'AssumeRoleSession' + str(time.time()).split(".")[0] + str(time.time()).split(".")[1]
        try:
            assumed_role_object = sts_client.assume_role(
                RoleArn         = role_to_assume, 
                RoleSessionName = RoleSessionName, 
                DurationSeconds = 900) #15 minutes
    
        except:
            return {
                'statusCode': 400,
                'body': 'Error in assuming the role ' + role_to_assume + ' in account ' + account_of_role
            }
    
        # From the response that contains the assumed role, get the temporary 
        # credentials that can be used to make subsequent API calls
        credentials=assumed_role_object['Credentials']
        
        # Use the temporary credentials that AssumeRole returns to make a connection to Amazon S3  
        s3_resource=boto3.resource(
            's3',
            aws_access_key_id=credentials['AccessKeyId'],
            aws_secret_access_key=credentials['SecretAccessKey'],
            aws_session_token=credentials['SessionToken']
        )
    
        try:
            obj = s3_resource.Object(bucket_name, data_tenant_id + "/" + file_name)
            return {
                'statusCode': 200,
                'body': obj.get()['Body'].read()
            }
        except:
            return {
                'statusCode': 400,
                'body': 'error in reading s3://' + bucket_name + '/' + data_tenant_id + '/' + file_name
            }
    

  7. Under Basic settings, edit Timeout to increase the timeout to 29 seconds.
  8. Edit Environment variables to add a key called s3_bucket_name, with the value set to the name of your S3 bucket.
  9. Configure a new test event with the following JSON document, and save it as testEvent.
    {
      "login_tenant_id": "tenant1",
      "s3_tenant_home": "tenant1_home"
    }
    

  10. Choose Test to test the Lambda function with the newly created test event testEvent. You should see status code 200, and the body of the results should contain the data for tenant1.

    Figure 10: The result of running the sts-ti-demo-s3-lambda function

    Figure 10: The result of running the sts-ti-demo-s3-lambda function

Next, create another Lambda function for Aurora PostgreSQL-Compatible. To do this, you first need to create a new Lambda layer.

To set up the Lambda layer

  1. Use the following commands to create a .zip file for Python package pg8000.

    Note: This example is created by using an Amazon EC2 instance running the Amazon Linux 2 Amazon Machine Image (AMI). If you’re using another version of Linux or don’t have the Python 3 or pip3 packages installed, install them by using the following commands.

    sudo yum update -y 
    sudo yum install python3 
    sudo pip3 install pg8000 -t build/python/lib/python3.8/site-packages/ 
    cd build 
    sudo zip -r pg8000.zip python/
    

  2. Download the pg8000.zip file you just created to your local desktop machine or into an S3 bucket location.
  3. Navigate to the Lambda console, choose Layers, and then choose Create layer.
  4. For Name, enter pgdb, and then upload pg8000.zip from your local desktop machine or from the S3 bucket location.

    Note: For more details, see the AWS documentation for creating and sharing Lambda layers.

  5. For Compatible runtimes, choose python3.6, python3.7, and python3.8, and then choose Create.

To set up the Lambda function with the newly created Lambda layer

  1. In the Lambda console, choose Function, and then choose Create function.
  2. Choose Author from scratch. For Function name, enter sts-ti-demo-pgdb-lambda.
  3. For Runtime, choose Python 3.7.
  4. Change the default execution role to Use an existing role, and then select sts-ti-demo-lambda-role from the drop-down list.
  5. Keep Advanced settings as the default value, and then choose Create function.
  6. Choose Layers, and then choose Add a layer.
  7. Choose Custom layer, select pgdb with Version 1 from the drop-down list, and then choose Add.
  8. Copy the following Python code into the lambda_function.py file that was created in your Lambda function.
    import boto3
    import pg8000
    import os
    import time
    import ssl
    
    connection = None
    assumed_role_object = None
    rds_client = None
    
    def assume_role(event):
        global assumed_role_object
        try:
            RolePrefix  = os.environ.get("RolePrefix")
            LoginTenant = event['login_tenant_id']
        
            # create an STS client object that represents a live connection to the STS service
            sts_client      = boto3.client('sts')
            # Prepare input parameters
            role_to_assume  = 'arn:aws:iam::' + sts_client.get_caller_identity()['Account'] + ':role/' + RolePrefix + '-' + LoginTenant
            RoleSessionName = 'AssumeRoleSession' + str(time.time()).split(".")[0] + str(time.time()).split(".")[1]
        
            # Call the assume_role method of the STSConnection object and pass the role ARN and a role session name.
            assumed_role_object = sts_client.assume_role(
                RoleArn         =   role_to_assume, 
                RoleSessionName =   RoleSessionName,
                DurationSeconds =   900) #15 minutes 
            
            return assumed_role_object['Credentials']
        except Exception as e:
            print({'Role assumption failed!': {'role': role_to_assume, 'Exception': 'Failed due to :{0}'.format(str(e))}})
            return None
    
    def get_connection(event):
        global rds_client
        creds = assume_role(event)
    
        try:
            # create an RDS client using assumed credentials
            rds_client = boto3.client('rds',
                aws_access_key_id       = creds['AccessKeyId'],
                aws_secret_access_key   = creds['SecretAccessKey'],
                aws_session_token       = creds['SessionToken'])
    
            # Read the environment variables and event parameters
            DBEndPoint   = os.environ.get('DBEndPoint')
            DatabaseName = os.environ.get('DatabaseName')
            DBUserName   = event['dbuser']
    
            # Generates an auth token used to connect to a database with IAM credentials.
            pwd = rds_client.generate_db_auth_token(
                DBHostname=DBEndPoint, Port=5432, DBUsername=DBUserName, Region='us-west-2'
            )
    
            ssl_context             = ssl.SSLContext()
            ssl_context.verify_mode = ssl.CERT_REQUIRED
            ssl_context.load_verify_locations('rds-ca-2019-root.pem')
    
            # create a database connection
            conn = pg8000.connect(
                host        =   DBEndPoint,
                user        =   DBUserName,
                database    =   DatabaseName,
                password    =   pwd,
                ssl_context =   ssl_context)
            
            return conn
        except Exception as e:
            print ({'Database connection failed!': {'Exception': "Failed due to :{0}".format(str(e))}})
            return None
    
    def execute_sql(connection, query):
        try:
            cursor = connection.cursor()
            cursor.execute(query)
            columns = [str(desc[0]) for desc in cursor.description]
            results = []
            for res in cursor:
                results.append(dict(zip(columns, res)))
            cursor.close()
            retry = False
            return results    
        except Exception as e:
            print ({'Execute SQL failed!': {'Exception': "Failed due to :{0}".format(str(e))}})
            return None
    
    
    def lambda_handler(event, context):
        global connection
        try:
            connection = get_connection(event)
            if connection is None:
                return {'statusCode': 400, "body": "Error in database connection!"}
    
            response = {'statusCode':200, 'body': {
                'db & user': execute_sql(connection, 'SELECT CURRENT_DATABASE(), CURRENT_USER'), \
                'data from tenant_metadata': execute_sql(connection, 'SELECT * FROM tenant_metadata')}}
            return response
        except Exception as e:
            try:
                connection.close()
            except Exception as e:
                connection = None
            return {'statusCode': 400, 'statusDesc': 'Error!', 'body': 'Unhandled error in Lambda Handler.'}
    

  9. Add a certificate file called rds-ca-2019-root.pem into the Lambda project root by downloading it from https://s3.amazonaws.com/rds-downloads/rds-ca-2019-root.pem.
  10. Under Basic settings, edit Timeout to increase the timeout to 29 seconds.
  11. Edit Environment variables to add the following keys and values.

    Key Value
    DBEndPoint Enter the database cluster endpoint URL
    DatabaseName Enter the database name
    RolePrefix assumeRole
    Figure 11: Example of environment variables display

    Figure 11: Example of environment variables display

  12. Configure a new test event with the following JSON document, and save it as testEvent.
    {
      "login_tenant_id": "tenant1",
      "dbuser": "tenant1_dbuser"
    }
    

  13. Choose Test to test the Lambda function with the newly created test event testEvent. You should see status code 200, and the body of the results should contain the data for tenant1.

    Figure 12: The result of running the sts-ti-demo-pgdb-lambda function

    Figure 12: The result of running the sts-ti-demo-pgdb-lambda function

Step 7: Perform negative testing of tenant isolation

You already performed positive tests of tenant isolation during the Lambda function creation steps. However, it’s also important to perform some negative tests to verify the robustness of the tenant isolation controls.

To perform negative tests of tenant isolation

  1. In the Lambda console, navigate to the sts-ti-demo-s3-lambda function. Update the test event to the following, to mimic a scenario where tenant1 attempts to access other tenants’ objects.
    {
      "login_tenant_id": "tenant1",
      "s3_tenant_home": "tenant2_home"
    }
    

  2. Choose Test to test the Lambda function with the updated test event. You should see status code 400, and the body of the results should contain an error message.

    Figure 13: The results of running the sts-ti-demo-s3-lambda function (negative test)

    Figure 13: The results of running the sts-ti-demo-s3-lambda function (negative test)

  3. Navigate to the sts-ti-demo-pgdb-lambda function and update the test event to the following, to mimic a scenario where tenant1 attempts to access other tenants’ data elements.
    {
      "login_tenant_id": "tenant1",
      "dbuser": "tenant2_dbuser"
    }
    

  4. Choose Test to test the Lambda function with the updated test event. You should see status code 400, and the body of the results should contain an error message.

    Figure 14: The results of running the sts-ti-demo-pgdb-lambda function (negative test)

    Figure 14: The results of running the sts-ti-demo-pgdb-lambda function (negative test)

Cleaning up

To de-clutter your environment, remove the roles, policies, Lambda functions, Lambda layers, Amazon S3 prefixes, database users, and the database table that you created as part of this exercise. You can choose to delete the S3 bucket, as well as the Aurora PostgreSQL-Compatible database cluster that we mentioned in the Prerequisites section, to avoid incurring future charges.

Update the security group of the Aurora PostgreSQL-Compatible database cluster to remove the inbound rule that you added to allow a PostgreSQL TCP connection (Port 5432) from anywhere (0.0.0.0/0).

Conclusion

By taking advantage of attribute-based access control (ABAC) in IAM, you can more efficiently implement tenant isolation in SaaS applications. The solution we presented here helps to achieve tenant isolation in Amazon S3 and Aurora PostgreSQL-Compatible databases by using ABAC with the pool model of data partitioning.

If you run into any issues, you can use Amazon CloudWatch and AWS CloudTrail to troubleshoot. If you have feedback about this post, submit comments in the Comments section below.

To learn more, see these AWS Blog and AWS Support articles:

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Ashutosh Upadhyay

Ashutosh works as a Senior Security, Risk, and Compliance Consultant in AWS Professional Services (GFS) and is based in Singapore. Ashutosh started off his career as a developer, and for the past many years has been working in the Security, Risk, and Compliance field. Ashutosh loves to spend his free time learning and playing with new technologies.

Author

Chirantan Saha

Chirantan works as a DevOps Engineer in AWS Professional Services (GFS) and is based in Singapore. Chirantan has 20 years of development and DevOps experience working for large banking, financial services, and insurance organizations. Outside of work, Chirantan enjoys traveling, spending time with family, and helping his children with their science projects.

Security is the top priority for Amazon S3

Post Syndicated from Maddie Bacon original https://aws.amazon.com/blogs/security/security-is-the-top-priority-for-amazon-s3/

Amazon Simple Storage Service (Amazon S3) launched 15 years ago in March 2006, and became the first generally available service from Amazon Web Services (AWS). AWS marked the fifteenth anniversary with AWS Pi Week—a week of in-depth streams and live events. During AWS Pi Week, AWS leaders and experts reviewed the history of AWS and Amazon S3, and some of the key decisions involved in building and evolving S3.

As part of this celebration, Werner Vogels, VP and CTO for Amazon.com, and Eric Brandwine, VP and Distinguished Engineer with AWS Security, had a conversation about the role of security in Amazon S3 and all AWS services. They touched on why customers come to AWS, and how AWS services grow with customers by providing built-in security that can progress to protections that are more complex, based on each customer’s specific needs. They also touched on how, starting with Amazon S3 over 15 years ago and continuing to this day, security is the top priority at AWS, and how nothing can proceed at AWS without security that customers can rely on.

“In security, there are constantly challenging tradeoffs,” Eric says. “The path that we’ve taken at AWS is that our services are usable, but secure by default.”

To learn more about how AWS helps secure its customers’ systems and information through a culture of security first, watch the video, and be sure to check out AWS Pi Week 2021: The Birth of the AWS Cloud.

If you have feedback about this post, submit comments in the Comments section below.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Maddie Bacon

Maddie (she/her) is a technical writer for AWS Security with a passion for creating meaningful, inclusive content. She previously worked as a security reporter and editor at TechTarget and has a BA in Mathematics. In her spare time, she enjoys reading, traveling, and all things Harry Potter.

How to visualize multi-account Amazon Inspector findings with Amazon Elasticsearch Service

Post Syndicated from Moumita Saha original https://aws.amazon.com/blogs/security/how-to-visualize-multi-account-amazon-inspector-findings-with-amazon-elasticsearch-service/

Amazon Inspector helps to improve the security and compliance of your applications that are deployed on Amazon Web Services (AWS). It automatically assesses Amazon Elastic Compute Cloud (Amazon EC2) instances and applications on those instances. From that assessment, it generates findings related to exposure, potential vulnerabilities, and deviations from best practices.

You can use the findings from Amazon Inspector as part of a vulnerability management program for your Amazon EC2 fleet across multiple AWS Regions in multiple accounts. The ability to rank and efficiently respond to potential security issues reduces the time that potential vulnerabilities remain unresolved. This can be accelerated within a single pane of glass for all the accounts in your AWS environment.

Following AWS best practices, in a secure multi-account AWS environment, you can provision (using AWS Control Tower) a group of accounts—known as core accounts, for governing other accounts within the environment. One of the core accounts may be used as a central security account, which you can designate for governing the security and compliance posture across all accounts in your environment. Another core account is a centralized logging account, which you can provision and designate for central storage of log data.

In this blog post, I show you how to:

  1. Use Amazon Inspector, a fully managed security assessment service, to generate security findings.
  2. Gather findings from multiple Regions across multiple accounts using Amazon Simple Notification Service (Amazon SNS) and Amazon Simple Queue Service (Amazon SQS).
  3. Use AWS Lambda to send the findings to a central security account for deeper analysis and reporting.

In this solution, we send the findings to two services inside the central security account:

Solution overview

Overall architecture

The flow of events to implement the solution is shown in Figure 1 and described in the following process flow.

Figure 1: Solution overview architecture

Figure 1: Solution overview architecture

Process flow

The flow of this architecture is divided into two types of processes—a one-time process and a scheduled process. The AWS resources that are part of the one-time process are triggered the first time an Amazon Inspector assessment template is created in each Region of each application account. The AWS resources of the scheduled process are triggered at a designated interval of Amazon Inspector scan in each Region of each application account.

One-time process

  1. An event-based Amazon CloudWatch rule in each Region of every application account triggers a regional AWS Lambda function when an Amazon Inspector assessment template is created for the first time in that Region.

    Note: In order to restrict this event to trigger the Lambda function only the first time an assessment template is created, you must use a specific user-defined tag to trigger the Attach Inspector template to SNS Lambda function for only one Amazon Inspector template per Region. For more information on tags, see the Tagging AWS resources documentation.

  2. The Lambda function attaches the Amazon Inspector assessment template (created in application accounts) to the cross-account Amazon SNS topic (created in the security account). The function, the template, and the topic are all in the same AWS Region.

    Note: This step is needed because Amazon Inspector templates can only be attached to SNS topics in the same account via the AWS Management Console or AWS Command Line Interface (AWS CLI).

Scheduled process

  1. A scheduled Amazon CloudWatch Event in every Region of the application accounts starts the Amazon Inspector scan at a scheduled time interval, which you can configure.
  2. An Amazon Inspector agent conducts the scan on the EC2 instances of the Region where the assessment template is created and sends any findings to Amazon Inspector.
  3. Once the findings are generated, Amazon Inspector notifies the Amazon SNS topic of the security account in the same Region.
  4. The Amazon SNS topics from each Region of the central security account receive notifications of Amazon Inspector findings from all application accounts. The SNS topics then send the notifications to a central Amazon SQS queue in the primary Region of the security account.
  5. The Amazon SQS queue triggers the Send findings Lambda function (as shown in Figure 1) of the security account.

    Note: Each Amazon SQS message represents one Amazon Inspector finding.

  6. The Send findings Lambda function assumes a cross-account role to fetch the following information from all application accounts:
    1. Finding details from the Amazon Inspector API.
    2. Additional Amazon EC2 attributes—VPC, subnet, security group, and IP address—from EC2 instances with potential vulnerabilities.
  7. The Lambda function then sends all the gathered data to a central S3 bucket and a domain in Amazon ES—both in the central security account.

These Amazon Inspector findings, along with additional attributes on the scanned instances, can be used for further analysis and visualization via Kibana—a data visualization dashboard for Amazon ES. Storing a copy of these findings in an S3 bucket gives you the opportunity to forward the findings data to outside monitoring tools that don’t support direct data ingestion from AWS Lambda.

Prerequisites

The following resources must be set up before you can implement this solution:

  1. A multi-account structure. To learn how to set up a multi-account structure, see Setting up AWS Control Tower and AWS Landing zone.
  2. Amazon Inspector agents must be installed on all EC2 instances. See Installing Amazon Inspector agents to learn how to set up Amazon Inspector agents on EC2 instances. Additionally, keep note of all the Regions where you install the Amazon Inspector agent.
  3. An Amazon ES domain with Kibana authentication. See Getting started with Amazon Elasticsearch Service and Use Amazon Cognito for Kibana access control.
  4. An S3 bucket for centralized storage of Amazon Inspector findings.
  5. An S3 bucket for storage of the Lambda source code for the solution.

Set up Amazon Inspector with Amazon ES and S3

Follow these steps to set up centralized Amazon Inspector findings with Amazon ES and Amazon S3:

  1. Upload the solution ZIP file to the S3 bucket used for Lambda code storage.
  2. Collect the input parameters for AWS CloudFormation deployment.
  3. Deploy the base template into the central security account.
  4. Deploy the second template in the primary Region of all application accounts to create global resources.
  5. Deploy the third template in all Regions of all application accounts.

Step 1: Upload the solution ZIP file to the S3 bucket used for Lambda code storage

  1. From GitHub, download the file Inspector-to-S3ES-crossAcnt.zip.
  2. Upload the ZIP file to the S3 bucket you created in the central security account for Lambda code storage. This code is used to create the Lambda function in the first CloudFormation stack set of the solution.

Step 2: Collect input parameters for AWS CloudFormation deployment

In this solution, you deploy three AWS CloudFormation stack sets in succession. Each stack set should be created in the primary Region of the central security account. Underlying stacks are deployed across the central security account and in all the application accounts where the Amazon Inspector scan is performed. You can learn more in Working with AWS CloudFormation StackSets.

Before you proceed to the stack set deployment, you must collect the input parameters for the first stack set: Central-SecurityAcnt-BaseTemplate.yaml.

To collect input parameters for AWS CloudFormation deployment

  1. Fetch the account ID (CentralSecurityAccountID) of the AWS account where the stack set will be created and deployed. You can use the steps in Finding your AWS account ID to help you find the account ID.
  2. Values for the ES domain parameters can be fetched from the Amazon ES console.
    1. Open the Amazon ES Management Console and select the Region where the Amazon ES domain exists.
    2. Select the domain name to view the domain details.
    3. The value for ElasticsearchDomainName is displayed on the top left corner of the domain details.
    4. On the Overview tab in the domain details window, select and copy the URL value of the Endpoint to use as the ElasticsearchEndpoint parameter of the template. Make sure to exclude the https:// at the beginning of the URL.

      Figure 2: Details of the Amazon ES domain for fetching parameter values

      Figure 2: Details of the Amazon ES domain for fetching parameter values

  3. Get the values for the S3 bucket parameters from the Amazon S3 console.
    1. Open the Amazon S3 Management Console.
    2. Copy the name of the S3 bucket that you created for centralized storage of Amazon Inspector findings. Save this bucket name for the LoggingS3Bucket parameter value of the Central-SecurityAcnt-BaseTemplate.yaml template.
    3. Select the S3 bucket used for source code storage. Select the bucket name and copy the name of this bucket for the LambdaSourceCodeS3Bucket parameter of the template.

      Figure 3: The S3 bucket where Lambda code is uploaded

      Figure 3: The S3 bucket where Lambda code is uploaded

  4. On the bucket details page, select the source code ZIP file name that you previously uploaded to the bucket. In the detail page of the ZIP file, choose the Overview tab, and then copy the value in the Key field to use as the value for the LambdaCodeS3Key parameter of the template.

    Figure 4: Details of the Lambda code ZIP file uploaded in Amazon S3 showing the key prefix value

    Figure 4: Details of the Lambda code ZIP file uploaded in Amazon S3 showing the key prefix value

Note: All of the other input parameter values of the template are entered automatically, but you can change them during stack set creation if necessary.

Step 3: Deploy the base template into the central security account

Now that you’ve collected the input parameters, you’re ready to deploy the base template that will create the necessary resources for this solution implementation in the central security account.

Prerequisites for CloudFormation stack set deployment

There are two permission modes that you can choose from for deploying a stack set in AWS CloudFormation. If you’re using AWS Organizations and have all features enabled, you can use the service-managed permissions; otherwise, self-managed permissions mode is recommended. To deploy this solution, you’ll use self-managed permissions mode. To run stack sets in self-managed permissions mode, your administrator account and the target accounts must have two IAM roles—AWSCloudFormationStackSetAdministrationRole and AWSCloudFormationStackSetExecutionRole—as prerequisites. In this solution, the administrator account is the central security account and the target accounts are application accounts. You can use the following CloudFormation templates to create the necessary IAM roles:

To deploy the base template

  1. Download the base template (Central-SecurityAcnt-BaseTemplate.yaml) from GitHub.
  2. Open the AWS CloudFormation Management Console and select the Region where all the stack sets will be created for deployment. This should be the primary Region of your environment.
  3. Select Create StackSet.
    1. In the Create StackSet window, select Template is ready and then select Upload a template file.
    2. Under Upload a template file, select Choose file and select the Central-SecurityAcnt-BaseTemplate.yaml template that you downloaded earlier.
    3. Choose Next.
  4. Add stack set details.
    1. Enter a name for the stack set in StackSet name.
    2. Under Parameters, most of the values are pre-populated except the values you collected in the previous procedure for CentralSecurityAccountID, ElasticsearchDomainName, ElasticsearchEndpoint, LoggingS3Bucket, LambdaSourceCodeS3Bucket, and LambdaCodeS3Key.
    3. After all the values are populated, choose Next.
  5. Configure StackSet options.
    1. (Optional) Add tags as described in the prerequisites to apply to the resources in the stack set that these rules will be deployed to. Tagging is a recommended best practice, because it enables you to add metadata information to resources during their creation.
    2. Under Permissions, choose the Self service permissions mode to be used for deploying the stack set, and then select the AWSCloudFormationStackSetAdministrationRole from the dropdown list.

      Figure 5: Permission mode to be selected for stack set deployment

      Figure 5: Permission mode to be selected for stack set deployment

    3. Choose Next.
  6. Add the account and Region details where the template will be deployed.
    1. Under Deployment locations, select Deploy stacks in accounts. Under Account numbers, enter the account ID of the security account that you collected earlier.

      Figure 6: Values to be provided during the deployment of the first stack set

      Figure 6: Values to be provided during the deployment of the first stack set

    2. Under Specify regions, select all the Regions where the stacks will be created. This should be the list of Regions where you installed the Amazon Inspector agent. Keep note of this list of Regions to use in the deployment of the third template in an upcoming step.
      • Though an Amazon Inspector scan is performed in all the application accounts, the regional Amazon SNS topics that send scan finding notifications are created in the central security account. Therefore, this template is created in all the Regions where Amazon Inspector will notify SNS. The template has the logic needed to handle the creation of specific AWS resources only in the primary Region, even though the template executes in many Regions.
      • The order in which Regions are selected under Specify regions defines the order in which the stack is deployed in the Regions. So you must make sure that the primary Region of your deployment is the first one specified under Specify regions, followed by the other Regions of stack set deployment. This is required because global resources are created using one Region—ideally the primary Region—and so stack deployment in that Region should be done before deployment to other Regions in order to avoid any build dependencies.

        Figure 7: Showing the order of specifying the Regions of stack set deployment

        Figure 7: Showing the order of specifying the Regions of stack set deployment

  7. Review the template settings and select the check box to acknowledge the Capabilities section. This is required if your deployment template creates IAM resources. You can learn more at Controlling access with AWS Identity and Access Management.

    Figure 8: Acknowledge IAM resources creation by AWS CloudFormation

    Figure 8: Acknowledge IAM resources creation by AWS CloudFormation

  8. Choose Submit to deploy the stack set.

Step 4: Deploy the second template in the primary Region of all application accounts to create global resources

This template creates the global resources required for sending Amazon Inspector findings to Amazon ES and Amazon S3.

To deploy the second template

  1. Download the template (ApplicationAcnts-RolesTemplate.yaml) from GitHub and use it to create the second CloudFormation stack set in the primary Region of the central security account.
  2. To deploy the template, follow the steps used to deploy the base template (described in the previous section) through Configure StackSet options.
  3. In Set deployment options, do the following:
    1. Under Account numbers, enter the account IDs of your application accounts as comma-separated values. You can use the steps in Finding your AWS account ID to help you gather the account IDs.
    2. Under Specify regions, select only your primary Region.

      Figure 9: Select account numbers and specify Regions

      Figure 9: Select account numbers and specify Regions

  4. The remaining steps are the same as for the base template deployment.

Step 5: Deploy the third template in all Regions of all application accounts

This template creates the resources in each Region of all application accounts needed for scheduled scanning of EC2 instances using Amazon Inspector. Notifications are sent to the SNS topics of each Region of the central security account.

To deploy the third template

  1. Download the template InspectorRun-SetupTemplate.yaml from GitHub and create the final AWS CloudFormation stack set. Similar to the previous stack sets, this one should also be created in the central security account.
  2. For deployment, follow the same steps you used to deploy the base template through Configure StackSet options.
  3. In Set deployment options:
    1. Under Account numbers, enter the same account IDs of your application accounts (comma-separated values) as you did for the second template deployment.
    2. Under Specify regions, select all the Regions where you installed the Amazon Inspector agent.

      Note: This list of Regions should be the same as the Regions where you deployed the base template.

  4. The remaining steps are the same as for the second template deployment.

Test the solution and delivery of the findings

After successful deployment of the architecture, to test the solution you can wait until the next scheduled Amazon Inspector scan or you can use the following steps to run the Amazon Inspector scan manually.

To run the Amazon Inspector scan manually for testing the solution

  1. In any one of the application accounts, go to any Region where the Amazon Inspector scan was performed.
  2. Open the Amazon Inspector console.
  3. In the left navigation menu, select Assessment templates to see the available assessments.
  4. Choose the assessment template that was created by the third template.
  5. Choose Run to start the assessment immediately.
  6. When the run is complete, Last run status changes from Collecting data to Analysis Complete.

    Figure 10: Amazon Inspector assessment run

    Figure 10: Amazon Inspector assessment run

  7. You can see the recent scan findings in the Amazon Inspector console by selecting Assessment runs from the left navigation menu.

    Figure 11: The assessment run indicates total findings from the last Amazon Inspector run in this Region

    Figure 11: The assessment run indicates total findings from the last Amazon Inspector run in this Region

  8. In the left navigation menu, select Findings to see details of each finding, or use the steps in the following section to verify the delivery of findings to the central security account.

Test the delivery of the Amazon Inspector findings

This solution delivers the Amazon Inspector findings to two AWS services—Amazon ES and Amazon S3—in the primary Region of the central security account. You can either use Kibana to view the findings sent to Amazon ES or you can use the findings sent to Amazon S3 and forward them to the security monitoring software of your preference for further analysis.

To check whether the findings are delivered to Amazon ES

  1. Open the Amazon ES Management Console and select the Region where the Amazon ES domain is located.
  2. Select the domain name to view the domain details.
  3. On the domain details page, select the Kibana URL.

    Figure 12: Amazon ES domain details page

    Figure 12: Amazon ES domain details page

  4. Log in to Kibana using your preferred authentication method as set up in the prerequisites.
    1. In the left panel, select Discover.
    2. In the Discover window, select a Region to view the total number of findings in that Region.

      Figure 13: The total findings in Kibana for the chosen Region of an application account

      Figure 13: The total findings in Kibana for the chosen Region of an application account

To check whether the findings are delivered to Amazon S3

  1. Open the Amazon S3 Management Console.
  2. Select the S3 bucket that you created for storing Amazon Inspector findings.
  3. Select the bucket name to view the bucket details. The total number of findings for the chosen Region is at the top right corner of the Overview tab.

    Figure 14: The total security findings as stored in an S3 bucket for us-east-1 Region

    Figure 14: The total security findings as stored in an S3 bucket for us-east-1 Region

Visualization in Kibana

The data sent to the Amazon ES index can be used to create visualizations in Kibana that make it easier to identify potential security gaps and plan the remediation accordingly.

You can use Kibana to create a dashboard that gives an overview of the potential vulnerabilities identified in different instances of different AWS accounts. Figure 15 shows an example of such a dashboard. The dashboard can help you rank the need for remediation based on criteria such as:

  • The category of vulnerability
  • The most impacted AWS accounts
  • EC2 instances that need immediate attention
Figure 15: A sample Kibana dashboard showing findings from Amazon Inspector

Figure 15: A sample Kibana dashboard showing findings from Amazon Inspector

You can build additional panels to visualize details of the vulnerability findings identified by Amazon Inspector, such as the CVE ID of the security vulnerability, its description, and recommendations on how to remove the vulnerabilities.

Figure 16: A sample Kibana dashboard panel listing the top identified vulnerabilities and their details

Figure 16: A sample Kibana dashboard panel listing the top identified vulnerabilities and their details

Conclusion

By using this solution to combine Amazon Inspector, Amazon SNS topics, Amazon SQS queues, Lambda functions, an Amazon ES domain, and S3 buckets, you can centrally analyze and monitor the vulnerability posture of EC2 instances across your AWS environment, including multiple Regions across multiple AWS accounts. This solution is built following least privilege access through AWS IAM roles and policies to help secure the cross-account architecture.

In this blog post, you learned how to send the findings directly to Amazon ES for visualization in Kibana. These visualizations can be used to build dashboards that security analysts can use for centralized monitoring. Better monitoring capability helps analysts to identify potentially vulnerable assets and perform remediation activities to improve security of your applications in AWS and their underlying assets. This solution also demonstrates how to store the findings from Amazon Inspector in an S3 bucket, which makes it easier for you to use those findings to create visualizations in your preferred security monitoring software.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Moumita Saha

Moumita is a Security Consultant with AWS Professional Services working to help enterprise customers secure their workloads in the cloud. She assists customers in secure cloud migration, designing automated solutions to protect against cyber threats in the cloud. She is passionate about cyber security, data privacy, and new, emerging cloud-security technologies.

Use Macie to discover sensitive data as part of automated data pipelines

Post Syndicated from Brandon Wu original https://aws.amazon.com/blogs/security/use-macie-to-discover-sensitive-data-as-part-of-automated-data-pipelines/

Data is a crucial part of every business and is used for strategic decision making at all levels of an organization. To extract value from their data more quickly, Amazon Web Services (AWS) customers are building automated data pipelines—from data ingestion to transformation and analytics. As part of this process, my customers often ask how to prevent sensitive data, such as personally identifiable information, from being ingested into data lakes when it’s not needed. They highlight that this challenge is compounded when ingesting unstructured data—such as files from process reporting, text files from chat transcripts, and emails. They also mention that identifying sensitive data inadvertently stored in structured data fields—such as in a comment field stored in a database—is also a challenge.

In this post, I show you how to integrate Amazon Macie as part of the data ingestion step in your data pipeline. This solution provides an additional checkpoint that sensitive data has been appropriately redacted or tokenized prior to ingestion. Macie is a fully managed data security and privacy service that uses machine learning and pattern matching to discover sensitive data in AWS.

When Macie discovers sensitive data, the solution notifies an administrator to review the data and decide whether to allow the data pipeline to continue ingesting the objects. If allowed, the objects will be tagged with an Amazon Simple Storage Service (Amazon S3) object tag to identify that sensitive data was found in the object before progressing to the next stage of the pipeline.

This combination of automation and manual review helps reduce the risk that sensitive data—such as personally identifiable information—will be ingested into a data lake. This solution can be extended to fit your use case and workflows. For example, you can define custom data identifiers as part of your scans, add additional validation steps, create Macie suppression rules to archive findings automatically, or only request manual approvals for findings that meet certain criteria (such as high severity findings).

Solution overview

Many of my customers are building serverless data lakes with Amazon S3 as the primary data store. Their data pipelines commonly use different S3 buckets at each stage of the pipeline. I refer to the S3 bucket for the first stage of ingestion as the raw data bucket. A typical pipeline might have separate buckets for raw, curated, and processed data representing different stages as part of their data analytics pipeline.

Typically, customers will perform validation and clean their data before moving it to a raw data zone. This solution adds validation steps to that pipeline after preliminary quality checks and data cleaning is performed, noted in blue (in layer 3) of Figure 1. The layers outlined in the pipeline are:

  1. Ingestion – Brings data into the data lake.
  2. Storage – Provides durable, scalable, and secure components to store the data—typically using S3 buckets.
  3. Processing – Transforms data into a consumable state through data validation, cleanup, normalization, transformation, and enrichment. This processing layer is where the additional validation steps are added to identify instances of sensitive data that haven’t been appropriately redacted or tokenized prior to consumption.
  4. Consumption – Provides tools to gain insights from the data in the data lake.

 

Figure 1: Data pipeline with sensitive data scan

Figure 1: Data pipeline with sensitive data scan

The application runs on a scheduled basis (four times a day, every 6 hours by default) to process data that is added to the raw data S3 bucket. You can customize the application to perform a sensitive data discovery scan during any stage of the pipeline. Because most customers do their extract, transform, and load (ETL) daily, the application scans for sensitive data on a scheduled basis before any crawler jobs run to catalog the data and after typical validation and data redaction or tokenization processes complete.

You can expect that this additional validation will add 5–10 minutes to your pipeline execution at a minimum. The validation processing time will scale linearly based on object size, but there is a start-up time per job that is constant.

If sensitive data is found in the objects, an email is sent to the designated administrator requesting an approval decision, which they indicate by selecting the link corresponding to their decision to approve or deny the next step. In most cases, the reviewer will choose to adjust the sensitive data cleanup processes to remove the sensitive data, deny the progression of the files, and re-ingest the files in the pipeline.

Additional considerations for deploying this application for regular use are discussed at the end of the blog post.

Application components

The following resources are created as part of the application:

Note: the application uses various AWS services, and there are costs associated with these resources after the Free Tier usage. See AWS Pricing for details. The primary drivers of the solution cost will be the amount of data ingested through the pipeline, both for Amazon S3 storage and data processed for sensitive data discovery with Macie.

The architecture of the application is shown in Figure 2 and described in the text that follows.
 

Figure 2: Application architecture and logic

Figure 2: Application architecture and logic

Application logic

  1. Objects are uploaded to the raw data S3 bucket as part of the data ingestion process.
  2. A scheduled EventBridge rule runs the sensitive data scan Step Functions workflow.
  3. triggerMacieScan Lambda function moves objects from the raw data S3 bucket to the scan stage S3 bucket.
  4. triggerMacieScan Lambda function creates a Macie sensitive data discovery job on the scan stage S3 bucket.
  5. checkMacieStatus Lambda function checks the status of the Macie sensitive data discovery job.
  6. isMacieStatusCompleteChoice Step Functions Choice state checks whether the Macie sensitive data discovery job is complete.
    1. If yes, the getMacieFindingsCount Lambda function runs.
    2. If no, the Step Functions Wait state waits 60 seconds and then restarts Step 5.
  7. getMacieFindingsCount Lambda function counts all of the findings from the Macie sensitive data discovery job.
  8. isSensitiveDataFound Step Functions Choice state checks whether sensitive data was found in the Macie sensitive data discovery job.
    1. If there was sensitive data discovered, run the triggerManualApproval Lambda function.
    2. If there was no sensitive data discovered, run the moveAllScanStageS3Files Lambda function.
  9. moveAllScanStageS3Files Lambda function moves all of the objects from the scan stage S3 bucket to the scanned data S3 bucket.
  10. triggerManualApproval Lambda function tags and moves objects with sensitive data discovered to the manual review S3 bucket, and moves objects with no sensitive data discovered to the scanned data S3 bucket. The function then sends a notification to the ApprovalRequestNotification Amazon SNS topic as a notification that manual review is required.
  11. Email is sent to the email address that’s subscribed to the ApprovalRequestNotification Amazon SNS topic (from the application deployment template) for the manual review user with the option to Approve or Deny pipeline ingestion for these objects.
  12. Manual review user assesses the objects with sensitive data in the manual review S3 bucket and selects the Approve or Deny links in the email.
  13. The decision request is sent from the Amazon API Gateway to the receiveApprovalDecision Lambda function.
  14. manualApprovalChoice Step Functions Choice state checks the decision from the manual review user.
    1. If denied, run the deleteManualReviewS3Files Lambda function.
    2. If approved, run the moveToScannedDataS3Files Lambda function.
  15. deleteManualReviewS3Files Lambda function deletes the objects from the manual review S3 bucket.
  16. moveToScannedDataS3Files Lambda function moves the objects from the manual review S3 bucket to the scanned data S3 bucket.
  17. The next step of the automated data pipeline will begin with the objects in the scanned data S3 bucket.

Prerequisites

For this application, you need the following prerequisites:

You can use AWS Cloud9 to deploy the application. AWS Cloud9 includes the AWS CLI and AWS SAM CLI to simplify setting up your development environment.

Deploy the application with AWS SAM CLI

You can deploy this application using the AWS SAM CLI. AWS SAM uses AWS CloudFormation as the underlying deployment mechanism. AWS SAM is an open-source framework that you can use to build serverless applications on AWS.

To deploy the application

  1. Initialize the serverless application using the AWS SAM CLI from the GitHub project in the aws-samples repository. This will clone the project locally which includes the source code for the Lambda functions, Step Functions state machine definition file, and the AWS SAM template. On the command line, run the following:
    sam init --location gh: aws-samples/amazonmacie-datapipeline-scan
    

    Alternatively, you can clone the Github project directly.

  2. Deploy your application to your AWS account. On the command line, run the following:
    sam deploy --guided
    

    Complete the prompts during the guided interactive deployment. The first deployment prompt is shown in the following example.

    Configuring SAM deploy
    ======================
    
            Looking for config file [samconfig.toml] :  Found
            Reading default arguments  :  Success
    
            Setting default arguments for 'sam deploy'
            =========================================
            Stack Name [maciepipelinescan]:
    

  3. Settings:
    • Stack Name – Name of the CloudFormation stack to be created.
    • AWS RegionRegion—for example, us-west-2, eu-west-1, ap-southeast-1—to deploy the application to. This application was tested in the us-west-2 and ap-southeast-1 Regions. Before selecting a Region, verify that the services you need are available in those Regions (for example, Macie and Step Functions).
    • Parameter StepFunctionName – Name of the Step Functions state machine to be created—for example, maciepipelinescanstatemachine).
    • Parameter BucketNamePrefix – Prefix to apply to the S3 buckets to be created (S3 bucket names are globally unique, so choosing a random prefix helps ensure uniqueness).
    • Parameter ApprovalEmailDestination – Email address to receive the manual review notification.
    • Parameter EnableMacie – Whether you need Macie enabled in your account or Region. You can select yes or no; select yes if you need Macie to be enabled for you as part of this template, select no, if you already have Macie enabled.
  4. Confirm changes and provide approval for AWS SAM CLI to deploy the resources to your AWS account by responding y to prompts, as shown in the following example. You can accept the defaults for the SAM configuration file and SAM configuration environment prompts.
    #Shows you resources changes to be deployed and require a 'Y' to initiate deploy
    Confirm changes before deploy [y/N]: y
    #SAM needs permission to be able to create roles to connect to the resources in your template
    Allow SAM CLI IAM role creation [Y/n]: y
    ReceiveApprovalDecisionAPI may not have authorization defined, Is this okay? [y/N]: y
    ReceiveApprovalDecisionAPI may not have authorization defined, Is this okay? [y/N]: y
    Save arguments to configuration file [Y/n]: y
    SAM configuration file [samconfig.toml]: 
    SAM configuration environment [default]:
    

    Note: This application deploys an Amazon API Gateway with two REST API resources without authorization defined to receive the decision from the manual review step. You will be prompted to accept each resource without authorization. A token (Step Functions taskToken) is used to authenticate the requests.

  5. This creates an AWS CloudFormation changeset. Once the changeset creation is complete, you must provide a final confirmation of y to Deploy the changeset? [y/N] when prompted as shown in the following example.
    Changeset created successfully. arn:aws:cloudformation:ap-southeast-1:XXXXXXXXXXXX:changeSet/samcli-deploy1605213119/db681961-3635-4305-b1c7-dcc754c7XXXX
    
    
    Previewing CloudFormation changeset before deployment
    ======================================================
    Deploy this changeset? [y/N]:
    

Your application is deployed to your account using AWS CloudFormation. You can track the deployment events in the command prompt or via the AWS CloudFormation console.

After the application deployment is complete, you must confirm the subscription to the Amazon SNS topic. An email will be sent to the email address entered in Step 3 with a link that you need to select to confirm the subscription. This confirmation provides opt-in consent for AWS to send emails to you via the specified Amazon SNS topic. The emails will be notifications of potentially sensitive data that need to be approved. If you don’t see the verification email, be sure to check your spam folder.

Test the application

The application uses an EventBridge scheduled rule to start the sensitive data scan workflow, which runs every 6 hours. You can manually start an execution of the workflow to verify that it’s working. To test the function, you will need a file that contains data that matches your rules for sensitive data. For example, it is easy to create a spreadsheet, document, or text file that contains names, addresses, and numbers formatted like credit card numbers. You can also use this generated sample data to test Macie.

We will test by uploading a file to our S3 bucket via the AWS web console. If you know how to copy objects from the command line, that also works.

Upload test objects to the S3 bucket

  1. Navigate to the Amazon S3 console and upload one or more test objects to the <BucketNamePrefix>-data-pipeline-raw bucket. <BucketNamePrefix> is the prefix you entered when deploying the application in the AWS SAM CLI prompts. You can use any objects as long as they’re a supported file type for Amazon Macie. I suggest uploading multiple objects, some with and some without sensitive data, in order to see how the workflow processes each.

Start the Scan State Machine

  1. Navigate to the Step Functions state machines console. If you don’t see your state machine, make sure you’re connected to the same region that you deployed your application to.
  2. Choose the state machine you created using the AWS SAM CLI as seen in Figure 3. The example state machine is maciepipelinescanstatemachine, but you might have used a different name in your deployment.
     
    Figure 3: AWS Step Functions state machines console

    Figure 3: AWS Step Functions state machines console

  3. Select the Start execution button and copy the value from the Enter an execution name – optional box. Change the Input – optional value replacing <execution id> with the value just copied as follows:
    {
        “id”: “<execution id>”
    }
    

    In my example, the <execution id> is fa985a4f-866b-b58b-d91b-8a47d068aa0c from the Enter an execution name – optional box as shown in Figure 4. You can choose a different ID value if you prefer. This ID is used by the workflow to tag the objects being processed to ensure that only objects that are scanned continue through the pipeline. When the EventBridge scheduled event starts the workflow as scheduled, an ID is included in the input to the Step Functions workflow. Then select Start execution again.
     

    Figure 4: New execution dialog box

    Figure 4: New execution dialog box

  4. You can see the status of your workflow execution in the Graph inspector as shown in Figure 5. In the figure, the workflow is at the pollForCompletionWait step.
     
    Figure 5: AWS Step Functions graph inspector

    Figure 5: AWS Step Functions graph inspector

The sensitive discovery job should run for about five to ten minutes. The jobs scale linearly based on object size, but there is a start-up time per job that is constant. If sensitive data is found in the objects uploaded to the <BucketNamePrefix>-data-pipeline-upload S3 bucket, an email is sent to the address provided during the AWS SAM deployment step, notifying the recipient requesting of the need for an approval decision, which they indicate by selecting the link corresponding to their decision to approve or deny the next step as shown in Figure 6.
 

Figure 6: Sensitive data identified email

Figure 6: Sensitive data identified email

When you receive this notification, you can investigate the findings by reviewing the objects in the <BucketNamePrefix>-data-pipeline-manual-review S3 bucket. Based on your review, you can either apply remediation steps to remove any sensitive data or allow the data to proceed to the next step of the data ingestion pipeline. You should define a standard response process to address discovery of sensitive data in the data pipeline. Common remediation steps include review of the files for sensitive data, deleting the files that you do not want to progress, and updating the ETL process to redact or tokenize sensitive data when re-ingesting into the pipeline. When you re-ingest the files into the pipeline without sensitive data, the files will not be flagged by Macie.

The workflow performs the following:

  • If you select Approve, the files are moved to the <BucketNamePrefix>-data-pipeline-scanned-data S3 bucket with an Amazon S3 SensitiveDataFound object tag with a value of true.
  • If you select Deny, the files are deleted from the <BucketNamePrefix>-data-pipeline-manual-review S3 bucket.
  • If no action is taken, the Step Functions workflow execution times out after five days and the file will automatically be deleted from the <BucketNamePrefix>-data-pipeline-manual-review S3 bucket after 10 days.

Clean up the application

You’ve successfully deployed and tested the sensitive data pipeline scan workflow. To avoid ongoing charges for resources you created, you should delete all associated resources by deleting the CloudFormation stack. In order to delete the CloudFormation stack, you must first delete all objects that are stored in the S3 buckets that you created for the application.

To delete the application

  1. Empty the S3 buckets created in this application (<BucketNamePrefix>-data-pipeline-raw S3 bucket, <BucketNamePrefix>-data-pipeline-scan-stage, <BucketNamePrefix>-data-pipeline-manual-review, and <BucketNamePrefix>-data-pipeline-scanned-data).
  2. Delete the CloudFormation stack used to deploy the application.

Considerations for regular use

Before using this application in a production data pipeline, you will need to stop and consider some practical matters. First, the notification mechanism used when sensitive data is identified in the objects is email. Email doesn’t scale: you should expand this solution to integrate with your ticketing or workflow management system. If you choose to use email, subscribe a mailing list so that the work of reviewing and responding to alerts is shared across a team.

Second, the application is run on a scheduled basis (every 6 hours by default). You should consider starting the application when your preliminary validations have completed and are ready to perform a sensitive data scan on the data as part of your pipeline. You can modify the EventBridge Event Rule to run in response to an Amazon EventBridge event instead of a scheduled basis.

Third, the application currently uses a 60 second Step Functions Wait state when polling for the Macie discovery job completion. In real world scenarios, the discovery scan will take 10 minutes at a minimum, likely several orders of magnitude longer. You should evaluate the typical execution times for your application execution and tune the polling period accordingly. This will help reduce costs related to running Lambda functions and log storage within CloudWatch Logs. The polling period is defined in the Step Functions state machine definition file (macie_pipeline_scan.asl.json) under the pollForCompletionWait state.

Fourth, the application currently doesn’t account for false positives in the sensitive data discovery job results. Also, the application will progress or delete all objects identified based on the decision by the reviewer. You should consider expanding the application to handle false positives through automation rather than manual review / intervention (such as deleting the files from the manual review bucket or removing the sensitive data tags applied).

Last, the solution will stop the ingestion of a subset of objects into your pipeline. This behavior is similar to other validation and data quality checks that most customers perform as part of the data pipeline. However, you should test to ensure that this will not cause unexpected outcomes and address them in your downstream application logic accordingly.

Conclusion

In this post, I showed you how to integrate sensitive data discovery using Macie as an additional validation step in an automated data pipeline. You’ve reviewed the components of the application, deployed it using the AWS SAM CLI, tested to validate that the application functions as expected, and cleaned up by removing deployed resources.

You now know how to integrate sensitive data scanning into your ETL pipeline. You can use automation and—where required—manual review to help reduce the risk of sensitive data, such as personally identifiable information, being inadvertently ingested into a data lake. You can take this application and customize it to fit your use case and workflows, such as using custom data identifiers as part of your scans, adding additional validation steps, creating Macie suppression rules to define cases to archive findings automatically, or only request manual approvals for findings that meet certain criteria (such as high severity findings).

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Macie forum.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Brandon Wu

Brandon is a security solutions architect helping financial services organizations secure their critical workloads on AWS. In his spare time, he enjoys exploring outdoors and experimenting in the kitchen.

How to delete user data in an AWS data lake

Post Syndicated from George Komninos original https://aws.amazon.com/blogs/big-data/how-to-delete-user-data-in-an-aws-data-lake/

General Data Protection Regulation (GDPR) is an important aspect of today’s technology world, and processing data in compliance with GDPR is a necessity for those who implement solutions within the AWS public cloud. One article of GDPR is the “right to erasure” or “right to be forgotten” which may require you to implement a solution to delete specific users’ personal data.

In the context of the AWS big data and analytics ecosystem, every architecture, regardless of the problem it targets, uses Amazon Simple Storage Service (Amazon S3) as the core storage service. Despite its versatility and feature completeness, Amazon S3 doesn’t come with an out-of-the-box way to map a user identifier to S3 keys of objects that contain user’s data.

This post walks you through a framework that helps you purge individual user data within your organization’s AWS hosted data lake, and an analytics solution that uses different AWS storage layers, along with sample code targeting Amazon S3.

Reference architecture

To address the challenge of implementing a data purge framework, we reduced the problem to the straightforward use case of deleting a user’s data from a platform that uses AWS for its data pipeline. The following diagram illustrates this use case.

We’re introducing the idea of building and maintaining an index metastore that keeps track of the location of each user’s records and allows us locate to them efficiently, reducing the search space.

You can use the following architecture diagram to delete a specific user’s data within your organization’s AWS data lake.

For this initial version, we created three user flows that map each task to a fitting AWS service:

Flow 1: Real-time metastore update

The S3 ObjectCreated or ObjectDelete events trigger an AWS Lambda function that parses the object and performs an add/update/delete operation to keep the metadata index up to date. You can implement a simple workflow for any other storage layer, such as Amazon Relational Database Service (RDS), Amazon Aurora, or Amazon Elasticsearch Service (ES). We use Amazon DynamoDB and Amazon RDS for PostgreSQL as the index metadata storage options, but our approach is flexible to any other technology.

Flow 2: Purge data

When a user asks for their data to be deleted, we trigger an AWS Step Functions state machine through Amazon CloudWatch to orchestrate the workflow. Its first step triggers a Lambda function that queries the metadata index to identify the storage layers that contain user records and generates a report that’s saved to an S3 report bucket. A Step Functions activity is created and picked up by a Lambda Node JS based worker that sends an email to the approver through Amazon Simple Email Service (SES) with approve and reject links.

The following diagram shows a graphical representation of the Step Function state machine as seen on the AWS Management Console.

The approver selects one of the two links, which then calls an Amazon API Gateway endpoint that invokes Step Functions to resume the workflow. If you choose the approve link, Step Functions triggers a Lambda function that takes the report stored in the bucket as input, deletes the objects or records from the storage layer, and updates the index metastore. When the purging job is complete, Amazon Simple Notification Service (SNS) sends a success or fail email to the user.

The following diagram represents the Step Functions flow on the console if the purge flow completed successfully.

For the complete code base, see step-function-definition.json in the GitHub repo.

Flow 3: Batch metastore update

This flow refers to the use case of an existing data lake for which index metastore needs to be created. You can orchestrate the flow through AWS Step Functions, which takes historical data as input and updates metastore through a batch job. Our current implementation doesn’t include a sample script for this user flow.

Our framework

We now walk you through the two use cases we followed for our implementation:

  • You have multiple user records stored in each Amazon S3 file
  • A user has records stored in homogenous AWS storage layers

Within these two approaches, we demonstrate alternatives that you can use to store your index metastore.

Indexing by S3 URI and row number

For this use case, we use a free tier RDS Postgres instance to store our index. We created a simple table with the following code:

CREATE UNLOGGED TABLE IF NOT EXISTS user_objects (
				userid TEXT,
				s3path TEXT,
				recordline INTEGER
			);

You can index on user_id to optimize query performance. On object upload, for each row, you need to insert into the user_objects table a row that indicates the user ID, the URI of the target Amazon S3 object, and the row that corresponds to the record. For instance, when uploading the following JSON input, enter the following code:

{"user_id":"V34qejxNsCbcgD8C0HVk-Q","body":"…"}
{"user_id":"ofKDkJKXSKZXu5xJNGiiBQ","body":"…"}
{"user_id":"UgMW8bLE0QMJDCkQ1Ax5Mg","body ":"…"}

We insert the tuples into user_objects in the Amazon S3 location s3://gdpr-demo/year=2018/month=2/day=26/input.json. See the following code:

(“V34qejxNsCbcgD8C0HVk-Q”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 0)
(“ofKDkJKXSKZXu5xJNGiiBQ”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 1)
(“UgMW8bLE0QMJDCkQ1Ax5Mg”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 2)

You can implement the index update operation by using a Lambda function triggered on any Amazon S3 ObjectCreated event.

When we get a delete request from a user, we need to query our index to get some information about where we have stored the data to delete. See the following code:

SELECT s3path,
                ARRAY_AGG(recordline)
                FROM user_objects
                WHERE userid = ‘V34qejxNsCbcgD8C0HVk-Q’
                GROUP BY;

The preceding example SQL query returns rows like the following:

(“s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json“, {2102,529})

The output indicates that lines 529 and 2102 of S3 object s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json contain the requested user’s data and need to be purged. We then need to download the object, remove those rows, and overwrite the object. For a Python implementation of the Lambda function that implements this functionality, see deleteUserRecords.py in the GitHub repo.

Having the record line available allows you to perform the deletion efficiently in byte format. For implementation simplicity, we purge the rows by replacing the deleted rows with an empty JSON object. You pay a slight storage overhead, but you don’t need to update subsequent row metadata in your index, which would be costly. To eliminate empty JSON objects, we can implement an offline vacuum and index update process.

Indexing by file name and grouping by index key

For this use case, we created a DynamoDB table to store our index. We chose DynamoDB because of its ease of use and scalability; you can use its on-demand pricing model so you don’t need to guess how many capacity units you might need. When files are uploaded to the data lake, a Lambda function parses the file name (for example, 1001-.csv) to identify the user identifier and populates the DynamoDB metadata table. Userid is the partition key, and each different storage layer has its own attribute. For example, if user 1001 had data in Amazon S3 and Amazon RDS, their records look like the following code:

{"userid:": 1001, "s3":{"s3://path1", "s3://path2"}, "RDS":{"db1.table1.column1"}}

For a sample Python implementation of this functionality, see update-dynamo-metadata.py in the GitHub repo.

On delete request, we query the metastore table, which is DynamoDB, and generate a purge report that contains details on what storage layers contain user records, and storage layer specifics that can speed up locating the records. We store the purge report to Amazon S3. For a sample Lambda function that implements this logic, see generate-purge-report.py in the GitHub repo.

After the purging is approved, we use the report as input to delete the required resources. For a sample Lambda function implementation, see gdpr-purge-data.py in the GitHub repo.

Implementation and technology alternatives

We explored and evaluated multiple implementation options, all of which present tradeoffs, such as implementation simplicity, efficiency, critical data compliance, and feature completeness:

  • Scan every record of the data file to create an index – Whenever a file is uploaded, we iterate through its records and generate tuples (userid, s3Uri, row_number) that are then inserted to our metadata storing layer. On delete request, we fetch the metadata records for requested user IDs, download the corresponding S3 objects, perform the delete in place, and re-upload the updated objects, overwriting the existing object. This is the most flexible approach because it supports a single object to store multiple users’ data, which is a very common practice. The flexibility comes at a cost because it requires downloading and re-uploading the object, which introduces a network bottleneck in delete operations. User activity datasets such as customer product reviews are a good fit for this approach, because it’s unexpected to have multiple records for the same user within each partition (such as a date partition), and it’s preferable to combine multiple users’ activity in a single file. It’s similar to what was described in the section “Indexing by S3 URI and row number” and sample code is available in the GitHub repo.
  • Store metadata as file name prefix – Adding the user ID as the prefix of the uploaded object under the different partitions that are defined based on query pattern enables you to reduce the required search operations on delete request. The metadata handling utility finds the user ID from the file name and maintains the index accordingly. This approach is efficient in locating the resources to purge but assumes a single user per object, and requires you to store user IDs within the filename, which might require InfoSec considerations. Clickstream data, where you would expect to have multiple click events for a single customer on a single date partition during a session, is a good fit. We covered this approach in the section “Indexing by file name and grouping by index key” and you can download the codebase from the GitHub repo.
  • Use a metadata file – Along with uploading a new object, we also upload a metadata file that’s picked up by an indexing utility to create and maintain the index up to date. On delete request, we query the index, which points us to the records to purge. A good fit for this approach is a use case that already involves uploading a metadata file whenever a new object is uploaded, such as uploading multimedia data, along with their metadata. Otherwise, uploading a metadata file on every object upload might introduce too much of an overhead.
  • Use the tagging feature of AWS services – Whenever a new file is uploaded to Amazon S3, we use the Put Object Tagging Amazon S3 operation to add a key-value pair for the user identifier. Whenever there is a user data delete request, it fetches objects with that tag and deletes them. This option is straightforward to implement using the existing Amazon S3 API and can therefore be a very initial version of your implementation. However, it involves significant limitations. It assumes a 1:1 cardinality between Amazon S3 objects and users (each object only contains data for a single user), searching objects based on a tag is limited and inefficient, and storing user identifiers as tags might not be compliant with your organization’s InfoSec policy.
  • Use Apache Hudi – Apache Hudi is becoming a very popular option to perform record-level data deletion on Amazon S3. Its current version is restricted to Amazon EMR, and you can use it if you start to build your data lake from scratch, because you need to store your as Hudi datasets. Hudi is a very active project and additional features and integrations with more AWS services are expected.

The key implementation decision of our approach is separating the storage layer we use for our data and the one we use for our metadata. As a result, our design is versatile and can be plugged in any existing data pipeline. Similar to deciding what storage layer to use for your data, there are many factors to consider when deciding how to store your index:

  • Concurrency of requests – If you don’t expect too many simultaneous inserts, even something as simple as Amazon S3 could be a starting point for your index. However, if you get multiple concurrent writes for multiple users, you need to look into a service that copes better with transactions.
  • Existing team knowledge and infrastructure – In this post, we demonstrated using DynamoDB and RDS Postgres for storing and querying the metadata index. If your team has no experience with either of those but are comfortable with Amazon ES, Amazon DocumentDB (with MongoDB compatibility), or any other storage layer, use those. Furthermore, if you’re already running (and paying for) a MySQL database that’s not used to capacity, you could use that for your index for no additional cost.
  • Size of index – The volume of your metadata is orders of magnitude lower than your actual data. However, if your dataset grows significantly, you might need to consider going for a scalable, distributed storage solution rather than, for instance, a relational database management system.

Conclusion

GDPR has transformed best practices and introduced several extra technical challenges in designing and implementing a data lake. The reference architecture and scripts in this post may help you delete data in a manner that’s compliant with GDPR.

Let us know your feedback in the comments and how you implemented this solution in your organization, so that others can learn from it.

 


About the Authors

George Komninos is a Data Lab Solutions Architect at AWS. He helps customers convert their ideas to a production-ready data product. Before AWS, he spent 3 years at Alexa Information domain as a data engineer. Outside of work, George is a football fan and supports the greatest team in the world, Olympiacos Piraeus.

 

 

 

 

Sakti Mishra is a Data Lab Solutions Architect at AWS. He helps customers architect data analytics solutions, which gives them an accelerated path towards modernization initiatives. Outside of work, Sakti enjoys learning new technologies, watching movies, and travel.

Discover sensitive data by using custom data identifiers with Amazon Macie

Post Syndicated from Kayla Jing original https://aws.amazon.com/blogs/security/discover-sensitive-data-by-using-custom-data-identifiers-with-amazon-macie/

As you put more and more data in the cloud, you need to rely on security automation to keep it secure at scale. AWS recently launched Amazon Macie, a fully managed service that uses machine learning and pattern matching to help you detect, classify, and better protect your sensitive data stored in the AWS Cloud.

Many data breaches are not the result of malicious activity from unauthorized users, but rather from mistakes made by authorized users. To monitor and manage the security of sensitive data, you must first be able to identify it. In this post, we show you how to use custom data identifiers with Macie to identify sensitive data. Once you know what’s sensitive, you can start designing security controls that operate at scale to monitor and remediate risk automatically.

Macie comes with a set of managed data identifiers that you can use to discover many types of sensitive data. These are somewhat generic and broadly applicable to many organizations. What makes Macie unique is its ability to help you address specific data needs. Macie enables you to expand your sensitive data detection through the new custom data identifiers. Custom data identifiers can be used to highlight organizational proprietary data, intellectual property, and specific scenarios.

Custom Data Identifiers in Macie help you find and identify sensitive data based on your own organization’s specific needs. In this post, we show you a step-by-step walkthrough of how to define and run custom data identifiers to automatically discover specific, sensitive data. Before you begin using Custom Data Identifiers, you need to enable Macie and configure detailed logging. Follow these instructions to enable Macie and follow these instructions to configure detailed logging, if you haven’t done that already.

When to use the Custom Data Identifier resource

To begin, imagine you’re an IT administrator for a manufacturing company that’s headquartered in France. Your company has acquired a few additional local subsidiaries, including an R&D facility in São Paulo, Brazil. The company is migrating to AWS, and in the process is classifying registration information, employee information, and product data into encrypted and non-encrypted storage.

You want to identify sensitive data for the following three scenarios:

  • SIRET-NIC: SIRET-NIC is a unique number assigned to businesses in France. This number is issued by their National Institute of Statistics (INSEE) when a business is registered. A sample file that contains SIRET-NIC information is shown in the following figure. Each record in the file includes the GUID, employee name, employee email, the company name, the date it was issued, and the SIRET-NIC number.

    Figure 1: SIRET-NIC dataset

    Figure 1: SIRET-NIC dataset

  • Brazil CPF (Cadastro de Pessoas Físicas – Natural Persons Register): CPF is a unique number assigned by the Brazilian revenue agency to people subject to taxes in the country. Each of your employees residing in the Brazilian office has a CPF.
  • Prototyping naming convention: Your company has products that are publicly available, but also products that are still in the prototyping stage and should be kept confidential. A sample file that contains Brazil CPF numbers and the prototype names is shown in the following figure.

    Figure 2: Brazil CPF and prototype number dataset

    Figure 2: Brazil CPF and prototype number dataset

Configure the Custom Data Identifier resource in the Macie console

To use custom data identifiers to identify your organization’s sensitive information, you must:

  1. Create custom data identifiers.
  2. Create a job to scan your Amazon Simple Storage Service (Amazon S3) bucket to locate the data patterns that match your custom data identifiers.
  3. Respond to the returned results.

The following steps introduce you to the Custom Data Identifier resource in Macie.

Designing Custom Data Identifiers for use with Amazon Macie

In the previous section you discovered 3 scenarios that your company will like to protect SIRET-NIC, Brazil CPF, and your prototyping naming convention. You now need to first create a specific REGEX pattern for each of these scenarios. There are different syntaxes and dialects of regular expression languages. Amazon Macie supports a subset of the Perl Compatible Regular Expressions (PCRE) library, and you can learn more about it in Regex support in custom data identifiers section. Once the patterns are ready, follow the instructions below to create the custom data identifiers.

Creating Custom Data Identifiers in Amazon Macie

  1. Sign in to the AWS Management Console.
  2. Enter Amazon Macie in the AWS services search box.
  3. Choose Amazon Macie.
  4. In the navigation pane on the left-hand side, under Settings, choose Custom data identifiers as shown in the following figure.

    Figure 3: Custom data identifiers console

    Figure 3: Custom data identifiers console

Create a custom data identifier

  1. Choose Create on the custom data identifier console.
  2. Name: Enter a name for your custom data identifier. Make it descriptive so you know what it does. For example, enter SIRET-NIC for the SIRET-NIC number you use.
  3. Description: Enter a description of the custom data identifier.
  4. Regular expression (regex): Define the pattern you want to identify. Use a Regular Expression (“regex”) to create the desired pattern. For example, a SIRET-NIC number contains 14 digits—9 numbers followed by a hyphen and then 5 more numbers. The first part, 9 numbers, can stay together or separated by spaces into 3 groups of 3. The specific regex pattern for this is \b(\d{3}\s?){2}\d{3}\-\d{5}\b
  5. Keywords: Define expressions that identify the text to match. The SIRET-NIC number itself is publicly accessible information. But in your case, you want to encrypt the information about the company that was registered during the month the acquisition happened (April 2020), thus the information will not leak to your competitors. So, the keywords here will be all the days in April.
  6. (Optional) Ignore words: Use this box to enter text that you want to be ignored. In this example scenario, you know your security training materials always use an example SIRET-NICs of 12345789-12345 and 000000000-00000. You can enter these values here, so that your security training materials are not flagged as sensitive data containing SIRET-NICs.
  7. Maximum match distance: Use this box to define the proximity between the result and the keywords. If you enter 20, Macie will provide results that include the specified keyword and 20 characters on either side of it.

Note: Do not select Submit yet. After entering the settings and before selecting Submit, you should test your custom data identifier with sample data to confirm that it works.

With all the attributes set, your console will look like what is shown in Figure 4.

Figure 4: SIRET-NIC custom data identifier creation

Figure 4: SIRET-NIC custom data identifier creation

Test your SIRET-NIC custom data identifier

Use the Evaluate section on the right-hand panel of the Macie console to confirm that the regex pattern and other configurations for your custom data identifier are correct.

Follow the steps below to use the Evaluate section.

  1. Enter test data in the sample data box.
  2. Select Submit. There will be one match per record in the file if the configurations are correct and your custom data identifier is ready.The following figure is an example of the Evaluate section using test data. The test data has 3 records, each record has 5 fields which are GUID, employee name, employee email, company name, date SIRET-NIC was issued, and the SIRET-NIC number.

    Figure 5: Evaluate, showing sample data

    Figure 5: Evaluate, showing sample data

  3. After verifying your SIRET-NIC custom data identifier works in the Evaluate section, now select Submit on the New custom data identifier window to create the custom data identifier.

Create a Brazil CPF Custom Data Identifier

Congrats on creating your first custom data identifier! Now use the same steps to create and test custom data identifiers for the Brazil CPF and prototyping naming convention scenarios. The Brazil CPF number usually shows up in the format of 000.000.000-00.

Use the following values for the Brazil CPF scenario, as shown in the following figure:

  • Name: Brazil CPF
  • Description: The format for Brazil CPF in our sample data is 000.000.000-00
  • Regular expression: \b(\d{3}\.){2}\d{3}\-\d{2}\b

    Figure 6: Brazil CPF custom data identifier

    Figure 6: Brazil CPF custom data identifier

Create a Prototype Name Custom Data Identifier

Assume that your company has a very strict and regular naming scheme for prototype part numbers. It is P, followed by a hyphen, and then 2 letters and 4 digits. E.g., P-AB1234. You want to identify objects in S3 that contain references to private prototype parts. This is a small pattern, and so if we’re not careful it will cause Macie to flag objects that do not actually contain one of our prototype numbers. We suggest adding \b at the beginning and the end of the regular expression. The \b symbol means a “word boundary” and word boundaries are basically whitespace, punctuation, or other things that are not letters and numbers. With \b, you limit the pattern so that you only match if the entire word matches the pattern. For example, P-AB1234 will match the pattern, but STEP-AB123456 and P-XY123 will not match the pattern. This gives you finer grained control and reduces false positives.

Use the following values for the prototyping name scenario, as shown in the following figure:

  • Name: Prototyping Naming
  • Description: Any prototype name start with P means it’s private. The format for private prototype name is P-2 capital letters and 4 numbers
  • Regular expression: \bP\-[A-Z]{2}\d{4}\b
Figure 7: Prototyping naming custom data identifier

Figure 7: Prototyping naming custom data identifier

You should now see a page like the following figure, indicating that the SIRET-NIC, Brazil CPF, and Prototyping Naming custom data identifiers are successfully configured.

Figure 8: Successfully configured custom data identifier

Figure 8: Successfully configured custom data identifier

Set up a Test Bucket to Demonstrate Macie

Before we can see Macie do its work, we have to create a bucket with some test data that we can scan. We’ve provided some sample data files that you can download. Follow these instructions to create a test bucket and load our test data into the test bucket.

  1. Download the sample data and unzip it.
  2. Sign in to the AWS Management Console and open the Amazon S3 console at https://console.aws.amazon.com/s3/.
  3. Choose Create bucket. The Create bucket wizard opens.
  4. In Bucket name, enter a DNS-compliant name for your bucket. The bucket name must:
    • Be unique across all of Amazon S3.
    • Be between 3 and 63 characters long.
    • Not contain uppercase characters.
    • Start with a lowercase letter or number.

    We created a bucket called bucketformacieuse; you have to choose another name because this one is already taken by us.

  5. In Region, choose the AWS Region where you want the bucket to reside.
  6. Select Create, to finish the bucket creation.
  7. Open the bucket you just created and upload the two Excel files you downloaded in step 1.

Use Macie to create a job to scan your data

Now you can create a job to scan your Amazon S3 bucket to detect and locate the data patterns defined in the SIRET-NIC, Brazil CPF, and Prototyping Naming custom data identifiers.

To create a job

  1. In the navigation pane, choose Jobs, and then select Create Job on the upper right.
  2. Select Amazon S3 buckets: Select the S3 bucket you want to analyze. In this case, we are using the bucket previously created, bucketformacieuse.
  3. Review Amazon S3 buckets: Verify that you selected the S3 bucket you want the job to scan and analyze.
  4. Scope: Select your scope. For this example, choose the One-time job option as your scope. The scope specifies how often you want the job to run. This can be either a one-time job or a scheduled job. If you choose a scheduled job, you can define how often you want your job to scan your Amazon S3 bucket.
  5. Custom data identifiers: Select the 3 custom data identifiers you created to be associated with this job, and then select Next. This is shown in the following figure.

    Figure 9: Select your custom data identifiers

    Figure 9: Select your custom data identifiers

  6. Name and description: Enter a name and description for the job.
  7. Review and create: Review and verify all your settings, and then select Create.

You now have a job in Macie to scan the Amazon S3 buckets you’ve chosen using the 3 custom data identifiers you created. More information about creating jobs is available in Running sensitive data discovery jobs in Amazon Macie.

Respond to results

Macie will help you be secure when you’re effectively responding to the findings that it produces. For our example, we’ll show you how to review your findings manually. You can look at your findings by bucket, type, or job, or see a collective summary of all findings. In this example, let’s look at all findings.

To review your results

  1. In the navigation pane on the left-hand side, choose Findings. Findings include the severity, the type, the resources affected, and when the findings were last updated.
  2. The following figure shows an example of the results you might see on the findings page. There are two findings for the selected job. The compagnie_français.csv and the empresa_brasileira.csv files contain the custom data identifiers that you created earlier and added to the job.

    Figure 10: Findings

    Figure 10: Findings

  3. Let’s look at the details of one of the findings so you can review the results. From the page showing the 4r findings, select the file that contains your custom data identifier for the Brazil CPF: empresa_brasileira.csv. The number of custom data identifiers found in the document is shown in the Result section on the right, as shown in the following figure.

    Figure 11: Findings detail page for the Brazil CPF custom data identifiers

    Figure 11: Findings detail page for the Brazil CPF custom data identifiers

  4. Now look at the findings details for the compagnie_français.csv file. It shows the number of custom data identifiers found in the file. In this case Macie found 13 SIRET-NIC numbers as shown in the following figure.

    Figure 12: Findings page for the French company file

    Figure 12: Findings page for the French company file

  5. If you configured detailed logging, the results will be saved in the Amazon S3 bucket you specified. The S3 bucket location can be found in the Details section after Detailed result location as shown in the preceding figure.

Now that you’ve used Macie and the Custom Data Identifiers resource to obtain these findings, you can identify what data to place in encrypted storage, and what can be placed in non-encrypted storage when migrating to AWS. Macie and custom data identifiers provide an automated tool to help you enhance protection of your sensitive data by providing you the information to help detect and classify your data in the AWS Cloud.

Using Macie at Scale

Custom Data Identifiers help you tell Macie what to look for. As you move more and more data to the cloud, you’ll need to make new identifiers and new rules. As your rules and identifiers grow you will need to create automation that responds to things that are found. For example, perhaps a lambda function turns on encryption in a bucket when it finds sensitive data in that bucket. Or perhaps a function automatically applies tags to buckets where sensitive data is found, and those buckets and their owners start to appear on reports for audit and compliance. Once you’ve done this at small scale, think about how you will automate responses at larger scale.

Conclusion

The new Custom Data Identifier resource in the newly enhanced Macie can help you detect, classify, and protect sensitive data types unique to your organization. This post focused on the functionality and use of custom data identifiers to automatically discover sensitive data stored in Amazon S3. You can also review the managed data identifiers to see a list of personally identifiable information (PII) that Macie can detect by default. Visit What is Amazon Macie? to learn more.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon Macie forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Kayla Jing

Kayla is a Solutions Architect at Amazon Web Services based out of Seattle. She has experience in data science with a focus on Data Analytics and Machine Learning.

Author

Joshua Choung

Joshua is a Solutions Architect based out of Seattle. He works with customers to provide architectural and technical guidance and training on their AWS cloud journey.

Author

Laura Reith

Laura is a Solutions Architect at Amazon Web Services. Before AWS, she worked as a Solutions Architect in Taiwan focusing on physical security and retail analytics.

Anonymize and manage data in your data lake with Amazon Athena and AWS Lake Formation

Post Syndicated from Manos Samatas original https://aws.amazon.com/blogs/big-data/anonymize-and-manage-data-in-your-data-lake-with-amazon-athena-and-aws-lake-formation/

Organizations collect and analyze more data than ever before. They move as fast as they can on their journey to become more data driven by using the insights from their data.

Different roles use data for different purposes. For example, data engineers transform the data before further processing, data analysts access the data and produce reports, and data scientists with domain and technical expertise can train machine learning algorithms. Those roles require access to the data, and access has never been easier to grant.

At the same time, most organizations have to comply with regulations when dealing with their customer data. For that reason, datasets that contain personally identifiable information (PII) is often anonymized. A common example of PII can be tables and columns that contain personal information about an individual (such as first name and last name) or tables with columns that, if joined with another table, can trace back to an individual.

You can use AWS Analytics services to anonymize your datasets. In this post, I describe how to use Amazon Athena to anonymize a dataset.  You can then use AWS Lake Formation to provide the right access to the right personas.

Use case

To better understand the concept, we use a straightforward use case: analysts in your organization need access to a dataset with sales data, some of which contains PII information. As the data lake admin, you’re not comfortable with all personnel having access to customers’ PII. To address this, you can use an anonymized dataset.

This use case has two users:

  • datalake_admin – Responsible for data anonymization and making sure the right permissions are enforced. They classify the data, generate anonymized datasets, and configures the required permissions.
  • datalake_analyst – Only has access to the anonymized dataset. They can extract patterns for users without tracing the request back to an individual customer.

The following AWS CloudFormation template generates the AWS Glue tables that you use later in this post:

However, the template doesn’t create the datalake_admin and datalake_analyst users. For more information about personas in Lake Formation, see Lake Formation Personas and IAM Permissions Reference.

Solution architecture

For this solution, you use the following services:

  • Lake Formation – Lake Formation makes it easy to set up a secure data lake—a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. The data lake admin can easily label the data and give users permission to access authorized datasets.
  • Athena – Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries you run. For this use case, the data lake admin uses Athena to anonymize the data, after which the data analyst can use Athena for interactive analytics over anonymized datasets.
  • Amazon S3Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. For this use case, you use Amazon S3 as storage for the data lake.

The following diagram illustrates the architecture for this solution.

In this architecture, there are no servers to manage. You only pay what you use. You can use the same solution for small or large datasets. The scaling happens behind the scenes but in a transparent way.

In the following sections, you look in more detail on how to do the following:

  • Label sensitive data with AWS Lake Formation
  • Anonymize data with Athena
  • Apply permissions with Lake Formation
  • Analyze the anonymized datasets

Labeling the sensitive data with Lake Formation

As a data lake admin, the first task is to label the personal information. Tags don’t enforce any security controls, but applying a good tagging strategy is a great way to describe the data. Tags are key-value pairs that you can apply for your AWS resources, including table and columns in your data lake. For this use case, you apply a very simple tagging strategy: for the columns that contain PII, you give the value PII.

You interact with the following tables from the tcp-ds dataset, which both have their data stored in Amazon S3 in CSV format:

  • store_sales – Stores sales data and references other tables that you can join together for more sophisticated business queries. The table has a foreign key with the customer table on the ss_customer_sk This key, when joined with the customer table, can uniquely identify a user. For that reason, treat this column as personal information.
  • customer – Stores customer data, a lot of which is PII. In addition to c_customer_sk, you could use data such as customer ID, (c_customer_id), customer name (c_first_name), customer last name (c_last_name), login (c_login), and email (c_email_address) to uniquely identify a customer.

To start tagging your columns (starting with the store_sales table), complete the following steps:

  1. As the data lake admin user, log in to the Lake Formation console.
  2. Choose Data Catalog Tables.
  3. Select store_sales.
  4. Choose Edit schema.
  5. Select the column you want to edit (ss_customer_sk).
  6. Choose Edit.
  7. For Key, enter Classification.
  8. For Value, enter PII.
  9. Choose Save.

To verify that you can apply the added column properties, use the Lake Formation API to get the table description.

  1. On the Data Catalog Tables page, select store_sales.
  2. Choose View properties.

The table properties look like the following JSON object:

{
"Name": "store_sales",
"DatabaseName": "tcp-ds-1tb",
"Owner": "owner",
"CreateTime": "2019-09-13T10:15:04.000Z",
"UpdateTime": "2020-03-18T16:10:34.000Z",
"LastAccessTime": "2019-09-13T10:15:03.000Z",
"Retention": 0,
"StorageDescriptor": {
"Columns": [
{
"Name": "ss_sold_date_sk",
"Type": "bigint",
"Parameters": {}
},
...
{
"Name": "ss_customer_sk",
"Type": "bigint",
"Parameters": {
"Classification": "PII"
}
},
...
}

The additional column properties are now in the table metadata.

  1. Repeat the preceding steps for the customer table and label the following columns:
    • c_customer_sk
    • c_customer_id
    • c_first_name
    • c_last_name
    • c_login
    • c_email_address

Adding a tag also allows you to perform metadata searches by tag attributes. For more information, see Discovering metadata with AWS Lake Formation: Part 1 and Discover metadata with AWS Lake Formation: Part 2.

Anonymizing data with Athena

The data lake admin now needs to provide the data analyst anonymized datasets for analytics. For this use case, you want to extract patterns on the customer table and the store_sales table separately, but you also want to join the two tables so you can perform more sophisticated queries.

The first step is to create a database in Lake Formation to organize tables in AWS Glue.

  1. On the Lake Formation console, under Data Catalog, choose Databases.
  2. Choose Create database.
  3. For Name, enter a name, such as anonymised_tcp_ds_1tb.
  4. Optionally, enter an Amazon S3 path for the database and a description.
  5. Choose Create database.

The next step is to create the tables that contain the anonymized data. Before you do so, consider the significance of each anonymized column from an analytics point of view. For columns that have little or no value in the analytics process, omitting the column altogether might be the right approach. You might use other columns as primary keys to join with other tables. To make sure that you can join the tables, you can apply a hash function to the table foreign keys.

A common approach to anonymize sensitive information is hashing. A hash function is any function that you can use to map data of arbitrary size to fixed-size values. For more information, see Hash function.

The following table summarizes your strategy for each column.

Table Column  Strategy
customer customer_first_name hash
customer customer_last_name hash
customer c_login omit
customer customer_id hash
Customer c_email_address omit
customer c_customer_sk hash
store_sales ss_customer_sk hash

If you use the same value as the input of your hash function, it always returns the same result. In addition, and contrary to encryption, you can’t reverse hashing.

  1. Use Athena string functions to hash individual columns and generate anonymized datasets.
  2. After you create those datasets, you can use Lake Formation to apply security controls. See the following code:
CREATE table "tcp-ds-anonymized".customer
WITH (format='parquet',external_location = 's3://tcp-ds-eu-west-1-1tb-anonymised/2/customer_parquet/')
AS SELECT       
         sha256(to_utf8(cast(c_customer_sk AS varchar))) AS c_customer_sk_anonym,
         sha256(to_utf8(cast(c_customer_id AS varchar))) AS c_customer_id_anonym,
         sha256(to_utf8(cast(c_first_name AS varchar))) AS c_first_name_anonym,
         sha256(to_utf8(cast(c_last_name AS varchar))) AS c_last_name_anonym,
         c_current_cdemo_sk,
         c_current_hdemo_sk,
         c_first_shipto_date_sk,
         c_first_sales_date_sk,
         c_salutation,
         c_preferred_cust_flag,
         c_current_addr_sk,
         c_birth_day,
         c_birth_month,
         c_birth_year,
         c_birth_country,
         c_last_review_date_sk
FROM customer
  1. To preview the data, enter the following code:
SELECT c_first_name_anonym, c_last_name_anonym FROM "tcp-ds-anonymized"."customer" limit 10;

The following screenshot shows the output of your query.

  1. To repeat these steps for the stores_sales table, enter the following code:
CREATE table "tcp-ds-anonymized".store_sales
WITH (format='parquet',external_location = 's3://tcp-ds-eu-west-1-1tb-anonymised/1/store_sales/')
AS SELECT sha256(to_utf8(cast(ss_customer_sk AS varchar))) AS ss_customer_sk_anonym,
         ss_sold_date_sk,
         ss_sales_price,
         ss_sold_time_sk,
         ss_item_sk,
         ss_hdemo_sk,
         ss_addr_sk,
         ss_store_sk,
         ss_promo_sk,
         ss_ticket_number,
         ss_quantity,
         ss_wholesale_cost,
         ss_list_price,
         ss_ext_discount_amt,
         ss_external_sales_price,
         ss_ext_wholesale_cost,
         ss_ext_list_price,
         ss_ext_tax,
         ss_coupon_amt,
         ss_net_paid,
         ss_net_paid_inc_tax,
         ss_net_profit
FROM store_sales;

One of the challenges you need to overcome when working with CTAS queries is that the query’s Amazon S3 location should be unique for the table you’re creating. You can add some incremental value or timestamp to the path of the table, for example, s3:/<bucket>/<table_name>/<version>, and make sure you use a different version number every time.

You can delete older data programmatically using Amazon S3 APIs or SDK. You can also use Amazon S3 lifecycle configuration to tell Amazon S3 to transition objects to another Amazon S3 storage class. For more information, see Object lifecycle management.

You can automate the anonymization of the CTAS query with AWS Glue jobs. AWS Glue provides a lightweight Python shell job option that can call the Amazon Athena API programmatically.

Applying permissions with Lake Formation

Now that you have the table structures and anonymized datasets, you can apply the required permissions using Lake Formation.

  1. On the Lake Formation console, under Data Catalog, choose Tables.
  2. Select the tables that contain the anonymized data.
  3. From the Actions drop-down menu, under Permissions, choose Grant.
  4. For IAM users and roles, choose the IAM user for the data analyst.
  5. For Table permissions, select Select.
  6. Choose Grant.

You can now view all table permissions and verify the permissions granted to a particular principal.

Analyzing the anonymized datasets

To verify that the role can access the right tables and query the anonymized datasets, complete the following steps:

  1. Sign in to the AWS Management Console as the data analyst.
  2. Under Analytics, choose Amazon Athena.

You should see a query field, similar to the following screenshot.

You can now test your access with queries. To see the top customers by revenue and last name, enter the following code:

SELECT c_last_name_anonym,
sum(ss_sales_price) AS total_sales
FROM store_sales
JOIN customer
ON store_sales.ss_customer_sk_anonym = customer.c_customer_sk_anonym
GROUP BY c_last_name_anonym
ORDER BY total_sales DESC limit 10;

The following screenshot shows the query output.

You can also try to query a table that you don’t have access to. You should receive an error message.

Conclusion

Anonymizing dataset is often a prerequisite before users can start analyzing a dataset. In this post, we discussed how data lake admins can use Athena and Lake Formation to label and anonymize data stored in Amazon S3. You can then use Lake Formation to apply permissions to the dataset and allow other users to access the data.

The services we discussed in this post are serverless. Building serverless applications means that your developers can focus on their core product instead of worrying about managing and operating servers or runtimes, either in the cloud or on-premises. This reduced overhead lets developers reclaim time and energy that they can spend on developing great products that scale and that are reliable.

 


About the Author

Manos Samatas is a Specialist Solutions Architect in Big Data and Analytics with Amazon Web Services. Manos lives and works in London. He is specialising in architecting Big Data and Analytics solutions for Public Sector customers in EMEA region.

How to retroactively encrypt existing objects in Amazon S3 using S3 Inventory, Amazon Athena, and S3 Batch Operations

Post Syndicated from Adam Kozdrowicz original https://aws.amazon.com/blogs/security/how-to-retroactively-encrypt-existing-objects-in-amazon-s3-using-s3-inventory-amazon-athena-and-s3-batch-operations/

Amazon Simple Storage Service (S3) is an object storage service that offers industry-leading scalability, performance, security, and data availability. With Amazon S3, you can choose from three different server-side encryption configurations when uploading objects:

  • SSE-S3 – uses Amazon S3-managed encryption keys
  • SSE-KMS – uses customer master keys (CMKs) stored in AWS Key Management Service (KMS)
  • SSE-C – uses master keys provided by the customer in each PUT or GET request

These options allow you to choose the right encryption method for the job. But as your organization evolves and new requirements arise, you might find that you need to change the encryption configuration for all objects. For example, you might be required to use SSE-KMS instead of SSE-S3 because you need more control over the lifecycle and permissions of the encryption keys in order to meet compliance goals.

You could change the settings on your buckets to use SSE-KMS rather than SSE-S3, but the switch only impacts newly uploaded objects, not objects that existed in the buckets before the change in encryption settings. Manually re-encrypting older objects under master keys in KMS may be time-prohibitive depending on how many objects there are. Automating this effort is possible using the right combination of features in AWS services.

In this post, I’ll show you how to use Amazon S3 Inventory, Amazon Athena, and Amazon S3 Batch Operations to provide insights on the encryption status of objects in S3 and to remediate incorrectly encrypted objects in a massively scalable, resilient, and cost-effective way. The solution uses a similar approach to the one mentioned in this blog post, but it has been designed with automation and multi-bucket scalability in mind. Tags are used to target individual noncompliant buckets in an account, and any encrypted (or unencrypted) object can be re-encrypted using SSE-S3 or SSE-KMS. Versioned buckets are also supported, and the solution operates on a regional level.

Note: You can’t re-encrypt to or from objects encrypted under SSE-C. This is because the master key material must be provided during the PUT or GET request, and cannot be provided as a parameter for S3 Batch Operations.

Moreover, the entire solution can be deployed in under 5 minutes using AWS CloudFormation. Simply tag your buckets targeted for encryption, upload the solution artifacts into S3, and deploy the artifact template through the CloudFormation console. In the following sections, you will see that the architecture has been built to be easy to use and operate, while at the same time containing a large number of customizable features for more advanced users.

Solution overview

At a high level, the core features of the architecture consist of 3 services interacting with one another: S3 Inventory reports (1) are delivered for targeted buckets, the report delivery events trigger an AWS Lambda function (2), and the Lambda function then executes S3 Batch (3) jobs using the reports as input to encrypt targeted buckets. Figure 1 below and the remainder of this section provide a more detailed look at what is happening underneath the surface. If this is not of high interest for you, feel free to skip ahead to the Prerequisites and Solution Deployment sections.

Figure 1: Solution architecture overview

Figure 1: Solution architecture overview

Here’s a detailed overview of how the solution works, as shown in Figure 1 above:

  1. When the CloudFormation template is first launched, a number of resources are created, including:
    • An S3 bucket to store the S3 Inventory reports
    • An S3 bucket to store S3 Batch Job completion reports
    • A CloudWatch event that is triggered by changes to tags on S3 buckets
    • An AWS Glue Database and AWS Glue Tables that can be used by Athena to query S3 Inventory and S3 Batch report findings
    • A Lambda function that is used as a Custom Resource during template launch, and afterwards as a target for S3 event notifications and CloudWatch events
  2. During deployment of the CloudFormation template, a Lambda-backed Custom Resource lists all S3 buckets within the AWS Region specified and checks to see if any has a configurable tag present (configured via an AWS CloudFormation parameter). When a bucket with the specified tag is discovered, the Lambda configures an S3 Inventory report for the discovered bucket to be delivered to the newly-created central report destination bucket.
  3. When a new S3 Inventory report arrives into the central report destination bucket (which can take between 1-2 days) from any of the tagged buckets, an S3 Event Notification triggers the Lambda to process it.
  4. The Lambda function first adds the path of the report CSV file as a partition to the AWS Glue table. This means that as each bucket delivers its report, it becomes instantly queryable by Athena, and any queries executed return the most recent information available on the status of the S3 buckets in the account.
  5. The Lambda function then checks the value of the EncryptBuckets parameter in the CloudFormation launch template to assess whether any re-encryption action should be taken. If it is set to yes, the Lambda function creates an S3 Batch job and executes it. The job takes each object listed in the manifest report and copies it over in the exact same location. When the copy occurs, SSE-KMS or SSE-S3 encryption is specified in the job parameters, effectively re-encrypting properly all identified objects.
  6. Once the batch job finishes for the S3 Inventory report, a completion report is sent to the central batch job report bucket. The CloudFormation template provides a parameter that controls the option to include either all successfully processed objects or only objects that were unsuccessfully processed. These reports can also be queried with Athena, since the reports are also added as partitions to the AWS Glue batch reports tables as they arrive.

Prerequisites

To follow along with the sample deployment, your AWS Identity and Access Management (IAM) principal (user or role) needs administrator access or equivalent.

Solution deployment

For this walkthrough, the solution will be configured to encrypt objects using SSE-KMS, rather than SSE-S3, when an inventory report is delivered for a bucket. Please note that the key policy of the KMS key will be automatically updated by the custom resource during launch to allow S3 to use it to encrypt inventory reports. No key policies are changed if SSE-S3 encryption is selected instead. The configuration in this walkthrough also adds a tag to all newly encrypted objects. You’ll learn how to use this tag to restrict access to unencrypted objects in versioned buckets. I’ll make callouts throughout the deployment guide for when you can choose a different configuration from what is deployed in this post.

To deploy the solution architecture and validate its functionality, you’ll perform five steps:

  1. Tag target buckets for encryption
  2. Deploy the CloudFormation template
  3. Validate delivery of S3 Inventory reports
  4. Confirm that reports are queryable with Athena
  5. Validate that objects are correctly encrypted

If you are only interested in deploying the solution and encrypting your existing environment, Steps 1 and 2 are all that are required to be completed. Steps 3 through 5 are optional on the other hand, and outline procedures that you would perform to validate the solution’s functionality. They are primarily for users who are looking to dive deep and take advantage of all of the features available.

With that being said, let’s get started with deploying the architecture!

Step 1: Tag target buckets

Navigate to the Amazon S3 console and identify which buckets should be targeted for inventorying and encryption. For each identified bucket, tag it with a designated key value pair by selecting Properties > Tags > Add tag. This demo uses the tag __Inventory: true and tags only one bucket called adams-lambda-functions, as shown in Figure 2.

Figure 2: Tagging a bucket targeted for encryption in Amazon S3

Figure 2: Tagging a bucket targeted for encryption in Amazon S3

Step 2: Deploy the CloudFormation template

  1. Download the S3 encryption solution. There will be two files that make up the backbone of the solution:
    • encrypt.py, which contains the Lambda microservices logic;
    • deploy.yml, which is the CloudFormation template that deploys the solution.
  2. Zip the file encrypt.py, rename it to encrypt.zip, and then upload it into any S3 bucket that is in the same Region as the one in which the CloudFormation template will be deployed. Your bucket should look like Figure 3:

    Figure 3: encrypt.zip uploaded into an S3 bucket

    Figure 3: encrypt.zip uploaded into an S3 bucket

  3. Navigate to the CloudFormation console and then create the CloudFormation stack using the deploy.yml template. For more information, see Getting Started with AWS CloudFormation in the CloudFormation User Guide. Figure 4 shows the parameters used to achieve the configuration specified for this walkthrough, with the fields outlined in red requiring input. You can choose your own configuration by altering the appropriate parameters if the ones specified do not fit your use case.

    Figure 4: Set the parameters in the CloudFormation stack

    Figure 4: Set the parameters in the CloudFormation stack

Step 3: Validate delivery of S3 Inventory reports

After you’ve successfully deployed the CloudFormation template, select any of your tagged S3 buckets and check that it now has an S3 Inventory report configuration. To do this, navigate to the S3 console, select a tagged bucket, select the Management tab, and then select Inventory, as shown in Figure 5. You should see that an inventory configuration exists. An inventory report will be delivered automatically to this bucket within 1 to 2 days, depending on the number of objects in the bucket. Make a note of the name of the bucket where the inventory report will be delivered. The bucket is given a semi-random name during creation through the CloudFormation template, so making a note of this will help you find the bucket more easily when you check for report delivery later.

Figure 5: Check that the tagged S3 bucket has an S3 Inventory report configuration

Figure 5: Check that the tagged S3 bucket has an S3 Inventory report configuration

Step 4: Confirm that reports are queryable with Athena

  1. After 1 to 2 days, navigate to the inventory reports destination bucket and confirm that reports have been delivered for buckets with the __Inventory: true tag. As shown in Figure 6, a report has been delivered for the adams-lambda-functions bucket.

    Figure 6: Confirm delivery of reports to the S3 reports destination bucket

    Figure 6: Confirm delivery of reports to the S3 reports destination bucket

  2. Next, navigate to the Athena console and select the AWS Glue database that contains the table holding the schema and partition locations for all of your reports. If you used the default values for the parameters when you launched the CloudFormation stack, the AWS Glue database will be named s3_inventory_database, and the table will be named s3_inventory_table. Run the following query in Athena:
    
    SELECT encryption_status, count(*) FROM s3_inventory_table GROUP BY encryption_status;
    

    The outputs of the query will be a snapshot aggregate count of objects in the categories of SSE-S3, SSE-C, SSE-KMS, or NOT-SSE across your tagged bucket environment, before encryption took place, as shown in Figure 7.

    Figure 7: Query results in Athena

    Figure 7: Query results in Athena

    From the query results, you can see that the adams-lambda-functions bucket had only two items in it, both of which were unencrypted. At this point, you can choose to perform any other analytics with Athena on the delivered inventory reports.

Step 5: Validate that objects are correctly encrypted

  1. Navigate to any of your target buckets in Amazon S3 and check the encryption status of a few sample objects by selecting the Properties tab of each object. The objects should now be encrypted using the specified KMS CMK. Because you set the AddTagToEncryptedObjects parameter to yes during the CloudFormation stack launch, these objects should also have the __ObjectEncrypted: true tag present. As an example, Figure 8 shows the rules_present_rule.zip object from the adams-lambda-functions bucket. This object has been properly encrypted using the correct KMS key, which has an alias of blog in this example, and it has been tagged with the specified key value pair.

    Figure 8: Checking the encryption status of an object in S3

    Figure 8: Checking the encryption status of an object in S3

  2. For further validation, navigate back to the Athena console and select the s3_batch_table from the s3_inventory_database, assuming that you left the default names unchanged. Then, run the following query:
    
    SELECT * FROM s3_batch_table;
    

    If encryption was successful, this query should result in zero items being returned because the solution by default only delivers S3 batch job completion reports on items that failed to copy. After validating by inspecting both the objects themselves and the batch completion reports, you can now safely say that the contents of the targeted S3 buckets are correctly encrypted.

Next steps

Congratulations! You’ve successfully deployed and operated a solution for rectifying S3 buckets with incorrectly encrypted and unencrypted objects. The architecture is massively scalable because it uses S3 Batch Operations and Lambda, it’s fully serverless, and it’s cost effective to run.

Please note that if you selected no for the EncryptBuckets parameter during the initial launch of the CloudFormation template, you can retroactively perform encryption on targeted buckets by simply doing a stack update. During the stack update, switch the EncryptBuckets parameter to yes, and proceed with deployment as normal. The update will reconfigure S3 inventory reports for all target S3 buckets to get the most up-to-date inventory. After the reports are delivered, encryption will proceed as desired.

Moreover, with the solution deployed, you can target new buckets for encryption just by adding the __Inventory: true tag. CloudWatch Events will register the tagging action and automatically configure an S3 Inventory report to be delivered for the newly tagged bucket.

Finally, now that your S3 buckets are properly encrypted, you should take a few more manual steps to help maintain your newfound account hygiene:

  • Perform remediation on unencrypted objects that may have failed to copy during the S3 Batch Operations job. The most common reason that objects fail to copy is when object size exceeds 5 GiB. S3 Batch Operations uses the standard CopyObject API call underneath the surface, but this API call can only handle objects less than 5 GiB in size. To successfully copy these objects, you can modify the solution you learned in this post to launch an S3 Batch Operations job that invokes Lambda functions. In the Lambda function logic, you can make CreateMultipartUpload API calls on objects that failed with a standard copy. The original batch job completion reports provide detail on exactly which objects failed to encrypt due to size.
  • Prohibit the retrieval of unencrypted object versions for buckets that had versioning enabled. When the object is copied over itself during the encryption process, the old unencrypted version of the object still exists. This is where the option in the solution to specify a tag on all newly encrypted objects becomes useful—you can now use that tag to draft a bucket policy that prohibits the retrieval of old unencrypted objects in your versioned buckets. For the solution that you deployed in this post, such a policy would look like this:
    
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect":     "Deny",
          "Action":     "s3:GetObject",
          "Resource":    "arn:aws:s3:::adams-lambda-functions/*",
          "Principal":   "*",
          "Condition": {  "StringNotEquals": {"s3:ExistingObjectTag/__ObjectEncrypted": "true" } }
        }
      ]
    }
    

  • Update bucket policies to prevent the upload of unencrypted or incorrectly encrypted objects. By updating bucket policies, you help ensure that in the future, newly uploaded objects will be correctly encrypted, which will help maintain account hygiene. The S3 encryption solution presented here is meant to be a onetime-use remediation tool, while you should view updating bucket policies as a preventative action. Proper use of bucket policies will help ensure that the S3 encryption solution is not needed again, unless another encryption requirement change occurs in the future. To learn more, see How to Prevent Uploads of Unencrypted Objects to Amazon S3.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon S3 forum.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Adam Kozdrowicz

Adam is a Data and Machine Learning Engineer for AWS Professional Services. He works closely with enterprise customers building big data applications on AWS, and he enjoys working with frameworks such as AWS Amplify, SAM, and CDK. During his free time, Adam likes to surf, travel, practice photography, and build machine learning models.

How Wind Mobility built a serverless data architecture

Post Syndicated from Pablo Giner original https://aws.amazon.com/blogs/big-data/how-wind-mobility-built-a-serverless-data-architecture/

Guest post by Pablo Giner, Head of BI, Wind Mobility.

Over the past few years, urban micro-mobility has become a trending topic. With the contamination indexes hitting historic highs, cities and companies worldwide have been introducing regulations and working on a wide spectrum of solutions to alleviate the situation.

We at Wind Mobility strive to make commuters’ life more sustainable and convenient by bringing short distance urban transportation to cities worldwide.

At Wind Mobility, we scale our services at the same pace as our users demand them, and we do it in an economically and environmentally viable way. We optimize our fleet distribution to avoid overcrowding cities with more scooters than those that are actually going to be used, and we position them just meters away from where our users need them and at the time of the day when they want them.

How do we do that? By optimizing our operations to their fullest. To do so, we need to be very well informed about our users’ behavior under varying conditions and understand our fleet’s potential.

Scalability and flexibility for rapid growth

We knew that before we could solve this challenge, we needed to collect data from many different sources, such as user interactions with our application, user demand, IoT signals from our scooters, and operational metrics. To analyze the numerous datasets collected and extract actionable insights, we needed to build a data lake. While the high-level goal was clear, the scope was less so. We were working hard to scale our operation as we continued to launch new markets. The rapid growth and expansion made it very difficult to predict the volume of data we would need to consume. We were also launching new microservices to support our growth, which resulted in more data sources to ingest. We needed an architecture that allowed us to be agile and quickly adopt to meet our growth. It became clear that a serverless architecture was best positioned to meet those needs, so we started to design our 100% serverless infrastructure.

The first challenge was ingesting and storing data from our scooters in the field, events from our mobile app, operational metrics, and partner APIs. We use AWS Lambda to capture changes in our operational databases and mobile app and push the events to Amazon Kinesis Data Streams, which allows us to take action in real time. We also use Amazon Kinesis Data Firehose to write the data to Amazon Simple Storage Service (Amazon S3), which we use for analytics.

After we were in Amazon S3 and adequately partitioned as per its most common use cases (we partition by date, region, and business line, depending on the data source), we had to find a way to query this data for both data profiling (understanding structure, content, and interrelationships) and ad hoc analysis. For that we chose AWS Glue crawlers to catalog our data and Amazon Athena to read from the AWS Glue Data Catalog and run queries. However, ad hoc analysis and data profiling are relatively sporadic tasks in our team, because most of the data processing computing hours are actually dedicated to transforming the multiple data sources into our data warehouse, consolidating the raw data, modeling it, adding new attributes, and picking the data elements, which constitute 95% of our analytics and predictive needs.

This is where all the heavy lifting takes place. We parse through millions of scooter and user events generated daily (over 300 events per second) to extract actionable insight. We selected AWS Glue to perform this task. Our primary ETL job reads the newly added raw event data from Amazon S3, processes it using Apache Spark, and writes the results to our Amazon Redshift data warehouse. AWS Glue plays a critical role in our ability to scale on demand. After careful evaluation and testing, we concluded that AWS Glue ETL jobs meet all our needs and free us from procuring and managing infrastructure.

Architecture overview

The following diagram represents our current data architecture, showing two serverless data collection, processing, and reporting pipelines:

  • Operational databases from Amazon Relational Database Service (Amazon RDS) and MongoDB
  • IoT and application events, followed by Athena for data profiling and Amazon Redshift for reporting

Our data is curated and transformed multiple times a day using an automated pipeline running on AWS Glue. The team can now focus on analyzing the data and building machine learning (ML) applications.

We chose Amazon QuickSight as our business intelligence tool to help us visualize and better understand our operational KPIs. Additionally, we use Amazon Elastic Container Registry (Amazon ECR) to store our Docker images containing our custom ML algorithms and Amazon Elastic Container Service (Amazon ECS) where we train, evaluate, and host our ML models. We schedule our models to be trained and evaluated multiple times a day. Taking as input curated data about demand, conversion, and flow of scooters, we run the models to help us optimize fleet utilization for a particular city at any given time.

The following diagram represents how data from the data lake is incorporated into our ML training, testing, and serving system. First, our developers work in the application code and commit their changes, which are built into new Docker images by our CI/CD pipeline and stored in the Amazon ECR registry. These images are pushed into Amazon ECS and tested in DEV and UAT environments before moving to PROD (where they are triggered by the Amazon ECS task scheduler). During their execution, the Amazon ECS tasks (some train the demand and usage forecasting models, some produce the daily and hourly predictions, and others optimize the fleet distribution to satisfy the forecast) read their configuration and pull data from Amazon S3 (which has been previously produced by scheduled AWS Glue jobs), finally storing their results back into Amazon S3. Executions of these pipelines are tracked via MLFlow (in a dedicated Amazon Elastic Compute Cloud (Amazon EC2) server) and the final result indicating the fleet operations required is fit into a Kepler map, which is then consumed by the operators on the field.

Conclusion

We at Wind Mobility place data at the forefront of our operations. For that, we need our data infrastructure to be as flexible as the industry and the context we operate in, which is why we chose serverless. Over the course of a year, we have built a data lake, a data warehouse, a BI suite, and a variety of (production) data science applications. All of that with a very small team.

Also, within the last 12 months, we have scaled up several of our data pipelines by a factor of 10, without slowing our momentum or redesigning any part of our architecture. When it came to double our fleet in 1 week and increase the frequency at which we capture data from scooters by a factor of 10, our serverless data architecture scaled with no issues. This allowed us to focus on adding value by simplifying our operation, reacting to changes quickly, and delighting our users.

We have measured our success in multiple dimensions:

  • Speed – Serverless is faster to deploy and expand; we believe we have reduced our time to market for the entire infrastructure by a factor of 2
  • Visibility – We have 360 degree visibility of our operations worldwide, accessible by our city managers, finance team, and management board
  • Optimized fleet deployment – We know, at any minute of the day, the number of scooters that our customers need over the next few hours, which reduces unsatisfied demand by more than 50%

If you face a similar challenge, our advice is clear: go fully serverless and use the spectrum of solutions available from AWS.

Follow us and discover more about Wind Mobility on Facebook, Instagram and LinkedIn.

 


About the Author

Pablo Giner is Head of BI at Wind Mobility. Pablo’s background is in wheels (motorcycle racing > vehicle engineering > collision insurance > eScooters sharing…) and for the last few years he has specialized in forming and developing data teams. At Wind Mobility, he leads the data function (data engineering + analytics + data science), and the project he is most proud of is what they call smart fleet rebalancing, an AI backed solution to reposition their fleet in real-time. “In God we trust. All others must bring data.” – W. Edward Deming

 

 

 

Adding voice to a CircuitPython project using Amazon Polly

Post Syndicated from Moheeb Zara original https://aws.amazon.com/blogs/compute/adding-voice-to-a-circuitpython-project-using-amazon-polly/

An Adafruit PyPortal displaying a quote while synthesizing and playing speech using Amazon Polly.

An Adafruit PyPortal displaying a quote while synthesizing and playing speech using Amazon Polly.

As a natural means of communication, voice is a powerful way to humanize an experience. What if you could make anything talk? This guide walks through how to leverage the cloud to add voice to an off-the-shelf microcontroller. Use it to develop more advanced ideas, like a talking toaster that encourages healthy breakfast habits or a house plant that can express its needs.

This project uses an Adafruit PyPortal, an open-source IoT touch display programmed using CircuitPython, a lightweight version of Python that works on embedded hardware. You copy your code to the PyPortal like you would to a thumb drive and it runs. Random quotes from the PaperQuotes API are periodically displayed on the PyPortal LCD.

A microcontroller can’t do speech synthesis on its own so I use Amazon Polly, a natural text to speech synthesis service, to generate audio. Adding speech also extends accessibility to the visually impaired. This project includes an example for requesting arbitrary speech in addition to random quotes. Use this example to add a voice to any CircuitPython project.

An Adafruit PyPortal, an external speaker, and a microSD card.

An Adafruit PyPortal, an external speaker, and a microSD card.

I deploy the backend to the AWS Cloud using the AWS Serverless Application Repository. The code on the PyPortal makes a REST call to the backend to fetch a quote and synthesize speech audio for playback on the device.

Prerequisites

You need the following to complete the project:

Deploy the backend application

An architecture diagram of the serverless backend when requesting speech synthesis of a text string.

An architecture diagram of the serverless backend when requesting speech synthesis of a text string.

The serverless backend consists of an Amazon API Gateway endpoint that invokes an AWS Lambda function. If called with a JSON object containing text and voiceId attributes, it uses Amazon Polly to synthesize speech and uploads an MP3 file as a public object to Amazon S3. Upon completion, it returns the URL for downloading the audio file. It also processes the submitted text and adds return lines so that it can appear text-wrapped when displayed on the PyPortal. For a full list of voices, see the Amazon Polly documentation. An example response:

To fetch quotes instead of a text field, call the endpoint with a comma-separated list of tags as shown in the following diagram. The Lambda function then calls the PaperQuotes API. It fetches up to 50 quotes per tag and selects a random one to synthesize as speech. As with arbitrary text, it returns a URL and a text-wrapped representation of the quote.

An architecture diagram of the serverless backend when requesting a random quote from the PaperQuotes API to synthesize as speech.

An architecture diagram of the serverless backend when requesting a random quote from the PaperQuotes API to synthesize as speech.

I use the AWS Serverless Application Model (AWS SAM) to create the backend template. While it can be deployed using the AWS SAM CLI, you can also deploy from the AWS Management Console:

  1. Generate a free PaperQuotes API key at paperquotes.com. The serverless backend requires this to fetch quotes.
  2. Navigate to the aws-serverless-pyportal-polly application in the AWS Serverless Application Repository.
  3. Under Application settings, enter the parameter, PaperQuotesAPIKey.
  4. Choose Deploy.
  5. Once complete, choose View CloudFormation Stack.
  6. Select the Outputs tab and make a note of the SpeechApiUrl. This is required for configuring the PyPortal.
  7. Click the link listed for SpeechApiKey in the Outputs tab.
  8. Click Show to reveal the API key. Make a note of this. This is required for authenticating requests from the PyPortal to the SpeechApiUrl.

PyPortal setup

The following instructions walk through installing the latest version of the Adafruit CircuityPython libraries and firmware. It also shows how to enable an external speaker module.

  1. Follow these instructions from Adafruit to install the latest version of the CircuitPython bootloader. At the time of writing, the latest version is 5.3.0.
  2. Follow these instructions to install the latest Adafruit CircuitPython library bundle. I use bundle version 5.x.
  3. Insert the microSD card in the slot located on the back of the device.
  4. Cut the jumper pad on the back of the device labeled A0. This enables you to use an external speaker instead of the built-in speaker.
  5. Plug the external speaker connector into the port labeled SPEAKER on the back of the device.
  6. Optionally install the Mu Editor, a multi-platform code editor and serial debugger compatible with Adafruit CircuitPython boards. This can help with troubleshooting issues.
  7. Optionally if you have a 3D printer at home, you can print a case for your PyPortal. This can protect and showcase your project.

Code PyPortal

As with regular Python, CircuitPython does not need to be compiled to execute. You can flash new firmware on the PyPortal by copying a Python file and necessary assets to a mounted volume. The bootloader runs code.py anytime the device starts or any files are updated.

  1. Use a USB cable to plug the PyPortal into your computer and wait until a new mounted volume CIRCUITPY is available.
  2. Download the project from GitHub. Inside the project, copy the contents of /circuit-python on to the CIRCUITPY volume.
  3. Inside the volume, open and edit the secrets.py file. Include your Wi-Fi credentials along with the SpeechApiKey and SpeechApiUrl API Gateway endpoint. These can be found under Outputs in the AWS CloudFormation stack created by the AWS Serverless Application Repository.
  4. Save the file, and the device restarts. It takes a moment to connect to Wi-Fi and make the first request.
    Optionally, if you installed the Mu Editor, you can click on “Serial” to follow along the device log.

The PyPortal takes a few moments to connect to the Wi-Fi network and make its first request. On success, you hear it greet you and describe itself. The default interval is set to then display and read a quote every five minutes.

Understanding the CircuitPython code

See the bottom of circuit-python/code.py from the GitHub project. When the PyPortal connects to Wi-Fi, the first thing it does is synthesize an arbitrary “hello world” text for display. It then begins periodically displaying and “speaking” quotes.

# Connect to WiFi
print("Connecting to WiFi...")
wifi.connect()
print("Connected!")

displayQuote("Ready!")

speakText('Hello world! I am an Adafruit PyPortal running Circuit Python speaking to you using AWS Serverless', 'Joanna')

while True:
    speakQuote('equality, humanity', 'Joanna')
    time.sleep(60*secrets['interval'])

Both the speakText and speakQuote function call the synthesizeSpeech function. The difference is whether text or tags are passed to the API.

def speakText(text, voice):
    data = { "text": text, "voiceId": voice }
    synthesizeSpeech(data)

def speakQuote(tags, voice):
    data = { "tags": tags, "voiceId": voice }
    synthesizeSpeech(data)

The synthesizeSpeech function posts the data to the API Gateway endpoint. It then invokes the Lambda function and returns the MP3 URL and the formatted text. The downloadfile function is called to fetch the MP3 file and store it on the SD card. displayQuote is called to display the quote on the LCD. Finally, the playMP3 opens the file and plays the speech audio using the built-in or external speaker.

def synthesizeSpeech(data):
    response = postToAPI(secrets['endpoint'], data)
    downloadfile(response['url'], '/sd/cache.mp3')
    displayQuote(response['text'])
    playMP3("/sd/cache.mp3")

Modifying the Lambda function

The serverless application includes a Lambda function, SynthesizeSpeechFunction, which can be modified directly in the Lambda console. The AWS SAM template used to deploy the AWS Serverless Application Repository application adds policies for accessing the S3 bucket where audio is stored. It also grants access to Amazon Polly for synthesizing speech. It also adds the PaperQuote API token as an environment variable and sets API Gateway as an event source.

SynthesizeSpeechFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: lambda_functions/SynthesizeSpeech/
      Handler: app.lambda_handler
      Runtime: python3.8
      Policies:
        - S3FullAccessPolicy:
            BucketName: !Sub "${AWS::StackName}-audio"
        - Version: '2012-10-17'
          Statement:
            - Effect: Allow
              Action:
                - polly:*
              Resource: '*'
      Environment:
        Variables:
          BUCKET_NAME: !Sub "${AWS::StackName}-audio"
          PAPER_QUOTES_TOKEN: !Ref PaperQuotesAPIKey
      Events:
        Speech:
          Type: Api
          Properties:
            RestApiId: !Ref SpeechApi
            Path: /speech
            Method: post

To edit the Lambda function, navigate back to the CloudFormation stack and click on the SpeechSynthesizeFunction under the Resources tab.

From here, you can edit the Lambda function code directly. Clicking Save deploys the new code.

The getQuotes function is called to fetch quotes from the PaperQuotes API. You can change this to call from a different source, such as a custom selection of quotes. Try modifying it to fetch social media posts or study questions.

Conclusion

I show how to add natural sounding text to speech on a microcontroller using a serverless backend. This is accomplished by deploying an application through the AWS Serverless Application Repository. The deployed API uses API Gateway to securely invoke a Lambda function that fetches quotes from the PaperQuotes API and generates speech using Amazon Polly. The speech audio is uploaded to S3.

I then show how to program a microcontroller, the Adafruit PyPortal, using CircuitPython. The code periodically calls the serverless API to fetch a quote and to download speech audio for playback. The sample code also demonstrates synthesizing arbitrary text to speech, meaning it can be used for any project you can conceive. Check out my previous guide on using the PyPortal to create a Martian weather display for inspiration.

Moovit embraces data lake architecture by extending their Amazon Redshift cluster to analyze billions of data points every day

Post Syndicated from Yonatan Dolan original https://aws.amazon.com/blogs/big-data/moovit-embraces-data-lake-architecture-by-extending-their-amazon-redshift-cluster-to-analyze-billions-of-data-points-every-day/

Amazon Redshift is a fast, fully managed, cloud-native data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence tools.

Moovit is a leading Mobility as a Service (MaaS) solutions provider and maker of the top urban mobility app. Guiding over 800 million users in more than 3,200 cities across 103 countries to get around town effectively and conveniently, Moovit has experienced exponential growth of their service in the last few years. The company amasses up to 6 billion anonymous data points a day to add to the world’s largest repository of transit and urban mobility data, aided by Moovit’s network of more than 685,000 local editors that help map and maintain local transit information in cities that would otherwise be unserved.

Like Moovit, many companies today are using Amazon Redshift to analyze data and perform various transformations on the data. However, as data continues to grow and become even more important, companies are looking for more ways to extract valuable insights from the data, such as big data analytics, numerous machine learning (ML) applications, and a range of tools to drive new use cases and business processes. Companies are looking to access all their data, all the time, by all users and get fast answers. The best solution for all those requirements is for companies to build a data lake, which is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale.

With a data lake built on Amazon Simple Storage Service (Amazon S3), you can easily run big data analytics using services such as Amazon EMR and AWS Glue. You can also query structured data (such as CSV, Avro, and Parquet) and semi-structured data (such as JSON and XML) by using Amazon Athena and Amazon Redshift Spectrum. You can also use a data lake with ML services such as Amazon SageMaker to gain insights.

Moovit uses an Amazon Redshift cluster to allow different company teams to analyze vast amounts of data. They wanted a way to extend the collected data into the data lake and allow additional analytical teams to access more data to explore new ideas and business cases.

Additionally, Moovit was looking to manage their storage costs and evolve to a model that allowed cooler data to be maintained at the lowest cost in S3, and maintain the hottest data in Redshift for the most efficient query performance. The proposed solution implemented a hot/cold storage pattern using Amazon Redshift Spectrum and reduced the local disk utilization on the Amazon Redshift cluster to make sure costs are maintained. Moovit is currently evaluating the new RA3 node with managed storage as an additional level of flexibility that will allow them to easily scale the amount of hot/cold storage without limit.

In this post we demonstrate how Moovit, with the support of AWS, implemented a lake house architecture by employing the following best practices:

  • Unloading data into Amazon Simple Storage Service (Amazon S3)
  • Instituting a hot/cold pattern using Amazon Redshift Spectrum
  • Using AWS Glue to crawl and catalog the data
  • Querying data using Athena

Solution overview

The following diagram illustrates the solution architecture.

The solution includes the following steps:

  1. Unload data from Amazon Redshift to Amazon S3
  2. Create an AWS Glue Data Catalog using an AWS Glue crawler
  3. Query the data lake in Amazon Athena
  4. Query Amazon Redshift and the data lake with Amazon Redshift Spectrum

Prerequisites

To complete this walkthrough, you must have the following prerequisites:

  1. An AWS account.
  2. An Amazon Redshift cluster.
  3. The following AWS services and access: Amazon Redshift, Amazon S3, AWS Glue, and Athena.
  4. The appropriate AWS Identity and Access Management (IAM) permissions for Amazon Redshift Spectrum and AWS Glue to access Amazon S3 buckets. For more information, see IAM policies for Amazon Redshift Spectrum and Setting up IAM Permissions for AWS Glue.

Walkthrough

To demonstrate the process Moovit used during their data architecture, we use the industry-standard TPC-H dataset provided publicly by the TPC organization.

The Orders table has the following columns:

Column Type
O_ORDERKEY int4
O_CUSTKEY int4
O_ORDERSTATUS varchar
O_TOTALPRICE numeric
O_ORDERDATE date
O_ORDERPRIORITY varchar
O_CLERK varchar
O_SHIPPRIORITY int4
O_COMMENT varchar
SKIP varchar

Unloading data from Amazon Redshift to Amazon S3

Amazon Redshift allows you to unload your data using a data lake export to an Apache Parquet file format. Parquet is an efficient open columnar storage format for analytics. Parquet format is up to twice as fast to unload and consumes up to six times less storage in Amazon S3, compared with text formats.

To unload cold or historical data from Amazon Redshift to Amazon S3, you need to run an UNLOAD statement similar to the following code (substitute your IAM role ARN):

UNLOAD ('select o_orderkey, o_custkey, o_orderstatus, o_totalprice, o_orderdate, o_orderpriority, o_clerk, o_shippriority, o_comment, skip
FROM tpc.orders
ORDER BY o_orderkey, o_orderdate') 
TO 's3://tpc-bucket/orders/' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::<account_number>:role/>Role<'
FORMAT AS parquet allowoverwrite PARTITION BY (o_orderdate);

It is important to define a partition key or column that minimizes Amazon S3 scans as much as possible based on the query patterns intended. The query pattern is often by date ranges; for this use case, use the o_orderdate field as the partition key.

Another important recommendation when unloading is to have file sizes between 128 MB and 512 MB. By default, the UNLOAD command splits the results to one or more files per node slice (virtual worker in the Amazon Redshift cluster) which allows you to use the Amazon Redshift MPP architecture. However, this can potentially cause files created by every slice to be small. In Moovit’s use case, the default UNLOAD using PARALLEL ON yielded dozens of small (MBs) files. For Moovit, PARALLEL OFF yielded the best results because it aggregated all the slices’ work into the LEADER node and wrote it out as a single stream controlling the file size using the MAXFILESIZE option.

Another performance enhancement applied in this use case was the use of Parquet’s min and max statistics. Parquet files have min_value and max_value column statistics for each row group that allow Amazon Redshift Spectrum to prune (skip) row groups that are out of scope for a query (range-restricted scan). To use row group pruning, you should sort the data by frequently-used columns. Min/max pruning helps scan less data from Amazon S3, which results in improved performance and reduced cost.

After unloading the data to your data lake, you can view your Parquet file’s content in Amazon S3 (assuming it’s under 128 MB). From the Actions drop-down menu, choose Select from.

You’re now ready to populate your Data Catalog using an AWS Glue crawler.

Creating a Data Catalog with an AWS Glue crawler

To query your data lake using Athena, you must catalog the data. The Data Catalog is an index of the location, schema, and runtime metrics of the data.

An AWS Glue crawler accesses your data store, extracts metadata (such as field types), and creates a table schema in the Data Catalog. For instructions, see Working with Crawlers on the AWS Glue Console.

Querying the data lake in Athena

After you create the crawler, you can view the schema and tables in AWS Glue and Athena, and can immediately start querying the data in Athena. The following screenshot shows the table in the Athena Query Editor.

Querying Amazon Redshift and the data lake using a unified view with Amazon Redshift Spectrum

Amazon Redshift Spectrum is a feature of Amazon Redshift that allows multiple Redshift clusters to query from same data in the lake. It enables the lake house architecture and allows data warehouse queries to reference data in the data lake as they would any other table. Amazon Redshift clusters transparently use the Amazon Redshift Spectrum feature when the SQL query references an external table stored in Amazon S3. Large multiple queries in parallel are possible by using Amazon Redshift Spectrum on external tables to scan, filter, aggregate, and return rows from Amazon S3 back to the Amazon Redshift cluster.

Following best practices, Moovit decided to persist all their data in their Amazon S3 data lake and only store hot data in Amazon Redshift. They could query both hot and cold datasets in a single query with Amazon Redshift Spectrum.

The first step is creating an external schema in Amazon Redshift that maps a database in the Data Catalog. See the following code:

CREATE EXTERNAL SCHEMA spectrum 
FROM data catalog 
DATABASE 'datalake' 
iam_role 'arn:aws:iam::<account_number>:role/mySpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

After the crawler creates the external table, you can start querying in Amazon Redshift using the mapped schema that you created earlier. See the following code:

SELECT * FROM spectrum.orders;

Lastly, create a late binding view that unions the hot and cold data:

CREATE OR REPLACE VIEW lake_house_joint_view AS (SELECT * FROM public.orders WHERE o_orderdate >= dateadd(‘day’,-90,date_trunc(‘day’,getdate())) 
UNION ALL SELECT * FROM spectrum.orders WHERE o_orderdate < dateadd(‘day’,-90,date_trunc(‘day’,getdate())) WITH NO SCHEMA BINDING;

Summary

In this post, we showed how Moovit unloaded data from Amazon Redshift to a data lake. By doing that, they exposed the data to many additional groups within the organization and democratized the data. These benefits of data democratization are substantial because various teams within Moovit can access the data, analyze it with various tools, and come up with new insights.

As an additional benefit, Moovit reduced their Amazon Redshift utilized storage, which allowed them to maintain cluster size and avoid additional spending by keeping all historical data within the data lake and only hot data in the Amazon Redshift cluster. Keeping only hot data on the Amazon Redshift cluster prevents Moovit from deleting data frequently, which saves IT resources, time, and effort.

If you are looking to extend your data warehouse to a data lake and leverage various tools for big data analytics and machine learning (ML) applications, we invite you to try out this walkthrough.

 


About the Authors

Yonatan Dolan is a Business Development Manager at Amazon Web Services. He is located in Israel and helps customers harness AWS analytical services to leverage data, gain insights, and derive value.

 

 

 

 

Alon Gendler is a Startup Solutions Architect at Amazon Web Services. He works with AWS customers to help them architect secure, resilient, scalable and high performance applications in the cloud.

 

 

 

 

Vincent Gromakowski is a Specialist Solutions Architect for Amazon Web Services.

 

 

Tighten S3 permissions for your IAM users and roles using access history of S3 actions

Post Syndicated from Mathangi Ramesh original https://aws.amazon.com/blogs/security/tighten-s3-permissions-iam-users-and-roles-using-access-history-s3-actions/

Customers tell us that when their teams and projects are just getting started, administrators may grant broad access to inspire innovation and agility. Over time administrators need to restrict access to only the permissions required and achieve least privilege. Some customers have told us they need information to help them determine the permissions an application really needs, and which permissions they can remove without impacting applications. To help with this, AWS Identity and Access Management (IAM) reports the last time users and roles used each service, so you can know whether you can restrict access. This helps you to refine permissions to specific services, but we learned that customers also need to set more granular permissions to meet their security requirements.

We are happy to announce that we now include action-level last accessed information for Amazon Simple Storage Service (Amazon S3). This means you can tighten permissions to only the specific S3 actions that your application requires. The action-level last accessed information is available for S3 management actions. As you try it out, let us know how you’re using action-level information and what additional information would be valuable as we consider supporting more services.

The following is an example snapshot of S3 action last accessed information.
 

Figure 1: S3 action last accessed information snapshot

Figure 1: S3 action last accessed information snapshot

You can use the new action last accessed information for Amazon S3 in conjunction with other features that help you to analyze access and tighten S3 permissions. AWS IAM Access Analyzer generates findings when your resource policies allow access to your resources from outside your account or organization. Specifically for Amazon S3, when an S3 bucket policy changes, Access Analyzer alerts you if the bucket is accessible by users from outside the account, which helps you to protect your data from unintended access. You can use action last accessed information for your user or role, in combination with Access Analyzer findings, to improve the security posture of your S3 permissions. You can review the action last accessed information in the IAM console, or programmatically using the AWS Command Line Interface (AWS CLI) or a programmatic client.

Example use case for reviewing action last accessed details

Now I’ll walk you through an example to demonstrate how you identify unused S3 actions and reduce permissions for your IAM principals. In this example a system administrator, Martha Rivera, is responsible for managing access for her IAM principals. She periodically reviews permissions to ensure that teams follow security best practices. Specifically, she ensures that the team has only the minimum S3 permissions required to work on their application and achieve their use cases. To do this, Martha reviews the last accessed timestamp for each supported S3 action that the roles in her account have access to. Martha then uses this information to identify the S3 actions that are not used, and she restricts access to those actions by updating the policies.

To view action last accessed information in the AWS Management Console

  1. Open the IAM Console.
  2. In the navigation pane, select Roles, then choose the role that you want to analyze (for example, PaymentAppTestRole).
  3. Select the Access Advisor tab. This tab displays all the AWS services to which the role has permissions, as shown in Figure 2.
     
    Figure 2: List of AWS services to which the role has permissions

    Figure 2: List of AWS services to which the role has permissions

  4. On the Access Advisor tab, select Amazon S3 to view all the supported actions to which the role has permissions, when each action was last used by the role, and the AWS Region in which it was used, as shown in Figure 3.
     
    Figure 3: List of S3 actions with access data

    Figure 3: List of S3 actions with access data

In this example, Martha notices that PaymentAppTestRole has read and write S3 permissions. From the information in Figure 3, she sees that the role is using read actions for GetBucketLogging, GetBucketPolicy, and GetBucketTagging. She also sees that the role hasn’t used write permissions for CreateAccessPoint, CreateBucket, PutBucketPolicy, and others in the last 30 days. Based on this information, Martha updates the policies to remove write permissions. To learn more about updating permissions, see Modifying a Role in the AWS IAM User Guide.

At launch, you can review 50 days of access data, that is, any use of S3 actions in the preceding 50 days will show up as a last accessed timestamp. As this tracking period continues to increase, you can start making permissions decisions that apply to use cases with longer period requirements (for example, when 60 or 90 days is available).

Martha sees that the GetAccessPoint action shows Not accessed in the tracking period, which means that the action was not used since IAM started tracking access for the service, action, and AWS Region. Based on this information, Martha confidently removes this permission to further reduce permissions for the role.

Additionally, Martha notices that an action she expected does not show up in the list in Figure 3. This can happen for two reasons, either PaymentAppTestRole does not have permissions to the action, or IAM doesn’t yet track access for the action. In such a situation, do not update permission for those actions, based on action last accessed information. To learn more, see Refining Permissions Using Last Accessed Data in the AWS IAM User Guide.

To view action last accessed information programmatically

The action last accessed data is available through updates to the following existing APIs. These APIs now generate action last accessed details, in addition to service last accessed details:

  • generate-service-last-accessed-details: Call this API to generate the service and action last accessed data for a user or role. You call this API first to start a job that generates the action last accessed data for a user or role. This API returns a JobID that you will then use with get-service-last-accessed-details to determine the status of the job completion.
  • get-service-last-accessed-details: Call this API to retrieve the service and action last accessed data for a user or role based on the JobID you pass in. This API is paginated at the service level.

To learn more, see GenerateServiceLastAccessedDetails in the AWS IAM User Guide.

Conclusion

By using action last accessed information for S3, you can review access for supported S3 actions, remove unused actions, and restrict access to S3 to achieve least privilege. To learn more about how to use action last accessed information, see Refining Permissions Using Last Accessed Data in the AWS IAM User Guide.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS IAM forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Mathangi Ramesh

Mathangi Ramesh

Mathangi is the product manager for AWS Identity and Access Management. She enjoys talking to customers and working with data to solve problems. Outside of work, Mathangi is a fitness enthusiast and a Bharatanatyam dancer. She holds an MBA degree from Carnegie Mellon University.

Running a high-performance SAS Grid Manager cluster on AWS with Amazon FSx for Lustre

Post Syndicated from Neelam original https://aws.amazon.com/blogs/big-data/running-a-high-performance-sas-grid-manager-cluster-on-aws-with-amazon-fsx-for-lustre/

SAS® is a software provider of data science and analytics used by enterprises and government organizations. SAS Grid is a highly available, fast processing analytics platform that offers centralized management that balances workloads across different compute nodes. This application suite is capable of data management, visual analytics, governance and security, forecasting and text mining, statistical analysis, and environment management. SAS and AWS recently performed testing using the Amazon FSx for Lustre shared file system to determine how well a standard workload performs on AWS using SAS Grid Manager. For more information about the results, see the whitepaper Accelerating SAS Using High-Performing File Systems on Amazon Web Services.

In this post, we take a look at an approach to deploy underlying AWS infrastructure to run SAS Grid with FSx for Lustre that you can also apply to similar applications with demanding I/O requirements.

System design overview

Running high-performance workloads that use throughput heavily, with sensitivity to network latency, requires approaches outside of typical applications. AWS generally recommends that applications span multiple Availability Zones for high availability. In the case of latency sensitivity, high throughput applications traffic should be local for optimal performance. To maximize throughput, you can do the following:

  • Run in a virtual private cloud (VPC), using instance types that support enhanced networking
  • Run instances in the same Availability Zone
  • Run instances within a placement group

The following diagram illustrates the SAS Grid with FSx for Lustre architecture on AWS.

SAS Grid architecture consists of mid-tier nodes, metadata servers, and Grid compute nodes. The mid-tier nodes are responsible for running the Platform Web Services (PWS) and Load Sharing Facility (LSF) components. These components dispatch jobs submitted and return the status of each job.

To effectively run PWS and LSF on mid-tier nodes, you need Amazon Elastic Compute Cloud (Amazon EC2) instances with high memory. For this use case, the r5 instance family would meet this requirement.

Metadata servers contain the metadata repository that stores the metadata definitions of all SAS Grid manager products, which the r5 instance family can also serve effectively. We recommend either meeting or exceeding the recommended memory requirement of 24 GB of RAM or 8 MB per physical core (whichever is larger). Metadata servers don’t need compute-intensive resources or high I/O bandwidth; therefore, you can choose the r5 instance family for a balance of price and performance.

SAS Grid nodes are responsible for executing the jobs received by the grid, and EC2 instances capable of handling these jobs depend on the size, complexity, and volume of the work the grid performs. To meet the minimum requirements of SAS Grid workloads, we recommend having a minimum of 8 GB of physical RAM per core and a robust I/O throughput of 100–125 MB/second per physical core. For this use case, EC2 instance families of m5n and r5n suffice in meeting the RAM and throughput requirements. You can host SASDATA, SASWORK, and UTILLOC libraries in a shared file system. If you choose to offload SASWORK to instance storage, the i3en instance family meets this need because they support instance storage over 1.2 TB. In the next section, we take a look at how throughput testing was performed to arrive at the EC2 instance recommendations with FSx for Lustre.

Steps to maximize storage I/O performance

SAS Grid requires a shared file system, and we wanted to benchmark the performance of FSx for Lustre as the chosen shared file system against various EC2 instance families that meet the minimum requirements of 8 GB of physical RAM per core and 100–125 MB/second throughput per physical core.

FSx for Lustre is a fully managed file storage service designed for applications that require fast storage. As a POSIX-compliant file system, you can use FSx for Lustre with current Linux-based applications without having to make any changes. Although FSx for Lustre offers a choice between scratch and persistent type file systems, we recommend for SAS Grid to use persistent type FSx for Lustre file system because you need to store the SASWORK, SASDATA, and UTILLOC data and libraries for longer periods with high availability and data durability. To meet I/O throughput, make sure to select the appropriate storage capacity for throughput per unit of storage to achieve the desired range of 100–125 MB/second.

After setting up the file system, we recommend mounting FSx for Lustre with the flock mount option. The following code example is a mount command and mount option for FSx for Lustre:

$ sudo mount -t lustre -o noatime,flock fs-0123456789abcd.fsx.us-west- [email protected]:/za3atbmv /fsx
$ mount -t lustre
[email protected]:/za3atbmv on /fsx type lustre

(rw,noatime,seclabel,flock,lazystatfs)

Throughput testing and results

To select the best-placed EC2 instances for running SAS Grid with FSx for Lustre, we ran a series of highly parallel network throughput tests from individual EC2 instances against a 100.8 TiB persistent file system that had an aggregate throughput capacity of 19.688 GB/second. We ran these tests in multiple regions using multiple EC2 instance families (c5, c5n, i3, i3en, m5, m5a, m5ad, m5n, m5dn, r5, r5a, r5ad, r5n, and r5dn). The tests ran for 3 hours for each instance, and the DataWriteBytes metric of the file system was recorded every 1 minute. Only one instance was accessing the file system at a time, and the p99.9 results were captured. The metrics were consistent across all four Regions.

We observed that the i3en, m5n, m5dn, r5n, and r5dn EC2 instance families meet or exceed the minimum network performance and memory recommendations. For more information about the performance results, see the whitepaper Accelerating SAS Using High-Performing File Systems on Amazon Web Services. The i3 instance family is just shy of meeting the minimum network performance. If you want to use the instance storage for SASWORK and UTILLOC libraries, you can consider i3en instances.

M5n and r5n are a good blend of price and performance, and we recommend the m5n instance family for SAS Grid nodes. However, if your workload is memory bound, consider using r5n instances, which provide higher memory per physical core for a higher price point than m5n instances.

We also ran rhel_iotest.sh, which is available from the SAS technical support samples tool repository (SASTSST), using the same FSx for Lustre configuration as mentioned earlier. The following table shows the read and write performance per physical core for a variety of instances sizes in the m5n and r5n families.

Instance Type

Variable Network Performance Peak per Physical Core
Read (MB/second) Write (MB/second)
m5n.large 850.20 357.07
m5n.xlarge 519.46 386.25
m5n.2xlarge 283.01 446.84
m5n.4xlarge 202.89 376.57
m5n.8xlarge 154.98 297.71
r5n.large 906.88 429.93
r5n.xlarge 488.36 455.76
r5n.2xlarge 256.96 471.65
r5n.4xlarge 203.31 390.03
r5n.8xlarge 149.63 299.45

To take advantage of the elasticity, scalability, and flexibility of the cloud, we recommend spreading the SAS Grid and compute workload over a larger number of smaller instances versus using a smaller number of larger instances. For mid-tier, use a minimum of two instances, and for metadata servers, we recommend a minimum of three instances for the SAS Grid architecture.

Conclusions

Before FSx for Lustre file system, you either had to use Amazon Elastic File System (Amazon EFS) or a third-party file system from AWS Marketplace and Amazon Elastic Block Store (Amazon EBS) for the SASWORK, SASDATA, and UTILLOC libraries and storage data. Each storage option came with its own settings and limitations, which caused loss in performance. With FSx for Lustre, you have a single solution for all SAS Grid storage requirements, which allows you to focus on running your business instead of maintaining a file system. We recommend that SAS admin deploy SAS Grid with m5n and r5n instances for SAS Grid compute nodes when accessing FSx for Lustre file system.

If you have questions or suggestions, please leave a comment.

Build an AWS Well-Architected environment with the Analytics Lens

Post Syndicated from Nikki Rouda original https://aws.amazon.com/blogs/big-data/build-an-aws-well-architected-environment-with-the-analytics-lens/

Building a modern data platform on AWS enables you to collect data of all types, store it in a central, secure repository, and analyze it with purpose-built tools. Yet you may be unsure of how to get started and the impact of certain design decisions. To address the need to provide advice tailored to specific technology and application domains, AWS added the concept of well-architected lenses 2017. AWS now is happy to announce the Analytics Lens for the AWS Well-Architected Framework. This post provides an introduction of its purpose, topics covered, common scenarios, and services included.

The new Analytics Lens offers comprehensive guidance to make sure that your analytics applications are designed in accordance with AWS best practices. The goal is to give you a consistent way to design and evaluate cloud architectures, based on the following five pillars:

  • Operational excellence
  • Security
  • Reliability
  • Performance efficiency
  • Cost optimization

The tool can help you assess the analytics workloads you have deployed in AWS by identifying potential risks and offering suggestions for improvements.

Using the Analytics Lens to address common requirements

The Analytics Lens models both the data architecture at the core of the analytics applications and the application behavior itself. These models are organized into the following six areas, which encompass the vast majority of analytics workloads deployed on AWS:

  1. Data ingestion
  2. Security and governance
  3. Catalog and search
  4. Central storage
  5. Processing and analytics
  6. User access

The following diagram illustrates these areas and their related AWS services.

There are a number of common scenarios where the Analytics Lens applies, such as the following:

  • Building a data lake as the foundation for your data and analytics initiatives
  • Efficient batch data processing at scale
  • Building a platform for streaming ingest and real-time event processing
  • Handling big data processing and streaming
  • Data-preparation operations

Whichever of these scenarios fits your needs, building to the principles of the Analytics Lens in the AWS Well-Architected Framework can help you implement best practices for success.

The Analytics Lens explains when and how to use the core services in the AWS analytics portfolio. These include Amazon Kinesis, Amazon Redshift, Amazon EMR, Amazon Athena, AWS Glue, and AWS Lake Formation. It also explains how Amazon Simple Storage Service (Amazon S3) can serve as the storage for your data lake and how to integrate with relevant AWS security services. With reference architectures, best practices advice, and answers to common questions, the Analytics Lens can help you make the right design decisions.

Conclusion

Applying the lens to your existing architectures can validate the stability and efficiency of your design (or provide recommendations to address the gaps that are identified). AWS is committed to the Analytics Lens as a living tool; as the analytics landscape evolves and new AWS services come on line, we’ll update the Analytics Lens appropriately. Our mission will always be to help you design and deploy well-architected applications.

For more information about building your own Well-Architected environment using the Analytics Lens, see the Analytics Lens whitepaper.

Special thanks to the following individuals who contributed to building this resource, among many others who helped with review and implementation: Radhika Ravirala, Laith Al-Saadoon, Wallace Printz, Ujjwal Ratan, and Neil Mukerje.

Are there questions you’d like to see answered in the tool? Share your thoughts and questions in the comments.

 


About the Authors

Nikki Rouda is the principal product marketing manager for data lakes and big data at Amazon Web Services. Nikki has spent 20+ years helping enterprises in 40+ countries develop and implement solutions to their analytics and IT infrastructure challenges. Nikki holds an MBA from the University of Cambridge and an ScB in geophysics and math from Brown University.

 

 


Radhika Ravirala is a specialist solutions architect at Amazon Web Services, where she helps customers craft distributed analytics applications on the AWS platform. Prior to her cloud journey, she worked as a software engineer and designer for technology companies in Silicon Valley.

Simplify data pipelines with AWS Glue automatic code generation and Workflows

Post Syndicated from Mohit Saxena original https://aws.amazon.com/blogs/big-data/simplify-data-pipelines-with-aws-glue-automatic-code-generation-and-workflows/

In the previous post of the series, we discussed how AWS Glue job bookmarks help you to incrementally load data from Amazon S3 and relational databases. We also saw how using the AWS Glue optimized Apache Parquet writer can help improve performance and manage schema evolution.

In the third post of the series, we’ll discuss three topics. First, we’ll look at how AWS Glue can automatically generate code to help transform data in common use cases such as selecting specific columns, flattening deeply nested records, efficiently parsing nested fields, and handling column data type evolution.

Second, we’ll outline how to use AWS Glue Workflows to build and orchestrate data pipelines using different Glue components such as Crawlers, Apache Spark and Python Shell ETL jobs.

Third, we’ll see how to leverage SparkSQL in your ETL jobs to perform SQL based transformations on datasets stored in Amazon S3 and relational databases.

Automatic Code Generation & Transformations: ApplyMapping, Relationalize, Unbox, ResolveChoice

AWS Glue can automatically generate code to help perform a variety of useful data transformation tasks. These transformations provide a simple to use interface for working with complex and deeply nested datasets. For example, some relational databases or data warehouses do not natively support nested data structures. AWS Glue can automatically generate the code necessary to flatten those nested data structures before loading them into the target database saving time and enabling non-technical users to work with data.

The following is a list of the popular transformations AWS Glue provides to simplify data processing:

  1. ApplyMapping is a transformation used to perform column projection and convert between data types. In this example, we use it to unnest several fields, such as action.id, which we map to the top-level action.id field. We also cast the id column to a long.
    medicare_output = medicare_src.apply_mapping(
        [('id, 'string', id, 'string'), 
        ('type, string, type', string),
        ('actor.id, 'int', actor.id', int),
        ('actor.login', 'string', actor.login', 'string'),
        ('actor.display_login', 'string', 'actor.display_login', 'string'),
        ('actor.gravatar_id', 'long', 'actor.gravatar_id', 'long'),
        ('actor.url', 'string','actor.url', 'string'),
        ('actor.avatar_url', 'string', 'actor.avatar_url', string)]
    )

  1. Relationalize converts a nested dataset stored in a DynamicFrameto a relational (rows and columns) format. Nested structures are unnested into top level columns and arrays decomposed into different tables with appropriate primary and foreign keys inserted. The result is a collection of DynamicFrames representing a set of tables that can be directly inserted into a relational database. More detail about relationalize can be found here.
    ## An example relationalizing and writing to Redshift
    dfc = history.relationalize("hist_root", redshift_temp_dir)
    ## Cycle through results and write to Redshift.
    for df_name in dfc.keys():
        df = dfc.select(df_name)
        print "Writing to Redshift table: ", df_name, " ..."
        glueContext.write_dynamic_frame.from_jdbc_conf(frame = df, 
            catalog_connection = "redshift3", 
            connection_options = {"dbtable": df_name, "database": "testdb"}, 
            redshift_tmp_dir = redshift_temp_dir)

  2. Unbox parses a string field of a certain type, such as JSON, into individual fields with their corresponding data types and store the result in a DynamicFrame. For example, you may have a CSV file with one field that is in JSON format {“a”: 3, “b”: “foo”, “c”: 1.2}. Unbox will reformat the JSON string into three distinct fields: an int, a string, and a double. The Unbox transformation is commonly used to replace costly Python User Defined Functions required to reformat data that may result in Apache Spark out of memory exceptions. The following example shows how to use Unbox:
    df_result = df_json.unbox('json', "json")

  3. ResolveChoice: AWS Glue Dynamic Frames support data where a column can have fields with different types. These columns are represented with Dynamic Frame’s choice type. For example, Dynamic Frame schema for the medicare dataset shows up as follows:
    root
     |-- drg definition: string
     |-- provider id: choice
     |    |-- long
     |    |-- string
     |-- provider name: string
     |-- provider street address: string

    This is because the “provider id” column could either be a long or string type. The Apache Spark Dataframe considers the whole dataset and is forced to cast it to the most general type, namely string. Dynamic Frames allow you to cast the type using the ResolveChoice transform. For example, you can cast the column to long type as follows.

    medicare_res = medicare_dynamicframe.resolveChoice(specs = [('provider id','cast:long')])
    
    medicare_res.printSchema()
     
    root
     |-- drg definition: string
     |-- provider id: long
     |-- provider name: string
     |-- provider street address: string

    This transform would also insert a null where the value was a string that could not be cast. As a result, the records with string type casted to null values can also be identified now. Alternatively, the choice type can also be cast to struct, which keeps values of both types.

Build and orchestrate data pipelines using AWS Glue Workflows

AWS Glue Workflows provide a visual tool to author data pipelines by combining Glue crawlers for schema discovery, and Glue Spark and Python jobs to transform the data. Relationships can be defined and parameters passed between task nodes to enable users to build pipelines of varying complexity. Workflows can be scheduled to run on a schedule or triggered programmatically. You can track the progress of each node independently or the entire workflow making it easier to troubleshoot your pipelines.

A typical workflow for ETL workloads is organized as follows:

  1. Glue Python command triggered manually, on a schedule, or on an external CloudWatch event. It would pre-process or list the partitions in Amazon S3 for a table under a base location. For example, a CloudTrail logs partition to process could be: s3://AWSLogs/ACCOUNTID/CloudTrail/REGION/YEAR/MONTH/DAY/HOUR/.The Python command can list all the regions and schedule crawlers to create different Glue Data Catalog tables on each region.
  2. Glue Crawlers triggered next to populate new partitions for every hour in Glue Data Catalog for recently ingested in Amazon S3.
  3. Concurrent Glue ETL jobs triggered to separately filter and process each partition or a group of partitions. For example, CloudTrail events corresponding to the last week can be read by a Glue ETL job by passing in the partition prefix as Glue job parameters and using Glue ETL push down predicates to just read all the partitions in that prefix.Partitioning and orchestrating concurrent Glue ETL jobs allows you to scale and reliably execute individual Apache Spark applications by processing only a subset of partitions in the Glue Data Catalog table. The transformed data can then be concurrently written back by all individual Glue ETL jobs to a common target table in Amazon S3 data lake, AWS Redshift or other databases.

Finally, a Glue Python command can be triggered to capture the completion status of the different Glue entities including Glue Crawlers, parallel Glue ETL jobs; and post-process or retry any failed components.

Executing SQL using SparkSQL in AWS Glue

AWS Glue Data Catalog as Hive Compatible Metastore

The AWS Glue Data Catalog is a managed metadata repository compatible with the Apache Hive Metastore API. You can follow the detailed instructions here to configure your AWS Glue ETL jobs and development endpoints to use the Glue Data Catalog. You also need to add the Hive SerDes to the class path of AWS Glue Jobs to serialize/deserialize data for the corresponding formats. You can then natively run Apache Spark SQL queries against your tables stored in the Data Catalog.

The following example assumes that you have crawled the US legislators dataset available at s3://awsglue-datasets/examples/us-legislators. We’ll use the Spark shell running on AWS Glue developer endpoint to execute SparkSQL queries directly on the legislators’ tables cataloged in the AWS Glue Data Catalog.

>>> spark.sql("use legislators")
DataFrame[]
>>> spark.sql("show tables").show()
+-----------+------------------+-----------+
|   database|         tableName|isTemporary|
+-----------+------------------+-----------+
|legislators|        areas_json|      false|
|legislators|    countries_json|      false|
|legislators|       events_json|      false|
|legislators|  memberships_json|      false|
|legislators|organizations_json|      false|
|legislators|      persons_json|      false|

>>> spark.sql("select distinct organization_id from memberships_json").show()
+--------------------+
|     organization_id|
+--------------------+
|d56acebe-8fdc-47b...|
|8fa6c3d2-71dc-478...|
+--------------------+

A similar approach to the above would be to use AWS Glue DynamicFrame API to read the data from S3. The DynamicFrame is then converted to a Spark DataFrame using the toDF method. Next, a temporary view can be registered for DataFrame, which can be queried using SparkSQL. The key difference between the two approaches is the use of Hive SerDes for the first approach, and native Glue/Spark readers for the second approach. The use of native Glue/Spark provides the performance and flexibility benefits such as computation of the schema at runtime, schema evolution, and job bookmarks support for Glue Dynamic Frames.

>>> memberships = glueContext.create_dynamic_frame.from_catalog(database="legislators", table_name="memberships_json")
>>> memberships.toDF().createOrReplaceTempView("memberships")
>>> spark.sql("select distinct organization_id from memberships").show()
+--------------------+
|     organization_id|
+--------------------+
|d56acebe-8fdc-47b...|
|8fa6c3d2-71dc-478...|
+--------------------+

Workflows and S3 Consistency

If you have a workflow of external processes ingesting data into S3, or upstream AWS Glue jobs generating input for a table used by downstream jobs in a workflow, you can encounter the following Apache Spark errors.

Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 10 in stage 16.0 failed 4 times, most recent failure: Lost task 10.3 in stage 16.0 (TID 761, ip-<>.ec2.internal, executor 1): 
java.io.FileNotFoundException: No such file or directory 's3://<bucket>/fileprefix-c000.snappy.parquet'
It is possible the underlying files have been updated.
You can explicitly invalidate the cache in Spark by running 
'REFRESH TABLE tableName' command in SQL or by recreating the Dataset/DataFrame involved.

These errors happen when the upstream jobs overwrite to the same S3 objects that the downstream jobs are concurrently listing or reading. This can also happen due to eventual consistency of S3 resulting in overwritten or deleted objects get updated at a later time when the downstream jobs are reading. A common manifestation of this error occurs when you are create a SparkSQL view and execute SQL queries in the downstream job. To avoid these errors, the best practice is to set up a workflow with upstream and downstream jobs scheduled at different times, and read/write to different S3 partitions based on time.

You can also enable the S3-optimized output committer for your Glue jobs by passing in a special job parameter: “–enable-s3-parquet-optimized-committer” set to true. This committer improves application performance by avoiding list and rename operations in Amazon S3 during job and task commit phases. It also avoids issues that can occur with Amazon S3’s eventual consistency during job and task commit phases, and helps to minimize task failures.

Conclusion

In this post, we discussed how to leverage the automatic code generation process in AWS Glue ETL to simplify common data manipulation tasks such as data type conversion and flattening complex structures. We also explored using AWS Glue Workflows to build and orchestrate data pipelines of varying complexity. Lastly, we looked at how you can leverage the power of SQL, with the use of AWS Glue ETL and Glue Data Catalog, to query and transform your data.

In the final post, we will explore specific capabilities in AWS Glue and best practices to help you better manage the performance, scalability and operation of AWS Glue Apache Spark jobs.

 


About the Authors

Mohit Saxena is a technical lead manager at AWS Glue. His passion is building scalable distributed systems for efficiently managing data on cloud. He also enjoys watching movies, and reading about the latest technology.

 

 

IAM Access Analyzer flags unintended access to S3 buckets shared through access points

Post Syndicated from Andrea Nedic original https://aws.amazon.com/blogs/security/iam-access-analyzer-flags-unintended-access-to-s3-buckets-shared-through-access-points/

Customers use Amazon Simple Storage Service (S3) buckets to store critical data and manage access to data at scale. With Amazon S3 Access Points, customers can easily manage shared data sets by creating separate access points for individual applications. Access points are unique hostnames attached to a bucket and customers can set distinct permissions using access point policies. To help you identify buckets that can be accessed publicly or from other AWS accounts or organizations, AWS Identity and Access Management (IAM) Access Analyzer mathematically analyzes resource policies. Now, Access Analyzer analyzes access point policies in addition to bucket policies and bucket ACLs. This helps you find unintended access to S3 buckets that use access points. Access Analyzer makes it easier to identify and remediate unintended public, cross-account, or cross-organization sharing of your S3 buckets that use access points. This enables you to restrict bucket access and adhere to the security best practice of least privilege.

In this post, first I review Access Analyzer and how to enable it. Then I walk through an example of how to use Access Analyzer to identify an S3 bucket that is shared through an access point. Finally, I show you how to view Access Analyzer bucket findings in the S3 Management Console.

IAM Access Analyzer overview

Access Analyzer helps you determine which resources can be accessed publicly or from other accounts or organizations. Access Analyzer determines this by mathematically analyzing access control policies attached to resources. This form of analysis, called automated reasoning, applies logic and mathematical inference to determine all possible access paths allowed by a resource policy. This is how IAM Access Analyzer uses provable security to deliver comprehensive findings for unintended bucket access. You can enable Access Analyzer by navigating to the IAM console. From there, select Access Analyzer to create an analyzer for an account or an organization.

How to use IAM Access Analyzer to identify an S3 bucket shared through an access point

Once you’ve created your analyzer, you can view findings for resources that can be accessed publicly or from other AWS accounts or organizations. For your S3 bucket findings, the Shared through column indicates whether a bucket is shared through its S3 bucket policy, one of its access points, or the bucket ACL. Looking at the Shared through column in the image below, we see the first finding is shared through an Access point.

Figure 1: IAM Access Analyzer report of findings for resources shared outside of my account

Figure 1: IAM Access Analyzer report of findings for resources shared outside of my account

If you use access points to manage bucket access and one of your buckets is shared through an access point, you will see the bucket finding indicate ‘Access Point’. In this example, I select the first finding to learn more. In the detail image below, you can see that the Shared through field lists the Amazon Resource Name (arn) of the access point that grants access to the bucket and the details of the resources and principals. If this access wasn’t your intent, you can review the access point details in the S3 console. There you can modify the access point policy to remove access.

Figure 2: IAM Access Analyzer finding details for a bucket shared through an access point

Figure 2: IAM Access Analyzer finding details for a bucket shared through an access point

How to use Access Analyzer for S3 to identify an S3 bucket shared through an access point

You can also view Access Analyzer findings for S3 buckets in the S3 Management Console with Access Analyzer for S3. This view reports S3 buckets that are configured to allow access to anyone on the internet or other AWS accounts. This includes accounts outside of your AWS organization. For each public or shared bucket, Access Analyzer for S3 displays whether the bucket is shared through the bucket policy, access points, or the bucket ACL. In the example below, we see the my-test-public-bucket is set to public access using a Bucket policy and bucket ACL. Additionally, the my-test-bucket is shared access to other AWS accounts using a Bucket policy and one or more access points. After you identify a bucket with unintended access using Access Analyzer for S3, you can Block Public Access to the bucket. Amazon S3 block public access settings override the bucket policies that are applied to the bucket. The settings also override the access point policies applied to the bucket’s access points.

Figure 3: Access Analyzer for S3 findings report in the S3 Management Console

Figure 3: Access Analyzer for S3 findings report in the S3 Management Console

Next steps

To turn on IAM Access Analyzer at no additional cost, head over to the IAM console. IAM Access Analyzer is available in the IAM console and through APIs in all commercial AWS Regions and AWS GovCloud (US). To learn more about IAM Access Analyzer, visit the feature page.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the AWS IAM Forum or contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Andrea Nedic

Andrea is a Senior Technical Program Manager in the AWS Automated Reasoning Group. She enjoys hearing from customers about how they build on AWS. Outside of work, Andrea likes to ski, dance, and be outdoors. She holds a PhD from Princeton University.