Tag Archives: Amazon Simple Storage Services (S3)

Optimizing the cost of serverless web applications

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/optimizing-the-cost-of-serverless-web-applications/

Web application backends are one of the most frequent types of serverless use-case for customers. The pay-for-value model can make it cost-efficient to build web applications using serverless tools.

While serverless cost is generally correlated with level of usage, there are architectural decisions that impact cost efficiency. The impact of these choices is more significant as your traffic grows, so it’s important to consider the cost-effectiveness of different designs and patterns.

This blog post reviews some common areas in web applications where you may be able to optimize cost. It uses the Happy Path web application as a reference example, which you can read about in the introductory blog post.

Serverless web applications generally use a combination of the services in the following diagram. I cover each of these areas to highlight common areas for cost optimization.

Serverless architecture by AWS service

The API management layer: Selecting the right API type

Most serverless web applications use an API between the frontend client and the backend architecture. Amazon API Gateway is a common choice since it is a fully managed service that scales automatically. There are three types of API offered by the service – REST APIs, WebSocket APIs, and the more recent HTTP APIs.

HTTP APIs offer many of the features in the REST APIs service, but the cost is often around 70% less. It supports Lambda service integration, JWT authorization, CORS, and custom domain names. It also has a simpler deployment model than REST APIs. This feature set tends to work well for web applications, many of which mainly use these capabilities. Additionally, HTTP APIs will gain feature parity with REST APIs over time.

The Happy Path application is designed for 100,000 monthly active users. It uses HTTP APIs, and you can inspect the backend/template.yaml to see how to define these in the AWS Serverless Application Model (AWS SAM). If you have existing AWS SAM templates that are using REST APIs, in many cases you can change these easily:

REST to HTTP API

Content distribution layer: Optimizing assets

Amazon CloudFront is a content delivery network (CDN). It enables you to distribute content globally across 216 Points of Presence without deploying or managing any infrastructure. It reduces latency for users who are geographically dispersed and can also reduce load on other parts of your service.

A typical web application uses CDNs in a couple of different ways. First, there is the distribution of the application itself. For single-page application frameworks like React or Vue.js, the build processes create static assets that are ideal for serving over a CDN.

However, these builds may not be optimized and can be larger than necessary. Many frameworks offer optimization plugins, and the JavaScript community frequently uses Webpack to bundle modules and shrink deployment packages. Similarly, any media assets used in the application build should be optimized. You can use tools like Lighthouse to analyze your web apps to find images that can be resized or compressed.

Optimizing images

The second common CDN use-case for web apps is for user-generated content (UGC). Many apps allow users to upload images, which are then shared with other users. A typical photo from a 12-megapixel smartphone is 3–9 MB in size. This high resolution is not necessary when photos are rendered within web apps. Displaying the high-resolution asset results in slower download performance and higher data transfer costs.

The Happy Path application uses a Resizer Lambda function to optimize these uploaded assets. This process creates two different optimized images depending upon which component loads the asset.

Image sizes in front-end applications

The upload S3 bucket shows the original size of the upload from the smartphone:

The distribution S3 bucket contains the two optimized images at different sizes:

Optimized images in the distribution S3 bucket

The distribution file sizes are 98–99% smaller. For a busy web application, using optimized image assets can make a significant difference to data transfer and CloudFront costs.

Additionally, you can convert to highly optimized file formats such as WebP to reduce file size even further. Not all browsers support this format, but you can use CSS on the frontend to fall back to other types if needed:

<img src="myImage.webp" onerror="this.onerror=null; this.src='myImage.jpg'">

The data layer

AWS offers many different database and storage options that can be useful for web applications. Billing models vary by service and Region. By understanding the data access and storage requirements of your app, you can make informed decisions about the right service to use.

Generally, it’s more cost-effective to store binary data in S3 than a database. First, when the data is uploaded, you can upload directly to S3 with presigned URLs instead of proxying data via API Gateway or another service.

If you are using Amazon DynamoDB, it’s best practice to store larger items in S3 and include a reference token in a table item. Part of DynamoDB pricing is based on read capacity units (RCUs). For binary items such as images, it is usually more cost-efficient to use S3 for storage.

Many web developers who are new to serverless are familiar with using a relational database, so choose Amazon RDS for their database needs. Depending upon your use-case and data access patterns, it may be more cost effective to use DynamoDB instead. RDS is not a serverless service so there are monthly charges for the underlying compute instance. DynamoDB pricing is based upon usage and storage, so for many web apps may be a lower-cost choice.

Integration layer

This layer includes services like Amazon SQS, Amazon SNS, and Amazon EventBridge, which are essential for decoupling serverless applications. Each of these have a request-based pricing component, where 64 KB of a payload is billed as one request. For example, a single SQS message with a 256 KB payload is billed as four requests. There are two optimization methods common for web applications.

1. Combine messages

Many messages sent to these services are much smaller than 64 KB. In some applications, the publishing service can combine multiple messages to reduce the total number of publish actions to SNS. Additionally, by either eliminating unused attributes in the message or compressing the message, you can store more data in a single request.

For example, a publishing service may be able to combine multiple messages together in a single publish action to an SNS topic:

  • Before optimization, a publishing service sends 100,000,000 1KB-messages to an SNS topic. This is charged as 100 million messages for a total cost of $50.00.
  • After optimization, the publishing service combines messages to send 1,562,500 64KB-messages to an SNS topic. This is charged as 1,562,500 messages for a total cost of $0.78.

2. Filter messages

In many applications, not every message is useful for a consuming service. For example, an SNS topic may publish to a Lambda function, which checks the content and discards the message based on some criteria. In this case, it’s more cost effective to use the native filtering capabilities of SNS. The service can filter messages and only invoke the Lambda function if the criteria is met. This lowers the compute cost by only invoking Lambda when necessary.

For example, an SNS topic receives messages about customer orders and forwards these to a Lambda function subscriber. The function is only interested in canceled orders and discards all other messages:

  • Before optimization, the SNS topic sends all messages to a Lambda function. It evaluates the message for the presence of an order canceled attribute. On average, only 25% of the messages are processed further. While SNS does not charge for delivery to Lambda functions, you are charged each time the Lambda service is invoked, for 100% of the messages.
  • After optimization, using an SNS subscription filter policy, the SNS subscription filters for canceled orders and only forwards matching messages. Since the Lambda function is only invoked for 25% of the messages, this may reduce the total compute cost by up to 75%.

3. Choose a different messaging service

For complex filtering options based upon matching patterns, you can use EventBridge. The service can filter messages based upon prefix matching, numeric matching, and other patterns, combining several rules into a single filter. You can create branching logic within the EventBridge rule to invoke downstream targets.

EventBridge offers a broader range of targets than SNS destinations. In cases where you publish from an SNS topic to a Lambda function to invoke an EventBridge target, you could use EventBridge instead and eliminate the Lambda invocation. For example, instead of routing from SNS to Lambda to AWS Step Functions, instead create an EventBridge rule that routes events directly to a state machine.

Business logic layer

Step Functions allows you to orchestrate complex workflows in serverless applications while eliminating common boilerplate code. The Standard Workflow service charges per state transition. Express Workflows were introduced in December 2019, with pricing based on requests and duration, instead of transitions.

For workloads that are processing large numbers of events in shorter durations, Express Workflows can be more cost-effective. This is designed for high-volume event workloads, such as streaming data processing or IoT data ingestion. For these cases, compare the cost of the two workflow types to see if you can reduce cost by switching across.

Lambda is the on-demand compute layer in serverless applications, which is billed by requests and GB-seconds. GB-seconds is calculated by multiplying duration in seconds by memory allocated to the function. For a function with a 1-second duration, invoked 1 million times, here is how memory allocation affects the total cost in the US East (N. Virginia) Region:

Memory (MB)GB/SCompute costTotal cost
128125,000$ 2.08$ 2.28
512500,000$ 8.34$ 8.54
10241,000,000$ 16.67$ 16.87
15361,500,000$ 25.01$ 25.21
20482,000,000$ 33.34$ 33.54
30082,937,500$ 48.97$ 49.17

There are many ways to optimize Lambda functions, but one of the most important choices is memory allocation. You can choose between 128 MB and 3008 MB, but this also impacts the amount of virtual CPU as memory increases. Since total cost is a combination of memory and duration, choosing more memory can often reduce duration and lower overall cost.

Instead of manually setting the memory for a Lambda function and running executions to compare duration, you can use the AWS Lambda Power Tuning tool. This uses Step Functions to run your function against varying memory configurations. It can produce a visualization to find the optimal memory setting, based upon cost or execution time.

Optimizing costs with the AWS Lambda Power Tuning tool

Conclusion

Web application backends are one of the most popular workload types for serverless applications. The pay-per-value model works well for this type of workload. As traffic grows, it’s important to consider the design choices and service configurations used to optimize your cost.

Serverless web applications generally use a common range of services, which you can logically split into different layers. This post examines each layer and suggests common cost optimizations helpful for web app developers.

To learn more about building web apps with serverless, see the Happy Path series. For more serverless learning resources, visit https://serverlessland.com.

Apply record level changes from relational databases to Amazon S3 data lake using Apache Hudi on Amazon EMR and AWS Database Migration Service

Post Syndicated from Ninad Phatak original https://aws.amazon.com/blogs/big-data/apply-record-level-changes-from-relational-databases-to-amazon-s3-data-lake-using-apache-hudi-on-amazon-emr-and-aws-database-migration-service/

Data lakes give organizations the ability to harness data from multiple sources in less time. Users across different roles are now empowered to collaborate and analyze data in different ways, leading to better, faster decision-making. Amazon Simple Storage Service (Amazon S3) is the highly performant object storage service for structured and unstructured data and the storage service of choice to build a data lake.

However, many use cases like performing change data capture (CDC) from an upstream relational database to an Amazon S3-based data lake require handling data at a record level. Performing an operation like inserting, updating, and deleting individual records from a dataset requires the processing engine to read all the objects (files), make the changes, and rewrite the entire dataset as new files. Furthermore, making the data available in the data lake in near-real time often leads to the data being fragmented over many small files, resulting in poor query performance. Apache Hudi is an open-source data management framework that enables you to manage data at the record level in Amazon S3 data lakes, thereby simplifying building CDC pipelines and making it efficient to do streaming data ingestion. Datasets managed by Hudi are stored in Amazon S3 using open storage formats, and integrations with Presto, Apache Hive, Apache Spark, and the AWS Glue Data Catalog give you near real-time access to updated data using familiar tools. Hudi is supported in Amazon EMR and is automatically installed when you choose Spark, Hive, or Presto when deploying your EMR cluster.

In this post, we show you how to build a CDC pipeline that captures the data from an Amazon Relational Database Service (Amazon RDS) for MySQL database using AWS Database Migration Service (AWS DMS) and applies those changes to a dataset in Amazon S3 using Apache Hudi on Amazon EMR. Apache Hudi includes the utility HoodieDeltaStreamer, which provides an easy way to ingest data from many sources, such as a distributed file system or Kafka. It manages checkpointing, rollback, and recovery so you don’t need to keep track of what data has been read and processed from the source, which makes it easy to consume change data. It also allows for lightweight SQL-based transformations on the data as it is being ingested. For more information, see Writing Hudi Tables. Support for AWS DMS with HoodieDeltaStreamer is provided with Apache Hudi version 0.5.2 and is available on Amazon EMR 5.30.x and 6.1.0.

Architecture overview

The following diagram illustrates the architecture we deploy to build our CDC pipeline.

In this architecture, we have a MySQL instance on Amazon RDS. AWS DMS pulls full and incremental data (using the CDC feature of AWS DMS) into an S3 bucket in Parquet format. HoodieDeltaStreamer on an EMR cluster is used to process the full and incremental data to create a Hudi dataset. As the data in the MySQL database gets updated, the AWS DMS task picks up the changes and takes them to the raw S3 bucket. The HoodieDeltastreamer job can be run on the EMR cluster at a certain frequency or in a continuous mode to apply these changes to the Hudi dataset in the Amazon S3 data lake. You can query this data with tools such as SparkSQL, Presto, Apache Hive running on the EMR cluster, and Amazon Athena.

Deploying the solution resources

We use AWS CloudFormation to deploy these components in your AWS account. Choose an AWS Region for deployment where the following services are available:

You need to meet the following prerequisites before deploying the CloudFormation template:

  • Have a VPC with at least two public subnets in your account.
  • Have a S3 bucket where you want to collect logs from the EMR cluster. This should be in the same AWS region where you spin up the CloudFormation stack.
  • Have an AWS Identity and Access Management (IAM) role dms-vpc-role. For instructions on creating one, see Security in AWS Database Migration Service.
  • If you’re deploying the stack in an account using the AWS Lake Formation permission model, validate the following settings:
    • The IAM user used to deploy the stack is added as a data lake administrator under Lake Formation or the IAM user used to deploy the stack has IAM privileges to create databases in the AWS Glue Data Catalog.
    • The Data Catalog settings under Lake Formation are configured to use only IAM access control for new databases and new tables in new databases. This makes sure that all access to the newly created databases and tables in the Data Catalog are controlled solely using IAM permissions.
  • IAMAllowedPrincipals is granted database creator privilege on the Lake Formation Database creators page.

If this privilege is not in place, grant it by choosing Grant and selecting the Create database permission.

These Lake Formation settings are required so that all permissions to the Data Catalog objects are controlled using IAM only.

Launching the CloudFormation stack

To launch the CloudFormation stack, complete the following steps:

  1. Choose Launch Stack:
  2. Provide the mandatory parameters in the Parameters section, including an S3 bucket to store the Amazon EMR logs and a CIDR IP range from where you want to access Amazon RDS for MySQL.
  3. Follow through the CloudFormation stack creation wizard, leaving rest of the default values unchanged.
  4. On the final page, select I acknowledge that AWS CloudFormation might create IAM resources with custom names.
  5. Choose Create stack.
  6. When the stack creation is complete, record the details of the S3 bucket, EMR cluster, and Amazon RDS for MySQL details on the Outputs tab of the CloudFormation stack.

The CloudFormation template uses m5.xlarge and m5.2xlarge instances for the EMR cluster. If these instance types aren’t available in the Region or Availability Zone you have selected for deployment, the creation of the CloudFormation stack fails. If that happens, choose a Region or subnet where the instance type is available. For more information about working around this issue, see Instance Type Not Supported.

CloudFormation also creates and configures the AWS DMS endpoints and tasks with requisite connection attributes such as dataFormat, timestampColumnName, and parquetTimestampInMillisecond. For more information, see Extra connection attributes when using Amazon S3 as a target for AWS DMS.

The database instance deployed as part of the CloudFormation stack has already been created with the settings needed for AWS DMS to work in CDC mode on the database. These are:

  • binlog_format=ROW
  • binlog_checksum=NONE

Also, automatic backups are enabled on the RDS DB instance. This is a required attribute for AWS DMS to do CDC. For more information, see Using a MySQL-compatible database as a source for AWS DMS.

Running the end-to-end data flow

Now that the CloudFormation stack is deployed, we can run our data flow to get the full and incremental data from MySQL into a Hudi dataset in our data lake.

  1. As a best practice, retain your binlogs for at least 24 hours. Log in to your Amazon RDS for MySQL database using your SQL client and run the following command:
    call mysql.rds_set_configuration('binlog retention hours', 24)

  2. Create a table in the dev database:
    create table dev.retail_transactions(
    tran_id INT,
    tran_date DATE,
    store_id INT,
    store_city varchar(50),
    store_state char(2),
    item_code varchar(50),
    quantity INT,
    total FLOAT);

  3. When the table is created, insert some dummy data into the database:
    insert into dev.retail_transactions values(1,'2019-03-17',1,'CHICAGO','IL','XXXXXX',5,106.25);
    insert into dev.retail_transactions values(2,'2019-03-16',2,'NEW YORK','NY','XXXXXX',6,116.25);
    insert into dev.retail_transactions values(3,'2019-03-15',3,'SPRINGFIELD','IL','XXXXXX',7,126.25);
    insert into dev.retail_transactions values(4,'2019-03-17',4,'SAN FRANCISCO','CA','XXXXXX',8,136.25);
    insert into dev.retail_transactions values(5,'2019-03-11',1,'CHICAGO','IL','XXXXXX',9,146.25);
    insert into dev.retail_transactions values(6,'2019-03-18',1,'CHICAGO','IL','XXXXXX',10,156.25);
    insert into dev.retail_transactions values(7,'2019-03-14',2,'NEW YORK','NY','XXXXXX',11,166.25);
    insert into dev.retail_transactions values(8,'2019-03-11',1,'CHICAGO','IL','XXXXXX',12,176.25);
    insert into dev.retail_transactions values(9,'2019-03-10',4,'SAN FRANCISCO','CA','XXXXXX',13,186.25);
    insert into dev.retail_transactions values(10,'2019-03-13',1,'CHICAGO','IL','XXXXXX',14,196.25);
    insert into dev.retail_transactions values(11,'2019-03-14',5,'CHICAGO','IL','XXXXXX',15,106.25);
    insert into dev.retail_transactions values(12,'2019-03-15',6,'CHICAGO','IL','XXXXXX',16,116.25);
    insert into dev.retail_transactions values(13,'2019-03-16',7,'CHICAGO','IL','XXXXXX',17,126.25);
    insert into dev.retail_transactions values(14,'2019-03-16',7,'CHICAGO','IL','XXXXXX',17,126.25);
    

    We now use AWS DMS to start pushing this data to Amazon S3.

  4. On the AWS DMS console, run the task hudiblogload.

This task does a full load of the table to Amazon S3 and then starts writing incremental data.

If you’re prompted to test the AWS DMS endpoints while starting the AWS DMS task for the first time, you should do so. It’s generally a good practice to test the source and target endpoints before starting an AWS DMS task for the first time.

In a few minutes, the status of the task changes to Load complete, replication ongoing, which means that the full load is complete and the ongoing replication has started. You can go to the S3 bucket created by the stack and you should see a .parquet file under the dmsdata/dev/retail_transactions folder in your S3 bucket.

  1. On the Hardware tab of your EMR cluster, choose the master instance group and note the EC2 instance ID for the master instance.
  2. On the Systems Manager console, choose Session Manager.
  3. Choose Start Session to start a session with the master node of your cluster.

If you face challenges connecting to the master instance of the EMR cluster, see Troubleshooting Session Manager.

  1. Switch the user to Hadoop by running the following command:
    sudo su hadoop

In a real-life use case, the AWS DMS task starts writing incremental files to the same Amazon S3 location when the full load is complete. The way to distinguish full load vs. incremental load files is that the full load files have a name starting with LOAD, whereas CDC filenames have datetimestamps, as you see in a later step. From a processing perspective, we want to process the full load into the Hudi dataset and then start incremental data processing. To do this, we move the full load files to a different S3 folder under the same S3 bucket and process those before we start processing incremental files.

  1. Run the following command on the master node of the EMR cluster (replace <s3-bucket-name> with your actual bucket name):
    aws s3 mv s3://<s3-bucket-name>/dmsdata/dev/retail_transactions/ s3://<s3-bucket-name>/dmsdata/data-full/dev/retail_transactions/  --exclude "*" --include "LOAD*.parquet" --recursive

With the full table dump available in the data-full folder, we now use the HoodieDeltaStreamer utility on the EMR cluster to populate the Hudi dataset on Amazon S3.

  1. Run the following command to populate the Hudi dataset to the hudi folder in the same S3 bucket (replace <s3-bucket-name> with the name of the S3 bucket created by the CloudFormation stack):
    spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  \
      --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.2-incubating,org.apache.spark:spark-avro_2.11:2.4.5 \
      --master yarn --deploy-mode cluster \
    --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
    --conf spark.sql.hive.convertMetastoreParquet=false \
    /usr/lib/hudi/hudi-utilities-bundle_2.11-0.5.2-incubating.jar \
      --table-type COPY_ON_WRITE \
      --source-ordering-field dms_received_ts \
      --props s3://<s3-bucket-name>/properties/dfs-source-retail-transactions-full.properties \
      --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
      --target-base-path s3://<s3-bucket-name>/hudi/retail_transactions --target-table hudiblogdb.retail_transactions \
      --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
        --payload-class org.apache.hudi.payload.AWSDmsAvroPayload \
    --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
      --enable-hive-sync
    

The preceding command runs a Spark job that runs the HoodieDeltaStreamer utility. For more information about the parameters used in this command, see Writing Hudi Tables.

When the Spark job is complete, you can navigate to the AWS Glue console and find a table called retail_transactions created under the hudiblogdb database. The input format for the table is org.apache.hudi.hadoop.HoodieParquetInputFormat.

Next, we query the data and look at the data in the retail_transactions table in the catalog.

  1. In the Systems Manager session established earlier, run the following command (make sure that you have completed all the prerequisites for the post, including adding IAMAllowedPrincipals as a database creator in Lake Formation):
    spark-shell --conf "spark.serializer=org.apache.spark.serializer.KryoSerializer" --conf "spark.sql.hive.convertMetastoreParquet=false" \
    --packages org.apache.hudi:hudi-spark-bundle_2.11:0.5.2-incubating,org.apache.spark:spark-avro_2.11:2.4.5 \
    --jars /usr/lib/hudi/hudi-spark-bundle_2.11-0.5.2-incubating.jar,/usr/lib/spark/external/lib/spark-avro.jar
    

  2. Run the following query on the retail_transactions table:
    spark.sql("Select * from hudiblogdb.retail_transactions order by tran_id").show()

You should see the same data in the table as the MySQL database with a few columns added by the HoodieDeltaStreamer process.

We now run some DML statements on our MySQL database and take these changes through to the Hudi dataset.

  1. Run the following DML statements on the MySQL database:
    insert into dev.retail_transactions values(15,'2019-03-16',7,'CHICAGO','IL','XXXXXX',17,126.25);
    update dev.retail_transactions set store_city='SPRINGFIELD' where tran_id=12;
    delete from dev.retail_transactions where tran_id=2;

In a few minutes, you see a new .parquet file created under dmsdata/dev/retail_transactions folder in the S3 bucket.

  1. Run the following command on the EMR cluster to get the incremental data to the Hudi dataset (replace <s3-bucket-name> with the name of the S3 bucket created by the CloudFormation template):
    spark-submit --class org.apache.hudi.utilities.deltastreamer.HoodieDeltaStreamer  \
      --packages org.apache.hudi:hudi-utilities-bundle_2.11:0.5.2-incubating,org.apache.spark:spark-avro_2.11:2.4.5 \
      --master yarn --deploy-mode cluster \
    --conf spark.serializer=org.apache.spark.serializer.KryoSerializer \
    --conf spark.sql.hive.convertMetastoreParquet=false \
    /usr/lib/hudi/hudi-utilities-bundle_2.11-0.5.2-incubating.jar \
      --table-type COPY_ON_WRITE \
      --source-ordering-field dms_received_ts \
      --props s3://<s3-bucket-name>/properties/dfs-source-retail-transactions-incremental.properties \
      --source-class org.apache.hudi.utilities.sources.ParquetDFSSource \
      --target-base-path s3://<s3-bucket-name>/hudi/retail_transactions --target-table hudiblogdb.retail_transactions \
      --transformer-class org.apache.hudi.utilities.transform.SqlQueryBasedTransformer \
        --payload-class org.apache.hudi.payload.AWSDmsAvroPayload \
    --schemaprovider-class org.apache.hudi.utilities.schema.FilebasedSchemaProvider \
      --enable-hive-sync \
    --checkpoint 0

The key difference between this command and the previous one is in the properties file that was used as an argument to the –-props and --checkpoint parameters. For the earlier command that performed the full load, we used dfs-source-retail-transactions-full.properties; for the incremental one, we used dfs-source-retail-transactions-incremental.properties. The differences between these two property files are:

  • The location of source data changes between full and incremental data in Amazon S3.
  • The SQL transformer query included a hard-coded Op field for the full load task, because an AWS DMS first-time full load doesn’t include the Op field for Parquet datasets. The Op field can have values of I, U, and D—for Insert, Update and Delete indicators.

We cover the details of the --checkpoint parameter in the Considerations when deploying to production section later in this post.

  1. When the job is complete, run the same query in spark-shell.

You should see these updates applied to the Hudi dataset.

You can use the Hudi CLI to administer Hudi datasets to view information about commits, the filesystem, statistics, and more.

  1. To do this, in the Systems Manager session, run the following command:
    /usr/lib/hudi/cli/bin/hudi-cli.sh

  2. Inside the Hudi-cli, run the following command (replace the <s3-bucket-name> with the S3 bucket created by the Cloud Formation stack):
    connect --path s3://<s3-bucket-name>/hudi/retail_transactions

  3. To inspect commits on your Hudi dataset, run the following command:
    commits show

You can also query incremental data from the Hudi dataset. This is particularly useful when you want to take incremental data for downstream processing like aggregations. Hudi provides multiple ways of pulling data incrementally which is documented here. An example of how to use this feature is available in the Hudi Quick Start Guide.

Considerations when deploying to production

The preceding setup showed an example of how to build a CDC pipeline from your relational database to your Amazon S3-based data lake. However, if you want to use this solution for production, you should consider the following:

  • To ensure high availability, you can set up the AWS DMS instance in a Multi-AZ configuration.
  • The CloudFormation stack deployed the required properties files needed by the deltastreamer utility into the S3 bucket at s3://<s3-bucket-name>/properties/. You may need to customize these based on your requirements. For more information, see Configurations. There are a few parameters that may need your attention:
    • deltastreamer.transformer.sql – This property exposes an extremely powerful feature of the deltastreamer utility: it enables you to transform data on the fly as it’s being ingested and persisted in the Hudi dataset. In this post, we have shown a basic transformation that casts the tran_date column to a string, but you can apply any transformation as part of this query.
    • parquet.small.file.limit – This field is in bytes and a critical storage configuration specifying how Hudi handles small files on Amazon S3. Small files can happen due to the number of records being processed in each insert per partition. Setting this value allows Hudi to continue to treat inserts in a particular partition as updates to the existing files, causing files that are up to the size of this small.file.limit to be rewritten and keep growing in size.
    • parquet.max.file.size – This is the max file size of a single Parquet in your Hudi dataset, after which a new file is created to store more data. For Amazon S3 storage and data querying needs, we can keep this around 256 MB–1 GB (256x1024x1024 = 268435456).
    • [Insert|Upsert|bulkinsert].shuffle.parallelism – In this post, we dealt with a small dataset of few records only. However, in real-life situations, you might want to bring in hundreds of millions of records in the first load, and then incremental CDC can potentially be in millions per day. There is a very important parameter to set when you want quite predictable control on the number of files in each of your Hudi dataset partitions. This is also needed to ensure you don’t hit an Apache Spark limit of 2 GB for data shuffle blocks when processing large amounts of data. For example, if you plan to load 200 GBs of data in first load and want to keep file sizes of approximately 256 MB, set the shuffle parallelism parameters for this dataset as 800 (200×1024/256). For more information, see Tuning Guide.
  • In the incremental load deltastreamer command, we used an additional parameter: --checkpoint 0. When deltastreamer writes a Hudi dataset, it persists checkpoint information in the .commit files under the .hoodie folder. It uses this information in subsequent runs and only reads that data from Amazon S3, which is created after this checkpoint time. In a production scenario, after you start the AWS DMS task, the task keeps writing incremental data to the target S3 folder as soon as the full load is complete. In the steps that we followed, we ran a command on the EMR cluster to manually move the full load files to another folder and process the data from there. When we did that, the timestamp associated with the S3 objects changes to the most current timestamp. If we run the incremental load without the checkpoint argument, deltastreamer doesn’t pick up any incremental data written to Amazon S3 before we manually moved the full load files. To make sure that all incremental data is processed by deltastreamer the first time, set the checkpoint to 0, which makes it process all incremental data in the folder. However, only use this parameter for the first incremental load and let deltastreamer use its own checkpointing methodology from that point onwards.
  • For this post, we ran the spark-submit command manually. However, in production, you can run it as a step on the EMR cluster.
  • You can either schedule the incremental data load command to run at a regular interval using a scheduling or orchestration tool, or run it in a continuous fashion at a certain frequency by passing additional parameters to the spark-submit command --min-sync-interval-seconds XX –continuous, where XX is the number of seconds between each run of the data pull. For example, if you want to run the processing every 5 minutes, replace XX with 300.

Cleaning up

When you are done exploring the solution, complete the following steps to clean up the resources deployed by CloudFormation:

  1. Empty the S3 bucket created by the CloudFormation stack
  2. Delete any Amazon EMR log files generated under s3://<EMR-Logs-S3-Bucket> /HudiBlogEMRLogs/.
  3. Stop the AWS DMS task Hudiblogload.
  4. Delete the CloudFormation stack.
  5. Delete any Amazon RDS for MySQL database snapshots retained after the CloudFormation template is deleted.

Conclusion

More and more data lakes are being built on Amazon S3, and these data lakes often need to be hydrated with change data from transactional systems. Handling deletes and upserts of data into the data lake using traditional methods involves a lot of heavy lifting. In this post, we saw how to easily build a solution with AWS DMS and HoodieDeltaStreamer on Amazon EMR. We also looked at how to perform lightweight record-level transformations when integrating data into the data lake, and how to use this data for downstream processes like aggregations. We also discussed the important settings and command line options that were used and how you could modify them to suit your requirements.


About the Authors

Ninad Phatak is a Senior Analytics Specialist Solutions Architect with Amazon Internet Services Private Limited. He specializes in data engineering and datawarehousing technologies and helps customers architect their analytics use cases and platforms on AWS.

 

 

 

Raghu Dubey is a Senior Analytics Specialist Solutions Architect with Amazon Internet Services Private Limited. He specializes in Big Data Analytics, Data warehousing and BI and helps customers build scalable data analytics platforms.

 

 

 

 

Amazon S3 Update – Three New Security & Access Control Features

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/amazon-s3-update-three-new-security-access-control-features/

A year or so after we launched Amazon S3, I was in an elevator at a tech conference and heard a couple of developers use “just throw it into S3” as the answer to their data storage challenge. I remember that moment well because the comment was made so casually, and it was one of the first times that I fully grasped just how quickly S3 had caught on.

Since that launch, we have added hundreds of features and multiple storage classes to S3, while also reducing the cost to storage a gigabyte of data for a month by almost 85% (from $0.15 to $0.023 for S3 Standard, and as low as $0.00099 for S3 Glacier Deep Archive). Today, our customers use S3 to support many different use cases including data lakes, backup and restore, disaster recovery, archiving, and cloud-native applications.

Security & Access Control
As the set of use cases for S3 has expanded, our customers have asked us for new ways to regulate access to their mission-critical buckets and objects. We added IAM policies many years ago, and Block Public Access in 2018. Last year we added S3 Access Points (Easily Manage Shared Data Sets with Amazon S3 Access Points) to help you manage access in large-scale environments that might encompass hundreds of applications and petabytes of storage.

Today we are launching S3 Object Ownership as a follow-on to two other S3 security & access control features that we launched earlier this month. All three features are designed to give you even more control and flexibility:

Object Ownership – You can now ensure that newly created objects within a bucket have the same owner as the bucket.

Bucket Owner Condition – You can now confirm the ownership of a bucket when you create a new object or perform other S3 operations.

Copy API via Access Points – You can now access S3’s Copy API through an Access Point.

You can use all of these new features in all AWS regions at no additional charge. Let’s take a look at each one!

Object Ownership
With the proper permissions in place, S3 already allows multiple AWS accounts to upload objects to the same bucket, with each account retaining ownership and control over the objects. This many-to-one upload model can be handy when using a bucket as a data lake or another type of data repository. Internal teams or external partners can all contribute to the creation of large-scale centralized resources. With this model, the bucket owner does not have full control over the objects in the bucket and cannot use bucket policies to share objects, which can lead to confusion.

You can now use a new per-bucket setting to enforce uniform object ownership within a bucket. This will simplify many applications, and will obviate the need for the Lambda-powered self-COPY that has become a popular way to do this up until now. Because this setting changes the behavior seen by the account that is uploading, the PUT request must include the bucket-owner-full-control ACL. You can also choose to use a bucket policy that requires the inclusion of this ACL.

To get started, open the S3 Console, locate the bucket and view its Permissions, click Object Ownership, and Edit:

Then select Bucket owner preferred and click Save:

As I mentioned earlier, you can use a bucket policy to enforce object ownership (read About Object Ownership and this Knowledge Center Article to learn more).

Many AWS services deliver data to the bucket of your choice, and are now equipped to take advantage of this feature. S3 Server Access Logging, S3 Inventory, S3 Storage Class Analysis, AWS CloudTrail, and AWS Config now deliver data that you own. You can also configure Amazon EMR to use this feature by setting fs.s3.canned.acl to BucketOwnerFullControl in the cluster configuration (learn more).

Keep in mind that this feature does not change the ownership of existing objects. Also, note that you will now own more S3 objects than before, which may cause changes to the numbers you see in your reports and other metrics.

AWS CloudFormation support for Object Ownership is under development and is expected to be ready before AWS re:Invent.

Bucket Owner Condition
This feature lets you confirm that you are writing to a bucket that you own.

You simply pass a numeric AWS Account ID to any of the S3 Bucket or Object APIs using the expectedBucketOwner parameter or the x-amz-expected-bucket-owner HTTP header. The ID indicates the AWS Account that you believe owns the subject bucket. If there’s a match, then the request will proceed as normal. If not, it will fail with a 403 status code.

To learn more, read Bucket Owner Condition.

Copy API via Access Points
S3 Access Points give you fine-grained control over access to your shared data sets. Instead of managing a single and possibly complex policy on a bucket, you can create an access point for each application, and then use an IAM policy to regulate the S3 operations that are made via the access point (read Easily Manage Shared Data Sets with Amazon S3 Access Points to see how they work).

You can now use S3 Access Points in conjunction with the S3 CopyObject API by using the ARN of the access point instead of the bucket name (read Using Access Points to learn more).

Use Them Today
As I mentioned earlier, you can use all of these new features in all AWS regions at no additional charge.

Jeff;

 

Amazon S3 on Outposts Now Available

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/amazon-s3-on-outposts-now-available/

AWS Outposts customers can now use Amazon Simple Storage Service (S3) APIs to store and retrieve data in the same way they would access or use data in a regular AWS Region. This means that many tools, apps, scripts, or utilities that already use S3 APIs, either directly or through SDKs, can now be configured to store that data locally on your Outposts.

AWS Outposts are a fully managed service that provides a consistent hybrid experience, with AWS installing the Outpost in your data center or colo facility. These Outposts are managed, monitored, and updated by AWS just like in the cloud. Customers use AWS Outposts to run services in their local environments, like Amazon Elastic Compute Cloud (EC2), Amazon Elastic Block Store (EBS), and Amazon Relational Database Service (RDS), and are ideal for workloads that require low latency access to on-premises systems, local data processing, or local data storage.

Outposts are connected to an AWS Region and are also able to access Amazon S3 in AWS Regions, however, this new feature will allow you to use the S3 APIs to store data on the AWS Outposts hardware and process it locally. You can use S3 on Outposts to satisfy demanding performance needs by keeping data close to on-premises applications. It will also benefit you if you want to reduce data transfers to AWS Regions, since you can perform filtering, compression, or other pre-processing on your data locally without having to send all of it to a region.

Speaking of keeping your data local, any objects and the associated metadata and tags are always stored on the Outpost and are never sent or stored elsewhere. However, it is essential to remember that if you have data residency requirements, you may need to put some guardrails in place to ensure no one has the permissions to copy objects manually from your Outposts to an AWS Region.

You can create S3 buckets on your Outpost and easily store and retrieve objects using the same Console, APIs, and SDKs that you would use in a regular AWS Region. Using the S3 APIs and features, S3 on Outposts makes it easy to store, secure, tag, retrieve, report on, and control access to the data on your Outpost.

S3 on Outposts provides a new Amazon S3 storage class, named S3 Outposts, which uses the S3 APIs, and is designed to durably and redundantly store data across multiple devices and servers on your Outposts. By default, all data stored is encrypted using server-side encryption with SSE-S3. You can optionally use server-side encryption with your own encryption keys (SSE-C) by specifying an encryption key as part of your object API requests.

When configuring your Outpost you can add 48 TB or 96 TB of S3 storage capacity, and you can create up to 100 buckets on each Outpost. If you have existing Outposts, you can add capacity via the AWS Outposts Console or speak to your AWS account team. If you are using no more than 11 TB of EBS storage on an existing Outpost today you can add up to 48 TB with no hardware changes on the existing Outposts. Other configurations will require additional hardware on the Outpost (if the hardware footprint supports this) in order to add S3 storage.

So let me show you how I can create an S3 bucket on my Outposts and then store and retrieve some data in that bucket.

Storing data using S3 on Outposts

To get started, I updated my AWS Command Line Interface (CLI) to the latest version. I can create a new Bucket with the following command and specify which outpost I would like the bucket created on by using the –outposts-id switch.

aws s3control create-bucket --bucket my-news-blog-bucket --outposts-id op-12345

In response to the command, I am given the ARN of the bucket. I take note of this as I will need it in the next command.

Next, I will create an Access point. Access points are a relatively new way to manage access to an S3 bucket. Each access point enforces distinct permissions and network controls for any request made through it. S3 on Outposts requires a Amazon Virtual Private Cloud configuration so I need to provide the VPC details along with the create-access-point command.

aws s3control create-access-point --account-id 12345 --name prod --bucket "arn:aws:s3-outposts:us-west-2:12345:outpost/op-12345/bucket/my-news-blog-bucket" --vpc-configuration VpcId=vpc-12345

S3 on Outposts uses endpoints to connect to Outposts buckets so that you can perform actions within your virtual private cloud (VPC). To create an endpoint, I run the following command.

aws s3outposts create-endpoint --outpost-id op-12345 --subnet-id subnet-12345 —security-group-id sg-12345

Now that I have set things up, I can start storing data. I use the put-object command to store an object in my newly created Amazon Simple Storage Service (S3) bucket.

aws s3api put-object --key my_news_blog_archives.zip --body my_news_blog_archives.zip --bucket arn:aws:s3-outposts:us-west-2:12345:outpost/op-12345/accesspoint/prod

Once the object is stored I can retrieve it by using the get-object command.

aws s3api get-object --key my_news_blog_archives.zip --bucket arn:aws:s3-outposts:us-west-2:12345:outpost/op-12345/accesspoint/prod my_news_blog_archives.zip

There we have it. I’ve managed to store an object and then retrieve it, on my Outposts, using S3 on Outposts.

Transferring Data from Outposts

Now that you can store and retrieve data on your Outposts, you might want to transfer results to S3 in an AWS Region, or transfer data from AWS Regions to your Outposts for frequent local access, processing, and storage. You can use AWS DataSync to do this with the newly launched support for S3 on Outposts.

With DataSync, you can choose which objects to transfer, when to transfer them, and how much network bandwidth to use. DataSync also encrypts your data in-transit, verifies data integrity in-transit and at-rest, and provides granular visibility into the transfer process through Amazon CloudWatch metrics, logs, and events.

Order today

If you want to start using S3 on Outposts, please visit the AWS Outposts Console, here you can add S3 storage to your existing Outposts or order an Outposts configuration that includes the desired amount of S3. If you’d like to discuss your Outposts purchase in more detail then contact our sales team.

Pricing with AWS Outposts works a little bit differently from most AWS services, in that it is not a pay-as-you-go service. You purchase Outposts capacity for a 3-year term and you can choose from a number of different payment schedules. There are a variety of AWS Outposts configurations featuring a combination of EC2 instance types and storage options. You can also increase your EC2 and storage capacity over time by upgrading your configuration. For more detailed information about pricing check out the AWS Outposts Pricing details.

Happy Storing

— Martin

How to delete user data in an AWS data lake

Post Syndicated from George Komninos original https://aws.amazon.com/blogs/big-data/how-to-delete-user-data-in-an-aws-data-lake/

General Data Protection Regulation (GDPR) is an important aspect of today’s technology world, and processing data in compliance with GDPR is a necessity for those who implement solutions within the AWS public cloud. One article of GDPR is the “right to erasure” or “right to be forgotten” which may require you to implement a solution to delete specific users’ personal data.

In the context of the AWS big data and analytics ecosystem, every architecture, regardless of the problem it targets, uses Amazon Simple Storage Service (Amazon S3) as the core storage service. Despite its versatility and feature completeness, Amazon S3 doesn’t come with an out-of-the-box way to map a user identifier to S3 keys of objects that contain user’s data.

This post walks you through a framework that helps you purge individual user data within your organization’s AWS hosted data lake, and an analytics solution that uses different AWS storage layers, along with sample code targeting Amazon S3.

Reference architecture

To address the challenge of implementing a data purge framework, we reduced the problem to the straightforward use case of deleting a user’s data from a platform that uses AWS for its data pipeline. The following diagram illustrates this use case.

We’re introducing the idea of building and maintaining an index metastore that keeps track of the location of each user’s records and allows us locate to them efficiently, reducing the search space.

You can use the following architecture diagram to delete a specific user’s data within your organization’s AWS data lake.

For this initial version, we created three user flows that map each task to a fitting AWS service:

Flow 1: Real-time metastore update

The S3 ObjectCreated or ObjectDelete events trigger an AWS Lambda function that parses the object and performs an add/update/delete operation to keep the metadata index up to date. You can implement a simple workflow for any other storage layer, such as Amazon Relational Database Service (RDS), Amazon Aurora, or Amazon Elasticsearch Service (ES). We use Amazon DynamoDB and Amazon RDS for PostgreSQL as the index metadata storage options, but our approach is flexible to any other technology.

Flow 2: Purge data

When a user asks for their data to be deleted, we trigger an AWS Step Functions state machine through Amazon CloudWatch to orchestrate the workflow. Its first step triggers a Lambda function that queries the metadata index to identify the storage layers that contain user records and generates a report that’s saved to an S3 report bucket. A Step Functions activity is created and picked up by a Lambda Node JS based worker that sends an email to the approver through Amazon Simple Email Service (SES) with approve and reject links.

The following diagram shows a graphical representation of the Step Function state machine as seen on the AWS Management Console.

The approver selects one of the two links, which then calls an Amazon API Gateway endpoint that invokes Step Functions to resume the workflow. If you choose the approve link, Step Functions triggers a Lambda function that takes the report stored in the bucket as input, deletes the objects or records from the storage layer, and updates the index metastore. When the purging job is complete, Amazon Simple Notification Service (SNS) sends a success or fail email to the user.

The following diagram represents the Step Functions flow on the console if the purge flow completed successfully.

For the complete code base, see step-function-definition.json in the GitHub repo.

Flow 3: Batch metastore update

This flow refers to the use case of an existing data lake for which index metastore needs to be created. You can orchestrate the flow through AWS Step Functions, which takes historical data as input and updates metastore through a batch job. Our current implementation doesn’t include a sample script for this user flow.

Our framework

We now walk you through the two use cases we followed for our implementation:

  • You have multiple user records stored in each Amazon S3 file
  • A user has records stored in homogenous AWS storage layers

Within these two approaches, we demonstrate alternatives that you can use to store your index metastore.

Indexing by S3 URI and row number

For this use case, we use a free tier RDS Postgres instance to store our index. We created a simple table with the following code:

CREATE UNLOGGED TABLE IF NOT EXISTS user_objects (
				userid TEXT,
				s3path TEXT,
				recordline INTEGER
			);

You can index on user_id to optimize query performance. On object upload, for each row, you need to insert into the user_objects table a row that indicates the user ID, the URI of the target Amazon S3 object, and the row that corresponds to the record. For instance, when uploading the following JSON input, enter the following code:

{"user_id":"V34qejxNsCbcgD8C0HVk-Q","body":"…"}
{"user_id":"ofKDkJKXSKZXu5xJNGiiBQ","body":"…"}
{"user_id":"UgMW8bLE0QMJDCkQ1Ax5Mg","body ":"…"}

We insert the tuples into user_objects in the Amazon S3 location s3://gdpr-demo/year=2018/month=2/day=26/input.json. See the following code:

(“V34qejxNsCbcgD8C0HVk-Q”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 0)
(“ofKDkJKXSKZXu5xJNGiiBQ”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 1)
(“UgMW8bLE0QMJDCkQ1Ax5Mg”, “s3://gdpr-demo/year=2018/month=2/day=26/input.json”, 2)

You can implement the index update operation by using a Lambda function triggered on any Amazon S3 ObjectCreated event.

When we get a delete request from a user, we need to query our index to get some information about where we have stored the data to delete. See the following code:

SELECT s3path,
                ARRAY_AGG(recordline)
                FROM user_objects
                WHERE userid = ‘V34qejxNsCbcgD8C0HVk-Q’
                GROUP BY;

The preceding example SQL query returns rows like the following:

(“s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json“, {2102,529})

The output indicates that lines 529 and 2102 of S3 object s3://gdpr-review/year=2015/month=12/day=21/review-part-0.json contain the requested user’s data and need to be purged. We then need to download the object, remove those rows, and overwrite the object. For a Python implementation of the Lambda function that implements this functionality, see deleteUserRecords.py in the GitHub repo.

Having the record line available allows you to perform the deletion efficiently in byte format. For implementation simplicity, we purge the rows by replacing the deleted rows with an empty JSON object. You pay a slight storage overhead, but you don’t need to update subsequent row metadata in your index, which would be costly. To eliminate empty JSON objects, we can implement an offline vacuum and index update process.

Indexing by file name and grouping by index key

For this use case, we created a DynamoDB table to store our index. We chose DynamoDB because of its ease of use and scalability; you can use its on-demand pricing model so you don’t need to guess how many capacity units you might need. When files are uploaded to the data lake, a Lambda function parses the file name (for example, 1001-.csv) to identify the user identifier and populates the DynamoDB metadata table. Userid is the partition key, and each different storage layer has its own attribute. For example, if user 1001 had data in Amazon S3 and Amazon RDS, their records look like the following code:

{"userid:": 1001, "s3":{"s3://path1", "s3://path2"}, "RDS":{"db1.table1.column1"}}

For a sample Python implementation of this functionality, see update-dynamo-metadata.py in the GitHub repo.

On delete request, we query the metastore table, which is DynamoDB, and generate a purge report that contains details on what storage layers contain user records, and storage layer specifics that can speed up locating the records. We store the purge report to Amazon S3. For a sample Lambda function that implements this logic, see generate-purge-report.py in the GitHub repo.

After the purging is approved, we use the report as input to delete the required resources. For a sample Lambda function implementation, see gdpr-purge-data.py in the GitHub repo.

Implementation and technology alternatives

We explored and evaluated multiple implementation options, all of which present tradeoffs, such as implementation simplicity, efficiency, critical data compliance, and feature completeness:

  • Scan every record of the data file to create an index – Whenever a file is uploaded, we iterate through its records and generate tuples (userid, s3Uri, row_number) that are then inserted to our metadata storing layer. On delete request, we fetch the metadata records for requested user IDs, download the corresponding S3 objects, perform the delete in place, and re-upload the updated objects, overwriting the existing object. This is the most flexible approach because it supports a single object to store multiple users’ data, which is a very common practice. The flexibility comes at a cost because it requires downloading and re-uploading the object, which introduces a network bottleneck in delete operations. User activity datasets such as customer product reviews are a good fit for this approach, because it’s unexpected to have multiple records for the same user within each partition (such as a date partition), and it’s preferable to combine multiple users’ activity in a single file. It’s similar to what was described in the section “Indexing by S3 URI and row number” and sample code is available in the GitHub repo.
  • Store metadata as file name prefix – Adding the user ID as the prefix of the uploaded object under the different partitions that are defined based on query pattern enables you to reduce the required search operations on delete request. The metadata handling utility finds the user ID from the file name and maintains the index accordingly. This approach is efficient in locating the resources to purge but assumes a single user per object, and requires you to store user IDs within the filename, which might require InfoSec considerations. Clickstream data, where you would expect to have multiple click events for a single customer on a single date partition during a session, is a good fit. We covered this approach in the section “Indexing by file name and grouping by index key” and you can download the codebase from the GitHub repo.
  • Use a metadata file – Along with uploading a new object, we also upload a metadata file that’s picked up by an indexing utility to create and maintain the index up to date. On delete request, we query the index, which points us to the records to purge. A good fit for this approach is a use case that already involves uploading a metadata file whenever a new object is uploaded, such as uploading multimedia data, along with their metadata. Otherwise, uploading a metadata file on every object upload might introduce too much of an overhead.
  • Use the tagging feature of AWS services – Whenever a new file is uploaded to Amazon S3, we use the Put Object Tagging Amazon S3 operation to add a key-value pair for the user identifier. Whenever there is a user data delete request, it fetches objects with that tag and deletes them. This option is straightforward to implement using the existing Amazon S3 API and can therefore be a very initial version of your implementation. However, it involves significant limitations. It assumes a 1:1 cardinality between Amazon S3 objects and users (each object only contains data for a single user), searching objects based on a tag is limited and inefficient, and storing user identifiers as tags might not be compliant with your organization’s InfoSec policy.
  • Use Apache Hudi – Apache Hudi is becoming a very popular option to perform record-level data deletion on Amazon S3. Its current version is restricted to Amazon EMR, and you can use it if you start to build your data lake from scratch, because you need to store your as Hudi datasets. Hudi is a very active project and additional features and integrations with more AWS services are expected.

The key implementation decision of our approach is separating the storage layer we use for our data and the one we use for our metadata. As a result, our design is versatile and can be plugged in any existing data pipeline. Similar to deciding what storage layer to use for your data, there are many factors to consider when deciding how to store your index:

  • Concurrency of requests – If you don’t expect too many simultaneous inserts, even something as simple as Amazon S3 could be a starting point for your index. However, if you get multiple concurrent writes for multiple users, you need to look into a service that copes better with transactions.
  • Existing team knowledge and infrastructure – In this post, we demonstrated using DynamoDB and RDS Postgres for storing and querying the metadata index. If your team has no experience with either of those but are comfortable with Amazon ES, Amazon DocumentDB (with MongoDB compatibility), or any other storage layer, use those. Furthermore, if you’re already running (and paying for) a MySQL database that’s not used to capacity, you could use that for your index for no additional cost.
  • Size of index – The volume of your metadata is orders of magnitude lower than your actual data. However, if your dataset grows significantly, you might need to consider going for a scalable, distributed storage solution rather than, for instance, a relational database management system.

Conclusion

GDPR has transformed best practices and introduced several extra technical challenges in designing and implementing a data lake. The reference architecture and scripts in this post may help you delete data in a manner that’s compliant with GDPR.

Let us know your feedback in the comments and how you implemented this solution in your organization, so that others can learn from it.

 


About the Authors

George Komninos is a Data Lab Solutions Architect at AWS. He helps customers convert their ideas to a production-ready data product. Before AWS, he spent 3 years at Alexa Information domain as a data engineer. Outside of work, George is a football fan and supports the greatest team in the world, Olympiacos Piraeus.

 

 

 

 

Sakti Mishra is a Data Lab Solutions Architect at AWS. He helps customers architect data analytics solutions, which gives them an accelerated path towards modernization initiatives. Outside of work, Sakti enjoys learning new technologies, watching movies, and travel.

Streaming data from Amazon S3 to Amazon Kinesis Data Streams using AWS DMS

Post Syndicated from Mahesh Goyal original https://aws.amazon.com/blogs/big-data/streaming-data-from-amazon-s3-to-amazon-kinesis-data-streams-using-aws-dms/

Stream processing is very useful in use cases where we need to detect a problem quickly and improve the outcome based on data, for example production line monitoring or supply chain optimizations.

This blog post walks you through process of streaming existing data files and ongoing changes from Amazon Simple Storage Service (Amazon S3) to Amazon Kinesis. You achieve this by using AWS Database Migration Service (AWS DMS). AWS DMS enables you to seamlessly migrate data from supported sources to relational databases, data warehouses, streaming platforms, and other data stores in AWS cloud.

Many SaaS, third-party applications already integrate with Amazon S3 and can deliver records to S3 buckets. In certain use cases, you need to further process this data in near-real-time to generate alerts. Use cases like threat detection and application monitoring require generating insights in seconds. Waiting for batch processes often leads to a delay in data analysis and reduces the ability of systems to respond quickly to critical situations. For such use cases, you need a way to convert batch to stream processing by expanding the existing integrations of your applications with Amazon S3.

You can use AWS DMS for such data-processing requirements. AWS DMS lets to expand your existing application into Amazon S3 to produce data in Amazon Kinesis Data Streams for real-time analytics without writing and maintaining new code. AWS DMS supports specifying Amazon S3 as the source and streaming services like Kinesis and Amazon Managed Streaming of Kafka (Amazon MSK) as the target. AWS DMS allows migration of full and change data capture (CDC) files to these services. AWS DMS performs this task out of box without any complex configuration or code development. You can also configure an AWS DMS replication instance to scale up or down depending on the workload.

For this post, we focus on streaming data to Kinesis. We deploy an AWS CloudFormation template to get started in minutes and explore the streaming pipeline.

Architecture overview

Third-party applications such as web, API, and data-integration services produce data and log files in S3 buckets. Data lakes built on AWS process and store data in Amazon S3 at different stages. AWS DMS supports Amazon S3 as the source and Kinesis as the target, so data stored in an S3 bucket is streamed to Kinesis. Several consumers, such as AWS Lambda, Amazon Kinesis Data Firehose, Amazon Kinesis Data Analytics, and the Kinesis Consumer Library (KCL), can consume the data concurrently to perform real-time analytics on the dataset. Each AWS service in this architecture can scale independently as needed.

The following diagram shows the architecture of this solution.

Deploying AWS CloudFormation

To get started, you first deploy the CloudFormation template to create the core components of the architecture. AWS CloudFormation automates the deployment of technology and infrastructure in a safe and repeatable manner across multiple Regions and accounts with the least amount of effort and time. To create these resources, complete the following steps:

  1. Sign in to the AWS Management Console and choose the us-west-2 Region.
  2. Choose Launch Stack:
  3. Choose Next.

 This automatically launches AWS CloudFormation in your AWS account with a template. It prompts you to sign in as needed. You can view the CloudFormation template on the console.

  1. For Stack name, enter a stack name.
  2. On the next screen, choose your VPC and subnet IDs.
  3. For Does DMS VPC and Cloudwatch role Exists?, enter Y if the managed AWS Identity and Access Management (IAM) roles dms-vpc-role and dms-cloudwatch-logs-role exist in your account. Otherwise, leave at the default N.

If you want to deploy the AWS DMS endpoint in a private subnet, enable the VPC endpoints for Kinesis and Amazon S3 before deploying the template.

  1. Choose Next.
  2. Acknowledge resource creation under Capabilities on the final screen and choose Create.

The stack takes 5–10 minutes to complete, during which it performs the following:

The files required for this demo don’t come with the template. Download blog_sample_file.zip and upload it to the source bucket before starting the AWS DMS task.

Using Amazon S3 as the source

When you use Amazon S3 as the source, the data files (full load and CDC) must be in comma-separated value (CSV) format.

In addition to the data files, AWS DMS also requires an external table definition. An external table definition is a JSON document that describes how AWS DMS should interpret the data from Amazon S3.

Amazon S3 file paths for full load and CDC files are required for AWS DMS to run the task. Make sure that files names are sequentially numbered to replicate the data in the correct order. In addition, AWS DMS allows you to specify the column delimiter, row delimiter, and other parameters using extra connection attributes.

AWS DMS can identify the operation to perform for each load record in two ways: from the record’s keyword value INSERT or I.

For more information, see Using Amazon S3 as a source for AWS DMS.

Using Amazon Kinesis as the target

AWS publishes records to a Kinesis data stream as JSON. During conversion, AWS DMS serializes each record from the source Amazon S3 files into an attribute-value pair in JSON format.

AWS DMS publishes each record in the source Amazon S3 file as one JSON data record in a data stream regardless of the action specified in the source file.

Additionally, AWS DMS allows object mapping to migrate data from source files to a data stream. Object mapping determines the structure of data records in the stream.

AWS DMS also supports multi-threaded migration for full load and CDC with task settings. You can promote the performance by setting multiple threads, buffer size, and parallel queue.

For more information, see Using Amazon Kinesis Data Streams as a target for AWS Database Migration Service.

Walkthrough

The AWS CloudFormation deployment takes care of all the infrastructure. Now you need files to complete this use case.

  1. Download blog_sample_file.zip, which contains full and CDC load files in CSV format.

If your source files aren’t in CSV, convert the file format to CSV. One conversion method is by using AWS Glue. For more information, see Format Options for ETL Inputs and Outputs in AWS Glue.

The following screenshot shows the sample records of the full load files that you use for this use case.

CDC files require additional attributes for AWS DMS to identify the action, table, and schema.

  1. Reformat the files as follows:
  • Operation – The change operation to be performed: INSERT or I, UPDATE or U, or DELETE or D.
  • Table name – The name of the source table.
  • Schema name – The name of the source schema.
  • Data – One or more columns that represent the data to be changed.

The following screenshot shows sample records of the CDC file.

External table definition is required in the source endpoint configuration. For this post, the definition is embedded in AWS CloudFormation.

  1. Enter the following code for the table definition for the full and CDC files:
    {
    	“TableCount”: “1",
    	“Tables”: [{
    		“TableName”: “table01”,
    		“TablePath”: “schema01/table01/“,
    		“TableOwner”: “schema01",
    		“TableColumns”: [{
    			“ColumnName”: “ingest_time”,
    			“ColumnType”: “TIMESTAMP”,
    			“ColumnNullable”: “false”,
    			“ColumnIsPk”: “true”
    		}, {
    			“ColumnName”: “doi”,
    			“ColumnType”: “STRING”,
    			“ColumnLength”: “30”
    		}, {
    			“ColumnName”: “id”,
    			“ColumnType”: “INT8”
    		}, {
    			“ColumnName”: “value”,
    			“ColumnType”: “NUMERIC”,
    			“ColumnPrecision”: “5”,
    			“ColumnScale”: “2”
    		}, {
    			“ColumnName”: “data_sig”,
    			“ColumnType”: “STRING”,
    			“ColumnLength”: “10”
    		}],
    		“TableColumnsTotal”: “5”
    	}]
    }
    

  2. Create folder structures under the source S3 bucket created through the CloudFormation template.
    1. Create folders schema01/table01/ for full load and cdcfile/ for CDC data files.
    2. Also, file names should be in incremental, as listed in the following CLI output.
      $aws s3 ls s3://blog-xxxxxxxx/schema01/table01 --recursive --human-readable --summarize
      2020-08-03 22:05:57    5.0 MiB schema01/table01/full_000
      2020-08-03 22:05:51    5.0 MiB schema01/table01/full_001
      2020-08-03 22:06:00    5.0 MiB schema01/table01/full_002
      2020-08-03 22:05:56    5.0 MiB schema01/table01/full_003
      2020-08-03 22:05:59    3.1 MiB schema01/table01/full_004
      
      $aws s3 ls s3://blog-xxxxxxxx/cdcfile --recursive --human-readable --summarize
      2020-08-03 22:06:28    4.8 MiB cdc/cdc_000
      2020-08-03 22:06:28    4.8 MiB cdc/cdc_001
      2020-08-03 22:06:26    4.8 MiB cdc/cdc_002
      2020-08-03 22:06:19    4.8 MiB cdc/cdc_003
      

  3. After the files are copied, on the AWS DMS console, choose Replication.
  4. Validate the instance status and configuration.
  5. Choose Endpoints.
  6. Validate the status and configuration of the Amazon S3 source endpoint and make sure that the connection to the replication instance is successful.
  7. Similarly, validate the status and configuration of Kinesis target endpoint and make sure that the connection to the replication instance is successful.
  8. Choose Database migration task.
  9. Verify that the source and target are mapped correctly.
  10. After validating all the configurations, restart the AWS DMS task. Because the task has been created and never started, choose Restart/Resume to start full load and CDC.

After data migration starts, you can see it listed under Table statistics. For more information, see How do I use table statistics to monitor an AWS DMS task?

AWS DMS completes the full load first and migrates change data as files are uploaded to the bucket location specified in the cdcPath parameter.

  1. While the migration is in progress, on the Kinesis console, check the IncomingBytes metrics on the Monitoring tab to confirm the data is streaming to Kinesis Data Streams.
  2. To confirm that the data streamed is being consumed by the Lambda consumer, use the GetRecords.Bytes metric.

You’re now ready to validate the records in Lambda. Lambda is configured to read from Kinesis through a trigger.

The Lambda consumer for this post is a sample function that consumes the records from the Kinesis data stream, decodes the base64 encoded data, and prints the records to the Amazon CloudWatch log group.

  1. On the Monitoring tab, open the recent logstream under CloudWatch Log Insights to see the printed records.

For more information about monitoring, see Monitoring functions in the AWS Lambda console.

You can add processing logic to the Lambda function as per your requirements to aggregate or process the records. You can also configure a Lambda destination for further processing. Lambda asynchronous invocations can put an event or message on Amazon Simple Notification Service (Amazon SNS), Amazon Simple Queue Service (Amazon SQS), or Amazon EventBridge. For more information, see Introducing AWS Lambda Destinations.

Best practice considerations

When implementing this solution, consider the following best practices:

  • Full load allows to you stream existing data from an S3 bucket to Kinesis. You can use full load to migrate previously stored data before streaming CDC data. The full load data should already exist before the task starts. For new CDC files, the data is streamed to Kinesis on a file delivery event in real-time.
  • For loading multiple tables, you can specify the table count and table properties in an external table definition file. The CDC path remains the same and AWS DMS maps the records to tables based on the metadata fields.
  • During a heavy workload, the AWS DMS instance can be constrained to resources like CPU, memory, storage, and I/O. For optimal transfer speed, monitor the CloudWatch metrics and scale the replication instance.
  • For migrating a large number of tables, you can speed up the transfer by setting the multi-threading parameter to higher values.
  • The CloudFormation template creates a data stream with two shards. As the data flow rate to the stream increases, you can scale the number of shards in the stream to adapt to changes. Monitoring Kinesis with CloudWatch metrics for IncomingRecords and WriteProvisionedThroughputExceeded provides insights on how to scale the shards.
  • Object mapping in the AWS DMS task defines the partition key. This partition key is used to group data by shard within a stream. The default partition key AWS DMS uses is TableName. You can use attribute mapping to change the partition key to a value of one of the fields in the JSON, or the primary key of the table in the source database. You can also set the partition key to a constant value to stream all the data to a single shard in the stream.
  • By default, Lambda invokes the function as soon as records are available in the stream. To avoid invoking the function with a small number of records, configure the event source to buffer records for up to 5 minutes by configuring a batch window. For more information, see Using AWS Lambda with Amazon Kinesis.
  • When Kinesis is configured as a trigger for Lambda, you can increase the concurrency to process multiple batches from each shard in parallel. Lambda can process up to 10 batches in each shard simultaneously. For more information about concurrency, see New AWS Lambda scaling controls for Kinesis and DynamoDB event sources.

Cleaning up

After successful testing and validation, you should delete all the resources deployed through the CloudFormation template to avoid any unwanted costs. First empty the S3 bucket and stop the AWS DMS task. Then delete the appropriate stacks on the AWS CloudFormation console.

Summary

This post describes a solution for converting batch processing to near real-time using AWS DMS. This solution greatly simplifies the process of migrating records from Amazon S3 to Kinesis for analysis. Kinesis as an AWS DMS target allows multiple systems to consume data simultaneously. Having a near-steaming pipeline allows you to make sense of all the changes in near-real time, which ultimately expands your organization’s ability for better decision-making. All the resources used in this solution scale seamlessly and allow you to focus on analysis, alerting, reporting, and fraud detection instead of focusing on platform setup and maintenance. This promotes cost-effectiveness while reducing operational burden.


About the Author

Mahesh Goyal is a Data Architect in Big Data at AWS. He works with customers in their journey to the cloud with a focus on big data and data warehouses. In his spare time, Mahesh likes to listen to music and explore new food places with his family.

 

 

 

 

Charishma Makineni is a Technical Account Manager at AWS. She works with enterprise customers to help them build secure and scalable solutions on the AWS cloud. She is focused on Big data and Analytics technologies. Outside of work, Charishma enjoys being outdoors, gardening and experimenting with cooking.

 

 

 

Suresh Patnam is a Solutions Architect at AWS. He helps customers innovate on the AWS platform by building highly available, scalable, and secure architectures on Big Data and AI/ML. In his spare time, Suresh enjoys playing tennis and spending time with his family.

Analyzing Amazon S3 server access logs using Amazon ES

Post Syndicated from Mahesh Goyal original https://aws.amazon.com/blogs/big-data/analyzing-amazon-s3-server-access-logs-using-amazon-es/

When you use Amazon Simple Storage Service (Amazon S3) to store corporate data and host websites, you need additional logging to monitor access to your data and the performance of your application. An effective logging solution enhances security and improves the detection of security incidents. With the advent of increased data storage needs, you can rely on Amazon S3 for a range of use cases and simultaneously looking for ways to analyze your logs to ensure compliance, perform the audit, and discover risks.

Amazon S3 lets you monitor the traffic using the server access logging feature. With server access logging, you can capture and monitor the traffic to your S3 bucket at any time, with detailed information about the source of the request. The logs are stored in the S3 bucket you own in the same Region. This addresses the security and compliance requirements of most organizations. The logs are critical for establishing baselines, analyzing access patterns, and identifying trends. For example, the logs could answer a financial organization’s question about how many requests are made to a bucket and who is making what type of access requests to the objects.

You can discover insights from server access logs through several different methods. One common option is by using Amazon Athena or Amazon Redshift Spectrum and query the log files stored in Amazon S3. However, this solution poses high latency with an exponential growth in volume. It requires further integration with Amazon QuickSight to add visualization capabilities.

You can address this by using Amazon Elasticsearch Service (Amazon ES). Amazon ES is a managed service that makes it easier to deploy, operate, and scale Elasticsearch clusters in the AWS Cloud. Elasticsearch is a popular open-source search and analytics engine for use cases such as log analytics, real-time application monitoring, and clickstream analysis. The service provides support for open-source Elasticsearch APIs, managed Kibana, and integration with other AWS services such as Amazon S3 and Amazon Kinesis for loading streaming data into Amazon ES.

This post walks you through automating ingestion of server access logs from Amazon S3 into Amazon ES using AWS Lambda and visualizing the data in Kibana.

Architecture overview

Server access logging is enabled on source buckets, and logs are delivered to access log bucket. The access log bucket is configured to send an event to the Lambda function when a log file is created. On an event trigger, the Lambda function reads the file, processes the access log, and sends it to Amazon ES. When the logs are available, you can use Kibana to create interactive visuals and analyze the logs over a time period.

When designing a log analytics solution for high-frequency incoming data, you should consider buffering layers to avoid instability in the system. Buffering helps you streamline processes for unpredictable incoming log data. For such use cases, you can take advantage of managed services like Amazon Kinesis Data Streams, Amazon Kinesis Data Firehose, and Amazon Managed Streaming for Apache Kafka (Amazon MSK).

Streaming services buffer data before delivering it to Amazon ES. This helps you avoid overwhelming your cluster with spiky ingestion events. Kinesis Data Firehose can reliably load data into Amazon ES. Kinesis Data Firehose lets you choose a buffer size of 1–100 MiBs and a buffer interval of 60–900 seconds when Amazon ES is selected as the destination. Kinesis Data Firehose also scales automatically to match the throughput of your data and requires no ongoing administration. For more information, see Ingest streaming data into Amazon Elasticsearch Service within the privacy of your VPC with Amazon Kinesis Data Firehose.

The following diagram illustrates the solution architecture.

Prerequisites

Before creating resources in AWS CloudFormation, you must enable server access logging on the source bucket. Open the S3 bucket properties and look for Amazon S3 access and delivery bucket. See the following screenshot.

You also need an AWS Identity and Access Management (IAM) user with sufficient permissions to interact with the AWS Management Console and related AWS services. The user must have access to create IAM roles and policies via the CloudFormation template.

Setting up the resources with AWS CloudFormation

First, deploy the CloudFormation template to create the core components of the architecture. AWS CloudFormation automates the deployment of technology and infrastructure in a safe and repeatable manner across multiple Regions and multiple accounts with the least amount of effort and time.

  1. Sign in to the console and choose the Region of the bucket storing the access log. For this post, I use us-east-1.
  2. Launch the stack:
  3. Choose Next.
  4. For Stack name, enter a name.
  5. On the Parameters page, enter the following parameters:
    1. VPC Configuration – Select any VPC that has at least two private subnets. The template deploys the Amazon ES service domain and Lambda within the VPC.
    2. Private subnets – Select two private subnets of the VPC. The route tables associated with subnets must have a NAT gateway configuration and VPC endpoint for Amazon S3 to privately connect the bucket from Lambda.
    3. Access log S3 bucket – Enter the S3 bucket where access logs are delivered. The template configures event notification on the bucket to trigger the Lambda function.
    4. Amazon ES domain name – Specify the Amazon ES domain name to be deployed through the template.
  6. Choose Next.
  7. On the next page, choose Next.
  8. Acknowledge resource creation under Capabilities and transforms and choose Create.

The stack takes about 10–15 minutes to complete. The CloudFormation stack does the following:

  • Creates an Amazon ES domain with fine-grained access control enabled on it. Fine-grained access control is configured with a primary user in the internal user database.
  • Creates IAM role for the Lambda function with required permission to read from S3 bucket and write to Amazon ES.
  • Creates Lambda within the same VPC of Amazon ES elastic network interfaces (ENI). Amazon ES places an ENI in the VPC for each of your data nodes. The communication from Lambda to the Amazon ES domain is via this ENI.
  • Configures file create event notification on Access log S3 bucket to trigger the Lambda function. The function code segments are discussed in detail in this GitHub project.

You must make several considerations before you proceed with a production-grade deployment. For this post, I use one primary shard with no replicas. As a best practice, we recommend deploying your domain into three Availability Zones with at least two replicas. This configuration lets Amazon ES distribute replica shards to different Availability Zones than their corresponding primary shards and improves the availability of your domain. For more information about sizing your Amazon ES, see Get started with Amazon Elasticsearch Service: T-shirt-size your domain.

We recommend setting the shard count based on your estimated index size, using 50 GB as a maximum target shard size. You should also define an index template to set the primary and replica shard counts before index creation. For more information about best practices, see Best practices for configuring your Amazon Elasticsearch Service domain.

For high-frequency incoming data, you can rotate indexes either per day or per week depending on the size of data being generated. You can use Index State Management to define custom management policies to automate routine tasks and apply them to indexes and index patterns.

Creating the Kibana user

With Amazon ES, you can configure fine-grained users to control access to your data. Fine-grained access control adds multiple capabilities to give you tighter control over your data. This feature includes the ability to use roles to define granular permissions for indexes, documents, or fields and to extend Kibana with read-only views and secure multi-tenant support. For more information on granular access control, see Fine-Grained Access Control in Amazon Elasticsearch Service.

For this post, you create a fine-grained role for Kibana access and map it to a user.

  1. Navigate to Kibana and enter the primary user credentials:
    1. User nameadminuser01
    2. Password[email protected]

To access Kibana, you must have access to the VPC. For more information about accessing Kibana, see Controlling Access to Kibana.

  1. Choose Security, Roles.
  2. For Role name, enter kibana_only_role.
  3. For Cluster-wide permissions, choose cluster_composite_ops_ro.
  4. For Index patterns, enter access-log and kibana.
  5. For Permissions: Action Groups, choose read, delete, index, and manage.
  6. Choose Save Role Definition.
  7. Choose Security, Internal User Database, and Create a New User.
  8. For Open Distro Security Roles, choose Kibana_only_role (created earlier).
  9. Choose Submit.

The user kibanauser01 now has full access to Kibana and access-logs indexes. You can log in to Kibana with this user and create the visuals and dashboards.

Building dashboards

You can use Kibana to build interactive visuals and analyze the trends and combine the visuals for different use cases in a dashboard. For example, you may want to see the number of requests made to the buckets in the last two days.

  1. Log in to Kibana using kibanauser01.
  2. Create an index pattern and set the time range
  3. On the Visualize section of your Kibana dashboard, add a new visualization.
  4. Choose Vertical Bar.

You can select any time range and visual based on your requirements.

  1. Choose the index pattern and then configure your graph options.
  2. In the Metrics pane, expand Y-Axis.
  3. For Aggregation, choose Count.
  4. For Custom Label, enter Request Count.
  5. Expand the X-Axis
  6. For Aggregation, choose Terms.
  7. For Field, choose bucket.
  8. For Order By, choose metric: Request Count.
  9. Choose Apply changes.
  10. Choose Add sub-bucket and expand the Split Series
  11. For Sub Aggregation, choose Date Histogram.
  12. For Field, choose requestdatetime.
  13. For Interval, choose Daily.
  14. Apply the changes by choosing the play icon at the top of the page.

You should see the visual on the right side, similar to the following screenshot.

You can combine graphs of different use cases into a dashboard. I have built some example graphs for general use cases like the number of operations per bucket, user action breakdown for buckets, HTTPS status rate, top users, and tabular formatted error details. See the following screenshots.

Cleaning up

Delete all the resources deployed through the CloudFormation template to avoid any unintended costs.

  1. Disable the access log on source bucket.
  2. On to the CloudFormation console, identify the stacks appropriately, and delete

Summary

This post detailed a solution to visualize and monitor Amazon S3 access logs using Amazon ES to ensure compliance, perform security audits, and discover risks and patterns at scale with minimal latency. To learn about best practices of Amazon ES, see Amazon Elasticsearch Service Best Practices. To learn how to analyze and create a dashboard of data stored in Amazon ES, see the AWS Security Blog.


About the Authors

Mahesh Goyal is a Data Architect in Big Data at AWS. He works with customers in their journey to the cloud with a focus on big data and data warehouses. In his spare time, Mahesh likes to listen to music and explore new food places with his family.

 

 

 

 

Uploading to Amazon S3 directly from a web or mobile application

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/uploading-to-amazon-s3-directly-from-a-web-or-mobile-application/

In web and mobile applications, it’s common to provide users with the ability to upload data. Your application may allow users to upload PDFs and documents, or media such as photos or videos. Every modern web server technology has mechanisms to allow this functionality. Typically, in the server-based environment, the process follows this flow:

Application server upload process

  1. The user uploads the file to the application server.
  2. The application server saves the upload to a temporary space for processing.
  3. The application transfers the file to a database, file server, or object store for persistent storage.

While the process is simple, it can have significant side-effects on the performance of the web-server in busier applications. Media uploads are typically large, so transferring these can represent a large share of network I/O and server CPU time. You must also manage the state of the transfer to ensure that the entire object is successfully uploaded, and manage retries and errors.

This is challenging for applications with spiky traffic patterns. For example, in a web application that specializes in sending holiday greetings, it may experience most traffic only around holidays. If thousands of users attempt to upload media around the same time, this requires you to scale out the application server and ensure that there is sufficient network bandwidth available.

By directly uploading these files to Amazon S3, you can avoid proxying these requests through your application server. This can significantly reduce network traffic and server CPU usage, and enable your application server to handle other requests during busy periods. S3 also is highly available and durable, making it an ideal persistent store for user uploads.

In this blog post, I walk through how to implement serverless uploads and show the benefits of this approach. This pattern is used in the Happy Path web application. You can download the code from this blog post in this GitHub repo.

Overview of serverless uploading to S3

When you upload directly to an S3 bucket, you must first request a signed URL from the Amazon S3 service. You can then upload directly using the signed URL. This is two-step process for your application front end:

Serverless uploading to S3

  1. Call an Amazon API Gateway endpoint, which invokes the getSignedURL Lambda function. This gets a signed URL from the S3 bucket.
  2. Directly upload the file from the application to the S3 bucket.

To deploy the S3 uploader example in your AWS account:

  1. Navigate to the S3 uploader repo and install the prerequisites listed in the README.md.
  2. In a terminal window, run:
    git clone https://github.com/aws-samples/amazon-s3-presigned-urls-aws-sam
    cd amazon-s3-presigned-urls-aws-sam
    sam deploy --guided
  3. At the prompts, enter s3uploader for Stack Name and select your preferred Region. Once the deployment is complete, note the APIendpoint output.

CloudFormation stack outputs

Testing the application

I show two ways to test this application. The first is with Postman, which allows you to directly call the API and upload a binary file with the signed URL. The second is with a basic frontend application that demonstrates how to integrate the API.

To test using Postman:

  1. First, copy the API endpoint from the output of the deployment.
  2. In the Postman interface, paste the API endpoint into the box labeled Enter request URL.
  3. Choose Send.Postman test
  4. After the request is complete, the Body section shows a JSON response. The uploadURL attribute contains the signed URL. Copy this attribute to the clipboard.
  5. Select the + icon next to the tabs to create a new request.
  6. Using the dropdown, change the method from GET to PUT. Paste the URL into the Enter request URL box.
  7. Choose the Body tab, then the binary radio button.Select the binary radio button in Postman
  8. Choose Select file and choose a JPG file to upload.
    Choose Send. You see a 200 OK response after the file is uploaded.200 response code in Postman
  9. Navigate to the S3 console, and open the S3 bucket created by the deployment. In the bucket, you see the JPG file uploaded via Postman.Uploaded object in S3 bucket

To test with the sample frontend application:

  1. Copy index.html from the example’s repo to an S3 bucket.
  2. Update the object’s permissions to make it publicly readable.
  3. In a browser, navigate to the public URL of index.html file.Frontend testing app at index.html
  4. Select Choose file and then select a JPG file to upload in the file picker. Choose Upload image. When the upload completes, a confirmation message is displayed.Upload in the test app
  5. Navigate to the S3 console, and open the S3 bucket created by the deployment. In the bucket, you see the second JPG file you uploaded from the browser.Second uploaded file in S3 bucket

Understanding the S3 uploading process

When uploading objects to S3 from a web application, you must configure S3 for Cross-Origin Resource Sharing (CORS). CORS rules are defined as an XML document on the bucket. Using AWS SAM, you can configure CORS as part of the resource definition in the AWS SAM template:

   S3UploadBucket:
    Type: AWS::S3::Bucket
    Properties:
      CorsConfiguration:
        CorsRules:
        - AllowedHeaders:
            - "*"
          AllowedMethods:
            - GET
            - PUT
            - HEAD
          AllowedOrigins:
            - "*"

The preceding policy allows all headers and origins – it’s recommended that you use a more restrictive policy for production workloads.

In the first step of the process, the API endpoint invokes the Lambda function to make the signed URL request. The Lambda function contains the following code:

const AWS = require('aws-sdk')
AWS.config.update({ region: process.env.AWS_REGION })
const s3 = new AWS.S3()
const URL_EXPIRATION_SECONDS = 300

// Main Lambda entry point
exports.handler = async (event) => {
  return await getUploadURL(event)
}

const getUploadURL = async function(event) {
  const randomID = parseInt(Math.random() * 10000000)
  const Key = `${randomID}.jpg`

  // Get signed URL from S3
  const s3Params = {
    Bucket: process.env.UploadBucket,
    Key,
    Expires: URL_EXPIRATION_SECONDS,
    ContentType: 'image/jpeg'
  }
  const uploadURL = await s3.getSignedUrlPromise('putObject', s3Params)
  return JSON.stringify({
    uploadURL: uploadURL,
    Key
  })
}

This function determines the name, or key, of the uploaded object, using a random number. The s3Params object defines the accepted content type and also specifies the expiration of the key. In this case, the key is valid for 300 seconds. The signed URL is returned as part of a JSON object including the key for the calling application.

The signed URL contains a security token with permissions to upload this single object to this bucket. To successfully generate this token, the code calling getSignedUrlPromise must have s3:putObject permissions for the bucket. This Lambda function is granted the S3WritePolicy policy to the bucket by the AWS SAM template.

The uploaded object must match the same file name and content type as defined in the parameters. An object matching the parameters may be uploaded multiple times, providing that the upload process starts before the token expires. The default expiration is 15 minutes but you may want to specify shorter expirations depending upon your use case.

Once the frontend application receives the API endpoint response, it has the signed URL. The frontend application then uses the PUT method to upload binary data directly to the signed URL:

let blobData = new Blob([new Uint8Array(array)], {type: 'image/jpeg'})
const result = await fetch(signedURL, {
  method: 'PUT',
  body: blobData
})

At this point, the caller application is interacting directly with the S3 service and not with your API endpoint or Lambda function. S3 returns a 200 HTML status code once the upload is complete.

For applications expecting a large number of user uploads, this provides a simple way to offload a large amount of network traffic to S3, away from your backend infrastructure.

Adding authentication to the upload process

The current API endpoint is open, available to any service on the internet. This means that anyone can upload a JPG file once they receive the signed URL. In most production systems, developers want to use authentication to control who has access to the API, and who can upload files to your S3 buckets.

You can restrict access to this API by using an authorizer. This sample uses HTTP APIs, which support JWT authorizers. This allows you to control access to the API via an identity provider, which could be a service such as Amazon Cognito or Auth0.

The Happy Path application only allows signed-in users to upload files, using Auth0 as the identity provider. The sample repo contains a second AWS SAM template, templateWithAuth.yaml, which shows how you can add an authorizer to the API:

  MyApi:
    Type: AWS::Serverless::HttpApi
    Properties:
      Auth:
        Authorizers:
          MyAuthorizer:
            JwtConfiguration:
              issuer: !Ref Auth0issuer
              audience:
                - https://auth0-jwt-authorizer
            IdentitySource: "$request.header.Authorization"
        DefaultAuthorizer: MyAuthorizer

Both the issuer and audience attributes are provided by the Auth0 configuration. By specifying this authorizer as the default authorizer, it is used automatically for all routes using this API. Read part 1 of the Ask Around Me series to learn more about configuring Auth0 and authorizers with HTTP APIs.

After authentication is added, the calling web application provides a JWT token in the headers of the request:

const response = await axios.get(API_ENDPOINT_URL, {
  headers: {
    Authorization: `Bearer ${token}`
        }
})

API Gateway evaluates this token before invoking the getUploadURL Lambda function. This ensures that only authenticated users can upload objects to the S3 bucket.

Modifying ACLs and creating publicly readable objects

In the current implementation, the uploaded object is not publicly accessible. To make an uploaded object publicly readable, you must set its access control list (ACL). There are preconfigured ACLs available in S3, including a public-read option, which makes an object readable by anyone on the internet. Set the appropriate ACL in the params object before calling s3.getSignedUrl:

const s3Params = {
  Bucket: process.env.UploadBucket,
  Key,
  Expires: URL_EXPIRATION_SECONDS,
  ContentType: 'image/jpeg',
  ACL: 'public-read'
}

Since the Lambda function must have the appropriate bucket permissions to sign the request, you must also ensure that the function has PutObjectAcl permission. In AWS SAM, you can add the permission to the Lambda function with this policy:

        - Statement:
          - Effect: Allow
            Resource: !Sub 'arn:aws:s3:::${S3UploadBucket}/'
            Action:
              - s3:putObjectAcl

Conclusion

Many web and mobile applications allow users to upload data, including large media files like images and videos. In a traditional server-based application, this can create heavy load on the application server, and also use a considerable amount of network bandwidth.

By enabling users to upload files to Amazon S3, this serverless pattern moves the network load away from your service. This can make your application much more scalable, and capable of handling spiky traffic.

This blog post walks through a sample application repo and explains the process for retrieving a signed URL from S3. It explains how to the test the URLs in both Postman and in a web application. Finally, I explain how to add authentication and make uploaded objects publicly accessible.

To learn more, see this video walkthrough that shows how to upload directly to S3 from a frontend web application. For more serverless learning resources, visit https://serverlessland.com.

Building a serverless document scanner using Amazon Textract and AWS Amplify

Post Syndicated from Moheeb Zara original https://aws.amazon.com/blogs/compute/building-a-serverless-document-scanner-using-amazon-textract-and-aws-amplify/

This guide demonstrates creating and deploying a production ready document scanning application. It allows users to manage projects, upload images, and generate a PDF from detected text. The sample can be used as a template for building expense tracking applications, handling forms and legal documents, or for digitizing books and notes.

The frontend application is written in Vue.js and uses the Amplify Framework. The backend is built using AWS serverless technologies and consists of an Amazon API Gateway REST API that invokes AWS Lambda functions. Amazon Textract is used to analyze text from uploaded images to an Amazon S3 bucket. Detected text is stored in Amazon DynamoDB.

An architectural diagram of the application.

An architectural diagram of the application.

Prerequisites

You need the following to complete the project:

Deploy the application

The solution consists of two parts, the frontend application and the serverless backend. The Amplify CLI deploys all the Amazon Cognito authentication, and hosting resources for the frontend. The backend requires the Amazon Cognito user pool identifier to configure an authorizer on the API. This enables an authorization workflow, as shown in the following image.

A diagram showing how an Amazon Cognito authorization workflow works

A diagram showing how an Amazon Cognito authorization workflow works

First, configure the frontend. Complete the following steps using a terminal running on a computer or by using the AWS Cloud9 IDE. If using AWS Cloud9, create an instance using the default options.

From the terminal:

  1. Install the Amplify CLI by running this command.
    npm install -g @aws-amplify/cli
  2. Configure the Amplify CLI using this command. Follow the guided process to completion.
    amplify configure
  3. Clone the project from GitHub.
    git clone https://github.com/aws-samples/aws-serverless-document-scanner.git
  4. Navigate to the amplify-frontend directory and initialize the project using the Amplify CLI command. Follow the guided process to completion.
    cd aws-serverless-document-scanner/amplify-frontend
    
    amplify init
  5. Deploy all the frontend resources to the AWS Cloud using the Amplify CLI command.
    amplify push
  6. After the resources have finishing deploying, make note of the StackName and UserPoolId properties in the amplify-frontend/amplify/backend/amplify-meta.json file. These are required when deploying the serverless backend.

Next, deploy the serverless backend. While it can be deployed using the AWS SAM CLI, you can also deploy from the AWS Management Console:

  1. Navigate to the document-scanner application in the AWS Serverless Application Repository.
  2. In Application settings, name the application and provide the StackName and UserPoolId from the frontend application for the UserPoolID and AmplifyStackName parameters. Provide a unique name for the BucketName parameter.
  3. Choose Deploy.
  4. Once complete, copy the API endpoint so that it can be configured on the frontend application in the next section.

Configure and run the frontend application

  1. Create a file, amplify-frontend/src/api-config.js, in the frontend application with the following content. Include the API endpoint and the unique BucketName from the previous step. The s3_region value must be the same as the Region where your serverless backend is deployed.
    const apiConfig = {
    	"endpoint": "<API ENDPOINT>",
    	"s3_bucket_name": "<BucketName>",
    	"s3_region": "<Bucket Region>"
    };
    
    export default apiConfig;
  2. In a terminal, navigate to the root directory of the frontend application and run it locally for testing.
    cd aws-serverless-document-scanner/amplify-frontend
    
    npm install
    
    npm run serve

    You should see an output like this:

  3. To publish the frontend application to cloud hosting, run the following command.
    amplify publish

    Once complete, a URL to the hosted application is provided.

Using the frontend application

Once the application is running locally or hosted in the cloud, navigating to it presents a user login interface with an option to register. The registration flow requires a code sent to the provided email for verification. Once verified you’re presented with the main application interface.

Once you create a project and choose it from the list, you are presented with an interface for uploading images by page number.

On mobile, it uses the device camera to capture images. On desktop, images are provided by the file system. You can replace an image and the page selector also lets you go back and change an image. The corresponding analyzed text is updated in DynamoDB as well.

Each time you upload an image, the page is incremented. Choosing “Generate PDF” calls the endpoint for the GeneratePDF Lambda function and returns a PDF in base64 format. The download begins automatically.

You can also open the PDF in another window, if viewing a preview in a desktop browser.

Understanding the serverless backend

An architecture diagram of the serverless backend.

An architecture diagram of the serverless backend.

In the GitHub project, the folder serverless-backend/ contains the AWS SAM template file and the Lambda functions. It creates an API Gateway endpoint, six Lambda functions, an S3 bucket, and two DynamoDB tables. The template also defines an Amazon Cognito authorizer for the API using the UserPoolID passed in as a parameter:

Parameters:
  UserPoolID:
    Type: String
    Description: (Required) The user pool ID created by the Amplify frontend.

  AmplifyStackName:
    Type: String
    Description: (Required) The stack name of the Amplify backend deployment. 

  BucketName:
    Type: String
    Default: "ds-userfilebucket"
    Description: (Required) A unique name for the user file bucket. Must be all lowercase.  


Globals:
  Api:
    Cors:
      AllowMethods: "'*'"
      AllowHeaders: "'*'"
      AllowOrigin: "'*'"

Resources:

  DocumentScannerAPI:
    Type: AWS::Serverless::Api
    Properties:
      StageName: Prod
      Auth:
        DefaultAuthorizer: CognitoAuthorizer
        Authorizers:
          CognitoAuthorizer:
            UserPoolArn: !Sub 'arn:aws:cognito-idp:${AWS::Region}:${AWS::AccountId}:userpool/${UserPoolID}'
            Identity:
              Header: Authorization
        AddDefaultAuthorizerToCorsPreflight: False

This only allows authenticated users of the frontend application to make requests with a JWT token containing their user name and email. The backend uses that information to fetch and store data in DynamoDB that corresponds to the user making the request.

Two DynamoDB tables are created. A Project table, which tracks all the project names by user, and a Pages table, which tracks pages by project and user. The DynamoDB tables are created by the AWS SAM template with the partition key and range key defined for each table. These are used by the Lambda functions to query and sort items. See the documentation to learn more about DynamoDB table key schema.

ProjectsTable:
    Type: AWS::DynamoDB::Table
    Properties: 
      AttributeDefinitions: 
        - 
          AttributeName: "username"
          AttributeType: "S"
        - 
          AttributeName: "project_name"
          AttributeType: "S"
      KeySchema: 
        - AttributeName: username
          KeyType: HASH
        - AttributeName: project_name
          KeyType: RANGE
      ProvisionedThroughput: 
        ReadCapacityUnits: "5"
        WriteCapacityUnits: "5"

  PagesTable:
    Type: AWS::DynamoDB::Table
    Properties: 
      AttributeDefinitions: 
        - 
          AttributeName: "project"
          AttributeType: "S"
        - 
          AttributeName: "page"
          AttributeType: "N"
      KeySchema: 
        - AttributeName: project
          KeyType: HASH
        - AttributeName: page
          KeyType: RANGE
      ProvisionedThroughput: 
        ReadCapacityUnits: "5"
        WriteCapacityUnits: "5"

When an API Gateway endpoint is called, it passes the user credentials in the request context to a Lambda function. This is used by the CreateProject Lambda function, which also receives a project name in the request body, to create an item in the Project Table and associate it with a user.

The endpoint for the FetchProjects Lambda function is called to retrieve the list of projects associated with a user. The DeleteProject Lambda function removes a specific project from the Project table and any associated pages in the Pages table. It also deletes the folder in the S3 bucket containing all images for the project.

When a user enters a Project, the API endpoint calls the FetchPageCount Lambda function. This returns the number of pages for a project to update the current page number in the upload selector. The project is retrieved from the path parameters, as defined in the AWS SAM template:

FetchPageCount:
    Type: AWS::Serverless::Function
    Properties:
      Handler: app.handler
      Runtime: python3.8
      CodeUri: lambda_functions/fetchPageCount/
      Policies:
        - DynamoDBCrudPolicy:
            TableName: !Ref PagesTable
      Environment:
        Variables:
          PAGES_TABLE_NAME: !Ref PagesTable
      Events:
        GetResource:
          Type: Api
          Properties:
            RestApiId: !Ref DocumentScannerAPI
            Path: /pages/count/{project+}
            Method: get  

The template creates an S3 bucket and two AWS IAM managed policies. The policies are applied to the AuthRole and UnauthRole created by Amplify. This allows users to upload images directly to the S3 bucket. To understand how Amplify works with Storage, see the documentation.

The template also sets an S3 event notification on the bucket for all object create events with a “.png” suffix. Whenever the frontend uploads an image to S3, the object create event invokes the ProcessDocument Lambda function.

The function parses the object key to get the project name, user, and page number. Amazon Textract then analyzes the text of the image. The object returned by Amazon Textract contains the detected text and detailed information, such as the positioning of text in the image. Only the raw lines of text are stored in the Pages table.

import os
import json, decimal
import boto3
import urllib.parse
from boto3.dynamodb.conditions import Key, Attr

client = boto3.resource('dynamodb')
textract = boto3.client('textract')

tableName = os.environ.get('PAGES_TABLE_NAME')

def handler(event, context):

  table = client.Table(tableName)

  print(table.table_status)
 
  key = urllib.parse.unquote(event['Records'][0]['s3']['object']['key'])
  bucket = event['Records'][0]['s3']['bucket']['name']
  project = key.split('/')[3]
  page = key.split('/')[4].split('.')[0]
  user = key.split('/')[2]
  
  response = textract.detect_document_text(
    Document={
        'S3Object': {
            'Bucket': bucket,
            'Name': key
        }
    })
    
  fullText = ""
  
  for item in response["Blocks"]:
    if item["BlockType"] == "LINE":
        fullText = fullText + item["Text"] + '\n'
  
  print(fullText)

  table.put_item(Item= {
    'project': user + '/' + project,
    'page': int(page), 
    'text': fullText
    })

  # print(response)
  return

The GeneratePDF Lambda function retrieves the detected text for each page in a project from the Pages table. It combines the text into a PDF and returns it as a base64-encoded string for download. This function can be modified if your document structure differs.

Understanding the frontend

In the GitHub repo, the folder amplify-frontend/src/ contains all the code for the frontend application. In main.js, the Amplify VueJS modules are configured to use the resources defined in aws-exports.js. It also configures the endpoint and S3 bucket of the serverless backend, defined in api-config.js.

In components/DocumentScanner.vue, the API module is imported and the API is defined.

API calls are defined as Vue methods that can be called by various other components and elements of the application.

In components/Project.vue, the frontend uses the Storage module for Amplify to upload images. For more information on how to use S3 in an Amplify project see the documentation.

Conclusion

This blog post shows how to create a multiuser application that can analyze text from images and generate PDF documents. This guide demonstrates how to do so in a secure and scalable way using a serverless approach. The example also shows an event driven pattern for handling high volume image processing using S3, Lambda, and Amazon Textract.

The Amplify Framework simplifies the process of implementing authentication, storage, and backend integration. Explore the full solution on GitHub to modify it for your next project or startup idea.

To learn more about AWS serverless and keep up to date on the latest features, subscribe to the YouTube channel.

#ServerlessForEveryone

Log your VPC DNS queries with Route 53 Resolver Query Logs

Post Syndicated from Martin Beeby original https://aws.amazon.com/blogs/aws/log-your-vpc-dns-queries-with-route-53-resolver-query-logs/

The Amazon Route 53 team has just launched a new feature called Route 53 Resolver Query Logs, which will let you log all DNS queries made by resources within your Amazon Virtual Private Cloud. Whether it’s an Amazon Elastic Compute Cloud (EC2) instance, an AWS Lambda function, or a container, if it lives in your Virtual Private Cloud and makes a DNS query, then this feature will log it; you are then able to explore and better understand how your applications are operating.

Our customers explained to us that DNS query logs were important to them. Some wanted the logs so that they could be compliant with regulations, others wished to monitor DNS querying behavior, so they could spot security threats. Others simply wanted to troubleshoot application issues that were related to DNS. The team listened to our customers and have developed what I have found to be an elegant and easy to use solution.

From knowing very little about the Route 53 Resolver, I was able to configure query logging and have it working with barely a second glance at the documentation; which I assure you is a testament to the intuitiveness of the feature rather than me having any significant experience with Route 53 or DNS query logging.

You can choose to have the DNS query logs sent to one of three AWS services: Amazon CloudWatch Logs, Amazon Simple Storage Service (S3), and Amazon Kinesis Data Firehose. The target service you choose will depend mainly on what you want to do with the data. If you have compliance mandates (For example, Australia’s Information Security Registered Assessors Program), then maybe storing the logs in Amazon Simple Storage Service (S3) is a good option. If you have plans to monitor and analyze DNS queries in real-time or you integrate your logs with a 3rd party data analysis tool like Kibana or a SEIM tool like Splunk, than perhaps Amazon Kinesis Data Firehose is the option for you. For those of you who want an easy way to search, query, monitor metrics, or raise alarms, then Amazon CloudWatch Logs is a great choice, and this is what I will show in the following demo.

Over in the Route 53 Console, near the Resolver menu section, I see a new item called Query logging. Clicking on this takes me to a screen where I can configure the logging.

The dashboard shows the current configurations that are setup. I click Configure query logging to get started.

The console asks me to fill out some necessary information, such as a friendly name; I’ve named mine demoNewsBlog.

I am now prompted to select the destination where I would like my logs to be sent. I choose the CloudWatch Logs log group and select the option to Create log group. I give my new log group the name /aws/route/demothebeebsnet.

Next, I need to select what VPC I would like to log queries for. Any resource that sits inside the VPCs I choose here will have their DNS queries logged. You are also able to add tags to this configuration. I am in the habit of tagging anything that I use as part of a demo with the tag demo. This is so I can easily distinguish between demo resources and live resources in my account.

Finally, I press the Configure query logging button, and the configuration is saved. Within a few moments, the service has successfully enabled the query logging in my VPC.

After a few minutes, I log into the Amazon CloudWatch Logs console and can see that the logs have started to appear.

As you can see below, I was quickly able to start searching my logs and running queries using Amazon CloudWatch Logs Insights.

There is a lot you can do with the Amazon CloudWatch Logs service, for example, I could use CloudWatch Metric Filters to automatically generate metrics or even create dashboards. While putting this demo together, I also discovered a feature inside of Amazon CloudWatch Logs called Contributor Insights that enables you to analyze log data and create time series that display top talkers. Very quickly, I was able to produce this graph, which lists out the most common DNS queries over time.
Route 53 Resolver Query Logs is available in all AWS Commercial Regions that support Route 53 Resolver Endpoints, and you can get started using either the API or the AWS Console. You do not pay for the Route 53 Resolver Query Logs, but you will pay for handling the logs in the destination service that you choose. So, for example, if you decided to use Amazon Kinesis Data Firehose, then you will incur the regular charges for handling logs with the Amazon Kinesis Data Firehose service.

Happy Logging

— Martin

Using serverless backends to iterate quickly on web apps – part 1

Post Syndicated from James Beswick original https://aws.amazon.com/blogs/compute/using-serverless-backends-to-iterate-quickly-on-web-apps-part-1/

For many organizations, building applications is an iterative process where requirements change quickly. Traditional software architectures can be challenging to adapt to these changes. Often, early architectural decisions may limit the developers’ ability to deliver new features. Serverless architectural patterns are often much more adaptable, and can help developers keep pace with an evolving list of end-user requirements.

This blog series explores how to structure and build a serverless web app backend to enable the most flexibility for changing product requirements. It covers how to use serverless services in your architecture, and how to separate parts of the backend to make maintenance easier. I also show how you can use AWS Step Functions to encapsulate complex workflows and minimize the amount the custom code in your applications.

In this series:

  • Part 1: Deploy the application, test the upload process, and review the architecture.
  • Part 2: Understand how to use Step Functions, and deploy a custom workflow.
  • Part 3: Advanced workflows with custom branching and image moderation.

The code uses the AWS Serverless Application Model (AWS SAM), enabling you to deploy the application easily in your own AWS account. This walkthrough creates resources covered in the AWS Free Tier but you may incur cost for usage beyond development and testing.

To set up the example, visit the GitHub repo and follow the instructions in the README.md file.

Introducing the “Happy Path” web application

In this scenario, a startup creates a web application called Happy Path. This app is designed to help state parks and nonprofit organizations replace printed materials, such as flyers and maps, with user-generated content. It allows visitors to capture images of park notices and photos of hiking trails. They can share these with other users to reduce printed waste.

The frontend displays and captures images of different locations, and the backend processes this data according to a set of business rules. This web application is designed for smartphones so it’s used while visitors are at the locations. Here is the typical user flow:

Happy Path user interface

  1. When park visitors first navigate to the site’s URL, it shows their current location with parks highlighted in the vicinity.
  2. The visitor selects a park. It shows thumbnails of any maps, photos, and images already uploaded by other users.
  3. If the visitor is logged in, they can upload their own images directly from their smartphone.

The first production version of this application provides a simple way for users to upload photos. It does little more than provide an uploading and sharing process.

However, the developer team quickly realizes that they must make some improvements. The developers need a way to implement complex, changing workflows on the backend without refactoring the code that is running in production. The architecture must also scale for an expected 100,000 monthly active users.

First, they want to optimize the large uploaded images to improve the speed of downloads. Next, they must also determine the suitability of images to ensure that the app only shows appropriate photos. There is also a rapidly growing list of feature requirements from organizations using the app.

In this series, I show how the development team can design the app to provide this level of flexibility. This way, they can implement new features and even pivot the core application if needed.

Deploying the application

In the GitHub repo, there are detailed deployment instructions in the README. The repo contains separate directories for the frontend, backend, and workflows. You must deploy the backend first. Once you have completed the deployment, you can run the frontend code on your local machine.

To launch the frontend application:

  1. Change to the frontend directory.
  2. Run npm run serve to start the development server. After building the modules in the project, the terminal shows the local URL where the application is running:
    Vue build completed
  3. Open a web browser and navigate to http://localhost:8080 to see the application.
  4. Open the developer console in your browser (for Google Chrome, Mozilla Firefox and Microsoft Edge, press F12 on the keyboard). This displays the application in a responsive layout and shows console logging. This can help you understand the flow of execution in the application.

Happy Path browser developer console

Testing the application

Now you have deployed the backend to your AWS account, and you are running the front end locally, you can test the application.

To upload an image for a location:

  1. Choose Log In and sign into the application, creating a new account if necessary.
  2. Select a location on the map to open the information window.
    Select a location on the map
  3. Choose Show Details, then choose Upload Images.
    Uploading images in Happy Path
  4. In the file picker dialog, select any one of the images from the sample photos dataset.

At this stage, the image is now uploaded to the S3 Uploads bucket on the backend. To verify this:

  1. Navigate to the Amazon S3 console.
  2. Choose the application’s upload bucket, then choose the folder name to open its contents. This shows the uploaded image.
    S3 bucket contents
  3. Navigate to the Amazon DynamoDB console.
  4. Select the hp-application table, then select the Items tab.
    DynamoDB table contents

There are two records shown:

  • The place listing: this item contains details about the selected park, such as the name and address.
  • The file metadata: this stores information about who uploaded the file, the timestamp, and the state of the upload.

At this stage, you have successfully tested that the frontend can upload images to the backend.

Architecture overview

After deploying the application using the repo’s README instructions, the backend architecture looks like this:

Happy Path backend architecture

There are five distinct functional areas for the backend application:

  1. API layer: when users interact with one of the API endpoints, this is processed by the API layer. Each API route invokes a Lambda function to complete its task, storing and fetching data from the storage layer.
  2. Storage layer: information about user uploads is persisted durably here. The application uses Amazon S3 buckets to store the binary objects, and a DynamoDB table for associated metadata.
  3. Notification layer: when images are uploaded, the PUT event triggers a Lambda function. This publishes the event to the Amazon EventBridge default event bus.
  4. Business logic layer: the customized business logic is encapsulated in AWS Step Functions workflows.
  5. Content distribution: the processed images are served via an Amazon CloudFront distribution to reduce latency and optimize delivery cost.

For future requirements, you can implement increasingly complex customized logic entirely within the business logic layer. All new workflow features are implemented here, without needing to modify other parts of the application

Conclusion

This series is about using serverless backends to allow you to iterate quickly on web application functionality.

In this post, I introduce the Happy Path example web application. I show the main features of the application, enabling end-users to upload maps and photos to the backend application. I walk through the deployment of the backend and frontend applications. Finally, you test with a sample image upload.

In part 2, you will deploy the image processing and workflow part of the application. This series explores progressively more complicated workflows, and how to manage their deployment. I will discuss some architectural choices which help to build in flexibility and scalability when designing backend applications

To learn more about building serverless web applications, see the Ask Around Me series.

Anonymize and manage data in your data lake with Amazon Athena and AWS Lake Formation

Post Syndicated from Manos Samatas original https://aws.amazon.com/blogs/big-data/anonymize-and-manage-data-in-your-data-lake-with-amazon-athena-and-aws-lake-formation/

Organizations collect and analyze more data than ever before. They move as fast as they can on their journey to become more data driven by using the insights from their data.

Different roles use data for different purposes. For example, data engineers transform the data before further processing, data analysts access the data and produce reports, and data scientists with domain and technical expertise can train machine learning algorithms. Those roles require access to the data, and access has never been easier to grant.

At the same time, most organizations have to comply with regulations when dealing with their customer data. For that reason, datasets that contain personally identifiable information (PII) is often anonymized. A common example of PII can be tables and columns that contain personal information about an individual (such as first name and last name) or tables with columns that, if joined with another table, can trace back to an individual.

You can use AWS Analytics services to anonymize your datasets. In this post, I describe how to use Amazon Athena to anonymize a dataset.  You can then use AWS Lake Formation to provide the right access to the right personas.

Use case

To better understand the concept, we use a straightforward use case: analysts in your organization need access to a dataset with sales data, some of which contains PII information. As the data lake admin, you’re not comfortable with all personnel having access to customers’ PII. To address this, you can use an anonymized dataset.

This use case has two users:

  • datalake_admin – Responsible for data anonymization and making sure the right permissions are enforced. They classify the data, generate anonymized datasets, and configures the required permissions.
  • datalake_analyst – Only has access to the anonymized dataset. They can extract patterns for users without tracing the request back to an individual customer.

The following AWS CloudFormation template generates the AWS Glue tables that you use later in this post:

However, the template doesn’t create the datalake_admin and datalake_analyst users. For more information about personas in Lake Formation, see Lake Formation Personas and IAM Permissions Reference.

Solution architecture

For this solution, you use the following services:

  • Lake Formation – Lake Formation makes it easy to set up a secure data lake—a centralized, curated, and secured repository that stores all your data, both in its original form and prepared for analysis. The data lake admin can easily label the data and give users permission to access authorized datasets.
  • Athena – Athena is an interactive query service that makes it easy to analyze data in Amazon S3 using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries you run. For this use case, the data lake admin uses Athena to anonymize the data, after which the data analyst can use Athena for interactive analytics over anonymized datasets.
  • Amazon S3Amazon Simple Storage Service (Amazon S3) is an object storage service that offers industry-leading scalability, data availability, security, and performance. For this use case, you use Amazon S3 as storage for the data lake.

The following diagram illustrates the architecture for this solution.

In this architecture, there are no servers to manage. You only pay what you use. You can use the same solution for small or large datasets. The scaling happens behind the scenes but in a transparent way.

In the following sections, you look in more detail on how to do the following:

  • Label sensitive data with AWS Lake Formation
  • Anonymize data with Athena
  • Apply permissions with Lake Formation
  • Analyze the anonymized datasets

Labeling the sensitive data with Lake Formation

As a data lake admin, the first task is to label the personal information. Tags don’t enforce any security controls, but applying a good tagging strategy is a great way to describe the data. Tags are key-value pairs that you can apply for your AWS resources, including table and columns in your data lake. For this use case, you apply a very simple tagging strategy: for the columns that contain PII, you give the value PII.

You interact with the following tables from the tcp-ds dataset, which both have their data stored in Amazon S3 in CSV format:

  • store_sales – Stores sales data and references other tables that you can join together for more sophisticated business queries. The table has a foreign key with the customer table on the ss_customer_sk This key, when joined with the customer table, can uniquely identify a user. For that reason, treat this column as personal information.
  • customer – Stores customer data, a lot of which is PII. In addition to c_customer_sk, you could use data such as customer ID, (c_customer_id), customer name (c_first_name), customer last name (c_last_name), login (c_login), and email (c_email_address) to uniquely identify a customer.

To start tagging your columns (starting with the store_sales table), complete the following steps:

  1. As the data lake admin user, log in to the Lake Formation console.
  2. Choose Data Catalog Tables.
  3. Select store_sales.
  4. Choose Edit schema.
  5. Select the column you want to edit (ss_customer_sk).
  6. Choose Edit.
  7. For Key, enter Classification.
  8. For Value, enter PII.
  9. Choose Save.

To verify that you can apply the added column properties, use the Lake Formation API to get the table description.

  1. On the Data Catalog Tables page, select store_sales.
  2. Choose View properties.

The table properties look like the following JSON object:

{
"Name": "store_sales",
"DatabaseName": "tcp-ds-1tb",
"Owner": "owner",
"CreateTime": "2019-09-13T10:15:04.000Z",
"UpdateTime": "2020-03-18T16:10:34.000Z",
"LastAccessTime": "2019-09-13T10:15:03.000Z",
"Retention": 0,
"StorageDescriptor": {
"Columns": [
{
"Name": "ss_sold_date_sk",
"Type": "bigint",
"Parameters": {}
},
...
{
"Name": "ss_customer_sk",
"Type": "bigint",
"Parameters": {
"Classification": "PII"
}
},
...
}

The additional column properties are now in the table metadata.

  1. Repeat the preceding steps for the customer table and label the following columns:
    • c_customer_sk
    • c_customer_id
    • c_first_name
    • c_last_name
    • c_login
    • c_email_address

Adding a tag also allows you to perform metadata searches by tag attributes. For more information, see Discovering metadata with AWS Lake Formation: Part 1 and Discover metadata with AWS Lake Formation: Part 2.

Anonymizing data with Athena

The data lake admin now needs to provide the data analyst anonymized datasets for analytics. For this use case, you want to extract patterns on the customer table and the store_sales table separately, but you also want to join the two tables so you can perform more sophisticated queries.

The first step is to create a database in Lake Formation to organize tables in AWS Glue.

  1. On the Lake Formation console, under Data Catalog, choose Databases.
  2. Choose Create database.
  3. For Name, enter a name, such as anonymised_tcp_ds_1tb.
  4. Optionally, enter an Amazon S3 path for the database and a description.
  5. Choose Create database.

The next step is to create the tables that contain the anonymized data. Before you do so, consider the significance of each anonymized column from an analytics point of view. For columns that have little or no value in the analytics process, omitting the column altogether might be the right approach. You might use other columns as primary keys to join with other tables. To make sure that you can join the tables, you can apply a hash function to the table foreign keys.

A common approach to anonymize sensitive information is hashing. A hash function is any function that you can use to map data of arbitrary size to fixed-size values. For more information, see Hash function.

The following table summarizes your strategy for each column.

TableColumn Strategy
customercustomer_first_namehash
customercustomer_last_namehash
customerc_loginomit
customercustomer_idhash
Customerc_email_addressomit
customerc_customer_skhash
store_salesss_customer_skhash

If you use the same value as the input of your hash function, it always returns the same result. In addition, and contrary to encryption, you can’t reverse hashing.

  1. Use Athena string functions to hash individual columns and generate anonymized datasets.
  2. After you create those datasets, you can use Lake Formation to apply security controls. See the following code:
CREATE table "tcp-ds-anonymized".customer
WITH (format='parquet',external_location = 's3://tcp-ds-eu-west-1-1tb-anonymised/2/customer_parquet/')
AS SELECT       
         sha256(to_utf8(cast(c_customer_sk AS varchar))) AS c_customer_sk_anonym,
         sha256(to_utf8(cast(c_customer_id AS varchar))) AS c_customer_id_anonym,
         sha256(to_utf8(cast(c_first_name AS varchar))) AS c_first_name_anonym,
         sha256(to_utf8(cast(c_last_name AS varchar))) AS c_last_name_anonym,
         c_current_cdemo_sk,
         c_current_hdemo_sk,
         c_first_shipto_date_sk,
         c_first_sales_date_sk,
         c_salutation,
         c_preferred_cust_flag,
         c_current_addr_sk,
         c_birth_day,
         c_birth_month,
         c_birth_year,
         c_birth_country,
         c_last_review_date_sk
FROM customer
  1. To preview the data, enter the following code:
SELECT c_first_name_anonym, c_last_name_anonym FROM "tcp-ds-anonymized"."customer" limit 10;

The following screenshot shows the output of your query.

  1. To repeat these steps for the stores_sales table, enter the following code:
CREATE table "tcp-ds-anonymized".store_sales
WITH (format='parquet',external_location = 's3://tcp-ds-eu-west-1-1tb-anonymised/1/store_sales/')
AS SELECT sha256(to_utf8(cast(ss_customer_sk AS varchar))) AS ss_customer_sk_anonym,
         ss_sold_date_sk,
         ss_sales_price,
         ss_sold_time_sk,
         ss_item_sk,
         ss_hdemo_sk,
         ss_addr_sk,
         ss_store_sk,
         ss_promo_sk,
         ss_ticket_number,
         ss_quantity,
         ss_wholesale_cost,
         ss_list_price,
         ss_ext_discount_amt,
         ss_external_sales_price,
         ss_ext_wholesale_cost,
         ss_ext_list_price,
         ss_ext_tax,
         ss_coupon_amt,
         ss_net_paid,
         ss_net_paid_inc_tax,
         ss_net_profit
FROM store_sales;

One of the challenges you need to overcome when working with CTAS queries is that the query’s Amazon S3 location should be unique for the table you’re creating. You can add some incremental value or timestamp to the path of the table, for example, s3:/<bucket>/<table_name>/<version>, and make sure you use a different version number every time.

You can delete older data programmatically using Amazon S3 APIs or SDK. You can also use Amazon S3 lifecycle configuration to tell Amazon S3 to transition objects to another Amazon S3 storage class. For more information, see Object lifecycle management.

You can automate the anonymization of the CTAS query with AWS Glue jobs. AWS Glue provides a lightweight Python shell job option that can call the Amazon Athena API programmatically.

Applying permissions with Lake Formation

Now that you have the table structures and anonymized datasets, you can apply the required permissions using Lake Formation.

  1. On the Lake Formation console, under Data Catalog, choose Tables.
  2. Select the tables that contain the anonymized data.
  3. From the Actions drop-down menu, under Permissions, choose Grant.
  4. For IAM users and roles, choose the IAM user for the data analyst.
  5. For Table permissions, select Select.
  6. Choose Grant.

You can now view all table permissions and verify the permissions granted to a particular principal.

Analyzing the anonymized datasets

To verify that the role can access the right tables and query the anonymized datasets, complete the following steps:

  1. Sign in to the AWS Management Console as the data analyst.
  2. Under Analytics, choose Amazon Athena.

You should see a query field, similar to the following screenshot.

You can now test your access with queries. To see the top customers by revenue and last name, enter the following code:

SELECT c_last_name_anonym,
sum(ss_sales_price) AS total_sales
FROM store_sales
JOIN customer
ON store_sales.ss_customer_sk_anonym = customer.c_customer_sk_anonym
GROUP BY c_last_name_anonym
ORDER BY total_sales DESC limit 10;

The following screenshot shows the query output.

You can also try to query a table that you don’t have access to. You should receive an error message.

Conclusion

Anonymizing dataset is often a prerequisite before users can start analyzing a dataset. In this post, we discussed how data lake admins can use Athena and Lake Formation to label and anonymize data stored in Amazon S3. You can then use Lake Formation to apply permissions to the dataset and allow other users to access the data.

The services we discussed in this post are serverless. Building serverless applications means that your developers can focus on their core product instead of worrying about managing and operating servers or runtimes, either in the cloud or on-premises. This reduced overhead lets developers reclaim time and energy that they can spend on developing great products that scale and that are reliable.

 


About the Author

Manos Samatas is a Specialist Solutions Architect in Big Data and Analytics with Amazon Web Services. Manos lives and works in London. He is specialising in architecting Big Data and Analytics solutions for Public Sector customers in EMEA region.

The serverless LAMP stack part 4: Building a serverless Laravel application

Post Syndicated from Benjamin Smith original https://aws.amazon.com/blogs/compute/the-serverless-lamp-stack-part-4-building-a-serverless-laravel-application/

In this post, you learn how to deploy a Laravel application with a serverless approach.

This is the fourth post in the “Serverless LAMP stack” series, previous posts covered:

Laravel is an open source web application framework for PHP. Using a framework helps developers to build faster by reusing generic components and modules. It also helps long-term maintenance by complying with development standards. However, there are still challenges when scaling PHP frameworks with a traditional LAMP stack. Deploying a framework using a serverless approach can help solve these challenges.

There are a number of solutions that simplify the deployment of a Laravel application onto a serverless infrastructure. The following solution uses an AWS Serverless Application Model (AWS SAM) template. This deploys a Laravel application into a single Lambda function. The function uses the Bref FPM custom runtime layer to run PHP. The AWS SAM template deploys the following architecture, explained in detail in “The Serverless LAMP stack Part 3: Replacing the web server”:

The serverless LAMP stack

Deploying Laravel and Bref with AWS SAM

Composer is a dependency management tool for PHP. It allows you to declare and manage your project libraries and dependencies such as Laravel and Bref.

Deploy Laraval and Bref with AWS SAM using the following steps:

  1. Download the Laravel installer using Composer:
    composer global require Laravel/installer
  2. Install Laravel:
    composer create-project --prefer-dist laravel/laravel blog
  3. In the Laravel project, install Bref using Composer:
    composer require bref/laravel-bridge
  4. Clone the AWS SAM template in your application’s root directory:
    git clone https://github.com/aws-samples/php-examples-for-aws-lambda/
  5. Change directory into “0.4-Building-A-Serverless-Laravel-App-With-AWS-SAM”:
    cd 0.4-Building-A-Serverless-Laravel-App-With-AWS-SAM
  6. Deploy the application using the AWS SAM CLI guided deploy:
    sam deploy -g

Once AWS SAM deploys the application, it returns the Amazon CloudFront distribution’s domain name. This distribution serves the serverless Laravel application.

CloudFront domain name

CloudFront domain name from AWS SAM template

Configuring Laravel for Lambda

There are some configuration changes required for Laravel to run in a Lambda function.

Session data store

While Lambda includes a 512 MB temporary file system, this is an ephemeral resource not intended for durable storage. This is because there is no guarantee of reusing the same Lambda function environment for each invocation.

For this reason, if you need Laravel session data, it must be stored outside of the Lambda function. There are a range of different options available for managing state with serverless applications. In this instance, it is recommended to store session data either in a database or using browser cookies.

Update the Laravel .env file to set the session_driver to cookie.

SESSION_DRIVER=cookie

Logging

Laravel implements a PHP logging library called Monolog as a common interface to write logs to a number of destinations. Laravel Monolog uses log channels to specify these destinations. Each channel is defined within the /config/logging.php file as an associative array.

Since the Lambda filesystem is not shared between multiple Lambda function invocations, application logs must be written to an external central location such as Amazon CloudWatch Logs. All errors, warnings, and notices emitted by PHP are forwarded onto CloudWatch Logs. This makes it easy to view, search, filter, or archive logs for future analysis from a single location. To configure this, add the following to the Laravel .env file:

LOG_CHANNEL=stderr

This ensures that the stderr channel is used to write all application logs which are automatically forwarded to CloudWatch Logs. This channel is defined in /config/logging.php:

'stderr' => 
    [ 
    'driver' => 'monolog', 
    'handler' => StreamHandler::class, 
    'formatter' => env('LOG_STDERR_FORMATTER'), 
    'with' => [ 
        'stream' => 'php://stderr', 
    ], 
],
CloudWatch Logs for a single Lambda invocation

CloudWatch Logs for a single Lambda invocation

Compiled views

Views contain the HTML served by an application, separating application logic from presentation logic. By default, views are compiled on demand inside the application’s storage directory.

As Lambda does not have write access to the storage directory, Laravel must be configured to write views to the function’s /tmp directory. This is a temporary file system for ephemeral data that’s only needed for the duration of each HTTP request.

In the .env file, add the following line to configure Laravel to use a new directory path for compiled views:

VIEW_COMPILED_PATH=/tmp/storage/framework/views

Laravel uses service providers to register or “bootstrap” components to your application. The AppServiceProvider.php file provides a central location to share data with all views. Add the following code to the Providers/AppServiceProvider.phpfile.

public function boot() { 
    // Make sure the directory for compiled views exist 
    if (! is_dir(config('view.compiled'))) { 
        mkdir(config('view.compiled'), 0755, true); 
    } 
}

This ensures that the view directory is automatically created for each Lambda function invocation, if it does not already exist.

File system abstraction with Amazon S3

Laravel uses a filesystem abstraction package called Flysystem. This provides a simple driver mechanism to configure the filesystem location. As Lambda’s /tmp directory is ephemeral, the filesystem location must be outside of the Lambda function. Configure Laravel to use the Amazon S3 filesystem driver by adding the following line to the .env file:

FILESYSTEM_DRIVER=s3

The AWS SAM template deploys an S3 bucket to store these objects:

Storage:
    Type: AWS::S3::Bucket
    Properties:
      BucketName: php-example-laravel-FileSystemBucket

The bucket name is provided to the Lambda function as an environment variable from within the AWS SAM template:

    Environment:
      Variables:
        AWS_BUCKET: !Ref Storage

The Lambda function is granted permission to read/write to the S3 bucket, using an IAM policy definition:

Policies:
        - S3FullAccessPolicy:
            BucketName: !Ref Storage

Laravel’s filesystem configuration is found at config/filesystems.php. This is where the S3 filesystem disk is defined using the AWS SAM environment variable.

's3' => [
            'driver' => 's3',
            'key' => env('AWS_ACCESS_KEY_ID'),
            'secret' => env('AWS_SECRET_ACCESS_KEY'),
            'token' => env('AWS_SESSION_TOKEN'),
            'region' => env('AWS_DEFAULT_REGION'),
            'bucket' => env('AWS_BUCKET'),
            'url' => env('AWS_URL'),
            'endpoint' => env('AWS_ENDPOINT'),
        ],

The AWS account information and bucket ARN are provided by the Lambda environment that is running PHP, using Laravel’s env() function.

Public asset files

Laravel has a public disk driver for storing publicly accessible files such as images
and CSS files. By default, the public disk driver stores these files in storage/app/public/. These files must rather be stored in S3. Change the configuration in config/filesystems.php to the following:

+ 'public' => env('FILESYSTEM_DRIVER_PUBLIC', 'public_local'),
    
    'disks' => [

        'local' => [
            'driver' => 'local',
            'root' => storage_path('app'),
        ],

- 'public => [
+ 'public_local' => [
            'driver' => 'local',
            'root' => storage_path('app/public'),
            'url' => env('APP_URL').'/storage',
            'visibility' => 'public',
        ],

        's3' => [
            'driver' => 's3',
            'key' => env('AWS_ACCESS_KEY_ID'),
            'secret' => env('AWS_SECRET_ACCESS_KEY'),
            'token' => env('AWS_SESSION_TOKEN'),
            'region' => env('AWS_DEFAULT_REGION'),
            'bucket' => env('AWS_BUCKET'),
            'url' => env('AWS_URL'),
            'endpoint' => env('AWS_ENDPOINT'),
        ],

+ 's3_public' => [ + 'driver' => 's3', + 'key' => env('AWS_ACCESS_KEY_ID'), + 'secret' => env('AWS_SECRET_ACCESS_KEY'), + 'token' => env('AWS_SESSION_TOKEN'), + 'region' => env('AWS_DEFAULT_REGION'), + 'bucket' => env('AWS_PUBLIC_BUCKET'), + 'url' => env('AWS_URL'), + ],

    ],

This adds a new filesystem disk named s3_public, which uses an S3 driver. Laravel’s env() function retrieves the environment variable env(‘AWS_PUBLIC_BUCKET’) to set/configure the bucket location. The bucket name is passed to the Lambda function as an environment variable.

Add the following line to the .env file to configure the public disk to use S3:

FILESYSTEM_DRIVER_PUBLIC=s3

Referencing static assets in view templates

Laravel’s asset() helper function generates a URL for an asset using the current scheme of the request (HTTP or HTTPS):

$url = asset('img/photo.jpg');

These assets must be stored on S3 and served via CloudFront’s global CDN. Configure the URL host by setting the ASSET_URL variable in your .env file:

ASSET_URL=https://{YourCloudFrontDomain}.cloudfront.net

This allows the application to correctly reference assets from S3, via the CloudFront domain. Laravel’s native asset() helper function is used from within the view templates with the following format:

<img src="{{ asset('assets/icons.png') }}">
Serverless Laravel App with Lambda

Serverless Laravel App with Lambda

Alternative deployments methods for a serverless Laravel application

1. Bref, an open source custom runtime for PHP, recently merged a new pull request to automatically configure Laravel for Lambda. This new package also provides a way to integrate Amazon SQS with the Laravel Queues Jobs system.

2. Laravel Vapour is a serverless deployment platform for Laravel. This is a paid service, built by the Laravel team on the AWS Cloud.

Conclusion

This post explains how to deploy a PHP Laravel application using a serverless approach with AWS SAM. It explains the initial Laravel configuration steps required to implement a session store and centralised logging with an external filesystem and static assets in S3.

PHP development teams can focus on shipping code without changing the way they build. Start building serverless applications with PHP.

Visit this GitHub repository for accompanying code and instructions.

New – Using Amazon GuardDuty to Protect Your S3 Buckets

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/new-using-amazon-guardduty-to-protect-your-s3-buckets/

As we anticipated in this post, the anomaly and threat detection for Amazon Simple Storage Service (S3) activities that was previously available in Amazon Macie has now been enhanced and reduced in cost by over 80% as part of Amazon GuardDuty. This expands GuardDuty threat detection coverage beyond workloads and AWS accounts to also help you protect your data stored in S3.

This new capability enables GuardDuty to continuously monitor and profile S3 data access events (usually referred to data plane operations) and S3 configurations (control plane APIs) to detect suspicious activities such as requests coming from an unusual geo-location, disabling of preventative controls such as S3 block public access, or API call patterns consistent with an attempt to discover misconfigured bucket permissions. To detect possibly malicious behavior, GuardDuty uses a combination of anomaly detection, machine learning, and continuously updated threat intelligence. For your reference, here’s the full list of GuardDuty S3 threat detections.

When threats are detected, GuardDuty produces detailed security findings to the console and to Amazon EventBridge, making alerts actionable and easy to integrate into existing event management and workflow systems, or trigger automated remediation actions using AWS Lambda. You can optionally deliver findings to an S3 bucket to aggregate findings from multiple regions, and to integrate with third party security analysis tools.

If you are not using GuardDuty yet, S3 protection will be on by default when you enable the service. If you are using GuardDuty, you can simply enable this new capability with one-click in the GuardDuty console or through the API. For simplicity, and to optimize your costs, GuardDuty has now been integrated directly with S3. In this way, you don’t need to manually enable or configure S3 data event logging in AWS CloudTrail to take advantage of this new capability. GuardDuty also intelligently processes only the data events that can be used to generate threat detections, significantly reducing the number of events processed and lowering your costs.

If you are part of a centralized security team that manages GuardDuty across your entire organization, you can manage all accounts from a single account using the integration with AWS Organizations.

Enabling S3 Protection for an AWS Account
I already have GuardDuty enabled for my AWS account in this region. Now, I want to add threat detection for my S3 buckets. In the GuardDuty console, I select S3 Protection and then Enable. That’s it. To be more protected, I repeat this process for all regions enabled in my account.

After a few minutes, I start seeing new findings related to my S3 buckets. I can select each finding to get more information on the possible threat, including details on the source actor and the target action.

After a few days, I select the Usage section of the console to monitor the estimated monthly costs of GuardDuty in my account, including the new S3 protection. I can also find which are the S3 buckets contributing more to the costs. Well, it turns out I didn’t have lots of traffic on my buckets recently.

Enabling S3 Protection for an AWS Organization
To simplify management of multiple accounts, GuardDuty uses its integration with AWS Organizations to allow you to delegate an account to be the administrator for GuardDuty for the whole organization.

Now, the delegated administrator can enable GuardDuty for all accounts in the organization in a region with one click. You can also set Auto-enable to ON to automatically include new accounts in the organization. If you prefer, you can add accounts by invitation. You can then go to the S3 Protection page under Settings to enable S3 protection for their entire organization.

When selecting Auto-enable, the delegated administrator can also choose to enable S3 protection automatically for new member accounts.

Available Now
As always, with Amazon GuardDuty, you only pay for the quantity of logs and events processed to detect threats. This includes API control plane events captured in CloudTrail, network flow captured in VPC Flow Logs, DNS request and response logs, and with S3 protection enabled, S3 data plane events. These sources are ingested by GuardDuty through internal integrations when you enable the service, so you don’t need to configure any of these sources directly. The service continually optimizes logs and events processed to reduce your cost, and displays your usage split by source in the console. If configured in multi-account, usage is also split by account.

There is a 30-day free trial for the new S3 threat detection capabilities. This applies as well to accounts that already have GuardDuty enabled, and add the new S3 protection capability. During the trial, the estimated cost based on your S3 data event volume is calculated in the GuardDuty console Usage tab. In this way, while you evaluate these new capabilities at no cost, you can understand what would be your monthly spend.

GuardDuty for S3 protection is available in all regions where GuardDuty is offered. For regional availability, please see the AWS Region Table. To learn more, please see the documentation.

Danilo

Building a Self-Service, Secure, & Continually Compliant Environment on AWS

Post Syndicated from Japjot Walia original https://aws.amazon.com/blogs/architecture/building-a-self-service-secure-continually-compliant-environment-on-aws/

Introduction

If you’re an enterprise organization, especially in a highly regulated sector, you understand the struggle to innovate and drive change while maintaining your security and compliance posture. In particular, your banking customers’ expectations and needs are changing, and there is a broad move away from traditional branch and ATM-based services towards digital engagement.

With this shift, customers now expect personalized product offerings and services tailored to their needs. To achieve this, a broad spectrum of analytics and machine learning (ML) capabilities are required. With security and compliance at the top of financial service customers’ agendas, being able to rapidly innovate and stay secure is essential. To achieve exactly that, AWS Professional Services engaged with a major Global systemically important bank (G-SIB) customer to help develop ML capabilities and implement a Defense in Depth (DiD) security strategy. This blog post provides an overview of this solution.

The machine learning solution

The following architecture diagram shows the ML solution we developed for a customer. This architecture is designed to achieve innovation, operational performance, and security performance in line with customer-defined control objectives, as well as meet the regulatory and compliance requirements of supervisory authorities.

Machine learning solution developed for customer

This solution is built and automated using AWS CloudFormation templates with pre-configured security guardrails and abstracted through the service catalog. AWS Service Catalog allows you to quickly let your users deploy approved IT services ensuring governance, compliance, and security best practices are enforced during the provisioning of resources.

Further, it leverages Amazon SageMaker, Amazon Simple Storage Service (S3), and Amazon Relational Database Service (RDS) to facilitate the development of advanced ML models. As security is paramount for this workload, data in S3 is encrypted using client-side encryption and column-level encryption on columns in RDS. Our customer also codified their security controls via AWS Config rules to achieve continual compliance

Compute and network isolation

To enable our customer to rapidly explore new ML models while achieving the highest standards of security, separate VPCs were used to isolate infrastructure and accessed control by security groups. Core to this solution is Amazon SageMaker, a fully managed service that provides the ability to rapidly build, train, and deploy ML models. Amazon SageMaker notebooks are managed Juypter notebooks that:

  1. Prepare and process data
  2. Write code to train models
  3. Deploy models to SageMaker hosting
  4. Test or validate models

In our solution, notebooks run in an isolated VPC with no egress connectivity other than VPC endpoints, which enable private communication with AWS services. When used in conjunction with VPC endpoint policies, you can use notebooks to control access to those services. In our solution, this is used to allow the SageMaker notebook to communicate only with resources owned by AWS Organizations through the use of the aws:PrincipalOrgID condition key. AWS Organizations helps provide governance to meet strict compliance regulation and you can use the aws:PrincipalOrgID condition key in your resource-based policies to easily restrict access to Identity Access Management (IAM) principals from accounts.

Data protection

Amazon S3 is used to store training data, model artifacts, and other data sets. Our solution uses server-side encryption with customer master keys (CMKs) stored in AWS Key Management Service (SSE-KMS) encryption to protect data at rest. SSE-KMS leverages KMS and uses an envelope encryption strategy with CMKs. Envelop encryption is the practice of encrypting data with a data key and then encrypting that data key using another key – the CMK. CMKs are created in KMS and never leave KMS unencrypted. This approach allows fine-grained control around access to the CMK and the logging of all access and attempts to access the key to Amazon CloudTrail. In our solution, the age of the CMK is tracked by AWS Config and is regularly rotated. AWS Config enables you to assess, audit, and evaluate the configurations of deployed AWS resources by continuously monitoring and recording AWS resource configurations. This allows you to automate the evaluation of recorded configurations against desired configurations.

Amazon S3 Block Public Access is also used at an account level to ensure that existing and newly created resources block bucket policies or access-control lists (ACLs) don’t allow public access. Service control policies (SCPs) are used to prevent users from modifying this setting. AWS Config continually monitors S3 and remediates any attempt to make a bucket public.

Data in the solution are classified according to their sensitivity that corresponds to your customer’s data classification hierarchy. Classification in the solution is achieved through resource tagging, and tags are used in conjunction with AWS Config to ensure adherence to encryption, data retention, and archival requirements.

Continuous compliance

Our solution adopts a continuous compliance approach, whereby the compliance status of the architecture is continuously evaluated and auto-remediated if a configuration change attempts to violate the compliance posture. To achieve this, AWS Config and config rules are used to confirm that resources are configured in compliance with defined policies. AWS Lambda is used to implement a custom rule set that extends the rules included in AWS Config.

Data exfiltration prevention

In our solution, VPC Flow Logs are enabled on all accounts to record information about the IP traffic going to and from network interfaces in each VPC. This allows us to watch for abnormal and unexpected outbound connection requests, which could be an indication of attempts to exfiltrate data. Amazon GuardDuty analyzes VPC Flow Logs, AWS CloudTrail event logs, and DNS logs to identify unexpected and potentially malicious activity within the AWS environment. For example, GuardDuty can detect compromised Amazon Elastic Cloud Compute (EC2) instances communicating with known command-and-control servers.

Conclusion

Financial services customers are using AWS to develop machine learning and analytics solutions to solve key business challenges while ensuring security and compliance needs. This post outlined how Amazon SageMaker, along with multiple security services (AWS Config, GuardDuty, KMS), enables building a self-service, secure, and continually compliant data science environment on AWS for a financial service use case.

 

Learn and use 13 AWS security tools to implement SEC recommended protection of stored customer data in the cloud

Post Syndicated from Sireesh Pachava original https://aws.amazon.com/blogs/security/learn-and-use-13-aws-security-tools-to-implement-sec-recommended-protection-stored-customer-data-cloud/

Most businesses collect, process, and store sensitive customer data that needs to be secured to earn customer trust and protect customers against abuses. Regulated businesses must prove they meet guidelines established by regulatory bodies. As an example, in the capital markets, broker-dealers and investment advisors must demonstrate they address the guidelines proposed by the Office of Compliance Inspections (OCIE), a division of the United States Securities Exchange Commission (SEC).

So what do you do as a business to secure and protect customer data in cloud, and to provide assurance to an auditor/regulator on customer’s data protection?

In this post, I will introduce you to 13 key AWS tools that you can use to address different facets of data protection across different types of AWS storage services. As a structure for the post, I will explain the key findings and issues the SEC OCIE found, and will explain how these tools help you meet the toughest compliance obligations and guidance. These tools and use cases apply to other industries as well.

What SEC OCIE observations mean for AWS customers

The SEC established the SEC Regulation S-P (primary rule for privacy notices and safeguard policies) and Regulation S-ID (identity theft red flags rules) as compliance requirements for financial institutions that includes securities firms. In 2019, the OCIE examined broker-dealers’ and investment advisors’ use of network storage solutions, including cloud storage to identify gaps in effective practices to protect stored customer information. OCIE noted gaps in security settings, configuration management, and oversight of vendor network storage solutions. OCIE also noted that firms don’t always use the available security features on storage solutions. The gaps can be summarized into three problem areas as below. These gaps are common to businesses in other industries as well.

  • Misconfiguration – Misconfigured network storage solution and missed security settings
  • Monitoring & Oversight – Inadequate oversight of vendor-provided network storage solutions
  • Data protection – Insufficient data classification policies and procedures

So how can you effectively use AWS security tools and capabilities to review and enhance your security and configuration management practices?

AWS tools and capabilities to help review, monitor and address SEC observations

I will cover the 13 key AWS tools that you can use to address different facets of data protection of storage under the same three (3) broad headings as above: 1. Misconfiguration, 2. Monitoring & Oversight, 3. Data protection.

All of these 13 tools rely on automated monitoring alerts along with detective, preventative, and predictive controls to help enable the available security features and data controls. Effective monitoring, security analysis, and change management are key to help companies, including capital markets firms protect customers’ data and verify the effectiveness of security risk mitigation.

AWS offers a complete range of cloud storage services to help you meet your application and archival compliance requirements. Some of the AWS storage services for common industry use are:

I use Amazon S3 and Amazon EBS for examples in this post.

Establish control guardrails by operationalizing the shared responsibility model

Before covering the 13 tools, let me reinforce the foundational pillar of the cloud security. The AWS shared responsibility model, where security and compliance is a shared responsibility between AWS and you as the AWS customer, is consistent with OCIE recommendations for ownership and accountability, and use of all available security features.

We start with the baseline structure for operationalizing the control guardrails. A lack of clear understanding of the shared responsibility model can result in missed controls or unused security features. Clarifying and operationalizing this shared responsibility model and shared controls helps enable the controls to be applied to both the infrastructure layer and customer layers, but in completely separate contexts or perspectives.

Security of the cloud – AWS is responsible for protecting the infrastructure that runs all of the services offered in the AWS cloud.

Security in the cloud – Your responsibility as a user of AWS is determined by the AWS cloud services that you select. This determines the amount of configuration work you must perform as part of your security responsibilities. You’re responsible for managing data in your care (including encryption options), classifying your assets, and using IAM tools to apply the appropriate permissions.

Misconfiguration – Monitor, detect, and remediate misconfiguration with AWS cloud storage services

Monitoring, detection, and remediation are the specific areas noted by the OCIE. Misconfiguration of settings results in errors such as inadvertent public access, unrestricted access permissions, and unencrypted records. Based on your use case, you can use a wide suite of AWS services to monitor, detect, and remediate misconfiguration.

Access analysis via AWS Identity and Access Management (IAM) Access Analyzer – Identifying if anyone is accessing your resources from outside an AWS account due to misconfiguration is critical. Access Analyzer identifies resources that can be accessed without an AWS account. For example Access Analyzer continuously monitors for new or updated policies, and it analyzes permissions granted using policies for Amazon S3 buckets, AWS Key Management Service (AWS KMS) and AWS IAM roles. To learn more about using IAM Access Analyzer to flag unintended access to S3 buckets, see IAM Access Analyzer flags unintended access to S3 buckets shared through access points.

Actionable security checks via AWS Trusted Advisor – Unrestricted access increases opportunities for malicious activity such as hacking, denial-of-service attacks, and data theft. Trusted Advisor posts security advisories that should be regularly reviewed and acted on. Trusted Advisor can alert you to risks such as Amazon S3 buckets that aren’t secured and Amazon EBS volume snapshots that are marked as public. Bucket permissions that don’t limit who can upload or delete data create potential security vulnerabilities by allowing anyone to add, modify, or remove items in a bucket. Trusted Advisor examines explicit bucket permissions and associated bucket policies that might override the bucket permissions. It also checks security groups for rules that allow unrestricted access to a resource. To learn more about using Trusted Advisor, see How do I start using Trusted Advisor?

Encryption via AWS Key Management Service (AWS KMS) – Simplifying the process to create and manage encryption keys is critical to configuring data encryption by default. You can use AWS KMS master keys to automatically control the encryption of the data stored within services integrated with AWS KMS such as Amazon EBS and Amazon S3. AWS KMS gives you centralized control over the encryption keys used to protect your data. AWS KMS is designed so that no one, including the service operators, can retrieve plaintext master keys from the service. The service uses FIPS140-2 validated hardware security modules (HSMs) to protect the confidentiality and integrity of keys. For example, you can specify that all newly created Amazon EBS volumes be created in encrypted form, with the option to use the default key provided by AWS KMS or a key you create. Amazon S3 inventory can be used to audit and report on the replication and encryption status of objects for business, compliance, and regulatory needs. To learn more about using KMS to enable data encryption on S3, see How to use KMS and IAM to enable independent security controls for encrypted data in S3.

Monitoring & Oversight – AWS storage services provide ongoing monitoring, assessment, and auditing

Continuous monitoring and regular assessment of control environment changes and compliance are key to data storage oversight. They help you validate whether security and access settings and permissions across your organization’s cloud storage are in compliance with your security policies and flag non-compliance. For example, you can use AWS Config or AWS Security Hub to simplify auditing, security analysis, monitoring, and change management.

Configuration compliance monitoring via AWS Config – You can use AWS Config to assess how well your resource configurations align with internal practices, industry guidelines, and regulations by providing a detailed view of the configuration of AWS resources including current, and historical configuration snapshot and changes. AWS Config managed rules are predefined, customizable rules to evaluate whether your AWS resources align with common best practices. Config rules can be used to evaluate the configuration settings, detect and remediate violation of conditions in the rules, and flag non-compliance with internal practices. This helps demonstrate compliance against internal policies and best practices, for data that requires frequent audits. For example you can use a managed rule to quickly assess whether your EBS volumes are encrypted or whether specific tags are applied to your resources. Another example of AWS Config rules is on-going detective controls that check that your S3 buckets don’t allow public read access. The rule checks the block public access setting, the bucket policy, and the bucket access control list (ACL). You can configure the logic that determines compliance with internal practices, which lets you automatically mark IAM roles in use as compliant and inactive roles as non-compliant. To learn more about using AWS Config rule, see Setting up custom AWS Config rule that checks the OS CIS compliance.

Automated compliance checks via AWS Security Hub – Security Hub eliminates the complexity and reduces the effort of managing and improving the security and compliance of your AWS accounts and workloads. It helps improve compliance with automated checks by running continuous and automated account and resource-level configuration checks against the rules in the supported industry best practices and standards, such as the CIS AWS Foundations Benchmarks. Security Hub insights are grouped findings that highlight emerging trends or possible issues. For example, insights help to identify Amazon S3 buckets with public read or write permissions. It also collects findings from partner security products using a standardized AWS security finding format, eliminating the need for time-consuming data parsing and normalization efforts. To learn more about Security Hub, see AWS Foundational Security Best Practices standard now available in Security Hub.

Security and compliance reports via AWS Artifact – As part of independent oversight, third-party auditors test more than 2,600 standards and requirements in the AWS environment throughout the year. AWS Artifact provides on-demand access to AWS security and compliance reports such as AWS Service Organization Control (SOC) reports, Payment Card Industry (PCI) reports, and certifications from accreditation bodies that validate the implementation and operating effectiveness of AWS security controls. You can access these attestations online under the artifacts section of the AWS Management Console. To learn more about accessing Artifact, see Downloading Reports in AWS Artifact.

Data Protection – Data classification policies and procedures for discovering, and protecting data

It’s important to classify institutional data to support application of the appropriate level of security. Data discovery and classification enables the implementation of the correct level of security, privacy, and access controls. Discovery and classification are highly complex given the volume of data involved and the tradeoffs between a strict security posture and the need for business agility.

Controls via S3 Block Public Access – S3 Block Public Access can help controls across an entire AWS Account or at the individual S3 bucket level to ensure that objects do not have public permissions. Block Public Access is a good second layer of protection to ensure you don’t’ inadvertently grant broader access to objects than intended. To learn more about using S3 Block Public Access, see Learn how to use two important Amazon S3 security features – Block Public Access and S3 Object Lock.

S3 configuration monitoring and sensitive data discovery via Amazon Macie – You can use Macie to discover, classify, and protect sensitive data like personally identifiable information (PII) stored in Amazon S3. Macie provides visibility and continuous monitoring of S3 bucket configurations across all accounts within your AWS Organization, and alerts you to any unencrypted buckets, publicly accessible buckets, or buckets shared or replicated with AWS accounts outside your organization. For buckets you specify, Macie uses machine learning and pattern matching to identify objects that contain sensitive data. When sensitive data is located, Macie sends findings to EventBridge allowing for automated actions or integrations with ticketing systems. To learn more about using Macie, see Enhanced Amazon Macie.

WORM data conformance via Amazon S3 Object Lock – Object Lock can help you meet the technical requirements of financial services regulations that require write once, read many (WORM) data storage for certain types of books and records information. To learn more about using S3 Object Lock, see Learn how to use two important Amazon S3 security features – Block Public Access and S3 Object Lock.

Alerts via Amazon GuardDuty – GuardDuty is designed to raise alarms when someone is scanning for potentially vulnerable systems or moving unusually large amounts of data to or from unexpected places. To learn more about GuardDuty findings, see Visualizing Amazon GuardDuty findings.

Note: AWS strongly recommends that you never put sensitive identifying information into free-form fields or metadata, such as function names or tags. The reason being any data entered into metadata might be included in diagnostic logs.

Effective configuration management program features, and practices

OCIE also noted effective industry practices for storage configuration, including:

  • Policies and procedures to support the initial installation and ongoing maintenance and monitoring of storage systems
  • Guidelines for security controls and baseline security configuration standards
  • Vendor management policies and procedures for security configuration assessment after software and hardware patches

In addition to the services already covered, AWS offers several other services and capabilities to help you implement effective control measures.

Security assessments using Amazon Inspector – You can use Amazon Inspector to assess your AWS resources for vulnerabilities or deviations from best practices and produce a detailed list of security findings prioritized by level of severity. For example, Amazon Inspector security assessments can help you check for unintended network accessibility of your Amazon Elastic Compute Cloud (Amazon EC2) instances and for vulnerabilities on those instances. To learn more about assessing network exposure of EC2 instances, see A simpler way to assess the network exposure of EC2 instances: AWS releases new network reachability assessments in Amazon Inspector.

Configuration compliance via AWS Config conformance packs – Conformance packs help you manage configuration compliance of your AWS resources at scale—from policy definition to auditing and aggregated reporting—using a common framework and packaging model. This helps to quickly establish a common baseline for resource configuration policies and best practices across multiple accounts in your organization in a scalable and efficient way. Sample conformance pack templates such as Operational best practices for Amazon S3 can help you to quickly get started on evaluating and configuring your AWS environment. To learn more about AWS Config conformance packs, see Manage custom AWS Config rules with remediations using conformance packs.

Logging and monitoring via AWS CloudTrail – CloudTrail lets you track and automatically respond to account activity that threatens the security of your AWS resources. With Amazon CloudWatch Events integration, you can define workflows that execute when events that can result in security vulnerabilities are detected. For example, you can create a workflow to add a specific policy to an Amazon S3 bucket when CloudTrail logs an API call that makes that bucket public. To learn more about using CloudTrail to respond to unusual API activity, see Announcing CloudTrail Insights: Identify and Respond to Unusual API Activity.

Machine learning based investigations via Amazon Detective – Detective makes it easy to analyze, investigate, and quickly identify the root cause of potential security issues or suspicious activities. Detective automatically collects log data from your AWS resources and uses machine learning, statistical analysis, and graph theory to build a linked set of data that helps you to conduct faster, more efficient security investigations. To learn more about Amazon Detective based investigation, see Amazon Detective – Rapid Security Investigation and Analysis.

Conclusion

AWS security and compliance capabilities are well suited to help you review the SEC OCIE observations, and implement effective practices to safeguard your organization’s data in AWS cloud storage. To review and enhance the security of your cloud data storage, learn about these 13 AWS tools and capabilities. Implementing these wide variety of monitoring, auditing, security analysis, and change management capabilities will help you to remediate the potential gaps in security settings and configurations. Many customers engage AWS Professional Services to help define and implement their security, risk, and compliance strategy, governance structures, operating controls, shared responsibility model, control mappings, and best practices.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, contact AWS Support.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Sireesh Pachava

Sai Sireesh is a Senior Advisor in Security, Risk, and Compliance at AWS. He specializes in solving complex strategy, business risk, security, and digital platform issues. A computer engineer with an MS and an MBA, he has held global leadership roles at Russell Investments, Microsoft, Thomson Reuters, and more. He’s a pro-bono director for the non-profit risk professional association PRMIA.

How to retroactively encrypt existing objects in Amazon S3 using S3 Inventory, Amazon Athena, and S3 Batch Operations

Post Syndicated from Adam Kozdrowicz original https://aws.amazon.com/blogs/security/how-to-retroactively-encrypt-existing-objects-in-amazon-s3-using-s3-inventory-amazon-athena-and-s3-batch-operations/

Amazon Simple Storage Service (S3) is an object storage service that offers industry-leading scalability, performance, security, and data availability. With Amazon S3, you can choose from three different server-side encryption configurations when uploading objects:

  • SSE-S3 – uses Amazon S3-managed encryption keys
  • SSE-KMS – uses customer master keys (CMKs) stored in AWS Key Management Service (KMS)
  • SSE-C – uses master keys provided by the customer in each PUT or GET request

These options allow you to choose the right encryption method for the job. But as your organization evolves and new requirements arise, you might find that you need to change the encryption configuration for all objects. For example, you might be required to use SSE-KMS instead of SSE-S3 because you need more control over the lifecycle and permissions of the encryption keys in order to meet compliance goals.

You could change the settings on your buckets to use SSE-KMS rather than SSE-S3, but the switch only impacts newly uploaded objects, not objects that existed in the buckets before the change in encryption settings. Manually re-encrypting older objects under master keys in KMS may be time-prohibitive depending on how many objects there are. Automating this effort is possible using the right combination of features in AWS services.

In this post, I’ll show you how to use Amazon S3 Inventory, Amazon Athena, and Amazon S3 Batch Operations to provide insights on the encryption status of objects in S3 and to remediate incorrectly encrypted objects in a massively scalable, resilient, and cost-effective way. The solution uses a similar approach to the one mentioned in this blog post, but it has been designed with automation and multi-bucket scalability in mind. Tags are used to target individual noncompliant buckets in an account, and any encrypted (or unencrypted) object can be re-encrypted using SSE-S3 or SSE-KMS. Versioned buckets are also supported, and the solution operates on a regional level.

Note: You can’t re-encrypt to or from objects encrypted under SSE-C. This is because the master key material must be provided during the PUT or GET request, and cannot be provided as a parameter for S3 Batch Operations.

Moreover, the entire solution can be deployed in under 5 minutes using AWS CloudFormation. Simply tag your buckets targeted for encryption, upload the solution artifacts into S3, and deploy the artifact template through the CloudFormation console. In the following sections, you will see that the architecture has been built to be easy to use and operate, while at the same time containing a large number of customizable features for more advanced users.

Solution overview

At a high level, the core features of the architecture consist of 3 services interacting with one another: S3 Inventory reports (1) are delivered for targeted buckets, the report delivery events trigger an AWS Lambda function (2), and the Lambda function then executes S3 Batch (3) jobs using the reports as input to encrypt targeted buckets. Figure 1 below and the remainder of this section provide a more detailed look at what is happening underneath the surface. If this is not of high interest for you, feel free to skip ahead to the Prerequisites and Solution Deployment sections.

Figure 1: Solution architecture overview

Figure 1: Solution architecture overview

Here’s a detailed overview of how the solution works, as shown in Figure 1 above:

  1. When the CloudFormation template is first launched, a number of resources are created, including:
    • An S3 bucket to store the S3 Inventory reports
    • An S3 bucket to store S3 Batch Job completion reports
    • A CloudWatch event that is triggered by changes to tags on S3 buckets
    • An AWS Glue Database and AWS Glue Tables that can be used by Athena to query S3 Inventory and S3 Batch report findings
    • A Lambda function that is used as a Custom Resource during template launch, and afterwards as a target for S3 event notifications and CloudWatch events
  2. During deployment of the CloudFormation template, a Lambda-backed Custom Resource lists all S3 buckets within the AWS Region specified and checks to see if any has a configurable tag present (configured via an AWS CloudFormation parameter). When a bucket with the specified tag is discovered, the Lambda configures an S3 Inventory report for the discovered bucket to be delivered to the newly-created central report destination bucket.
  3. When a new S3 Inventory report arrives into the central report destination bucket (which can take between 1-2 days) from any of the tagged buckets, an S3 Event Notification triggers the Lambda to process it.
  4. The Lambda function first adds the path of the report CSV file as a partition to the AWS Glue table. This means that as each bucket delivers its report, it becomes instantly queryable by Athena, and any queries executed return the most recent information available on the status of the S3 buckets in the account.
  5. The Lambda function then checks the value of the EncryptBuckets parameter in the CloudFormation launch template to assess whether any re-encryption action should be taken. If it is set to yes, the Lambda function creates an S3 Batch job and executes it. The job takes each object listed in the manifest report and copies it over in the exact same location. When the copy occurs, SSE-KMS or SSE-S3 encryption is specified in the job parameters, effectively re-encrypting properly all identified objects.
  6. Once the batch job finishes for the S3 Inventory report, a completion report is sent to the central batch job report bucket. The CloudFormation template provides a parameter that controls the option to include either all successfully processed objects or only objects that were unsuccessfully processed. These reports can also be queried with Athena, since the reports are also added as partitions to the AWS Glue batch reports tables as they arrive.

Prerequisites

To follow along with the sample deployment, your AWS Identity and Access Management (IAM) principal (user or role) needs administrator access or equivalent.

Solution deployment

For this walkthrough, the solution will be configured to encrypt objects using SSE-KMS, rather than SSE-S3, when an inventory report is delivered for a bucket. Please note that the key policy of the KMS key will be automatically updated by the custom resource during launch to allow S3 to use it to encrypt inventory reports. No key policies are changed if SSE-S3 encryption is selected instead. The configuration in this walkthrough also adds a tag to all newly encrypted objects. You’ll learn how to use this tag to restrict access to unencrypted objects in versioned buckets. I’ll make callouts throughout the deployment guide for when you can choose a different configuration from what is deployed in this post.

To deploy the solution architecture and validate its functionality, you’ll perform five steps:

  1. Tag target buckets for encryption
  2. Deploy the CloudFormation template
  3. Validate delivery of S3 Inventory reports
  4. Confirm that reports are queryable with Athena
  5. Validate that objects are correctly encrypted

If you are only interested in deploying the solution and encrypting your existing environment, Steps 1 and 2 are all that are required to be completed. Steps 3 through 5 are optional on the other hand, and outline procedures that you would perform to validate the solution’s functionality. They are primarily for users who are looking to dive deep and take advantage of all of the features available.

With that being said, let’s get started with deploying the architecture!

Step 1: Tag target buckets

Navigate to the Amazon S3 console and identify which buckets should be targeted for inventorying and encryption. For each identified bucket, tag it with a designated key value pair by selecting Properties > Tags > Add tag. This demo uses the tag __Inventory: true and tags only one bucket called adams-lambda-functions, as shown in Figure 2.

Figure 2: Tagging a bucket targeted for encryption in Amazon S3

Figure 2: Tagging a bucket targeted for encryption in Amazon S3

Step 2: Deploy the CloudFormation template

  1. Download the S3 encryption solution. There will be two files that make up the backbone of the solution:
    • encrypt.py, which contains the Lambda microservices logic;
    • deploy.yml, which is the CloudFormation template that deploys the solution.
  2. Zip the file encrypt.py, rename it to encrypt.zip, and then upload it into any S3 bucket that is in the same Region as the one in which the CloudFormation template will be deployed. Your bucket should look like Figure 3:

    Figure 3: encrypt.zip uploaded into an S3 bucket

    Figure 3: encrypt.zip uploaded into an S3 bucket

  3. Navigate to the CloudFormation console and then create the CloudFormation stack using the deploy.yml template. For more information, see Getting Started with AWS CloudFormation in the CloudFormation User Guide. Figure 4 shows the parameters used to achieve the configuration specified for this walkthrough, with the fields outlined in red requiring input. You can choose your own configuration by altering the appropriate parameters if the ones specified do not fit your use case.

    Figure 4: Set the parameters in the CloudFormation stack

    Figure 4: Set the parameters in the CloudFormation stack

Step 3: Validate delivery of S3 Inventory reports

After you’ve successfully deployed the CloudFormation template, select any of your tagged S3 buckets and check that it now has an S3 Inventory report configuration. To do this, navigate to the S3 console, select a tagged bucket, select the Management tab, and then select Inventory, as shown in Figure 5. You should see that an inventory configuration exists. An inventory report will be delivered automatically to this bucket within 1 to 2 days, depending on the number of objects in the bucket. Make a note of the name of the bucket where the inventory report will be delivered. The bucket is given a semi-random name during creation through the CloudFormation template, so making a note of this will help you find the bucket more easily when you check for report delivery later.

Figure 5: Check that the tagged S3 bucket has an S3 Inventory report configuration

Figure 5: Check that the tagged S3 bucket has an S3 Inventory report configuration

Step 4: Confirm that reports are queryable with Athena

  1. After 1 to 2 days, navigate to the inventory reports destination bucket and confirm that reports have been delivered for buckets with the __Inventory: true tag. As shown in Figure 6, a report has been delivered for the adams-lambda-functions bucket.

    Figure 6: Confirm delivery of reports to the S3 reports destination bucket

    Figure 6: Confirm delivery of reports to the S3 reports destination bucket

  2. Next, navigate to the Athena console and select the AWS Glue database that contains the table holding the schema and partition locations for all of your reports. If you used the default values for the parameters when you launched the CloudFormation stack, the AWS Glue database will be named s3_inventory_database, and the table will be named s3_inventory_table. Run the following query in Athena:
    
    SELECT encryption_status, count(*) FROM s3_inventory_table GROUP BY encryption_status;
    

    The outputs of the query will be a snapshot aggregate count of objects in the categories of SSE-S3, SSE-C, SSE-KMS, or NOT-SSE across your tagged bucket environment, before encryption took place, as shown in Figure 7.

    Figure 7: Query results in Athena

    Figure 7: Query results in Athena

    From the query results, you can see that the adams-lambda-functions bucket had only two items in it, both of which were unencrypted. At this point, you can choose to perform any other analytics with Athena on the delivered inventory reports.

Step 5: Validate that objects are correctly encrypted

  1. Navigate to any of your target buckets in Amazon S3 and check the encryption status of a few sample objects by selecting the Properties tab of each object. The objects should now be encrypted using the specified KMS CMK. Because you set the AddTagToEncryptedObjects parameter to yes during the CloudFormation stack launch, these objects should also have the __ObjectEncrypted: true tag present. As an example, Figure 8 shows the rules_present_rule.zip object from the adams-lambda-functions bucket. This object has been properly encrypted using the correct KMS key, which has an alias of blog in this example, and it has been tagged with the specified key value pair.

    Figure 8: Checking the encryption status of an object in S3

    Figure 8: Checking the encryption status of an object in S3

  2. For further validation, navigate back to the Athena console and select the s3_batch_table from the s3_inventory_database, assuming that you left the default names unchanged. Then, run the following query:
    
    SELECT * FROM s3_batch_table;
    

    If encryption was successful, this query should result in zero items being returned because the solution by default only delivers S3 batch job completion reports on items that failed to copy. After validating by inspecting both the objects themselves and the batch completion reports, you can now safely say that the contents of the targeted S3 buckets are correctly encrypted.

Next steps

Congratulations! You’ve successfully deployed and operated a solution for rectifying S3 buckets with incorrectly encrypted and unencrypted objects. The architecture is massively scalable because it uses S3 Batch Operations and Lambda, it’s fully serverless, and it’s cost effective to run.

Please note that if you selected no for the EncryptBuckets parameter during the initial launch of the CloudFormation template, you can retroactively perform encryption on targeted buckets by simply doing a stack update. During the stack update, switch the EncryptBuckets parameter to yes, and proceed with deployment as normal. The update will reconfigure S3 inventory reports for all target S3 buckets to get the most up-to-date inventory. After the reports are delivered, encryption will proceed as desired.

Moreover, with the solution deployed, you can target new buckets for encryption just by adding the __Inventory: true tag. CloudWatch Events will register the tagging action and automatically configure an S3 Inventory report to be delivered for the newly tagged bucket.

Finally, now that your S3 buckets are properly encrypted, you should take a few more manual steps to help maintain your newfound account hygiene:

  • Perform remediation on unencrypted objects that may have failed to copy during the S3 Batch Operations job. The most common reason that objects fail to copy is when object size exceeds 5 GiB. S3 Batch Operations uses the standard CopyObject API call underneath the surface, but this API call can only handle objects less than 5 GiB in size. To successfully copy these objects, you can modify the solution you learned in this post to launch an S3 Batch Operations job that invokes Lambda functions. In the Lambda function logic, you can make CreateMultipartUpload API calls on objects that failed with a standard copy. The original batch job completion reports provide detail on exactly which objects failed to encrypt due to size.
  • Prohibit the retrieval of unencrypted object versions for buckets that had versioning enabled. When the object is copied over itself during the encryption process, the old unencrypted version of the object still exists. This is where the option in the solution to specify a tag on all newly encrypted objects becomes useful—you can now use that tag to draft a bucket policy that prohibits the retrieval of old unencrypted objects in your versioned buckets. For the solution that you deployed in this post, such a policy would look like this:
    
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect":     "Deny",
          "Action":     "s3:GetObject",
          "Resource":    "arn:aws:s3:::adams-lambda-functions/*",
          "Principal":   "*",
          "Condition": {  "StringNotEquals": {"s3:ExistingObjectTag/__ObjectEncrypted": "true" } }
        }
      ]
    }
    

  • Update bucket policies to prevent the upload of unencrypted or incorrectly encrypted objects. By updating bucket policies, you help ensure that in the future, newly uploaded objects will be correctly encrypted, which will help maintain account hygiene. The S3 encryption solution presented here is meant to be a onetime-use remediation tool, while you should view updating bucket policies as a preventative action. Proper use of bucket policies will help ensure that the S3 encryption solution is not needed again, unless another encryption requirement change occurs in the future. To learn more, see How to Prevent Uploads of Unencrypted Objects to Amazon S3.

If you have feedback about this post, submit comments in the Comments section below. If you have questions about this post, start a new thread on the Amazon S3 forum.

Want more AWS Security how-to content, news, and feature announcements? Follow us on Twitter.

Author

Adam Kozdrowicz

Adam is a Data and Machine Learning Engineer for AWS Professional Services. He works closely with enterprise customers building big data applications on AWS, and he enjoys working with frameworks such as AWS Amplify, SAM, and CDK. During his free time, Adam likes to surf, travel, practice photography, and build machine learning models.

How Wind Mobility built a serverless data architecture

Post Syndicated from Pablo Giner original https://aws.amazon.com/blogs/big-data/how-wind-mobility-built-a-serverless-data-architecture/

Guest post by Pablo Giner, Head of BI, Wind Mobility.

Over the past few years, urban micro-mobility has become a trending topic. With the contamination indexes hitting historic highs, cities and companies worldwide have been introducing regulations and working on a wide spectrum of solutions to alleviate the situation.

We at Wind Mobility strive to make commuters’ life more sustainable and convenient by bringing short distance urban transportation to cities worldwide.

At Wind Mobility, we scale our services at the same pace as our users demand them, and we do it in an economically and environmentally viable way. We optimize our fleet distribution to avoid overcrowding cities with more scooters than those that are actually going to be used, and we position them just meters away from where our users need them and at the time of the day when they want them.

How do we do that? By optimizing our operations to their fullest. To do so, we need to be very well informed about our users’ behavior under varying conditions and understand our fleet’s potential.

Scalability and flexibility for rapid growth

We knew that before we could solve this challenge, we needed to collect data from many different sources, such as user interactions with our application, user demand, IoT signals from our scooters, and operational metrics. To analyze the numerous datasets collected and extract actionable insights, we needed to build a data lake. While the high-level goal was clear, the scope was less so. We were working hard to scale our operation as we continued to launch new markets. The rapid growth and expansion made it very difficult to predict the volume of data we would need to consume. We were also launching new microservices to support our growth, which resulted in more data sources to ingest. We needed an architecture that allowed us to be agile and quickly adopt to meet our growth. It became clear that a serverless architecture was best positioned to meet those needs, so we started to design our 100% serverless infrastructure.

The first challenge was ingesting and storing data from our scooters in the field, events from our mobile app, operational metrics, and partner APIs. We use AWS Lambda to capture changes in our operational databases and mobile app and push the events to Amazon Kinesis Data Streams, which allows us to take action in real time. We also use Amazon Kinesis Data Firehose to write the data to Amazon Simple Storage Service (Amazon S3), which we use for analytics.

After we were in Amazon S3 and adequately partitioned as per its most common use cases (we partition by date, region, and business line, depending on the data source), we had to find a way to query this data for both data profiling (understanding structure, content, and interrelationships) and ad hoc analysis. For that we chose AWS Glue crawlers to catalog our data and Amazon Athena to read from the AWS Glue Data Catalog and run queries. However, ad hoc analysis and data profiling are relatively sporadic tasks in our team, because most of the data processing computing hours are actually dedicated to transforming the multiple data sources into our data warehouse, consolidating the raw data, modeling it, adding new attributes, and picking the data elements, which constitute 95% of our analytics and predictive needs.

This is where all the heavy lifting takes place. We parse through millions of scooter and user events generated daily (over 300 events per second) to extract actionable insight. We selected AWS Glue to perform this task. Our primary ETL job reads the newly added raw event data from Amazon S3, processes it using Apache Spark, and writes the results to our Amazon Redshift data warehouse. AWS Glue plays a critical role in our ability to scale on demand. After careful evaluation and testing, we concluded that AWS Glue ETL jobs meet all our needs and free us from procuring and managing infrastructure.

Architecture overview

The following diagram represents our current data architecture, showing two serverless data collection, processing, and reporting pipelines:

  • Operational databases from Amazon Relational Database Service (Amazon RDS) and MongoDB
  • IoT and application events, followed by Athena for data profiling and Amazon Redshift for reporting

Our data is curated and transformed multiple times a day using an automated pipeline running on AWS Glue. The team can now focus on analyzing the data and building machine learning (ML) applications.

We chose Amazon QuickSight as our business intelligence tool to help us visualize and better understand our operational KPIs. Additionally, we use Amazon Elastic Container Registry (Amazon ECR) to store our Docker images containing our custom ML algorithms and Amazon Elastic Container Service (Amazon ECS) where we train, evaluate, and host our ML models. We schedule our models to be trained and evaluated multiple times a day. Taking as input curated data about demand, conversion, and flow of scooters, we run the models to help us optimize fleet utilization for a particular city at any given time.

The following diagram represents how data from the data lake is incorporated into our ML training, testing, and serving system. First, our developers work in the application code and commit their changes, which are built into new Docker images by our CI/CD pipeline and stored in the Amazon ECR registry. These images are pushed into Amazon ECS and tested in DEV and UAT environments before moving to PROD (where they are triggered by the Amazon ECS task scheduler). During their execution, the Amazon ECS tasks (some train the demand and usage forecasting models, some produce the daily and hourly predictions, and others optimize the fleet distribution to satisfy the forecast) read their configuration and pull data from Amazon S3 (which has been previously produced by scheduled AWS Glue jobs), finally storing their results back into Amazon S3. Executions of these pipelines are tracked via MLFlow (in a dedicated Amazon Elastic Compute Cloud (Amazon EC2) server) and the final result indicating the fleet operations required is fit into a Kepler map, which is then consumed by the operators on the field.

Conclusion

We at Wind Mobility place data at the forefront of our operations. For that, we need our data infrastructure to be as flexible as the industry and the context we operate in, which is why we chose serverless. Over the course of a year, we have built a data lake, a data warehouse, a BI suite, and a variety of (production) data science applications. All of that with a very small team.

Also, within the last 12 months, we have scaled up several of our data pipelines by a factor of 10, without slowing our momentum or redesigning any part of our architecture. When it came to double our fleet in 1 week and increase the frequency at which we capture data from scooters by a factor of 10, our serverless data architecture scaled with no issues. This allowed us to focus on adding value by simplifying our operation, reacting to changes quickly, and delighting our users.

We have measured our success in multiple dimensions:

  • Speed – Serverless is faster to deploy and expand; we believe we have reduced our time to market for the entire infrastructure by a factor of 2
  • Visibility – We have 360 degree visibility of our operations worldwide, accessible by our city managers, finance team, and management board
  • Optimized fleet deployment – We know, at any minute of the day, the number of scooters that our customers need over the next few hours, which reduces unsatisfied demand by more than 50%

If you face a similar challenge, our advice is clear: go fully serverless and use the spectrum of solutions available from AWS.

Follow us and discover more about Wind Mobility on Facebook, Instagram and LinkedIn.

 


About the Author

Pablo Giner is Head of BI at Wind Mobility. Pablo’s background is in wheels (motorcycle racing > vehicle engineering > collision insurance > eScooters sharing…) and for the last few years he has specialized in forming and developing data teams. At Wind Mobility, he leads the data function (data engineering + analytics + data science), and the project he is most proud of is what they call smart fleet rebalancing, an AI backed solution to reposition their fleet in real-time. “In God we trust. All others must bring data.” – W. Edward Deming

 

 

 

Adding voice to a CircuitPython project using Amazon Polly

Post Syndicated from Moheeb Zara original https://aws.amazon.com/blogs/compute/adding-voice-to-a-circuitpython-project-using-amazon-polly/

An Adafruit PyPortal displaying a quote while synthesizing and playing speech using Amazon Polly.

An Adafruit PyPortal displaying a quote while synthesizing and playing speech using Amazon Polly.

As a natural means of communication, voice is a powerful way to humanize an experience. What if you could make anything talk? This guide walks through how to leverage the cloud to add voice to an off-the-shelf microcontroller. Use it to develop more advanced ideas, like a talking toaster that encourages healthy breakfast habits or a house plant that can express its needs.

This project uses an Adafruit PyPortal, an open-source IoT touch display programmed using CircuitPython, a lightweight version of Python that works on embedded hardware. You copy your code to the PyPortal like you would to a thumb drive and it runs. Random quotes from the PaperQuotes API are periodically displayed on the PyPortal LCD.

A microcontroller can’t do speech synthesis on its own so I use Amazon Polly, a natural text to speech synthesis service, to generate audio. Adding speech also extends accessibility to the visually impaired. This project includes an example for requesting arbitrary speech in addition to random quotes. Use this example to add a voice to any CircuitPython project.

An Adafruit PyPortal, an external speaker, and a microSD card.

An Adafruit PyPortal, an external speaker, and a microSD card.

I deploy the backend to the AWS Cloud using the AWS Serverless Application Repository. The code on the PyPortal makes a REST call to the backend to fetch a quote and synthesize speech audio for playback on the device.

Prerequisites

You need the following to complete the project:

Deploy the backend application

An architecture diagram of the serverless backend when requesting speech synthesis of a text string.

An architecture diagram of the serverless backend when requesting speech synthesis of a text string.

The serverless backend consists of an Amazon API Gateway endpoint that invokes an AWS Lambda function. If called with a JSON object containing text and voiceId attributes, it uses Amazon Polly to synthesize speech and uploads an MP3 file as a public object to Amazon S3. Upon completion, it returns the URL for downloading the audio file. It also processes the submitted text and adds return lines so that it can appear text-wrapped when displayed on the PyPortal. For a full list of voices, see the Amazon Polly documentation. An example response:

To fetch quotes instead of a text field, call the endpoint with a comma-separated list of tags as shown in the following diagram. The Lambda function then calls the PaperQuotes API. It fetches up to 50 quotes per tag and selects a random one to synthesize as speech. As with arbitrary text, it returns a URL and a text-wrapped representation of the quote.

An architecture diagram of the serverless backend when requesting a random quote from the PaperQuotes API to synthesize as speech.

An architecture diagram of the serverless backend when requesting a random quote from the PaperQuotes API to synthesize as speech.

I use the AWS Serverless Application Model (AWS SAM) to create the backend template. While it can be deployed using the AWS SAM CLI, you can also deploy from the AWS Management Console:

  1. Generate a free PaperQuotes API key at paperquotes.com. The serverless backend requires this to fetch quotes.
  2. Navigate to the aws-serverless-pyportal-polly application in the AWS Serverless Application Repository.
  3. Under Application settings, enter the parameter, PaperQuotesAPIKey.
  4. Choose Deploy.
  5. Once complete, choose View CloudFormation Stack.
  6. Select the Outputs tab and make a note of the SpeechApiUrl. This is required for configuring the PyPortal.
  7. Click the link listed for SpeechApiKey in the Outputs tab.
  8. Click Show to reveal the API key. Make a note of this. This is required for authenticating requests from the PyPortal to the SpeechApiUrl.

PyPortal setup

The following instructions walk through installing the latest version of the Adafruit CircuityPython libraries and firmware. It also shows how to enable an external speaker module.

  1. Follow these instructions from Adafruit to install the latest version of the CircuitPython bootloader. At the time of writing, the latest version is 5.3.0.
  2. Follow these instructions to install the latest Adafruit CircuitPython library bundle. I use bundle version 5.x.
  3. Insert the microSD card in the slot located on the back of the device.
  4. Cut the jumper pad on the back of the device labeled A0. This enables you to use an external speaker instead of the built-in speaker.
  5. Plug the external speaker connector into the port labeled SPEAKER on the back of the device.
  6. Optionally install the Mu Editor, a multi-platform code editor and serial debugger compatible with Adafruit CircuitPython boards. This can help with troubleshooting issues.
  7. Optionally if you have a 3D printer at home, you can print a case for your PyPortal. This can protect and showcase your project.

Code PyPortal

As with regular Python, CircuitPython does not need to be compiled to execute. You can flash new firmware on the PyPortal by copying a Python file and necessary assets to a mounted volume. The bootloader runs code.py anytime the device starts or any files are updated.

  1. Use a USB cable to plug the PyPortal into your computer and wait until a new mounted volume CIRCUITPY is available.
  2. Download the project from GitHub. Inside the project, copy the contents of /circuit-python on to the CIRCUITPY volume.
  3. Inside the volume, open and edit the secrets.py file. Include your Wi-Fi credentials along with the SpeechApiKey and SpeechApiUrl API Gateway endpoint. These can be found under Outputs in the AWS CloudFormation stack created by the AWS Serverless Application Repository.
  4. Save the file, and the device restarts. It takes a moment to connect to Wi-Fi and make the first request.
    Optionally, if you installed the Mu Editor, you can click on “Serial” to follow along the device log.

The PyPortal takes a few moments to connect to the Wi-Fi network and make its first request. On success, you hear it greet you and describe itself. The default interval is set to then display and read a quote every five minutes.

Understanding the CircuitPython code

See the bottom of circuit-python/code.py from the GitHub project. When the PyPortal connects to Wi-Fi, the first thing it does is synthesize an arbitrary “hello world” text for display. It then begins periodically displaying and “speaking” quotes.

# Connect to WiFi
print("Connecting to WiFi...")
wifi.connect()
print("Connected!")

displayQuote("Ready!")

speakText('Hello world! I am an Adafruit PyPortal running Circuit Python speaking to you using AWS Serverless', 'Joanna')

while True:
    speakQuote('equality, humanity', 'Joanna')
    time.sleep(60*secrets['interval'])

Both the speakText and speakQuote function call the synthesizeSpeech function. The difference is whether text or tags are passed to the API.

def speakText(text, voice):
    data = { "text": text, "voiceId": voice }
    synthesizeSpeech(data)

def speakQuote(tags, voice):
    data = { "tags": tags, "voiceId": voice }
    synthesizeSpeech(data)

The synthesizeSpeech function posts the data to the API Gateway endpoint. It then invokes the Lambda function and returns the MP3 URL and the formatted text. The downloadfile function is called to fetch the MP3 file and store it on the SD card. displayQuote is called to display the quote on the LCD. Finally, the playMP3 opens the file and plays the speech audio using the built-in or external speaker.

def synthesizeSpeech(data):
    response = postToAPI(secrets['endpoint'], data)
    downloadfile(response['url'], '/sd/cache.mp3')
    displayQuote(response['text'])
    playMP3("/sd/cache.mp3")

Modifying the Lambda function

The serverless application includes a Lambda function, SynthesizeSpeechFunction, which can be modified directly in the Lambda console. The AWS SAM template used to deploy the AWS Serverless Application Repository application adds policies for accessing the S3 bucket where audio is stored. It also grants access to Amazon Polly for synthesizing speech. It also adds the PaperQuote API token as an environment variable and sets API Gateway as an event source.

SynthesizeSpeechFunction:
    Type: AWS::Serverless::Function
    Properties:
      CodeUri: lambda_functions/SynthesizeSpeech/
      Handler: app.lambda_handler
      Runtime: python3.8
      Policies:
        - S3FullAccessPolicy:
            BucketName: !Sub "${AWS::StackName}-audio"
        - Version: '2012-10-17'
          Statement:
            - Effect: Allow
              Action:
                - polly:*
              Resource: '*'
      Environment:
        Variables:
          BUCKET_NAME: !Sub "${AWS::StackName}-audio"
          PAPER_QUOTES_TOKEN: !Ref PaperQuotesAPIKey
      Events:
        Speech:
          Type: Api
          Properties:
            RestApiId: !Ref SpeechApi
            Path: /speech
            Method: post

To edit the Lambda function, navigate back to the CloudFormation stack and click on the SpeechSynthesizeFunction under the Resources tab.

From here, you can edit the Lambda function code directly. Clicking Save deploys the new code.

The getQuotes function is called to fetch quotes from the PaperQuotes API. You can change this to call from a different source, such as a custom selection of quotes. Try modifying it to fetch social media posts or study questions.

Conclusion

I show how to add natural sounding text to speech on a microcontroller using a serverless backend. This is accomplished by deploying an application through the AWS Serverless Application Repository. The deployed API uses API Gateway to securely invoke a Lambda function that fetches quotes from the PaperQuotes API and generates speech using Amazon Polly. The speech audio is uploaded to S3.

I then show how to program a microcontroller, the Adafruit PyPortal, using CircuitPython. The code periodically calls the serverless API to fetch a quote and to download speech audio for playback. The sample code also demonstrates synthesizing arbitrary text to speech, meaning it can be used for any project you can conceive. Check out my previous guide on using the PyPortal to create a Martian weather display for inspiration.

Moovit embraces data lake architecture by extending their Amazon Redshift cluster to analyze billions of data points every day

Post Syndicated from Yonatan Dolan original https://aws.amazon.com/blogs/big-data/moovit-embraces-data-lake-architecture-by-extending-their-amazon-redshift-cluster-to-analyze-billions-of-data-points-every-day/

Amazon Redshift is a fast, fully managed, cloud-native data warehouse that makes it simple and cost-effective to analyze all your data using standard SQL and your existing business intelligence tools.

Moovit is a leading Mobility as a Service (MaaS) solutions provider and maker of the top urban mobility app. Guiding over 800 million users in more than 3,200 cities across 103 countries to get around town effectively and conveniently, Moovit has experienced exponential growth of their service in the last few years. The company amasses up to 6 billion anonymous data points a day to add to the world’s largest repository of transit and urban mobility data, aided by Moovit’s network of more than 685,000 local editors that help map and maintain local transit information in cities that would otherwise be unserved.

Like Moovit, many companies today are using Amazon Redshift to analyze data and perform various transformations on the data. However, as data continues to grow and become even more important, companies are looking for more ways to extract valuable insights from the data, such as big data analytics, numerous machine learning (ML) applications, and a range of tools to drive new use cases and business processes. Companies are looking to access all their data, all the time, by all users and get fast answers. The best solution for all those requirements is for companies to build a data lake, which is a centralized repository that allows you to store all your structured, semi-structured, and unstructured data at any scale.

With a data lake built on Amazon Simple Storage Service (Amazon S3), you can easily run big data analytics using services such as Amazon EMR and AWS Glue. You can also query structured data (such as CSV, Avro, and Parquet) and semi-structured data (such as JSON and XML) by using Amazon Athena and Amazon Redshift Spectrum. You can also use a data lake with ML services such as Amazon SageMaker to gain insights.

Moovit uses an Amazon Redshift cluster to allow different company teams to analyze vast amounts of data. They wanted a way to extend the collected data into the data lake and allow additional analytical teams to access more data to explore new ideas and business cases.

Additionally, Moovit was looking to manage their storage costs and evolve to a model that allowed cooler data to be maintained at the lowest cost in S3, and maintain the hottest data in Redshift for the most efficient query performance. The proposed solution implemented a hot/cold storage pattern using Amazon Redshift Spectrum and reduced the local disk utilization on the Amazon Redshift cluster to make sure costs are maintained. Moovit is currently evaluating the new RA3 node with managed storage as an additional level of flexibility that will allow them to easily scale the amount of hot/cold storage without limit.

In this post we demonstrate how Moovit, with the support of AWS, implemented a lake house architecture by employing the following best practices:

  • Unloading data into Amazon Simple Storage Service (Amazon S3)
  • Instituting a hot/cold pattern using Amazon Redshift Spectrum
  • Using AWS Glue to crawl and catalog the data
  • Querying data using Athena

Solution overview

The following diagram illustrates the solution architecture.

The solution includes the following steps:

  1. Unload data from Amazon Redshift to Amazon S3
  2. Create an AWS Glue Data Catalog using an AWS Glue crawler
  3. Query the data lake in Amazon Athena
  4. Query Amazon Redshift and the data lake with Amazon Redshift Spectrum

Prerequisites

To complete this walkthrough, you must have the following prerequisites:

  1. An AWS account.
  2. An Amazon Redshift cluster.
  3. The following AWS services and access: Amazon Redshift, Amazon S3, AWS Glue, and Athena.
  4. The appropriate AWS Identity and Access Management (IAM) permissions for Amazon Redshift Spectrum and AWS Glue to access Amazon S3 buckets. For more information, see IAM policies for Amazon Redshift Spectrum and Setting up IAM Permissions for AWS Glue.

Walkthrough

To demonstrate the process Moovit used during their data architecture, we use the industry-standard TPC-H dataset provided publicly by the TPC organization.

The Orders table has the following columns:

ColumnType
O_ORDERKEYint4
O_CUSTKEYint4
O_ORDERSTATUSvarchar
O_TOTALPRICEnumeric
O_ORDERDATEdate
O_ORDERPRIORITYvarchar
O_CLERKvarchar
O_SHIPPRIORITYint4
O_COMMENTvarchar
SKIPvarchar

Unloading data from Amazon Redshift to Amazon S3

Amazon Redshift allows you to unload your data using a data lake export to an Apache Parquet file format. Parquet is an efficient open columnar storage format for analytics. Parquet format is up to twice as fast to unload and consumes up to six times less storage in Amazon S3, compared with text formats.

To unload cold or historical data from Amazon Redshift to Amazon S3, you need to run an UNLOAD statement similar to the following code (substitute your IAM role ARN):

UNLOAD ('select o_orderkey, o_custkey, o_orderstatus, o_totalprice, o_orderdate, o_orderpriority, o_clerk, o_shippriority, o_comment, skip
FROM tpc.orders
ORDER BY o_orderkey, o_orderdate') 
TO 's3://tpc-bucket/orders/' 
CREDENTIALS 'aws_iam_role=arn:aws:iam::<account_number>:role/>Role<'
FORMAT AS parquet allowoverwrite PARTITION BY (o_orderdate);

It is important to define a partition key or column that minimizes Amazon S3 scans as much as possible based on the query patterns intended. The query pattern is often by date ranges; for this use case, use the o_orderdate field as the partition key.

Another important recommendation when unloading is to have file sizes between 128 MB and 512 MB. By default, the UNLOAD command splits the results to one or more files per node slice (virtual worker in the Amazon Redshift cluster) which allows you to use the Amazon Redshift MPP architecture. However, this can potentially cause files created by every slice to be small. In Moovit’s use case, the default UNLOAD using PARALLEL ON yielded dozens of small (MBs) files. For Moovit, PARALLEL OFF yielded the best results because it aggregated all the slices’ work into the LEADER node and wrote it out as a single stream controlling the file size using the MAXFILESIZE option.

Another performance enhancement applied in this use case was the use of Parquet’s min and max statistics. Parquet files have min_value and max_value column statistics for each row group that allow Amazon Redshift Spectrum to prune (skip) row groups that are out of scope for a query (range-restricted scan). To use row group pruning, you should sort the data by frequently-used columns. Min/max pruning helps scan less data from Amazon S3, which results in improved performance and reduced cost.

After unloading the data to your data lake, you can view your Parquet file’s content in Amazon S3 (assuming it’s under 128 MB). From the Actions drop-down menu, choose Select from.

You’re now ready to populate your Data Catalog using an AWS Glue crawler.

Creating a Data Catalog with an AWS Glue crawler

To query your data lake using Athena, you must catalog the data. The Data Catalog is an index of the location, schema, and runtime metrics of the data.

An AWS Glue crawler accesses your data store, extracts metadata (such as field types), and creates a table schema in the Data Catalog. For instructions, see Working with Crawlers on the AWS Glue Console.

Querying the data lake in Athena

After you create the crawler, you can view the schema and tables in AWS Glue and Athena, and can immediately start querying the data in Athena. The following screenshot shows the table in the Athena Query Editor.

Querying Amazon Redshift and the data lake using a unified view with Amazon Redshift Spectrum

Amazon Redshift Spectrum is a feature of Amazon Redshift that allows multiple Redshift clusters to query from same data in the lake. It enables the lake house architecture and allows data warehouse queries to reference data in the data lake as they would any other table. Amazon Redshift clusters transparently use the Amazon Redshift Spectrum feature when the SQL query references an external table stored in Amazon S3. Large multiple queries in parallel are possible by using Amazon Redshift Spectrum on external tables to scan, filter, aggregate, and return rows from Amazon S3 back to the Amazon Redshift cluster.

Following best practices, Moovit decided to persist all their data in their Amazon S3 data lake and only store hot data in Amazon Redshift. They could query both hot and cold datasets in a single query with Amazon Redshift Spectrum.

The first step is creating an external schema in Amazon Redshift that maps a database in the Data Catalog. See the following code:

CREATE EXTERNAL SCHEMA spectrum 
FROM data catalog 
DATABASE 'datalake' 
iam_role 'arn:aws:iam::<account_number>:role/mySpectrumRole'
CREATE EXTERNAL DATABASE IF NOT EXISTS;

After the crawler creates the external table, you can start querying in Amazon Redshift using the mapped schema that you created earlier. See the following code:

SELECT * FROM spectrum.orders;

Lastly, create a late binding view that unions the hot and cold data:

CREATE OR REPLACE VIEW lake_house_joint_view AS (SELECT * FROM public.orders WHERE o_orderdate >= dateadd(‘day’,-90,date_trunc(‘day’,getdate())) 
UNION ALL SELECT * FROM spectrum.orders WHERE o_orderdate < dateadd(‘day’,-90,date_trunc(‘day’,getdate())) WITH NO SCHEMA BINDING;

Summary

In this post, we showed how Moovit unloaded data from Amazon Redshift to a data lake. By doing that, they exposed the data to many additional groups within the organization and democratized the data. These benefits of data democratization are substantial because various teams within Moovit can access the data, analyze it with various tools, and come up with new insights.

As an additional benefit, Moovit reduced their Amazon Redshift utilized storage, which allowed them to maintain cluster size and avoid additional spending by keeping all historical data within the data lake and only hot data in the Amazon Redshift cluster. Keeping only hot data on the Amazon Redshift cluster prevents Moovit from deleting data frequently, which saves IT resources, time, and effort.

If you are looking to extend your data warehouse to a data lake and leverage various tools for big data analytics and machine learning (ML) applications, we invite you to try out this walkthrough.

 


About the Authors

Yonatan Dolan is a Business Development Manager at Amazon Web Services. He is located in Israel and helps customers harness AWS analytical services to leverage data, gain insights, and derive value.

 

 

 

 

Alon Gendler is a Startup Solutions Architect at Amazon Web Services. He works with AWS customers to help them architect secure, resilient, scalable and high performance applications in the cloud.

 

 

 

 

Vincent Gromakowski is a Specialist Solutions Architect for Amazon Web Services.