All posts by Sachin Thakkar

Backtest trading strategies with Amazon Kinesis Data Streams long-term retention and Amazon SageMaker

Post Syndicated from Sachin Thakkar original https://aws.amazon.com/blogs/big-data/backtest-trading-strategies-with-amazon-kinesis-data-streams-long-term-retention-and-amazon-sagemaker/

Real-time insight is critical when it comes to building trading strategies. Any delay in data insight can cost lot of money to the traders. Often, you need to look at historical market trends to predict future trading pattern and make the right bid. More the historical data you analyze, better trading prediction you get. Back tracking streaming data can be tricky as it requires sophisticated storage and analysis mechanisms.

Amazon Kinesis Data Streams allows our customer to store streaming data for up to one year. Amazon Kinesis Data Streams Long Term Retention (LTR) of streaming data enables to use the same platform for both real-time and older data retained in Amazon Kinesis Data Streams. For example, one can train machine learning algorithms for financial trading, marketing personalization, and recommendation models without moving the data into a different data store or writing a new application. Customers can also satisfy certain data retention regulations, including under HIPAA and FedRAMP, using long term retention. This thus simplifies the data ingestion architecture for our trading use case we will discuss in this post.

In the post Building algorithmic trading strategies with Amazon SageMaker, we demonstrated how to backtest trading strategies with Amazon SageMaker with historical stock price data stored in Amazon Simple Storage Service (Amazon S3). In this post, we expand this solution for streaming data and describe how to use Amazon Kinesis.

In addition, we want to use Amazon SageMaker automatic tuning to find the optimal configuration for a moving average crossover strategy. In this strategy, two moving averages for a slow and a fast time period are calculated, and a trade is executed when a crossover happens. If the fast moving average crosses above the slow moving average, the strategy places a long trade, otherwise the strategy goes short. We find the optimal period length for these moving averages by running multiple backtests with different lengths over a historical dataset.

Finally, we run the optimal configuration for this moving average crossover strategy over a different test dataset and analyze the performance results. If the performance measured in profit and loss (P&L) is positive for the test period, we can consider this trading strategy for a forward test.

To show you how easy and quick it is to get started on AWS, we provide a one-click deployment for an extensible trading backtesting solution that uses Kinesis long-term retention for streaming data.

Solution overview

We use Kinesis Data Streams to store real-time streaming as well as historical market data. We use Jupyter notebooks as our central interface for exploring and backtesting new trading strategies. SageMaker allows you to set up Jupyter notebooks and integrate them with AWS CodeCommit to store different versions of strategies and share them with other team members.

We use Amazon S3 to store model artifacts and backtesting results.

For our trading strategies, we create Docker containers that contain the required libraries for backtesting and the strategy itself. These containers follow the SageMaker Docker container structure in order to run them inside of SageMaker. For more information about the structure of SageMaker containers, see Using the SageMaker Training and Inference Toolkits.

The following diagram illustrates this architecture.

We run the data preparation step from a SageMaker notebook. This copies the historical market data to the S3 bucket.

We use AWS Data Migration Service (AWS DMS) to load the market data to the data stream. The

SageMaker notebook connects with Kinesis Data Streams and runs the trading strategy algorithm via a SageMaker training job. The algorithm uses part of the data for training to find the optimal strategy configuration.

Finally, we run the trading strategy using the previously determined configuration on a test dataset.

Prerequisites

Before getting started, we set up our resources. In this post, we use the us-east-2 Region as an example.

  1. Deploy the AWS resources using the provided AWS CloudFormation template.
  2. For Stack name, enter a name for your stack.
  3. Provide an existing S3 bucket name to store the historical market data.

The data is loaded in Kinesis Data Streams from this S3 bucket. Your bucket needs to be in same Region where your stack is set up.

  1. Accept all the defaults and choose Next.
  2. Acknowledge that AWS CloudFormation might create AWS Identity and Access Management (IAM) resources with custom names.
  3. Choose Create stack.

This creates all the required resources.

Data load to Kinesis Data Streams

To perform the data load, complete the following steps:

  1. On the SageMaker console, under Notebook in the navigation pane, choose Notebook instances.
  2. Locate the notebook instance AlgorithmicTradingInstance-*.
  3. Choose Open Jupyter for this instance.
  4. Go to the algorithmic-trading->4_Kinesis folder and choose Strategy_Kinesis_EMA_HPO.pynb.

You now run the data preparation step in the notebook.

  1. Load the dataset.

Specify the existing bucket where test data is stored. Make sure that the test bucket is in the same Region where you set up the stack.

  1. Run all the steps in the notebook until Step 2 Data Preparation.
  2. On the AWS DMS console, choose Database migration tasks.
  3. Select the AWS DMS task dmsreplicationtask-*.
  4. On the Actions menu, choose Restart/Resume.

This starts the data load from the S3 bucket to the data stream.

Wait until the replication task shows the status Load complete.

  1. Continue the steps in the Jupyter notebook.

Read data from Kinesis long-term retention

We read daily open, high, low, close price, and volume data from the stream’s long-term retention with the AWS SDK for Python (Boto3).

Although we don’t use enhanced fan-out (EFO) in this post, it may be advisable to do so an existing application is already reading from the stream. That way this backtesting app doesn’t interfere with the existing application.

You can visualize your data, as shown in the following screenshot.

Define your trading strategy

In this step, we define our moving average crossover trading strategy.

Build a Docker image

We build our backtesting job as a Docker image and push it to Amazon ECR.

Hyperparameter optimization with SageMaker on training data

For the moving average crossover trading strategy, we want to find the optimal fast period and slow period of this strategy, and we provide a range of days to search for.

We use profit and loss (P&L) of the strategy as the metric to find optimized hyperparameters.

You can see the tuning job recommended a value of 7 and 21 days for the fast and slow period for this trading strategy given the training dataset.

Run the strategy with optimal hyperparameters on test data

We now run this strategy with optimal hyperparameters on the test data.

When the job is complete, the performance results are stored in Amazon S3, and you can review the performance metrics in a chart and analyze the buy and sell orders for your strategy.

Conclusion

In this post, we described how to use the Kinesis Data Streams long-term retention feature to store the historical stock price data and how to use streaming data for backtesting of a trading strategy with SageMaker.

Long-term retention of streaming data enables you to use the same platform for both real-time and older data retained in Kinesis Data Streams. This allows you to use this data stream for financial use cases like backtesting or for machine learning without moving the data into a different data store or writing a new application. You can also satisfy certain data retention regulations, including under HIPAA and FedRAMP, using long-term retention. For more information, see Amazon Kinesis Data Streams enables data stream retention up to one year.

Risk disclaimer

This post is for educational purposes only and past trading performance does not guarantee future performance.


About the Authors

Sachin Thakkar is a Senior Solutions Architect at Amazon Web Services, working with a leading Global System Integrator (GSI). He brings over 22 years of experience as an IT Architect and as Technology Consultant for large institutions. His focus area is on Data & Analytics. Sachin provides architectural guidance and supports the GSI partner in building strategic industry solutions on AWS

Amogh Gaikwad is a Solutions Developer in the Prototyping Team. He specializes in machine learning and analytics and has extensive experience developing ML models in real-world environments and integrating AI/ML and other AWS services into large-scale production applications. Before joining Amazon, he worked as a software developer, developing enterprise applications focusing on Enterprise Resource Planning (ERP) and Supply Chain Management (SCM). Amogh has received his master’s in Computer Science specializing in Big Data Analytics and Machine Learning.

Dhiraj Thakur is a Solutions Architect with Amazon Web Services. He works with AWS customers and partners to provide guidance on enterprise cloud adoption, migration and strategy. He is passionate about technology and enjoys building and experimenting in Analytics and AI/ML space.

Oliver Steffmann is an Enterprise Solutions Architect at AWS based in New York. He brings over 18 years of experience as IT Architect, Software Development Manager, and as Management Consultant for international financial institutions. During his time as a consultant, he leveraged his extensive knowledge of Big Data, Machine Learning and cloud technologies to help his customers get their digital transformation off the ground. Before that he was the head of municipal trading technology at a tier-one investment bank in New York and started his career in his own startup in Germany.

Build operational metrics for your enterprise AWS Glue Data Catalog at scale

Post Syndicated from Sachin Thakkar original https://aws.amazon.com/blogs/big-data/build-operational-metrics-for-your-enterprise-aws-glue-data-catalog-at-scale/

Over the last several years, enterprises have accumulated massive amounts of data. Data volumes have increased at an unprecedented rate, exploding from terabytes to petabytes and sometimes exabytes of data. Increasingly, many enterprises are building highly scalable, available, secure, and flexible data lakes on AWS that can handle extremely large datasets. After data lakes are productionized, to measure the efficacy of the data lake and communicate the gaps or accomplishments to the business groups, enterprise data teams need tools to extract operational insights from the data lake. Those insights help answer key questions such as:

  • The last time a table was updated
  • The total table count in each database
  • The projected growth of a given table
  • The most frequently queried table vs. least queried tables

In this post, I walk you through a solution to build an operational metrics dashboard (like the following screenshot) for your enterprise AWS Glue Data Catalog on AWS.

Solution overview

This post shows you how to collect metadata information from your data lake’s AWS Glue Data Catalog resources (databases and tables) and build an operational dashboard on this data.

The following diagram illustrates the overall solution architecture and steps.

The steps are as follows:

  1. A data collector Python program runs on a schedule and collects metadata details about databases and tables from the enterprise Data Catalog.
  2. The following key data attributes are collected for each table and database in your AWS Glue Data Catalog.
Table Data Database Data
TableName DatabaseName
DatabaseName CreateTime
Owner SharedResource
CreateTime SharedResourceOwner
UpdateTime SharedResourceDatabaseName
LastAccessTime Location
TableType Description
Retention
CreatedBy
IsRegisteredWithLakeFormation
Location
SizeInMBOnS3
TotalFilesonS3
  1. The program reads each table’s file location and computes the number of files on Amazon Simple Storage Service (Amazon S3) and the size in MB.
  2. All the data for the tables and databases is stored in an S3 bucket for downstream analysis. The program runs every day and creates new files partitioned by year, month, and day on Amazon S3.
  3. We crawl the data created in Step 4 using an AWS Glue crawler.
  4. The crawler creates an external database and tables for our generated dataset for downstream analysis.
  5. We can query the extracted data with Amazon Athena.
  6. We use Amazon QuickSight to build our operational metrics dashboard and gain insights into our data lake content and usage.

For simplicity, this program crawls and collects data from the Data Catalog for the us-east-1 Region only.

Walkthrough overview

The walkthrough includes the following steps:

  1. Configure your dataset.
  2. Deploy the core solution resources with an AWS CloudFormation template, and set up and trigger the AWS Glue job.
  3. Crawl the metadata dataset and create external tables in the Data Catalog.
  4. Build a view and query the data through Athena.
  5. Set up and import data into QuickSight to create an operational metrics dashboard for the Data Catalog.

Configure your dataset

We use the AWS COVID-19 data lake for analysis. This data lake is comprised of data in a publicly readable S3 bucket.

To make the data from the AWS COVID-19 data lake available in your AWS account, create a CloudFormation stack using the following template. If you’re signed in to your AWS account, the following link fills out most of the stack creation form for you. Make sure to change the Region to us-east-1. For instructions on creating a CloudFormation stack, see Get started.

This template creates a COVID-19 database in your Data Catalog and tables that point to the public AWS COVID-19 data lake. You don’t need to host the data in your account, and you can rely on AWS to refresh the data as datasets are updated through AWS Data Exchange.

For more information about the COVID-19 dataset, see A public data lake for analysis of COVID-19 data.

Your environment may already have existing datasets in the Data Catalog. The program collects the aforementioned attributes for those datasets as well, which can be used for analysis.

Deploy your resources

To make it easier to get started, we created a CloudFormation template that automatically sets up a few key components of the solution:

  • An AWS Glue job (Python program) that is triggered based on a schedule
  • The AWS Identity and Access Management (IAM) role required by the AWS Glue job so the job can collect and store details about databases and tables in the Data Catalog
  • A new S3 bucket for the AWS Glue job to store the data files
  • A new database in the Data Catalog for storing our metrics data tables

The source code for the AWS Glue job and the CloudFormation template are available in the GitHub repo.

You must first download the AWS Glue Python code from GitHub and upload it to an existing S3 bucket. The path of this file needs to be provided when running the CloudFormation stack.

  1. Launch the stack:
  2. Provide values for your parameters as shown in the following screenshot.

After the stack is deployed successfully, you can check the resources created on the stack’s Resources tab.

You can verify and check the AWS Glue job setup and trigger, which is scheduled as per your specified time.

Now that we have verified that the stack is successfully set up, we can run our AWS Glue job manually and collect key attributes for our analysis.

  1. On the AWS Glue console, choose AWS Glue Studio in the navigation pane.
  2. In the AWS Glue Studio Console, click on Jobs and select the DataCollector job and Run the job.

The AWS Glue job collects data and stores it in the S3 bucket created for us through AWS CloudFormation. The job creates separate folders for database and table data, as shown in the following screenshot.

Crawl and set up external tables for the metrics data

Follow these steps to create tables in the database by using AWS Glue crawlers on the data stored on Amazon S3. Note that the database has been created for us using the CloudFormation stack.

  1. On the AWS Glue console, under Databases in the navigation pane, choose Tables.
  2. Choose Add tables.
  3. Choose Add tables using a crawler.
  4. Enter a name for the crawler and choose Next.
  5. For Add crawler, choose Create source type.
  6. Specify the crawler source type by choosing Data stores and choose Next.
  7. In the Add a data store section, for Choose a data store, choose S3.
  8. For Crawl data in, select Specified path.
  9. For Include path, enter the path to the tables folder generated by the AWS Glue job: s3://<data bucket created using CFN>/datalake/tables/.
  10. When asked if you want to create another data store, select No and then choose Next.
  11. On the Choose an IAM Role page, select Choose an Existing IAM Role.
  12. For IAM role, choose the IAM role created through the CloudFormation stack.
  13. Choose Next.
  14. On the Output page, for Database, choose the AWS Glue database you created earlier.
  15. Choose Next.
  16. Review your selections and choose Finish.
  17. Select the crawler you just created and choose Run crawler.

The crawler should take only a few minutes to complete. While it’s running, status messages may appear, informing you that the system is attempting to run the crawler and then is actually running the crawler. You can choose the refresh icon to check on the current status of the crawler.

  1. In the navigation pane, choose Tables.

The table called tables, which was created by the crawler, should be listed.

Query data with Athena

This section demonstrates how to query these tables using Athena. Athena is a serverless, interactive query service that makes it easy to analyze the data in the AWS COVID-19 data lake. Athena supports SQL, a common language that data analysts use for analyzing structured data. To query the data, complete the following steps:

  1. Sign in to the Athena console.
  2. If this is your first time using Athena, you must specify a query result location on Amazon S3.
  3. On the drop-down menu, choose the datalake360db database.
  4. Enter your queries and explore the datasets.

Set up and import data into QuickSight and create an operational metrics dashboard

Set up QuickSight before you import the dataset, and make sure that you have at least 512 MB of SPICE capacity. For more information, see Managing SPICE Capacity.

Before proceeding, make sure your QuickSight account has IAM permissions to access Athena (see Authorizing Connections to Amazon Athena) and Amazon S3.

Let’s first create our datasets.

  1. On the QuickSight console, choose Datasets in the navigation pane.
  2. Choose New dataset.
  3. Choose Athena from the list of data sources.
  4. For Data source name, enter a name.
  5. For Database, choose the database that you set up in the previous step (datalake360db).
  6. For Tables, select databases.
  7. Finish creating your dataset..
  8. Repeat same steps to create a tables dataset.

Now you edit the databases dataset.

  1. From the datasets list, choose the databases dataset.
  2. Choose Edit dataset.
  3. Change the createtime field type from string to date.
  4. Enter the date format as yy/MM/dd HH:mm:ss.
  5. Choose Update.
  6. Similarly, change the tables dataset fields createtime, updatetime, and lastaccessedtime to the date type.
  7. Choose Save and Publish to save the changes to the dataset.

Next, we add calculated fields for the count of databases and tables.

  1. For the tables dataset, choose Add calculation.
  2. Add the calculated field tablesCount as distinct_count({tablename}.
  3. Similarly, add a new calculated field databasesCount as distinct_count({databasename}.

Now let’s create a new analysis.

  1. In the navigation pane, choose Analysis.
  2. Choose the tables dataset.
  3. Choose Create analysis.

Let’s create our first visual for the count of number of databases and tables in our data lake Data Catalog.

  1. Create a new visual and add databasesCount from the fields list.

This provides us with a count of databases in our Data Catalog.

  1. Similarly, add a visual to display the total number of tables using the tablesCount field.

Let’s create second visual for the total number of files on Amazon S3 and total storage size on Amazon S3.

  1. Similar to the previous step, we add a new visual and choose the totalfilesons3 and sizeinmbons3 fields to display Amazon S3-related storage details.

Let’s create another visual to check which are the least used datasets.

  1. Add a visual using the LastAccessTime data element.

Finally, let’s create one more visual to check if databases are shared resources from different accounts.

  1. Select the databases dataset.
  2. We create a table visual type and add databasename, sharedresource, and description fields.

Now you have an idea of what types of visuals are possible using this data. The following screenshot is one example of a finished dashboard.

Clean up

To avoid ongoing charges, delete the CloudFormation stacks and output files in Amazon S3 that you created during deployment. You have to delete the data in the S3 buckets before you can delete the buckets.

Conclusion

In this post, we showed how you can set up an operational metrics dashboard for your Data Catalog. We set up our program to collect key data elements about our tables and databases from the AWS Glue Data Catalog. We then used this dataset to build our operational metrics dashboard and gained insights on our data lake.


About the Authors

Sachin Thakkar is a Senior Solutions Architect at Amazon Web Services, working with a leading Global System Integrator (GSI). He brings over 22 years of experience as an IT Architect and as Technology Consultant for large institutions. His focus area is on Data & Analytics. Sachin provides architectural guidance and supports the GSI partner in building strategic industry solutions on AWS