Tag Archives: Amazon Sagemaker

Announcing Amazon SageMaker Canvas – a Visual, No Code Machine Learning Capability for Business Analysts

Post Syndicated from Alex Casalboni original https://aws.amazon.com/blogs/aws/announcing-amazon-sagemaker-canvas-a-visual-no-code-machine-learning-capability-for-business-analysts/

As an organization facing business problems and dealing with data on a daily basis, the ability to build systems that can predict business outcomes becomes very important. This ability lets you solve problems and move faster by automating slow processes and embedding intelligence in your IT systems.

But how do you make sure that all teams and individual decision makers in the organization are empowered to create these machine learning (ML) systems at scale, and without depending on other data science and data engineering teams? As a business user or data analyst, you’d like to build and use prediction systems based on the data that you analyze and process every day, without having to learn about hundreds of algorithms, training parameters, evaluation metrics, and deployment best practices.

Today, I’m excited to announce the general availability of Amazon SageMaker Canvas, a new visual, no code capability that allows business analysts to build ML models and generate accurate predictions without writing code or requiring ML expertise. Its intuitive user interface lets you browse and access disparate data sources in the cloud or on-premises, combine datasets with the click of a button, train accurate models, and then generate new predictions once new data is available.

SageMaker Canvas leverages the same technology as Amazon SageMaker to automatically clean and combine your data, create hundreds of models under the hood, select the best performing one, and generate new individual or batch predictions. It supports multiple problem types such as binary classification, multi-class classification, numerical regression, and time series forecasting. These problem types let you address business-critical use cases, such as fraud detection, churn reduction, and inventory optimization, without writing a single line of code.

SageMaker Canvas in Action
Imagine that I’m an e-commerce manager who needs to predict whether or not a product will be shipped on time. The datasets at my disposal consist of a product catalog and the historical shipping dataset, both in CSV format.

First, I enter the SageMaker Canvas application where all of my models and datasets are created and inspected.

I select Import, and upload two CSV files: ProductData.csv and ShippingData.csv. I have 120 products and 10,000 shipping records.

I could also fetch data from Amazon Simple Storage Service (Amazon S3) or connect to other cloud or on-premises data sources, such as Amazon Redshift or Snowflake. For this use case, I prefer to upload 1.6 MB of data directly from my computer.

Before confirming the import, I have a chance to preview the two datasets, their columns, and their respective values. For example, each product has a ComputerBrand, ScreenSize, and PackageWeight. In addition to useful columns such as ShippingOrigin, OrderDate, and ShippingPriority, each record in the shipping dataset also contains OnTimeDelivery, which is either On Time or Late. This column will be used by SageMaker Canvas to generate a prediction model based on historical data.

After a few seconds of processing, the datasets are ready, and I decide to join them to create a single dataset containing both product and shipping information. This is an optional step that often lets you increase the precision of a prediction model.

Now I can simply drag and drop the two datasets: SageMaker Canvas will automatically identify the shared ProductId column and apply an Inner Join transformation.

The join preview lets me visualize the resulting columns, identify missing or invalid values, and optionally deselect unwanted columns.

I select Save joined data and provide a new name for this joined dataset, which now includes 16 columns and 10,000 records.

Next, I want to create a model and start by selecting New model in the Models section on the left menu. I call it On Time Prediction Model.

The first step is selecting a dataset.

I select a target column that my model will predict: OnTimeDelivery.

SageMaker Canvas shows me the value distribution and already recommends the most appropriate model type: two categories classification.

Before proceeding with the model training, I have the option to generate an analysis report. This analysis gives me two very important pieces of information: the estimated accuracy and the impact of each column.

The estimated accuracy of 99.9% gives me confidence, but then I notice that the highest impact is provided by the ActualShippingDays column. Unfortunately, this column is not available in advance and I can’t use it for my predictions. So I deselect it and run the analysis again.

The new estimated accuracy is 94.2%, which is still pretty high. The most impactful columns are ShippingPriority, YShippingDistance, XShippingDistance, and Carrier. This is great because all of this information is available in advance and can be used for a prediction. On the other hand, product-related columns, such PackageWeight and ScreenSize, have very small impacts on the prediction. This means that in the future I could simplify the overall process by feeding only shipping information into the training and prediction phases.

I’m happy with the analysis insights. Therefore, I decide to proceed and build a prediction model by selecting the Standard build option.

Now I can go for a walk, attend a few productive meetings, or simply spend some time with family. SageMaker Canvas is doing all of the work for me, training hundreds of models behind the scenes. It will select the best performing one, so that I can start generating accurate predictions in a couple of hours. Of course, the training duration will vary depending on the dataset size and problem type.

After about an hour and a half, the model is ready and the console lets me analyze its accuracy and the column impacts visually. I’m also happy to see that the model predicts the correct value 95.8% of the time, which is even higher than the estimated accuracy.

Optionally, I could also inspect advanced metrics such as Precision, Recall, F1 Score, and so on. These metrics help me understand how the model is performing and what kind of false positives and false negatives I can expect from this model.

From here, I could share the model into Amazon SageMaker Studio or continue using the Canvas UI to generate new predictions.

I decide to continue with the intuitive UI and select Predict. Now I can work with individual records or with a dataset for batch predictions.

When selecting Single prediction, SageMaker Canvas simplifies my life and lets me start from an existing record. I modify the column values and get immediate feedback on the prediction and the corresponding feature importance.

This quick feedback loop and intuitive UI allows me to use the ML model without having to write custom code. In case I decide to integrate the model into an automated production system, the Amazon SageMaker Studio integration lets me share the model easily with other data scientists in my team.

Generally Available Today
SageMaker Canvas is generally available in US East (Ohio), US East (N. Virginia), US West (Oregon), Europe (Frankfurt), and Europe (Ireland). You can start using it with your local datasets, as well as data already stored on Amazon S3, Amazon Redshift, or Snowflake. With just a few clicks, you’ll prepare and join your datasets, analyze estimated accuracy, verify which columns are impactful, train the best performing model, and generate new individual or batch predictions. We’re excited to hear your feedback and help you solve even more business problems with ML.

Alex

Provide data reliability in Amazon Redshift at scale using Great Expectations library

Post Syndicated from Faizan Ahmed original https://aws.amazon.com/blogs/big-data/provide-data-reliability-in-amazon-redshift-at-scale-using-great-expectations-library/

Ensuring data reliability is one of the key objectives of maintaining data integrity and is crucial for building data trust across an organization. Data reliability means that the data is complete and accurate. It’s the catalyst for delivering trusted data analytics and insights. Incomplete or inaccurate data leads business leaders and data analysts to make poor decisions, which can lead to negative downstream impacts and subsequently may result in teams spending valuable time and money correcting the data later on. Therefore, it’s always a best practice to run data reliability checks before loading the data into any targets like Amazon Redshift, Amazon DynamoDB, or Amazon Timestream databases.

This post discusses a solution for running data reliability checks before loading the data into a target table in Amazon Redshift using the open-source library Great Expectations. You can automate the process for data checks via the extensive built-in Great Expectations glossary of rules using PySpark, and it’s flexible for adding or creating new customized rules for your use case.

Amazon Redshift is a cloud data warehouse solution and delivers up to three times better price-performance than other cloud data warehouses. With Amazon Redshift, you can query and combine exabytes of structured and semi-structured data across your data warehouse, operational database, and data lake using standard SQL. Amazon Redshift lets you save the results of your queries back to your Amazon Simple Storage Service (Amazon S3) data lake using open formats like Apache Parquet, so that you can perform additional analytics from other analytics services like Amazon EMR, Amazon Athena, and Amazon SageMaker.

Great Expectations (GE) is an open-source library and is available in GitHub for public use. It helps data teams eliminate pipeline debt through data testing, documentation, and profiling. Great Expectations helps build trust, confidence, and integrity of data across data engineering and data science teams in your organization. GE offers a variety of expectations developers can configure. The tool defines expectations as statements describing verifiable properties of a dataset. Not only does it offer a glossary of more than 50 built-in expectations, it also allows data engineers and scientists to write custom expectation functions.

Use case overview

Before performing analytics or building machine learning (ML) models, cleaning data can take up a lot of time in the project cycle. Without automated and systematic data quality checks, we may spend most of our time cleaning data and hand-coding one-off quality checks. As most data engineers and scientists know, this process can be both tedious and error-prone.

Having an automated quality check system is critical to project efficiency and data integrity. Such systems help us understand data quality expectations and the business rules behind them, know what to expect in our data analysis, and make communicating the data’s intricacies much easier. For example, in a raw dataset of customer profiles of a business, if there’s a column for date of birth in format YYYY-mm-dd, values like 1000-09-01 would be correctly parsed as a date type. However, logically this value would be incorrect in 2021, because the age of the person would be 1021 years, which is impossible.

Another use case could be to use GE for streaming analytics, where you can use AWS Database Migration Service (AWS DMS) to migrate a relational database management system. AWS DMS can export change data capture (CDC) files in Parquet format to Amazon S3, where these files can then be cleansed by an AWS Glue job using GE and written to either a destination bucket for Athena consumption or the rows can be streamed in AVRO format to Amazon Kinesis or Kafka.

Additionally, automated data quality checks can be versioned and also bring benefit in the form of optimal data monitoring and reduced human intervention. Data lineage in an automated data quality system can also indicate at which stage in the data pipeline the errors were introduced, which can help inform improvements in upstream systems.

Solution architecture

This post comes with a ready-to-use blueprint that automatically provisions the necessary infrastructure and spins up a SageMaker notebook that walks you step by step through the solution. Additionally, it enforces the best practices in data DevOps and infrastructure as code. The following diagram illustrates the solution architecture.

The architecture contains the following components:

  1. Data lake – When we run the AWS CloudFormation stack, an open-source sample dataset in CSV format is copied to an S3 bucket in your account. As an output of the solution, the data destination is an S3 bucket. This destination consists of two separate prefixes, each of which contains files in Parquet format, to distinguish between accepted and rejected data.
  2. DynamoDB – The CloudFormation stack persists data quality expectations in a DynamoDB table. Four predefined column expectations are populated by the stack in a table called redshift-ge-dq-dynamo-blog-rules. Apart from the pre-populated rules, you can add any rule from the Great Expectations glossary according to the data model showcased later in the post.
  3. Data quality processing – The solution utilizes a SageMaker notebook instance powered by Amazon EMR to process the sample dataset using PySpark (v3.1.1) and Great Expectations (v0.13.4). The notebook is automatically populated with the S3 bucket location and Amazon Redshift cluster identifier via the SageMaker lifecycle config provisioned by AWS CloudFormation.
  4. Amazon Redshift – We create internal and external tables in Amazon Redshift for the accepted and rejected datasets produced from processing the sample dataset. The external dq_rejected.monster_com_rejected table, for rejected data, uses Amazon Redshift Spectrum and creates an external database in the AWS Glue Data Catalog to reference the table. The dq_accepted.monster_com table is created as a regular Amazon Redshift table by using the COPY command.

Sample dataset

As part of this post, we have performed tests on the Monster.com job applicants sample dataset to demonstrate the data reliability checks using the Great Expectations library and loading data into an Amazon Redshift table.

The dataset contains nearly 22,000 different sample records with the following columns:

  • country
  • country_code
  • date_added
  • has_expired
  • job_board
  • job_description
  • job_title
  • job_type
  • location
  • organization
  • page_url
  • salary
  • sector
  • uniq_id

For this post, we have selected four columns with inconsistent or dirty data, namely organization, job_type, uniq_id, and location, whose inconsistencies are flagged according to the rules we define from the GE glossary as described later in the post.

Prerequisites

For this solution, you should have the following prerequisites:

  • An AWS account if you don’t have one already. For instructions, see Sign Up for AWS.
  • For this post, you can launch the CloudFormation stack in the following Regions:
    • us-east-1
    • us-east-2
    • us-west-1
    • us-west-2
  • An AWS Identity and Access Management (IAM) user. For instructions, see Create an IAM User.
  • The user should have create, write, and read access for the following AWS services:
  • Familiarity with Great Expectations and PySpark.

Set up the environment

Choose Launch Stack to start creating the required AWS resources for the notebook walkthrough:

For more information about Amazon Redshift cluster node types, see Overview of Amazon Redshift clusters. For the type of workflow described in this post, we recommend using the RA3 Instance Type family.

Run the notebooks

When the CloudFormation stack is complete, complete the following steps to run the notebooks:

  1. On the SageMaker console, choose Notebook instances in the navigation pane.

This opens the notebook instances in your Region. You should see a notebook titled redshift-ge-dq-EMR-blog-notebook.

  1. Choose Open Jupyter next to this notebook to open the Jupyter notebook interface.

You should see the Jupyter notebook file titled ge-redshift.ipynb.

  1. Choose the file to open the notebook and follow the steps to run the solution.

Run configurations to create a PySpark context

When the notebook is open, make sure the kernel is set to Sparkmagic (PySpark). Run the following block to set up Spark configs for a Spark context.

Create a Great Expectations context

In Great Expectations, your data context manages your project configuration. We create a data context for our solution by passing our S3 bucket location. The S3 bucket’s name, created by the stack, should already be populated within the cell block. Run the following block to create a context:

from great_expectations.data_context.types.base import DataContextConfig,DatasourceConfig,S3StoreBackendDefaults
from great_expectations.data_context import BaseDataContext

bucket_prefix = "ge-redshift-data-quality-blog"
bucket_name = "ge-redshift-data-quality-blog-region-account_id"
region_name = '-'.join(bucket_name.replace(bucket_prefix,'').split('-')[1:4])
dataset_path=f"s3://{bucket_name}/monster_com-job_sample.csv"
project_config = DataContextConfig(
    config_version=2,
    plugins_directory=None,
    config_variables_file_path=None,
    datasources={
        "my_spark_datasource": {
            "data_asset_type": {
                "class_name": "SparkDFDataset",//Setting dataset type to Spark
                "module_name": "great_expectations.dataset",
            },
            "spark_config": dict(spark.sparkContext.getConf().getAll()) //Passing Spark Session configs,
            "class_name": "SparkDFDatasource",
            "module_name": "great_expectations.datasource"
        }
    },
    store_backend_defaults=S3StoreBackendDefaults(default_bucket_name=bucket_name)//
)
context = BaseDataContext(project_config=project_config)

For more details on creating a GE context, see Getting started with Great Expectations.

Get GE validation rules from DynamoDB

Our CloudFormation stack created a DynamoDB table with prepopulated rows of expectations. The data model in DynamoDB describes the properties related to each dataset and its columns and the number of expectations you want to configure for each column. The following code describes an example of the data model for the column organization:

{
 "id": "job_reqs-organization", 
 "dataset_name": "job_reqs", 
 "rules": [ //list of expectations to apply to this column
  {
   "kwargs": {
    "result_format": "SUMMARY|COMPLETE|BASIC|BOOLEAN_ONLY" //The level of detail of the result
   },
   "name": "expect_column_values_to_not_be_null",//name of GE expectation   "reject_msg": "REJECT:null_values_found_in_organization"
  }
 ],
 "column_name": "organization"
}

The code contains the following parameters:

  • id – Unique ID of the document
  • dataset_name – Name of the dataset, for example monster_com
  • rules – List of GE expectations to apply:
    • kwargs – Parameters to pass to an individual expectation
    • name – Name of the expectation from the GE glossary
    • reject_msg – String to flag for any row that doesn’t pass this expectation
  • column_name – Name of dataset column to run the expectations on

Each column can have one or more expectations associated that it needs to pass. You can also add expectations for more columns or to existing columns by following the data model shown earlier. With this technique, you can automate verification of any number of data quality rules for your datasets without performing any code change. Apart from its flexibility, what makes GE powerful is the ability to create custom expectations if the GE glossary doesn’t cover your use case. For more details on creating custom expectations, see How to create custom Expectations.

Now run the cell block to fetch the GE rules from the DynamoDB client:

  1. Read the monster.com sample dataset and pass through validation rules.

After we have the expectations fetched from DynamoDB, we can read the raw CSV dataset. This dataset should already be copied to your S3 bucket location by the CloudFormation stack. You should see the following output after reading the CSV as a Spark DataFrame.

To evaluate whether a row passes each column’s expectations, we need to pass the necessary columns to a Spark user-defined function. This UDF evaluates each row in the DataFrame and appends the results of each expectation to a comments column.

Rows that pass all column expectations have a null value in the comments column.

A row that fails at least one column expectation is flagged with the string format REJECT:reject_msg_from_dynamo. For example, if a row has a null value in the organization column, then according to the rules defined in DynamoDB, the comments column is populated by the UDF as REJECT:null_values_found_in_organization.

The technique with which the UDF function recognizes a potentially erroneous column is done by evaluating the result dictionary generated by the Great Expectations library. The generation and structure of this dictionary is dependent upon the keyword argument of result_format. In short, if the count of unexpected column values of any column is greater than zero, we flag that as a rejected row.

  1. Split the resulting dataset into accepted and rejected DataFrames.

Now that we have all the rejected rows flagged in the source DataFrame within the comments column, we can use this property to split the original dataset into accepted and rejected DataFrames. In the previous step, we mentioned that we append an action message in the comments column for each failed expectation in a row. With this fact, we can select rejected rows that start with the string REJECT (alternatively, you can also filter by non-null values in the comments column to get the accepted rows). When we have the set of rejected rows, we can get the accepted rows as a separate DataFrame by using the following PySpark except function.

Write the DataFrames to Amazon S3.

Now that we have the original DataFrame divided, we can write them both to Amazon S3 in Parquet format. We need to write the accepted DataFrame without the comments column because it’s only added to flag rejected rows. Run the cell blocks to write the Parquet files under appropriate prefixes as shown in the following screenshot.

Copy the accepted dataset to an Amazon Redshift table

Now that we have written the accepted dataset, we can use the Amazon Redshift COPY command to load this dataset into an Amazon Redshift table. The notebook outlines the steps required to create a table for the accepted dataset in Amazon Redshift using the Amazon Redshift Data API. After the table is created successfully, we can run the COPY command.

Another noteworthy point to mention is that one of the advantages that we witness due to the data quality approach described in this post is that the Amazon Redshift COPY command doesn’t fail due to schema or datatype errors for the columns, which have clear expectations defined that match the schema. Similarly, you can define expectations for every column in the table that satisfies the schema constraints and can be considered a dq_accepted.monster_com row.

Create an external table in Amazon Redshift for rejected data

We need to have the rejected rows available to us in Amazon Redshift for comparative analysis. These comparative analyses can help inform upstream systems regarding the quality of data being collected and how they can be corrected to improve the overall quality of data. However, it isn’t wise to store the rejected data on the Amazon Redshift cluster, particularly for large tables, because it occupies extra disk space and increase cost. Instead, we use Redshift Spectrum to register an external table in an external schema in Amazon Redshift. The external schema lives in an external database in the AWS Glue Data Catalog and is referenced by Amazon Redshift. The following screenshot outlines the steps to create an external table.

Verify and compare the datasets in Amazon Redshift.

12,160 records got processed successfully out of a total of 22,000 from the input dataset, and were loaded to the monster_com table under the dq_accepted schema. These records successfully passed all the validation rules configured in DynamoDB.

A total 9,840 records got rejected due to breaking of one or more rules configured in DynamoDB and loaded to the monster_com_rejected table in the dq_rejected schema. In this section, we describe the behavior of each expectation on the dataset.

  • Expect column values to not be null in organization – This rule is configured to reject a row if the organization is null. The following query returns the sample of rows, from the dq_rejected.monster_com_rejected table, that are null in the organization column, with their reject message.
  • Expect column values to match the regex list in job_type – This rule expects the column entries to be strings that can be matched to either any of or all of a list of regular expressions. In our use case, we have only allowed values that match a pattern within [".*Full.*Time", ".*Part.*Time", ".*Contract.*"].
  • The following query shows rows that are rejected due to an invalid job type.

Most of the records were rejected with multiple reasons, and all those mismatches are captured under the comments column.

  • Expect column values to not match regex for uniq_id – Similar to the previous rule, this rule aims to reject any row whose value matches a certain pattern. In our case, that pattern is having an empty space (\s++) in the primary column uniq_id. This means we consider a value to be invalid if it has empty spaces in the string. The following query returned an invalid format for uniq_id.
  • Expect column entries to be strings with a length between a minimum value and a maximum value (inclusive) – A length check rule is defined in the DynamoDB table for the location column. This rule rejects values or rows if the length of the value violates the specified constraints. The following
  • query returns the records that are rejected due to a rule violation in the location column.

You can continue to analyze the other columns’ predefined rules from DynamoDB or pick any rule from the GE glossary and add it to an existing column. Rerun the notebook to see the result of your data quality rules in Amazon Redshift. As mentioned earlier, you can also try creating custom expectations for other columns.

Benefits and limitations

The efficiency and efficacy of this approach is delineated from the fact that GE enables automation and configurability to an extensive degree when compared with other approaches. A very brute force alternative to this could be writing stored procedures in Amazon Redshift that can perform data quality checks on staging tables before data is loaded into main tables. However, this approach might not be scalable because you can’t persist repeatable rules for different columns, as persisted here in DynamoDB, in stored procedures (or call DynamoDB APIs), and would have to write and store a rule for each column of every table. Furthermore, to accept or reject a row based on a single rule requires complex SQL statements that may result in longer durations for data quality checks or even more compute power, which can also incur extra costs. With GE, a data quality rule is generic, repeatable, and scalable across different datasets.

Another benefit of this approach, related to using GE, is that it supports multiple Python-based backends, including Spark, Pandas, and Dask. This provides flexibility across an organization where teams might have skills in different frameworks. If a data scientist prefers using Pandas to write their ML pipeline feature quality test, then a data engineer using PySpark can use the same code base to extend those tests due to the consistency of GE across backends.

Furthermore, GE is written natively in Python, which means it’s a good option for engineers and scientists who are more used to running their extract, transform, and load (ETL) workloads in PySpark in comparison to frameworks like Deequ, which is natively written in Scala over Apache Spark and fits better for Scala use cases (the Python interface, PyDeequ, is also available). Another benefit of using GE is the ability to run multi-column unit tests on data, whereas Deequ doesn’t support that (as of this writing).

However, the approach described in this post might not be the most performant in some cases for full table load batch reads for very large tables. This is due to the serde (serialization/deserialization) cost of using UDFs. Because the GE functions are embedded in PySpark UDFs, the performance of these functions is slower than native Spark functions. Therefore, this approach gives the best performance when integrated with incremental data processing workflows, for example using AWS DMS to write CDC files from a source database to Amazon S3.

Clean up

Some of the resources deployed in this post, including those deployed using the provided CloudFormation template, incur costs as long as they’re in use. Be sure to remove the resources and clean up your work when you’re finished in order to avoid unnecessary cost.

Go to the CloudFormation console and click the ‘delete stack’ to remove all resources.

The resources in the CloudFormation template are not production ready. If you would like to use this solution in production, enable logging for all S3 buckets and ensure the solution adheres to your organization’s encryption policies through EMR Security Best Practices.

Conclusion

In this post, we demonstrated how you can automate data reliability checks using the Great Expectations library before loading data into an Amazon Redshift table. We also showed how you can use Redshift Spectrum to create external tables. If dirty data were to make its way into the accepted table, all downstream consumers such as business intelligence reporting, advanced analytics, and ML pipelines can get affected and produce inaccurate reports and results. The trends of such data can generate wrong leads for business leaders while making business decisions. Furthermore, flagging dirty data as rejected before loading into Amazon Redshift also helps reduce the time and effort a data engineer might have to spend in order to investigate and correct the data.

We are interested to hear how you would like to apply this solution for your use case. Please share your thoughts and questions in the comments section.


About the Authors

Faizan Ahmed is a Data Architect at AWS Professional Services. He loves to build data lakes and self-service analytics platforms for his customers. He also enjoys learning new technologies and solving, automating, and simplifying customer problems with easy-to-use cloud data solutions on AWS. In his free time, Faizan enjoys traveling, sports, and reading.

Bharath Kumar Boggarapu is a Data Architect at AWS Professional Services with expertise in big data technologies. He is passionate about helping customers build performant and robust data-driven solutions and realize their data and analytics potential. His areas of interests are open-source frameworks, automation, and data architecting. In his free time, he loves to spend time with family, play tennis, and travel.

Batch Inference at Scale with Amazon SageMaker

Post Syndicated from Ramesh Jetty original https://aws.amazon.com/blogs/architecture/batch-inference-at-scale-with-amazon-sagemaker/

Running machine learning (ML) inference on large datasets is a challenge faced by many companies. There are several approaches and architecture patterns to help you tackle this problem. But no single solution may deliver the desired results for efficiency and cost effectiveness. In this blog post, we will outline a few factors that can help you arrive at the most optimal approach for your business. We will illustrate a use case and architecture pattern with Amazon SageMaker to perform batch inference at scale.

ML inference can be done in real time on individual records, such as with a REST API endpoint. Inference can also be done in batch mode as a processing job on a large dataset. While both approaches push data through a model, each has its own target goal when running inference at scale.

With real-time inference, the goal is usually to optimize the number of transactions per second that the model can process. With batch inference, the goal is usually tied to time constraints and the service-level agreement (SLA) for the job. Table 1 shows the key attributes of real-time, micro-batch, and batch inference scenarios.

Real Time Micro Batch Batch
Execution Mode
Synchronous Synchronous/Asynchronous Asynchronous
Prediction Latency
Subsecond Seconds to minutes Indefinite
Data Bounds Unbounded/stream Bounded Bounded
Execution Frequency
Variable Variable Variable/fixed
Invocation Mode
Continuous stream/API calls Event-based Event-based/scheduled
Examples Real-time REST API endpoint Data analyst running a SQL UDF Scheduled inference job

Table 1. Key characteristics of real-time, micro-batch, and batch inference scenarios

Key considerations for batch inference jobs

Batch inference tasks are usually good candidates for horizontal scaling. Each worker within a cluster can operate on a different subset of data without the need to exchange information with other workers. AWS offers multiple storage and compute options that enable horizontal scaling. Table 2 shows some key considerations when architecting for batch inference jobs.

  • Model type and ML framework. Models built with frameworks such as XGBoost and SKLearn require smaller compute instances. Those built with deep learning frameworks, such as TensorFlow and PyTorch require larger ones.
  • Complexity of the model. Simple models can run on CPU instances while more complex ensemble models and large-scale deep learning models can benefit from GPU instances.
  • Size of the inference data. While all approaches work on small datasets, larger datasets come with a unique set of challenges. The storage system must provide sufficient throughput and I/O to reliably run the inference workload.
  • Inference frequency and job concurrency. The volume of jobs within a fixed interval of time is an important consideration to address Service Quotas. The frequency and SLA requirements also proportionally impact the number of concurrent jobs. This might create additional pressure on the underlying Service Quotas.
ML Framework Model Complexity
Inference Data Size
Inference Frequency
Job Concurrency
  • Traditional
    • XGBoost
    • SKLearn
  • Deep Learning
    • Tensorflow
    • PyTorch
  • Low (linear models)
  • Medium (complex ensemble models)
  • High (large scale DL models)
  • Small (<1 GB)
  • Medium (<100 GB)
  • Large (<1 TB)
  • Hyperscale (>1 TB)
  • Hourly
  • Daily
  • Weekly
  • Monthly
  • 1
  • <10
  • <100
  • >100

Table 2. Key considerations when architecting for batch inference jobs

Real world Batch Inference use case and architecture

Often customers in certain domains such as advertising and marketing or healthcare must make predictions on hyperscale datasets. This requires deploying an inference pipeline that can complete several thousand inference jobs on extremely large datasets. The individual models used are typically of low complexity from a compute perspective. They could include a combination of various algorithms implemented in scikit-learn, XGBoost, and TensorFlow, for example. Most of the complexity in these use cases stems from large volumes of data and the number of concurrent jobs that must run to meet the service level agreement (SLA).

The batch inference architecture for these requirements typically is composed of three layers:

  • Orchestration layer. Manages the submission, scheduling, tracking, and error handling of individual jobs or multi-step pipelines
  • Storage layer. Stores the data that will be inferenced upon
  • Compute layer. Runs the inference job

There are several AWS services available that can be used for each of these architectural layers. The architecture in Figure 1 illustrates a real world implementation. Amazon SageMaker Processing and training services are used for compute layer and Amazon S3 for the storage layer. Amazon Managed Workflows for Apache Airflow (MWAA) and Amazon DynamoDB are used for the orchestration and job control layer.

Figure 1. Architecture for batch inference at scale with Amazon SageMaker

Figure 1. Architecture for batch inference at scale with Amazon SageMaker

Orchestration and job control layer. Apache Airflow is used to orchestrate the training and inference pipelines with job metadata captured into DynamoDB. At each step of the pipeline, Airflow updates the status of each model run. A custom Airflow sensor polls the status of each pipeline. It advances the pipeline with the successful completion of each step, or resubmits a job in case of failure.

Compute layer. SageMaker processing is used as the compute option for running the inference workload. SageMaker has a purpose-built batch transform feature for running batch inference jobs. However, this feature often requires additional pre and post-processing steps to get the data into the appropriate input and output format. SageMaker Processing offers a general purpose managed compute environment to run a custom batch inference container with a custom script. In the architecture, the processing script takes the input location of the model artifact generated by a SageMaker training job and the location of the inference data, and performs pre and post-processing along with model inference.

Storage layer. Amazon S3 is used to store the large input dataset and the output inference data. The ShardedByS3Key data distribution strategy distributes the files across multiple nodes within a processing cluster. With this option enabled, SageMaker Processing will automatically copy a different subset of input files into each node of the processing job. This way you can horizontally scale batch inference jobs by requesting a higher number of instances when configuring the job.

One caveat of this approach is that while many ML algorithms utilize multiple CPU cores during training, only one core is utilized during inference. This can be rectified by using Python’s native concurrency and parallelism frameworks such concurrent.futures. The following pseudo-code illustrates how you can distribute the inference workload across all instance cores. This assumes the SageMaker Processing job has been configured to copy the input files into the /opt/ml/processing/input directory.

from concurrent.futures import ProcessPoolExecutor, as_completed
from multiprocessing import cpu_count
import os
from glob import glob
import pandas as pd

def inference_fn(model_dir, file_path, output_dir):

model = joblib.load(f"{model_dir}/model.joblib")
data = pd.read_parquet(file_path)
data["prediction"] = model.predict(data)

output_path = f"{output_dir}/{os.path.basename(file_path)}"

data.to_parquet(output_path)

return output_path

input_files = glob("/opt/ml/processing/input/*")
model_dir = "/opt/ml/model"
output_dir = "/opt/ml/output"

with ProcessPoolExecutor(max_workers=cpu_count()) as executor:
futures = [executor.submit(inference_fn, model_dir, file_path, output_dir) for file in input_files]

results =[]
for future in as_completed(futures):
results.append(future.result())

Conclusion

In this blog post, we described ML inference options and use cases. We primarily focused on batch inference and reviewed key challenges faced when performing batch inference at scale. We provided a mental model of some key considerations and best practices to consider as you make various architecture decisions. We illustrated these considerations with a real world use case and an architecture pattern to perform batch inference at scale. This pattern can be extended to other choices of compute, storage, and orchestration services on AWS to build large-scale ML inference solutions.

More information:

Announcing Fully Managed RStudio on Amazon SageMaker for Data Scientists

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/announcing-fully-managed-rstudio-on-amazon-sagemaker-for-data-scientists/

Two years ago, we introduced Amazon SageMaker Studio, the industry’s first fully integrated development environment (IDE) for machine learning (ML). Amazon SageMaker Studio provides a single, web-based visual interface where you can perform all ML development steps, improving data science team productivity by up to 10 times

Many data scientists love the R project, an open-source ecosystem with more than 18,000 packages that is not just a programming language but is also an interactive environment for doing data science. RStudio is one of the most popular IDE among R developers for ML and data science projects. RStudio provides open-source tools for R and enterprise-ready professional software for data science teams to develop and share their work in the organization. But, building, securing, scaling and maintaining RStudio yourself is tedious and cumbersome.

Today, in collaboration with RStudio PBC, we are excited to announce the general availability of RStudio on Amazon SageMaker, the industry’s first fully managed RStudio Workbench IDE in the cloud. You can now bring your current RStudio license to easily migrate your self-managed RStudio environments to Amazon SageMaker in just a few simple steps. If you’d like to read more about this exciting collaboration, check out this blog from RStudio PBC.

With RStudio on Amazon SageMaker, administrators can have a simple experience to migrate their RStudio environments to integrate into Amazon SageMaker and bring existing RStudio licenses to manage through AWS License Manager. They can onboard both R and Python developers to the same Amazon SageMaker domain using AWS Single Sign-On (SSO) or AWS Identity and Access Management (IAM) and take it as a centralized place to configure both RStudio and Amzon SageMaker Studio.

So, data scientists have a freedom of choice between programming languages and coding interfaces to switch between RStudio and Amazon SageMaker Studio notebooks. All of their work, including code, datasets, repositories, and other artifacts are synchronized between the two environments through the underlying Amazon EFS storage.

Getting Started with RStudio on SageMaker
You now can launch the familiar RStudio Workbench with a simple click from Amazon SageMaker. Before getting started, your administrator needs to buy an appropriate license from RStudio PBC for end-users, set up your granted licenses in AWS License Manager, and create an Amazon SageMaker domain and user profile to launch RStudio on Amazon SageMaker. To learn all the administrator jobs, including managing licenses and monitoring usages, see a blog post of the setting up process, or Manage RStudio on Amazon SageMaker in the AWS documentation.

Once the required setup process is completed, you can open the RStudio Workbench from the new Launch app drop-down list in the created user list and select RStudio.

You will immediately see the RStudio Workbench home page and a list of sessions, projects, and published content on the home page. To create a new session, select the New Session button on the page, select a desired instance in the Instance Type dropdown list, and choose Start Session.

When you choose a compute instance type for a lightweight analysis that can be powered by two vCPU and four GiB memory, you can use a default ml.t3.medium instance. For a complex and large-scale ML modeling, you can choose a large instance with desired compute and memory from a wide array of ML instances available on Amazon SageMaker.

In a few minutes, your session will be ready for development in RStudio Workbench. When you launch your RStudio session, the Base R image serves as the basis of your instance. This Docker image includes R v4.0, AWS tools such as awscli, sagemaker, boto3 Python packages, and reticulate package for the interoperability between Python and R.

Managing R Packages and Publishing your Analysis
Along with the RStudio Workbench, RStudio Connect and RStudio Package Manager are the most used products of RStudio.

RStudio Connect is designed to allow data scientists to publish insights and dashboard and web applications from RStudio Workbench easily. RStudio Package Manager centrally manages the package repository for your organization so that data scientists can securely install packages faster while ensuring project reproducibility and repeatability.

Your administrator, for example, can create a repository and subscribe it to the built-in source named cran in RStudio Package Manager.

$ rspm sync --wait # Initiate a sync
$ rspm create repo --name=prod-cran --description='Access CRAN packages' # Create a repository:
$ rspm subscribe --repo=prod-cran --source=cran # Subscribe the repository to the cran source

When these steps are completed, you can use the prod-cran repository in the web interface of RStudio Package Manager.

Now, you can configure this repository to install and manage your packages in RStudio Workbench. You can also configure RStudio Connect to publish insights, dashboard and web applications from RStudio Workbench via RStudio Connect so that your collaborators can easily consume your work.

For example, you run the analysis inline to create an R Markdown that can be published to your collaborators. You can preview the slides while writing codes with the Preview button and publish it with the Publish icon in your RStudio session.

You can also publish Shiny application easy to create interactive web interfaces, or Python-based content such as Streamlit to the RStudio Connect instance.

To learn more, see Host RStudio Connect and Package Manager for ML development in RStudio on Amazon SageMaker written by my colleagues, Michael Hsieh, Chayan Panda, and Farooq Sabir on the AWS Machine Learning Blog.

Integrating training jobs with Amazon SageMaker
One of the benefits of using RStudio on Amazon SageMaker is the integration of Amazon SageMaker features. Your RStudio and Jupyter Notebook instances of Amazon SageMaker allow you to share the same Amazon EFS file system. You can import R codes written in Jupyter Notebook or use the same files in both Jupyter Notebook and RStudio without having to move your files between the two.

For example, you can run an R sample code including importing libraries, creating an Amazon SageMaker session, getting the IAM role, and importing and visualizing sample data. And then, it stores data on the S3 bucket, and triggers a training task with an XGBoost model by specifying the training container and defining an Amazon SageMaker Estimator. To learn more, see R sample codes in Amazon SageMaker.

# Import reticulate, readr and sagemaker libraries
library(reticulate)
library(readr)
sagemaker <- import('sagemaker')

# Create a sagemaker session
session <- sagemaker$Session()

# Get execution role
role_arn <- sagemaker$get_execution_role()

# Read a csv file from UCI public repository
data_file <- 'http://archive.ics.uci.edu/ml/machine-learning-databases/abalone/abalone.data'

# Copy data to a dataframe, rename columns, and show dataframe head
data_csv <- read_csv(file = data_file, col_names = FALSE, col_types = cols())
names(data_csv) <- c('sex', 'length', 'diameter', 'height', 'whole_weight', 'shucked_weight', 'viscera_weight', 'shell_weight', 'rings')
head(data_csv)

# Visualize data have height equal to 0
library(ggplot2)
options(repr.plot.width = 5, repr.plot.height = 4) 
ggplot(abalone, aes(x = height, y = rings, color = sex, alpha=0.5)) + geom_point() + geom_jitter()

# Upload data to Amazon S3 bucket
s3_train <- session$upload_data(path = data_csv,
                                bucket = my_s3_bucket, 
                                key_prefix = 'r_hello_world_demo/data')
s3_path = paste('s3://',bucket,'/r_hello_world_demo/data/abalone.csv',sep = '')

# Train a XGBoost model, specify the training containers, and define an Amazon SageMaker Estimator
container <- sagemaker$image_uris$retrieve(framework='xgboost', 
                                           region= session$boto_region_name, 
										   version='latest')							
estimator <- sagemaker$estimator$Estimator(image_uri = container,
                                           role = role_arn,
                                           train_instance_count = 1L,
                                           train_instance_type = 'ml.m5.4xlarge',
                                           train_volume_size = 30L,
                                           train_max_run = 3600L,
                                           input_mode = 'File',
                                           output_path = s3_path)

Now Available
RStudio on Amazon SageMaker is available in all AWS Regions where both Amazon SageMaker Studio and AWS License Manager are available. You can bring your own license of RStudio on Amazon SageMaker and pay for the underlying compute and storage resources within Amazon SageMaker or other AWS services, based on your usage.

To get started with RStudio on Amazon SageMaker, you can use AWS Free Tier. You can use 250 hours of ml.t3.medium instance on Amazon SageMaker Studio per month for the first two months. To learn more, see Amazon SageMaker Pricing page.

Give it a try, and please send us feedback either in the AWS forum for Amazon SageMaker or through your usual AWS support contacts.

Channy

WeatherBug reduced ETL latency to 30 times faster using Amazon Redshift Spectrum

Post Syndicated from Anton Morozov original https://aws.amazon.com/blogs/big-data/weatherbug-reduced-etl-latency-to-30-times-faster-using-amazon-redshift-spectrum/

This post is co-written with data engineers, Anton Morozov and James Phillips, from Weatherbug.

WeatherBug is a brand owned by GroundTruth, based in New York City, that provides location-based advertising solutions to businesses. WeatherBug consists of a mobile app reporting live and forecast data on hyperlocal weather to consumer users. The WeatherBug Data Engineering team has built a modern analytics platform to serve multiple use cases, including weather forecasting and location-based advertising, that is completely built on AWS. They use an Amazon Simple Storage Service (Amazon S3) data lake to store clickstream data and use Amazon Redshift as their cloud data warehouse platform.

In this post, we share how WeatherBug reduced their extract, transform, and load (ETL) latency using Amazon Redshift Spectrum.

Amazon Redshift Spectrum overview

Amazon Redshift is the most widely used cloud data warehouse. It makes it fast, simple, and cost-effective to analyze all your data using standard SQL and your existing business intelligence (BI) tools. It allows you to run complex analytic queries against terabytes to petabytes of structured and semi-structured data, using sophisticated query optimization, columnar storage on high-performance storage, and massively parallel query execution.

Redshift Spectrum allows you to query open format data directly in the S3 data lake without having to load the data or duplicate your infrastructure. With Redshift Spectrum, you can query open file formats such as Apache Parquet, ORC, JSON, Avro, and CSV. For more information, see Amazon Redshift Spectrum overview and Amazon Redshift Spectrum Extends Data Warehousing Out to Exabytes—No Loading Required.

Redshift Spectrum runs on a massive compute fleet independent of your Amazon Redshift cluster. Redshift Spectrum pushes many compute-intensive tasks, such as predicate filtering and aggregation, down to the Redshift Spectrum layer. Therefore, Redshift Spectrum queries use much less of your cluster’s processing capacity than other queries. With Redshift Spectrum, you can efficiently query and retrieve structured and semi-structured data from files in Amazon S3 without having to load the data into Amazon Redshift tables. Redshift Spectrum queries employ massive parallelism to run very fast against large datasets. Much of the processing occurs in the Redshift Spectrum layer, and most of the data remains in Amazon S3. Multiple clusters can concurrently query the same dataset in Amazon S3 without the need to make copies of the data for each cluster.

Previous solution

To summarize performance metrics for internal BI needs, WeatherBug has to move a lot of data between their S3 data lake and Amazon Redshift cluster using complex ETL processings. They used Apache Airflow to orchestrate their pipeline during the ETL process and used Apache Hive for large-scale ETL jobs in order to offload the data from the Amazon Redshift cluster. The Apache Hive data warehouse software facilitates reading, writing, and managing large datasets residing in distributed storage using SQL. You can project this structure onto data already in storage. A command line tool and JDBC driver are provided to connect users to Hive.

WeatherBug chose Hive as a solution because it was easy to use with their Python/Airflow pipeline and they were able to launch multiple jobs in parallel. This solution was working well for WeatherBug but it needed more engineering effort to build the ETL processes along with operational overheads on the current solution because it involved multiple technologies.

The following diagram illustrates this previous architecture.

New Amazon Redshift Spectrum-based solution

To optimize the current Hadoop-based solution to move data between their Amazon Redshift cluster and S3 buckets, the WeatherBug team considered Redshift Spectrum as an alternative solution. The following diagram shows their updated architecture.

WeatherBug created Redshift Spectrum external tables that pointed to the data stored in their S3 buckets. This helped them perform the data movement and transformations from Amazon Redshift to Amazon S3 using Insert into <external table> select from <Redshift table> and from Amazon S3 to Amazon Redshift using Insert into <Redshift table> select from <external table> along with the data transformations in the inline SQL queries.

During the pilot implementation on a few sample tables with this current solution, WeatherBug found it very easy to learn how to use the Amazon Redshift Spectrum features, and not long after they had a proof of concept far superior to their existing Hadoop-based solution. They reduced the targeted pipeline’s runtime from 17 minutes to 30 seconds, a 3300% improvement, with the additional benefit of eliminating the cost and management of Hadoop cluster. They were excited to apply this approach to additional pipelines that support their Amazon Redshift clusters. This was a nice win for them to be able to improve processing times and reduce cost and overhead with low engineering effort.

In addition to these improvements, they replaced some of their jobs that use Apache Hive to query tables with Amazon Redshift Spectrum.

In their initial testing, WeatherBug is seeing costs for Amazon Redshift Spectrum of $0.14 per day instead of $16.12 per day on Hive for a selected sample job, which is a 115 times reduction in cost.

Conclusion

When you have a data lake and cloud data warehouse built in Amazon S3 and Amazon Redshift, you may need frequent ETL jobs between the two systems for different use cases. Amazon Redshift Spectrum provides an easy-to-implement, cost-effective, and high-performance solution to interact between Amazon Redshift and Amazon S3 to query the Amazon S3 data from Amazon Redshift, join Amazon Redshift tables with S3 objects, and transform using simple SQL queries. Many Data Engineering use cases similar to the WeatherBug example discussed here can be optimized using Amazon Redshift Spectrum.


About the Authors

Anton Morozov is a WeatherBug data engineer working on analytics pipelines. He considers AWS Redshift an essential technology to understanding data for business intelligence needs.

James Phillips is a WeatherBug data engineer who uses many AWS technologies on a daily basis. Some of these include Amazon Redshift, Amazon EMR, and Amazon SageMaker.

Avijit Goswami is a Principal Solutions Architect at AWS, helping his digital native and startup customers become tomorrow’s enterprises using AWS services. He is part of the analytics specialists field community in AWS and works closely with the WeatherBug Data Engineering team.

Field Notes: Build a Cross-Validation Machine Learning Model Pipeline at Scale with Amazon SageMaker

Post Syndicated from Wei Teh original https://aws.amazon.com/blogs/architecture/field-notes-build-a-cross-validation-machine-learning-model-pipeline-at-scale-with-amazon-sagemaker/

When building a machine learning algorithm, such as a regression or classification algorithm, a common goal is to produce a generalized model. This is so that it performs well on new data that the model has not seen before. Overfitting and underfitting are two fundamental causes of poor performance for machine learning models. A model is overfitted when it performs well on known data, but generalizes poorly on new data. However, an underfit model performs poorly on both trained and new data. A reliable model validation technique helps provide better assessment for predicting model performance in practice, and provides insight for training models to achieve the best accuracy.

Cross-validation is a standard model validation technique commonly used for assessing performance of machine learning algorithms. In general, it works by first sampling the dataset into groups of similar sizes, where each group contains a subset of data dedicated for training and model evaluation. After the data has been grouped, a machine learning algorithm will fit and score a model using the data in each group independently. The final score of the model is defined by the average score across all the trained models for performance metric representation.

There are few cross-validation methods commonly used, including k-fold, stratified k-fold, and leave-p-out, to name a few. Although there are well-defined data science frameworks that can help simplify cross-validation processes, such as Python scikit-learn library, these frameworks are designed to work in a monolithic, single compute environment. When it comes to training machine learning algorithms with large volume of data, these frameworks become bottlenecked with limited scalability and reliability.

In this blog post, we are going to walk through the steps for building a highly scalable, high-accuracy, machine learning pipeline, with the k-fold cross-validation method, using Amazon Simple Storage Service (Amazon S3), Amazon SageMaker Pipelines, SageMaker automatic model tuning, and SageMaker training at scale.

Overview of solution

To operate the k-fold cross validation training pipeline at scale, we built an end to end machine learning pipeline using SageMaker native features. This solution implements the k-fold data processing, model training, and model selection processes as individual components to maximize parallellism. The pipeline is orchestrated through SageMaker Pipelines in distributed manner to achieve scalability and performance efficiency. Let’s dive into the high-level architecture of the solution in the following section.

Figure 1. Solution architecture

Figure 1. Solution architecture

The overall solution architecture is shown in Figure 1. There are four main building blocks in the k-fold cross-validation model pipeline:

  1. Preprocessing – Sample and split the entire dataset into k groups.
  2. Model training – Fit the SageMaker training jobs in parallel with hyperparameters optimized through the SageMaker automatic model tuning job.
  3. Model selection – Fit a final model, using the best hyperparameters obtained in step 2, with the entire dataset.
  4. Model registration – Register the final model with SageMaker Model Registry, for model lifecycle management and deployment.

The final output from the pipeline is a model that represents best performance and accuracy for the given dataset. The pipeline can be orchestrated easily using a workflow management tool, such as Pipelines.

Amazon SageMaker is a fully managed service that enables data scientists and developers to quickly develop, train, tune, and deploy machine learning quickly and at scale. When it comes to choosing the right machine learning and data processing frameworks to solve problems, SageMaker gives you the flexibility to use prebuilt containers bundled with the supported common machine learning frameworks—such as Tensorflow, Pytorch, and MxNet—or to bring your own container images with custom scripts and libraries that fit your use cases to train on the highly available SageMaker model training environment. Additionally, Pipelines enables users to develop complete machine learning workflows using python SDK, and manage these workflows in SageMaker Studio.

For simplicity, we will use the public Iris flower data as the train and test dataset to build a multivariate classification model using linear algorithm (SVM). The pipeline architecture is agnostic to the data and model; hence, it can be modified to adopt a different dataset or algorithm.

Prerequisites

To deploy the solution, you require the following:

  • SageMaker Studio
  • A Command Line (Terminal) that supports building Docker images (or instance, AWS Cloud9)

Solution walkthrough

In this section, we are going to walk through the steps to create a cross-validation model training pipeline using Pipelines. The main components are as follows.

  1. Pipeline parameters
    Pipelines parameters are introduced as variables that allow the predefined values to be overridden at runtime. Pipelines supports the following parameters types: String, Integer, and Float (expressed as ParameterString, ParameterInteger, and ParameterFloat). The following are some examples of the parameters used in the cross-validation model training pipeline:
    • K-Fold – Value of k to be used in k-fold cross-validation
    • ProcessingInstanceCount – Number of instances for SageMaker processing job
    • ProcessingInstanceType – Instance type used for SageMaker processing job
    • TrainingInstanceType – Instance type used for SageMaker training job
    • TrainingInstanceCount – Number of instances for SageMaker training job
  1. Preprocessing

In this step, the original dataset is split into k equal-sized samples. One of the k samples is retained as the validation data for model evaluation, with the remaining k-1 samples to be used as training data. This process is repeated k times, with each of the k samples used as the validation set only one time. The k sample collections are uploaded to an S3 bucket, with the prefix corresponding to an index (0 – k-1) to be identified as the input path to the specified training jobs in the next step of the pipeline. The cross-validation split is submitted as a SageMaker processing job orchestrated through the Pipelines processing step. The processing flow is shown in Figure 2.

Figure 2. K-fold cross-validation: original data is split into k equal-sized samples uploaded to S3 bucket

Figure 2. K-fold cross-validation: original data is split into k equal-sized samples uploaded to S3 bucket

The following code snippet splits the k-fold dataset in the preprocessing script:

def save_kfold_datasets(X, y, k):
    """ Splits the datasets (X,y) k folds and saves the output from 
    each fold into separate directories.

    Args:
        X : numpy array represents the features
        y : numpy array represetns the target
        k : int value represents the number of folds to split the given datasets
    """

    # Shuffles and Split dataset into k folds. 
    kf = KFold(n_splits=k, random_state=23, shuffle=True)

    fold_idx = 0
    for train_index, test_index in kf.split(X, y=y, groups=None):    
       X_train, X_test = X[train_index], X[test_index]
       y_train, y_test = y[train_index], y[test_index]
       os.makedirs(f'{base_dir}/train/{fold_idx}', exist_ok=True)
       np.savetxt(f'{base_dir}/train/{fold_idx}/train_x.csv', X_train, delimiter=',')
       np.savetxt(f'{base_dir}/train/{fold_idx}/train_y.csv', y_train, delimiter=',')

       os.makedirs(f'{base_dir}/test/{fold_idx}', exist_ok=True)
       np.savetxt(f'{base_dir}/test/{fold_idx}/test_x.csv', X_test, delimiter=',')
       np.savetxt(f'{base_dir}/test/{fold_idx}/test_y.csv', y_test, delimiter=',')
       fold_idx += 1
  1.  Cross-validation training with SageMaker automatic model tuning

In a typical cross-validation training scenario, a chosen algorithm is trained for k times with specific training and a validation dataset sampled through the k-fold technique, mentioned in the previous step. Traditionally, the cross-validation model training process is performed sequentially on the same server. This method is inefficient and doesn’t scale well for models with large volumes of data. Because all the samples are uploaded to an S3 bucket, we can now run k training jobs in parallel. Each training job will consume input samples in the specified bucket location correspond to the index (ranged between 0 – k-1) given to the training job. Additionally, the hyperparameter values must be the same for all k jobs because cross validation estimates the true out-of-sample performance of a model trained with this specific set of hyperparameters.

Although the cross-validation technique helps generalize the models, hyperparameter tuning for the model is typically performed manually. In this blog post, we are going to take a heuristic approach of finding the most optimized hyperparameters using SageMaker automatic model tuning.

We start by defining a training script that accepts the hyperparameters as input for the specified model algorithm, and then implement the model training and evaluation steps.

The steps involved in the training script are summarized as follows:

    1. Parse hyperparameters from the input.
    2. Fit the model using the parsed hyperparameters.
    3. Evaluate model performance (score).
    4. Save the trained model.
if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('-c', '--c', type=float, default=1.0)
    parser.add_argument('--gamma', type=float)
    parser.add_argument('--kernel', type=str)
    # Sagemaker specific arguments. Defaults are set in the environment variables.
    parser.add_argument('--output-data-dir', type=str, default=os.environ['SM_OUTPUT_DATA_DIR'])
    parser.add_argument('--model-dir', type=str, default=os.environ['SM_MODEL_DIR'])
    parser.add_argument('--train', type=str, default=os.environ['SM_CHANNEL_TRAIN'])
    parser.add_argument('--test', type=str, default=os.environ.get('SM_CHANNEL_TEST'))
    args = parser.parse_args()
    model = train(train=args.train, test=args.test)
    evaluate(test=args.test, model=model)
    dump(model, os.path.join(args.model_dir, "model.joblib"))

Next, we create a python script that performs cross-validation model training by submitting k SageMaker training jobs in parallel with given hyperparameters. Additionally, the script monitors the progress of the training jobs, and calculates the objective metrics by averaging the scores across the completed jobs.

Now we create a python script that uses a SageMaker automatic model tuning job to find the optimal hyperparameters for the trained models. The hyperparameter tuner works by running a specified number of training jobs using the ranges of hyperparameters specified. The number of training jobs and ranges of hyperparameters are given in the input parameter to the script. After the tuning job completes, the objective metrics, as well as the hyperparameters from the best cross-validation model training job, are captured, formatted in JSON format, respectively, to be used in the next steps of the workflow. Figure 3 illustrates cross-validation training with automatic model tuning.

Figure 3. In cross-validation training step, a SageMaker HyperparameterTuner job invokes n training jobs. The metrics and hyperparameters are captured for downstream processes.

Figure 3. In cross-validation training step, a SageMaker HyperparameterTuner job invokes n training jobs. The metrics and hyperparameters are captured for downstream processes.

Finally, the training and cross-validation scripts are packaged and built as a custom container image, available for the SageMaker automatic model tuning job for submission. The following code snippet is for building the custom image:

FROM python:3.7
RUN apt-get update && pip install sagemaker boto3 numpy sagemaker-training
COPY cv.py /opt/ml/code/train.py
COPY scikit_learn_iris.py /opt/ml/code/scikit_learn_iris.py
ENV SAGEMAKER_PROGRAM train.py
  1. Model evaluation
    The objective metrics in the cross-validation training and tuning steps define the model quality. To evaluate the model performance, we created a conditional step that compares the metrics against a baseline to determine the next step in the workflow. The following code snippet illustrates the conditional step in detail. Specifically, this step first extracts the objective metrics based on the evaluation report uploaded in previous step, and then compares the value with baseline_model_objective_value provided in the pipeline job. The workflow continues if the model objective metric is greater than or equal to the baseline value, and stops otherwise.
from sagemaker.workflow.conditions import ConditionGreaterThanOrEqualTo
from sagemaker.workflow.condition_step import (
    ConditionStep,
    JsonGet,
)
cond_gte = ConditionGreaterThanOrEqualTo(
    left=JsonGet(
        step=step_cv_train_hpo,
        property_file=evaluation_report,
        json_path="multiclass_classification_metrics.accuracy.value",
    ),
    right=baseline_model_objective_value,
)
step_cond = ConditionStep(
    name="ModelEvaluationStep",
    conditions=[cond_gte],
    if_steps=[step_model_selection, step_register_model],
    else_steps=[],
)
  1. Model Selection
    At this stage of the pipeline, we’ve completed cross-validation and hyperparameter optimization steps to identify the best performing model trained with the specific hyperparameter values. In this step, we are going to fit a model using the same algorithm used in cross-validation training by providing the entire dataset and the hyperparameters from the best model. The trained model will be used for serving predictions for downstream applications. The following code snippet illustrates a Pipelines training step for model selection:
from sagemaker.inputs import TrainingInput
from sagemaker.workflow.steps import TrainingStep
from sagemaker.sklearn.estimator import SKLearn
sklearn_estimator = SKLearn("scikit_learn_iris.py", 
                           framework_version=framework_version, 
                           instance_type=training_instance_type,
                           py_version='py3', 
                           source_dir="code",
                           output_path=s3_bucket_base_path_output,
                           role=role)
step_model_selection = TrainingStep(
    name="ModelSelectionStep",
    estimator=sklearn_estimator,
    inputs={
        "train": TrainingInput(
            s3_data=f'{step_process.arguments["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"]}/all',
            content_type="text/csv"
        ),
        "jobinfo": TrainingInput(
            s3_data=f"{s3_bucket_base_path_jobinfo}",
            content_type="application/json"
        )
    }
)
  1. Model registration
    Because the cross-validation model training pipeline evolves, it’s important to have a mechanism for managing the version of model artifacts over time, so that the team responsible for the project can manage the model lifecycle, including track, deploy, or rollback a model based on the version. Building your own model registry, with lifecycle management capabilities, can be complicated and challenging to maintain and operate. SageMaker Model Registry simplifies model lifecycle management by enabling model catalog, versioning, metrics association, model approval workflow, and model deployment automation.

In the final step of the pipeline, we are going to register the trained model with Model Registry by associating model objective metrics, the model artifact location on S3 bucket, the estimator object used in the model selection step, model training and inference metadata, and approval status. The following code snippet illustrates the model registry step using ModelMetrics and RegisterModel.

from sagemaker.model_metrics import MetricsSource, ModelMetrics
from sagemaker.workflow.step_collections import RegisterModel
model_metrics = ModelMetrics(
    model_statistics=MetricsSource(
        s3_uri="{}/evaluation.json".format(
            step_cv_train_hpo.arguments["ProcessingOutputConfig"]["Outputs"][0]["S3Output"]["S3Uri"]
        ),
        content_type="application/json",
    )
)
step_register_model = RegisterModel(
    name="RegisterModelStep",
    estimator=sklearn_estimator,
    model_data=step_model_selection.properties.ModelArtifacts.S3ModelArtifacts,
    content_types=["text/csv"],
    response_types=["text/csv"],
    inference_instances=["ml.t2.medium", "ml.m5.xlarge"],
    transform_instances=["ml.m5.xlarge"],
    model_package_group_name=model_package_group_name,
    approval_status=model_approval_status,
    model_metrics=model_metrics,

Figure 4 shows a model version registered in SageMaker Model Registry upon a successful pipeline job through Studio.

Figure 4. Model version registered successfully in SageMaker

  1. Putting everything together
    Now that we’ve defined a cross-validation training pipeline, we can track, visualize, and manage the pipeline job directly from within Studio. The following code snippet and Figure 5 depicts our pipeline definition:
from sagemaker.workflow.pipeline_experiment_config import PipelineExperimentConfig
from sagemaker.workflow.execution_variables import ExecutionVariables
pipeline_name = f"CrossValidationTrainingPipeline"
pipeline = Pipeline(
    name=pipeline_name,
    parameters=[
        processing_instance_count,
        processing_instance_type,
        training_instance_type,
        training_instance_count,
        inference_instance_type,
        hpo_tuner_instance_type,
        model_approval_status,
        role,
        default_bucket,
        baseline_model_objective_value,
        bucket_prefix,
        image_uri,
        k,
        max_jobs,
        max_parallel_jobs,
        min_c,
        max_c,
        min_gamma,
        max_gamma,
        gamma_scaling_type
    ],    
    pipeline_experiment_config=PipelineExperimentConfig(
      ExecutionVariables.PIPELINE_NAME,
      ExecutionVariables.PIPELINE_EXECUTION_ID),
    steps=[step_process, step_cv_train_hpo, step_cond],
Figure 5. SageMaker Pipelines definition shown in SageMaker Studio

Figure 5. SageMaker Pipelines definition shown in SageMaker Studio

Finally, to kick off the pipeline, invoke the pipeline.start() function, with optional parameters specific to the job run:

execution = pipeline.start(
    parameters=dict(
        BaselineModelObjectiveValue=0.8,
        MinimumC=0,
        MaximumC=1
    ))

You can track the pipeline job from within Studio, or use SageMaker application programming interfaces (APIs). Figure 6 shows a screenshot of a pipeline job in progress from Studio.

Figure 6. SageMaker Pipelines job progress shown in SageMaker Studio

Figure 6. SageMaker Pipelines job progress shown in SageMaker Studio

Conclusion

In this blog post, we showed you an architecture that orchestrates a complete workflow for cross-validation model training. We implemented the workflow using SageMaker Pipelines that incorporates preprocessing, hyperparameter tuning, model evaluation, model selection, and model registration. The solution addresses the common challenge of orchestrating cross-validation model pipeline at scale. The entire pipeline implementation, including a jupyter notebook that defines the pipeline, a Dockerfile and python scripts described in this blog post, can be found in the GitHub project.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

Scaling Ad Verification with Machine Learning and AWS Inferentia

Post Syndicated from Julien Simon original https://aws.amazon.com/blogs/aws/scaling-ad-verification-with-machine-learning-and-aws-inferentia/

Amazon Advertising helps companies build their brand and connect with shoppers, through ads shown both within and beyond Amazon’s store, including websites, apps, and streaming TV content in more than 15 countries. Businesses or brands of all sizes including registered sellers, vendors, book vendors, Kindle Direct Publishing (KDP) authors, app developers, and agencies on Amazon marketplaces can upload their own ad creatives, which can include images, video, audio, and of course products sold on Amazon. To promote an accurate, safe, and pleasant shopping experience, these ads must comply with content guidelines.

Here’s a simple example. Can you figure out why two of the following ads would not be compliant?

Amazon Ads

The ad in the center doesn’t feature the product in context. It also shows the same product multiple times. The ad on the right looks much better, but it contains text, which is not allowed for this ad format.

New ad creatives come in many sizes, shapes, and languages, and at very large scale. Assuming it would even be possible, verifying them manually would be a complex, slow, and error-prone process. Machine learning (ML) to the rescue!

Using Machine Learning to Verify Ad Creatives
Each ad must be evaluated against many rules, which no single model could reasonably learn. In fact, it takes many models to check ad properties, for example:

  • Media-specific models that analyze images, video, audio, and text that describe the advertised products.
  • Content-specific models that detect headlines, text, backgrounds, and objects.
  • Language-specific models that validate syntax and grammar, and flag unapproved language.

Some of these capabilities are readily available in AWS AI services. For example, Amazon Advertising teams use Amazon Rekognition to extract metadata information from images and videos.

Other capabilities require custom models trained on in-house datasets. For this purpose, Amazon teams labeled large ad datasets with Amazon SageMaker Ground Truth, using a combination of manual labeling, and automatic labeling with active learning. Using these datasets, teams then used Amazon SageMaker to train models, and deploy them automatically on real-time prediction endpoints with the AWS Cloud Development Kit (AWS CDK) and Amazon SageMaker Pipelines.

When a business uploads a new ad, relevant models are invoked simultaneously to process specific ad components, extract signals, and output a quality score. All scores are then consolidated, and sent to a final model that predicts whether the ad should be manually reviewed.

Thanks to this process, most new ads can be verified and published automatically, which means businesses can quickly promote their brand and products, and Amazon can maintain a high-quality shopping experience.

However, faced with a growing number of more complex models, Amazon Advertising teams started to look for a solution that could increase prediction throughput while reducing costs. They found it in AWS Inferentia.

What is AWS Inferentia?
Available in Amazon EC2 Inf1 instances, AWS Inferentia is a custom chip built by AWS to accelerate ML inference workloads, and optimize their cost. Each AWS Inferentia chip contains four NeuronCores. Each NeuronCore implements a high-performance systolic array matrix multiply engine, which massively speeds up typical deep learning operations such as convolution and transformers. NeuronCores are also equipped with a large on-chip cache, which helps to cut down on external memory accesses, reduce latency, and increase throughput.

Thanks to AWS Neuron, a software development kit for ML inference, AWS Inferentia can be used natively from ML frameworks like TensorFlow, PyTorch, and Apache MXNet. It consists of a compiler, runtime, and profiling tools that enable you to run high-performance and low latency inference. For many trained models, compilation is a one-liner with the Neuron SDK, not requiring any additional application code changes. The result is a high performance inference deployment, that can easily scale while keeping costs under control. You’ll find many examples in the Neuron documentation. Alternatively, thanks to Amazon SageMaker Neo, you can also compile models directly in SageMaker.

Scaling Ad Verification with AWS Inferentia
Amazon Advertising teams started compiling their models for Inferentia, and deploying them on SageMaker endpoints powered by Inf1 instances. They compared the Inf1 endpoints to the GPU endpoints they had been using so far. They found that large deep learning models like BERT run more effectively on Inferentia, which decreases latency by 30%, and reduces costs by 71%. A few months ago, ML teams working on Amazon Alexa came to the same conclusions.

What about prediction quality? GPU models are typically trained with single-precision floating-point data (FP32). Inferentia uses the shorter FP16, BF16, and INT8 data types, which can create slight differences in predicted output. Running both GPU and Inferentia models in parallel, teams analyzed probability distributions, tweaked prediction thresholds for their Inferentia models, and made sure that these models would predict ads just like GPU models did. You can learn more about these techniques in the Performance Tuning section of the documentation.

With these final adjustments out of the way, the Amazon Advertising teams started phasing out GPU models. All text data is now predicted on Inferentia, and the migration of computer vision pipelines is in progress.

AWS Customers Are Successful with AWS Inferentia
In addition to Amazon teams, customers also report very nice results on scaling and optimizing their ML workloads with Inferentia.

Binghui Ouyang, Senior Data Scientist at Autodesk: “Autodesk is advancing the cognitive technology of our AI-powered virtual assistant, Autodesk Virtual Agent (AVA) by using Inferentia. AVA answers over 100,000 customer questions per month by applying natural language understanding (NLU) and deep learning techniques to extract the context, intent, and meaning behind inquiries. Piloting Inferentia, we are able to obtain a 4.9x higher throughput over G4dn for our NLU models, and look forward to running more workloads on the Inferentia-based Inf1 instances.

Paul Fryzel, Principal Engineer, AI Infrastructure at Condé Nast: “Condé Nast’s global portfolio encompasses over 20 leading media brands, including Wired, Vogue, and Vanity Fair. Within a few weeks, our team was able to integrate our recommendation engine with AWS Inferentia chips. This union enables multiple runtime optimizations for state-of-the-art natural language models on SageMaker’s Inf1 instances. As a result, we observed a 72% reduction in cost than the previously deployed GPU instances.”

Getting Started
You can get started with Inferentia and Inf1 instances today, either on Amazon SageMaker or with the Neuron SDK. This self-paced workshop walks you through both options.

Give it a try, and let us know what you think. As always, we look forward to your feedback. You can send it through your usual AWS Support contacts, post it on the AWS Forum for SageMaker, or on the Neuron SDK Github repository.

– Julien

Emerging Solutions for Operations Research on AWS

Post Syndicated from Randy DeFauw original https://aws.amazon.com/blogs/architecture/emerging-solutions-for-operations-research-on-aws/

Operations research (OR) uses mathematical and analytical tools to arrive at optimal solutions for complex business problems like workforce scheduling. The mathematical techniques used to solve these problems, such as linear programming and mixed-integer programming, require the use of optimization software (solvers).  There are several popular and powerful solvers available, ranging from commercial options like IBM CPLEX to open-source packages like ORTools. While these solvers incorporate decades of algorithmic expertise and can solve large and complex problems effectively, they have some scalability limitations.

In this post, we’ll describe three alternatives that you can consider for solving OR problems (see Figure 1). None of these are as general purpose as traditional solvers, but they should be on your “emerging technologies” radar.

Figure 1. OR optimization options

Figure 1. OR optimization options

These include:

  1. A traditional solver running on a compute platform
  2. Reinforcement and machine learning (ML) algorithms running on Amazon SageMaker
  3. A quantum computing algorithm running on Amazon Braket. Experiments are collected in Amazon DynamoDB and the results are visualized in Amazon Elasticsearch Service.

A reference problem and solution

Let’s start with a reference problem and solve it with a traditional solver. We’ll tackle an inventory management issue (see Figure 2). We have a sales depot that supplies products for local sales outlets. For the depot’s Region, there are seven weeks of historical sales data for each product. We also know how much each product costs and for how much it can be sold. Finally, we know the overall weekly capacity of the depot. This depends on logistical constraints like the size of the warehouse and transportation availability. This scenario is loosely based on the Grupo Bimbo retailer’s Kaggle competition and dataset.

Figure 2. Sales depot inventory management scenario

Figure 2. Sales depot inventory management scenario

Our job is to place an inventory order to restock our sales depot each week. We quantify our work through a reward function. We want to maximize our revenue:

revenue = (sale price * number of units sold)

(Note that the sample dataset does not include cost of goods sold, only sale price.)

We use these constraints:

total units sold <= depot capacity
0 <= quantity sold of any given item <= forecasted demand for that item

There are many possible solutions to this problem. Using ORTools, we get an average reward (profit) of about $5,700, in about 1,000 simulations.

We can make the scenario slightly more realistic by acknowledging that our sales forecasts are not perfect. After we get the solution from the solver, we can penalize the reward (profit) by subtracting the cost of unsold goods. With this approach, we get a reward of about $2,450.

Solving OR problems with reinforcement learning

An alternative approach to the traditional solver is reinforcement learning (RL). RL is a field of ML that handles problems where the right answer is not immediately known, like playing a game of chess. RL fits our sales depot scenario, because we don’t know how well we will do until after we place the order and are able to view a week of sales activity.

Our sales depot problem resembles a knapsack problem. This is a common OR pattern where we want to fill a container (in this case, our sales depot) with as many items as possible until capacity is reached. Each item has a value (sales price) and a weight (cost). In RL we have to translate this into an observation space, an action space, a state, and a reward (see Figure 3).

The observation space is what our purchasing agent sees. This includes our depot capacity, the sales price, and the forecasted demand. The action space is what our agent can do. In the simplest case, it’s the number of each item to order for the depot, each week. The state is what the agent sees right now, and we model that as the sales results from last week. Finally, the reward function is our profit equation.

One important distinction between OR solvers and RL is that we can’t easily enforce hard constraints in RL. We can limit the amount of an individual product we purchase each week, but we can’t enforce an overall limit on the number of items purchased. We may exceed the capacity of our depot. The simplest way to handle that is to enforce a penalty. There are more sophisticated techniques available, such as interpreting our action as the percentage of budget to spend on each item. But let’s illustrate the simple case here.

Using an RL algorithm from the Ray RLLib package, our reward was $7,000 on average, including penalties for ordering too much of any given item.

Figure 3. Translating OR problem to RL

Figure 3. Translating OR problem to RL

Solving OR problems with machine learning

It’s possible to model a knapsack problem using ML rather than RL in some cases, and there are simple reference implementations available. The design assumes that we know, or can accurately estimate the reward for a given week. With our simple scenario, we can compute the reward using estimates of future sales. We can use this in a custom loss function to train a neural network.

Solving OR problems with quantum computing

Quantum computers are fundamentally different than the computers most of us use. The appeal of quantum computers is that they can tackle some types of problems much more efficiently than standard computers. Quantum computers can, in theory, solve prime number factoring for decryption in orders of magnitude faster than a standard computer. But they are still in their infancy and limited to the size of problem they can handle, due to hardware limitations.

D-Wave Systems, which make some of the types of quantum computers available through Amazon Braket, has a solver called QBSolv. QBSolv works on a specific type of optimization problem called quadratic unconstrained binary optimization (QUBO). It breaks large problems into smaller pieces that a quantum computer can handle. There is a reference pattern for translating a knapsack problem to a QUBO problem.

Running the sales depot problem through QBSolv on Amazon Braket and using a subset of the data, I was able to obtain a reward of $900. When I tried to run on the full dataset, I was not able to complete the decomposition step, likely due to a hardware limitation.

Conclusion

In this blog post, I review OR problems and traditional OR solvers. I then discussed three alternative approaches, RL, ML, and quantum computing. Each of these alternatives has drawbacks and none is a general-purpose replacement for traditional OR solvers.

However, RL and ML are potentially more scalable because you can train those solutions on a cluster of machines, rather than running an OR solver on a single machine. RL agents can also learn from experience, giving them flexibility to handle scenarios that may be difficult to incorporate into an OR solver. Quantum computing solutions are promising but the current state of the art for quantum computers limits their application to small-scale problems at the moment. All of these alternatives can potentially derive a solution more quickly than an OR solver.

Further Reading:

How to Accelerate Building a Lake House Architecture with AWS Glue

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/how-to-accelerate-building-a-lake-house-architecture-with-aws-glue/

Customers are building databases, data warehouses, and data lake solutions in isolation from each other, each having its own separate data ingestion, storage, management, and governance layers. Often these disjointed efforts to build separate data stores end up creating data silos, data integration complexities, excessive data movement, and data consistency issues. These issues are preventing customers from getting deeper insights. To overcome these issues and easily move data around, a Lake House approach on AWS was introduced.

In this blog post, we illustrate the AWS Glue integration components that you can use to accelerate building a Lake House architecture on AWS. We will also discuss how to derive persona-centric insights from your Lake House using AWS Glue.

Components of the AWS Glue integration system

AWS Glue is a serverless data integration service that facilitates the discovery, preparation, and combination of data. It can be used for analytics, machine learning, and application development. AWS Glue provides all of the capabilities needed for data integration. So you can start analyzing your data and putting it to use in minutes, rather than months.

The following diagram illustrates the various components of the AWS Glue integration system.

Figure 1. AWS Glue integration components

Figure 1. AWS Glue integration components

Connect – AWS Glue allows you to connect to various data sources anywhere

Glue connector: AWS Glue provides built-in support for the most commonly used data stores. You can use Amazon Redshift, Amazon RDS, Amazon Aurora, Microsoft SQL Server, MySQL, MongoDB, or PostgreSQL using JDBC connections. AWS Glue also allows you to use custom JDBC drivers in your extract, transform, and load (ETL) jobs. For data stores that are not natively supported such as SaaS applications, you can use connectors. You can also subscribe to several connectors offered in the AWS Marketplace.

Glue crawlers: You can use a crawler to populate the AWS Glue Data Catalog with tables. A crawler can crawl multiple data stores in a single pass. Upon completion, the crawler creates or updates one or more tables in your Data Catalog. Extract, transform, and load (ETL) jobs that you define in AWS Glue use these Data Catalog tables as sources and targets.

Catalog – AWS Glue simplifies data discovery and governance

Glue Data Catalog: The Data Catalog serves as the central metadata catalog for the entire data landscape.

Glue Schema Registry: The AWS Glue Schema Registry allows you to centrally discover, control, and evolve data stream schemas. With AWS Glue Schema Registry, you can manage and enforce schemas on your data streaming applications.

Data quality – AWS Glue helps you author and monitor data quality rules

Glue DataBrew: AWS Glue DataBrew allows data scientists and data analysts to clean and normalize data. You can use a visual interface, reducing the time it takes to prepare data by up to 80%. With Glue DataBrew, you can visualize, clean, and normalize data directly from your data lake, data warehouses, and databases.

Curate data: You can use either Glue development endpoint or AWS Glue Studio to curate your data.

AWS Glue development endpoint is an environment that you can use to develop and test your AWS Glue scripts. You can choose either Amazon SageMaker notebook or Apache Zeppelin notebook as an environment.

AWS Glue Studio is a new visual interface for AWS Glue that supports extract-transform-and-load (ETL) developers. You can author, run, and monitor AWS Glue ETL jobs. You can now use a visual interface to compose jobs that move and transform data, and run them on AWS Glue.

AWS Data Exchange makes it easy for AWS customers to securely exchange and use third-party data in AWS. This is for data providers who want to structure their data across multiple datasets or enrich their products with additional data. You can publish additional datasets to your products using the AWS Data Exchange.

Deequ is an open-source data quality library developed internally at Amazon, for data quality. It provides multiple features such as automatic constraint suggestions and verification, metrics computation, and data profiling.

Build a Lake House architecture faster, using AWS Glue

Figure 2 illustrates how you can build a Lake House using AWS Glue components.

Figure 2. Building lake house architectures with AWS Glue

Figure 2. Building Lake House architectures with AWS Glue

The architecture flow follows these general steps:

  1. Glue crawlers scan the data from various data sources and populate the Data Catalog for your Lake House.
  2. The Data Catalog serves as the central metadata catalog for the entire data landscape.
  3. Once data is cataloged, fine-grained access control is applied to the tables through AWS Lake Formation.
  4. Curate your data with business and data quality rules by using Glue Studio, Glue development endpoints, or Glue DataBrew. Place transformed data in a curated Amazon S3 for purpose built analytics downstream.
  5. Facilitate data movement with AWS Glue to and from your data lake, databases, and data warehouse by using Glue connections. Use AWS Glue Elastic views to replicate the data across the Lake House.

Derive persona-centric insights from your Lake House using AWS Glue

Many organizations want to gather observations from increasingly larger volumes of acquired data. These insights help them make data-driven decisions with speed and agility. They must use a central data lake, a ring of purpose-built data services, and data warehouses based on persona or job function.

Figure 3 illustrates the Lake House inside-out data movement with AWS Glue DataBrew, Amazon Athena, Amazon Redshift, and Amazon QuickSight to perform persona-centric data analytics.

Figure 3. Lake house persona-centric data analytics using AWS Glue

Figure 3. Lake House persona-centric data analytics using AWS Glue

This shows how Lake House components serve various personas in an organization:

  1. Data ingestion: Data is ingested to Amazon Simple Storage Service (S3) from different sources.
  2. Data processing: Data curators and data scientists use DataBrew to validate, clean, and enrich the data. Amazon Athena is also used to run improvised queries to analyze the data in the lake. The transformation is shared with data engineers to set up batch processing.
  3. Batch data processing: Data engineers or developers set up batch jobs in AWS Glue and AWS Glue DataBrew. Jobs can be initiated by an event, or can be scheduled to run periodically.
  4. Data analytics: Data/Business analysts can now analyze prepared dataset in Amazon Redshift or in Amazon S3 using Athena.
  5. Data visualizations: Business analysts can create visuals in QuickSight. Data curators can enrich data from multiple sources. Admins can enforce security and data governance. Developers can embed QuickSight dashboard in applications.

Conclusion

Using a Lake House architecture will help you get persona-centric insights quickly from all of your data based on user role or job function. In this blog post, we describe several AWS Glue components and AWS purpose-built services that you can use to build Lake House architectures on AWS. We have also presented persona-centric Lake House analytics architecture using AWS Glue, to help you derive insights from your Lake House.

Read more and get started on building Lake House Architectures on AWS.

How MEDHOST’s cardiac risk prediction successfully leveraged AWS analytic services

Post Syndicated from Pandian Velayutham original https://aws.amazon.com/blogs/big-data/how-medhosts-cardiac-risk-prediction-successfully-leveraged-aws-analytic-services/

MEDHOST has been providing products and services to healthcare facilities of all types and sizes for over 35 years. Today, more than 1,000 healthcare facilities are partnering with MEDHOST and enhancing their patient care and operational excellence with its integrated clinical and financial EHR solutions. MEDHOST also offers a comprehensive Emergency Department Information System with business and reporting tools. Since 2013, MEDHOST’s cloud solutions have been utilizing Amazon Web Services (AWS) infrastructure, data source, and computing power to solve complex healthcare business cases.

MEDHOST can utilize the data available in the cloud to provide value-added solutions for hospitals solving complex problems, like predicting sepsis, cardiac risk, and length of stay (LOS) as well as reducing re-admission rates. This requires a solid foundation of data lake and elastic data pipeline to keep up with multi-terabyte data from thousands of hospitals. MEDHOST has invested a significant amount of time evaluating numerous vendors to determine the best solution for its data needs. Ultimately, MEDHOST designed and implemented machine learning/artificial intelligence capabilities by leveraging AWS Data Lab and an end-to-end data lake platform that enables a variety of use cases such as data warehousing for analytics and reporting.

Since you’re reading this post, you may also be interested in the following:

Getting started

MEDHOST’s initial objectives in evaluating vendors were to:

  • Build a low-cost data lake solution to provide cardiac risk prediction for patients based on health records
  • Provide an analytical solution for hospital staff to improve operational efficiency
  • Implement a proof of concept to extend to other machine learning/artificial intelligence solutions

The AWS team proposed AWS Data Lab to architect, develop, and test a solution to meet these objectives. The collaborative relationship between AWS and MEDHOST, AWS’s continuous innovation, excellent support, and technical solution architects helped MEDHOST select AWS over other vendors and products. AWS Data Lab’s well-structured engagement helped MEDHOST define clear, measurable success criteria that drove the implementation of the cardiac risk prediction and analytical solution platform. The MEDHOST team consisted of architects, builders, and subject matter experts (SMEs). By connecting MEDHOST experts directly to AWS technical experts, the MEDHOST team gained a quick understanding of industry best practices and available services allowing MEDHOST team to achieve most of the success criteria at the end of a four-day design session. MEDHOST is now in the process of moving this work from its lower to upper environment to make the solution available for its customers.

Solution

For this solution, MEDHOST and AWS built a layered pipeline consisting of ingestion, processing, storage, analytics, machine learning, and reinforcement components. The following diagram illustrates the Proof of Concept (POC) that was implemented during the four-day AWS Data Lab engagement.

Ingestion layer

The ingestion layer is responsible for moving data from hospital production databases to the landing zone of the pipeline.

The hospital data was stored in an Amazon RDS for PostgreSQL instance and moved to the landing zone of the data lake using AWS Database Migration Service (DMS). DMS made migrating databases to the cloud simple and secure. Using its ongoing replication feature, MEDHOST and AWS implemented change data capture (CDC) quickly and efficiently so MEDHOST team could spend more time focusing on the most interesting parts of the pipeline.

Processing layer

The processing layer was responsible for performing extract, tranform, load (ETL) on the data to curate them for subsequent uses.

MEDHOST used AWS Glue within its data pipeline for crawling its data layers and performing ETL tasks. The hospital data copied from RDS to Amazon S3 was cleaned, curated, enriched, denormalized, and stored in parquet format to act as the heart of the MEDHOST data lake and a single source of truth to serve any further data needs. During the four-day Data Lab, MEDHOST and AWS targeted two needs: powering MEDHOST’s data warehouse used for analytics and feeding training data to the machine learning prediction model. Even though there were multiple challenges, data curation is a critical task which requires an SME. AWS Glue’s serverless nature, along with the SME’s support during the Data Lab, made developing the required transformations cost efficient and uncomplicated. Scaling and cluster management was addressed by the service, which allowed the developers to focus on cleaning data coming from homogenous hospital sources and translating the business logic to code.

Storage layer

The storage layer provided low-cost, secure, and efficient storage infrastructure.

MEDHOST used Amazon S3 as a core component of its data lake. AWS DMS migration tasks saved data to S3 in .CSV format. Crawling data with AWS Glue made this landing zone data queryable and available for further processing. The initial AWS Glue ETL job stored the parquet formatted data to the data lake and its curated zone bucket. MEDHOST also used S3 to store the .CSV formatted data set that will be used to train, test, and validate its machine learning prediction model.

Analytics layer

The analytics layer gave MEDHOST pipeline reporting and dashboarding capabilities.

The data was in parquet format and partitioned in the curation zone bucket populated by the processing layer. This made querying with Amazon Athena or Amazon Redshift Spectrum fast and cost efficient.

From the Amazon Redshift cluster, MEDHOST created external tables that were used as staging tables for MEDHOST data warehouse and implemented an UPSERT logic to merge new data in its production tables. To showcase the reporting potential that was unlocked by the MEDHOST analytics layer, a connection was made to the Redshift cluster to Amazon QuickSight. Within minutes MEDHOST was able to create interactive analytics dashboards with filtering and drill-down capabilities such as a chart that showed the number of confirmed disease cases per US state.

Machine learning layer

The machine learning layer used MEDHOST’s existing data sets to train its cardiac risk prediction model and make it accessible via an endpoint.

Before getting into Data Lab, the MEDHOST team was not intimately familiar with machine learning. AWS Data Lab architects helped MEDHOST quickly understand concepts of machine learning and select a model appropriate for its use case. MEDHOST selected XGBoost as its model since cardiac prediction falls within regression technique. MEDHOST’s well architected data lake enabled it to quickly generate training, testing, and validation data sets using AWS Glue.

Amazon SageMaker abstracted underlying complexity of setting infrastructure for machine learning. With few clicks, MEDHOST started Jupyter notebook and coded the components leading to fitting and deploying its machine learning prediction model. Finally, MEDHOST created the endpoint for the model and ran REST calls to validate the endpoint and trained model. As a result, MEDHOST achieved the goal of predicting cardiac risk. Additionally, with Amazon QuickSight’s SageMaker integration, AWS made it easy to use SageMaker models directly in visualizations. QuickSight can call the model’s endpoint, send the input data to it, and put the inference results into the existing QuickSight data sets. This capability made it easy to display the results of the models directly in the dashboards. Read more about QuickSight’s SageMaker integration here.

Reinforcement layer

Finally, the reinforcement layer guaranteed that the results of the MEDHOST model were captured and processed to improve performance of the model.

The MEDHOST team went beyond the original goal and created an inference microservice to interact with the endpoint for prediction, enabled abstracting of the machine learning endpoint with the well-defined domain REST endpoint, and added a standard security layer to the MEDHOST application.

When there is a real-time call from the facility, the inference microservice gets inference from the SageMaker endpoint. Records containing input and inference data are fed to the data pipeline again. MEDHOST used Amazon Kinesis Data Streams to push records in real time. However, since retraining the machine learning model does not need to happen in real time, the Amazon Kinesis Data Firehose enabled MEDHOST to micro-batch records and efficiently save them to the landing zone bucket so that the data could be reprocessed.

Conclusion

Collaborating with AWS Data Lab enabled MEDHOST to:

  • Store single source of truth with low-cost storage solution (data lake)
  • Complete data pipeline for a low-cost data analytics solution
  • Create an almost production-ready code for cardiac risk prediction

The MEDHOST team learned many concepts related to data analytics and machine learning within four days. AWS Data Lab truly helped MEDHOST deliver results in an accelerated manner.


About the Authors

Pandian Velayutham is the Director of Engineering at MEDHOST. His team is responsible for delivering cloud solutions, integration and interoperability, and business analytics solutions. MEDHOST utilizes modern technology stack to provide innovative solutions to our customers. Pandian Velayutham is a technology evangelist and public cloud technology speaker.

 

 

 

 

George Komninos is a Data Lab Solutions Architect at AWS. He helps customers convert their ideas to a production-ready data product. Before AWS, he spent 3 years at Alexa Information domain as a data engineer. Outside of work, George is a football fan and supports the greatest team in the world, Olympiacos Piraeus.

Building a Data Pipeline for Tracking Sporting Events Using AWS Services

Post Syndicated from Ashwini Rudra original https://aws.amazon.com/blogs/architecture/building-a-data-pipeline-for-tracking-sporting-events-using-aws-services/

In an evolving world that is increasingly connected, data-centric, and fast-paced, the sports industry is no exception. Amazon Web Services (AWS) has been helping customers in the sports industry gain real-time insights through analytics. You can re-invent and reimagine the fan experience by tracking sports actions and activities. In this blog post, we will highlight common architectural and design patterns for building a data pipeline to track sporting events in real time.

The sports industry is largely comprised of two subsegments: participatory and spectator sports. Participatory sports, for example fitness, golf, boating, and skiing, comprise the largest share of the market. Spectator sports, such as teams/clubs/leagues, individual sports, and racing, are expected to be the fastest growing segment. Sports teams/leagues/clubs comprise the largest share of the Spectator sports segment, and is growing most rapidly.

IoT data pipeline architecture overview

Let’s discuss the infrastructure in three parts:

  1. Infrastructure at the arena itself
  2. Processing data using AWS services
  3. Leveraging this analysis using a graphics overlay (this can be especially useful for broadcasters, OTT channels, and arena users)

Data-gathering devices

Radio-frequency identification (RFID) chips or IoT devices can be worn by players or embedded in the playing equipment. These devices emit 20–50 messages per second. These messages are collected and output using JSON. This information may include player coordinate positions, player speed, statistics, health information, or more. To process the game, leagues, coaches, or broadcasters can analyze this data using analytics tools and/or machine learning.

Figure 1. Data pipeline architecture using AWS Services

Figure 1. Data pipeline architecture using AWS Services

Processing data, feature engineering, and model training at AWS

Use serverless services from AWS when possible in order to keep your solution scalable and cost-efficient. This also helps with operational overhead for teams. You can use the Kinesis family of services for stream ingestion and processing. The streaming data from hundreds to thousands of IoT sources (from equipment and clothing) can be fed to Amazon Kinesis Data Streams (KDS). KDS and Amazon Kinesis Data Firehose provide a buffering mechanism for streaming data before it lands on Amazon Simple Storage Service (S3). With Amazon Kinesis Data Analytics, you can process and analyze Kinesis stream data using powerful SQL, Apache Flink, or Beam. Kinesis Data Analytics also supports building applications in SQL, Java, Scala, and Python. With this service, you can quickly author and run powerful SQL code against Amazon Kinesis Streams as your source. This way you can perform time series analytics, feed real-time dashboards, and create real-time metrics. Read more about Amazon Kinesis Data Analytics for SQL Applications.

You might want to transform or enhance the streaming data before it is delivered to Amazon S3. Amazon Kinesis Data Firehose can be used with an AWS Lambda function to do the transformation. Let’s say you have a player prediction timestamp that you want to represent in a different time format to different ML algorithms. Lambda can process and transform this data. Kinesis Data Firehose will deliver the transformed and raw data to the destination (Amazon S3). This can occur after the specific buffering size or when the buffering interval is reached, whichever happens first.

For more complex transformations, AWS Glue can be used. For example, once the data lands in Amazon S3, you can start preparing and aggregating the training dataset using Amazon SageMaker Data Wrangler. As part of the feature engineering process, you can do the following:

  • Transform the data
  • Delete unneeded columns
  • Impute missing values
  • Perform label encoding
  • Use the quick model option to get a sense of which features are adding predictive power as you progress with your data preparation

All the data preparation and feature engineering tasks can be performed from Data Wrangler’s single visual interface.

Once data is prepared in Amazon S3, Amazon SageMaker can be used for model training. In soccer, you can predict a goal percentage based on the player’s position, acceleration, and past performance history.  SageMaker provides several built-in algorithms that can be trained. For real-time predictions, Amazon API Gateway provides an API layer to clients like an OTT, broadcasting service, or a web browser. API Gateway can invoke a Lambda function, with logic to call a SageMaker endpoint and persist the output to the database. This data can be used later on for further analysis or to fine-tune your models.

Figure 2. Deliver real-time prediction using SageMaker

Figure 2. Deliver real-time prediction using Amazon SageMaker

Computer vision-based object detection techniques can be very useful in Sports. These techniques use deep learning algorithms to predict the pass probability, game player face-off, or win prediction. For the sports industry, object detection technology like these are crucial. They obviate the need for sensors. Real-time object identification can be used to:

  • Generate new advanced analytics regarding player and team performance
  • Aid game officials in making correct calls
  • Provide fans an improved and more data-rich viewing experience

Read Football tracking in the NFL with Amazon SageMaker for more information on how to track using broadcast video data. Using SageMaker, you can train object detection models that analyze thousands of images. You can then locate and classify the football itself, and distinguish it from background objects.

Creating a graphics overlay

When you have the ML inference data and video ingestion ready, you may want to represent this data on your broadcasted video. The graphic overlay feature lets you insert an image (a BMP, PNG, or TGA file) at a specified time. It is displayed as a static overlay on the underlying video for a specified duration. The motion graphic overlay feature lets you insert an animation (a MOV or SWF file, or a series of PNG files) on the underlying video. This can be displayed at a specified time for a specified duration.

For example, a player’s motion prediction can be inserted on video during a game, through a RESTful API call of ML inferences. You can use AWS Elemental Live to achieve this. Read about AWS Elemental Live Graphic Overlay at AWS documentation.

Reducing latency

You may want to reduce latency for analytics such as for player health and safety. Use video, data, or machine learning processing at the arena using AWS Outposts. You can also use AWS Wavelength along with 5G infrastructure. For more information, read Catch Important Moments in Sports with 5G and AWS Wavelength.

Summary

In this blog, we’ve highlighted how customers in the sports industry are using AWS to increase the quality of the game, and enhance the sports fan’s experience. The following benefits can be achieved by building a data pipeline for tracking sporting events using AWS services:

  • Amazon Kinesis collects, processes, and analyzes in-game streaming data in real time. This way both teams and fans get timely insights and can react quickly to new information.
  • The serverless nature of this architecture enables a cost-effective, scalable, and operationally efficient environment for customers.
  • Amazon Machine Learning services like Amazon SageMaker can be used to enrich the fan viewing experience. It presents in-game predictions such as who will score next, or which team will win the game.

Visit our AWS Sports Partnerships page for more information on how AWS is changing the game.

Preventing Free Trial Abuse with AWS Managed Services

Post Syndicated from Katie Williams original https://aws.amazon.com/blogs/architecture/preventing-free-trial-abuse-with-aws-managed-services/

Free trial promotions are a popular marketing tactic, but they can also be a common source of fraud for ecommerce retailers. So, how do you identify fraudulent users? And what are some effective ways to prevent free trial abuse?

This blog post outlines common free trial abuse attack vectors and presents prevention techniques. We’ll show you how to incorporate Amazon Fraud Detector into your architecture to catch free trial abuse faster and more frequently.

Additionally, our fraud prevention solution shows you how you to use managed AWS services and Serverless on AWS. This solution shows you how to quickly build a scalable prevention system that doesn’t require machine learning (ML) expertise and offers a cost-efficient, pay-as-you-go pricing model.

Common free trial abuse attack vectors

Attack vectors are how users gain fraudulent access to an application. For the determined attacker, or for high value services, methods are often combined and automated. In this section, we familiarize you with common attack vectors to help you protect against them.

Fake emails and disposable phone numbers

Applications generally limit free trials to one per account. They use a unique identifier such as an email address or phone number to enforce this restriction. To bypass this, fraudulent users create multiple fake email accounts or use disposable email services.

Browser cookies

Other applications use browser cookies to track if the user has been to the site. If the user has visited before, they are prompted to log into their account instead of creating a new one. Enabling a browser’s privacy mode can defeat cookie tracking, or a user may delete individual cookies.

Stolen/fake credit card information

Users may bypass credit card verification steps by obtaining a stolen credit card number. Or they can create fake credit card numbers using automatic credit card generator scripts.

Free trial abuse prevention techniques

Now that you’re aware of common attack vectors, this section will show you how you can prevent bad actors from taking advantage of free trial promotions.

Verify phone numbers and email addresses

Because it’s usually the easiest to implement, phone or email verification is a common prevention technique.

The application sends a confirmation code or link to the phone or email contact provided by a new user. Then they pass this confirmation token back to the application to verify that they have access to these contact methods.

To strengthen this approach, you can incorporate expiration times for the code or link used. With a confirmation link, applications can further block outside threats from by including parameters in the URL that specify additional identification information.

Cross-check against lists of disposable email addresses/phone numbers

Disposable email addresses and phone numbers are commonly used to safeguard online privacy and anonymity. However, they also provide an easy way to create throwaway accounts for free trial abuse.

While it’s impossible to identify all disposable contact methods, you can reduce the number of disposable emails and phone numbers used in a few steps.

  • First, validate users’ email input. Applications should check the validity of the email format and prevent users from implementing plus addressing in their emails. For example, [email protected] and [email protected] will have confirmation emails delivered to the same inbox.
  • For an additional layer of protection, you can scan actively maintained lists of disposable email addresses and phone numbers and cross-check them against new user information. These lists are available on open-source GitHub repositories or through third-party APIs.

Our prevention solution architecture shows you how to periodically call these lists using AWS Lambda and update an Amazon DynamoDB table that is used as the main source of truth for fraudulent emails and phone numbers.

Verify credit card information 

Free trials often require a credit card that will be charged once the free trial period ends. You can verify these credit card details two ways:

  1. Ensure that the card number is valid using the Luhn algorithm, which validates credit card numbers upon entry.
  2. Once you have validated the credit card number, perform a second check to validate that your new user is the cardholder. To do this, request additional information from the user such as the credit card CVV, full name, or billing address. Payment vendor APIs will then securely pass this information along for an identity verification check.

Note that this additional logic deals with personally identifiable information and payment card industry data. AWS Compliance provides resources and tooling to help you secure your customer data and adhere to compliance controls.

Use machine learning to learn and identify fraudulent behavior

While the previous techniques offer quick fixes, a custom machine learning model offers a more permanent and powerful solution.

ML provides you greater accuracy in fraud prevention. By building custom fraud detection ML models that detect patterns presented in historical user sign-up data, you can learn and adapt to changing attack vectors. To get started, Amazon SageMaker allows you to build, train, and deploy custom ML models, and Amazon Fraud Detector provides a simplified model building approach, as shown in the following section.

Using Amazon Fraud Detector for ML-based fraud prevention

Amazon Fraud Detector automates the time-consuming and expensive steps to build, train, and deploy an ML model for fraud detection. It customizes each model it creates based on your specific dataset, making the accuracy of models higher than current one-size-fits-all ML solutions.

Amazon Fraud Detector distinguishes between legitimate and high-risk customer account registrations so you can selectively introduce additional steps or checks based on risk. Figure 1 demonstrates how Amazon Fraud Detector can be used to train and validate a model and then host the model endpoint so that it can be called through the Amazon Fraud Detector API.

How Amazon Fraud Detector works

Figure 1. How Amazon Fraud Detector works

Building fraud prevention into your AWS environment

A well-architected application should incorporate a combination of the prevention techniques described in this post to block free trial abuse from multiple attack vectors.

Our sample architecture (Figure 2) demonstrates how you can apply these techniques using managed AWS services to build a comprehensive fraud prevention workflow into your application, which is summarized as follows:

  1. A user enrolls by providing an email address, phone number, and credit card number. A Luhn algorithm check is run on the credit card number input to verify that the number is legitimate.
  2. If the credit card number passes this check, a call is made to the payment processor API to verify the cardholder’s identity.
  3. Amazon Cognito creates a user sign-up workflow, stores user details, and employs a pre-sign-up action that invokes a Lambda function, which performs two additional fraud checks.
  4. The Lambda function checks if the email address or phone number provided matches those stored in a DynamoDB database. This database contains a list of known disposable contacts. An Amazon CloudWatch event initiates a Lambda function to update these lists hourly.
  5. Upon passing the first check, the Lambda function sends the enrollment data to Amazon Fraud Detector. The data is evaluated using ML to assign a fraud score.
  6. If the score is within the range allowing enrollment to proceed, Amazon Cognito sends the user an email with a confirmation link.
Fraud prevention sample architecture

Figure 2. Fraud prevention sample architecture

Conclusion

In this blog post, we identified some of the common attack vectors that lead to free trial abuse, and we provided you techniques that you can use to prevent them. We showed you AWS services to quickly solve these problems and build a solution.

You can look at other ways Amazon Fraud Detector has been used to detect and prevent fraud in AWS environments on the AWS Machine Learning: Fraud Detector Blog channel.

Improving Retail Forecast Accuracy with Machine Learning

Post Syndicated from Soonam Jose original https://aws.amazon.com/blogs/architecture/improving-retail-forecast-accuracy-with-machine-learning/

The global retail market continues to grow larger and the influx of consumer data increases daily. The rise in volume, variety, and velocity of data poses challenges with demand forecasting and inventory planning. Outdated systems generate inaccurate demand forecasts. This results in multiple challenges for retailers. They are faced with over-stocking and lost sales, and often have to rely on increased levels of safety stock to avoid losing sales.

A recent McKinsey study indicates that AI-based forecasting improves forecasting accuracy by 10–20 percent. This translates to revenue increases of 2–3 percent. An accurate forecasting system can also help determine ideal inventory levels and better predict the impact of sales promotions. It provides a single view of demand across all channels and a better customer experience overall.

In this blog post, we will show you how to build a reliable retail forecasting system. We will use Amazon Forecast, and an AWS-vetted solution called Improving Forecast Accuracy with Machine Learning. This is an AWS Solutions Implementation that automatically produces forecasts and generates visualization dashboards. This solution can be extended to use cases across a variety of industries.

Improving Forecast Accuracy solution architecture

This post will illustrate a retail environment that has an SAP S/4 HANA system for overall enterprise resource planning (ERP). We will show a forecasting solution based on Amazon Forecast to predict demand across product categories. The environment also has a unified platform for customer experience provided by SAP Customer Activity Repository (CAR). Replenishment processes are driven by SAP Forecasting and Replenishment (F&R), and SAP Fiori apps are used to manage forecasts.

The solution is divided into four parts: Data extraction and preparation, Forecasting and monitoring, Data visualization, and Forecast import and utilization in SAP.

Figure 1. Notional architecture for improving forecasting accuracy solution and SAP integration

Figure 1. Notional architecture for improving forecasting accuracy solution and SAP integration

­­Data extraction and preparation

Historical demand data such as sales, web traffic, inventory numbers, and resource demand are extracted from SAP and uploaded to Amazon Simple Storage Service (S3). There are multiple ways to extract data from an SAP system into AWS. As part of this architecture, we will use operational data provisioning (ODP) extraction. ODP acts as a data source for OData services, enabling REST-based integrations with external applications. The ODP-Based Data Extraction via OData document details this approach. The steps involved are:

  1. Create a data source using transaction RSO2, allow Change Data Capture for specific data to be extracted
  2. Create an OData service using transaction SEGW
  3. Create a Data model for ODP extraction, which refers to the defined data source, then register the service
  4. Initiate the service from SAP gateway client
  5. In the AWS Management Console, create an AWS Lambda function to extract data and upload to S3. Check out the sample extractor code using Python, referenced in the blog Building data lakes with SAP on AWS

Related data that can potentially affect demand levels can be uploaded to Amazon S3. These could include seasonal events, promotions, and item price. Additional item metadata, such as product descriptions, color, brand, size may also be uploaded. Amazon Forecast provides built-in related time series data for holidays and weather. These three components together form the forecast inputs.

Forecasting and monitoring

An S3 event notification will be initiated when new datasets are uploaded to the input bucket. This in turn, starts an AWS Step Functions state machine. The state machine combines a series of AWS Lambda functions that build, train, and deploy machine learning models in Amazon Forecast. All AWS Step Functions logs are sent to Amazon CloudWatch. Administrators will be notified with the results of the AWS Step Functions through Amazon Simple Notification Service (SNS).

An AWS Glue job combines raw forecast input data, metadata, predictor backtest exports, and forecast exports. These all go into an aggregated view of forecasts in an S3 bucket. It is then translated to the format expected by the External Forecast import interface. Amazon Athena can be used to query forecast output in S3 using standard SQL queries.

Data visualization

Amazon QuickSight analyses can be created on a per-forecast basis. This provides users with forecast output visualization across hierarchies and categories of forecasted items. It also displays item-level accuracy metrics. Dashboards can be created from these analyses and shared within the organization. Additionally, data scientists and developers can prepare and process data, and evaluate Forecast outputs using an Amazon SageMaker Notebook Instance.

Forecast import and utilization in SAP

Amazon Forecast outputs located in Amazon S3 will be imported into the Unified Demand Forecast (UDF) module within the SAP Customer Activity Repository (CAR). You can read here about how to import external forecasts. An AWS Lambda function will be initiated when aggregated forecasts are uploaded to the S3 bucket. The Lambda function performs a remote function call (RFC) to the SAP system through the official SAP JCo Library. The SAP RFC credentials and connection information may be stored securely inside AWS Secrets Manager and read on demand to establish connectivity.

Once imported, forecast values from the solution can be retrieved by SAP Forecasting and Replenishment (F&R). They will be consumed as an input to replenishment processes, which consist of requirements calculation and­­­­­ requirement quantity optimization. SAP F&R calculates requirements based on the forecast, the current stock, and the open purchase orders. The requirement quantity then may be improved in accordance with optimization settings defined in SAP F&R.

­­­

Additionally, you have the flexibly to adjust the system forecast as required by the demand situation or analyze forecasts via respective SAP Fiori Apps.

Sample use case: AnyCompany Stores, Inc.

To illustrate how beneficial this solution can be for retail organizations, let’s consider AnyCompany Stores, Inc. This is a hypothetical customer and leader in the retailer industry with 985 stores across the United States. They struggle with issues stemming from their existing forecasting implementation. That implementation only understands certain categories and does not factor in the entire product portfolio. Additionally, it is limited to available demand history and does not consider related information that may affect forecasts. AnyCompany Stores is looking to improve their demand forecasting system.

Using Improving Forecast Accuracy with Machine Learning, AnyCompany Stores can easily generate AI-based forecasts at appropriate quantiles to address sensitivities associated with respective product categories. This mitigates inconsistent inventory buys, overstocks, out-of-stocks, and margin erosion. The solution also considers all relevant related data in addition to the historical demand data. This ensures that generated forecasts are accurate for each product category.

The generated forecasts may be used to complement existing forecasting and replenishment processes. With an improved forecasting solution, AnyCompany Stores will be able to meet demand, while holding less inventory and improving customer experience. This also helps ensure that potential demand spikes are accurately captured, so staples will always be in stock. Additionally, the company will not overstock expensive items with short shelf lives that are likely to spoil.

Conclusion

In this post, we explored how to implement an accurate retail forecasting solution using a ready-to-deploy AWS Solution. We use generated forecasts to drive inventory replenishment optimization and improve customer experience. The solution can be extended to inventory, workforce, capacity, and financial planning.

We showcase one of the ways in which Improving Forecast Accuracy with Machine Learning may be extended for a use case in the retail industry. If your organization would like to improve business outcomes with the power of forecasting, explore customizing this solution to fit your unique needs.

Further reading:

Integrating Redaction of FinServ Data into a Machine Learning Pipeline

Post Syndicated from Ravikant Gupta original https://aws.amazon.com/blogs/architecture/integrating-redaction-of-finserv-data-into-a-machine-learning-pipeline/

Financial companies process hundreds of thousands of documents every day. These include loan and mortgage statements that contain large amounts of confidential customer information.

Data privacy requires that sensitive data be redacted to protect the customer and the institution. Redacting digital and physical documents is time-consuming and labor-intensive. The accidental or inadvertent release of personal information can be devastating for the customer and the institution. Having automated processes in place reduces the likelihood of a data breach.

In this post, we discuss how to automatically redact personally identifiable information (PII) data fields from your financial services (FinServ) data through machine learning (ML) capabilities of Amazon Comprehend and Amazon Athena. This will ensure you comply with federal regulations and meet customer expectations.

Protecting data and complying with regulations

Protecting PII is crucial to complying with regulations like the California Consumer Privacy Act (CCPA), Europe’s General Data Protection Regulation (GDPR), and Payment Card Industry’s data security standards (PCI DSS).

In Figure 1, we show how structured and non-structured sensitive data stored in AWS data stores can be redacted before it is made available to data engineers and data scientists for feature engineering and building ML models in compliance with organizations data security policies.

How to redact confidential information in your ML pipeline

Figure 1. How to redact confidential information in your ML pipeline

Architecture walkthrough

This section explains each step presented in Figure 1 and the AWS services used:

  1. By using services like AWS DataSync, AWS Storage Gateway, and AWS Transfer Family, data can be ingested into AWS using batch or streaming pattern. This data lands in an Amazon Simple Storage Service (Amazon S3) bucket, we call this “raw data” in Figure 1.
  2. To detect if the raw data bucket has any sensitive data, use Amazon Macie. Macie is a fully managed data security and data privacy service that uses ML and pattern matching to discover and protect your sensitive data in AWS. When Macie discovers sensitive data, you can configure it to tag the objects with an Amazon S3 object tag to identify that sensitive data was found in the object before progressing to the next stage of the pipeline. Refer to the Use Macie to discover sensitive data as part of automated data pipelines blog post for detailed instruction on building such pipeline.
  3.  This tagged data lands in a “scanned data” bucket, where we use Amazon Comprehend, a natural language processing (NLP) service that uses ML to uncover information in unstructured data. Amazon Comprehend works for unstructured text document data and redacts sensitive fields like credit card numbers, date of birth, social security number, passport number, and more. Refer to the Detecting and redacting PII using Amazon Comprehend blog post for step-by-step instruction on building such a capability.
  4. If your pipeline requires redaction for specific use cases only, you can use the information in Introducing Amazon S3 Object Lambda – Use Your Code to Process Data as It Is Being Retrieved from S3 to redact sensitive data. Using this operation, an AWS Lambda function will intercept each GET request. It will redact data as necessary before it goes back to the requestor. This allows you to keep one copy of all the data and redact the data as it is requested for a specific workload. For further details, refer to the Amazon S3 Object Lambda Access Point to redact personally identifiable information (PII) from documents developer guide.
  5. When you want to join multiple datasets from different data sources, use an Athena federated query. Using user-defined functions (UDFs) with Athena federated query will help you redact data in Amazon S3 or from other data sources such as an online transaction store like Amazon Relational Database Service (Amazon RDS), a data warehouse solution like Amazon Redshift, or a NoSQL store like Amazon DocumentDB. Athena supports UDFs, which enable you to write custom functions and invoke them in SQL queries. UDFs allow you to perform custom processing such as redacting sensitive data, compressing, and decompressing data or applying customized decryption. To read further on how you can get this set up refer to the Redacting sensitive information with user-defined functions in Amazon Athena blog post.
  6. Redacted data lands in another S3 bucket that is now ready for any ML pipeline consumption.
  7. Using AWS Glue DataBrew, the data preparation without writing any code. You can choose reusable recipes from over 250 pre-built transformations to automate data preparation tasks by jobs that can be scheduled based on your requirements.
  8. Data is then used by Amazon SageMaker Data Wrangler to do feature engineering on curated data in data preparation (step 6). SageMaker Data Wrangler offers over 300 pre-configured data transformations, such as convert column type, one hot encoding, impute missing data with mean or median, rescale columns, and data/time embedding, so you can transform your data into formats that can be effectively used for models without writing a single line of code.
  9. The output of the SageMaker Data Wrangler job is stored in Amazon SageMaker Feature Store, a purpose-built repository where you can store and access features to name, organize, and reuse them across teams. SageMaker Feature Store provides a unified store for features during training and real-time inference without the need to write additional code or create manual processes to keep features consistent.
  10. Use ML features in SageMaker notebooks or SageMaker Studio for ML training on your redacted data. SageMaker notebook instance is an ML compute instance running the Jupyter Notebook App. Amazon SageMaker Studio is a web-based, integrated development environment for ML that lets you build, train, debug, deploy, and monitor your ML models. SageMaker Studio is integrated with SageMaker Data Wrangler.

Conclusion

Federal regulations require that financial institutions protect customer data. To achieve this, redact sensitive fields in your data.

In this post, we showed you how to use AWS services to meet these requirements with Amazon Comprehend and Amazon Athena. These services allow data engineers and data scientist in your organization to safely consume this data for machine learning pipelines.

Field Notes: Launch a Fully Configured AWS Deep Learning Desktop with NICE DCV

Post Syndicated from Ajay Vohra original https://aws.amazon.com/blogs/architecture/field-notes-launch-a-fully-configured-aws-deep-learning-desktop-with-nice-dcv/

You want to start quickly when doing deep learning using GPU-activated Elastic Compute Cloud (Amazon EC2) instances in the AWS Cloud. Although AWS provides end-to-end machine learning (ML) in Amazon SageMaker, working at the deep learning frameworks level, the quickest way to start is with AWS Deep Learning AMIs (DLAMIs), which provide preconfigured Conda environments for most of the popular frameworks.

DLAMIs make it straightforward to launch Amazon EC2 instances, but these instances do not automatically provide the high-performance graphics visualization required during deep learning research. Additionally, they are not preconfigured to use AWS storage services or SageMaker. This post explains how you can launch a fully-configured deep learning desktop in the AWS Cloud. Not only is this desktop preconfigured with the popular frameworks such as TensorFlow, PyTorch, and Apache MXNet, but it is also enabled for high-performance graphics visualization. NICE DCV is the remote display protocol used for visualization. In addition, it is preconfigured to use AWS storage services and SageMaker.

Overview of the Solution

The deep learning desktop described in this solution is ready for research and development of deep neural networks (DNNs) and their visualization. You no longer need to set up low-level drivers, libraries, and frameworks, or configure secure access to AWS storage services and SageMaker. The desktop has preconfigured access to your data in a Simple Storage Service (Amazon S3) bucket, and a shared Amazon Elastic File System (Amazon EFS) is automatically attached to the desktop. It is automatically configured to access SageMaker for ML services, and provides you with the ability to prepare the data needed for deep learning, and to research, develop, build, and debug your DNNs. You can use all the advanced capabilities of SageMaker from your deep learning desktop. The following diagram shows the reference architecture for this solution.

Reference Architecture to Launch a Fully Configured AWS Deep Learning Desktop with NICE DCV

Figure 1 – Architecture overview of the solution to launch a fully configured AWS Deep Learning Desktop with NICE DCV

The deep learning desktop solution discussed in this post is contained in a single AWS CloudFormation template. To launch the solution, you create a CloudFormation stack from the template. Before we provide a detailed walkthrough for the solution, let us review the key benefits.

DNN Research and Development

During the DNN research phase, there is iterative exploration until you choose the preferred DNN architecture. During this phase, you may prefer to work in an isolated environment (for example, a dedicated desktop) with your favorite integrated development environment (IDE) (for example, Visual Studio Code or PyCharm). Developers like the ability to step through code the IDE Debugger. With the increasing support for imperative programming in modern ML frameworks, the ability to step through code in the research phase can accelerate DNN development.

The DLAMIs are preconfigured with NVIDIA GPU drivers, NVIDIA CUDA Toolkit, and low-level libraries such as Deep Neural Network library (cuDNN). Deep learning ML frameworks such as TensorFlow, PyTorch, and Apache MXNet are preconfigured.

After you launch the deep learning desktop, you need to install and open your favorite IDE, clone your GitHub repository, and you can start researching, developing, debugging, and visualizing your DNN. The acceleration of DNN research and development is the first key benefit for the solution described in this post.

Screenshot showing Developing on deep learning desktop with Visual Studio Code IDE

Figure 2 – Developing on deep learning desktop with Visual Studio Code IDE

Elasticity in number of GPUs

During the research phase, you need to debug any issues using a single GPU. However, as the DNN is stabilized, you horizontally scale across multiple GPUs in a single machine, followed by scaling across multiple machines.

Most modern deep learning frameworks support distributed training across multiple GPUs in a single machine, and also across multiple machines. However, when you use a single GPU in an on-premises desktop equipped with multiple GPUs, the idle GPUs are wasted. With the deep learning desktop solution described in this post, you can stop the deep learning desktop instance, change its Amazon EC2 instance type to another compatible type, restart the desktop, and get the exact number of GPUs you need at the moment. The elasticity in the number of GPUs in the deep learning desktop is the second key benefit for the solution described in this post.

Integrated access to storage services 

Since the deep learning desktop is running in AWS Cloud, you have access to all of the AWS data storage options, including the S3 object store, the Amazon EFS, and the Amazon FSx file system for Lustre. You can build your favorite data pipeline and it will be supported by one or more data storage options. You can also easily use ML-IO library, which is a high-performance data access library for ML tasks with support for multiple data formats. The integrated access to highly durable and scalable object and file system storage services for accessing ML data is the third key benefit for the solution described in this post.

Integrated access to SageMaker

Once you have a stable version of your DNN, you need to find the right hyperparameters that lead to model convergence during training. Having tuned the hyperparameters, you need to run multiple trials over permutations of datasets and hyperparameters to fine-tune your models. Finally, you may need to prune and compile the models to optimize inference. To compress the training time, you may need to do distributed data parallel training across multiple GPUs in multiple machines. For all of these activities, the deep learning desktop is preconfigured to use SageMaker. You can use jupyter-lab notebooks running on the desktop to launch SageMaker training jobs for distributed training in infrastructure automatically managed by SageMaker.

Submitting a SageMaker training job from deep learning desktop using Jupyter Lab notebook

Figure 3 – Submitting a SageMaker training job from deep learning desktop using Jupyter Lab notebook

The SageMaker training logs, TensorBoard summaries, and model checkpoints can be configured to be written to the Amazon EFS attached to the deep learning desktop. You can use the Linux command tail to monitor the logs, or start a TensorBoard server from the Conda environment on the deep learning desktop, and monitor the progress of your SageMaker training jobs. You can use a Jupyter Lab notebook running on the deep learning desktop to load a specific model checkpoint available on the Amazon EFS, and visualize the predictions from the model checkpoint, even while the SageMaker training job is still running.

 Locally monitoring the TensorBoard summaries from SageMaker training job

Figure 4 – Locally monitoring the TensorBoard summaries from SageMaker training job

SageMaker offers many advanced capabilities, such as profiling ML training jobs using Amazon SageMaker Debugger, and these services are easily accessible from the deep learning desktop. You can manage the training input data, training model checkpoints, training logs, and TensorBoard summaries of your local iterative development, in addition to the distributed SageMaker training jobs, all from your deep learning desktop. The integrated access to SageMaker services is the fourth key use case for the solution described in this post.

Prerequisites

To get started, complete the following steps:

Walkthrough

The complete source and reference documentation for this solution is available in the repository accompanying this post. Following is a walkthrough of the steps.

Create a CloudFormation stack

Create a stack on the CloudFormation console in your selected AWS Region using the CloudFormation template in your cloned GitHub repository. This CloudFormation stack creates IAM resources. When you are creating a CloudFormation stack using the console, you must confirm: I acknowledge that AWS CloudFormation might create IAM resources.

To create the CloudFormation stack, you must specify values for the following input parameters (for the rest of the input parameters, default values are recommended):

  • DesktopAccessCIDR – Use the public internet address of your laptop as the base value for the CIDR.
  • DesktopInstanceType – For deep leaning, the recommended value for this parameter is p3.2xlarge, or larger.
  • DesktopVpcId – Select an Amazon Virtual Private Cloud (VPC) with at least one public subnet.
  • DesktopVpcSubnetId – Select a public subnet in your VPC.
  • DesktopSecurityGroupId – The specified security group must allow inbound access over ports 22 (SSH) and 8443 (NICE DCV) from your DesktopAccessCIDR, and must allow inbound access from within the security group to port 2049 and all network ports required for distributed SageMaker training in your subnet.
  • If you leave it blank, the automatically-created security group allows inbound access for SSH, and NICE DCV from your DesktopAccessCIDR, and allows inbound access to all ports from within the security group.
  • KeyName – Select your SSH key pair name.
  • S3Bucket – Specify your S3 bucket name. The bucket can be empty.

Visit the documentation on all the input parameters.

Connect to the deep learning desktop

  • When the status for the stack in the CloudFormation console is CREATE_COMPLETE, find the deep learning desktop instance launched in your stack in the Amazon EC2 console,
  • Connect to the instance using SSH as user ubuntu, using your SSH key pair. When you connect using SSH, if you see the message, “Cloud init in progress. Machine will REBOOT after cloud init is complete!!”, disconnect and try again in about 15 minutes.
  • The desktop installs the NICE DCV server on first-time startup, and automatically reboots after the install is complete. If instead you see the message, “NICE DCV server is enabled!”, the desktop is ready for use.
  • Before you can connect to the desktop using the NICE DCV client, you need to set a new password for user ubuntu using the Bash command:
    sudo passwd ubuntu 
  • After you successfully set the new password for user ubuntu, exit the SSH connection. You are now ready to connect to the desktop using a suitable NICE DCV client (a non–web browser client is recommended) using the user ubuntu, and the new password.
  • NICE DCV client asks you to specify the server host and port to connect. For the server host, use the public IPv4 DNS address of the desktop Amazon EC2 instance available in Amazon EC2 console.
  • You do not need to specify the port, because the desktop is configured to use the default NICE DCV server port of 8443.
  • When you first login to the desktop using the NICE DCV client, you will be asked if you would like to upgrade the OS version. Do not upgrade the OS version!

Develop on the deep learning desktop

When you are connected to the desktop using the NICE DCV client, use the Ubuntu Software Center to install Visual Studio Code, or your favorite IDE. To view the available Conda environments containing the popular deep learning frameworks preconfigured on the desktop, open a desktop terminal, and run the Bash command:

conda env list

The deep learning desktop instance has secure access to the S3 bucket you specified when you created the CloudFormation stack. You can verify access to the S3 bucket by running the Bash command (replace ‘your-bucket-name’ following with your S3 bucket name):

aws s3 ls your-bucket-name 

If your bucket is empty, a successful initiation of the previous command will produce no output, which is normal.

An Amazon Elastic Block Store (Amazon EBS) root volume is attached to the instance. In addition, an Amazon EFS is mounted on the desktop at the value of EFSMountPath input parameter, which by default is /home/ubuntu/efs. You can use the Amazon EFS for staging deep learning input and output data.

Use SageMaker from the deep learning desktop

The deep learning desktop is preconfigured to use SageMaker. To get started with SageMaker examples in a JupyterLab notebook, launch the following Bash commands in a desktop terminal:

mkdir ~/git
cd ~/git
git clone https://github.com/aws/amazon-sagemaker-examples.git
jupyter-lab

This will start a ‘jupyter-lab’ notebook server in the terminal, and open a tab in your web browser. You can explore any of the SageMaker example notebooks. We recommend starting with the example Distributed Training of Mask-RCNN in SageMaker using Amazon EFS found at the following path in the cloned repository:

advanced_functionality/distributed_tensorflow_mask_rcnn/mask-rcnn-scriptmode-efs.ipynb

The preceding SageMaker example requires you to specify a subnet and a security group. Use the preconfigured OS environment variables as follows:

security_group_ids = [ os.environ['desktop_sg_id'] ] 
subnets = [ os.environ['desktop_subnet_id' ] ] 

Stopping and restarting the desktop

You may safely reboot, stop, and restart the desktop instance at any time. The desktop will automatically mount the Amazon EFS at restart.

Clean Up

When you no longer need the deep learning desktop, you may delete the CloudFormation stack from the CloudFormation console. Deleting the stack will shut down the desktop instance, and delete the root Amazon EBS volume attached to the desktop. The Amazon EFS is not automatically deleted when you delete the stack.

Conclusion

In this post, we showed how to launch a desktop pre-configured with the popular machine learning frameworks for research and development of deep learning neural networks.  NICE-DCV was used for high performance visualization related to deep learning. AWS storage services were used for highly scalable access to deep learning data.  Finally, Amazon SageMaker was used for the distributed training of deep learning data.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

ERGO Breaks New Frontiers for Insurance with AI Factory on AWS

Post Syndicated from Piotr Klesta original https://aws.amazon.com/blogs/architecture/ergo-breaks-new-frontiers-for-insurance-with-ai-factory-on-aws/

This post is co-authored with Piotr Klesta, Robert Meisner and Lukasz Luszczynski of ERGO

Artificial intelligence (AI) and related technologies are already finding applications in our homes, cars, industries, and offices. The insurance business is no exception to this. When AI is implemented correctly, it adds a major competitive advantage. It enhances the decision-making process, improves efficiency in operations, and provides hassle-free customer assistance.

At ERGO Group, we realized early on that innovation using AI required more flexibility in data integration than most of our legacy data architectures allowed. Our internal governance, data privacy processes, and IT security requirements posed additional challenges towards integration. We had to resolve these issues in order to use AI at the enterprise level, and allow for sensitive data to be used in a cloud environment.

We aimed for a central system that introduces ‘intelligence’ into other core application systems, and thus into ERGO’s business processes. This platform would support the process of development, training, and testing of complex AI models, in addition to creating more operational efficiency. The goal of the platform is to take the undifferentiated heavy lifting away from our data teams so that they focus on what they do best – harness data insights.

Building ERGO AI Factory to power AI use cases

Our quest for this central system led to the creation of AI Factory built on AWS Cloud. ERGO AI Factory is a compliant platform for running production-ready AI use cases. It also provides a flexible model development and testing environment. Let’s look at some of the capabilities and services we offer to our advanced analytics teams.

Figure 1: AI Factory imperatives

Figure 1: AI Factory imperatives

  • Compliance: Enforcing security measures (for example, authentication, encryption, and least privilege) was one of our top priorities for the platform. We worked closely with the security teams to meet strict domain and geo-specific compliance requirements.
  • Data governance: Data lineage and deep metadata extraction are important because they support proper data governance and auditability. They also allow our users to navigate a complex data landscape. Our data ingestion frameworks include a mixture of third party and AWS services to capture and catalog both technical and business metadata.
  • Data storage and access: AI Factory stores data in Amazon Simple Storage Service (S3) in a secure and compliant manner. Access rights are only granted to individuals working on the corresponding projects. Roles are defined in Active Directory.
  • Automated data pipelines: We sought to provide a flexible and robust data integration solution. An ETL pipeline using Apache Spark, Apache Airflow, and Kubernetes pods is central to our data ingestion. We use this for AI model development and subsequent data preparation for operationalization and model integration.
  • Monitoring and security: AI Factory relies on open-source cloud monitoring solutions like Grafana to detect security threats and anomalies. It does this by collecting service and application logs, tracking metrics, and generating alarms.
  • Feedback loop: We store model inputs/outputs and use BI tools, such as Amazon QuickSight, to track the behavior and performance of productive AI models. It’s important to share such information with our business partners so we can build their trust and confidence with AI.
  • Developer-friendly environment: Creating AI models is possible in a notebook-style or integrated development environment. Because our data teams use a variety of machine learning (ML) frameworks and libraries, we keep our platform extensible and our framework agnostic. We support Python/R, Apache Spark, PyTorch and TensorFlow, and more. All this is bolstered by CI/CD processes that accelerate delivery and reduce errors.
  • Business process integration: AI Factory offers services to integrate ML models into existing business processes. We focus on standardizing processes and close collaboration with business and technical stakeholders. Our overarching goal is to operationalize the AI model in the shortest possible timeframe, while preserving high quality and security standards.

AI Factory architecture

So far, we have looked at the functional building blocks of the AI Factory. Let’s take an architectural view of the platform using a five-step workflow:

Figure 2: AI Factory high-level architecture

Figure 2: AI Factory high-level architecture

  1. Data ingestion environment: We use this environment to ingest data from the prominent on-premises ERGO data sources. We can schedule the batch or Delta transfer data to various cloud destinations using multiple Kubernetes-hosted microservices. Once ingested, data is persisted and cataloged as ERGO’s data lake on Amazon S3. It is prepared for processing by the upstream environments.
  2. Model development environment: This environment is used primarily by data scientists and data engineers. We use Amazon EMR and Amazon SageMaker extensively for data preparation, data wrangling, experimentation with predictive models, and development through rapid iterations.
  3. Model operationalization environment: Trained models with satisfactory KPIs are promoted from the model development to the operationalization environment. This is where we integrate AI models in business processes. The team focuses on launching and optimizing the operation of services and algorithms.
    • Integration with ERGO business processes is achieved using Kubernetes-hosted ‘Model Service.’ This allows us to infuse AI models provided by data scientists in existing business processes.
    • An essential part of model operationalization is to continuously monitor the quality of the deployed ML models using the ‘feedback loop service.’
  4. Model insights environment: This environment is used for displaying information about platform performance, processes, and analytical data. Data scientists use its services to check for unexpected bias or performance drifts that the model could exhibit. Feedback coming from the business through the “feedback loop service’ allows them to identify problems fast and retrain the model.
  5. Shared services: Though shown as the fifth step of the workflow, the shared services environment supports almost every step in the process. It provides common, shared components between different parts of the platform managing CI/CD and orchestration processes within the AI factory. Additional services like platform logging and monitoring, authentication, and metadata management are also delivered from the shared services environment.

A binding theme across the various subplatforms is that all provisioning and deployment activities are automated using Infrastructure as Code (IaC) practices. This reduces the potential for human error, provides architectural flexibility, and greatly speeds up software development and our infrastructure-related operations.

All components of the AI factory are run in the AWS Cloud and can be scaled and adapted as needed. The connection between model development and operationalization happens at well-defined interfaces to prevent unnecessary coupling of components.

Lessons learned

Security first

  • Align with security early and often
  • Understand all the regulatory obligations and document them as critical, non-functional requirements

Modular approach

  • Combine modern data science technology and professional IT with a cross-functional, agile way of working
  • Apply loosely coupled services with an API-first approach

Data governance

  • Tracking technical metadata is important but not sufficient, you need business attributes too
  • Determine data ownership in operational systems to map upstream data governance workflows
  • Establish solutions to data masking as the data moves across sub-platforms
  • Define access rights and permissions boundaries among various personas

FinOps strategy

  • Carefully track platform cost
  • Assign owners responsible for monitoring and cost improvements
  • Provide regular feedback to platform stakeholders on usage patterns and associated expenses

Working with our AWS team

  • Establish cadence for architecture review and new feature updates
  • Plan cloud training and enablement

The future for the AI factory

The creation of the AI Factory was an essential building block of ERGO’s strategy. Now we are ready to embrace the next chapter in our advanced analytics journey.

We plan to focus on important use cases that will deliver the highest business value. We want to make the AI Factory available to ERGO’s international subsidiaries. We are also enhancing and scaling its capabilities. We are creating an ‘analytical content hub’ based on automated text extraction, improving speech to text, and developing translation processes for all unstructured and semistructured data using AWS AI services.

How Banks Can Use AWS to Meet Compliance

Post Syndicated from Jiwan Panjiker original https://aws.amazon.com/blogs/architecture/how-banks-can-use-aws-to-meet-compliance/

Since the 2008 financial crisis, banking supervisory institutions such as the Basel Committee on Banking Supervision (BCBS) have strengthened regulations. There is now increased oversight over the financial services industry. For banks, making the necessary changes to comply with these rules is a challenging, multi-year effort.

Basel IV, a massive update to existing rules, is due for implementation in January 2023. Basel IV standardizes the approach to calculating credit risk, increases the impact of risk-weighted assets (RWAs) and emphasizes data transparency.

Given the complexity of data, modeling, and numerous assumptions that have to be made, compliance under Basel IV implementation will be challenging. Standardization omits nuances unique to your business, which can drive up costs, but violating guidelines will result in steep penalties.

This post will address these challenges by outlining a mechanism that facilitates a healthy, data-driven dialogue between banks and regulators to better achieve compliance objectives. The reference architecture will focus on enabling fast, iterative releases with the help of serverless AWS services.

There are four key actions to take in order to support this mechanism:

  1. Automate data management
  2. Establish a continuous integration/continuous delivery (CI/CD) pipeline
  3. Enable fast, point-in-time audit replays
  4. Set up proactive monitoring and notifications

Automate data management

Due to frequent merger activity, banks are typically comprised of a web of integrated systems and siloed business units, making it difficult to consolidate data. Under Basel IV guidelines, auditors want banks to provide detailed data in a presentable way.

You can tackle this first challenge by establishing a data pipeline as shown in Figure 1. Take inventory of each data source as it is incorporated into the pipeline. Identify the critical internal and external data sources that will be used to populate the initial landing area. Amazon Simple Storage Service (S3) is a great choice for this.

Figure 1. Data pipeline that cleans, processes, and segments data

Figure 1. Data pipeline that cleans, processes, and segments data

Amazon S3 is a highly available, durable service that is a popular data lake solution. S3 offers WORM storage capabilities like S3 Glacier Vault and S3 Object Lock to protect the integrity of your archived data in accordance with U.S. SEC and FINRA rules.

Basel IV regulations also require banks to use many attributes to develop accurate credit risk models. The attributes can be a mix of datasets such as financial statements, internal balanced scorecards, macro-economic data, and credit ratings. The risk models themselves can also be segmented by portfolio types, industry segments, asset types and much more.

You can split data into different domains and designate data owners with separate S3 buckets. Credit risk model developers, analyst, and data scientists can then use the structure of the S3 buckets to pull together relevant datasets. They can then store the outputs into S3 buckets.

To support fast, automated data retrieval, store object metadata in a highly scalable, and queryable database. You can set up Amazon S3 so that an event can initiate a function to populate Amazon DynamoDB. Developers can use AWS Lambda to write these functions using popular languages like Python.

With AWS Glue, you can automate Extract/Load/Transform (ETL) processes to clean and move data to the different S3 buckets. AWS Glue can also support data operations by automatically cataloging your various data sources.

Taking on a structured approach will simplify data governance and transparency as the business continues to grow and operate.

Establish a CI/CD pipeline

Adopt tools that machine learning teams can use to build a streamlined CI/CD solution as demonstrated in Figure 2.

Figure 2. An end-to-end machine learning development and deployment pipeline

Figure 2. An end-to-end machine learning development and deployment pipeline

Using tightly integrated AWS services, your teams can minimize time spent managing tools and deployment processes, and instead, focus on tuning the models and analyzing the results.

Amazon SageMaker brings together a powerful set of machine learning capabilities on the AWS Cloud. It helps data scientists and engineers build insightful models. Figure 2 depicts the high-level architecture and shows how Amazon SageMaker Pipelines helps teams orchestrate the automation and deployment processes.

The core of the pipeline uses a set of AWS deployment services so that your teams can collaborate and review effectively. With AWS CodeCommit, your teams can set up git-based repository to store and version models for data processing, training, and evaluation. The repository can also store code and configuration files using AWS CloudFormation for deployment. You can use AWS CodePipeline and AWS CodeBuild to create and update a model endpoint based on the approved/reviewed changes.

Any updates detected in the AWS CodeCommit repository initiate a deployment whenever a new model version is added to the Model Registry. Amazon S3 can be used to store generated model artifacts, historical data, and models.

Enable fast, point-in-time audit replays

Figure 3. Containers offer a lightweight, powerful solution to run audits using historical assets

Figure 3. Containers offer a lightweight, powerful solution to run audits using historical assets

One of the main themes of Basel IV is transparency. Figure 3 illustrates a solution to build trust with regulators by allowing them to verify and understand modeling activity.

A lightweight application is hosted in AWS Fargate and enables auditors to re-run Basel credit risk models under specified conditions. With AWS Fargate, you don’t need to manually manage instances or container orchestration. Configure the CPU or memory specifications at the task level and set guidelines around scalability for your service. Your tasks then scale up and down automatically, based on demand, and will optimize cost efficiency and availability.

Figure 3 shows the following:

  1. The application takes inputs such as date, release version, and model type.
  2. It then queries DynamoDB with this information.
  3. The query will return the data necessary to retrieve model artifacts from previous CI/CD deployments and relevant datasets from historical S3 buckets.
  4. Using this information, it can spin up as many containers as needed to run the model.
  5. It then stores the outputs in a separate S3 bucket.
  6. Auditors will have a detailed trace of all the attributes, assumptions, and data that went into the modeling effort. To streamline this process, the app can also compare the outputs of the historical runs to the recent replay and highlight any significant deviations.

Though internal models will be de-emphasized under Basel IV, banks will continue to run internal models as a benchmark against the broader standards. Schedule AWS Fargate tasks to run these models regularly to capitalize on highly performant compute services while minimizing costs.

Set up proactive monitoring and notifications

Figure 4. Scheduled jobs can send out notifications using Amazon SNS when certain thresholds are breached

Figure 4. Scheduled jobs can send out notifications using Amazon SNS when certain thresholds are breached

The last principle is based around establishing an early warning system, enabling banks to take on a more proactive role in maintaining compliance.

With automated monitoring and notifications, banks will be able to respond quickly to potential concerns. For instance, there can be a daily scheduled job that launches containers and runs the models against the latest data. If any thresholds are breached, alerts can be sent out via SMS or email. Operational teams can be subscribed to certain message topics using Amazon Simple Notification Service (SNS). They can then respond before actual compliance issues emerge.

Conclusion

With a Well-Architected approach, AWS helps you control your data, deploy new features, and embrace a serverless approach. This frees you to innovate quickly and address regulatory challenges.

You can iterate with new AWS services and bring machine learning to bear on various streams of data to identify high impact pools of value. You can get a clearer picture of the data to make it easier to identify areas where you can reduce RWAs. Using Amazon S3, you can turn on AWS analytics services such as Amazon QuickSight and Amazon Athena to visualize the data. You’ll be able to fulfill reporting requirements such as those found in regulatory studies like CCAR, DFAST, CECL, and IFRS9.

For more information about establishing a data pipeline, read Lake House Formation Architecture. It is a powerful pattern that combines a few concepts that will help bring your data together cohesively. To set up a robust CI/CD pipeline, explore the AWS Serverless CI/CD Reference Architecture.

Architecting Persona-centric Data Platform with On-premises Data Sources

Post Syndicated from Raghavarao Sodabathina original https://aws.amazon.com/blogs/architecture/architecting-persona-centric-data-platform-with-on-premises-data-sources/

Many organizations are moving their data from silos and aggregating it in one location. Collecting this data in a data lake enables you to perform analytics and machine learning on that data. You can store your data in purpose-built data stores, like a data warehouse, to get quick results for complex queries on structured data.

In this post, we show how to architect a persona-centric data platform with on-premises data sources by using AWS purpose-built analytics services and Apache NiFi. We will also discuss Lake House architecture on AWS, which is the next evolution from data warehouse and data lake-based solutions.

Data movement services

AWS provides a wide variety of services to bring data into a data lake:

You may want to bring on-premises data into the AWS Cloud to take advantage of AWS purpose-built analytics services, derive insights, and make timely business decisions. Apache NiFi is an open source tool that enables you to move and process data using a graphical user interface.

For this use case and solution architecture, we use Apache NiFi to ingest data into Amazon S3 and AWS purpose-built analytics services, based on user personas.

Building persona-centric data platform on AWS

When you are building a persona-centric data platform for analytics and machine learning, you must first identify your user personas. Who will be using your platform? Then choose the appropriate purpose-built analytics services. Envision a data platform analytics architecture as a stack of seven layers:

  1. User personas: Identify your user personas for data engineering, analytics, and machine learning
  2. Data ingestion layer: Bring the data into your data platform and data lineage lifecycle view, while ingesting data into your storage layer
  3. Storage layer: Store your structured and unstructured data
  4. Cataloging layer: Store your business and technical metadata about datasets from the storage layer
  5. Processing layer: Create data processing pipelines
  6. Consumption layer: Enable your user personas for purpose-built analytics
  7. Security and Governance: Protect your data across the layers

Reference architecture

The following diagram illustrates how to architect a persona-centric data platform with on-premises data sources by using AWS purpose-built analytics services and Apache NiFi.

Figure 1. Example architecture for persona-centric data platform with on-premises data sources

Figure 1. Example architecture for persona-centric data platform with on-premises data sources

Architecture flow:

    1. Identify user personas: You must first identify user personas to derive insights from your data platform. Let’s start with identifying your users:
      • Enterprise data service users who would like to consume data from your data lake into their respective applications.
      • Business users who would like to like create business intelligence dashboards by using your data lake datasets.
      • IT users who would like to query data from your data lake by using traditional SQL queries.
      • Data scientists who would like to run machine learning algorithms to derive recommendations.
      • Enterprise data warehouse users who would like to run complex SQL queries on your data warehouse datasets.
    2. Data ingestion layer: Apache NiFi scans the on-premises data stores and ingest the data into your data lake (Amazon S3). Apache NiFi can also transform the data in transit. It supports both Extract, Transform, Load (ETL) and Extract, Load, Transform (ELT) data transformations. Apache NiFi also supports data lineage lifecycle while ingesting data into Amazon S3.
    3. Storage layer: For your data lake storage, we recommend using Amazon S3 to build a data lake. It has unmatched 11 nines of durability and 99.99% availability. You can also create raw, transformed, and enriched storage layers depending upon your use case.
    4. Cataloging layer: AWS Lake Formation provides the central catalog to store and manage metadata for all datasets hosted in the data lake by AWS Glue Data Catalog. AWS services such as AWS Glue, Amazon EMR, and Amazon Athena natively integrate with Lake Formation. They automate discovering and registering dataset metadata into the Lake Formation catalog.
    5. Processing layer: Amazon EMR processes your raw data and places them into a new S3 bucket. Use AWS Glue DataBrew and AWS Glue to process the data as needed.
    6. Consumption layer or persona-centric analytics: Once data is transformed:
      • AWS Lambda and Amazon API Gateway will allow you to develop data services for enterprise data service users
      • You can develop user-friendly dashboards for your business users using Amazon QuickSight
      • Use Amazon Athena to query transformed data for your IT users
      • Your data scientists can utilize AWS Glue DataBrew to clean and normalize the data and Amazon SageMaker for machine learning models
      • Your enterprise data warehouse users can use Amazon Redshift to derive business intelligence
    7. Security and governance layer: AWS IAM provides users, groups, and role-level identity, in addition to the ability to configure coarse-grained access control for resources managed by AWS services in all layers. AWS Lake Formation provides fine-grained access controls and you can grant/revoke permissions at the database- or table- or column-level access.

Lake House architecture on AWS

The vast majority of data lakes are built on Amazon S3. At the same time, customers are leveraging purpose-built analytics stores that are optimized for specific use cases. Customers want the freedom to move data between their centralized data lakes and the surrounding purpose-built analytics stores. And they want to get insights with speed and agility in a seamless, secure, and compliant manner. We call this modern approach to analytics the Lake House architecture.

Figure 2. Lake House architecture on AWS

Figure 2. Lake House architecture on AWS

Refer to the whitepaper Derive Insights from AWS Lake house for various design patterns to derive persona-centric analytics by using the AWS Lake House approach. Check out the blog post Build a Lake House Architecture on AWS  for a Lake House reference architecture on AWS.

Conclusion

In this post, we show you how to build a persona-centric data platform on AWS with a seven-layered approach. This uses Apache NiFi as a data ingestion tool and AWS purpose-built analytics services for persona-centric analytics and machine learning. We have also shown how to build persona-centric analytics by using the AWS Lake House approach.

With the information in this post, you can now build your own data platform on AWS to gain faster and deeper insights from your data. AWS provides you the broadest and deepest portfolio of purpose-built analytics and machine learning services to support your business needs.

Read more and get started on building a data platform on AWS:

Field Notes: Develop Data Pre-processing Scripts Using Amazon SageMaker Studio and an AWS Glue Development Endpoint

Post Syndicated from Sam Mokhtari original https://aws.amazon.com/blogs/architecture/field-notes-develop-data-pre-processing-scripts-using-amazon-sagemaker-studio-and-an-aws-glue-development-endpoint/

This post was co-written with Marcus Rosen, a Principal  – Machine Learning Operations with Rio Tinto, a global mining company. 

Data pre-processing is an important step in setting up Machine Learning (ML) projects for success. Many AWS customers use Apache Spark on AWS Glue or Amazon EMR to run data pre-processing scripts while using Amazon SageMaker to build ML models.  To develop spark scripts in AWS Glue, you can create an environment called a Glue Development (Dev) Endpoint that lets you author and test your data pre-processing scripts iteratively. When you’re satisfied with the results of your development, you can create a Glue ETL job that runs the final script as part of your automation framework.

With the introduction of Amazon SageMaker Studio in AWS re:Invent 2020, you can now use a single web-based IDE to spin up a notebook and perform all ML development steps. These include data pre-processing, ML model training, ML model deployment and monitoring.

This post walks you through how to connect a SageMaker Studio notebook to an AWS Glue Dev Endpoint, so you can use a single tool to iteratively develop both data pre-processing scripts and ML models.

Solution Overview

The following diagram shows the components that are used in this solution.

  • First, we use an AWS CloudFormation template to set up the required networking components (for example, VPC, subnets).
  • Then, we create an AWS Glue Dev Endpoint and use a security group to allow SageMaker Studio to securely access the endpoint.
  • Finally, we create a studio domain and use a SparkMagic kernel to connect to the AWS Glue Dev Endpoint and run spark scripts.

In the Amazon SageMaker Studio notebook, SparkMagic will call a REST API against a Livy server running on the AWS Glue Dev Endpoint. Apache Livy is a service that enables interaction with a remote Spark cluster over a REST API.

 

The following diagram shows the components that are used in this solution. We use an AWS CloudFormation template to set up the required ntworking components (for example, VPC, subnets).

Set up the VPC

You can use the following CloudFormation template to set up the environment needed for this solution.

launch stack button

This template deploys the following resources in your account:

  • A new VPC, with both public and private subnet.
  • VPC endpoints for the following resources:
  • Security groups for SageMaker Studio, Glue endpoint and VPC endpoints
  • SageMaker Service IAM role
  • AWS Glue Dev Endpoint IAM role
  • Set up the AWS Glue Dev Endpoint

Set up AWS Glue Dev Endpoint

Review this Developer Guide: Adding a Development Endpoint for instructions to create an AWS Glue Dev Endpoint.

Note: you must use the AWS Glue Dev Endpoint IAM role provisioned by the CloudFormation template.

  • In the Networking section, select Choose a VPC, subnet, and security groups.

Then choose the VPC glue security group, which you provisioned through the CloudFormation template.

The AWS Glue Dev Endpoint needs to be secured with an SSH public key, which should be generated within your local environment. An SSH key pair (public/private) can be generated using the ssh-keygen on Linux or using PuTTYgen on Windows.

Glue Dev Endpoint screenshot

The final review page looks similar to the following screenshot.

Final review page

Once the AWS Glue Dev Endpoint is in Ready status, keep note of its private IP address (Glue -> ETL -> Dev Endpoints). You will use this IP for the Livy port forwarding.

Set up SageMaker Studio

We recommend launching the SageMaker Studio resource by following the instructions in Securing Amazon SageMaker Studio connectivity using a private VPC .

Follow these steps when you provision the SageMaker Studio resources:

  • Select Standard setup with the AWS Identity and Access Management (IAM) authentication method.
  • Attach a SageMaker Service IAM role, created by the CloudFormation template, to SageMaker Studio.
  • Under Network and storage, select the same VPC and private subnet as the AWS Glue endpoint.
  • For the Network Access for Studio option, select VPC Only — SageMaker Studio will use your VPC. Direct internet access is disabled.

Then ensure that the security group with the self-referencing rule is attached. Also, check your other required security groups are attached for SageMaker Studio from the CloudFormation template output.

Connect the SageMaker Studio notebook to the AWS Glue Dev Endpoint

Once you launch the SageMaker Studio and you add the users. Follow these steps to connect the SageMaker Studio notebook to the AWS Glue Dev Endpoint:

  1. Open the Studio and go to the launcher page (by pressing the “+” icon on the top-left of the page.
  2. Under Notebooks and compute resources, select SparkMagic in the dropdown menu and select Notebook.
  3. Then open another launcher page, select SparkMagic in the same dropdown menu and select Image terminal. One thing to note is that the SparkMagic app will take some time to initialize. Proceed once the apps are in Ready status (2-3 minutes).

Notebooks and compute resources screenshot

4. Upload the private key into SparkMagic Image terminal. In other words, copy the private key to “.ssh” directory and update its permissions using “chmod 400”.

Note: the private key is corresponding to the public key used when you create the AWS Glue Dev Endpoint.

5. Now, you need to achieve port forwarding of the Livy service in order for SparkMagic kernel to be able to connect to the AWS Glue Dev Endpoint.  You run the following command in the image terminal:

/usr/bin/ssh -4 -N -o ServerAliveInterval=60 -o ServerAliveCountMax=3 -o StrictHostKeyChecking=no -i /root/.ssh/{PRIVATE_KEY} -L 8998:169.254.76.1:8998 glue@{GLUE_ENDPOINT_PRIVATE_IP_ADDRESS}

The command consists of:

  • {PRIVATE_KEY} is the private key file name that you copied into .ssh directory.
  • {GLUE_ENDPOINT_PRIVATE_IP_ADDRESS} is the private IP address of the AWS Glue Dev Endpoint.
  • “8998” is the Livy port we are using for port forwarding.
  • “169.254.76.1” is the remote IP address defined by AWS Glue, this IP address does not change.

Note: Keep this terminal open and the SSH command running in order to keep the Livy session active.

6. Go to the SparkMagic notebook and restart the kernel, by going to the top menu and selecting Kernel > Restart Kernel.

7. Once the notebook kernel is restarted, the connection between the Studio Notebook and the AWS Glue Dev Endpoint is ready. To test the integration, you can run the following example command to list the tables in the AWS Glue Data Catalog.

spark.sql("show tables").show()

To test the integration, you can run the following command to list the tables in the Glue Data Catalog

Cleaning up

To avoid incurring future charges, delete the resources you created:

Conclusion

Our customers needed a single web-based IDE to spin up a notebook and perform all ML development steps including data pre-processing, ML model training, ML model deployment and monitoring. This blog post demonstrated how you can configure a SageMaker Studio notebook and connect to AWS Glue Dev Endpoint. This provides a framework for you to use  when developing both data preprocessing scripts and ML models.

To learn more about how to develop data pre-processing scripts and ML models in Amazon SageMaker, you can check out the examples in this repository.

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

 

 

Building a Cloud-based OLAP Cube and ETL Architecture with AWS Managed Services

Post Syndicated from Peter Chung original https://aws.amazon.com/blogs/architecture/building-a-cloud-based-olap-cube-and-etl-architecture-with-aws-managed-services/

For decades, enterprises used online analytical processing (OLAP) workloads to answer complex questions about their business by filtering and aggregating their data. These complex queries were compute and memory-intensive. This required teams to build and maintain complex extract, transform, and load (ETL) pipelines to model and organize data, oftentimes with commercial-grade analytics tools.

In this post, we discuss building a cloud-based OLAP cube and ETL architecture that will yield faster results at lower costs without sacrificing performance by:

  • Connecting your on-premises database to the cloud for data profiling, discovery, and transformation
  • Running OLAP workloads without costly third-party software licenses, dedicated infrastructure, or the need to migrate data
  • Using AWS Glue Data Catalog, Amazon Athena, Amazon QuickSight, and Amazon SageMaker to catalog and visualize data with machine learning (ML)

Data analytics pipeline with AWS Managed Services

The proposed architecture in Figure 1 relies on AWS Managed Services. AWS Glue DataBrew is a no-code data transformation service that you can use to quickly build your transformation jobs. AWS Glue crawlers collect metadata from the transformed data and catalogs it for analytics and visualization using Athena and QuickSight. SageMaker will build, train, and deploy ML models.

This architecture will help you get answers from your data to your users as fast as possible without needing to migrate your data to AWS. There is no coding required, so you can leverage data transformation, cataloging, analytics, and ML quickly.

Figure 1. Example architecture using AWS Managed Services

Figure 1. Example architecture using AWS Managed Services

Benefits of AWS Managed Services for data analytics

Immediate connectivity to on-premises databases

The example architecture in Figure 1 begins with an online transaction processing (OLTP) database running in your corporate data center. Figure 2 shows how you can establish a Java database connectivity (JDBC) connection from the OLTP database to DataBrew running in AWS to run OLAP workloads. DataBrew supports data sources using JDBC for common data stores such as Microsoft SQL Server, MySQL, Oracle, and PostgreSQL.

DataBrew - JDBC connection to data source

Figure 2. DataBrew – JDBC connection to data source

Automatic data discovery

Figures 3 through 6 show how DataBrew summarizes your data for discovery. You can profile your data to understand patterns and detect anomalies. You can also run transformations called “jobs” in DataBrew without writing any code using over 250 built-in transforms.

DataBrew - dataset profiling overview

Figure 3. DataBrew – dataset profiling overview

 

DataBrew - data correlation patterns

Figure 4. DataBrew – data correlation patterns

 

DataBrew - data points distribution

Figure 5. DataBrew – data points distribution

No-code data transformation and cataloging

To run OLAP-type transactions, you can create jobs based on the transformation steps shown in Figure 6. These steps collectively are referred to as DataBrew recipes. These recipe results can be run as a job and outputted to an Amazon Simple Storage Service (Amazon S3) bucket.

A DataBrew project user interface view with sample data and transformation functions

Figure 6. A DataBrew project user interface view with sample data and transformation functions

Scheduled DataBrew jobs act similarly to scheduled ETL pipelines in OLAP. Based on data refresh and business requirements, DataBrew can run a job on a recurring basis (for example, every 12 hours). This can be run at a particular time of day, or as defined by a valid CRON expression. This helps you automate your transformation workflows.

The OLAP catalog is a set of metadata that sits between the actual OLAP data stored and applications. To create a Data Catalog, you can use AWS Glue crawlers to automatically classify your data to determine the data’s format, schema, and associated properties. Figure 7 shows the results of a crawler’s results written to Data Catalog as metadata to help data users find the data they need.

AWS Glue crawler metadata table output of column names and data types

Figure 7. AWS Glue crawler metadata table output of column names and data types

Data analytics without third-party software licenses

You can run analytics on your data by referring to the metadata definitions in the Data Catalog as references to the actual data in Amazon S3 using Athena. Athena is well suited for running one-time queries using standard SQL to query the transformed data directly in Amazon S3 without having to move data around. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run.

Enterprises often supplement their OLAP workloads with separate visualization and business intelligence (BI) tools. These tools often come with their own licensing, server management, and security considerations.

You can visualize curated data using QuickSight, a scalable, serverless, embeddable, ML-powered BI service. QuickSight lets you easily create and publish interactive BI dashboards that include ML-powered insights, as shown in Figure 8. These dashboards can be shared with other users and embedded within your own applications.

A sample of data visualization options with Amazon QuickSight

Figure 8. A sample of data visualization options with Amazon QuickSight

Finally, you can incorporate ML workloads to OLAP workloads using SageMaker. In the past, ML workloads were often expensive, resource-intensive, and inaccessible. SageMaker provides a fully managed ML service to quickly and easily build and train ML models and directly deploy them into a production-ready hosted environment.

Conclusion

In this post, we show you how to connect your on-premises database using a JDBC connection to DataBrew for data profiling, discovery, and transformation. We looked at how you can use DataBrew recipes and jobs to run OLAP workloads without costly third-party software licenses, dedicated infrastructure, or the need to migrate any data. We also looked at AWS capabilities in data cataloging, visualization, and machine learning using Data Catalog, Athena, QuickSight, and SageMaker without having to manage any servers.

Laying the foundation to modernize an analytics workflow is critical for many enterprises that are looking to reduce the time it takes to understand their business. With AWS, you can perform enterprise-scale analytics with our portfolio of analytics services.