Tag Archives: Amazon Neptune

Analyze large amounts of graph data to get insights and find trends with Amazon Neptune Analytics

Post Syndicated from Sébastien Stormacq original https://aws.amazon.com/blogs/aws/introducing-amazon-neptune-analytics-a-high-performance-graph-analytics/

I am happy to announce the general availability of Amazon Neptune Analytics, a new analytics database engine that makes it faster for data scientists and application developers to quickly analyze large amounts of graph data. With Neptune Analytics, you can now quickly load your dataset from Amazon Neptune or your data lake on Amazon Simple Storage Service (Amazon S3), run your analysis tasks in near real time, and optionally terminate your graph afterward.

Graph data enables the representation and analysis of intricate relationships and connections within diverse data domains. Common applications include social networks, where it aids in identifying communities, recommending connections, and analyzing information diffusion. In supply chain management, graphs facilitate efficient route optimization and bottleneck identification. In cybersecurity, they reveal network vulnerabilities and identify patterns of malicious activity. Graph data finds application in knowledge management, financial services, digital advertising, and network security, performing tasks such as identifying money laundering networks in banking transactions and predicting network vulnerabilities.

Since the launch of Neptune in May 2018, thousands of customers have embraced the service for storing their graph data and performing updates and deletion on specific subsets of the graph. However, analyzing data for insights often involves loading the entire graph into memory. For instance, a financial services company aiming to detect fraud may need to load and correlate all historical account transactions.

Performing analyses on extensive graph datasets, such as running common graph algorithms, requires specialized tools. Utilizing separate analytics solutions demands the creation of intricate pipelines to transfer data for processing, which is challenging to operate, time-consuming, and prone to errors. Furthermore, loading large datasets from existing databases or data lakes to a graph analytic solution can take hours or even days.

Neptune Analytics offers a fully managed graph analytics experience. It takes care of the infrastructure heavy lifting, enabling you to concentrate on problem-solving through queries and workflows. Neptune Analytics automatically allocates compute resources according to the graph’s size and quickly loads all the data in memory to run your queries in seconds. Our initial benchmarking shows that Neptune Analytics loads data from Amazon S3 up to 80x faster than existing AWS solutions.

Neptune Analytics supports 5 families of algorithms covering 15 different algorithms, each with multiple variants. For example, we provide algorithms for path-finding, detecting communities (clustering), identifying important data (centrality), and quantifying similarity. Path-finding algorithms are used for use cases such as route planning for supply chain optimization. Centrality algorithms like page rank identify the most influential sellers in a graph. Algorithms like connected components, clustering, and similarity algorithms can be used for fraud-detection use cases to determine whether the connected network is a group of friends or a fraud ring formed by a set of coordinated fraudsters.

Neptune Analytics facilitates the creation of graph applications using openCypher, presently one of the widely adopted graph query languages. Developers, business analysts, and data scientists appreciate openCypher’s SQL-inspired syntax, finding it familiar and structured for composing graph queries.

Let’s see it at work
As we usually do on the AWS News blog, let’s show how it works. For this demo, I first navigate to Neptune in the AWS Management Console. There is a new Analytics section on the left navigation pane. I select Graphs and then Create graph.

Neptune Analytics - create graph 1

On the Create graph page, I enter the details of my graph analytics database engine. I won’t detail each parameter here; their names are self-explanatory.

Neptune Analytics - Create graph 1

Pay attention to Allow from public because, the vast majority of the time, you want to keep your graph only available from the boundaries of your VPC. I also create a Private endpoint to allow private access from machines and services inside my account VPC network.

Neptune Analytics - Create graph 2

In addition to network access control, users will need proper IAM permissions to access the graph.

Finally, I enable Vector search to perform similarity search using embeddings in the dataset. The dimension of the vector depends on the large language model (LLM) that you use to generate the embedding.

Neptune Analytics - Create graph 3

When I am ready, I select Create graph (not shown here).

After a few minutes, my graph is available. Under Connectivity & security, I take note of the Endpoint. This is the DNS name I will use later to access my graph from my applications.

I can also create Replicas. A replica is a warm standby copy of the graph in another Availability Zone. You might decide to create one or more replicas for high availability. By default, we create one replica, and depending on your availability requirements, you can choose not to create replicas.

Neptune Analytics - create graph 3

Business queries on graph data
Now that the Neptune Analytics graph is available, let’s load and analyze data. For the rest of this demo, imagine I’m working in the finance industry.

I have a dataset obtained from the US Securities and Exchange Commission (SEC). This dataset contains the list of positions held by investors that have more than $100 million in assets. Here is a diagram to illustrate the structure of the dataset I use in this demo.

Nuptune graph analytics - dataset structure

I want to get a better understanding of the positions held by one investment firm (let’s name it “Seb’s Investments LLC”). I wonder what its top five holdings are and who else holds more than $1 billion in the same companies. I am also curious to know what are other investment companies that have a similar portfolio as Seb’s Investments LLC.

To start my analysis, I create a Jupyter notebook in the Neptune section of the AWS Management Console. In the notebook, I first define my analytics endpoint and load the data set from an S3 bucket. It takes only 18 seconds to load 17 million records.

Neptune Analytics - load data

Then, I start to explore the dataset using openCypher queries. I start by defining my parameters:

params = {'name': "Seb's Investments LLC", 'quarter': '2023Q4'}

First, I want to know what the top five holdings are for Seb’s Investments LLC in this quarter and who else holds more than $1 billion in the same companies. In openCypher, it translates to the query hereafter. The $name parameter’s value is “Seb’s Investment LLC” and the $quarter parameter’s value is 2023Q4.

MATCH p=(h:Holder)-->(hq1)-[o:owns]->(holding)
WHERE h.name = $name AND hq1.name = $quarter
WITH DISTINCT holding as holding, o ORDER BY o.value DESC LIMIT 5
MATCH (holding)<-[o2:owns]-(hq2)<--(coholder:Holder)
WHERE hq2.name = '2023Q4'
WITH sum(o2.value) AS totalValue, coholder, holding
WHERE totalValue > 1000000000
RETURN coholder.name, collect(holding.name)

Neptune Analytics - query 1

Then, I want to know what the other top five companies are that have similar holdings as “Seb’s Investments LLC.” I use the topKByNode() function to perform a vector search.

MATCH (n:Holder)
WHERE n.name = $name
CALL neptune.algo.vectors.topKByNode(n)
YIELD node, score
WHERE score >0
RETURN node.name LIMIT 5

This query identifies a specific Holder node with the name “Seb’s Investments LLC.” Then, it utilizes the Neptune Analytics custom vector similarity search algorithm on the embedding property of the Holder node to find other nodes in the graph that are similar. The results are filtered to include only those with a positive similarity score, and the query finally returns the names of up to five related nodes.

Neptune Analytics - query 2

Pricing and availability
Neptune Analytics is available today in seven AWS Regions: US East (Ohio, N. Virginia), US West (Oregon), Asia Pacific (Singapore, Tokyo), and Europe (Frankfurt, Ireland).

AWS charges for the usage on a pay-as-you-go basis, with no recurring subscriptions or one-time setup fees.

Pricing is based on configurations of memory-optimized Neptune capacity units (m-NCU). Each m-NCU corresponds to one hour of compute and networking capacity and 1 GiB of memory. You can choose configurations starting with 128 m-NCUs and up to 4096 m-NCUs. In addition to m-NCU, storage charges apply for graph snapshots.

I invite you to read the Neptune pricing page for more details

Neptune Analytics is a new analytics database engine to analyze large graph datasets. It helps you discover insights faster for use cases such as fraud detection and prevention, digital advertising, cybersecurity, transportation logistics, and bioinformatics.

Get started
Log in to the AWS Management Console to give Neptune Analytics a try.

— seb

Harmonize data using AWS Glue and AWS Lake Formation FindMatches ML to build a customer 360 view

Post Syndicated from Nishchai JM original https://aws.amazon.com/blogs/big-data/harmonize-data-using-aws-glue-and-aws-lake-formation-findmatches-ml-to-build-a-customer-360-view/

In today’s digital world, data is generated by a large number of disparate sources and growing at an exponential rate. Companies are faced with the daunting task of ingesting all this data, cleansing it, and using it to provide outstanding customer experience.

Typically, companies ingest data from multiple sources into their data lake to derive valuable insights from the data. These sources are often related but use different naming conventions, which will prolong cleansing, slowing down the data processing and analytics cycle. This problem particularly impacts companies trying to build accurate, unified customer 360 profiles. There are customer records in this data that are semantic duplicates, that is, they represent the same user entity, but have different labels or values. It’s commonly referred to as a data harmonization or deduplication problem. The underlying schemas were implemented independently and don’t adhere to common keys that can be used for joins to deduplicate records using deterministic techniques. This has led to so-called fuzzy deduplication techniques to address the problem. These techniques utilize various machine learning (ML) based approaches.

In this post, we look at how we can use AWS Glue and the AWS Lake Formation ML transform FindMatches to harmonize (deduplicate) customer data coming from different sources to get a complete customer profile to be able to provide better customer experience. We use Amazon Neptune to visualize the customer data before and after the merge and harmonization.

Overview of solution

In this post, we go through the various steps to apply ML-based fuzzy matching to harmonize customer data across two different datasets for auto and property insurance. These datasets are synthetically generated and represent a common problem for entity records stored in multiple, disparate data sources with their own lineage that appear similar and semantically represent the same entity but don’t have matching keys (or keys that work consistently) for deterministic, rule-based matching. The following diagram shows our solution architecture.

We use an AWS Glue job to transform the auto insurance and property insurance customer source data to create a merged dataset containing fields that are common to both datasets (identifiers) that a human expert (data steward) would use to determine semantic matches. The merged dataset is then used to deduplicate customer records using an AWS Glue ML transform to create a harmonized dataset. We use Neptune to visualize the customer data before and after the merge and harmonization to see how the transform FindMacthes can bring all related customer data together to get a complete customer 360 view.

To demonstrate the solution, we use two separate data sources: one for property insurance customers and another for auto insurance customers, as illustrated in the following diagram.

The data is stored in an Amazon Simple Storage Service (Amazon S3) bucket, labeled as Raw Property and Auto Insurance data in the following architecture diagram. The diagram also describes detailed steps to process the raw insurance data into harmonized insurance data to avoid duplicates and build logical relations with related property and auto insurance data for the same customer.

The workflow includes the following steps:

  1. Catalog the raw property and auto insurance data, using an AWS Glue crawler, as tables in the AWS Glue Data Catalog.
  2. Transform raw insurance data into CSV format acceptable to Neptune Bulk Loader, using an AWS Glue extract, transform, and load (ETL) job.
  3. When the data is in CSV format, use an Amazon SageMaker Jupyter notebook to run a PySpark script to load the raw data into Neptune and visualize it in a Jupyter notebook.
  4. Run an AWS Glue ETL job to merge the raw property and auto insurance data into one dataset and catalog the merged dataset. This dataset will have duplicates and no relations are built between the auto and property insurance data.
  5. Create and train an AWS Glue ML transform to harmonize the merged data to remove duplicates and build relations between the related data.
  6. Run the AWS Glue ML transform job. The job also catalogs the harmonized data in the Data Catalog and transforms the harmonized insurance data into CSV format acceptable to Neptune Bulk Loader.
  7. When the data is in CSV format, use a Jupyter notebook to run a PySpark script to load the harmonized data into Neptune and visualize it in a Jupyter notebook.

Prerequisites

To follow along with this walkthrough, you must have an AWS account. Your account should have permission to provision and run an AWS CloudFormation script to deploy the AWS services mentioned in the architecture diagram of the solution.

Provision required resources using AWS CloudFormation:

To launch the CloudFormation stack that configures the required resources for this solution in your AWS account, complete the following steps:

  1. Log in to your AWS account and choose Launch Stack:

  1. Follow the prompts on the AWS CloudFormation console to create the stack.
  2. When the launch is complete, navigate to the Outputs tab of the launched stack and note all the key-value pairs of the resources provisioned by the stack.

Verify the raw data and script files S3 bucket

On the CloudFormation stack’s Outputs tab, choose the value for S3BucketName. The S3 bucket name should be cloud360-s3bucketstack-xxxxxxxxxxxxxxxxxxxxxxxx and should contain folders similar to the following screenshot.

The following are some important folders:

  • auto_property_inputs – Contains raw auto and property data
  • merged_auto_property – Contains the merged data for auto and property insurance
  • output – Contains the delimited files (separate subdirectories)

Catalog the raw data

To help walk through the solution, the CloudFormation stack created and ran an AWS Glue crawler to catalog the property and auto insurance data. To learn more about creating and running AWS Glue crawlers, refer to Working with crawlers on the AWS Glue console. You should see the following tables created by the crawler in the c360_workshop_db AWS Glue database:

  • source_auto_address – Contains address data of customers with auto insurance
  • source_auto_customer – Contains auto insurance details of customers
  • source_auto_vehicles – Contains vehicle details of customers
  • source_property_addresses – Contains address data of customers with property insurance
  • source_property_customers – Contains property insurance details of customers

You can review the data using Amazon Athena. For more information about using Athena to query an AWS Glue table, refer to Running SQL queries using Amazon Athena. For example, you can run the following SQL query:

SELECT * FROM "c360_workshop_db"."source_auto_address" limit 10;

Convert the raw data into CSV files for Neptune

The CloudFormation stack created and ran the AWS Glue ETL job prep_neptune_data to convert the raw data into CSV format acceptable to Neptune Bulk Loader. To learn more about building an AWS Glue ETL job using AWS Glue Studio and to review the job created for this solution, refer to Creating ETL jobs with AWS Glue Studio.

Verify the completion of job run by navigating to the Runs tab and checking the status of most recent run.

Verify the CSV files created by the AWS Glue job in the S3 bucket under the output folder.

Load and visualize the raw data in Neptune

This section uses SageMaker Jupyter notebooks to load, query, explore, and visualize the raw property and auto insurance data in Neptune. Jupyter notebooks are web-based interactive platforms. We use Python scripts to analyze the data in a Jupyter notebook. A Jupyter notebook with the required Python scripts has already been provisioned by the CloudFormation stack.

  1. Start Jupyter Notebook.
  2. Choose the Neptune folder on the Files tab.

  1. Under the Customer360 folder, open the notebook explore_raw_insurance_data.ipynb.

  1. Run Steps 1–5 in the notebook to analyze and visualize the raw insurance data.

The rest of the instructions are inside the notebook itself. The following is a summary of the tasks for each step in the notebook:

  • Step 1: Retrieve Config – Run this cell to run the commands to connect to Neptune for Bulk Loader.
  • Step 2: Load Source Auto Data – Load the auto insurance data into Neptune as vertices and edges.
  • Step 3: Load Source Property Data – Load the property insurance data into Neptune as vertices and edges.
  • Step 4: UI Configuration – This block sets up the UI config and provides UI hints.
  • Step 5: Explore entire graph – The first block builds and displays a graph for all customers with more than four coverages of auto or property insurance policies. The second block displays the graph for four different records for a customer with the name James.

These are all records for the same customer, but because they’re not linked in any way, they appear as different customer records. The AWS Glue FindMatches ML transform job will identify these records as customer James, and the records provide complete visibility on all policies owned by James. The Neptune graph looks like the following example. The vertex covers represents the coverage of auto or property insurance by the owner (James in this case) and the vertex locatedAt represents the address of the property or vehicle that is covered by the owner’s insurance.

Merge the raw data and crawl the merged dataset

The CloudFormation stack created and ran the AWS Glue ETL job merge_auto_property to merge the raw property and auto insurance data into one dataset and catalog the resultant dataset in the Data Catalog. The AWS Glue ETL job does the following transforms on the raw data and merges the transformed data into one dataset:

  • Changes the following fields on the source table source_auto_customer:
    1. Changes policyid to id and data type to string
    2. Changes fname to first_name
    3. Changes lname to last_name
    4. Changes work to company
    5. Changes dob to date_of_birth
    6. Changes phone to home_phone
    7. Drops the fields birthdate, priority, policysince, and createddate
  • Changes the following fields on the source_property_customers:
    1. Changes customer_id to id and data type to string
    2. Changes social to ssn
    3. Drops the fields job, email, industry, city, state, zipcode, netnew, sales_rounded, sales_decimal, priority, and industry2
  • After converting the unique ID field in each table to string type and renaming it to id, the AWS Glue job appends the suffix -auto to all id fields in the source_auto_customer table and the suffix -property to all id fields in the source_propery_customer table before copying all the data from both tables into the merged_auto_property table.

Verify the new table created by the job in the Data Catalog and review the merged dataset using Athena using below Athena SQL query:

SELECT * FROM "c360_workshop_db"."merged_auto_property" limit 10

For more information about how to review the data in the merged_auto_property table, refer to Running SQL queries using Amazon Athena.

Create, teach, and tune the Lake Formation ML transform

The merged AWS Glue job created a Data Catalog called merged_auto_property. Preview the table in Athena Query Editor and download the dataset as a CSV from the Athena console. You can open the CSV file for quick comparison of duplicates.

The rows with IDs 11376-property and 11377-property are mostly same except for the last two digits of their SSN, but these are mostly human errors. The fuzzy matches are easy to spot by a human expert or data steward with domain knowledge of how this data was generated, cleansed, and processed in the various source systems. Although a human expert can identify those duplicates on a small dataset, it becomes tedious when dealing with thousands of records. The AWS Glue ML transform builds on this intuition and provides an easy-to-use ML-based algorithm to automatically apply this approach to large datasets efficiently.

Create the FindMatches ML transform

  1. On the AWS Glue console, expand Data Integration and ETL in the navigation pane.
  2. Under Data classification tools, choose Record Matching.

This will open the ML transforms page.

  1. Choose Create transform.
  2. For Name, enter c360-ml-transform.
  3. For Existing IAM role, choose GlueServiceRoleLab.
  4. For Worker type, choose G.2X (Recommended).
  5. For Number of workers, enter 10.
  6. For Glue version, choose as Spark 2.4 (Glue Version 2.0).
  7. Keep the other values as default and choose Next.

  1. For Database, choose c360_workshop_db.
  2. For Table, choose merged_auto_property.
  3. For Primary key, select id.
  4. Choose Next.

  1. In the Choose tuning options section, you can tune performance and cost metrics available for the ML transform. We stay with the default trade-offs for a balanced approach.

We have specified these values to achieve balanced results. If needed, you can adjust these values later by selecting the transform and using the Tune menu.

  1. Review the values and choose Create ML transform.

The ML transform is now created with the status Needs training.

Teach the transform to identify the duplicates

In this step, we teach the transform by providing labeled examples of matching and non-matching records. You can create your labeling set yourself or allow AWS Glue to generate the labeling set based on heuristics. AWS Glue extracts records from your source data and suggests potential matching records. The file will contain approximately 100 data samples for you to work with.

  1. On the AWS Glue console, navigate to the ML transforms page.
  2. Select the transform c360-ml-transform and choose Train model.

  1. Select I have labels and choose Browse S3 to upload labels from Amazon S3.


Two labeled files have been created for this example. We upload these files to teach the ML transform.

  1. Navigate to the folder label in your S3 bucket, select the labeled file (Label-1-iteration.csv), and choose Choose. And Click “Upload labeling file from S3”.
  2. A green banner appears for successful uploads.
  3. Upload another label file (Label-2-iteration.csv) and select Append to my existing labels.
  4. Wait for the successful upload, then choose Next.

  1. Review the details in the Estimate quality metrics section and choose Close.

Verify that the ML transform status is Ready for use. Note that the label count is 200 because we successfully uploaded two labeled files to teach the transform. Now we can use it in an AWS Glue ETL job for fuzzy matching of the full dataset.

Before proceeding to the next steps, note the transform ID (tfm-xxxxxxx) for the created ML transform.

Harmonize the data, catalog the harmonized data, and convert the data into CSV files for Neptune

In this step, we run an AWS Glue ML transform job to find matches in the merged data. The job also catalogs the harmonized dataset in the Data Catalog and converts the merged [A1] dataset into CSV files for Neptune to show the relations in the matched records.

  1. On the AWS Glue console, choose Jobs in the navigation pane.
  2. Choose the job perform_ml_dedup.

  1. On the job details page, expand Additional properties.
  2. Under Job parameters, enter the transform ID you saved earlier and save the settings.

    1. Choose Run and monitor the job status for completion.

  1. Run the following query in Athena to review the data in the new table ml_matched_auto_property, created and cataloged by the AWS Glue job, and observe the results:
SELECT * FROM c360_workshop_db.ml_matched_auto_property WHERE first_name like 'Jam%' and last_name like 'Sanchez%';

The job has added a new column called match_id. If multiple records follow the match criteria, then all matching records have the same match_id.

Match IDs play a crucial role in data harmonization using Lake Formation FindMatches. Each row is assigned a unique integer match ID based on matching criteria such as first_name, last_name, SSN, or date_of_birth, as defined in the uploaded label file. For instance, match ID 25769803941 is assigned to all records that meet the match criteria, such as row 1, 2, 4, and 5 which share the same last_name, SSN, and date_of_birth. Consequently, the properties with ID 19801-property, 29801-auto, 19800-property, and 29800-auto all share the same match ID. It’s important to take note of the match ID because it will be utilized for Neptune Gremlin queries.

The output of the AWS Glue job also has created two files, master_vertex.csv and master_edge.csv, in the S3 bucket output/master_data. We use these files to load data into the Neptune database to find the relationship among different entities.

Load and visualize the harmonized data in Neptune

This section uses Jupyter notebooks to load, query, explore, and visualize the ML matched auto and property insurance data in Neptune. Complete the following steps:

  1. Start Jupyter Notebook.
  2. Choose the Neptune folder on the Files tab.
  3. Under the Customer360 folder, choose the notebook. explore_harmonized_insurance_data.ipynb.
  4. Run Steps 1–5 in the notebook to analyze and visualize the raw insurance data.

The rest of the instructions are inside the notebook itself. The following is a summary of the tasks for each step in the notebook:

  • Step 1. Retrieve Config – Run this cell to run the commands to connect to Neptune for Bulk Loader.
  • Step 2. Load Harmonized Customer Data – Load the final vertex and edge files into Neptune.
  • Step 3. Initialize Neptune node traversals – This block sets up the UI config and provides UI hints.
  • Step 4. Exploring Customer 360 graph – Replace the Match_id 25769803941 copied from the previous step into g.V('REPLACE_ME')( If its not replaced already ) and run the cell.

This displays the graph for four different records for a customer with first_name, and James and JamE are is now connected with the SameAs vertex. The Neptune graph helps connect different entities with match criteria; the AWS Glue FindMatches ML transform job has identified these records as customer James and the records show the Match_id is the same for them. The following diagram shows an example of the Neptune graph. The vertex covers represents the coverage of auto or property insurance by the owner (James in this case) and the vertex locatedAt represents the address of the property or vehicle that is covered by the owner’s insurance.

Clean up

To avoid incurring additional charges to your account, on the AWS CloudFormation console, select the stack that you provisioned as part of this post and delete it.

Conclusion

In this post, we showed how to use the AWS Lake Formation FindMatch transform for fuzzy matching data on a data lake to link records if there are no join keys and group records with similar match IDs. You can use Amazon Neptune to establish the relationship between records and visualize the connect graph for deriving insights.

We encourage you to explore our range of services and see how they can help you achieve your goals. For more data and analytics blog posts, check out AWS Blogs.


About the Authors

Nishchai JM is an Analytics Specialist Solutions Architect at Amazon Web services. He specializes in building Big-data applications and help customer to modernize their applications on Cloud. He thinks Data is new oil and spends most of his time in deriving insights out of the Data.

Varad Ram is Senior Solutions Architect in Amazon Web Services. He likes to help customers adopt to cloud technologies and is particularly interested in artificial intelligence. He believes deep learning will power future technology growth. In his spare time, he like to be outdoor with his daughter and son.

Narendra Gupta is a Specialist Solutions Architect at AWS, helping customers on their cloud journey with a focus on AWS analytics services. Outside of work, Narendra enjoys learning new technologies, watching movies, and visiting new places

Arun A K is a Big Data Solutions Architect with AWS. He works with customers to provide architectural guidance for running analytics solutions on the cloud. In his free time, Arun loves to enjoy quality time with his family

Automate discovery of data relationships using ML and Amazon Neptune graph technology

Post Syndicated from Moira Lennox original https://aws.amazon.com/blogs/big-data/automate-discovery-of-data-relationships-using-ml-and-amazon-neptune-graph-technology/

Data mesh is a new approach to data management. Companies across industries are using a data mesh to decentralize data management to improve data agility and get value from data. However, when a data producer shares data products on a data mesh self-serve web portal, it’s neither intuitive nor easy for a data consumer to know which data products they can join to create new insights. This is especially true in a large enterprise with thousands of data products.

This post shows how to use machine learning (ML) and Amazon Neptune to create automated recommendations to join data products and display those recommendations alongside the existing data products. This allows data consumers to easily identify new datasets and provides agility and innovation without spending hours doing analysis and research.

Background

The success of a data-driven organization recognizes data as a key enabler to increase and sustain innovation. It follows what is called a distributed system architecture. The goal of a data product is to solve the long-standing issue of data silos and data quality. Independent data products often only have value if you can connect them, join them, and correlate them to create a higher order data product that creates additional insights. A modern data architecture is critical in order to become a data-driven organization. It allows stakeholders to manage and work with data products across the organization, enhancing the pace and scale of innovation.

Solution overview

A data mesh architecture starts to solve for the decoupled architecture by decoupling the data infrastructure from the application infrastructure, which is a common challenge in traditional data architectures. It focuses on decentralized ownership, domain design, data products, and self-serve data infrastructure. This allows for a new way of thinking and new organizational elements—namely, a modern data community.

However, today’s data mesh platform contains largely independent data products. Even with well-documented data products, knowing how to connect or join data products is a time-consuming job. Data consumers spend hours, days, or months to understand and analyze the data. Identifying links or relationships between data products is critical to create value from the data mesh and enable a data-driven organization.

The solution in this post illustrates an approach to solving these challenges. It uses a fictional insurance company with several data products shared on their data mesh marketplace. The following figure shows the sample data products used in our solution.

Suppose a consumer is browsing the customer data product in the data mesh marketplace. The consumer wonders if the customer data could be linked to claim, policy, or encounter data. Because these data products come from different lines of business (LOBs) or silos, it’s hard to know. A consumer would have to review each data product and do the necessary analysis and research to know this with any certainty.

To solve this problem, our solution uses ML and Neptune to create recommendations for the data consumer. The solution generates a list of data products, product attributes, and the associated probability scores to show join ability. This reduces the time to discover, analyze, and create new insights.

We use Valentine, a data science algorithm for comparing datasets, to improve data product recommendations. Neptune, the managed AWS graph database service, stores information about explicit connections between datasets, improving the recommendations.

Example use case

Let’s walk through a concrete example. Suppose a consumer is browsing the Customer data product in the data mesh marketplace. Customer is similar to the Policy and Encounter data products, but these products come from different silos. Their similarity to the Customer is hard to gauge. To expedite the consumer’s work, the mesh recommends how the Policy and Encounter products can be connected to the Customer product.

Let’s consider two cases. First, is Customer similar to Claim? The following is a sample of the data in each product.

Intuitively, these two products have lots of overlap. Every Cust_Nbr in Claim has a corresponding Customer_ID in Customer. There is no foreign key constraint in Claim that assures us it points to Customer. We think there is enough similarity to infer a join relationship.

The data science algorithm Valentine is an effective tool for this. Valentine is presented in the paper Valentine: Evaluating Matching Techniques for Dataset Discovery (2021, Koutras et al.). Valentine determines if two datasets are joinable or unionable. We focus on the former. Two datasets are joinable if a record from one dataset has a link to a record in the other dataset using one or more columns. Valentine addresses the use case where data is messy: there is no foreign key constraint in place, and data doesn’t match perfectly between datasets. Valentine looks for similarities, and its findings are probabilistic. It scores its proposed matches.

This solution uses an implementation of Valentine available in the following GitHub repo. The first step is to load each data product from its source into a Pandas data frame. If the data is large, load a representative subset of it, at most a few million records. Pass the frames to the valentine_match() function and select the matching method. We use COMA, one of several methods that Valentine supports. The function’s result indicates the similarity of columns and the score. In this case, it tells us that the Customer_ID for Customer matches the Cust_Nbr for Claim, with a very high score. We then instruct the data mesh to recommend Claim to the consumer browsing Customer.

A graph database isn’t required to recommend Claim; the two products could be directly compared. But let’s consider Encounter. Is Customer similar to Encounter? This case is more complicated. Many encounters in the Encounter product don’t link to a customer. An encounter occurs when someone contacts the contact center, which could be by phone or email. The party may or may not be a customer, and if they are a customer, we may not know their customer ID during this encounter. Additionally, sometimes the phone or email they use isn’t the same as the one from a customer record in the Customer product.

In the following sample encounter set, encounters 1 and 2 match to Customer_ID 4. Note that encounter 2’s inbound_email doesn’t exactly match the inbound_email in that customer’s record in the Customer product. Encounter 3 has no Customer_ID, but its inbound_email matches the customer with ID 8. Encounter 4 appears to refer to the customer with ID 8, but the email doesn’t match, and no Customer_ID is given. Encounter 5 only has Inbound_Phone, but that matches the customer with ID 1. Encounter 6 only has an Inbound_Phone, and it doesn’t appear to match any of the customers we’ve listed so far.

We don’t have a strong enough comparison to infer similarity.

But we know more about the customer than the Customer product tells us. In the Neptune database, we maintain a knowledge graph that combines multiple products and links them through relationships. A knowledge graph allows us to combine data from different sources to gain a better understanding of a specific problem domain. In Neptune, we combine the Customer product data with an additional data product: Sales Opportunity. We ingest each product from its source into the knowledge graph and model a hasSalesOpportunity relationship between Customer and SalesOpportunity resources. The following figure shows these resources, their attributes, and their relationship.

With the AWS SDK for Pandas, we combine this data by running a query against the Neptune graph. We use a graph query language (such as SPARQL) to wrangle a representative subset of customer and sales opportunity data into a Pandas data frame (shown as Enhanced Customer View in the following figure). In the following example, we enhance customers 7 and 8 with alternate phone or email contact data from sales opportunities.

We pass that frame to Valentine and compare it to Encounter. This time, two additional encounters match a customer.

The score meets our threshold, and is high enough to share with the consumer as a possible match. To the customer browsing Customer in the mesh marketplace, we present the recommendation of Encounter, along with scoring details to support the recommendation. With this recommendation, the consumer can explore the Encounter product with greater confidence.

Conclusion

Data-driven organizations are transitioning to a data product way of thinking. Utilizing strategies like data mesh generates value on a large scale. We took this a step further by creating a blueprint to create smart recommendations by linking similar data products using graph technology and ML. In this post, we showed how an organization can augment a data catalog with additional metadata by using ML and Neptune with an automated process.

This solution solves the interoperability and linkage problem for data products. Additionally, it gives organizations real-time insights, agility, and innovation without spending time on data analysis and research. This approach creates a truly connected ecosystem with simplified access to delight your data consumers. The current solution is platform agnostic; however, in a future post we will show how to implement this using data.all (open-source software) and Amazon DataZone.

To learn more about ML in Neptune, refer to Amazon Neptune ML for machine learning on graphs. You can also explore Neptune notebooks demonstrating ML and data science for graphs. For more information about the data mesh architecture, refer to Design a data mesh architecture using AWS Lake Formation and AWS Glue. To learn more about Amazon DataZone and how you can share, search, and discover data at scale across organizational boundaries.


About the Authors


Moira Lennox
is a Senior Data Strategy Technical Specialist for AWS with 27 years’ experience helping companies innovate and modernize their data strategies to achieve new heights and allow for strategic decision-making. She has experience working in large enterprises and technology providers, in both business and technical roles across multiple industries, including health care live sciences, financial services, communications, digital entertainment, energy, and manufacturing.

Joel Farvault is Principal Specialist SA Analytics for AWS with 25 years’ experience working on enterprise architecture, data strategy, and analytics, mainly in the financial services industry. Joel has led data transformation projects on fraud analytics, claims automation, and data governance.

Mike Havey is a Solutions Architect for AWS with over 25 years of experience building enterprise applications. Mike is the author of two books and numerous articles. His Amazon author page

AWS Week in Review – March 20, 2023

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/aws-week-in-review-march-20-2023/

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

A new week starts, and Spring is almost here! If you’re curious about AWS news from the previous seven days, I got you covered.

Last Week’s Launches
Here are the launches that got my attention last week:

Picture of an S3 bucket and AWS CEO Adam Selipsky.Amazon S3 – Last week there was AWS Pi Day 2023 celebrating 17 years of innovation since Amazon S3 was introduced on March 14, 2006. For the occasion, the team released many new capabilities:

Amazon Linux 2023 – Our new Linux-based operating system is now generally available. Sébastien’s post is full of tips and info.

Application Auto Scaling – Now can use arithmetic operations and mathematical functions to customize the metrics used with Target Tracking policies. You can use it to scale based on your own application-specific metrics. Read how it works with Amazon ECS services.

AWS Data Exchange for Amazon S3 is now generally available – You can now share and find data files directly from S3 buckets, without the need to create or manage copies of the data.

Amazon Neptune – Now offers a graph summary API to help understand important metadata about property graphs (PG) and resource description framework (RDF) graphs. Neptune added support for Slow Query Logs to help identify queries that need performance tuning.

Amazon OpenSearch Service – The team introduced security analytics that provides new threat monitoring, detection, and alerting features. The service now supports OpenSearch version 2.5 that adds several new features such as support for Point in Time Search and improvements to observability and geospatial functionality.

AWS Lake Formation and Apache Hive on Amazon EMR – Introduced fine-grained access controls that allow data administrators to define and enforce fine-grained table and column level security for customers accessing data via Apache Hive running on Amazon EMR.

Amazon EC2 M1 Mac Instances – You can now update guest environments to a specific or the latest macOS version without having to tear down and recreate the existing macOS environments.

AWS Chatbot – Now Integrates With Microsoft Teams to simplify the way you troubleshoot and operate your AWS resources.

Amazon GuardDuty RDS Protection for Amazon Aurora – Now generally available to help profile and monitor access activity to Aurora databases in your AWS account without impacting database performance

AWS Database Migration Service – Now supports validation to ensure that data is migrated accurately to S3 and can now generate an AWS Glue Data Catalog when migrating to S3.

AWS Backup – You can now back up and restore virtual machines running on VMware vSphere 8 and with multiple vNICs.

Amazon Kendra – There are new connectors to index documents and search for information across these new content: Confluence Server, Confluence Cloud, Microsoft SharePoint OnPrem, Microsoft SharePoint Cloud. This post shows how to use the Amazon Kendra connector for Microsoft Teams.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
A few more blog posts you might have missed:

Example of a geospatial query.Women founders Q&A – We’re talking to six women founders and leaders about how they’re making impacts in their communities, industries, and beyond.

What you missed at that 2023 IMAGINE: Nonprofit conference – Where hundreds of nonprofit leaders, technologists, and innovators gathered to learn and share how AWS can drive a positive impact for people and the planet.

Monitoring load balancers using Amazon CloudWatch anomaly detection alarms – The metrics emitted by load balancers provide crucial and unique insight into service health, service performance, and end-to-end network performance.

Extend geospatial queries in Amazon Athena with user-defined functions (UDFs) and AWS Lambda – Using a solution based on Uber’s Hexagonal Hierarchical Spatial Index (H3) to divide the globe into equally-sized hexagons.

How cities can use transport data to reduce pollution and increase safety – A guest post by Rikesh Shah, outgoing head of open innovation at Transport for London.

For AWS open-source news and updates, here’s the latest newsletter curated by Ricardo to bring you the most recent updates on open-source projects, posts, events, and more.

Upcoming AWS Events
Here are some opportunities to meet:

AWS Public Sector Day 2023 (March 21, London, UK) – An event dedicated to helping public sector organizations use technology to achieve more with less through the current challenging conditions.

Women in Tech at Skills Center Arlington (March 23, VA, USA) – Let’s celebrate the history and legacy of women in tech.

The AWS Summits season is warming up! You can sign up here to know when registration opens in your area.

That’s all from me for this week. Come back next Monday for another Week in Review!

Danilo

Happy New Year! AWS Week in Review – January 9, 2023

Post Syndicated from Channy Yun original https://aws.amazon.com/blogs/aws/happy-new-year-aws-week-in-review-january-9-2023/

Happy New Year! As we kick off 2023, I wanted to take a moment to remind you of some 2023 predictions by AWS leaders for you to help prepare for the new year.

You can also read the nine best things Amazon announced and AWS for Automotive at the Consumer Electronics Show (CES) 2023 in the last week to see the latest offerings from Amazon and AWS that are helping innovate at speed and create new customer experiences at the forefront of technology.

Last Year-End Launches
We skipped two weeks since the last week in review on December 19, 2022. I want to pick some important launches from them.

Last Week’s Launches
As usual, let’s take a look at some launches from the last week that I want to remind you of:

  • Amazon S3 Encrypts New Objects by Default – Amazon S3 encrypts all new objects by default. Now, S3 automatically applies server-side encryption (SSE-S3) for each new object, unless you specify a different encryption option. There is no additional cost for default object-level encryption.
  • Amazon Aurora MySQL Version 3 Backtrack Support – Backtrack allows you to move your MySQL 8.0 compatible Aurora database to a prior point in time without needing to restore from a backup, and it completes within seconds, even for large databases.
  • Amazon EMR Serverless Custom Images – Amazon EMR Serverless now allows you to customize images for Apache Spark and Hive. This means that you can package application dependencies or custom code in the image, simplifying running Spark and Hive workloads.
  • The Graph Explorer, Open-Source Low-Code Visual Exploration Tool – Amazon Neptune announced the graph-explorer, a React-based web application that enables users to visualize both property graph and Resource Description Framework (RDF) data and explore connections between data without having to write graph queries. To learn more about open source updates at AWS, see Ricardo’s OSS newsletter.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some other news items that you may find interesting in the new year:

  • AWS Collective on Stack Overflow – Please join the AWS Collective on Stack Overflow, which provides builders a curated space to engage and learn from this large developer’s community.
  • AWS Fundamentals Book – This upcoming AWS online book is intended to focus on AWS usage in the real world, and goes deeper with amazing per-service infographics.
  • AWS Security Events Workshops – AWS Customer Incident Response Team (CIRT) release five real-world workshops that simulate security events, such as server-side request forgery, ransomware, and cryptominer-based security events, to help you learn the tools and procedures that AWS CIRT uses.

Upcoming AWS Events
Check your calendars and sign up for these AWS events in the new year:

  • AWS Builders Online Series on January 18 – This online conference is designed for you to learn core AWS concepts, and step-by-step architectural best practices, including demonstrations to help you get started and accelerate your success on AWS.
  • AWS Community Day Singapore on January 28 – Come and join AWS User Group Singapore’s first AWS Community Day, a community-led conference for AWS users. See Events for Developers to learn about developer events hosted by AWS and the AWS Community.
  • AWS Cloud Practitioner Essentials Day in January and February – This online workshop provides a detailed overview of cloud concepts, AWS services, security, architecture, pricing, and support. This course also helps you prepare for the AWS Certified Cloud Practitioner examination.

You can browse all upcoming in-person, and virtual events.

That’s all for this week. Check back next Monday for another Week in Review!

— Channy

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Introducing Amazon Neptune Serverless – A Fully Managed Graph Database that Adjusts Capacity for Your Workloads

Post Syndicated from Danilo Poccia original https://aws.amazon.com/blogs/aws/introducing-amazon-neptune-serverless-a-fully-managed-graph-database-that-adjusts-capacity-for-your-workloads/

Amazon Neptune is a fully managed graph database service that makes it easy to build and run applications that work with highly connected datasets. With Neptune, you can use open and popular graph query languages to execute powerful queries that are easy to write and perform well on connected data. You can use Neptune for graph use cases such as recommendation engines, fraud detection, knowledge graphs, drug discovery, and network security.

Neptune has always been fully managed and handles time-consuming tasks such as provisioning, patching, backup, recovery, failure detection and repair. However, managing database capacity for optimal cost and performance requires you to monitor and reconfigure capacity as workload characteristics change. Also, many applications have variable or unpredictable workloads where the volume and complexity of database queries can change significantly. For example, a knowledge graph application for social media may see a sudden spike in queries due to sudden popularity.

Introducing Amazon Neptune Serverless
Today, we’re making that easier with the launch of Amazon Neptune Serverless. Neptune Serverless scales automatically as your queries and your workloads change, adjusting capacity in fine-grained increments to provide just the right amount of database resources that your application needs. In this way, you pay only for the capacity you use. You can use Neptune Serverless for development, test, and production workloads and optimize your database costs compared to provisioning for peak capacity.

With Neptune Serverless you can quickly and cost-effectively deploy graphs for your modern applications. You can start with a small graph, and as your workload grows, Neptune Serverless will automatically and seamlessly scale your graph databases to provide the performance you need. You no longer need to manage database capacity and you can now run graph applications without the risk of higher costs from over-provisioning or insufficient capacity from under-provisioning.

With Neptune Serverless, you can continue to use the same query languages (Apache TinkerPop Gremlin, openCypher, and RDF/SPARQL) and features (such as snapshots, streams, high availability, and database cloning) already available in Neptune.

Let’s see how this works in practice.

Creating an Amazon Neptune Serverless Database
In the Neptune console, I choose Databases in the navigation pane and then Create database. For Engine type, I select Serverless and enter my-database as the DB cluster identifier.

Console screenshot.

I can now configure the range of capacity, expressed in Neptune capacity units (NCUs), that Neptune Serverless can use based on my workload. I can now choose a template that will configure some of the next options for me. I choose the Production template that by default creates a read replica in a different Availability Zone. The Development and Testing template would optimize my costs by not having a read replica and giving access to DB instances that provide burstable capacity.

Console screenshot.

For Connectivity, I use my default VPC and its default security group.

Console screenshot.

Finally, I choose Create database. After a few minutes, the database is ready to use. In the list of databases, I choose the DB identifier to get the Writer and Reader endpoints that I am going to use later to access the database.

Using Amazon Neptune Serverless
There is no difference in the way you use Neptune Serverless compared to a provisioned Neptune database. I can use any of the query languages supported by Neptune. For this walkthrough, I choose to use openCypher, a declarative query language for property graphs originally developed by Neo4j that was open-sourced in 2015 and contributed to the openCypher project.

To connect to the database, I start an Amazon Linux Amazon Elastic Compute Cloud (Amazon EC2) instance in the same AWS Region and associate the default security group and a second security group that gives me SSH access.

With a property graph I can represent connected data. In this case, I want to create a simple graph that shows how some AWS services are part of a service category and implement common enterprise integration patterns.

I use curl to access the Writer openCypher HTTPS endpoint and create a few nodes that represent patterns, services, and service categories. The following commands are split into multiple lines in order to improve readability.

curl https://<my-writer-endpoint>:8182/openCypher \
-d "query=CREATE (mq:Pattern {name: 'Message Queue'}),
(pubSub:Pattern {name: 'Pub/Sub'}),
(eventBus:Pattern {name: 'Event Bus'}),
(workflow:Pattern {name: 'WorkFlow'}),
(applicationIntegration:ServiceCategory {name: 'Application Integration'}),
(sqs:Service {name: 'Amazon SQS'}), (sns:Service {name: 'Amazon SNS'}),
(eventBridge:Service {name: 'Amazon EventBridge'}), (stepFunctions:Service {name: 'AWS StepFunctions'}),
(sqs)-[:IMPLEMENT]->(mq), (sns)-[:IMPLEMENT]->(pubSub),
(eventBridge)-[:IMPLEMENT]->(eventBus),
(stepFunctions)-[:IMPLEMENT]->(workflow),
(applicationIntegration)-[:CONTAIN]->(sqs),
(applicationIntegration)-[:CONTAIN]->(sns),
(applicationIntegration)-[:CONTAIN]->(eventBridge),
(applicationIntegration)-[:CONTAIN]->(stepFunctions);"

This is a visual representation of the nodes and their relationships for the graph created by the previous command. The type (such as Service or Pattern) and properties (such as name) are shown inside each node. The arrows represent the relationships (such as CONTAIN or IMPLEMENT) between the nodes.

Visualization of graph data.

Now, I query the database to get some insights. To query the database, I can use either a Writer or a Reader endpoint. First, I want to know the name of the service implementing the “Message Queue” pattern. Note how the syntax of openCypher resembles that of SQL with MATCH instead of SELECT.

curl https://<my-endpoint>:8182/openCypher \
-d "query=MATCH (s:Service)-[:IMPLEMENT]->(p:Pattern {name: 'Message Queue'}) RETURN s.name;"
{
  "results" : [ {
    "s.name" : "Amazon SQS"
  } ]
}

I use the following query to see how many services are in the “Application Integration” category. This time, I use the WHERE clause to filter results.

curl https://<my-endpoint>:8182/openCypher \
-d "query=MATCH (c:ServiceCategory)-[:CONTAIN]->(s:Service) WHERE c.name='Application Integration' RETURN count(s);"
{
  "results" : [ {
    "count(s)" : 4
  } ]
}

There are many options now that I have this graph database up and running. I can add more data (services, categories, patterns) and more relationships between the nodes. I can focus on my application and let Neptune Serverless manage capacity and infrastructure for me.

Availability and Pricing
Amazon Neptune Serverless is available today in the following AWS Regions: US East (Ohio, N. Virginia), US West (N. California, Oregon), Asia Pacific (Tokyo), and Europe (Ireland, London).

With Neptune Serverless, you only pay for what you use. The database capacity is adjusted to provide the right amount of resources you need in terms of Neptune capacity units (NCUs). Each NCU is a combination of approximately 2 gibibytes (GiB) of memory with corresponding CPU and networking. The use of NCUs is billed per second. For more information, see the Neptune pricing page.

Having a serverless graph database opens many new possibilities. To learn more, see the Neptune Serverless documentation. Let us know what you build with this new capability!

Simplify the way you work with highly connected data using Neptune Serverless.

Danilo

AWS Week in Review – August 1, 2022

Post Syndicated from Jeff Barr original https://aws.amazon.com/blogs/aws/aws-week-in-review-august-1-2022/

AWS re:Inforce returned to Boston last week, kicking off with a keynote from Amazon Chief Security Officer Steve Schmidt and AWS Chief Information Security officer C.J. Moses:

Be sure to take some time to watch this video and the other leadership sessions, and to use what you learn to take some proactive steps to improve your security posture.

Last Week’s Launches
Here are some launches that caught my eye last week:

AWS Wickr uses 256-bit end-to-end encryption to deliver secure messaging, voice, and video calling, including file sharing and screen sharing, across desktop and mobile devices. Each call, message, and file is encrypted with a new random key and can be decrypted only by the intended recipient. AWS Wickr supports logging to a secure, customer-controlled data store for compliance and auditing, and offers full administrative control over data: permissions, ephemeral messaging options, and security groups. You can now sign up for the preview.

AWS Marketplace Vendor Insights helps AWS Marketplace sellers to make security and compliance data available through AWS Marketplace in the form of a unified, web-based dashboard. Designed to support governance, risk, and compliance teams, the dashboard also provides evidence that is backed by AWS Config and AWS Audit Manager assessments, external audit reports, and self-assessments from software vendors. To learn more, read the What’s New post.

GuardDuty Malware Protection protects Amazon Elastic Block Store (EBS) volumes from malware. As Danilo describes in his blog post, a malware scan is initiated when Amazon GuardDuty detects that a workload running on an EC2 instance or in a container appears to be doing something suspicious. The new malware protection feature creates snapshots of the attached EBS volumes, restores them within a service account, and performs an in-depth scan for malware. The scanner supports many types of file systems and file formats and generates actionable security findings when malware is detected.

Amazon Neptune Global Database lets you build graph applications that run across multiple AWS Regions using a single graph database. You can deploy a primary Neptune cluster in one region and replicate its data to up to five secondary read-only database clusters, with up to 16 read replicas each. Clusters can recover in minutes in the result of an (unlikely) regional outage, with a Recovery Point Objective (RPO) of 1 second and a Recovery Time Objective (RTO) of 1 minute. To learn a lot more and see this new feature in action, read Introducing Amazon Neptune Global Database.

Amazon Detective now Supports Kubernetes Workloads, with the ability to scale to thousands of container deployments and millions of configuration changes per second. It ingests EKS audit logs to capture API activity from users, applications, and the EKS control plane, and correlates user activity with information gleaned from Amazon VPC flow logs. As Channy notes in his blog post, you can enable Amazon Detective and take advantage of a free 30 day trial of the EKS capabilities.

AWS SSO is Now AWS IAM Identity Center in order to better represent the full set of workforce and account management capabilities that are part of IAM. You can create user identities directly in IAM Identity Center, or you can connect your existing Active Directory or standards-based identify provider. To learn more, read this post from the AWS Security Blog.

AWS Config Conformance Packs now provide you with percentage-based scores that will help you track resource compliance within the scope of the resources addressed by the pack. Scores are computed based on the product of the number of resources and the number of rules, and are reported to Amazon CloudWatch so that you can track compliance trends over time. To learn more about how scores are computed, read the What’s New post.

Amazon Macie now lets you perform one-click temporary retrieval of sensitive data that Macie has discovered in an S3 bucket. You can retrieve up to ten examples at a time, and use these findings to accelerate your security investigations. All of the data that is retrieved and displayed in the Macie console is encrypted using customer-managed AWS Key Management Service (AWS KMS) keys. To learn more, read the What’s New post.

AWS Control Tower was updated multiple times last week. CloudTrail Organization Logging creates an org-wide trail in your management account to automatically log the actions of all member accounts in your organization. Control Tower now reduces redundant AWS Config items by limiting recording of global resources to home regions. To take advantage of this change you need to update to the latest landing zone version and then re-register each Organizational Unit, as detailed in the What’s New post. Lastly, Control Tower’s region deny guardrail now includes AWS API endpoints for AWS Chatbot, Amazon S3 Storage Lens, and Amazon S3 Multi Region Access Points. This allows you to limit access to AWS services and operations for accounts enrolled in your AWS Control Tower environment.

For a full list of AWS announcements, be sure to keep an eye on the What’s New at AWS page.

Other AWS News
Here are some other news items and customer stories that you may find interesting:

AWS Open Source News and Updates – My colleague Ricardo Sueiras writes a weekly open source newsletter and highlights new open source projects, tools, and demos from the AWS community. Read installment #122 here.

Growy Case Study – This Netherlands-based company is building fully-automated robot-based vertical farms that grow plants to order. Read the case study to learn how they use AWS IoT and other services to monitor and control light, temperature, CO2, and humidity to maximize yield and quality.

Journey of a Snap on Snapchat – This video shows you how a snapshot flows end-to-end from your camera to AWS, to your friends. With over 300 million daily active users, Snap takes advantage of Amazon Elastic Kubernetes Service (EKS), Amazon DynamoDB, Amazon Simple Storage Service (Amazon S3), Amazon CloudFront, and many other AWS services, storing over 400 terabytes of data in DynamoDB and managing over 900 EKS clusters.

Cutting Cardboard Waste – Bin packing is almost certainly a part of every computer science curriculum! In the linked article from the Amazon Science site, you can learn how an Amazon Principal Research Scientist developed PackOpt to figure out the optimal set of boxes to use for shipments from Amazon’s global network of fulfillment centers. This is an NP-hard problem and the article describes how they build a parallelized solution that explores a multitude of alternative solutions, all running on AWS.

Upcoming Events
Check your calendar and sign up for these online and in-person AWS events:

AWS SummitAWS Global Summits – AWS Global Summits are free events that bring the cloud computing community together to connect, collaborate, and learn about AWS. Registrations are open for the following AWS Summits in August:

Imagine Conference 2022IMAGINE 2022 – The IMAGINE 2022 conference will take place on August 3 at the Seattle Convention Center, Washington, USA. It’s a no-cost event that brings together education, state, and local leaders to learn about the latest innovations and best practices in the cloud. You can register here.

That’s all for this week. Check back next Monday for another Week in Review!

Jeff;

This post is part of our Week in Review series. Check back each week for a quick roundup of interesting news and announcements from AWS!

Build data lineage for data lakes using AWS Glue, Amazon Neptune, and Spline

Post Syndicated from Khoa Nguyen original https://aws.amazon.com/blogs/big-data/build-data-lineage-for-data-lakes-using-aws-glue-amazon-neptune-and-spline/

Data lineage is one of the most critical components of a data governance strategy for data lakes. Data lineage helps ensure that accurate, complete and trustworthy data is being used to drive business decisions. While a data catalog provides metadata management features and search capabilities, data lineage shows the full context of your data by capturing in greater detail the true relationships between data sources, where the data originated from and how it gets transformed and converged. Different personas in the data lake benefit from data lineage:

  • For data scientists, the ability to view and track data flow as it moves from source to destination helps you easily understand the quality and origin of a particular metric or dataset
  • Data platform engineers can get more insights into the data pipelines and the interdependencies between datasets
  • Changes in data pipelines are easier to apply and validate because engineers can identify a job’s upstream dependencies and downstream usage to properly evaluate service impacts

As the complexity of data landscape grows, customers are facing significant manageability challenges in capturing lineage in a cost-effective and consistent manner. In this post, we walk you through three steps in building an end-to-end automated data lineage solution for data lakes: lineage capturing, modeling and storage and finally visualization.

In this solution, we capture both coarse-grained and fine-grained data lineage. Coarse-grained data lineage, which often targets business users, focuses on capturing the high-level business processes and overall data workflows. Typically, it captures and visualizes the relationships between datasets and how they’re propagated across storage tiers, including extract, transform and load (ETL) jobs and operational information. Fine-grained data lineage gives access to column-level lineage and the data transformation steps in the processing and analytical pipelines.

Solution overview

Apache Spark is one of the most popular engines for large-scale data processing in data lakes. Our solution uses the Spline agent to capture runtime lineage information from Spark jobs, powered by AWS Glue. We use Amazon Neptune, a purpose-built graph database optimized for storing and querying highly connected datasets, to model lineage data for analysis and visualization.

The following diagram illustrates the solution architecture. We use AWS Glue Spark ETL jobs to perform data ingestion, transformation, and load. The Spline agent is configured in each AWS Glue job to capture lineage and run metrics, and sends such data to a lineage REST API. This backend consists of producer and consumer endpoints, powered by Amazon API Gateway and AWS Lambda functions. The producer endpoints process the incoming lineage objects before storing them in the Neptune database. We use consumer endpoints to extract specific lineage graphs for different visualizations in the frontend application. We perform ad hoc interactive analysis on the graph through Neptune notebooks.

Solution Architecture

We provide sample code and Terraform deployment scripts on GitHub to quickly deploy this solution to the AWS Cloud.

Data lineage capturing

The Spline agent is an open-source project that can harvest data lineage automatically from Spark jobs at runtime, without the need to modify the existing ETL code. It listens to Spark’s query run events, extracts lineage objects from the job run plans and sends them to a preconfigured backend (such as HTTP endpoints). The agent also automatically collects job run metrics such as the number of output rows. As of this writing, the Spline agent works only with Spark SQL (DataSet/DataFrame APIs) and not with RDDs/DynamicFrames.

The following screenshot shows how to integrate the Spline agent with AWS Glue Spark jobs. The Spline agent is an uber JAR that needs to be added to the Java classpath. The following configurations are required to set up the Spline agent:

  • spark.sql.queryExecutionListeners configuration is used to register a Spline listener during its initialization.
  • spark.spline.producer.url specifies the address of the HTTP server that the Spline agent should send lineage data to.

Spline Agent Configuration on AWS Glue

We build a data lineage API that is compatible with the Spline agent. This API facilitates the insertion of lineage data to Neptune database and graph extraction for visualization. The Spline agent requires three HTTP endpoints:

  • /status – For health checks
  • /execution-plans – For sending the captured Spark execution plans after the jobs are submitted to run
  • /execution-events – For sending the job’s run metrics when the job is complete

We also create additional endpoints to manage various metadata of the data lake, such as the names of the storage layers and dataset classification.

When a Spark SQL statement is run or a DataFrame action is called, Spark’s optimization engine, namely Catalyst, generates different query plans: a logical plan, optimized logical plan and physical plan, which can be inspected using the EXPLAIN statement. In a job run, the Spline agent parses the analyzed logical plan to construct a JSON lineage object. The object consists of the following:

  • A unique job run ID
  • A reference schema (attribute names and data types)
  • A list of operations
  • Other system metadata such as Spark version and Spline agent version

A run plan specifies the steps the Spark job performs, from reading data sources, applying different transformations, to finally persisting the job’s output into a storage location.

To sum up, the Spline agent captures not only the metadata of the job (such as job name and run date and time), the input and output tables (such as data format, physical location and schema) but also detailed information about the business logic (SQL-like operations that the job performs, such as join, filter, project and aggregate).

Data modeling and storage

Data modeling starts with the business requirements and use cases and maps those needs into a structure for storing and organizing our data. In data lineage for data lakes, the relationships between data assets (jobs, tables and columns) are as important as the metadata of those. As a result, graph databases are suitable to model such highly connected entities, making it efficient to understand the complex and deep network of relationships within the data.

Neptune is a fast, reliable, fully managed graph database service that makes it easy to build and run applications with highly connected datasets. You can use Neptune to create sophisticated, interactive graph applications that can query billions of relationships in milliseconds. Neptune supports three popular graph query languages: Apache TinkerPop Gremlin and openCypher for property graphs and SPARQL for W3C’s RDF data model. In this solution, we use the property graph’s primitives (including vertices, edges, labels and properties) to model the objects and use the gremlinpython library to interact with the graphs.

The objective of our data model is to provide an abstraction for data assets and their relationships within the data lake. In the producer Lambda functions, we first parse the JSON lineage objects to form logical entities such as jobs, tables and operations before constructing the final graph in Neptune.

Lineage Processing

The following diagram shows a sample data model used in this solution.

Lineage Data Model

This data model allows us to easily traverse the graph to extract coarse-grained and fine-grained data lineage, as mentioned earlier.

Data lineage visualization

You can extract specific views of the lineage graph from Neptune using the consumer endpoints backed by Lambda functions. Hierarchical views of lineage at different levels make it easy for the end-user to analyze the information.

The following screenshot shows a data lineage view across all jobs and tables.

Lineage Visualization Overall

The following screenshot shows a view of a specific job plan.

Lineage Visualization at Job Level

The following screenshot shows a detailed look into the operations taken by the job.

Lineage Visualization at Execution Level

The graphs are visualized using the vis.js network open-source project. You can interact with the graph elements to learn more about the entity’s properties, such as data schema.

Conclusion

In this post, we showed you architectural design options to automatically collect end-to-end data lineage for AWS Glue Spark ETL jobs across a data lake in a multi-account AWS environment using Neptune and the Spline agent. This approach enables searchable metadata, helps to draw insights and achieve an improved organization-wide data lineage posture. The proposed solution uses AWS managed and serverless services, which are scalable and configurable for high availability and performance.

For more information about this solution, see Github. You may modify the code to extend the data model and APIs.


About the Authors

Khoa Nguyen is a Senior Big Data Architect at Amazon Web Services. He works with large enterprise customers and AWS partners to accelerate customers’ business outcomes by providing expertise in Big Data and AWS services.

Krithivasan Balasubramaniyan is a Principal Consultant at Amazon Web Services. He enables global enterprise customers in their digital transformation journey and helps architect cloud native solutions.

Rahul Shaurya is a Senior Big Data Architect at Amazon Web Services. He helps and works closely with customers building data platforms and analytical applications on AWS.

Field Notes: Data-Driven Risk Analysis with Amazon Neptune and Amazon Elasticsearch Service

Post Syndicated from Adriaan de Jonge original https://aws.amazon.com/blogs/architecture/field-notes-data-driven-risk-analysis-with-amazon-neptune-and-amazon-elasticsearch-service/

This blog post is co-authored with Charles Crouspeyre and Angad Srivastava. Charles is Director at Accenture Applied Intelligence and ASEAN AI SME (Subject Matter Expert) and Angad is Data and Analytics Consultant at AWS and NLP (Natural Language Processing) expert. Together, they are the lead architects of the solution presented in this blog.

In this blog, you learn how Amazon Neptune as a graph database, combined with Amazon Elasticsearch Service (Amazon ES) for full text indexing helps you shorten risk analysis processes from weeks to minutes. We give a walk-through of the steps involved in creating this knowledge management solution, which includes natural language processing components.

The business problem

Our Energy customer needs to do a risk assessment before acquiring raw materials that will be processed in their equipment. The process includes assessing the inventory of raw materials, the capacity of storage units, analyzing the performance of the processing units, and quality assurance of the end product. The cycle time for a comprehensive risk analysis across different teams working in silos is more than 2 weeks, while the window of opportunity for purchasing is a couple of days. So, the customer either puts their equipment and personnel at risk or misses good buying opportunities.

The solution described in this blog helps our customer improve and speed up their decision making. This is done through an automated analysis and understanding of the documents and information they have gathered over the years. They use Natural Language Processing (NLP) to analyze and better understand the documents which is discussed later on in this blog.

Our customer has accumulated years of documents that were mostly in silos across the organization: emails, SharePoint, local computer, private notes, and more.

The data is so heterogenous and widespread that it became hard for our customer to retrieve the right information in a timely manner. Our objective was to create a platform centralizing all this information, and to facilitate present and future information retrieval. Making informed decisions on time helps our customer to purchase raw materials at a better price, increasing their margins significantly.

Overview of business solution

To understand the tasks involved, let’s look at the high-level platform workflow:

Figure 1: This illustration visualizes a 4-step process consisting of Hydrate, Analyze, Search and Feedback.

Figure 1: This illustration visualizes a 4-step process consisting of Hydrate, Analyze, Search and Feedback.

We can summarize our workflow as a 4-step process:

  1. Hydrate: where we extract the information from multiple sources and do a first level of processing such as document scanning and natural language processing (NLP).
  2. Analyze: where the information extracted from the hydration step is ingested and merged with existing information.
  3. Search: where information is retrieved from the system based on user queries, by leveraging our knowledge graph and the concept map representation that we have created.
  4. Feedback: where users can rate the results for the system as good or bad. The feedback is collected and used to update the Knowledge graph, to re-train our models or to improve our query matching layer.

High-level technical architecture

The following architecture consists of a traditional data layer, combined with a knowledge layer. The compute part of the solution is serverless. The database storage part requires long-running solutions.

Figure 2: A diagram visualizing the steps involved in data processing across two layers, the data layer and the knowledge layer and their implementations with AWS services.

Figure 2: A diagram visualizing the steps involved in data processing across two layers, the data layer and the knowledge layer and their implementations with AWS services.

The data layer of our application is similar to many common data analytics setups, and includes:

  • An ingestion and normalization component, implemented with AWS Lambda, fronted by Amazon API Gateway and AWS Transfer Family
  • An ETL component, implemented with AWS Glue and AWS Lambda
  • A data enhancement component, implemented with Lambda
  • An analytics component, implemented with Amazon Redshift
  • A knowledge query component, implemented with Lambda
  • A user interface, a custom implementation based on React

Where our solution really adds value, is the knowledge layer, which is what we will focus on in this blog. We created this specifically for our knowledge representation and management. This layer consists of the following:

  • The knowledge extraction block, where the raw text is extracted, analyzed and classified into facts and structured data. This is implemented using Amazon SageMaker and Amazon Comprehend.
  • The knowledge repository, where the raw data is saved and kept is Amazon Simple Storage Service (Amazon S3).
  • The relationship, and knowledge extraction, and indexing, where the facts extracted earlier are analyzed and added to our knowledge graph.  This is implemented with a combination of Neptune, Amazon S3, Amazon DocumentDB (with MongoDB compatibility) and Amazon ES. Neptune is used as a property graph, queried with the Gremlin graph traversal language.
  • The knowledge aggregator, where we leverage both our knowledge graph and business representation to extract facts to associate with the user query, and rank information based on their relevance. This is implemented leveraging Amazon ES.

The last component, the knowledge aggregator, is fundamental for our infrastructure. In general, when we talk about information retrieval system – a system designed to supply the right information in the hands of users at the right time – there are two common approaches:

  1. Keyword-based search: take the user query and search for the presence of certain keywords from the query in the available documents.
  2. Concept-based search: build a business-related taxonomy to extend the keyword-based search into a business-related concept-based search.

The downside of a keyword-based search is that it does not capture the complexity and specificity of the business domain in which the query occurs.  Due to this limitation, we chose to go with a concept-based search approach as it allows us to inject a layer of business understanding to our ingestion and information retrieval.

Knowledge layer deep-dive

Because the value added from our solution is in the knowledge layer, let’s dive deeper into the details of this layer.

Figure 3: An architecture diagram of the knowledge layer of the solution, classified in 3 categories: ingestion, knowledge representation and retrieval

Figure 3: An architecture diagram of the knowledge layer of the solution, classified in 3 categories: ingestion, knowledge representation and retrieval

The architecture in Figure 3 describes the technical solution architecture broken down into 3 key steps. The 3 steps are:

  1. Ingestion
  2. Knowledge representation
  3. Retrieval

Another way to approach the problem definition is by looking at the process flow for how raw data/information flows through the system to generate the knowledge layer. Figure 4 gives an example of how the information is broadly treated as it progresses the logical phases of the process flow.

 

Figure 6: Context based Knowledge Graph Generation

Figure 4: An illustration of how information is extracted from an unstructured document, modeled as a graph and visualized in a business-friendly and concise format.

In this example, we can recognize a raw material of type “Champion” and detect a relationship between this entity and another entity of type “basic nitrogen”. This relationship is classified as the type “is characterized by”.

The facts in the parsed content are then classified into different categories of relevancy based on the contextual information contained in the text i.e., an important paragraph that mentions a potential issue will get classified as a key highlight with a high degree of relevancy.

This paragraph text is further analyzed to recognize and extract entities mentioned such as “Champion” and “basic nitrogen”; and to determine the semantic relationship between these entities based on the context of the paragraph i.e., “characterized by” and, “incompatibility due to low levels of basic nitrogen”.

There is a correlation between the different steps of the technical architecture versus the phases in the information analysis process. So we will present them together.

This table shows the correlation between the different steps of the technical architecture versus the phases in the information analysis process.

  • During the Ingestion step in the technical solution architecture, the aim is to process the incoming raw data in any format as defined in the Extract Information phase of the information analysis process flow.
  • Once the data ingestion has occurred, the next step is to capture the knowledge representation. The contextualize information phase of the information analysis process flow helps ensure that comprehensive and accurate knowledge representation occurs within the system.
  • The last step for the solution is to then facilitate retrieval of information by providing appropriate interfaces for interacting with the knowledge representation within the system. This is facilitated by the assemble information phase of the Information Analysis process.

To further understand the proposed solution, let us review the steps and the associated process flow phases.

Technical Architecture Step 1: Ingestion

Information comes in through the ingestion pipeline from various sources, such as websites, reports, news, blogs and internal data. Raw data enters the system either through automated API-based integrations with external websites or internal systems like Microsoft SharePoint, or can be ingested manually through AWS Transfer Family. Once a new piece of data has been ingested into the solution, it initiates the process for extracting information from the raw data.

Information Analysis Phase 1: Extract information

Once the information lands in our system, the knowledge representation process starts with our Lambda functions acting as the orchestrator between other components. Amazon SageMaker was initially used to create custom models for document categorization and classification of ingested unstructured files.

For example, an unstructured file that is ingested into our system gets recognized as a new email (one of the acceptable data sources) and is classified as “compatibility highlights” based on the email contents. But with improvements in the capabilities of Amazon Comprehend managed service, the need for custom model development, maintenance, and machine learning operations (MLOps) could be reduced. The solution now uses Amazon Comprehend with custom training for the initial step of document categorization and information extraction. Additionally, Amazon Comprehend was also used to create custom named-entity recognition models, that were trained to recognize custom raw materials and properties.

In this example, an unstructured pdf document is ingested into our system as illustrated in Figure 5.

Example of an unstructured pdf document being ingested into our system

Figure 5: Phase 1 – Information Extraction

Amazon Comprehend analyzes the unstructured document, classifies its contents and extracts a specific piece of information regarding a type of raw material called “Champion”. This has an inherent property called “low basic nitrogen” associated with it.

Technical Architecture Step 2: Knowledge representation

Knowledge representation is the process of extracting semantic relationships between the various information/data elements within a piece of raw data. It then incorporates it into the existing layers of knowledge already identified and stored. Based on the categorization of the document, the raw text is pre-processed and parsed into logical units. The parsed data is then analyzed in our NLP layer for content identification and fact classification.

The facts and key relationships deduced from the results of both Amazon Comprehend are returned back to the Lambda functions, which in-turn store the detected facts to the knowledge graph.

Information Analysis Phase 2: Contextualize information

Once the information is extracted from the document; our first step is to contextualize the information using our business representation in the form of a taxonomy. The system detects different parts and entities that the paragraph is composed of, and structures the information into our knowledge graph as illustrated in Figure 6.

Figure 6: Context based Knowledge Graph Generation

Figure 6: Context based Knowledge Graph Generation

This data extraction process is repeated iteratively, so that the knowledge graph grows over time through the detection of new facts and relationships. When we ingest new data into our knowledge graph, we search our knowledge graph for similar entities. If a similar entity exists, we analyze the type of relationships and properties both entities have. When we observe sufficient similarities between the entities, we associate relationships from one entity to the other.

For example, a new entity “Crude A” which has the properties – level of basic nitrogen and level of sulfur is ingested. Next, we have “Champion”, as described above, which has similar levels of basic nitrogen and a property “risk” associated to it. Based on the existing knowledge graph, we can now infer that there is a high probability that “Crude A” has a similar risk associated to it as shown in Figure 7.

Figure 7: Crude Knowledge Graph Representation

Figure 7: Crude Knowledge Graph Representation

The probability calculations can take multiple factors into consideration to make the process more accurate. This makes the structure of the knowledge graph quite dynamic and evolve automatically.

The complete raw data is also stored in Amazon ES as a secondary implementation to perform free form queries. This process helps ensure that all the relevant information for any extracted factoid associated with an entity in the knowledge graph is completely represented with the system. Some of this information may not exist in the knowledge graph because the document data extraction model can’t capture all the relevant information. One reason for such a problem to occur can be poor quality of the source document making automated reading of documents and data extraction difficult. Another reason can be the sophistication of the Amazon Comprehend models.

Technical Architecture Step 3: Retrieval

To retrieve information, the user query is analyzed by the Lambda function on the right side of Figure 3. Based on the analysis, key terms are identified from the user query for which a search needs to be performed. For example, if the query provided is “What is the likelihood of damage due to processing Champion in location A”, semantic analysis of the query will indicate that we are looking for relationships between entities Champion, any type of risks, any known incidents at location A and possible mitigations to reduce identified risks.

To address the query, the information then needs to compiled together from the existing knowledge graph as well as Amazon ES to provide a complete answer.

Information Analysis Phase 3: Assemble information

Figure 8 illustrates the output of Information assembly process.

Figure 8: "Champion" crude assembled information for property Nitrogen

Figure 8: “Champion” crude assembled information for property Nitrogen

Based on the facts available within the knowledge graph, we have identified that for “Champion” there is a possibility of damage occurring “due to increased pressure drop and loss of heat transfer” but this can be mitigated by “blending champion to meet basic nitrogen levels”.

In addition, say there was information available about “Crude B” that has been processed at “location A”. This also originated from “Brunei” and had similar “Nitrogen” and properties such as “Kerogen3”, “napthenic” and had a processing incident causing damage. We can then conclude by looking at the information stored within the knowledge graph and Amazon ES, that there are other possibilities for damage to occur due to processing of “Champion” at “Location A” as well.

Once all the relevant pieces of information have been collected, a sorted list of information is sent back to the user interface to be displayed.

Fact reconciliation

In reality, it is possible that new information contradicts existing information, which causes conflicts during ingestion. There are various ways to handle such contradictions, for example:

Figure 5: Visualizations of four illustrative ways to deal with contradictory new facts.

Figure 9: Visualizations of four illustrative ways to deal with contradictory new facts.

  1. Assume the latest data is the most accurate, by looking at the timestamp of each data point. This makes it possible to update the list of facts in our knowledge graph
  2. Analyze how new facts alter the properties or relationships of existing facts and update them or create a relationship between nodes
  3. Calculate a reliability score for the available sources, to rank the fact based on who has provided them
  4. Ask for end user feedback through the user interface

In our solution, we have mechanism 1, 2, and 4. Mechanisms 1 and 2 are implemented within the contextualize information phase of the information analysis process.

Mechanism 4 is implemented in the search results use interface where the user has a ‘thumbs up’ and ‘thumbs down’ button to classify the different search results as relevant or not. This information is then fed back into the Amazon Comprehend model, the knowledge graph as well as captured within Amazon ES to optimize subsequent search results.

Over time, mechanism 4 can be expanded to capture more detailed feedback including corrections to the search result instead of a simple yes/no feedback mechanism.  Such enhancements to mechanism 4 and the implementation of Mechanism 3 can be a possible future enhancement for the solution proposed.

Conclusion

Our customer needed help to shorten their risk analysis process to make high-impact purchase decisions for raw materials. Our knowledge management solution helped them extract knowledge from their vast set of documents and make it available in knowledge graph format, for risk analysts to analyze. Knowledge graphs are a great way to handle this “domain specificity”. It helps extract information during the ingestion phase. It also helps contextualize queries during the retrieval phase.

The possibilities are endless. One thing is certain: we’d encourage you to use graph databases with Neptune supported by Amazon ES for your use cases with highly connected data!

Field Notes provides hands-on technical guidance from AWS Solutions Architects, consultants, and technical account managers, based on their experiences in the field solving real-world business problems for customers.

 

Charles Crouspeyre

Charles Crouspeyre is leading the AI Engineering practice for Accenture in ASEAN, where he is helping companies from all industries think and deploy their AI ambitions. When not working, he likes to spend time with his young-aged daughter reading/drawing/cooking/singing/playing hide & seek with her as she “requests”.

Angad Srivastava

Angad Srivastava

Angad Srivastava is a Data and Analytics Consultant at AWS in Singapore, where he consults with clients in ASEAN to develop robust AI solutions. When not at his desk, he can be found planning his next budget-friendly backpacking trip to check off yet another country from his bucket list.

Bringing machine learning to more builders through databases and analytics services

Post Syndicated from Swami Sivasubramanian original https://aws.amazon.com/blogs/big-data/bringing-machine-learning-to-more-builders-through-databases-and-analytics-services/

Machine learning (ML) is becoming more mainstream, but even with the increasing adoption, it’s still in its infancy. For ML to have the broad impact that we think it can have, it has to get easier to do and easier to apply. We launched Amazon SageMaker in 2017 to remove the challenges from each stage of the ML process, making it radically easier and faster for everyday developers and data scientists to build, train, and deploy ML models. SageMaker has made ML model building and scaling more accessible to more people, but there’s a large group of database developers, data analysts, and business analysts who work with databases and data lakes where much of the data used for ML resides. These users still find it too difficult and involved to extract meaningful insights from that data using ML.

This group is typically proficient in SQL but not Python, and must rely on data scientists to build the models needed to add intelligence to applications or derive predictive insights from data. And even when you have the model in hand, there’s a long and involved process to prepare and move data to use the model. The result is that ML isn’t being used as much as it can be.

To meet the needs of this large and growing group of builders, we’re integrating ML into AWS databases, analytics, and business intelligence (BI) services.

AWS customers generate, process, and collect more data than ever to better understand their business landscape, market, and customers. And you don’t just use one type of data store for all your needs. You typically use several types of databases, data warehouses, and data lakes, to fit your use case. Because all these use cases could benefit from ML, we’re adding ML capabilities to our purpose-built databases and analytics services so that database developers, data analysts, and business analysts can train models on their data or add inference results right from their database, without having to export and process their data or write large amounts of ETL code.

Machine Learning for database developers

At re:Invent last year, we announced ML integrated inside Amazon Aurora for developers working with relational databases. Previously, adding ML using data from Aurora to an application was a very complicated process. First, a data scientist had to build and train a model, then write the code to read data from the database. Next, you had to prepare the data so it can be used by the ML model. Then, you called an ML service to run the model, reformat the output for your application, and finally load it into the application.

Now, with a simple SQL query in Aurora, you can add ML to an enterprise application. When you run an ML query in Aurora using SQL, it can directly access a wide variety of ML models from Amazon SageMaker and Amazon Comprehend. The integration between Aurora and each AWS ML service is optimized, delivering up to 100 times better throughput when compared to moving data between Aurora and SageMaker or Amazon Comprehend without this integration. Because the ML model is deployed separately from the database and the application, each can scale up or scale out independently of the other.

In addition to making ML available in relational databases, combining ML with certain types of non-relational database models can also lead to better predictions. For example, database developers use Amazon Neptune, a purpose-built, high-performance graph database, to store complex relationships between data in a graph data model. You can query these graphs for insights and patterns and apply the results to implement capabilities such as product recommendations or fraud detection.

However, human intuition and analyzing individual queries is not enough to discover the full breadth of insights available from large graphs. ML can help, but as was the case with relational databases it requires you to do a significant amount of heavy lifting upfront to prepare the graph data and then select the best ML model to run against that data. The entire process can take weeks.

To help with this, today we announced the general availability of Amazon Neptune ML to provide database developers access to ML purpose-built for graph data. This integration is powered by SageMaker and uses the Deep Graph Library (DGL), a framework for applying deep learning to graph data. It does the hard work of selecting the graph data needed for ML training, automatically choosing the best model for the selected data, exposing ML capabilities via simple graph queries, and providing templates to allow you to customize ML models for advanced scenarios. The following diagram illustrates this workflow.

And because the DGL is purpose-built to run deep learning on graph data, you can improve accuracy of most predictions by over 50% compared to that of traditional ML techniques.

Machine Learning for data analysts

At re:Invent last year, we announced ML integrated inside Amazon Athena for data analysts. With this integration, you can access more than a dozen built-in ML models or use your own models in SageMaker directly from ad-hoc queries in Athena. As a result, you can easily run ad-hoc queries in Athena that use ML to forecast sales, detect suspicious logins, or sort users into customer cohorts.

Similarly, data analysts also want to apply ML to the data in their Amazon Redshift data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day. These Amazon Redshift users want to run ML on their data in Amazon Redshift without having to write a single line of Python. Today we announced the preview of Amazon Redshift ML to do just that.

Amazon Redshift now enables you to run ML algorithms on Amazon Redshift data without manually selecting, building, or training an ML model. Amazon Redshift ML works with Amazon SageMaker Autopilot, a service that automatically trains and tunes the best ML models for classification or regression based on your data while allowing full control and visibility.

When you run an ML query in Amazon Redshift, the selected data is securely exported from Amazon Redshift to Amazon Simple Storage Service (Amazon S3). SageMaker Autopilot then performs data cleaning and preprocessing of the training data, automatically creates a model, and applies the best model. All the interactions between Amazon Redshift, Amazon S3, and SageMaker are abstracted away and automatically occur. When the model is trained, it becomes available as a SQL function for you to use. The following diagram illustrates this workflow.

Rackspace Technology – a leading end-to-end multicloud technology services company, and Slalom –  a modern consulting firm focused on strategy, technology, and business transformation are both users of Redshift ML in preview.

Nihar Gupta, General Manager for Data Solutions at Rackspace Technology says “At Rackspace Technology, we help companies elevate their AI/ML operationsthe seamless integration with Amazon SageMaker will empower data analysts to use data in new ways, and provide even more insight back to the wider organization.”

And Marcus Bearden, Practice Director at Slalom shared “We hear from our customers that they want to have the skills and tools to get more insight from their data, and Amazon Redshift is a popular cloud data warehouse that many of our customers depend on to power their analytics, the new Amazon Redshift ML feature will make it easier for SQL users to get new types of insight from their data with machine learning, without learning new skills.”

Machine Learning for business analysts

To bring ML to business analysts, we launched new ML capabilities in Amazon QuickSight earlier this year called ML Insights. ML Insights uses SageMaker Autopilot to enable business analysts to perform ML inference on their data and visualize it in BI dashboards with just a few clicks. You can get results for different use cases that require ML, such as anomaly detection to uncover hidden insights by continuously analyzing billions of data points, to do forecasting, to predict growth, and other business trends. In addition, QuickSight can also give you an automatically generated summary in plain language (a capability we call auto-narratives), which interprets and describes what the data in your dashboard means. See the following screenshot for an example.

Customers like Expedia Group, Tata Consultancy Services, and Ricoh Company are already benefiting from ML out of the box with QuickSight. These human-readable narratives enable you to quickly interpret the data in a shared dashboard and focus on the insights that matter most.

In addition, customers have also been interested in asking questions of their business data in plain language and receiving answers in near-real time. Although some BI tools and vendors have attempted to solve this challenge with Natural Language Query (NLQ), the existing approaches require that you first spend months in advance preparing and building a model on a pre-defined set of data, and even then, you still have no way of asking ad hoc questions when those questions require a new calculation that wasn’t pre-defined in the data model. For example, the question “What is our year-over-year growth rate?” requires that “growth rate” be pre-defined as a calculation in the model. With today’s BI tools, you need to work with your BI teams to create and update the model to account for any new calculation or data, which can take days or weeks of effort.

Last week, we announced Amazon QuickSight Q. ‘Q’ gives business analysts the ability to ask any question of all their data and receive an accurate answer in seconds. To ask a question, you simply type it into the QuickSight Q search bar using natural language and business terminology that you’re familiar with. Q uses ML (natural language processing, schema understanding, and semantic parsing for SQL code generation) to automatically generate a data model that understands the meaning of and relationships between business data, so you can get answers to your business questions without waiting weeks for a data model to be built. Because Q eliminates the need to build a data model, you’re also not limited to asking only a specific set of questions. See the following screenshot for an example.

Best Western Hotels & Resorts is a privately-held hotel brand with a global network of approximately 4,700 hotels in over 100 countries and territories worldwide. “With Amazon QuickSight Q, we look forward to enabling our business partners to self-serve their ad hoc questions while reducing the operational overhead on our team for ad hoc requests,” said Joseph Landucci, Senior Manager of Database and Enterprise Analytics at Best Western Hotels & Resorts. “This will allow our partners to get answers to their critical business questions quickly by simply typing and searching their questions in plain language.”

Summary

For ML to have a broad impact, we believe it has to get easier to do and easier to apply. Database developers, data analysts, and business analysts who work with databases and data lakes have found it too difficult and involved to extract meaningful insights from their data using ML. To meet the needs of this large and growing group of builders, we’ve added ML capabilities to our purpose-built databases and analytics services so that database developers, data analysts, and business analysts can all use ML more easily without the need to be an ML expert. These capabilities put ML in the hands of every data professional so that they can get the most value from their data.


About the Authors

Swami Sivasubramanian is Vice President at AWS in charge of all Amazon AI and Machine Learning services. His team’s mission is “to put machine learning capabilities in the hands on every developer and data scientist.” Swami and the AWS AI and ML organization work on all aspects of machine learning, from ML frameworks (Tensorflow, Apache MXNet and PyTorch) and infrastructure, to Amazon SageMaker (an end-to-end service for building, training and deploying ML models in the cloud and at the edge), and finally AI services (Transcribe, Translate, Personalize, Forecast, Rekognition, Textract, Lex, Comprehend, Kendra, etc.) that make it easier for app developers to incorporate ML into their apps with no ML experience required.

Previously, Swami managed AWS’s NoSQL and big data services. He managed the engineering, product management, and operations for AWS database services that are the foundational building blocks for AWS: DynamoDB, Amazon ElastiCache (in-memory engines), Amazon QuickSight, and a few other big data services in the works. Swami has been awarded more than 250 patents, authored 40 referred scientific papers and journals, and participates in several academic circles and conferences.

 

Herain Oberoi leads Product Marketing for AWS’s Databases, Analytics, BI, and Blockchain services. His team is responsible for helping customers learn about, adopt, and successfully use AWS services. Prior to AWS, he held various product management and marketing leadership roles at Microsoft and a successful startup that was later acquired by BEA Systems. When he’s not working, he enjoys spending time with his family, gardening, and exercising.