All posts by Swami Sivasubramanian

With a zero-ETL approach, AWS is helping builders realize near-real-time analytics

Post Syndicated from Swami Sivasubramanian original https://aws.amazon.com/blogs/big-data/realizing-near-real-time-analytics-with-a-zero-etl-future/

Data is at the center of every application, process, and business decision. When data is used to improve customer experiences and drive innovation, it can lead to business growth. According to Forrester, advanced insights-driven businesses are 8.5 times more likely than beginners to report at least 20% revenue growth. However, to realize this growth, managing and preparing the data for analysis has to get easier.

That’s why AWS is investing in a zero-ETL future so that builders can focus more on creating value from data, instead of preparing data for analysis.

Challenges with ETL

What is ETL? Extract, Transform, Load is the process data engineers use to combine data from different sources. ETL can be challenging, time-consuming, and costly. Firstly, it invariably requires data engineers to create custom code. Next, DevOps engineers have to deploy and manage the infrastructure to make sure the pipelines scale with the workload. In case the data sources change, data engineers have to manually make changes in their code and deploy it again. While all of this is happening—a process that can take days—data analysts can’t run interactive analysis or build dashboards, data scientists can’t build machine learning (ML) models or run predictions, and end-users can’t make data-driven decisions.

Furthermore, the time required to build or change pipelines makes the data unfit for near-real-time use cases such as detecting fraudulent transactions, placing online ads, and tracking passenger train schedules. In these scenarios, the opportunity to improve customer experiences, address new business opportunities, or lower business risks can simply be lost.

On the flip side, when organizations can quickly and seamlessly integrate data that is stored and analyzed in different tools and systems, they get a better understanding of their customers and business. As a result, they can make data-driven predictions with more confidence, improve customer experiences, and promote data-driven insights across the business.

AWS is bringing its zero-ETL vision to life

We have been making steady progress towards bringing our zero-ETL vision to life. For example, customers told us that they want to ingest streaming data into their data stores for doing analytics—all without delving into the complexities of ETL.

With Amazon Redshift Streaming Ingestion, organizations can configure Amazon Redshift to directly ingest high-throughput streaming data from Amazon Managed Streaming for Apache Kafka (Amazon MSK) or Amazon Kinesis Data Streams and make it available for near-real-time analytics in just a few seconds. They can connect to multiple data streams and pull data directly into Amazon Redshift without staging it in Amazon Simple Storage Service (Amazon S3). After running analytics, the insights can be made available broadly across the organization with Amazon QuickSight, a cloud-native, serverless business intelligence service. QuickSight makes it incredibly simple and intuitive to get to answers with Amazon QuickSight Q, which allows users to ask business questions about their data in natural language and receive answers quickly through data visualizations.

Another example of AWS’s investment in zero-ETL is providing the ability to query a variety of data sources without having to worry about data movement. Using federated query in Amazon Redshift and Amazon Athena, organizations can run queries across data stored in their operational databases, data warehouses, and data lakes so that they can create insights from across multiple data sources with no data movement. Data analysts and data engineers can use familiar SQL commands to join data across several data sources for quick analysis, and store the results in Amazon S3 for subsequent use. This provides a flexible way to ingest data while avoiding complex ETL pipelines.

More recently, AWS introduced Amazon Aurora zero-ETL integration with Amazon Redshift at AWS re:Invent 2022. Check out the following video:

We learned from customers that they spend significant time and resources building and managing ETL pipelines between transactional databases and data warehouses. For example, let’s say a global manufacturing company with factories in a dozen countries uses a cluster of Aurora databases to store order and inventory data in each of those countries. When the company executives want to view all of the orders and inventory, the data engineers would have to build individual data pipelines from each of the Aurora clusters to a central data warehouse so that the data analysts can query the combined dataset. To do this, the data integration team has to write code to connect to 12 different clusters and manage and test 12 production pipelines. After the team deploys the code, it has to constantly monitor and scale the pipelines to optimize performance, and when anything changes, they have to make updates across 12 different places. It is quite a lot of repetitive work.

No more need for custom ETL pipelines between Aurora and Amazon Redshift

The Aurora zero-ETL integration with Amazon Redshift brings together the transactional data of Aurora with the analytics capabilities of Amazon Redshift. It minimizes the work of building and managing custom ETL pipelines between Aurora and Amazon Redshift. Unlike the traditional systems where data is siloed in one database and the user has to make a trade-off between unified analysis and performance, data engineers can replicate data from multiple Aurora database clusters into the same or new Amazon Redshift instance to derive holistic insights across many applications or partitions. Updates in Aurora are automatically and continuously propagated to Amazon Redshift so the data engineers have the most recent information in near-real time. The entire system can be serverless and dynamically scales up and down based on data volume, so there’s no infrastructure to manage. Now, organizations get the best of both worlds—fast, scalable transactions in Aurora together with fast, scalable analytics in Amazon Redshift—all in one seamless system. With near-real-time access to transactional data, organizations can leverage Amazon Redshift’s analytics and capabilities such as built-in ML, materialized views, data sharing, and federated access to multiple data stores and data lakes to derive insights from transactional and other data.

Improving the zero-ETL performance is a continuous goal for AWS. For instance, one of our early zero-ETL preview customers observed that the hundreds of thousands of transactions produced every minute from their Amazon Aurora MySQL databases appeared in less than 10 seconds into their Amazon Redshift warehouse. Previously, they had more than a 2-hour delay moving data from their ETL pipeline into Amazon Redshift. With the zero-ETL integration between Aurora and Redshift, they are now able to achieve near-real time analytics.

This integration is now available in Public Preview. To learn more, refer to Getting started guide for near-real-time analytics using Amazon Aurora zero-ETL integration with Amazon Redshift.

Zero-ETL makes data available to data engineers at the point of use through direct integrations between services and direct querying across a variety of data stores. This frees the data engineers to focus on creating value from the data, instead of spending time and resources building pipelines. AWS will continue investing in its zero-ETL vision so that organizations can accelerate their use of data to drive business growth.


About the Author

Swami Sivasubramanian is the Vice President of AWS Data and Machine Learning.

Data: The genesis for modern invention

Post Syndicated from Swami Sivasubramanian original https://aws.amazon.com/blogs/big-data/data-the-genesis-for-modern-invention/

It only takes one groundbreaking invention—one iconic idea that solves a widespread pain point for customers—to create or transform an industry forever. From the invention of the telegraph, to the discovery of GPS, to the earliest cloud computing services, history is filled with examples of these “eureka” moments that continue to have long-lasting impacts on the way we do business today.

Cognitive scientists John Kounios and Mark Beeman demonstrated that great inventors don’t simply stumble upon their epiphanies; in reality, an idea is preceded by a collection of life experiences, educational knowledge, or even past failures the human brain processes and assimilates over time. Their ideas are preceded by a collection of data points.

When we apply this concept to organizations, and the vast amount of data being produced on a daily basis, we realize there’s an incredible opportunity to ingest, store, process, analyze, and visualize data to create the next big thing.

Today—more than ever before—data is the genesis for modern invention. But to produce new ideas with our data, we need to build dynamic, end-to-end data strategies that lead to new customer experiences as the final output. Some of the biggest brands in the world like Formula 1, Toyota, and Georgia-Pacific are already leveraging AWS to do just that.

This week at AWS re:Invent 2022, I shared several key learnings we’ve collected after working with these brands and more than 1.5 million customers who are using AWS to build their data strategies.

I also revealed several new services and innovations for our customers. Here are a few highlights.

You need a comprehensive set of services to get the job done

Creating a data lake to perform analytics and machine learning (ML) is not an end-to-end data strategy. Your needs will inevitably grow and change over time, which is why we believe every customer should have access to a wide variety of tools based on data types, personas, and their specific use cases.

And our data supports this, with 94% of the top 1,000 AWS customers using more than 10 of our databases and analytics services. A one-size-fits-all approach just doesn’t work in the long run.

You need a comprehensive set of services that enable you to store and query data in your databases, data lakes, and data warehouses; services that help you act on your data with analytics, business intelligence, and machine learning; and services that help you catalog and govern your data across your organization.

You should also have access to services that support a variety of data types for your future use cases, whether you’re working with financial data, clinical data, or retail data. Many of our customers are also using their data to create machine learning models, but some data types are still too cumbersome to work with and prepare for ML.

For example, geospatial data, which supports use cases like self-driving cars, urban planning, or even crop yield in agricultural farms, can be incredibly difficult to access, prepare, and visualize for ML. That’s why this week we announced new capabilities for Amazon SageMaker that make it easier for data scientists to work with geospatial data.

Performance and security are paramount

Performance and security continue to be critical components of our customers’ data strategies.

You’ll need to perform at scale across your data warehouses, databases, and data lakes, or when you want to quickly analyze and visualize your data. We’ve built our business on high-performing services like Amazon Aurora, Amazon DynamoDB, and Amazon Redshift, and this week, we announced several new capabilities to continue building on our performance innovations to date.

For our serverless, interactive query service, Amazon Athena, we announced a new integration with Apache Spark that enables you to spin up Spark workloads up to 75 times faster than other serverless Spark offerings. We also introduced a new feature called Elastic Clusters within our fully managed document database, Amazon DocumentDB, that enables customers to easily scale out or shard their data across multiple database instances.

To help our customers protect their data from potential compromises, we announced Amazon GuardDuty RDS Protection to intelligently detect potential threats for their data stored in Aurora, as well as a new open-source project that allows developers to safely use PostgreSQL extensions in their core databases without worrying about unintended security impacts.

Connecting data is critical for deeper insights

To get the most of your data, you need to combine data silos for deeper insights. However, connecting data across siloes typically requires complex extract, transform, and load (ETL) pipelines, which means creating a manual integration every time you want to ask a different question of your data or build a different ML model. This isn’t fast enough to keep up with the speed that businesses need to move today.

Zero ETL is the future. And we’ve been making strides in this zero-ETL future for several years by deepening integrations between our services. But this week, we’re getting closer to a zero-ETL future by announcing Aurora now supports zero-ETL integration with Amazon Redshift to bring transactional data in Aurora and the analytical capabilities in Amazon Redshift together.

We also announced a new auto-copy feature from Amazon Simple Storage Service (Amazon S3) to Amazon Redshift that removes the need for you to build and manage ETL pipelines whenever you want to use your data for analytics. And we’re not stopping here. With AWS, you can now connect to hundreds of data sources, from software as a service (SaaS) applications to on-premises data stores.

We’ll continue to build no zero-ETL capabilities into our services to help our customers easily analyze all their data, no matter where it resides.

Data governance unleashes innovation

Governance was historically used as a defensive measure, which meant locking down data in silos. But in reality, the right governance strategy helps you move and innovate faster with guardrails that give the right people access to your data, when and where they need it.

In addition to fine-grained access controls within AWS Lake Formation, this week we’re making it easier for customers to govern access and privileges within more of our data services with new capabilities announced in Amazon Redshift and Amazon SageMaker.

Our customers also told us they want an end-to-end strategy that enables them to govern their data across the entire data journey. That’s why this week we announced Amazon DataZone, a new data management service that helps you catalog, discover, analyze, share, and govern data across the organization.

When you properly manage secure access to your data, it can flow to the right places and connect the dots across siloed teams and departments.

Build with AWS

With the introduction of these new services and features this week, as well as our comprehensive set of data services, it’s important to remember that support is available as you build your end-to-end data strategy. In fact, we have an entire team at AWS, as well as an extensive network of partners to help our customers build data foundations that will meet their needs now—and well into the future.

For more information about re:Invent 2022, please visit our event page.


About the Author

Swami Sivasubramanian is the Vice President of AWS Data and Machine Learning.

Bringing machine learning to more builders through databases and analytics services

Post Syndicated from Swami Sivasubramanian original https://aws.amazon.com/blogs/big-data/bringing-machine-learning-to-more-builders-through-databases-and-analytics-services/

Machine learning (ML) is becoming more mainstream, but even with the increasing adoption, it’s still in its infancy. For ML to have the broad impact that we think it can have, it has to get easier to do and easier to apply. We launched Amazon SageMaker in 2017 to remove the challenges from each stage of the ML process, making it radically easier and faster for everyday developers and data scientists to build, train, and deploy ML models. SageMaker has made ML model building and scaling more accessible to more people, but there’s a large group of database developers, data analysts, and business analysts who work with databases and data lakes where much of the data used for ML resides. These users still find it too difficult and involved to extract meaningful insights from that data using ML.

This group is typically proficient in SQL but not Python, and must rely on data scientists to build the models needed to add intelligence to applications or derive predictive insights from data. And even when you have the model in hand, there’s a long and involved process to prepare and move data to use the model. The result is that ML isn’t being used as much as it can be.

To meet the needs of this large and growing group of builders, we’re integrating ML into AWS databases, analytics, and business intelligence (BI) services.

AWS customers generate, process, and collect more data than ever to better understand their business landscape, market, and customers. And you don’t just use one type of data store for all your needs. You typically use several types of databases, data warehouses, and data lakes, to fit your use case. Because all these use cases could benefit from ML, we’re adding ML capabilities to our purpose-built databases and analytics services so that database developers, data analysts, and business analysts can train models on their data or add inference results right from their database, without having to export and process their data or write large amounts of ETL code.

Machine Learning for database developers

At re:Invent last year, we announced ML integrated inside Amazon Aurora for developers working with relational databases. Previously, adding ML using data from Aurora to an application was a very complicated process. First, a data scientist had to build and train a model, then write the code to read data from the database. Next, you had to prepare the data so it can be used by the ML model. Then, you called an ML service to run the model, reformat the output for your application, and finally load it into the application.

Now, with a simple SQL query in Aurora, you can add ML to an enterprise application. When you run an ML query in Aurora using SQL, it can directly access a wide variety of ML models from Amazon SageMaker and Amazon Comprehend. The integration between Aurora and each AWS ML service is optimized, delivering up to 100 times better throughput when compared to moving data between Aurora and SageMaker or Amazon Comprehend without this integration. Because the ML model is deployed separately from the database and the application, each can scale up or scale out independently of the other.

In addition to making ML available in relational databases, combining ML with certain types of non-relational database models can also lead to better predictions. For example, database developers use Amazon Neptune, a purpose-built, high-performance graph database, to store complex relationships between data in a graph data model. You can query these graphs for insights and patterns and apply the results to implement capabilities such as product recommendations or fraud detection.

However, human intuition and analyzing individual queries is not enough to discover the full breadth of insights available from large graphs. ML can help, but as was the case with relational databases it requires you to do a significant amount of heavy lifting upfront to prepare the graph data and then select the best ML model to run against that data. The entire process can take weeks.

To help with this, today we announced the general availability of Amazon Neptune ML to provide database developers access to ML purpose-built for graph data. This integration is powered by SageMaker and uses the Deep Graph Library (DGL), a framework for applying deep learning to graph data. It does the hard work of selecting the graph data needed for ML training, automatically choosing the best model for the selected data, exposing ML capabilities via simple graph queries, and providing templates to allow you to customize ML models for advanced scenarios. The following diagram illustrates this workflow.

And because the DGL is purpose-built to run deep learning on graph data, you can improve accuracy of most predictions by over 50% compared to that of traditional ML techniques.

Machine Learning for data analysts

At re:Invent last year, we announced ML integrated inside Amazon Athena for data analysts. With this integration, you can access more than a dozen built-in ML models or use your own models in SageMaker directly from ad-hoc queries in Athena. As a result, you can easily run ad-hoc queries in Athena that use ML to forecast sales, detect suspicious logins, or sort users into customer cohorts.

Similarly, data analysts also want to apply ML to the data in their Amazon Redshift data warehouse. Tens of thousands of customers use Amazon Redshift to process exabytes of data per day. These Amazon Redshift users want to run ML on their data in Amazon Redshift without having to write a single line of Python. Today we announced the preview of Amazon Redshift ML to do just that.

Amazon Redshift now enables you to run ML algorithms on Amazon Redshift data without manually selecting, building, or training an ML model. Amazon Redshift ML works with Amazon SageMaker Autopilot, a service that automatically trains and tunes the best ML models for classification or regression based on your data while allowing full control and visibility.

When you run an ML query in Amazon Redshift, the selected data is securely exported from Amazon Redshift to Amazon Simple Storage Service (Amazon S3). SageMaker Autopilot then performs data cleaning and preprocessing of the training data, automatically creates a model, and applies the best model. All the interactions between Amazon Redshift, Amazon S3, and SageMaker are abstracted away and automatically occur. When the model is trained, it becomes available as a SQL function for you to use. The following diagram illustrates this workflow.

Rackspace Technology – a leading end-to-end multicloud technology services company, and Slalom –  a modern consulting firm focused on strategy, technology, and business transformation are both users of Redshift ML in preview.

Nihar Gupta, General Manager for Data Solutions at Rackspace Technology says “At Rackspace Technology, we help companies elevate their AI/ML operationsthe seamless integration with Amazon SageMaker will empower data analysts to use data in new ways, and provide even more insight back to the wider organization.”

And Marcus Bearden, Practice Director at Slalom shared “We hear from our customers that they want to have the skills and tools to get more insight from their data, and Amazon Redshift is a popular cloud data warehouse that many of our customers depend on to power their analytics, the new Amazon Redshift ML feature will make it easier for SQL users to get new types of insight from their data with machine learning, without learning new skills.”

Machine Learning for business analysts

To bring ML to business analysts, we launched new ML capabilities in Amazon QuickSight earlier this year called ML Insights. ML Insights uses SageMaker Autopilot to enable business analysts to perform ML inference on their data and visualize it in BI dashboards with just a few clicks. You can get results for different use cases that require ML, such as anomaly detection to uncover hidden insights by continuously analyzing billions of data points, to do forecasting, to predict growth, and other business trends. In addition, QuickSight can also give you an automatically generated summary in plain language (a capability we call auto-narratives), which interprets and describes what the data in your dashboard means. See the following screenshot for an example.

Customers like Expedia Group, Tata Consultancy Services, and Ricoh Company are already benefiting from ML out of the box with QuickSight. These human-readable narratives enable you to quickly interpret the data in a shared dashboard and focus on the insights that matter most.

In addition, customers have also been interested in asking questions of their business data in plain language and receiving answers in near-real time. Although some BI tools and vendors have attempted to solve this challenge with Natural Language Query (NLQ), the existing approaches require that you first spend months in advance preparing and building a model on a pre-defined set of data, and even then, you still have no way of asking ad hoc questions when those questions require a new calculation that wasn’t pre-defined in the data model. For example, the question “What is our year-over-year growth rate?” requires that “growth rate” be pre-defined as a calculation in the model. With today’s BI tools, you need to work with your BI teams to create and update the model to account for any new calculation or data, which can take days or weeks of effort.

Last week, we announced Amazon QuickSight Q. ‘Q’ gives business analysts the ability to ask any question of all their data and receive an accurate answer in seconds. To ask a question, you simply type it into the QuickSight Q search bar using natural language and business terminology that you’re familiar with. Q uses ML (natural language processing, schema understanding, and semantic parsing for SQL code generation) to automatically generate a data model that understands the meaning of and relationships between business data, so you can get answers to your business questions without waiting weeks for a data model to be built. Because Q eliminates the need to build a data model, you’re also not limited to asking only a specific set of questions. See the following screenshot for an example.

Best Western Hotels & Resorts is a privately-held hotel brand with a global network of approximately 4,700 hotels in over 100 countries and territories worldwide. “With Amazon QuickSight Q, we look forward to enabling our business partners to self-serve their ad hoc questions while reducing the operational overhead on our team for ad hoc requests,” said Joseph Landucci, Senior Manager of Database and Enterprise Analytics at Best Western Hotels & Resorts. “This will allow our partners to get answers to their critical business questions quickly by simply typing and searching their questions in plain language.”

Summary

For ML to have a broad impact, we believe it has to get easier to do and easier to apply. Database developers, data analysts, and business analysts who work with databases and data lakes have found it too difficult and involved to extract meaningful insights from their data using ML. To meet the needs of this large and growing group of builders, we’ve added ML capabilities to our purpose-built databases and analytics services so that database developers, data analysts, and business analysts can all use ML more easily without the need to be an ML expert. These capabilities put ML in the hands of every data professional so that they can get the most value from their data.


About the Authors

Swami Sivasubramanian is Vice President at AWS in charge of all Amazon AI and Machine Learning services. His team’s mission is “to put machine learning capabilities in the hands on every developer and data scientist.” Swami and the AWS AI and ML organization work on all aspects of machine learning, from ML frameworks (Tensorflow, Apache MXNet and PyTorch) and infrastructure, to Amazon SageMaker (an end-to-end service for building, training and deploying ML models in the cloud and at the edge), and finally AI services (Transcribe, Translate, Personalize, Forecast, Rekognition, Textract, Lex, Comprehend, Kendra, etc.) that make it easier for app developers to incorporate ML into their apps with no ML experience required.

Previously, Swami managed AWS’s NoSQL and big data services. He managed the engineering, product management, and operations for AWS database services that are the foundational building blocks for AWS: DynamoDB, Amazon ElastiCache (in-memory engines), Amazon QuickSight, and a few other big data services in the works. Swami has been awarded more than 250 patents, authored 40 referred scientific papers and journals, and participates in several academic circles and conferences.

 

Herain Oberoi leads Product Marketing for AWS’s Databases, Analytics, BI, and Blockchain services. His team is responsible for helping customers learn about, adopt, and successfully use AWS services. Prior to AWS, he held various product management and marketing leadership roles at Microsoft and a successful startup that was later acquired by BEA Systems. When he’s not working, he enjoys spending time with his family, gardening, and exercising.