All posts by Shailesh Chauhan

Iterate confidently on Amazon QuickSight datasets with new Dataset Versions capability

Post Syndicated from Shailesh Chauhan original https://aws.amazon.com/blogs/big-data/iterate-confidently-on-amazon-quicksight-datasets-with-new-dataset-versions-capability/

Amazon QuickSight allows data owners and authors to create and model their data in QuickSight using datasets, which contain logical and semantic information about the data. Datasets can be created from a single or multiple data sources, and can be shared across the organization with strong controls around data access (object/row/column level security) and metadata included, and can be programmatically created or modified. QuickSight now supports dataset versioning, which allows dataset owners to see how a dataset has progressed, preview a version, or revert back to a stable working version in case something goes wrong. Dataset Versions gives you the confidence to experiment with your content, knowing that your older versions are available and you can easily revert back to it, if needed. For more details, see Dataset Versions.

In this post, we look at a use case of an author editing a dataset and how QuickSight makes it easy to iterate on your dataset definitions.

What is Dataset Versions?

Previously, changes made to a dataset weren’t tracked. Dataset authors would often make a change that would break the underlying dashboards, and they were often worried about the changes made to the dataset definitions. Dataset authors spent time figuring out how to fix the dataset, which could take significant time.

With Dataset Versions, each publish event associated with the dataset is tracked. Dataset authors can review previous versions of the dataset and how dataset has progressed. Each time someone publishes a dataset, QuickSight creates a new version, which becomes the active version. It makes the previous version the most recent version in the version list. With Dataset Versions, authors can restore back to a previous version if they encounter any issue with the current version.

To help you understand versions better, let’s take the following scenario. Imagine you have a dataset and have iterated on it by making changes over time. You have multiple dashboards based on this dataset. You just added a new table called regions to this dataset. QuickSight saves a new version, and dashboards dependent on it the dataset break due to the addition of this table. You realize that you added the wrong table—you were supposed to add the stateandcity table instead. Let’s see how the Dataset Versions feature comes to your rescue.

Access versions

To access your dataset versions, choose the Manage menu and Publishing History on the data prep page of the dataset.

A panel opens on the right for you to see all the versions. In the following screenshot, the current active version of the dataset is version 38—published on November 10, 2021. This is the version that is breaking your dependent dashboards.

See publishing history

As you make changes to the dataset and publish the changes, QuickSight creates a timeline of all the publishes. You see the publishing history with all the events tracked as a tile. You can choose the tile to preview a particular version and see the respective dataset definition at that time. You know that the dataset was working fine on October 18, 2021 (the previous version), and you choose Preview to verify the dataset definition.

Revert back

After you confirm the dataset definition, choose Revert to go back the previous stable version (published on October 18, 2021). QuickSight asks you to confirm, and you choose Publish. The dataset reverts back to the old working definition and the dependent dashboards are fixed.

Start a new version

Alternatively, as you’re previewing the previously published good version (version 37, published October 18, 2021), you can start fresh from that version. The previous version just had the retail_sales_new table, and you can add the correct table stateandcity to the dataset definition. When you choose Publish, a new version (version 39) is created, and all the dashboards have this new working version, thereby fixing them.

Conclusion

This post showed how the new Dataset Versions feature in QuickSight helps you easily iterate on your datasets, showing you how a dataset has progressed over time and allowing you to revert back to a specific version. Dataset Versions gives you the freedom to experiment with your content, knowing that your older versions are available and you can revert back to them, if required. Dataset Versions is now generally available in QuickSight Standard and Enterprise Editions in all QuickSight Regions. For further details, visit see Dataset Versions.


About the Authors

Shailesh Chauhan is a product manager for Amazon QuickSight, AWS’s cloud-native, fully managed SaaS BI service. Before QuickSight, Shailesh was global product lead at Uber for all data applications built from the ground up. Earlier, he was a founding team member at ThoughtSpot, where he created world’s first analytics search engine. Shailesh is passionate about building meaningful and impactful products from scratch. He looks forward to helping customers while working with people with a great mind and big heart.

Mayank Jain is a Software Development Manager at Amazon QuickSight. He leads the data preparation team that delivers an enterprise-ready platform to transform, define and organize data. Before QuickSight, he was Senior Software Engineer at Microsoft Bing where he developed core search experiences. Mayank is passionate about solving complex problems with simplistic user experience that can empower customer to be more productive.

Create larger SPICE datasets and refresh data faster in Amazon QuickSight with new SPICE features

Post Syndicated from Shailesh Chauhan original https://aws.amazon.com/blogs/big-data/create-larger-spice-datasets-and-refresh-data-faster-in-amazon-quicksight-with-new-spice-features/

Amazon QuickSight is a scalable business intelligence (BI) service built for the cloud, which allows insights to be shared with all users in the organization. QuickSight offers SPICE, an in-memory, cloud-native data store that allows end-users to interactively explore data. SPICE provides consistently fast query performance and automatically scales for high concurrency. With SPICE, you save time and cost because you don’t need to retrieve data from the data source (whether a database or data warehouse) every time you change an analysis or update a visual, and you remove the load of concurrent access or analytical complexity off the underlying data source with the data.

Today, we’re introducing incremental refresh in SPICE, with a refresh rate of 15 minutes (four times faster than before), which improves freshness of data in SPICE. In addition, we’re doubling SPICE limits on a per dataset basis to 500 million rows (twice that of our previous 250 million row limit). In this post, we walk through these new capabilities and how you can use them to create SPICE datasets that can help you scale your data to all your users.

What’s new with QuickSight SPICE?

We’ve added the following capabilities to QuickSight:

  • Incremental refresh – QuickSight now supports incrementally loading new data to SPICE datasets without needing to refresh the full set of data. With incremental refresh, you can update SPICE datasets in a fraction of the time a full refresh would take, enabling access to the most recent insights much sooner. You can schedule incremental refresh to run up to every 15 minutes on a dataset on SQL-based data sources, such as Amazon Redshift, Amazon Athena, PostgreSQL, Microsoft SQL Server, or Snowflake.
  • 500 million row SPICE capacity – The QuickSight SPICE engine now supports datasets up to 500 million rows or 500 GB in size. This change lets you use SPICE for datasets twice as large than before.

In the next sections, we show you how to get started with incremental refresh and 500 million row SPICE capacity.

Create large datasets

Let’s say you’re part of the central data team that has access to data tables in data sources. You want to create a central dataset for analysts. SPICE can now scale to double the capacity, so you can create a large scaled dataset rather than create and maintain several unconnected datasets. You can bring in up to 32 tables (from different data sources) in a single dataset to a total of 500 million rows. You can enjoy the double capacity of SPICE with no extra step—it’s automatically available. To create a dataset, simply choose New Dataset on the Data page. On the Data Prep page for the new dataset, choose Add data to add tables to a single dataset.

Set up incremental refresh

With incremental refresh, QuickSight now allows you to ingest data incrementally for your SQL-based sources (such as Amazon Redshift, Athena, PostgreSQL, or Snowflake) in a specified time period. On the Datasets page, choose the dataset, and choose Refresh now or Schedule a refresh.

For Refresh type, select Incremental refresh.

Configure look-back window

While setting up incremental refresh, you have to specify a look-back window (for example, 1 day, 1 week, 6 hours) in which new rows are found, and modified and deleted rows sync. This means that less data needs to be queried and transferred for each refresh, thereby increasing the speed at which ingestions can complete.

Let’s walk through an example to illustrate the concept. We have a dataset that contains 6 months’ worth of sales records: 180,000 records (1,000 records per day). Right now, the dataset contains data from January 1 to June 30, and today is July 1. I run an incremental refresh with a look-back window of 7 days. QuickSight queries the database asking for all data since June 24 (7 days ago): 7,000 records. All the changes since June 24, including deleted, updated, and added data, are propagated into SPICE. The next day, July 2, QuickSight does the same, but querying from June 25 (7,000 records). The end result is that rather than having to ingest 180,000 records every day, you only have to process 7,000 records.

You can set up a look-back window as part of setting up your incremental refresh. After you select Incremental refresh from the steps in the preceding section, choose Configure.

You can choose all eligible date columns to use for look-back and the window size, which QuickSight uses to query for that range. Then choose Submit.

Schedule an incremental refresh

A scheduled SQL incremental refresh allows you to regularly ingest data from a data source to SPICE, incrementally. To set up a scheduled SQL incremental refresh, similar to manual incremental refresh, if this is a first-time setup, you’re prompted to set up a look-back window. After configuration, choose the time zone, repetition interval, and starting time and choose Create.

The scheduled refresh begins at the time you specified.

Set up full ingestion

Previously, for SPICE datasets, the only update mechanism in QuickSight was a full refresh. All the data defined by the dataset was queried and transferred into the dataset from its source, fully replacing what previously existed. With incremental refresh, you can update your data every 15 minutes. However, we still recommend a full refresh to make sure your dataset is in sync with the source. You can set up a full ingestion every week on a weekend to not disrupt any business workflows.

Conclusion

With incremental refresh and double SPICE capacity, QuickSight enables you to create datasets to cater to your scaling business needs in the following ways:

  • Faster and reliable refreshes – Incremental refreshes are faster because only the most recent data needs to be refreshed and not the entire dataset. Additionally, the refreshes are also more reliable because you don’t need to spend time on long-running queries or any potential network disruptions.
  • Large datasets – SPICE can now scale up to 500 million rows, but you don’t have to spend time updating, because you can update incrementally and don’t need to refresh the entire dataset.
  • Easy setup with fewer resources – With incremental refresh, you have less data to refresh. This reduces overall consumption of resources needed. The setup process is also much simpler with scheduled incremental refresh.

QuickSight’s SPICE incremental refresh and 500 million row SPICE capacity can help you create scalable and reliable datasets without putting a strain on underlying data sources. These features are now generally available in QuickSight Enterprise Editions in all Regions. Go ahead and try it out! To learn more about refreshing data in QuickSight, see Refreshing Data.


About the Authors

Shailesh Chauhan is a Product Manager at Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Before QuickSight, Shailesh was global product lead at Uber for all data applications built from ground-up. Earlier, he was a founding team member at ThoughtSpot, where he created world’s first analytics search engine. Shailesh is passionate about building meaningful and impactful products from scratch. He looks forward to helping customers while working with people with great mind and big heart.

Anilkumar Senesetti is a Software Development Manager at AWS QuickSight. He leads the Data Ingestion (DI) team that delivers solutions to accelerate ingestion of customers data into SPICE ensuring correctness, durability, consistency and security of the data. With 15 years of industry experience in business intelligence domain, he provides valuable insights across layers to deliver solutions that improve customer experience. He is passionate about predictive analytics and outside of the work, he enjoys building features on astrological website that he owns.

Create and reuse governed datasets in Amazon QuickSight with new Dataset-as-a-Source feature

Post Syndicated from Shailesh Chauhan original https://aws.amazon.com/blogs/big-data/create-and-reuse-governed-datasets-in-amazon-quicksight-with-new-dataset-as-a-source-feature/

Amazon QuickSight is a fast, cloud-powered, business intelligence (BI) service that makes it easy to deliver insights to everyone in your organization. QuickSight recently introduced Dataset-as-a-Source, a new feature that allows data owners to create authoritative datasets that can then be reused and further extended by thousands of users across the enterprise. This post walks through an example of how QuickSight makes it easy to create datasets that are reusable and easy to govern, with Dataset-as-a-source.

Introducing Dataset-as-a-Source

Dataset-as-a-Source allows QuickSight authors and data owners to create authoritative datasets, as a single source of truth, that either use the QuickSight SPICE in-memory engine, or directly query the underlying database. These datasets may contain data from a single table from a single database, or a combination of data across multiple data sources, including flat files, software as a service (SaaS) sources, databases, and data warehouses. Data owners with a deep understanding of the data can predefine metrics and calculations in the dataset with metadata that makes it easy for authors to understand and consume this data.

After you create these authoritative datasets, you can share them with authors who want to consume this data – either directly on a per-user or group basis, or using shared folders in QuickSight. Authors can now simply use these datasets for their dashboard creation activities, or choose to further augment these datasets by adding additional calculated fields or joining them with other data that is relevant to them. Any updates made by the data owner to the authoritative dataset automatically cascades to the downstream datasets created by individual authors using those datasets. This provides the organization with a secure, governed, and effortless data sharing process that can scale to thousands of authors as needed, without any server setup or data silos on client desktops.

Dataset-as-a-Source example use case

This example use case uses the Amazon Customer Reviews Data. It is public data stored in the us-east-1 Region. You have the following three tables:

  • product_reviews – Customer reviews for a specific product
  • customer – Customer profile data
  • customer_address – Customer address information

The following diagram shows the relationship of the three tables.

In the following sections, we explain the process of using Dataset-as-a-Source using the tables from the preceding schema.

Create central datasets

Let’s say you’re part of the central data engineering team that has access to the preceding data with three tables in a database. Your team serves over 100 business analysts across 10 different teams in eight countries, namely the Coupons team, Outbound Marketing team, Mobile Platform team, and Recommendations team. Analysts on each team want to analyze product and customer data along with data specific to their teams.

Rather than manually creating and maintaining several unconnected datasets for each team, your team created a central dataset. You created the central Product and Customer dataset by joining the three tables in a schema. Dashboard performance and query costs are optimized for large central dataset with Dynamic Querying. Dynamic Querying enables direct query datasets that contain joined tables to run more efficiently. Joins are applied dynamically so that only the subset of tables needed to render a visual are used in the join. Your team also created a key metric (calculated field): Average Rating. All of these teams use the Average Rating metric as a base to analyze their own business line. Its definition is: Average Rating = Sum(Star_rating) / Unique_Count(Start_rating).

Reuse and join with other datasets

Individual teams reused the central Product and Customer dataset and joined it with the data of their own, to create their own datasets. For example, the Marketing team wanted to understand how their team helped improving the product ratings. Therefore, they combined the central Product and Customer dataset with campaign data to create a new dataset: Marketing – Product & Customer Rating. Similarly, the Mobile Platform team combined mobile data with the Product and Customer dataset to understand rating impact on mobile.

With Dataset-as-a-Source, you can centralize data management while allowing each team to customize the data for their own needs, all while syncing updates to the data, such as updates to metric definitions, and maintaining dataset definitions like column names, descriptions, and field folders. Additionally, if these are SPICE datasets, the newly created datasets are created with SPICE as a source, so you don’t reach to the datasource every time a dataset is created or refreshed.

Govern central datasets

You and the central team defined business metrics like Average Rating in the central dataset. With Dataset-as-a-Source, individual teams can use the central dataset without having to redo the work of recreating the field themselves, while extending it with data representing their specific business needs. All teams are able to use a uniform Average Rating metric definition.

Suppose you now want to make a change to the Average Rating definition because of a business operation change. You want to exclude Digital_Video_Download from the product category, and need to provide an updated definition to each team. To make modifications, you have to modify just the central dataset, and the associated datasets get the updates automatically. This saves you time and prevents errors in business metric definitions.

Conclusion

Creating datasets from existing datasets using Dataset-as-a-Source helps you with the following:

  • Dataset governance – Data engineers can easily scale to the needs of multiple teams within their organization by developing and maintaining a few general-purpose datasets that describe the organization’s main data models—without compromising on query performance.
  • Data source management reduction – Analysts spend considerable amounts of time and effort requesting access to databases, managing database credentials, finding the right tables, and managing data refresh schedules. Building new datasets from existing datasets means that analysts don’t have to start from scratch with raw data from databases. They can start with pre-curated data, while also ensuring that on the data source side, optimizations such as workload management can be put in place to ensure optimal performance of backend stores.
  • Metrics accuracy – Creating datasets from existing datasets allows data engineers to centrally define and maintain critical data definitions, such as sales growth and net marginal return, across their company’s many organizations. It also allows them to distribute changes to those definitions, and gives their analysts the ability to get started with visualizing the right data more quickly and reliably.
  • Dataset customization and flexibility – Creating datasets from existing datasets gives analysts more flexibility to customize datasets for their own business needs without worrying about disrupting data for other teams.

This post showed how QuickSight’s Dataset-as-a-Source can help your data management workflows. This feature greatly improves governance and reusability of the datasets. Dataset-as-a-Source is now generally available in Amazon QuickSight Standard and Enterprise Editions in all QuickSight regions. For further details, visit here.


About the Author

Shailesh Chauhan is product managing Amazon QuickSight, AWS’ cloud-native, fully managed BI service. Before QuickSight, Shailesh was global product lead at Uber for all data applications built from ground-up. Earlier, he was a founding team member at ThoughtSpot, where he created world’s first analytics search engine. Shailesh is passionate about building meaningful and impactful products from scratch. He looks forward to helping customers while working with people with great mind and big heart.