Tag Archives: Analytics

Query expansion based on user behaviour

Post Syndicated from Grab Tech original https://engineering.grab.com/query-expansion-based-on-user-behaviour

Introduction

Our consumers used to face a few common pain points while searching for food with the Grab app. Sometimes, the results would include merchants that were not yet operational or locations that were out of the delivery radius. Other times, no alternatives were provided. The search system would also have difficulties handling typos, keywords in different languages, synonyms, and even word spacing issues, resulting in a suboptimal user experience.

Over the past few months, our search team has been building a query expansion framework that can solve these issues. When a user query comes in, it expands the query to a few related keywords based on semantic relevance and user intention. These expanded words are then searched with the original query to recall more results that are high-quality and diversified. Now let’s take a deeper look at how it works.

Query expansion framework

Building the query expansion corpus

We used two different approaches to produce query expansion candidates: manual annotation for top keywords and data mining based on user rewrites.

Manual annotation for top keywords

Search has a pronounced fat head phenomenon. The most frequent thousand of keywords account for more than 70% of the total search traffic. Therefore, handling these keywords well can improve the overall search quality a lot. We manually annotated the possible expansion candidates for these common keywords to cover the most popular merchants, items and alternatives. For instance, “McDonald’s” is annotated with {“burger”, “western”}.

Data mining based on user rewrites

We observed that sometimes users tend to rewrite their queries if they are not satisfied with the search result. As a pilot study, we checked the user rewrite records within some user search sessions and found several interesting samples:

{Ya Kun Kaya Toast,Starbucks}
{healthy,Subway}
{Muni,Muji}
{奶茶,koi}
{Roti,Indian}

We can see that besides spelling corrections, users’ rewrite behaviour also reveals deep semantic relations between these pairs that cannot be easily captured by lexical similarity, such as similar merchants, merchant attributes, language differences, cuisine types, and so on. We can leverage the user’s knowledge to build a query expansion corpus to improve the diversity of the search result and user experience. Furthermore, we can use the wisdom of the crowd to find some common patterns with higher confidence.

Based on this intuition, we leveraged the high volume of search click data available in Grab to generate high-quality expansion pairs at the user session level. To augment the original queries, we collected rewrite pairs that happened to multiple users and multiple times in a time period. Specifically, we used the heuristic rules below to collect the rewrite pairs:

  • Select the sessions where there are at least two distinct queries (rewrite session)
  • Collect adjacent query pairs in the search session where the second query leads to a click but the first does not (effective rewrite)
  • Filter out the sample pairs with time interval longer than 30 seconds in between, as users are more likely to change their mind on what to look for in these pairs (single intention)
  • Count the occurrences and filter out the low-frequency pairs (confidence management)

After we have the mining pairs, we categorised and annotated the rewrite types to gain a deeper understanding of the user’s rewrite behaviour. A few samples mined from the Singapore area data are shown in the table below.

Original query Rewrite query Frequency in a month Distinct user count Type
playmade by 丸作 playmade 697 666 Drop keywords
mcdonald’s burger 573 535 Merchant -> Food
Bubble tea koi 293 287 Food -> Merchant
Kfc McDonald’s 238 234 Merchant -> Merchant
cake birthday cake 206 205 Add words
麦当劳 mcdonald’s 205 199 Locale change
4 fingers 4fingers 165 162 Space correction
krc kfc 126 124 Spelling correction
5 guys five guys 120 120 Number synonym
koi the koi thé 45 44 Tone change

We further computed the percentages of some categories, as shown in the figure below.

Figure 1. The donut chart illustrates the percentages of the distinct user counts for different types of rewrites.

Apart from adding words, dropping words and spelling corrections, a significant portion of the rewrites are in the category of Other. It is more semantic driven, such as merchant to merchant, or merchant to cuisine. Those rewrites are useful for capturing deeper connections between queries and can be a powerful diversifier to query expansion.

Grouping

After all the rewrite pairs were discovered offline through data mining, we grouped the query pairs by the original query to get the expansion candidates of each query. For serving efficiency, we limited the max number of expansion candidates to three.

Query expansion serving

Expansion matching architecture

The expansion matching architecture benefits from the recent search architecture upgrade, where the system flow is changed to a query understanding, multi-recall and result fusion flow. In particular, a query goes through the query understanding module and gets augmented with additional information. In this case, the query understanding module takes in the keyword and expands it to multiple synonyms, for example, KFC will be expanded to fried chicken. The original query together with its expansions are sent together to the search engine under the multi-recall framework. After that, results from multiple recallers with different keywords are fused together.

Continuous monitoring and feedback loop

It’s important to make sure the expansion pairs are relevant and up-to-date. We run the data mining pipeline periodically to capture the new user rewrite behaviours. Meanwhile, we also monitor the expansion pairs’ contribution to the search result by measuring the net contribution of recall or user interaction that the particular query brings, and eliminate the obsolete pairs in an automatic way. This reflects our effort to build an adaptive system.

Results

We conducted online A/B experiments across 6 countries in Southeast Asia to evaluate the expanded queries generated by our system. We set up 3 groups:

  • Control group, where no query is expanded.
  • Treatment group 1, where we expanded the queries based on manual annotations only.
  • Treatment group 2, where we expanded the queries using the data mining approach.

We observed decent uplift in click-through rate and conversion rate from both treatment groups. Furthermore, in treatment group 2, the data mining approach produced even better results.

Future work

Data mining enhancement

Currently, the data mining approach can only identify the pairs from the same search session by one user. This limits the number of linked pairs. Some potential enhancements include:

  • Augment expansion pairs by associating queries from different users who click on the same merchant/item, for example, using a click graph. This can capture relevant queries across user sessions.
  • Build a probabilistic model on top of the current transition pairs. Currently, all the transition pairs are equally weighted but apparently, the transitions that happen more often should carry higher probability/weights.

Ads application

Query expansion can be applied to advertising and would increase ads fill rate. With “KFC” expanded to “fried chicken”, the sponsored merchants who buy the keyword “fried chicken” would be eligible to show up when the user searches “KFC”. This would enable Grab to provide more relevant sponsored content to our users, which helps not only the consumers but also the merchants.

Special thanks to Zhengmin Xu and Daniel Ng for proofreading this article.

Join us

Grab is the leading superapp platform in Southeast Asia, providing everyday services that matter to consumers. More than just a ride-hailing and food delivery app, Grab offers a wide range of on-demand services in the region, including mobility, food, package and grocery delivery services, mobile payments, and financial services across 428 cities in eight countries.

Powered by technology and driven by heart, our mission is to drive Southeast Asia forward by creating economic empowerment for everyone. If this mission speaks to you, join our team today!

Your guide to streaming data & real-time analytics at re:Invent 2022

Post Syndicated from Anna Montalat original https://aws.amazon.com/blogs/big-data/your-guide-to-streaming-data-real-time-analytics-at-reinvent-2022/

Mark your calendars for November 28 through December 2, 2022 to attend AWS re:Invent in Las Vegas – a learning conference hosted by AWS for the global cloud computing community.

To maximize the value of your data, you need to act upon it in real time, instead of waiting for hours, days, or week. AWS streaming data services offer unmatched, end to end capabilities to build real-time streaming data pipelines and applications to maximize the value of your data and act upon it in real time. You can leverage Kinesis Data Streams, Kinesis Video Streams and Amazon Managed Streaming for Apache Kafka (MSK) to collect and store data streams at scale; Kinesis Data Firehose to load real-time streams into data lakes, warehouses, and analytics services; and Kinesis Data Analytics to analyze streaming data in real time using Apache Flink. With streaming data architectures, customers can analyze data as soon as it is produced, get timely insights and make real-time decisions to capitalize on opportunities, enhance customer experiences, prevent networking failures, or update critical business metrics in real-time, just to name a few. This post walks you through the key sessions on streaming data and analytics that you cannot miss this year at reInvent to help you plan your conference week accordingly.

To access the session catalog and reserve your seat for one of our streaming data and analytics sessions, you must be registered for re:Invent. Register now!

Keynotes and leadership sessions you cannot miss!

Speakers have always been a key piece of the re:Invent puzzle. This year is no different, and you’ll have the chance to hear from some of the leading voices at AWS.

Adam Selipsky, Chief Executive Officer of Amazon Web Services – Keynote

Tuesday November 29 | 8:30 AM – 10:30 AM PST | The Venetian

Join Adam Selipsky, CEO of Amazon Web Services, as he looks at the ways that forward-thinking builders are transforming industries and even our future, powered by AWS. He highlights innovations in data, infrastructure, security, and more that are helping customers achieve their goals faster, take advantage of untapped potential, and create a better future with AWS.

Reserve your seat now!

Swami Sivasubramanian, Vice President of AWS Data and Machine Learning – Keynote

Wednesday November 30 | 8:30 AM – 10:30 AM PST | The Venetian

Join Swami Sivasubramanian, Vice President of AWS Data and Machine Learning, as he unveils some of the latest AWS innovations, designed to help you transform data into meaningful insights. Hear from leading AWS customers who are using data to bring new experiences to life for their customers.

Reserve your seat now!

AWS storage innovations at exabyte scale – Leadership session

Tuesday November 29 | 11:00 – 12:00 PM PST | The Venetian

Data is the change agent driving digital transformation. In this session, Mai-Lan Tomsen Bukovec, AWS Tech VP, and Andy Warfield, AWS Distinguished Engineer, share the latest AWS storage innovations and an inside look at how customers drive modern business on data lakes and with high-performance data.

Reserve your seat now!

Unlock the value of your data with AWS analytics – Leadership session

Wednesday November 30 | 2:30 – 3:30 PM PST | The Venetian

Data fuels digital transformation and drives effective business decisions. In this session, G2 Krishnamoorthy, VP of AWS Analytics, addresses the current state of analytics on AWS, covers the latest service innovations around data, and highlights customer successes with AWS analytics.

Reserve your seat now!

Customer sessions

Join our customer sessions to learn first-hand how other organizations are maximizing the value of their data with real-time streaming data architectures, enabling them to untap new business opportunities, enhance processes, and deliver delightful customer experiences.

  • How Riot Games processes 20 TB of analytics data daily on AWS – Riot Games ingests about 20 TB of data every day on AWS. This data powers a wide range of services, including game matchmaking, in-game personalization, analytics, security, and player behavior management. Join this session to learn how Riot Games transformed their data ingestion pipeline to query data from 6 hours after it was produced down to just 5 minutes. Reserve your seat now!
  • How Samsung modernized architecture for real-time analytics – In this session, Samsung SmartThings shares how they modernized their streaming data analytics architecture for real-time analytics. Originally, Samsung developers spent most of their time managing infrastructure. After migrating to Amazon Kinesis Data Analytics, developers were able to focus on delivering business value without needing to worry about infrastructure management. Reserve your seat now!
  • Leveling up computer vision and artificial intelligence development – Seeing is believing, and Kami Vision is proof! In this session, Kami Vision speaks to how they utilized Amazon Kinesis Video Streams to do the undifferentiated video lifting so that they could develop KamiCare fall detection—an accurate way to monitor if a person has fallen to the floor and cannot get up. Reserve your seat now!
  • How Sony Orchard accelerated innovation with Amazon MSK – The Orchard, a subsidiary of Sony Music Entertainment, built a high-performing data synchronization solution using Amazon Managed Streaming for Apache Kafka (Amazon MSK). Learn how their data synchronization and search capabilities improved using this solution. Reserve your seat now!
  • How Poshmark accelerates growth via real-time analytics & personalization – Find out how Poshmark designed real-time personalization using real-time event capture to deliver tailored customer experiences, reduce security risks, and enable end-users to more confidently interact with the Poshmark app. Reserve your seat now!
  • Building and operating at scale with feature management (sponsored by LaunchDarkly) – LaunchDarkly customers deliver software applications that support millions of end-users at any given time. They rely on LaunchDarkly to launch, control, and measure those applications in real time without negative customer impact. In this session, we’ll discuss key architecture decisions and LaunchDarkly best practices. Reserve your seat now!

Breakout sessions

AWS re:Invent breakout sessions are lecture-style and one hour long. These sessions take place across the re:Invent campus and cover all topics at all levels.

  • What’s new in AWS streaming – Streaming data and analytics help your business make real-time contextual decisions, deliver personalized customer experiences, and untap new opportunities in real time. Join us to find out about the latest innovations in the AWS streaming portfolio. Reserve your seat now!
  • Build a managed analytics platform for your ecommerce business – With the increase in popularity of online shopping, building an analytics platform for ecommerce is important for any organization because it provides insights about the business, trends, and customer behavior. Join us to learn how to build a complete analytics platform in batch and real-time mode. Reserve your seat now!
  • Publishing real-time financial data feeds using Kafka – This session describes how to offer a real-time financial data feed as a service on AWS. With Amazon MSK, you can use Kafka to allow your customers to subscribe to message topics containing the financial data of interest. We will cover connectivity best practices for scalability, security options for a secure architecture, and lessons learned from customers that are using AWS to distribute financial data on AWS. Reserve your seat now!
  • Interact with streaming data using Amazon Kinesis Data Analytics Studio – Join us in this theater session to learn how analyzing streaming data provides the timely, actionable insights a business needs to grow. Reserve your seat now!

Chalk talks

Chalk talks are a highly interactive content format with a small audience. Each begins with a short lecture delivered by an AWS expert followed by a Q&A session with the audience.

  • Modern data exchange using AWS data streaming – We’ll explore how different systems sync low-latency data changes using Apache Hudi backed by Amazon Simple Storage Service (Amazon S3) in a data mesh architecture. This modern architecture allows developers to build streaming jobs that read, join, and aggregate data from multiple datasets and sync data changes to downstream data stores. Reserve your seat now!
  • Build a serverless streaming workload with Amazon Kinesis – Collecting, processing, and analyzing streaming data is easy with Amazon Kinesis services. Make plans for this chalk talk that will take your streaming capabilities to the next level. Reserve your seat now!

Workshops

Workshops are two-hour hands-on sessions where you work in teams to solve problems using AWS services. Workshops organize attendees into small groups and provide scenarios to encourage interaction, giving you the opportunity to learn from and teach each other. Don’t forget to bring your laptop!

  • Building a serverless Apache Kafka data pipeline – Serverless means “focus on what matters”! In this workshop, we’ll show how you can build a serverless data pipeline using Amazon MSK Serverless, deploy a Kafka client container-based AWS Lambda function, and much more! Reserve your seat now!
  • Event detection with Amazon MSK and Amazon Kinesis Data Analytics – When in Las Vegas, you do as Las Vegans do! In this workshop, you’ll see how casinos use Amazon MSK, Amazon Kinesis Data Analytics Studio, and AWS Lambda to enhance customer experiences. Reserve your seat now!
  • Build smart camera applications using Amazon Kinesis Video Streams WebRTC – Amazon Kinesis Video Streams WebRTC helps users to easily build low-latency video solutions such as smart doorbells, connected vehicles, surveillance cameras, and more. Join this workshop for hands-on experience building a complete real-world video solution, including setting up a device with a camera to transmit video. Reserve your seat now!

Fun, fun, and more fun!

All work and no play … not at re:Invent! Sure, we’ll work hard and learn a lot, but we also plan to have a great time while we’re together. Our gamified learning sessions will give you real-life learning opportunities through interactive events that promise to be fun and entertaining!

The fun continues with AWS Builder Labs, where you’ll have the opportunity to test your skills in sandbox settings while working alongside some of the leading minds from AWS!

And don’t forget to visit the Analytics kiosk within the AWS Village to meet with experts to dive deeper into AWS streaming data services such as Kinesis Data Streams, Kinesis Data Firehose, Kinesis Data Analytics and Amazon MSK. You will be able to ask our experts questions and experience live demos for our newly launched capabilities. Make sure to stop by the swag distribution table to grab free Analytics swag if you have attended either the Analytics kiosk or one of our breakout sessions, chalk talks, or workshops.

Register today

Keep your eyes on this post for more updates and exciting news. It’s going to be a simply amazing event and we can’t wait to see you at re:Invent 2022, the world’s premier tech event! Register now to secure your spot!


About the author

Anna Montalat is a Senior Product Marketing Manager for AWS streaming data services which includes Amazon Managed Streaming for Apache Kafka (MSK), Kinesis Data Streams, Kinesis Video Streams, Kinesis Data Firehose, and Kinesis Data Analytics. She is passionate about bringing new and emerging technologies to market, working closely with service teams and enterprise customers. Outside of work, Anna skis through winter time and sails through summer.

Use an event-driven architecture to build a data mesh on AWS

Post Syndicated from Jan Michael Go Tan original https://aws.amazon.com/blogs/big-data/use-an-event-driven-architecture-to-build-a-data-mesh-on-aws/

In this post, we take the data mesh design discussed in Design a data mesh architecture using AWS Lake Formation and AWS Glue, and demonstrate how to initialize data domain accounts to enable managed sharing; we also go through how we can use an event-driven approach to automate processes between the central governance account and data domain accounts (producers and consumers). We build a data mesh pattern from scratch as Infrastructure as Code (IaC) using AWS CDK and use an open-source self-service data platform UI to share and discover data between business units.

The key advantage of this approach is being able to add actions in response to data mesh events such as permission management, tag propagation, search index management, and to automate different processes.

Before we dive into it, let’s look at AWS Analytics Reference Architecture, an open-source library that we use to build our solution.

AWS Analytics Reference Architecture

AWS Analytics Reference Architecture (ARA) is a set of analytics solutions put together as end-to-end examples. It regroups AWS best practices for designing, implementing, and operating analytics platforms through different purpose-built patterns, handling common requirements, and solving customers’ challenges.

ARA exposes reusable core components in an AWS CDK library, currently available in Typescript and Python. This library contains AWS CDK constructs (L3) that can be used to quickly provision analytics solutions in demos, prototypes, proofs of concept, and end-to-end reference architectures.

The following table lists data mesh specific constructs in the AWS Analytics Reference Architecture library.

Construct Name Purpose
CentralGovernance Creates an Amazon EventBridge event bus for central governance account that is used to communicate with data domain accounts (producer/consumer). Creates workflows to automate data product registration and sharing.
DataDomain Creates an Amazon EventBridge event bus for data domain account (producer/consumer) to communicate with central governance account. It creates data lake storage (Amazon S3), and workflow to automate data product registration. It also creates a workflow to populate AWS Glue Catalog metadata for newly registered data product.

You can find AWS CDK constructs for the AWS Analytics Reference Architecture on Construct Hub.

In addition to ARA constructs, we also use an open-source Self-service data platform (User Interface). It is built using AWS Amplify, Amazon DynamoDB, AWS Step Functions, AWS Lambda, Amazon API Gateway, Amazon EventBridge, Amazon Cognito, and Amazon OpenSearch. The frontend is built with React. Through the self-service data platform you can: 1) manage data domains and data products, and 2) discover and request access to data products.

Central Governance and data sharing

For the governance of our data mesh, we will use AWS Lake Formation. AWS Lake Formation is a fully managed service that simplifies data lake setup, supports centralized security management, and provides transactional access on top of your data lake. Moreover, it enables data sharing across accounts and organizations. This centralized approach has a number of key benefits, such as: centralized audit; centralized permission management; and centralized data discovery. More importantly, this allows organizations to gain the benefits of centralized governance while taking advantage of the inherent scaling characteristics of decentralized data product management.

There are two ways to share data resources in Lake Formation: 1) Named Based Access Control (NRAC), and 2) Tag-Based Access Control (LF-TBAC). NRAC uses AWS Resource Access Manager (AWS RAM) to share data resources across accounts. Those are consumed via resource links that are based on created resource shares. Tag-Based Access Control (LF-TBAC) is another approach to share data resources in AWS Lake Formation, that defines permissions based on attributes. These attributes are called LF-tags. You can read this blog to learn about LF-TBAC in the context of data mesh.

The following diagram shows how NRAC and LF-TBAC data sharing works. In this example, data domain is registered as a node on mesh and therefore we create two databases in the central governance account. NRAC database is shared with data domain via AWS RAM. Access to data products that we register in this database will be handled through NRAC. LF-TBAC database is tagged with data domain N line of business (LOB) LF-tag: <LOB:N>. LOB tag is automatically shared with data domain N account and therefore database is available in that account. Access to Data Products in this database will be handled through LF-TBAC.

BDB-2279-ram-tag-share

In our solution we will demonstrate both NRAC and LF-TBAC approaches. With the NRAC approach, we will build up an event-based workflow that would automatically accept RAM share in the data domain accounts and automate the creation of the necessary metadata objects (eg. local database, resource links, etc). While with the LF-TBAC approach, we rely on permissions associated with the shared LF-Tags to allow producer data domains to manage their data products, and consumer data domains read access to the relevant data products associated with the LF-Tags that they requested access to.

We use CentralGovernance construct from ARA library to build a central governance account. It creates an EventBridge event bus to enable communication with data domain accounts that register as nodes on mesh. For each registered data domain, specific event bus rules are created that route events towards that account. Central governance account has a central metadata catalog that allows for data to be stored in different data domains, as opposed to a single central lake. For each registered data domain, we create two separate databases in central governance catalog to demonstrate both NRAC and LF-TBAC data sharing. CentralGovernance construct creates workflows for data product registration and data product sharing. We also deploy a self-service data platform UI  to enable good user experience to manage data domains, data products, and to simplify data discovery and sharing.

BDB-2279-central-gov

A data domain: producer and consumer

We use DataDomain construct from ARA library to build a data domain account that can be either producer, consumer, or both. Producers manage the lifecycle of their respective data products in their own AWS accounts. Typically, this data is stored in Amazon Simple Storage Service (Amazon S3). DataDomain construct creates a data lake storage with cross-account bucket policy that enables central governance account to access the data. Data is encrypted using AWS KMS, and central governance account has a permission to use the key. Config secret in AWS Secrets Manager contains all the necessary information to register data domain as a node on mesh in central governance. It includes: 1) data domain name, 2) S3 location that holds data products, and 3) encryption key ARN. DataDomain construct also creates data domain and crawler workflows to automate data product registration.

BDB-2279-data-domain

Creating an event-driven data mesh

Data mesh architectures typically require some level of communication and trust policy management to maintain least privileges of the relevant principals between the different accounts (for example, central governance to producer, central governance to consumer). We use event-driven approach via EventBridge to securely forward events from one event bus to event bus in another account while maintaining the least privilege access. When we register data domain to central governance account through the self-service data platform UI, we establish bi-directional communication between the accounts via EventBridge. Domain registration process also creates database in the central governance catalog to hold data products for that particular domain. Registered data domain is now a node on mesh and we can register new data products.

The following diagram shows data product registration process:

BDB-2279-register-dd-small

  1. Starts Register Data Product workflow that creates an empty table (the schema is managed by the producers in their respective producer account). This workflow also grants a cross-account permission to the producer account that allows producer to manage the schema of the table.
  2. When complete, this emits an event into the central event bus.
  3. The central event bus contains a rule that forwards the event to the producer’s event bus. This rule was created during the data domain registration process.
  4. When the producer’s event bus receives the event, it triggers the Data Domain workflow, which creates resource-links and grants permissions.
  5. Still in the producer account, Crawler workflow gets triggered when the Data Domain workflow state changes to Successful. This creates the crawler, runs it, waits and checks if the crawler is done, and deletes the crawler when it’s complete. This workflow is responsible for populating tables’ schemas.

Now other data domains can find newly registered data products using the self-service data platform UI and request access. The sharing process works in the same way as product registration by sending events from the central governance account to consumer data domain, and triggering specific workflows.

Solution Overview

The following high-level solution diagram shows how everything fits together and how event-driven architecture enables multiple accounts to form a data mesh. You can follow the workshop that we released to deploy the solution that we covered in this blog post. You can deploy multiple data domains and test both data registration and data sharing. You can also use self-service data platform UI to search through data products and request access using both LF-TBAC and NRAC approaches.

BDB-2279-arch-diagram

Conclusion

Implementing a data mesh on top of an event-driven architecture provides both flexibility and extensibility. A data mesh by itself has several moving parts to support various functionalities, such as onboarding, search, access management and sharing, and more. With an event-driven architecture, we can implement these functionalities in smaller components to make them easier to test, operate, and maintain. Future requirements and applications can use the event stream to provide their own functionality, making the entire mesh much more valuable to your organization.

To learn more how to design and build applications based on event-driven architecture, see the AWS Event-Driven Architecture page. To dive deeper into data mesh concepts, see the Design a Data Mesh Architecture using AWS Lake Formation and AWS Glue blog.

If you’d like our team to run data mesh workshop with you, please reach out to your AWS team.


About the authors


Jan Michael Go Tan is a Principal Solutions Architect for Amazon Web Services. He helps customers design scalable and innovative solutions with the AWS Cloud.

Dzenan Softic is a Senior Solutions Architect at AWS. He works with startups to help them define and execute their ideas. His main focus is in data engineering and infrastructure.

David Greenshtein is a Specialist Solutions Architect for Analytics at AWS with a passion for ETL and automation. He works with AWS customers to design and build analytics solutions enabling business to make data-driven decisions. In his free time, he likes jogging and riding bikes with his son.
Vincent Gromakowski is an Analytics Specialist Solutions Architect at AWS where he enjoys solving customers’ analytics, NoSQL, and streaming challenges. He has a strong expertise on distributed data processing engines and resource orchestration platform.

Build an optimized self-service interactive analytics platform with Amazon EMR Studio

Post Syndicated from Pablo Redondo Sanchez original https://aws.amazon.com/blogs/big-data/build-an-optimized-self-service-interactive-analytics-platform-with-amazon-emr-studio/

Data engineers and data scientists are dependent on distributed data processing infrastructure like Amazon EMR to perform data processing and advanced analytics jobs on large volumes of data. In most mid-size and enterprise organizations, cloud operations teams own procuring, provisioning, and maintaining the IT infrastructures, and their objectives and best practices differ from the data engineering and data science teams. Enforcing infrastructure best practices and governance controls present interesting challenges for analytics teams:

  • Limited agility – Designing and deploying a cluster with the required networking, security, and monitoring configuration requires significant expertise in cloud infrastructure. This results in high dependency on operations teams to perform simple experimentation and development tasks. This typically results in weeks or months to deploy an environment.
  • Security and performance risks – Experimentation and development activities typically require sharing existing environments with other teams, which presents security and performance risks due to lack of workload isolation.
  • Limited collaboration – The security complexity of running shared environments and the lack of a shared web UI limits the analytics team’s ability to share and collaborate during development tasks.

To promote experimentation and solve the agility challenge, organizations need to reduce deployment complexity and remove dependencies to cloud operations teams while maintaining guardrails to optimize cost, security, and resource utilization. In this post, we walk you through how to implement a self-service analytics platform with Amazon EMR and Amazon EMR Studio to improve the agility of your data science and data engineering teams without compromising on the security, scalability, resiliency, and cost efficiency of your big data workloads.

Solution overview

A self-service data analytics platform with Amazon EMR and Amazon EMR Studio provides the following advantages:

  • It’s simple to launch and access for data engineers and data scientists.
  • The robust integrated development environment (IDE) is interactive, makes data easy to explore, and provides all the tooling necessary to debug, build, and schedule data pipelines.
  • It enables collaboration for analytics teams with the right level of workload isolation for added security.
  • It removes dependency from cloud operations teams by allowing administrators within each analytics organization to self-provision, scale, and de-provision resources from within the same UI, without exposing the complexities of the EMR cluster infrastructure and without compromising on security, governance, and costs.
  • It simplifies moving from prototyping into a production environment.
  • Cloud operations teams can independently manage EMR cluster configurations as products and continuously optimize for cost and improve the security, reliability, and performance of their EMR clusters.

Amazon EMR Studio is a web-based IDE that provides fully managed Jupyter notebooks where teams can develop, visualize, and debug applications written in R, Python, Scala, and PySpark, and tools such as Spark UI to provide an interactive development experience and simplify debugging of jobs. Data scientists and data engineers can directly access Amazon EMR Studio through a single sign-on enabled URL and collaborate with peers using these notebooks within the concept of an Amazon EMR Studio Workspace, version code with repositories such as GitHub and Bitbucket, or run parameterized notebooks as part of scheduled workflows using orchestration services. Amazon EMR Studio notebook applications run on EMR clusters, so you get the benefit of a highly scalable data processing engine using the performance optimized Amazon EMR runtime for Apache Spark.

The following diagram illustrates the architecture of the self-service analytics platform with Amazon EMR and Amazon EMR Studio.

Self Service Analytics Architecture

Cloud operations teams can assign one Amazon EMR Studio environment to each team for isolation and provision Amazon EMR Studio developer and administrator users within each team. Cloud operations teams have full control on the permissions each Amazon EMR Studio user has via Amazon EMR Studio permissions policies and control the EMR cluster configurations that Amazon EMR Studio administrators can deploy via cluster templates. Amazon EMR Studio administrators within each team can assign workspaces to each developer and attached to existing EMR clusters or, if allowed, self-provision EMR clusters from predefined templates. Each workspace is a serverless Jupyter instance with notebook files backed up continuously into an Amazon Simple Storage Service (Amazon S3) bucket. Users can attach or detach to provisioned EMR clusters and you only pay for the EMR cluster compute capacity used.

Cloud operations teams organize EMR cluster configurations as products within the AWS Service Catalog. In AWS Service Catalog, EMR cluster templates are organized as products in a portfolio that you share with Amazon EMR Studio users. Templates hide the complexities of the infrastructure configuration and can have custom parameters to allow for further optimization based on the workload requirement. After you publish a cluster template, Amazon EMR Studio administrators can launch new clusters and attach to new or existing workspaces within an Amazon EMR Studio without dependency to cloud operations teams. This makes it easier for teams to test upgrades, share predefined templates across teams, and allow analytics users to focus on achieving business outcomes.

The following diagram illustrates the decoupling architecture.

Decoupling Architecture

You can decouple the definition of the EMR clusters configurations as products and enable independent teams to deploy serverless workspaces and attach self-provisioned EMR clusters within Amazon EMR Studio in minutes. This enables organizations to create an agile and self-service environment for data processing and data science at scale while maintaining the proper level of security and governance.

As a cloud operations engineer, the main task is making sure your templates follow proper cluster configurations that are secure, run at optimal cost, and are easy to use. The following sections discuss key recommendations for security, cost optimization, and ease of use when defining EMR cluster templates for use within Amazon EMR Studio. For additional Amazon EMR best practices, refer to the EMR Best Practices Guide.

Security

Security is mission critical for any data science and data prep workload. Ensure you follow these recommendations:

  • Team-based isolation – Maintain workload isolation by provisioning an Amazon EMR Studio environment per team and a workspace per user.
  • Authentication – Use AWS IAM Identity Center (successor for AWS Single Sign-On) or federated access with AWS Identity and Access Management (IAM) to centralize user management.
  • Authorization – Set fine-grained permissions within your Amazon EMR Studio environment. Set limited (1–2) users with the Amazon EMR Studio admin role to allow workspace and cluster provisioning. Most data engineers and data scientists will have a developer role. For more information on how to define permissions, refer to Configure EMR Studio user permissions.
  • Encryption – When defining your cluster configuration templates, ensure encryption is enforced both in transit and at rest. For example, traffic between data lakes should use the latest version of TLS, data is encrypted with AWS Key Management Service (AWS KMS) at rest for Amazon S3, Amazon Elastic Block Store (Amazon EBS), and Amazon Relational Database Service (Amazon RDS).

Cost

To optimize cost of your running EMR cluster, consider the following cost-optimization options in your cluster templates:

  • Use EC2 Spot Instances – Spot Instances let you take advantage of unused Amazon Elastic Compute Cloud (Amazon EC2) capacity in the AWS Cloud and offer up to a 90% discount compared to On-Demand prices. Spot is best suited for workloads that can be interrupted or have flexible SLAs, like testing and development workloads.
  • Use instance fleets – Use instance fleets when using EC2 Spot to increase the likelihood of Spot availability. An instance fleet is a group of EC2 instances that host a particular node type (primary, core, or task) in an EMR cluster. Because instance fleets can consist of a mix of instance types, both On-Demand and Spot, this will increase the likelihood of Spot Instance availability when reaching your target capacity. Consider at least 10 instance types across all Availability Zones.
  • Use Spark cluster mode and ensure that application masters run on On-Demand nodes – The application master (AM) is the main container launching and monitoring the application executors. Therefore, it’s important to ensure the AM is as resilient as possible. In an Amazon EMR Studio environment, you can expect users running multiple applications concurrently. In cluster mode, your Spark applications can run as independent sets of processes spread across your worker nodes within the AMs. By default, an AM can run on any of the worker nodes. Modify the behavior to ensure AMs run only in On-Demand nodes. For details on this setup, see Spot Usage.
  • Use Amazon EMR managed scaling – This avoids overprovisioning clusters and automatically scales your clusters up or down based on resource utilization. With Amazon EMR managed scaling, AWS manages the automatic scaling activity by continuously evaluating cluster metrics and making optimized scaling decisions.
  • Implement an auto-termination policy – This avoids idle clusters or the need to manually monitor and stop unused EMR clusters. When you set an auto-termination policy, you specify the amount of idle time after which the cluster should automatically shut down.
  • Provide visibility and monitor usage costs – You can provide visibility of EMR clusters to Amazon EMR Studio administrators and cloud operations teams by configuring user-defined cost allocation tags. These tags help create detailed cost and usage reports in AWS Cost Explorer for EMR clusters across multiple dimensions.

Ease of use

With Amazon EMR Studio, administrators within data science and data engineering teams can self-provision EMR clusters from templates pre-built with AWS CloudFormation. Templates can be parameterized to optimize cluster configuration according to each team’s workload requirements. For ease of use and to avoid dependencies to cloud operations teams, the parameters should avoid requesting unnecessary details or expose infrastructure complexities. Here are some tips to abstract the input values:

  • Maintain the number of questions to a minimum (less than 5).
  • Hide network and security configurations. Be opinionated when defining your cluster according to your security and network requirements following Amazon EMR best practices.
  • Avoid input values that require knowledge of AWS Cloud-specific terminology, such as EC2 instance types, Spot vs. On-Demand Instances, and so on.
  • Abstract input parameters considering information available to data engineering and data science teams. Focus on parameters that will help further optimize the size and costs of your EMR clusters.

The following screenshot is an example of input values you can request from a data science team and how to resolve them via CloudFormation template features.

EMR Studio IDE

The input parameters are as follows:

  • User concurrency – Knowing how many users are expected to run jobs simultaneously will help determine the number of executors to provision
  • Optimized for cost or reliability – Use Spot Instances to optimize for cost; for SLA sensitive workloads, use only On-Demand nodes
  • Workload memory requirements (small, medium, large) – Determine the ratio of memory per Spark executor in your EMR cluster

The following sections describe how to resolve the EMR cluster configurations from these input parameters and what features to use in your CloudFormation templates.

User concurrency: How many concurrent users do you need?

Knowing the expected user concurrency will help determine the target node capacity of your cluster or the min/max capacity when using the Amazon EMR auto scaling feature. Consider how much capacity (CPU cores and memory) each data scientist needs to run their average workload.

For example, let’s say you want to provision 10 executors to each data scientist in the team. If the expected concurrency is set to 7, then you need to provision 70 executors. An r5.2xlarge instance type has 8 cores and 64 Gib of RAM. With the default configuration, the core count (spark.executor.cores) is set to 1 and memory (spark.executor.memory) is set to 6 Gib. One core will be reserved for running the Spark application, therefore leaving seven executors per node. You will need a total of 10 r5.2xlarge nodes to meet the demand. The target capacity can dynamically resolve to 10 from the user concurrency input, and the capacity weights in your fleet make sure the same capacity is met if different instance sizes are provisioned to meet the expected capacity.

Using an CloudFormation transform allows you to resolve the target capacity based on a numeric input value. A transform passes your template script to a custom AWS Lambda function so you can replace any placeholder in your CloudFormation template with values resolved from your input parameters.

The following CloudFormation script calls the emr-size-macro transform that replaces the custom::Target placeholder in the TargetSpotCapacity object based on the UserConcurrency input value:

Parameters:
...
 UserConcurrency: 
  Description: "How many users you expect to run jobs simultaneously" 
  Type: "Number" 
  Default: "5"
...
Resources
   EMRClusterTaskSpot: 
    'Fn::Transform': 
      Name: emr-size-macro Parameters: 
      FleetType: task 
      InputSize: !Ref TeamSize
    Type: AWS::EMR::InstanceFleetConfig
    Condition: UseSpot
    Properties:
      ClusterId: !Ref EMRCluster
      Name: cfnTask
      InstanceFleetType: TASK
      TargetOnDemandCapacity: 0
      TargetSpotCapacity: "custom::Target"
      LaunchSpecifications:
        OnDemandSpecification:
          AllocationStrategy: lowest-price
        SpotSpecification:
          AllocationStrategy: capacity-optimized
          TimeoutAction: SWITCH_TO_ON_DEMAND
          TimeoutDurationMinutes: 5
     InstanceTypeConfigs: !FindInMap [ InstanceTypes, !Ref MemoryProfile, taskfleet]

Optimized for cost or reliability: How do you optimize your EMR cluster?

This parameter determines if the cluster should use Spot Instances for task nodes to optimize cost or provision only On-Demand nodes for SLA sensitive workloads that need to be optimized for reliability.

You can use the CloudFormation Conditions feature in your template to resolve your desired instance fleet configurations. The following code shows how the Conditions feature looks in a sample EMR template:

Parameters:
  ...
  Optimization: 
    Description: "Choose reliability if your jobs need to meet specific SLAs" 
    Type: "String" 
    Default: "cost" 
    AllowedValues: [ 'cost', 'reliability']
...
Conditions: 
  UseSpot: !Equals 
    - !Ref Optimization 
    - cost 
  UseOnDemand: !Equals 
    - !Ref Optimization 
    - reliability
Resources:
...
EMRClusterTaskSpot:
    Type: AWS::EMR::InstanceFleetConfig
    Condition: UseSpot
    Properties:
      ClusterId: !Ref EMRCluster
      Name: cfnTask
      InstanceFleetType: TASK
      TargetOnDemandCapacity: 0
      TargetSpotCapacity: 6
      LaunchSpecifications:
        OnDemandSpecification:
          AllocationStrategy: lowest-price
        SpotSpecification:
          AllocationStrategy: capacity-optimized
          TimeoutAction: SWITCH_TO_ON_DEMAND
          TimeoutDurationMinutes: 5
      InstanceTypeConfigs:
        - InstanceType: !FindInMap [ InstanceTypes, !Ref ClusterSize, taskfleet]
          WeightedCapacity: 1
 EMRClusterTaskOnDemand:
    Type: AWS::EMR::InstanceFleetConfig
    Condition: UseOnDemand
    Properties:
      ClusterId: !Ref EMRCluster
      Name: cfnTask
      InstanceFleetType: TASK
      TargetOnDemandCapacity: 6
      TargetSpotCapacity: 0
 ...

Workload memory requirements: How big a cluster do you need?

This parameter helps determine the amount of memory and CPUs to allocate to each Spark executor. The specific memory to CPU ratio allocated to each executor should be set appropriately to avoid out of memory errors. You can map the input parameter (small, medium, large) to specific instance types to select the CPU/memory ratio. Amazon EMR has default configurations (spark.executor.cores, spark.executor.memory) based on each instance type. For example, a small size cluster request could resolve to general purpose instances like m5 (default: 2 cores and 4 gb per executor), whereas a medium workflow can resolve to an R type (default: 1 core and 6 gb per executor). You can further tune the default Amazon EMR memory and CPU core allocation to each instance type by following the best practices outlined in the Spark section of the EMR Best Practices Guides.

Use the CloudFormation Mappings section to resolve the cluster configuration in your template:

Parameters:
…
   MemoryProfile: 
    Description: "What is the memory profile you expect in your workload." 
    Type: "String" 
    Default: "small" 
    AllowedValues: ['small', 'medium', 'large']
…
Mappings:
  InstanceTypes: small:
      master: "m5.xlarge"
      core: "m5.xlarge"
      taskfleet:
        - InstanceType: m5.2xlarge
          WeightedCapacity: 1
        - InstanceType: m5.4xlarge
          WeightedCapacity: 2
        - InstanceType: m5.8xlarge
          WeightedCapacity: 3
          ...
    medium:
      master: "m5.xlarge"
      core: "r5.2xlarge"
      taskfleet:
        - InstanceType: r5.2xlarge
          WeightedCapacity: 1
        - InstanceType: r5.4xlarge
          WeightedCapacity: 2
        - InstanceType: r5.8xlarge
          WeightedCapacity: 3
...
Resources:
...
  EMRClusterTaskSpot:
    Type: AWS::EMR::InstanceFleetConfig
    Properties:
      ClusterId: !Ref EMRCluster
      InstanceFleetType: TASK    
      InstanceTypeConfigs: !FindInMap [InstanceTypes, !Ref MemoryProfile, taskfleet]
      ...

Conclusion

In this post, we showed how to create a self-service analytics platform with Amazon EMR and Amazon EMR Studio to take full advantage of the agility the AWS Cloud provides by considerably reducing deployment times without compromising governance. We also walked you through best practices in security, cost, and ease of use when defining your Amazon EMR Studio environment so data engineering and data science teams can speed up their development cycles by removing dependencies from Cloud Operations teams when provisioning their data processing platforms.

If this is your first time exploring Amazon EMR Studio, we recommend checking out the Amazon EMR workshops and referring to Create an EMR Studio. Continue referencing the Amazon EMR Best Practices Guide when defining your templates and check out the Amazon EMR Studio sample repo for EMR cluster template references.


About the Authors

Pablo Redondo is a Principal Solutions Architect at Amazon Web Services. He is a data enthusiast with over 16 years of fintech and healthcare industry experience and is a member of the AWS Analytics Technical Field Community (TFC). Pablo has been leading the AWS Gain Insights Program to help AWS customers achieve better insights and tangible business value from their data analytics initiatives.

Malini Chatterjee is a Senior Solutions Architect at AWS. She provides guidance to AWS customers on their workloads across a variety of AWS technologies with a breadth of expertise in data and analytics. She is very passionate about semi-classical dancing and performs in community events. She loves traveling and spending time with her family.

Avijit Goswami is a Principal Solutions Architect at AWS, specialized in data and analytics. He supports AWS strategic customers in building high-performing, secure, and scalable data lake solutions on AWS using AWS-managed services and open-source solutions. Outside of his work, Avijit likes to travel, hike San Francisco Bay Area trails, watch sports, and listen to music.

How Hudl built a cost-optimized AWS Glue pipeline with Apache Hudi datasets

Post Syndicated from Indira Balakrishnan original https://aws.amazon.com/blogs/big-data/how-hudl-built-a-cost-optimized-aws-glue-pipeline-with-apache-hudi-datasets/

This is a guest blog post co-written with Addison Higley and Ramzi Yassine from Hudl.

Hudl Agile Sports Technologies, Inc. is a Lincoln, Nebraska based company that provides tools for coaches and athletes to review game footage and improve individual and team play. Its initial product line served college and professional American football teams. Today, the company provides video services to youth, amateur, and professional teams in American football as well as other sports, including soccer, basketball, volleyball, and lacrosse. It now serves 170,000 teams in 50 different sports around the world. Hudl’s overall goal is to capture and bring value to every moment in sports.

Hudl’s mission is to make every moment in sports count. Hudl does this by expanding access to more moments through video and data and putting those moments in context. Our goal is to increase access by different people and increase context with more data points for every customer we serve. Using data to generate analytics, Hudl is able to turn data into actionable insights, telling powerful stories with video and data.

To best serve our customers and provide the most powerful insights possible, we need to be able to compare large sets of data between different sources. For example, enriching our MongoDB and Amazon DocumentDB (with MongoDB compatibility) data with our application logging data leads to new insights. This requires resilient data pipelines.

In this post, we discuss how Hudl has iterated on one such data pipeline using AWS Glue to improve performance and scalability. We talk about the initial architecture of this pipeline, and some of the limitations associated with this approach. We also discuss how we iterated on that design using Apache Hudi to dramatically improve performance.

Problem statement

A data pipeline that ensures high-quality MongoDB and Amazon DocumentDB statistics data is available in our central data lake, and is a requirement for Hudl to be able to deliver sports analytics. It’s important to maintain the integrity of the data between MongoDB and Amazon DocumentDB transactional data with the data lake capturing changes in near-real time along with upserts to records in the data lake. Because Hudl statistics are backed by MongoDB and Amazon DocumentDB databases, in addition to a broad range of other data sources, it’s important that relevant MongoDB and Amazon DocumentDB data is available in a central data lake where we can run analytics queries to compare statistics data between sources.

Initial design

The following diagram demonstrates the architecture of our initial design.

Intial Ingestion Pipeline Design

Let’s discuss the key AWS services of this architecture:

  • AWS Data Migration Service (AWS DMS) allowed our team to move quickly in delivering this pipeline. AWS DMS gives our team a full snapshot of the data, and also offers ongoing change data capture (CDC). By combining these two datasets, we can ensure our pipeline delivers the latest data.
  • Amazon Simple Storage Service (Amazon S3) is the backbone of Hudl’s data lake because of its durability, scalability, and industry-leading performance.
  • AWS Glue allows us to run our Spark workloads in a serverless fashion, with minimal setup. We chose AWS Glue for its ease of use and speed of development. Additionally, features such as AWS Glue bookmarking simplified our file management logic.
  • Amazon Redshift offers petabyte-scale data warehousing. Amazon Redshift provides consistently fast performance, and easy integrations with our S3 data lake.

The data processing flow includes the following steps:

  1. Amazon DocumentDB holds the Hudl statistics data.
  2. AWS DMS gives us a full export of statistics data from Amazon DocumentDB, and ongoing changes in the same data.
  3. In the S3 Raw Zone, the data is stored in JSON format.
  4. An AWS Glue job merges the initial load of statistics data with the changed statistics data to give a snapshot of statistics data in JSON format for reference, eliminating duplicates.
  5. In the S3 Cleansed Zone, the JSON data is normalized and converted to Parquet format.
  6. AWS Glue uses a COPY command to insert Parquet data into Amazon Redshift consumption base tables.
  7. Amazon Redshift stores the final table for consumption.

The following is a sample code snippet from the AWS Glue job in the initial data pipeline:

from awsglue.context import GlueContext 
from pyspark.sql.session import SparkSession

spark = SparkSession.builder.getOrCreate() 
spark_context = spark.sparkContext 
gc = GlueContext(spark_context)
   full_df = read_full_data()#Load entire dataset from S3 Cleansed Zone


cdc_df = read_cdc_data() # Read new CDC data which represents delta in the source MongoDB/DocumentDB


joined_df = full_df.join(cdc_df, '_id', 'full_outer') #Calculate final snapshot by joining the existing data with delta


result = joined_df.filter((joined_df.Op != 'D') | (joined_df.Op.isNull())) .select(coalesce(cdc_df._doc, full_df._doc).alias('_doc'))

gc.write_dynamic_frame.from_options(frame=DynamicFrame.fromDF(result, gc) , connection_type = "s3", connection_options = {"path": output_path}, format = "parquet", transformation_ctx = "ctx4")

Challenges

Although this initial solution met our need for data quality, we felt there was room for improvement:

  • The pipeline was slow – The pipeline ran slowly (over 2 hours) because for each batch, the whole dataset was compared. Every record had to be compared, flattened, and converted to Parquet, even when only a few records were changed from the previous daily run.
  • The pipeline was expensive – As the data size grew daily, the job duration also grew significantly (especially in step 4). To mitigate the impact, we needed to allocate more AWS Glue DPUs (Data Processing Units) to scale the job, which led to higher cost.
  • The pipeline limited our ability to scale – Hudl’s data has a long history of rapid growth with increasing customers and sporting events. Given this trend, our pipeline needed to run as efficiently as possible to handle only changing datasets to have predictable performance.

New design

The following diagram illustrates our updated pipeline architecture.

Although the overall architecture looks roughly the same, the internal logic in AWS Glue was significantly changed, along with addition of Apache Hudi datasets.

In step 4, AWS Glue now interacts with Apache HUDI datasets in the S3 Cleansed Zone to upsert or delete changed records as identified by AWS DMS CDC. The AWS Glue to Apache Hudi connector helps convert JSON data to Parquet format and upserts into the Apache HUDI dataset. Retaining the full documents in our Apache HUDI dataset allows us to easily make schema changes to our final Amazon Redshift tables without needing to re-export data from our source systems.

The following is a sample code snippet from the new AWS Glue pipeline:

from awsglue.context import GlueContext 
from pyspark.sql.session import SparkSession

spark = SparkSession.builder.getOrCreate() 
spark_context = spark.sparkContext 
gc = GlueContext(spark_context)

upsert_conf = {'className': 'org.apache.hudi', '
hoodie.datasource.hive_sync.use_jdbc': 'false', 
'hoodie.datasource.write.precombine.field': 'write_ts', 
'hoodie.datasource.write.recordkey.field': '_id', 
'hoodie.table.name': 'glue_table', 
'hoodie.consistency.check.enabled': 'true', 
'hoodie.datasource.hive_sync.database': 'glue_database', 'hoodie.datasource.hive_sync.table': 'glue_table', 'hoodie.datasource.hive_sync.enable': 'true', 'hoodie.datasource.hive_sync.support_timestamp': 'true', 'hoodie.datasource.hive_sync.sync_as_datasource': 'false', 
'path': 's3://bucket/prefix/', 'hoodie.compact.inline': 'false', 'hoodie.datasource.hive_sync.partition_extractor_class':'org.apache.hudi.hive.NonPartitionedExtractor, 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.NonpartitionedKeyGenerator', 'hoodie.upsert.shuffle.parallelism': 200, 
'hoodie.datasource.write.operation': 'upsert', 
'hoodie.cleaner.policy': 'KEEP_LATEST_COMMITS', 
'hoodie.cleaner.commits.retained': 10 }

gc.write_dynamic_frame.from_options(frame=DynamicFrame.fromDF(cdc_upserts_df, gc, "cdc_upserts_df"), connection_type="marketplace.spark", connection_options=upsert_conf)

Results

With this new approach using Apache Hudi datasets with AWS Glue deployed after May 2022, the pipeline runtime was predictable and less expensive than the initial approach. Because we only handled new or modified records by eliminating the full outer join over the entire dataset, we saw an 80–90% reduction in runtime for this pipeline, thereby reducing costs by 80–90% compared to the initial approach. The following diagram illustrates our processing time before and after implementing the new pipeline.

Conclusion

With Apache Hudi’s open-source data management framework, we simplified incremental data processing in our AWS Glue data pipeline to manage data changes at the record level in our S3 data lake with CDC from Amazon DocumentDB.

We hope that this post will inspire your organization to build AWS Glue pipelines with Apache Hudi datasets that reduce cost and bring performance improvements using serverless technologies to achieve your business goals.


About the authors

Addison Higley is a Senior Data Engineer at Hudl. He manages over 20 data pipelines to help ensure data is available for analytics so Hudl can deliver insights to customers.

Ramzi Yassine is a Lead Data Engineer at Hudl. He leads the architecture, implementation of Hudl’s data pipelines and data applications, and ensures that our data empowers internal and external analytics.

Swagat Kulkarni is a Senior Solutions Architect at AWS and an AI/ML enthusiast. He is passionate about solving real-world problems for customers with cloud-native services and machine learning. Swagat has over 15 years of experience delivering several digital transformation initiatives for customers across multiple domains, including retail, travel and hospitality, and healthcare. Outside of work, Swagat enjoys travel, reading, and meditating.

Indira Balakrishnan is a Principal Solutions Architect in the AWS Analytics Specialist SA Team. She is passionate about helping customers build cloud-based analytics solutions to solve their business problems using data-driven decisions. Outside of work, she volunteers at her kids’ activities and spends time with her family.

How SOCAR built a streaming data pipeline to process IoT data for real-time analytics and control

Post Syndicated from DoYeun Kim original https://aws.amazon.com/blogs/big-data/how-socar-built-a-streaming-data-pipeline-to-process-iot-data-for-real-time-analytics-and-control/

SOCAR is the leading Korean mobility company with strong competitiveness in car-sharing. SOCAR has become a comprehensive mobility platform in collaboration with Nine2One, an e-bike sharing service, and Modu Company, an online parking platform. Backed by advanced technology and data, SOCAR solves mobility-related social problems, such as parking difficulties and traffic congestion, and changes the car ownership-oriented mobility habits in Korea.

SOCAR is building a new fleet management system to manage the many actions and processes that must occur in order for fleet vehicles to run on time, within budget, and at maximum efficiency. To achieve this, SOCAR is looking to build a highly scalable data platform using AWS services to collect, process, store, and analyze internet of things (IoT) streaming data from various vehicle devices and historical operational data.

This in-car device data, combined with operational data such as car details and reservation details, will provide a foundation for analytics use cases. For example, SOCAR will be able to notify customers if they have forgotten to turn their headlights off or to schedule a service if a battery is running low. Unfortunately, the previous architecture didn’t enable the enrichment of IoT data with operational data and couldn’t support streaming analytics use cases.

AWS Data Lab offers accelerated, joint-engineering engagements between customers and AWS technical resources to create tangible deliverables that accelerate data and analytics modernization initiatives. The Build Lab is a 2–5-day intensive build with a technical customer team.

In this post, we share how SOCAR engaged the Data Lab program to assist them in building a prototype solution to overcome these challenges, and to build the basis for accelerating their data project.

Use case 1: Streaming data analytics and real-time control

SOCAR wanted to utilize IoT data for a new business initiative. A fleet management system, where data comes from IoT devices in the vehicles, is a key input to drive business decisions and derive insights. This data is captured by AWS IoT and sent to Amazon Managed Streaming for Apache Kafka (Amazon MSK). By joining the IoT data to other operational datasets, including reservations, car information, device information, and others, the solution can support a number of functions across SOCAR’s business.

An example of real-time monitoring is when a customer turns off the car engine and closes the car door, but the headlights are still on. By using IoT data related to the car light, door, and engine, a notification is sent to the customer to inform them that the car headlights should be turned off.

Although this real-time control is important, they also want to collect historical data—both raw and curated data—in Amazon Simple Storage Service (Amazon S3) to support historical analytics and visualizations by using Amazon QuickSight.

Use case 2: Detect table schema change

The first challenge SOCAR faced was existing batch ingestion pipelines that were prone to breaking when schema changes occurred in the source systems. Additionally, these pipelines didn’t deliver data in a way that was easy for business analysts to consume. In order to meet the future data volumes and business requirements, they needed a pattern for the automated monitoring of batch pipelines with notification of schema changes and the ability to continue processing.

The second challenge was related to the complexity of the JSON files being ingested. The existing batch pipelines weren’t flattening the five-level nested structure, which made it difficult for business users and analysts to gain business insights without any effort on their end.

Overview of solution

In this solution, we followed the serverless data architecture to establish a data platform for SOCAR. This serverless architecture allowed SOCAR to run data pipelines continuously and scale automatically with no setup cost and without managing servers.

AWS Glue is used for both the streaming and batch data pipelines. Amazon Kinesis Data Analytics is used to deliver streaming data with subsecond latencies. In terms of storage, data is stored in Amazon S3 for historical data analysis, auditing, and backup. However, when frequent reading of the latest snapshot data is required by multiple users and applications concurrently, the data is stored and read from Amazon DynamoDB tables. DynamoDB is a key-value and document database that can support tables of virtually any size with horizontal scaling.

Let’s discuss the components of the solution in detail before walking through the steps of the entire data flow.

Component 1: Processing IoT streaming data with business data

The first data pipeline (see the following diagram) processes IoT streaming data with business data from an Amazon Aurora MySQL-Compatible Edition database.

Whenever a transaction occurs in two tables in the Aurora MySQL database, this transaction is captured as data and then loaded into two MSK topics via AWS Database Management (AWS DMS) tasks. One topic conveys the car information table, and the other topic is for the device information table. This data is loaded into a single DynamoDB table that contains all the attributes (or columns) that exist in the two tables in the Aurora MySQL database, along with a primary key. This single DynamoDB table contains the latest snapshot data from the two DB tables, and is important because it contains the latest information of all the cars and devices for the lookup against the streaming IoT data. If the lookup were done on the database directly with the streaming data, it would impact the production database performance.

When the snapshot is available in DynamoDB, an AWS Glue streaming job runs continuously to collect the IoT data and join it with the latest snapshot data in the DynamoDB table to produce the up-to-date output, which is written into another DynamoDB table.

The up-to-date data in DynamoDB is used for real-time monitoring and control that SOCAR’s Data Analytics team performs for safety maintenance and fleet management. This data is ultimately consumed by a number of apps to perform various business activities, including route optimization, real-time monitoring for oil consumption and temperature, and to identify a driver’s driving pattern, tire wear and defect detection, and real-time car crash notifications.

Component 2: Processing IoT data and visualizing the data in dashboards

The second data pipeline (see the following diagram) batch processes the IoT data and visualizes it in QuickSight dashboards.

There are two data sources. The first is the Aurora MySQL database. The two database tables are exported into Amazon S3 from the Aurora MySQL cluster and registered in the AWS Glue Data Catalog as tables. The second data source is Amazon MSK, which receives streaming data from AWS IoT Core. This requires you to create a secure AWS Glue connection for an Apache Kafka data stream. SOCAR’s MSK cluster requires SASL_SSL as a security protocol (for more information, refer to Authentication and authorization for Apache Kafka APIs). To create an MSK connection in AWS Glue and set up connectivity, we use the following CLI command:

aws glue create-connection —connection-input
'{"Name":"kafka-connection","Description":"kafka connection example",
"ConnectionType":"KAFKA",
"ConnectionProperties":{
"KAFKA_BOOTSTRAP_SERVERS":"<server-ip-addresses>",
"KAFKA_SSL_ENABLED":"true",
// "KAFKA_CUSTOM_CERT": "s3://bucket/prefix/cert.pem",
"KAFKA_SECURITY_PROTOCOL" : "SASL_SSL",
"KAFKA_SKIP_CUSTOM_CERT_VALIDATION":"false",
"KAFKA_SASL_MECHANISM": "SCRAM-SHA-512",
"KAFKA_SASL_SCRAM_USERNAME": "<username>",
"KAFKA_SASL_SCRAM_PASSWORD: "<password>"
},
"PhysicalConnectionRequirements":
{"SubnetId":"subnet-xxx","SecurityGroupIdList":["sg-xxx"],"AvailabilityZone":"us-east-1a"}}'

Component 3: Real-time control

The third data pipeline processes the streaming IoT data in millisecond latency from Amazon MSK to produce the output in DynamoDB, and sends a notification in real time if any records are identified as an outlier based on business rules.

AWS IoT Core provides integrations with Amazon MSK to set up real-time streaming data pipelines. To do so, complete the following steps:

  1. On the AWS IoT Core console, choose Act in the navigation pane.
  2. Choose Rules, and create a new rule.
  3. For Actions, choose Add action and choose Kafka.
  4. Choose the VPC destination if required.
  5. Specify the Kafka topic.
  6. Specify the TLS bootstrap servers of your Amazon MSK cluster.

You can view the bootstrap server URLs in the client information of your MSK cluster details. The AWS IoT rule was created with the Kafka topic as an action to provide data from AWS IoT Core to Kafka topics.

SOCAR used Amazon Kinesis Data Analytics Studio to analyze streaming data in real time and build stream-processing applications using standard SQL and Python. We created one table from the Kafka topic using the following code:

CREATE TABLE table_name (
column_name1 VARCHAR,
column_name2 VARCHAR(100),
column_name3 VARCHAR,
column_name4 as TO_TIMESTAMP (`time_column`, 'EEE MMM dd HH:mm:ss z yyyy'),
 WATERMARK FOR column AS column -INTERVAL '5' SECOND
)
PARTITIONED BY (column_name5)
WITH (
'connector'= 'kafka',
'topic' = 'topic_name',
'properties.bootstrap.servers' = '<bootstrap servers shown in the MSK client info dialog>',
'format' = 'json',
'properties.group.id' = 'testGroup1',
'scan.startup.mode'= 'earliest-offset'
);

Then we applied a query with business logic to identify a particular set of records that need to be alerted. When this data is loaded back into another Kafka topic, AWS Lambda functions trigger the downstream action: either load the data into a DynamoDB table or send an email notification.

Component 4: Flattening the nested structure JSON and monitoring schema changes

The final data pipeline (see the following diagram) processes complex, semi-structured, and nested JSON files.

This step uses an AWS Glue DynamicFrame to flatten the nested structure and then land the output in Amazon S3. After the data is loaded, it’s scanned by an AWS Glue crawler to update the Data Catalog table and detect any changes in the schema.

Data flow: Putting it all together

The following diagram illustrates our complete data flow with each component.

Let’s walk through the steps of each pipeline.

The first data pipeline (in red) processes the IoT streaming data with the Aurora MySQL business data:

  1. AWS DMS is used for ongoing replication to continuously apply source changes to the target with minimal latency. The source includes two tables in the Aurora MySQL database tables (carinfo and deviceinfo), and each is linked to two MSK topics via AWS DMS tasks.
  2. Amazon MSK triggers a Lambda function, so whenever a topic receives data, a Lambda function runs to load data into DynamoDB table.
  3. There is a single DynamoDB table with columns that exist from the carinfo table and the deviceinfo table of the Aurora MySQL database. This table consists of all the data from two tables and stores the latest data by performing an upsert operation.
  4. An AWS Glue job continuously receives the IoT data and joins it with data in the DynamoDB table to produce the output into another DynamoDB target table.
  5. This target table contains the final data, which includes all the device and car status information from the IoT devices as well as metadata from the Aurora MySQL table.

The second data pipeline (in green) batch processes IoT data to use in dashboards and for visualization:

  1. The car and reservation data (in two DB tables) is exported via a SQL command from the Aurora MySQL database with the output data available in an S3 bucket. The folders that contain data are registered as an S3 location for the AWS Glue crawler and become available via the AWS Glue Data Catalog.
  2. The MSK input topic continuously receives data from AWS IoT. Each car has a number of IoT devices, and each device captures data and sends it to an MSK input topic. The Amazon MSK S3 sink connector is configured to export data from Kafka topics to Amazon S3 in JSON formats. In addition, the S3 connector exports data by guaranteeing exactly-once delivery semantics to consumers of the S3 objects it produces.
  3. The AWS Glue job runs in a daily batch to load the historical IoT data into Amazon S3 and into two tables (refer to step 1) to produce the output data in an Enriched folder in Amazon S3.
  4. Amazon Athena is used to query data from Amazon S3 and make it available as a dataset in QuickSight for visualizing historical data.

The third data pipeline (in blue) processes streaming IoT data from Amazon MSK with millisecond latency to produce the output in DynamoDB and send a notification:

  1. An Amazon Kinesis Data Analytics Studio notebook powered by Apache Zeppelin and Apache Flink is used to build and deploy its output as a Kinesis Data Analytics application. This application loads data from Amazon MSK in real time, and users can apply business logic to select particular events coming from the IoT real-time data, for example, the car engine is off and the doors are closed, but the headlights are still on. The particular event that users want to capture can be sent to another MSK topic (Outlier) via the Kinesis Data Analytics application.
  2. Amazon MSK triggers a Lambda function, so whenever a topic receives data, a Lambda function runs to send an email notification to users that are subscribed to an Amazon Simple Notification Service (Amazon SNS) topic. An email is published using an SNS notification.
  3. The Kinesis Data Analytics application loads data from AWS IoT, applies business logic, and then loads it into another MSK topic (output). Amazon MSK triggers a Lambda function when data is received, which loads data into a DynamoDB Append table.
  4. Amazon Kinesis Data Analytics Studio is used to run SQL commands for ad hoc interactive analysis on streaming data.

The final data pipeline (in yellow) processes complex, semi-structured, and nested JSON files, and sends a notification when a schema evolves.

  1. An AWS Glue job runs and reads the JSON data from Amazon S3 (as a source), applies logic to flatten the nested schema using a DynamicFrame, and pivots out array columns from the flattened frame.
  2. The output is stored in Amazon S3 and is automatically registered to the AWS Glue Data Catalog table.
  3. Whenever there is a new attribute or change in the JSON input data at any level in the nested structure, the new attribute and change are captured in Amazon EventBridge as an event from the AWS Glue Data Catalog. An email notification is published using Amazon SNS.

Conclusion

As a result of the four-day Build Lab, the SOCAR team left with a working prototype that is custom fit to their needs, gaining a clear path to production. The Data Lab allowed the SOCAR team to build a new streaming data pipeline, enrich IoT data with operational data, and enhance the existing data pipeline to process complex nested JSON data. This establishes a baseline architecture to support the new fleet management system beyond the car-sharing business.


About the Authors

DoYeun Kim is the Head of Data Engineering at SOCAR. He is a passionate software engineering professional with 19+ years experience. He leads a team of 10+ engineers who are responsible for the data platform, data warehouse and MLOps engineering, as well as building in-house data products.

SangSu Park is a Lead Data Architect in SOCAR’s cloud DB team. His passion is to keep learning, embrace challenges, and strive for mutual growth through communication. He loves to travel in search of new cities and places.

YoungMin Park is a Lead Architect in SOCAR’s cloud infrastructure team. His philosophy in life is-whatever it may be-to challenge, fail, learn, and share such experiences to build a better tomorrow for the world. He enjoys building expertise in various fields and basketball.

Younggu Yun is a Senior Data Lab Architect at AWS. He works with customers around the APAC region to help them achieve business goals and solve technical problems by providing prescriptive architectural guidance, sharing best practices, and building innovative solutions together. In his free time, his son and he are obsessed with Lego blocks to build creative models.

Vicky Falconer leads the AWS Data Lab program across APAC, offering accelerated joint engineering engagements between teams of customer builders and AWS technical resources to create tangible deliverables that accelerate data analytics modernization and machine learning initiatives.

Learn more about Apache Flink and Amazon Kinesis Data Analytics with three new videos

Post Syndicated from Deepthi Mohan original https://aws.amazon.com/blogs/big-data/learn-more-about-apache-flink-and-amazon-kinesis-data-analytics-with-three-new-videos/

Amazon Kinesis Data Analytics is a fully managed service for Apache Flink that reduces the complexity of building, managing, and integrating Apache Flink applications with other AWS services. Apache Flink is an open-source framework and engine for stateful processing of data streams. It’s highly available and scalable, delivering high throughput and low latency for the most demanding stream-processing applications.

In this post, we highlight three new videos for you to learn more about Apache Flink and Kinesis Data Analytics, including open-source contributions to Apache Flink, our learnings from running thousands of Flink jobs on a managed service, and how we use Kinesis Data Analytics and Apache Flink to enable machine learning (ML) in Alexa.

In Introducing the new Async Sink, we present the new Async Sink framework, an open-source contribution to make it easier than ever to build sink connectors for Apache Flink. You can learn about the need for the Async Sink framework and how we built it, followed by a demo of building a new sink to Amazon CloudWatch to deliver CloudWatch metrics, in under 20 minutes! The Async Sink framework bootstraps development of Flink sinks, is compatible with Apache Flink 1.15 and above, and has already seen usage by the community beyond building new sinks to AWS services.

The video Practical learnings from running thousands of Flink jobs shares insight from running Kinesis Data Analytics, a managed service for Apache Flink that runs tens of thousands of Flink jobs. You can learn lessons based on our experience of operating Apache Flink at very large scale, touching on issues such as out-of-memory errors, timeouts, and stability challenges. The video also covers improving application performance with memory tuning and configuration changes and the approaches to automating job health monitoring and management of Flink jobs at scale.

“Alexa, be quiet!” End-to-end near-real time model building and evaluation in Amazon Alexa discusses how Alexa has built an automated end-to-end solution for incremental model building or fine-tuning ML models through continuous learning, continual learning, or semi-supervised active learning. Alexa uses Apache Flink to transform and discover metrics in real time. In this video, you learn about how Alexa scales infrastructure to meet the needs of ML teams across Alexa, and explore specific use cases that use Apache Flink and Kinesis Data Analytics to improve Alexa experiences to delight customers.

To learn more about Kinesis Data Analytics for Apache Flink, visit our product page.


About the author

Deepthi Mohan is a Principal Product Manager on the Kinesis Data Analytics team.

How Kyligence Cloud uses Amazon EMR Serverless to simplify OLAP

Post Syndicated from Daniel Gu original https://aws.amazon.com/blogs/big-data/how-kyligence-cloud-uses-amazon-emr-serverless-to-simplify-olap/

This post was co-written with Daniel Gu and Yolanda Wang, from Kyligence.

Today, more than ever, organizations realize that modern business runs on data—almost all our interactions with business are based on data, and organizations must use analytics to understand, plan, and improve their operations. That is where Online Analytical Processing (OLAP) comes in. OLAP is designed to manage and analyze big data, enabling organizations to use their data to extract business insights in multiple dimensions.

Kyligence Cloud OLAP solution offers an Intelligent OLAP Platform to simplify multi-dimensional analytics for cloud data lakes. In the past, Kyligence used to deploy and maintain its own Spark clusters based on Amazon Elastic Compute Cloud (Amazon EC2) to handle the multi-dimensional model pre-computing process that required users to build their monitoring and alerting systems to improve the observability and reliability of the Spark cluster. In this post, we present how Kyligence built and end-to-end Kyligence Cloud OLAP solution with Amazon EMR Serverless to simplify deployment and operations, reduce costs, and accelerate time-to-value over the data lake.

What is Amazon EMR Serverless?

Amazon EMR Serverless is a big data cloud platform for running large-scale distributed data processing jobs, and machine learning (ML) applications using open-source analytics frameworks like Apache Spark and Apache Hive. Amazon EMR Serverless makes it easy and cost-effective for data engineers and analysts to run applications without having to tune, operate, optimize, secure, or manage clusters.

What is OLAP?

OLAP is an approach to quickly answer analytics queries at high speeds on large volumes of data, providing capabilities for precomputation, sophisticated data modeling, and multi-dimensional analytics by rolling up large, sometimes separate datasets into a multi-dimensional database known as an OLAP Cube that enables “slicing and dicing” of data from different viewpoints for a streamlined query experience. Apache Kylin, Apache Druid, and ClickHouse are some of the popular OLAP tools.

Although OLAP tools have been successfully used in various industries, they still face many challenges:

  • Dependence on IT organizations – Traditional OLAP tools require complex infrastructure to run large-scale data computing. It requires a large team of IT professionals to operate and maintain this infrastructure, resulting in high costs.
  • Need for large compute resources – Traditional OLAP tools need huge amount of computing resources for processing, and transforming data through a series of specific steps toward a concrete goal. Lack of computational capabilities leads to longer response times, limits the amount of data that can be processed, and impedes the flexibility of the OLAP tool greatly . As a result, data analysts are often confined to narrow datasets, incapable of analyzing all the data freely.
  • Inefficient usage of resources in the cloud – When a large-scale data modeling calculation is performed in the cloud, the cost estimation tools estimate and deploy the corresponding computing resources. However, the utilization rate of resources is often not very high, resulting in inefficient usage of resources.

With OLAP integrated with Amazon EMR Serverless, OLAP tools can use Amazon EMR Serverless as a serverless computing resource pool to complete data processing jobs, which simplifies and enhances user experience.

Kyligence approach to OLAP using Amazon EMR Serverless

Kyligence is an AWS ISV partner that offers an Intelligent OLAP Platform to simplify multi-dimensional analytics for cloud data lakes. As a cloud-native OLAP platform, Kyligence Cloud now integrates with Amazon EMR Serverless to automatically provision Spark to run indexing and building jobs. This empowers you to use all the features and benefits of Kyligence’s OLAP with Amazon EMR Serverless.

Kyligence seamlessly connects to major AWS-native data sources including Amazon Simple Storage Service (Amazon S3), Amazon Redshift, and Amazon Relational Database Service (Amazon RDS) to get the most out of your data on AWS, building a comprehensive AWS big data solution. During data modeling, Kyligence uses Amazon S3 to store the pre-computed data, and serves it for high concurrency queries. Kyligence also seamlessly interfaces with popular business intelligence (BI) tools such as Tableau, Microsoft Power BI, and Microsoft Excel to provide rich, built-in data visualization and self-service tools.

The following diagram illustrates the Kyligence Cloud architecture on AWS.

What you can expect from Kyligence Cloud on AWS

This solution offers the following benefits:

  • High performance – With AWS’s global infrastructure and the distributed computing capabilities of Amazon EMR, Kyligence offers a scalable, cost-effective, high-performance OLAP engine for multi-dimensional analytics. It enables critical data applications and large-scale interactive analytics, and helps you achieve sub-second query response times and high concurrency on PB-scale data.
  • Auto-scaling – Kyligence Cloud’s computing resources can be expanded with one click, and as load decreases, cluster size can be automatically reduced. This auto-scaling capability provides optimized costs with service stability.
  • High compatibility – Kyligence Cloud provides a rich set of APIs (ODBC, JDBC, Rest API, Python Client) and standard ANSI-SQL and XMLA/MDX interface, which can be easily integrated with popular analytics tools like Tableau, Microsoft Excel, Microsoft Power BI, and data science tools like Python.
  • Security and reliability – With Amazon S3, Amazon RDS, Kyligence enterprise-level security features, and AWS Identity and Access Management (IAM) support, Kyligence Cloud safely manages access to the services and resources deployed on AWS while supporting multi-level access control of data models, tables, and cells to ensure data security and privacy protection.
  • One-click deployment on AWS – Kyligence Cloud is available in AWS Marketplace. The deployment is completed automatically based on an AWS CloudFormation template and parameter settings. Kyligence performs automated cluster operation and maintenance, and elastic rule-based cluster scaling, which lightens the workload for IT administrators and cloud infrastructure teams. Kyligence also offers a quick deployment method in the Kyligence Cloud Portal.

How Amazon EMR Serverless integrates with OLAP

With Amazon EMR Serverless, Kyligence Cloud provides out-of-the-box managed Apache Spark services. The Kyligence engine can distribute the compute job to Apache Spark in Amazon EMR Serverless. With the automatic on-demand provisioning and scaling capabilities of Amazon EMR Serverless, Kyligence can quickly meet changing processing requirements at any data volume.

The following diagram illustrates Kyligence Cloud integrated with Amazon EMR Serverless.

Benefits of using Kyligence Cloud with Amazon EMR Serverless

In the past, Kyligence used to deploy and maintain its own Spark clusters based on Amazon Elastic Compute Cloud (Amazon EC2) to handle the multi-dimensional model pre-computing process that required Kyligence users to build their monitoring and alerting systems to improve the observability and reliability of the Spark clusters.

Now, running Kyligence on Amazon EMR Serverless offers a more cost-effective, and high-performance way to run cloud analytics on AWS:

  • Simplified deployment on the cloud – With managed services, you don’t need to consider the lifecycle of the underlying infrastructure and resources. This greatly reduces application complexity and simplifies the deployment of Kyligence Cloud.
  • Improve performance on the cloud – With the help of Amazon EMR Serverless, it provides a refined scaling strategy, which can help Kyligence Cloud spin up and recycle resources faster. In Kyligence performance benchmark testing, we observed 15–20% faster performance compared to open-source Spark cluster for index building.
  • Reduce the difficulty of operation and maintenance – With the help of Amazon EMR Serverless capabilities, operation and maintenance personnel can easily maintain the capacity and running status of computing resources without having to understand the underlying analysis framework.
  • Cost-optimization on the cloud – Amazon EMR Serverless provides a refined scaling strategy that can automatically determine the resources that the application needs, acquires these resources to process your jobs, and releases the resources when the jobs complete. You only pay for the resources used by the application, which helps reduce the Total Cost of Operations (TCO) on the cloud.

Get started with Kyligence Cloud on Amazon EMR Serverless

You can get started with the full potential of Kyligence Cloud on the AWS Marketplace or quickly test drive Kyligence.

To use Amazon EMR Serverless, you just need to select Serverless Spark on the Build Cluster tab during deployment.

Conclusion

Using managed and scalable services like Amazon EMR Serverless allows Kyligence users to speed up self-service analytics on large volumes of data, and maintain a relatively simplified architecture. With this solution, you can now concentrate on business demands instead of technical issues.

About Kyligence

Kyligence was founded in 2016 by the original creators of Apache Kylin™, the leading open-source OLAP for big data. Kyligence offers an Intelligent OLAP Platform to simplify multi-dimensional analytics for cloud data lakes.

For more information, visit Kyligence.


About the authors

Daniel Gu is a senior product manager on the Kyligence Cloud Team, who manages products and services and conducts research to determine the viability of products in the cloud.

Yolanda Wang is a senior product marketing manager at Kyligence, who owns the positioning, messaging, and branding of Kyligence products and works with various teams to drive go-to-market strategies.

Kiran Guduguntla is a WW Go-to-Market Specialist for Amazon EMR at AWS. He works with AWS customers across the globe to strategize, build, develop, and deploy modern data analytics solutions.

Field-level security in Amazon OpenSearch Service

Post Syndicated from Satyanarayana Adimula original https://aws.amazon.com/blogs/big-data/field-level-security-in-amazon-opensearch-service/

Amazon OpenSearch Service is fully open-source search and analytics engine that securely unlocks real-time search, monitoring, and analysis of business and operational data for use cases like application monitoring, log analytics, observability, and website search.

But what if you have personal identifiable information (PII) data in your log data? How do you control and audit access to that data? For example, what if you need to exclude fields from log search results or anonymize them? Fine-grained access control can manage access to your data depending on the use case—to return results from only one index, hide certain fields in your documents, or exclude certain documents altogether.

Let’s say you have users that work on the logistics of online orders placed on Sunday. The users must not have the access to a customer’s PII data and must be restricted from seeing the customer’s email. Additionally, the customer’s full name and first name must be anonymized. The post demonstrates implementing this field-level security with OpenSearch Service security controls.

Solution overview

The solution has the following steps to provision OpenSearch Service with Amazon Cognito federation within Amazon Virtual Private Cloud (Amazon VPC), use a proxy server to sign in to OpenSearch Dashboards, and demonstrate the field-level security:

  1. Create an OpenSearch Service domain with VPC access and fine-grained access enabled.
  2. Access OpenSearch Service from outside the VPC and load the sample data.
  3. Create an OpenSearch Service role for field-level security and map it to a backend role.

OpenSearch Service security has three main layers:

  • Network – Determines whether a request can reach an OpenSearch Service domain. Placing an OpenSearch Service domain within a VPC enables secure communication between OpenSearch Service and other services within the VPC without the need for an internet gateway, NAT device, or VPN connection. The associated security groups must permit clients to reach the OpenSearch Service endpoint.
  • Domain access policy – After a request reaches a domain endpoint, the domain access policy allows or denies the request access to a given URI at the edge of the domain. The domain access policy specifies which actions a principal can perform on the domain’s sub-resources, which include OpenSearch Service indexes and APIs. If a domain access policy contains AWS Identity and Access Management (IAM) users or roles, clients must send signed requests using AWS Signature Version 4.
  • Fine-grained access control – After the domain access policy allows a request to reach a domain endpoint, fine-grained access control evaluates the user credentials and either authenticates the user or denies the request. If fine-grained access control authenticates the user, the request is handled based on the OpenSearch Service roles mapped to the user. Additional security levels include:
    • Cluster-level security – To make broad requests such as _mget, _msearch, and _bulk, monitor health, take snapshots, and more. For details, see Cluster permissions.
    • Index-level security – To create new indexes, search indexes, read and write documents, delete documents, manage aliases, and more. For details, see Index permissions.
    • Document-level security – To restrict the documents in an index that a user can see. For details, see Document-level security.
    • Field-level security – To control the document fields a user can see. When creating a role, add a list of fields to either include or exclude. If you include fields, any users you map to that role can see only those fields. If you exclude fields, they can see all fields except the excluded ones. Field-level security affects the number of fields included in hits when you search. For details, see Field-level security.
    • Field masking – To anonymize the data in a field. If you apply the standard masking to a field, OpenSearch Service uses a secure, random hash that can cause inaccurate aggregation results. To perform aggregations on masked fields, use pattern-based masking instead. For details, see Field masking.

The following figure illustrates these layers.

Prerequisites

For this walkthrough, you should have the following prerequisites:

  • An AWS account
  • An Amazon Cognito user pool and identity pool

Create an OpenSearch Service domain with VPC access

You first create an OpenSearch Service domain with VPC access, enabling fine-grained access control and choosing the IAM ARN as the master user.

When you use IAM for the master user, all requests to the cluster must be signed using AWS Signature Version 4. For sample code, see Signing HTTP requests to Amazon OpenSearch Service. IAM is recommended if you want to use the same users on multiple clusters, to use Amazon Cognito to access OpenSearch Dashboards, or if you have OpenSearch Service clients that support Signature Version 4 signing.

Fine-grained access control requires HTTPS, node-to-node encryption, and encryption at rest. Node-to-node encryption enables TLS 1.2 encryption for all communications within the VPC. If you send data to OpenSearch Service over HTTPS, node-to-node encryption helps ensure that your data remains encrypted as OpenSearch Service distributes (and redistributes) it throughout the cluster.

Add a domain access policy to allow the specified IAM ARNs to the URI at the edge of the domain.

Set up Amazon Cognito to federate into OpenSearch Service

You can authenticate and protect your OpenSearch Service default installation of OpenSearch Dashboards using Amazon Cognito. If you don’t configure Amazon Cognito authentication, you can still protect Dashboards using an IP-based access policy and a proxy server, HTTP basic authentication, or SAML. For more details, see Amazon Cognito authentication for OpenSearch Dashboards.

Create a user called masteruser in the Amazon Cognito user pool that was configured for the OpenSearch Service domain and associate the user with the IAM role Cognito_<Cognito User Pool>Auth_Role, which is a master user in OpenSearch Service. Create another user called ecomuser1 and associate it with a different IAM role, for example OpenSearchFineGrainedAccessRole. Note that ecomuser1 doesn’t have any access by default.

If you want to configure SAML authentication, see SAML authentication for OpenSearch Dashboards.

Access OpenSearch Service from outside the VPC

When you place your OpenSearch Service domain within a VPC, your computer must be able to connect to the VPC. This connection can be VPN, transit gateway, managed network, or proxy server.

Fine-grained access control has an OpenSearch Dashboards plugin that simplifies management tasks. You can use Dashboards to manage users, roles, mappings, action groups, and tenants. The Dashboards sign-in page and underlying authentication method differs depending on how you manage users and configured your domain.

Load sample data into OpenSearch

Sign in as masteruser to access OpenSearch Dashboards and load the sample data for ecommerce orders, flight data, and web logs.

Create an OpenSearch Service role and user mapping

OpenSearch Service roles are the core ways of controlling access to your cluster. Roles contain any combination of cluster-wide permissions, index-specific permissions, document-level and field-level security, and tenants.

You can create new roles for fine-grained access control and map roles to users using OpenSearch Dashboards or the _plugins/_security operation in the REST API. For more information, see Create roles and Map users to roles. Fine-grained access control also includes a number of predefined roles.

Backend roles offer another way of mapping OpenSearch Service roles to users. Rather than mapping the same role to dozens of different users, you can map the role to a single backend role, and then make sure that all users have that backend role. Note that the master user ARN is mapped to the all_access and security_manager roles by default to give the user full access to the data.

Create an OpenSearch Service role for field-level security

For our use case, an ecommerce company has requirements for certain users to see the online orders placed on Sunday. The users need to look at the order fulfilment logistics for only those orders. They don’t need to see customer’s email. They also don’t have to know the actual first name and last name of the customer; the customer’s first name and last name must be anonymized when displayed to the user.

Create a role in OpenSearch Service with the following steps:

  1. Log in to OpenSearch Dashboards as masteruser.
  2. Choose Security, Roles, and Create role.
  3. Name the role Orders-placed-on-Sunday.
  4. For Index permissions, specify opensearch_dashboards_sample_data_ecommerce.
  5. For the action group, choose read.
  6. For Document-level security, specify the following query:
    {
      "match": {
        "day_of_week" : "Sunday"
      }
    }

  7. For Field-level security, choose Exclude and specify email.
  8. For Anonymization, specify customer_first_name and customer_full_name.
  9. Choose Create.

You can see the following permissions to the role Orders-placed-on-Sunday.

Choose View expression to see the document-level security.

Map the OpenSearch Service role to the backend role of the Amazon Cognito group

To perform user mapping, complete the following steps:

  1. Go to the OpenSearch Service role Orders-placed-on-Sunday.
  2. Choose Mapped users, Manage mapping.
  3. For Backend roles, enter arn:aws:iam::<account-id>:role/OpenSearchFineGrainedAccessRole.
  4. Choose Map.
  5. Return to the list of roles and choose the predefined role opensearch_dashboards_user, which includes the permissions a user needs to work with index patterns, visualizations, dashboards, and tenants.
  6. Map the opensearch_dashboards_user role to arn:aws:iam::<account-id>:role/OpenSearchFineGrainedAccessRole.

Test the solution

To test your fine-grained access control, complete the following steps:

  1. Log in to the OpenSearch Dashboards URL as ecomuser1.
  2. Go to OpenSearch Plugins and choose Query Workbench.
  3. Run the following SQL queries in OpenSearch Workbench to verify the fine-grained access applied to ecomuser1 as compared to the same queries run by masteruser.
SQL Results when signed-in as masteruser
SHOW tables LIKE %sample%; opensearch_dashboards_sample_data_ecommerce
opensearch_dashboards_sample_data_flights
opensearch_dashboards_sample_data_logs
SELECT COUNT(*) FROM opensearch_dashboards_sample_data_flights ; 13059
SELECT day_of_week, count(*) AS total_records FROM opensearch_dashboards_sample_data_ecommerce GROUP BY day_of_week_i,day_of_week ORDER BY day_of_week_i;
day_of_week total_records
Monday 579
Tuesday 609
Wednesday 592
Thursday 775
Friday 770
Saturday 736
Sunday 614
SELECT customer_last_name AS last_name, customer_full_name AS full_name, email FROM opensearch_dashboards_sample_data_ecommerce WHERE day_of_week = ‘Sunday’ AND order_id = ‘582936’;
last_name full_name email
Miller Gwen Miller [email protected]

..

SQL Results when signed-in as ecomuser1 Observations
SHOW tables LIKE %sample%; no permissions for [indices:admin/get] and User [name=Cognito/<cognito pool-id>/ecomuser1, backend_roles=[arn:aws:iam::<account-id>:role/OpenSearchFineGrainedAccessRole] ecomuser1 can’t list tables.
SELECT COUNT(*) FROM opensearch_dashboards_sample_data_flights ; no permissions for [indices:data/read/search] and User [name=Cognito/<cognito pool-id>/ecomuser1, backend_roles=[arn:aws:iam::<account-id>:role/OpenSearchFineGrainedAccessRole] ecomuser1 can’t see flights data.
SELECT day_of_week, count(*) AS total_records  FROM opensearch_dashboards_sample_data_ecommerce GROUP BY day_of_week_i,day_of_week ORDER BY day_of_week_i;
day_of_week total_records
Sunday 614
ecomuser1 can see ecommerce orders placed on Sunday only.
SELECT customer_last_name AS last_name, customer_full_name AS full_name, email FROM opensearch_dashboards_sample_data_ecommerce WHERE day_of_week = ‘Sunday’ AND order_id = ‘582936’;
last_name full_name email
Miller f1493b0f9039531ed02c9b1b7855707116beca01c6c0d42cf7398b8d880d555f .
For ecomuser1, customer’s email is excluded and customer_full_name is anonymized.

From these results, you can see OpenSearch Service field-level access controls were applied to ecomuser1, restricting the user from seeing the customer’s email. Additionally, the customer’s full name and first name were anonymized when displayed to the user.

Conclusion

When OpenSearch Service fine-grained access control authenticates a user, the request is handled based on the OpenSearch Service roles mapped to the user. This post demonstrated fine-grained access control restricting a user from seeing a customer’s PII data, as per the business requirements.

Role-based fine-grained access control enables you to control access to your data on OpenSearch Service at the index level, document level, and field level. When your logs or applications data has sensitive data, the field-level security permissions can help you provision the right level of access for your users.


About the author

Satya Adimula is a Senior Data Architect at AWS based in Boston. With extensive experience in data and analytics, Satya helps organizations derive their business insights from the data at scale.

Reduce cost and improve query performance with Amazon Athena Query Result Reuse

Post Syndicated from Theo Tolv original https://aws.amazon.com/blogs/big-data/reduce-cost-and-improve-query-performance-with-amazon-athena-query-result-reuse/

Amazon Athena is an interactive query service that makes it easy to analyze data in Amazon Simple Storage Service (Amazon S3) using standard SQL. Athena is serverless, so there is no infrastructure to manage, and you pay only for the queries that you run on datasets at petabyte scale. You can use Athena to query your S3 data lake for use cases such as data exploration for machine learning (ML) and AI, business intelligence (BI) reporting, and ad hoc querying.

It’s not uncommon for datasets in data lakes to update only daily, or at most a few times per day, yet queries running on these datasets may be repeated more frequently. Previously, all queries resulted in a data scan, even if the same query was repeated again. When the source data hasn’t changed, repeat queries run needlessly, leading to the same results with higher data scan costs and query latency. Wouldn’t it be better if the results of a recent query could be reused instead?

Query Result Reuse is a new feature available in Athena engine version 3 that makes it possible to reuse the results of a previous query. This can improve performance and reduce cost for frequently run queries, by skipping scanning the source data and instead returning a previously calculated result directly. With Query Result Reuse, you can tell Athena that you want to reuse results of a previous query run, with a maximum age setting that controls how recent a previous result has to be.

Athena automatically reuses any previous results that match your query and maximum age setting, or transparently runs the query again if no match is found. If you know that a dataset changes a few times per day, you can, for example, tell Athena to reuse results that are up to an hour old to avoid rerunning most queries, but still get new results when you run a query soon after new data has become available.

In this post, we demonstrate how to reduce cost and improve query performance with the new Query Result Reuse feature.

When should you use Query Result Reuse?

We recommend using Query Result Reuse for every query where the source data doesn’t change frequently. You can configure the maximum age of results to reuse per query, or use the default, which is 60 minutes. In certain cases where queries include non-deterministic functions such as RAND(), the query fetches fresh data from the input source even if the Query Result Reuse feature is enabled.

Query Result Reuse allows results to be shared among users in a workgroup, as long as they have access to the tables and data. This means Query Result Reuse can benefit not only a single user, but also other users in the workgroup who might be running the same queries. One example where this may be especially beneficial is when you have dashboards that are viewed by many users. The dashboard widgets run the same queries for all users, and are therefore accelerated by Query Result Reuse, when enabled.

Another example is if you have a dataset that is updated daily, and many users who all query the most recent data to create reports. Different people might run the same queries as part of their work; with Query Result Reuse, they can collectively avoid running the same query more than once, making everyone more productive and lowering overall cost by avoiding repeated scans of the same data.

Finally, if you have a historical dataset that is frequently queried, but never or very rarely updated, you can configure queries to reuse results that are up to 7 days old to maximize the chances of reusing results and avoid unnecessary costs.

How does Query Result Reuse work?

Query Result Reuse takes advantage of the fact that Athena writes query results to Amazon S3 as a CSV file. Before the introduction of Query Result Reuse, it was possible to reuse query results by reading these files directly. You could also use the ClientRequestToken parameter of the StartQueryExecution API to ensure queries are run only once, and subsequent runs return the same results. With Query Result Reuse, the process of reusing query results is easier and more versatile.

When Athena receives a query with Query Result Reuse enabled, it looks for a result for a query with the same query string that was run in the same workgroup. The query string has to be identical in order to match.

Query Result Reuse is enabled on a per query basis. When you run a query, you specify how old a result can be for it to be reused, from 1 minute up to 7 days. If the query has been run before, and a result exists that matches the request, it’s returned, otherwise the query is run and a new result is calculated. This new result is then available to be reused by subsequent queries.

You can run the query multiple times with different settings for how old a result you can accept. Results can be reused within the same workgroup, even if a different user ran the query previously.

Before a query result is reused, Athena does a few checks to make sure that the user is still allowed to see the results. It checks that the user has access to the tables involved in the query and permission to read the result file on Amazon S3.

There are some situations where query results can’t be reused, for example if the query uses non-deterministic functions, or has AWS Lake Form ation fine-grained access controls enabled. These limitations are described in more detail later in this post.

Run queries with Query Result Reuse

In this section, we demonstrate how to run queries with the Query Result Reuse feature via the Athena API, the Athena console, and the JDBC and ODBC drivers.

Run queries using the Athena API

For applications that use the Athena API through the AWS Command Line Interface (AWS CLI) or the AWS SDKs, the StartQueryExecution API call now has the additional parameter ResultReuseConfiguration, where you can enable Query Result Reuse and specify the maximum age of results. For example, when using the AWS CLI, you can run a query with Query Result Reuse enabled as follows:

aws athena start-query-execution \
  --work-group "my_work_group" \
  --query-string "SELECT * FROM my_table LIMIT 10" \
  --result-reuse-configuration \
    "ResultReuseByAgeConfiguration={Enabled=true,MaxAgeInMinutes=60}"

The following code shows how to do this with the AWS SDK for Python:

import boto3

client = boto3.client('athena')
response = client.start_query_execution(
    WorkGroup='my_work_group',
    QueryString='SELECT * FROM my_table LIMIT 10',
    ResultReuseConfiguration={
        'ResultReuseByAgeConfiguration': {
   	    	'Enabled': True,
     		'MaxAgeInSeconds': 60
        }
    }
)

These examples assume that my_work_group uses Athena engine v3, that the workgroup has an output location configured, and that the AWS Region has been set in the AWS CLI configuration.

When a query result is reused, you can see in the statistics section of the response from the GetQueryExecution API call that no data was scanned and that results were reused:

{
    "QueryExecution": {
        …
        "Statistics": {
            "EngineExecutionTimeInMillis": 272,
            "DataScannedInBytes": 0,
            "TotalExecutionTimeInMillis": 445,
            "QueryQueueTimeInMillis": 143,
            "ServiceProcessingTimeInMillis": 30,
            "ResultReuseInformation": {
               	"ReusedPreviousResult": true
           	}
        }
    }
}

Run queries using the Athena console

When you run queries on the Athena console, Query Result Reuse is now enabled by default. You can enable and disable Query Result Reuse in the query editor. You can also choose the pen icon to change the maximum age of results. This setting applies to all queries run on the Athena console.

The following screenshot shows an example query run against AWS CloudTrail logs with Query Result Reuse enabled.

When we ran the query again, the results showed up immediately, and we could see the message “using reused query results” in the Query results pane as a confirmation that the results of our first query had been reused. The Data scanned statistic also showed “-” to indicate that no data was scanned.

Run queries using the JDBC and ODBC drivers

If you use the JDBC or ODBC driver to query Athena, you can now add enableResultReuse=1 to your connection parameters to enable Query Result Reuse, and use ageforResultReuse=60 to set the maximum age to 60 minutes. The drivers automatically apply the setting to all queries running in the context of the connection.

For more information on how to connect to Athena via JDBC and ODBC, refer to Connecting to Amazon Athena with ODBC and JDBC drivers.

Limitations and considerations

Query Result Reuse is supported for most Athena queries, but there are some limitations. We want to ensure that reusing results doesn’t create surprising situations, or expose results that a user shouldn’t have access to. For that reason, Athena always runs a fresh query in the following situations:

  • Non-deterministic functions – Some functions and expressions produce different results from query to query, such as CURRENT_TIME and RAND(). Results for queries that use temporal and non-deterministic expressions and functions aren’t reusable because that could create surprising and inconsistent results.
  • Fine-grained access controls – Row-level and column-level permissions are configured in Lake Formation, and Athena can’t know if these have changed since a previous query result was created. Users using the same workgroup can also have different permissions, and checking all permissions would undo many of the cost and performance savings you get from Query Result Reuse.
  • Federated queries, user-defined functions (UDFs), and external Hive metastores – Users using the same workgroup can have different permissions to invoke the AWS Lambda functions that these features rely on. Athena isn’t able to check that a user that wants to reuse a result has permission to invoke these Lambda functions without running the query, which would negate the cost and performance savings.

Athena detects these conditions automatically and runs the query as if Query Result Reuse wasn’t enabled. You won’t get errors, but you can determine that Query Result Reuse wasn’t in effect by inspecting the query status (see our earlier examples).

Query Result Reuse is available in Athena engine version 3 only.

Conclusion

Query Result Reuse is a new feature in Athena that aims to reduce cost and query response times for datasets that change less frequently than they are queried. For teams that often run the same query, or have dashboards that are used more often than the data changes, Query Result Reuse can result in lower costs and faster results. It’s easy to get started with Query Result Reuse via the Athena console, API, and JDBC/ODBC; all you have to do is set the maximum age of results, and run your queries as usual.

We hope that you will like this new feature, and that it will save cost and improve performance for you and your team!


About the authors

Theo Tolv is a Senior Big Data Architect in the Athena team. He’s worked with small and big data for most of his career and often hangs out on Stack Overflow answering questions about Athena.

Vijay Jain is a Senior Product Manager in Amazon Web Services (AWS) Athena team. He is passionate about building scalable analytics technologies and products working closely with enterprise customers. Outside of work, Vijay likes running and spending time with his family.

What’s new with Amazon QuickSight at AWS re:Invent 2022

Post Syndicated from Mia Heard original https://aws.amazon.com/blogs/big-data/whats-new-with-amazon-quicksight-at-aws-reinvent-2022/

AWS re:Invent is a learning conference hosted by AWS for the global cloud computing community. This year’s re:Invent will be held in Las Vegas, Nevada, from November 28 to December 2.

Amazon QuickSight is the most popular cloud-native serverless BI service. This post walks you through the details of all QuickSight-related sessions and activities to help you plan your conference week accordingly. These sessions should appeal to data and analytics teams, product and engineering teams, and line of business and technology leaders interested in modernizing their BI capabilities to transform data into actionable insights for all.

To access the session catalog and reserve your seat for one of our BI sessions, you must be registered for re:Invent. Register now!

Keynotes

Adam Selipsky, Chief Executive Officer of Amazon Web Services – Keynote

Tuesday November 29 | 8:30 AM – 10:30 AM PST | The Venetian

Join Adam Selipsky, Chief Executive Officer of Amazon Web Services, as he looks at the ways that forward-thinking builders are transforming industries and even our future, powered by AWS. He highlights innovations in data, infrastructure, and more that are helping customers achieve their goals faster, take advantage of untapped potential, and create a better future with AWS.

Swami Sivasubramanian, Vice President of AWS Data and Machine Learning – Keynote

Wednesday November 30 | 8:30 AM – 10:30 AM PST | The Venetian

Join Swami Sivasubramanian, Vice President of AWS Data and Machine Learning, as he reveals the latest AWS innovations that can help you transform your company’s data into meaningful insights and actions for your business. In this keynote, several speakers discuss the key components of a future-proof data strategy and how to empower your organization to drive the next wave of modern invention with data. Hear from leading AWS customers who are using data to bring new experiences to life for their customers.

Leadership sessions

ANT203-L (LVL 200) Unlock the value of your data with AWS analytics

Wednesday November 30 | 2:30 – 3:30 PM PST | The Venetian

Data fuels digital transformation and drives effective business decisions. To survive in an ever-changing world, organizations are turning to data to derive insights, create new experiences, and reinvent themselves so they can remain relevant today and in the future. AWS offers analytics services that allow organizations to gain faster and deeper insights from all their data. In this session, G2 Krishnamoorthy, VP of AWS Analytics, addresses the current state of analytics on AWS, covers the latest service innovations around data, and highlights customer successes with AWS analytics. Also, learn from organizations like FINRA and more who have turned to AWS for their digital transformation journey.
Reserve your seat now!

BSI201 (LVL 200) Reinvent how you derive value from your data with Amazon QuickSight

Tuesday November 29 | 2:00 PM – 3:00 PM PST | Mandalay Bay

In this session, learn how you can use AWS-native business analytics to provide your users with machine learning-powered interactive dashboards, natural language query (NLQ), and embedded analytics to provide insights to users at scale, when and where they need it. Join this session to also learn more about how Amazon uses QuickSight internally.
Reserve your seat now!

Breakout sessions

BSI202 (LVL 200) Migrate to cloud-native business analytics with Amazon QuickSight

Wednesday November 30 | 2:30 PM – 3:30 PM PST | Encore

Legacy BI systems can hurt agile decision-making in the modern organization, with expensive licensing, outdated capabilities, and expensive infrastructure management. In this session, discover how migrating your BI to the cloud with cloud-native, fully managed business analytics capabilities from QuickSight can help you overcome these challenges. Learn how you can use QuickSight’s interactive dashboards and reporting capabilities to provide insights to every user in the organization, lowering your costs and enabling better decision-making. Join this session to also learn more about Siemens QuickSight use case.
Reserve your seat now!

BSI207 (LVL 200) Get clarity on your data in seconds with Amazon QuickSight Q

Wednesday November 30 | 4:45 PM – 5:45 PM PST | MGM Grand

Amazon QuickSight Q is a machine learning–powered natural language capability that empowers business users to ask questions about all of their data using everyday business language and get answers in seconds. Q interprets questions to understand their intent and generates an answer instantly in the form of a visual without requiring authors to create graphics, dashboards, or analyses. In this session, the QuickSight Q team provides an overview and demonstration of Q in action. Join this session to also learn more about NASDAQ’s QuickSight use case.
Reserve your seat now!

BSI203 (LVL 200) Differentiate your apps with Amazon QuickSight embedded analytics

Thursday December 1 | 12:30 PM – 1:30 PM PST | Caesars Forum

In this session, learn how to enable new monetization opportunities and grow your business with QuickSight embedded analytics. Discover how you can differentiate your end-user experience by embedding data visualizations, dashboards, and ML-powered natural language query into your applications at scale with no infrastructure to manage. Join this session to also learn more about Guardian Life and Showpad’s QuickSight use case.
Reserve your seat now!

BSI304 (LVL 300) Optimize your AWS cost and usage with Cloud Intelligence Dashboards

Thursday December 1 | 3:30 PM – 4:30 PM PST | Encore

Do your engineers know how much they’re spending? Do you have insight into the details of your cost and usage on AWS? Are you taking advantage of all your cost optimization opportunities? Attend this session to learn how organizations are using the Cloud Intelligence Dashboards to start their FinOps journeys and create cost-aware cultures within their organizations. Dive deep into specific use cases and learn how you can use these insights to drive and measure your cost optimization efforts. Discover how unit economics, resource-level visibility, and periodic spend updates make it possible for FinOps practitioners, developers, and business executives to come together to make smarter decisions. Join this session to also learn more about Dolby laboratories’ QuickSight use case.
Reserve your seat now!

Chalk talks

BSI302 (LVL 300) Deploy your BI assets at scale to thousands with Amazon QuickSight

Tuesday November 29 | 11:45 AM – 12:45 AM PST | Wynn
As your user bases grow to hundreds or thousands of users, managing assets and user permissions at scale becomes increasingly important. In this chalk talk, learn about best practices for content development, promotion, authorization, organization, and cleanup to help ensure that your users are developing and sharing content in a safe and scalable manner.
Reserve your seat now!

BSI301 (LVL 300) Architecting multi-tenancy for your apps with Amazon QuickSight

Tuesday November 29 | 2:45 PM – 3:45 PM PST | Caesars Forum

Whether you are deploying QuickSight internally in a centrally managed single account or developing a SaaS application with multiple external tenants, it is paramount to focus on security and governance and to isolate tenants from each other. In this chalk talk, learn about different architectures and security frameworks that you can use to deploy QuickSight to thousands of departments or clients in a scalable and controlled manner.
Reserve your seat now!

*This session will also be repeated Wednesday November 30 | 7:45 PM – 8:45 PM PST | Wynn

BSI401 (LVL 400) Insightful dashboards through advanced calculations with QuickSight

Monday November 28 | 12:15 PM – 1:15 PM PST | MGM Grand
Loading data into various charting types is very rarely the end goal for your users. When they find interesting patterns or trends, they tend to dig deeper into their data and use calculations to surface more underlying insights. In this chalk talk, learn about various ways to build insightful and creative dashboards using QuickSight’s new advanced calculation capabilities, such as level-aware calculation and period functions.
Reserve your seat now!

Workshops

BSI205 (LVL 200) Build stunning customized dashboards with Amazon QuickSight

Monday November 28 | 10:45 AM – 12:45 PM PST | Wynn

Want to grow your dashboard building skills? In this workshop, the QuickSight team demonstrates the latest authoring functionality designed to empower you to build beautiful layouts and robust interactive experiences with other applications, right from within your dashboard. You must bring your laptop to participate.
Reserve your seat now!

*This session will be also be repeated Thursday December 1 | 11:45 AM – 1:45 PM PST | Caesars Forum

BSI303 (LVL 300) Seamlessly embed analytics into your apps with Amazon QuickSight
Wednesday November 30 | 5:30 PM – 7:30 PM PST | Wynn

In this workshop, learn how you can bring data insights to your internal teams and end customers by simply and seamlessly embedding rich, interactive data visualizations and dashboards into your web applications and portals. You must bring your laptop to participate.
Reserve your seat now!

Partner session

PEX307 (LVL 300) Migrating business intelligence systems to Amazon QuickSight

Wednesday November 30 | 9:15 AM – 10:15 AM PST | Encore

QuickSight is a scalable, serverless, embeddable, machine learning–powered BI tool built for the cloud. If you’re building a cloud-native BI solution and are unsure how to migrate on AWS, this session is for you. Dive deep into BI best practices, tools, and methodologies for migrating BI dashboards to QuickSight, and learn how to use APIs and the AWS CLI to automate common migration tasks required to perform BI dashboard migration. This session is intended for AWS Partners.
Reserve your seat now!

Additional activities

Business Intelligence kiosk in the AWS Village

Visit the Business Intelligence kiosk within the AWS Village to meet with experts to dive deeper into QuickSight capabilities such as Q and embedded analytics. You will be able to ask our experts questions and experience live demos for our newly launched capabilities.

Free QuickSight swag

Make sure to stop by the swag distribution table to grab free QuickSight swag if you have attended either the Business Intelligence kiosk or one of our breakout sessions, chalk talks, or workshops.

Useful resources

Whether you plan on attending re:Invent in person or view available content virtually, you can always learn more about QuickSight through these helpful resources:

QuickSight Community Hub – Ask, answer, and learn with others in the QuickSight Community.

QuickSight YouTube channel – Subscribe to stay up to date on the latest QuickSight workshops, how tos, and demo videos.

QuickSight DemoCentral – Experience QuickSight first-hand through interactive dashboards and demos


About the authors

Mia Heard is a Product Marketing Manager for Amazon QuickSight, AWS’ cloud-native, fully managed BI service.

California State University Chancellor’s Office reduces cost and improves efficiency using Amazon QuickSight for streamlined HR reporting in higher education

Post Syndicated from Madi Hsieh original https://aws.amazon.com/blogs/big-data/california-state-university-chancellors-office-reduces-cost-and-improves-efficiency-using-amazon-quicksight-for-streamlined-hr-reporting-in-higher-education/

The California State University Chancellor’s Office (CSUCO) sits at the center of America’s most significant and diverse 4-year universities. The California State University (CSU) serves approximately 477,000 students and employs more than 55,000 staff and faculty members across 23 universities and 7 off-campus centers. The CSU provides students with opportunities to develop intellectually and personally, and to contribute back to the communities throughout California. For this large organization, managing a wide system of campuses while maintaining the decentralized autonomy of each is crucial. In 2019, they needed a highly secure tool to streamline the process of pulling HR data. The CSU had been using a legacy central data warehouse based on data from their financial system, but it lacked the robustness to keep up with modern technology. This wasn’t going to work for their HR reporting needs.

Looking for a tool to match the cloud-based infrastructure of their other operations, the Business Intelligence and Data Operations (BI/DO) team within the Chancellor’s Office chose Amazon QuickSight, a fast, easy-to-use, cloud-powered business analytics service that makes it easy for all employees within an organization to build visualizations, perform ad hoc analysis, and quickly get business insights from their data, any time, on any device. The team uses QuickSight to organize HR information across the CSU, implementing a centralized security system.

“It’s easy to use, very straightforward, and relatively intuitive. When you couple the experience of using QuickSight, with a huge cost difference to [the BI platform we had been using], to me, it’s a simple choice,”

– Andy Sydnor, Director Business Intelligence and Data Operations at the CSUCO.

With QuickSight, the team has the capability to harness security measures and deliver data insights efficiently across their campuses.

In this post, we share how the CSUCO uses QuickSight to reduce cost and improve efficiency in their HR reporting.

Delivering BI insights across the CSU’s 23 universities

The CSUCO serves the university system’s faculty, students, and staff by overseeing operations in several areas, including finance, HR, student information, and space and facilities. Since migrating to QuickSight in 2019, the team has built dashboards to support these operations. Dashboards include COVID-related leaves of absence, historical financial reports, and employee training data, along with a large selection of dashboards to track employee data at an individual campus level or from a system-wide perspective.

The team created a process for reading security roles from the ERP system and then translating them using QuickSight groups for internal HR reporting. QuickSight allowed them to match security measures with the benefits of low maintenance and familiarity to their end-users.

With QuickSight, the CSUCO is able to run a decentralized security process where campus security teams can provision access directly and users can get to their data faster. Before transitioning to QuickSight, the BI/DO team spent hours trying to get to specific individual-level data, but with QuickSight, the retrieval time was shortened to just minutes. For the first time, Sydnor and his team were able to pinpoint a specific employee’s work history without having to take additional actions to find the exact data they needed.

Cost savings compared to other BI tools

Sydnor shares that, for a public organization, one of the most attractive qualities of QuickSight is the immense cost savings. The BI/DO team at the Chancellor’s Office estimates that they’re saving roughly 40% on costs since switching from their previous BI platform, which is a huge benefit for a public organization of this scale. Their previous BI tool was costing them extensive amounts of money on licensing for features they didn’t require; the CSUCO felt they weren’t getting the best use of their investment.

The functionality of QuickSight to meet their reporting needs at an affordable price point is what makes QuickSight the CSUCO’s preferred BI reporting tool. Sydnor likes that with QuickSight, “we don’t have to go out and buy a subscription or a license for somebody, we can just provision access. It’s much easier to distribute the product.” QuickSight allows the CSUCO to focus their budget in other areas rather than having to pay for charges by infrequent users.

Simple and intuitive interface

Getting started in QuickSight was a no-brainer for Sydnor and his team. As a public organization, the procurement process can be cumbersome, thereby slowing down valuable time for putting their data to action. As an existing AWS customer, the CSUCO could seamlessly integrate QuickSight into their package of AWS services. An issue they were running into with other BI tools was encountering roadblocks to setting up the system, which wasn’t an issue with QuickSight, because it’s a fully managed service that doesn’t require deploying any servers.

The following screenshot shows an example of the CSUCO security audit dashboard.

example of the CSUCO security audit dashboard.

Sydnor tells us, “Our previous BI tool had a huge library of visualization, but we don’t need 95% of those. Our presentations look great with the breadth of visuals QuickSight provides. Most people just want the data and ultimately, need a robust vehicle to get data out of a database and onto a table or visualization.”

Converting from their original BI tool to QuickSight was painless for his team. Sydnor tells us that he has “yet to see something we can’t do with QuickSight.” One of Sydnor’s employees who was a user of the previous tool learned QuickSight in just 30 minutes. Now, they conduct QuickSight demos all the time.

Looking to the future: Expanding BI integration and adopting Amazon QuickSight Q

With QuickSight, the Chancellor’s Office aims to roll out more HR dashboards across its campuses and extend the tool for faculty use in the classroom. In the upcoming year, two campuses are joining CSUCO in building their own HR reporting dashboards through QuickSight. The organization is also making plans to use QuickSight to report on student data and implement external-facing dashboards. Some of the data points they’re excited to explore are insights into at-risk students and classroom scheduling on campus.

Thinking ahead, CSUCO is considering Amazon QuickSight Q, a machine learning-powered natural language capability that gives anyone in an organization the ability to ask business questions in natural language and receive accurate answers with relevant visualizations. Sydnor says, “How cool would that be if professors could go in and ask simple, straightforward questions like, ‘How many of my department’s students are taking full course loads this semester?’ It has a lot of potential.”

Summary

The CSUCO is excited to be a champion of QuickSight in the CSU, and are looking for ways to increase its implementation across their organization in the future.

To learn more, visit the website for the California State University Chancellor’s Office. For more on QuickSight, visit the Amazon QuickSight product page, or browse other Big Data Blog posts featuring QuickSight.


About the authors

Madi Hsieh, AWS 2022 Summer Intern, UCLA.

Tina Kelleher, Program Manager at AWS.

Build the next generation, cross-account, event-driven data pipeline orchestration product

Post Syndicated from Maria Guerra original https://aws.amazon.com/blogs/big-data/build-the-next-generation-cross-account-event-driven-data-pipeline-orchestration-product/

This is a guest post by Mehdi Bendriss, Mohamad Shaker, and Arvid Reiche from Scout24.

At Scout24 SE, we love data pipelines, with over 700 pipelines running daily in production, spread across over 100 AWS accounts. As we democratize data and our data platform tooling, each team can create, maintain, and run their own data pipelines in their own AWS account. This freedom and flexibility is required to build scalable organizations. However, it’s full of pitfalls. With no rules in place, chaos is inevitable.

We took a long road to get here. We’ve been developing our own custom data platform since 2015, developing most tools ourselves. Since 2016, we have our self-developed legacy data pipeline orchestration tool.

The motivation to invest a year of work into a new solution was driven by two factors:

  • Lack of transparency on data lineage, especially dependency and availability of data
  • Little room to implement governance

As a technical platform, our target user base for our tooling includes data engineers, data analysts, data scientists, and software engineers. We share the vision that anyone with relevant business context and minimal technical skills can create, deploy, and maintain a data pipeline.

In this context, in 2015 we created the predecessor of our new tool, which allows users to describe their pipeline in a YAML file as a list of steps. It worked well for a while, but we faced many problems along the way, notably:

  • Our product didn’t support pipelines to be triggered by the status of other pipelines, but based on the presence of _SUCCESS files in Amazon Simple Storage Service (Amazon S3). Here we relied on periodic pulls. In complex organizations, data jobs often have strong dependencies to other work streams.
  • Given the previous point, most pipelines could only be scheduled based on a rough estimate of when their parent pipelines might finish. This led to cascaded failures when the parents failed or didn’t finish on time.
  • When a pipeline fails and gets fixed, then manually redeployed, all its dependent pipelines must be rerun manually. This means that the data producer bears the responsibility of notifying every single team downstream.

Having data and tooling democratized without the ability to provide insights into which jobs, data, and dependencies exist diminishes synergies within the company, leading to silos and problems in resource allocation. It became clear that we needed a successor for this product that would give more flexibility to the end-user, less computing costs, and no infrastructure management overhead.

In this post, we describe, through a hypothetical case study, the constraints under which the new solution should perform, the end-user experience, and the detailed architecture of the solution.

Case study

Our case study looks at the following teams:

  • The core-data-availability team has a data pipeline named listings that runs every day at 3:00 AM on the AWS account Account A, and produces on Amazon S3 an aggregate of the listings events published on the platform on the previous day.
  • The search team has a data pipeline named searches that runs every day at 5:00 AM on the AWS account Account B, and exports to Amazon S3 the list of search events that happened on the previous day.
  • The rent-journey team wants to measure a metric referred to as X; they create a pipeline named pipeline-X that runs daily on the AWS account Account C, and relies on the data of both previous pipelines. pipeline-X should only run daily, and only after both the listings and searches pipelines succeed.

User experience

We provide users with a CLI tool that we call DataMario (relating to its predecessor DataWario), and which allows users to do the following:

  • Set up their AWS account with the necessary infrastructure needed to run our solution
  • Bootstrap and manage their data pipeline projects (creating, deploying, deleting, and so on)

When creating a new project with the CLI, we generate (and require) every project to have a pipeline.yaml file. This file describes the pipeline steps and the way they should be triggered, alerting, type of instances and clusters in which the pipeline will be running, and more.

In addition to the pipeline.yaml file, we allow advanced users with very niche and custom needs to create their pipeline definition entirely using a TypeScript API we provide them, which allows them to use the whole collection of constructs in the AWS Cloud Development Kit (AWS CDK) library.

For the sake of simplicity, we focus on the triggering of pipelines and the alerting in this post, along with the definition of pipelines through pipeline.yaml.

The listings and searches pipelines are triggered as per a scheduling rule, which the team defines in the pipeline.yaml file as follows:

trigger: 
    schedule: 
        hour: 3

pipeline-x is triggered depending on the success of both the listings and searches pipelines. The team defines this dependency relationship in the project’s pipeline.yaml file as follows:

trigger: 
    executions: 
        allOf: 
            - name: listings 
              account: Account_A_ID 
              status: 
                  - SUCCESS 
            - name: searches 
              account: Account_B_ID 
              status: 
                  - SUCCESS

The executions block can define a complex set of relationships by combining the allOf and anyOf blocks, along with a logical operator operator: OR / AND, which allows mixing the allOf and anyOf blocks. We focus on the most basic use case in this post.

Accounts setup

To support alerting, logging, and dependencies management, our solution has components that must be pre-deployed in two types of accounts:

  • A central AWS account – This is managed by the Data Platform team and contains the following:
    • A central data pipeline Amazon EventBridge bus receiving all the run status changes of AWS Step Functions workflows running in user accounts
    • An AWS Lambda function logging the Step Functions workflow run changes in an Amazon DynamoDB table to verify if any downstream pipelines should be triggered based on the current event and previous run status changes log
    • A Slack alerting service to send alerts to the Slack channels specified by users
    • A trigger management service that broadcasts triggering events to the downstream buses in the user accounts
  • All AWS user accounts using the service – These accounts contain the following:
    • A data pipeline EventBridge bus that receives Step Functions workflow run status changes forwarded from the central EventBridge bus
    • An S3 bucket to store data pipelines artifacts, along their logs
    • Resources needed to run Amazon EMR clusters, like security groups, AWS Identity and Access Management (IAM) roles, and more

With the provided CLI, users can set up their account by running the following code:

$ dpc setup-user-account

Solution overview

The following diagram illustrates the architecture of the cross-account, event-driven pipeline orchestration product.

In this post, we refer to the different colored and numbered squares to reference a component in the architecture diagram. For example, the green square with label 3 refers to the EventBridge bus default component.

Deployment flow

This section is illustrated with the orange squares in the architecture diagram.

A user can create a project consisting of a data pipeline or more using our CLI tool as follows:

$ dpc create-project -n 'project-name'

The created project contains several components that allow the user to create and deploy data pipelines, which are defined in .yaml files (as explained earlier in the User experience section).

The workflow of deploying a data pipeline such as listings in Account A is as follows:

  • Deploy listings by running the command dpc deploy in the root folder of the project. An AWS CDK stack with all required resources is automatically generated.
  • The previous stack is deployed as an AWS CloudFormation template.
  • The stack uses custom resources to perform some actions, such as storing information needed for alerting and pipeline dependency management.
  • Two Lambda functions are triggered, one to store the mapping pipeline-X/slack-channels used for alerting in a DynamoDB table, and another one to store the mapping between the deployed pipeline and its triggers (other pipelines that should result in triggering the current one).
  • To decouple alerting and dependency management services from the other components of the solution, we use Amazon API Gateway for two components:
    • The Slack API.
    • The dependency management API.
  • All calls for both APIs are traced in Amazon CloudWatch log groups and two Lambda functions:
    • The Slack channel publisher Lambda function, used to store the mapping pipeline_name/slack_channels in a DynamoDB table.
    • The dependencies publisher Lambda function, used to store the pipelines dependencies (the mapping pipeline_name/parents) in a DynamoDB table.

Pipeline trigger flow

This is an event-driven mechanism that ensures that data pipelines are triggered as requested by the user, either following a schedule or a list of fulfilled upstream conditions, such as a group of pipelines succeeding or failing.

This flow relies heavily on EventBridge buses and rules, specifically two types of rules:

  • Scheduling rules.
  • Step Functions event-based rules, with a payload matching the set of statuses of all the parents of a given pipeline. The rules indicate for which set of statuses all the parents of pipeline-X should be triggered.

Scheduling

This section is illustrated with the black squares in the architecture diagram.

The listings pipeline running on Account A is set to run every day at 3:00 AM. The deployment of this pipeline creates an EventBridge rule and a Step Functions workflow for running the pipeline:

  • The EventBridge rule is of type schedule and is created on the default bus (this is the EventBridge bus responsible for listening to native AWS events—this distinction is important to avoid confusion when introducing the other buses). This rule has two main components:
    • A cron-like notation to describe the frequency at which it runs: 0 3 * * ? *.
    • The target, which is the Step Functions workflow describing the workflow of the listings pipeline.
  • The listings Step Function workflow describes and runs immediately when the rule gets triggered. (The same happens to the searches pipeline.)

Each user account has a default EventBridge bus, which listens to the default AWS events (such as the run of any Lambda function) and scheduled rules.

Dependency management

This section is illustrated with the green squares in the architecture diagram. The current flow starts after the Step Functions workflow (black square 2) starts, as explained in the previous section.

As a reminder, pipeline-X is triggered when both the listings and searches pipelines are successful. We focus on the listings pipeline for this post, but the same applies to the searches pipeline.

The overall idea is to notify all downstream pipelines that depend on it, in every AWS account, passing by and going through the central orchestration account, of the change of status of the listings pipeline.

It’s then logical that the following flow gets triggered multiple times per pipeline (Step Functions workflow) run as its status changes from RUNNING to either SUCCEEDED, FAILED, TIMED_OUT, or ABORTED. The reason being that there could be pipelines downstream potentially listening on any of those status change events. The steps are as follows:

  • The event of the Step Functions workflow starting is listened to by the default bus of Account A.
  • The rule export-events-to-central-bus, which specifically listens to the Step Function workflow run status change events, is then triggered.
  • The rule forwards the event to the central bus on the central account.
  • The event is then caught by the rule trigger-events-manager.
  • This rule triggers a Lambda function.
  • The function gets the list of all children pipelines that depend on the current run status of listings.
  • The current run is inserted in the run log Amazon Relational Database Service (Amazon RDS) table, following the schema sfn-listings, time (timestamp), status (SUCCEEDED, FAILED, and so on). You can query the run log RDS table to evaluate the running preconditions of all children pipelines and get all those that qualify for triggering.
  • A triggering event is broadcast in the central bus for each of those eligible children.
  • Those events get broadcast to all accounts through the export rules—including Account C, which is of interest in our case.
  • The default EventBridge bus on Account C receives the broadcasted event.
  • The EventBridge rule gets triggered if the event content matches the expected payload of the rule (notably that both pipelines have a SUCCEEDED status).
  • If the payload is valid, the rule triggers the Step Functions workflow pipeline-X and triggers the workflow to provision resources (which we discuss later in this post).

Alerting

This section is illustrated with the gray squares in the architecture diagram.

Many teams handle alerting differently across the organization, such as Slack alerting messages, email alerts, and OpsGenie alerts.

We decided to allow users to choose their preferred methods of alerting, giving them the flexibility to choose what kind of alerts to receive:

  • At the step level – Tracking the entire run of the pipeline
  • At the pipeline level – When it fails, or when it finishes with a SUCCESS or FAILED status

During the deployment of the pipeline, a new Amazon Simple Notification Service (Amazon SNS) topic gets created with the subscriptions matching the targets specified by the user (URL for OpsGenie, Lambda for Slack or email).

The following code is an example of what it looks like in the user’s pipeline.yaml:

notifications:
    type: FULL_EXECUTION
    targets:
        - channel: SLACK
          addresses:
               - data-pipeline-alerts
        - channel: EMAIL
          addresses:
               - [email protected]

The alerting flow includes the following steps:

  1. As the pipeline (Step Functions workflow) starts (black square 2 in the diagram), the run gets logged into CloudWatch Logs in a log group corresponding to the name of the pipeline (for example, listings).
  2. Depending on the user preference, all the run steps or events may get logged or not thanks to a subscription filter whose target is the execution-tracker-lambda Lambda function. The function gets called anytime a new event gets published in CloudWatch.
  3. This Lambda function parses and formats the message, then publishes it to the SNS topic.
  4. For the email and OpsGenie flows, the flow stops here. For posting the alert message on Slack, the Slack API caller Lambda function gets called with the formatted event payload.
  5. The function then publishes the message to the /messages endpoint of the Slack API Gateway.
  6. The Lambda function behind this endpoint runs, and posts the message in the corresponding Slack channel and under the right Slack thread (if applicable).
  7. The function retrieves the secret Slack REST API key from AWS Secrets Manager.
  8. It retrieves the Slack channels in which the alert should be posted.
  9. It retrieves the root message of the run, if any, so that subsequent messages get posted under the current run thread on Slack.
  10. It posts the message on Slack.
  11. If this is the first message for this run, it stores the mapping with the DB schema execution/slack_message_id to initiate a thread for future messages related to the same run.

Resource provisioning

This section is illustrated with the light blue squares in the architecture diagram.

To run a data pipeline, we need to provision an EMR cluster, which in turn requires some information like Hive metastore credentials, as shown in the workflow. The workflow steps are as follows:

  • Trigger the Step Functions workflow listings on schedule.
  • Run the listings workflow.
  • Provision an EMR cluster.
  • Use a custom resource to decrypt the Hive metastore password to be used in Spark jobs relying on central Hive tables or views.

End-user experience

After all preconditions are fulfilled (both the listings and searches pipelines succeeded), the pipeline-X workflow runs as shown in the following diagram.

As shown in the diagram, the pipeline description (as a sequence of steps) defined by the user in the pipeline.yaml is represented by the orange block.

The steps before and after this orange section are automatically generated by our product, so users don’t have to take care of provisioning and freeing compute resources. In short, the CLI tool we provide our users synthesizes the user’s pipeline definition in the pipeline.yaml and generates the corresponding DAG.

Additional considerations and next steps

We tried to stay consistent and stick to one programming language for the creation of this product. We chose TypeScript, which played well with AWS CDK, the infrastructure as code (IaC) framework that we used to build the infrastructure of the product.

Similarly, we chose TypeScript for building the business logic of our Lambda functions, and of the CLI tool (using Oclif) we provide for our users.

As demonstrated in this post, EventBridge is a powerful service for event-driven architectures, and it plays a central and important role in our products. As for its limitations, we found that pairing Lambda and EventBridge could fulfill all our current needs and granted a high level of customization that allowed us to be creative in the features we wanted to serve our users.

Needless to say, we plan to keep developing the product, and have a multitude of ideas, notably:

  • Extend the list of core resources on which workloads run (currently only Amazon EMR) by adding other compute services, such Amazon Elastic Compute Cloud (Amazon EC2)
  • Use the Constructs Hub to allow users in the organization to develop custom steps to be used in all data pipelines (we currently only offer Spark and shell steps, which suffice in most cases)
  • Use the stored metadata regarding pipeline dependencies for data lineage, to have an overview of the overall health of the data pipelines in the organization, and more

Conclusion

This architecture and product brought many benefits. It allows us to:

  • Have a more robust and clear dependency management of data pipelines at Scout24.
  • Save on compute costs by avoiding scheduling pipelines based approximately on when its predecessors are usually triggered. By shifting to an event-driven paradigm, no pipeline gets started unless all its prerequisites are fulfilled.
  • Track our pipelines granularly and in real time on a step level.
  • Provide more flexible and alternative business logic by exposing multiple event types that downstream pipelines can listen to. For example, a fallback downstream pipeline might be run in case of a parent pipeline failure.
  • Reduce the cross-team communication overhead in case of failures or stopped runs by increasing the transparency of the whole pipelines’ dependency landscape.
  • Avoid manually restarting pipelines after an upstream pipeline is fixed.
  • Have an overview of all jobs that run.
  • Support the creation of a performance culture characterized by accountability.

We have big plans for this product. We will use DataMario to implement granular data lineage, observability, and governance. It’s a key piece of infrastructure in our strategy to scale data engineering and analytics at Scout24.

We will make DataMario open source towards the end of 2022. This is in line with our strategy to promote our approach to a solution on a self-built, scalable data platform. And with our next steps, we hope to extend this list of benefits and ease the pain in other companies solving similar challenges.

Thank you for reading.


About the authors

Mehdi Bendriss is a Senior Data / Data Platform Engineer, MSc in Computer Science and over 9 years of experience in software, ML, and data and data platform engineering, designing and building large-scale data and data platform products.

Mohamad Shaker is a Senior Data / Data Platform Engineer, with over 9 years of experience in software and data engineering, designing and building large-scale data and data platform products that enable users to access, explore, and utilize their data to build great data products.

Arvid Reiche is a Data Platform Leader, with over 9 years of experience in data, building a data platform that scales and serves the needs of the users.

Marco Salazar is a Solutions Architect working with Digital Native customers in the DACH region with over 5 years of experience building and delivering end-to-end, high-impact, cloud native solutions on AWS for Enterprise and Sports customers across EMEA. He currently focuses on enabling customers to define technology strategies on AWS for the short- and long-term that allow them achieve their desired business objectives, specializing on Data and Analytics engagements. In his free time, Marco enjoys building side-projects involving mobile/web apps, microcontrollers & IoT, and most recently wearable technologies.

Your guide to AWS Analytics at re:Invent 2022

Post Syndicated from Imtiaz Sayed original https://aws.amazon.com/blogs/big-data/your-guide-to-aws-analytics-at-reinvent-2022/

Join the global cloud community at AWS re:Invent this year to meet, get inspired, and rethink what’s possible!

Reserved seating is available for registered attendees to secure seats in the sessions of their choice. You can reserve a seat in your favorite sessions by signing in to the attendee portal and navigating to Event Sessions. For those who can’t make it in person, you can get your free online pass to watch live keynotes and leadership sessions by registering for a virtual-only access. This curated attendee guide helps data and analytics enthusiasts manage their schedule*, as well as navigate the AWS analytics and business intelligence tracks to get the best out of re:Invent.

For additional session details, visit the AWS Analytics splash page.

#AWSanalytics, #awsfordata, #reinvent22

Keynotes

KEY002 | Adam Selipsky (CEO, Amazon Web Services) | Tuesday, November 29 | 8:30 AM – 10:30 AM

Join Adam Selipsky, CEO of Amazon Web Services, as he looks at the ways that forward-thinking builders are transforming industries and even our future, powered by AWS.

KEY003 | Swami Sivasubramanian (Vice President, AWS Data and Machine Learning) | Wednesday, November 30 | 8:30 AM – 10:30 AM

Join Swami Sivasubramanian, Vice President of AWS Data and Machine Learning, as he reveals the latest AWS innovations that can help you transform your company’s data into meaningful insights and actions for your business.

Leadership sessions

ANT203-L | Unlock the value of your data with AWS analytics | G2 Krishnamoorthy, VP of AWS Analytics | Wednesday, November 30 | 2:30 PM – 3:30 PM

G2 addresses the current state of analytics on AWS, covers the latest service innovations around data, and highlights customer successes with AWS analytics. Also, learn from organizations like FINRA and more who have turned to AWS for their digital transformation journey.

Breakout sessions

AWS re:Invent breakout sessions are lecture-style and one hour long sessions delivered by AWS experts, customers, and partners.

Monday, Nov 28 Tuesday, Nov 29 Wednesday, Nov 30 Thursday, Dec 1 Friday, Dec 2

10:00 AM – 11:00 AM

ANT326 | How BMW, Intuit, and Morningstar are transforming with AWS and Amazon Athena

11:00 AM – 12:00 PM

ANT301 | Democratizing your organization’s data analytics experience

10:00 AM – 11:00 AM

ANT212 | How JPMC and LexisNexis modernize analytics with Amazon Redshift

12:30 PM – 1:30 PM

ANT207 | What’s new in AWS streaming

8:30 AM – 9:30 AM

ANT311 | Building security operations with Amazon OpenSearch Service

11:30 AM – 12:30 PM

ANT206 | What’s new in Amazon OpenSearch Service

12:15 PM – 1:15 PM

ANT334 | Simplify and accelerate data integration and ETL modernization with AWS Glue

10:00 AM – 11:00 AM

ANT209 | Build interactive analytics applications

12:30 PM – 1:30 PM

BSI203 | Differentiate your apps with Amazon QuickSight embedded analytics

.

12:15 PM – 1:15 PM

ANT337 | Migrating to Amazon EMR to reduce costs and simplify operations

1:15 PM – 2:15 PM

ANT205 | Achieving your modern data architecture

10:45 AM – 11:45 AM

ANT218 | Leveling up computer vision and artificial intelligence development

1:15 PM – 2:15 PM

ANT336 | Building data mesh architectures on AWS

.

1:00 PM – 2:00 PM

ANT341 | How Riot Games processes 20 TB of analytics data daily on AWS

2:00 PM – 3:00 PM

BSI201 | Reinvent how you derive value from your data with Amazon QuickSight

11:30 AM – 12:30 PM

ANT340 | How Sony Orchard accelerated innovation with Amazon MSK

2:00 PM – 3:00 PM

ANT342 | How Poshmark accelerates growth via real-time analytics and personalization

.

1:45 PM – 2:45 PM

BSI207 | Get clarity on your data in seconds with Amazon QuickSight Q

2:45 PM – 3:45 PM

ANT339 | How Samsung modernized architecture for real-time analytics

1:00 PM – 2:00 PM

ANT201 | What’s new with Amazon Redshift

3:30 PM – 4:30 PM

ANT219 | Dow Jones and 3M: Observability with Amazon OpenSearch Service

.

3:15 PM – 4:15 PM

ANT302 | What’s new with Amazon EMR

3:30 PM – 4:30 PM

ANT204 | Enabling agility with data governance on AWS

2:30 PM – 3:30 PM

BSI202 | Migrate to cloud-native business analytics with Amazon QuickSight

. .

4:45 PM – 5:45 PM

ANT335 | How Disney Parks uses AWS Glue to replace thousands of Hadoop jobs

5:00 PM – 6:00 PM

ANT338 | Scaling data processing with Amazon EMR at the speed of market volatility

4:45 PM – 5:45 PM

ANT324 | Modernize your data warehouse

. .

5:30 PM – 6:30 PM

ANT220 | Using Amazon AppFlow to break down data silos for analytics and ML

5:45 PM – 6:45 PM

ANT325 | Simplify running Apache Spark and Hive apps with Amazon EMR Serverless

5:30 PM – 6:30 PM

ANT317 | Self-service analytics with Amazon Redshift Serverless

. .

Chalk talks

Chalk talks are an hour long, highly interactive content format with a small audience. Each begins with a short lecture delivered by an AWS expert, followed by a Q&A session with the audience.

Monday, Nov 28 Tuesday, Nov 29 Wednesday, Nov 30 Thursday, Dec 1 Friday, Dec 2

12:15 PM – 1:15 PM

ANT303 | Security and data access controls in Amazon EMR

11:00 AM – 12:00 PM

ANT318 [Repeat] | Build event-based microservices with AWS streaming services

9:15 AM – 10:15 AM

ANT320 [Repeat] | Get better price performance in cloud data warehousing with Amazon Redshift

11:45 AM – 12:45 PM

ANT329 | Turn data to insights in seconds with secure and reliable Amazon Redshift

9:15 AM – 10:15 AM

ANT314 [Repeat] | Why and how to migrate to Amazon OpenSearch Service

12:15 PM – 1:15 PM

BSI401 | Insightful dashboards through advanced calculations with QuickSight

11:45 AM – 12:45 PM

BSI302 | Deploy your BI assets at scale to thousands with Amazon QuickSight

10:45 AM – 11:45 AM

ANT330 [Repeat] | Run Apache Spark on Kubernetes with Amazon EMR on Amazon EKS

1:15 PM – 2:15 PM

ANT401 | Ingest machine-generated data at scale with Amazon OpenSearch Service

10:00 AM – 11:00 AM

ANT322 [Repeat] | Simplifying ETL migration and data integration with AWS Glue

1:00 PM – 2:00 PM

ANT323 [Repeat] | Break through data silos with Amazon Redshift

1:15 PM – 2:15 PM

ANT327 | Modernize your analytics architecture with Amazon Athena

12:15 PM – 1:15 PM

ANT323 [Repeat] | Break through data silos with Amazon Redshift

2:00 PM – 3:00 PM

ANT333 [Repeat] | Build a serverless data streaming workload with Amazon Kinesis

..

1:45 PM – 2:45 PM

ANT319 | Democratizing ML for data analysts

2:45 PM – 3:45 PM

ANT320 [Repeat] | Get better price performance in cloud data warehousing with Amazon Redshift

4:00 PM – 5:00 PM

ANT314 [Repeat] | Why and how to migrate to Amazon OpenSearch Service

.2:00 AM – 3:00 PM

ANT330 [Repeat] | Run Apache Spark on Kubernetes with Amazon EMR on Amazon EKS

.

1:45 PM – 2:45 PM

ANT322 [Repeat] | Simplifying ETL migration and data integration with AWS Glue

2:45 PM – 3:45 PM

BSI301 | Architecting multi-tenancy for your apps with Amazon QuickSight

4:45 PM – 5:45 PM

ANT333 [Repeat] | Build a serverless data streaming workload with Amazon Kinesis

. .

5:30 PM – 6:30 PM

ANT315 | Optimizing Amazon OpenSearch Service domains for scale and cost

4:15 PM – 5:15 PM

ANT304 | Run serverless Spark workloads with AWS analytics

4:45 PM – 5:45 PM

ANT331 | Understanding TCO for different Amazon EMR deployment models

. .
.

5:00 PM – 6:00 PM

ANT328 | Build transactional data lakes using open-table formats in Amazon Athena

4:45 PM – 5:45 PM

ANT321 | What’s new in AWS Lake Formation

. .
. .

7:00 PM – 8:00 PM

ANT318 [Repeat] | Build event-based microservices with AWS streaming services

. .

Builders’ sessions

These are one-hour small-group sessions with up to nine attendees per table and one AWS expert. Each builders’ session begins with a short explanation or demonstration of what you’re going to build. Once the demonstration is complete, bring your laptop to experiment and build with the AWS expert.

Monday, Nov 28 Tuesday, Nov 29 Wednesday, Nov 30 Thursday, Dec 1 Friday, Dec 2
………………………….

11:00 AM – 12:00 PM

ANT402 | Human vs. machine: Amazon Redshift ML inferences

1:00 PM – 2:00 PM

ANT332 | Build a data pipeline using Apache Airflow and Amazon EMR Serverless

11:00 AM – 12:00 PM

ANT316 [Repeat] | How to build dashboards for machine-generated data

………………………
. .

7:00 PM – 8:00 PM

ANT316 [Repeat] | How to build dashboards for machine-generated data

. .

Workshops

Workshops are two-hour interactive sessions where you work in teams or individually to solve problems using AWS services. Each workshop starts with a short lecture, and the rest of the time is spent working the problem. Bring your laptop to build along with AWS experts.

Monday, Nov 28 Tuesday, Nov 29 Wednesday, Nov 30 Thursday, Dec 1 Friday, Dec 2

10:00 AM – 12:00 PM

ANT306 [Repeat] | Beyond monitoring: Observability with operational analytics

11:45 AM – 1:45 PM

ANT313 | Using Apache Spark for data science and ML workflows with Amazon EMR

8:30 AM – 10:30 AM

ANT307 | Improve search relevance with ML in Amazon OpenSearch Service

11:00 AM – 1:00 PM

ANT403 | Event detection with Amazon MSK and Amazon Kinesis Data Analytics

8:30 AM – 10:30 AM

ANT309 [Repeat]| Build analytics applications using Apache Spark with Amazon EMR Serverless

4:00 PM – 6:00 PM

ANT309 [Repeat]| Build analytics applications using Apache Spark with Amazon EMR Serverless

2:45 PM – 4:45 PM

ANT310 [Repeat] | Build a data mesh with AWS Lake Formation and AWS Glue

12:15 PM – 2:15 PM

ANT306 [Repeat] | Beyond monitoring: Observability with operational analytics

11:45 AM – 1:45 PM

BSI205 | Build stunning customized dashboards with Amazon QuickSight

.
. .

12:15 PM – 2:15 PM

ANT312 | Near real-time ML inferences with Amazon Redshift

2:45 PM – 4:45 PM

ANT308 | Seamless data sharing using Amazon

.
. .

5:30 PM – 7:30 PM

ANT310 [Repeat] | Build a data mesh with AWS Lake Formation and AWS Glue

. .
. .

5:30 PM – 7:30 PM

BSI303 | Seamlessly embed analytics into your apps with Amazon QuickSight

. .

* All schedules are in PDT time zone.

AWS Analytics & Business Intelligence kiosks

Join us at the AWS Analytics Kiosk in the AWS Village at the Expo. Dive deep into AWS Analytics with AWS subject matter experts, see the latest demos, ask questions, or just drop by to socially connect with your peers.


About the author

Imtiaz (Taz) Sayed is the WW Tech Leader for Analytics at AWS. He enjoys engaging with the community on all things data and analytics. He can be reached via
LinkedIn.

Retain more for less with tiered storage for Amazon MSK

Post Syndicated from Masudur Rahaman Sayem original https://aws.amazon.com/blogs/big-data/retain-more-for-less-with-tiered-storage-for-amazon-msk/

Organizations are adopting Apache Kafka and Amazon Managed Streaming for Apache Kafka (Amazon MSK) to capture and analyze data in real-time. Amazon MSK allows you to build and run production applications on Apache Kafka without needing Kafka infrastructure management expertise or having to deal with the complex overheads associated with running Apache Kafka on your own. With increasing maturity, customers seek to build sophisticated use cases that combine aspects of real time and batch processing. For instance, you may want to train machine learning (ML) models based on historic data and then use these models to do real time inferencing. Or you may want to be able to recompute previous results when the application logic changed, e.g., when a new KPI is added to a streaming analytics application or when a bug was fixed that caused incorrect output. These use cases often require storing data for several weeks, months, or even years.

Apache Kafka is well positioned to support these kind of use cases. Data is retained in the Kafka cluster as long as required by configuring the retention policy. In this way, the most recent data can be processed in real time for low-latency use cases while historic data remains accessible in the cluster and can be processed in a batch fashion.

However, retaining data in a Kafka cluster can become expensive because storage and compute are tightly coupled in a cluster. To scale storage, you need to add more brokers. But adding more brokers with the sole purpose of increasing the storage squanders the rest of the compute resources like CPU and memory. Also, a large cluster with more nodes adds operational complexity with a longer time to recover and rebalance when a broker fails. To avoid that operational complexity and higher cost, you can move your data to Amazon Simple Storage Service (Amazon S3) for long-term access and with cost-effective storage classes in Amazon S3 you can optimize your overall storage cost. This solves cost challenges, but now you have to build and maintain that part of the architecture for data movement to a different data store. You also need to build different data processing logic using different APIs for consuming data (Kafka API for streaming, Amazon S3 API for historic reads).

Today, we’re announcing Amazon MSK tiered storage, which brings a virtually unlimited and low-cost storage tier for Amazon MSK, making it simpler and cost-effective for developers to build streaming data applications. Since the launch of Amazon MSK in 2019, we have enabled capabilities such as vertical scaling and automatic scaling of broker storage so you can operate your Kafka workloads in a cost-effective way. Earlier this year, we launched provisioned throughput which enables seamlessly scaling I/O without having to provision additional brokers. Tiered storage makes it even more cost-effective for you to run Kafka workloads. You can now store data in Apache Kafka without worrying about limits. You can effectively balance your performance and costs by using the performance-optimized primary storage for real-time data and the new low-cost tier for the historical data. With a few clicks, you can move streaming data into a lower-cost tier to store data and only pay for what you use.

Tiered storage frees you from making hard trade-offs between supporting the data retention needs of your application teams and the operational complexity that comes with it. This enables you to use the same code to process both real-time and historical data to minimize redundant workflows and simplify architectures. With Amazon MSK tiered storage, you can implement a Kappa architecture – a streaming-first software architecture deployment pattern – to use the same data processing pipeline for correctness and completeness of data over a much longer time horizon for business analysis.

How Amazon MSK tiered storage works

Let’s look at how tiered storage works for Amazon MSK. Apache Kafka stores data in files called log segments. As each segment completes, based on the segment size configured at cluster or topic level, it’s copied to the low-cost storage tier. Data is held in performance-optimized storage for a specified retention time, or up to a specified size, and then deleted. There is a separate time and size limit setting for the low-cost storage, which must be longer than the performance-optimized storage tier. If clients request data from segments stored in the low-cost tier, the broker reads the data from it and serves the data in the same way as if it were being served from the performance-optimized storage. The APIs and existing clients work with minimal changes. When your application starts reading data from the low-cost tier, you can expect an increase in read latency for the first few bytes. As you start reading the remaining data sequentially from the low-cost tier, you can expect latencies that are similar to the primary storage tier. With tiered storage, you pay for the amount of data you store and the amount of data you retrieve.

For a pricing example, let’s consider a workload where your ingestion rate is 15 MB/s, with a replication factor of 3, and you want to retain data in your Kafka cluster for 7 days. For such a workload, it requires 6x m5.large brokers, with 32.4 TB EBS storage, which costs $4,755. But if you use tiered storage for the same workload with local retention of 4 hours and overall data retention of 7 days, it requires 3x m5.large brokers, with 0.8 TB EBS storage and 9 TB of tiered storage, which costs $1,584. If you want to read all the historic data at once, it costs $13 ($0.0015 per GB retrieval cost). In this example with tiered storage, you save around 66% of your overall cost.

Get started using Amazon MSK tiered storage

To enable tiered storage on your existing cluster, upgrade your MSK cluster to Kafka version 2.8.2.tiered and then choose Tiered storage and EBS storage as your cluster storage mode on the Amazon MSK console.

After tiered storage is enabled on the cluster level, run the following command to enable tiered storage on an existing topic. In this example, you’re enabling tiered storage on a topic called msk-ts-topic with 7 days’ retention (local.retention.ms=604800000) for a local high-performance storage tier, setting 180 days’ retention (retention.ms=15550000000) to retain the data in the low-cost storage tier, and updating the log segment size to 48 MB:

bin/kafka-configs.sh --bootstrap-server $bsrv --alter --entity-type topics --entity-name msk-ts-topic --add-config 'remote.storage.enable=true, local.retention.ms=604800000, retention.ms=15550000000, segment.bytes=50331648'

Availability and pricing

Amazon MSK tiered storage is available in all AWS regions where Amazon MSK is available excluding the AWS China, AWS GovCloud regions. This low-cost storage tier scales to virtually unlimited storage and requires no upfront provisioning. You pay only for the volume of data retained and retrieved in the low-cost tier.

For more information about this feature and its pricing, see the Amazon MSK developer guide and Amazon MSK pricing page. For finding the right sizing for your cluster, see the best practices page.

Summary

With Amazon MSK tiered storage you don’t need to provision storage for the low-cost tier or manage the infrastructure. Tiered storage enables you to scale to virtually unlimited storage. You can access data in the low-cost tier using the same clients you currently use to read data from the high-performance primary storage tier. Apache Kafka’s consumer API, streams API, and connectors consume data from both tiers without changes. You can modify the retention limits on the low-cost storage tier similarly as to how you can modify the retention limits on the high-performance storage.

Enable tiered storage on your MSK clusters today to retain data longer at a lower cost.


About the Author

Masudur Rahaman Sayem is a Streaming Architect at AWS. He works with AWS customers globally to design and build data streaming architecture to solve real-world business problems. He is passionate about distributed systems. He also likes to read, especially classic comic books.

Measure the adoption of your Amazon QuickSight dashboards and view your BI portfolio in a single pane of glass

Post Syndicated from Maitri Brahmbhatt original https://aws.amazon.com/blogs/big-data/measure-the-adoption-of-your-amazon-quicksight-dashboards-and-view-your-bi-portfolio-in-a-single-pane-of-glass/

Amazon QuickSight is a fully managed, cloud-native business intelligence (BI) service. If you plan to deploy enterprise-grade QuickSight dashboards, measuring user adoption and usage patterns is an important ingredient for the success of your BI investment. For example, knowing the usage patterns like geo location, department, and job role can help you fine-tune your dashboards to the right audience. Furthermore, to return the investment of your BI portfolio, with dashboard usage, you can reduce license costs by identifying inactive QuickSight authors.

In this post, we introduce the latest Admin Console, an AWS packaged solution that you can easily deploy and use to create a usage and inventory dashboard for your QuickSight assets. The Admin Console helps identify usage patterns of an individual user and dashboards. It can also help you track which dashboards and groups you have or need access to, and what you can do with that access, by providing more details on QuickSight group and user permissions and activities and QuickSight asset (dashboards, analyses, and datasets) permissions. With timely access to interactive usage metrics, the Admin Console can help BI leaders and administrators make a cost-efficient plan for dashboard improvements. Another common use case of this dashboard is to provide a centralized repository of the QuickSight assets. QuickSight artifacts consists of multiple types of assets (dashboards, analyses, datasets, and more) with dependencies between them. Having a single repository to view all assets and their dependencies can be an important element in your enterprise data dictionary.

This post demonstrates how to build the Admin Console using a serverless data pipeline. With basic AWS knowledge, you can create this solution in your own environment within an hour. Alternatively, you can dive deep into the source code to meet your specific needs.

Admin Console dashboard

The following animation displays the contents of our demo dashboard.

The Admin Console dashboard includes six sheets:

  • Landing Page – Provides drill-down into each detailed tabs.
  • User Analysis – Provides detailed analysis of the user behavior and identifies active and inactive users and authors.
  • Dashboard Analysis – Shows the most commonly viewed dashboards.
  • Assets Access Permissions – Provides information on permissions applied to each asset, such as dashboard, analysis, datasets, data source, and themes.
  • Data Dictionary – Provides information on the relationships between each of your assets, such as which analysis was used to build each dashboard, and which datasets and data sources are being used in each analysis. It also provides details on each dataset, including schema name, table name, columns, and more.
  • Overview – Provides instructions on how to use the dashboard.

You can interactively play with the sample dashboard in the following Interactive Dashboard Demo.

Let’s look at Forwood Safety, an innovative, values-driven company with a laser focus on fatality prevention. An early adopter of QuickSight, they collaborated with AWS to deploy this solution to collect BI application usage insights.

“Our engineers love this admin console solution,” says Faye Crompton, Leader of Analytics and Benchmarking at Forwood. “It helps us to understand how users analyze critical control learnings by helping us to quickly identify the most frequently visited dashboards in Forwood’s self-service analytics and reporting tool, FAST.”

Solution overview

The following diagram illustrates the workflow of the solution.

The workflow involves the following steps:

  1. The AWS Lambda function Data_Prepare is scheduled to run hourly. This function calls QuickSight APIs to get the QuickSight namespace, group, user, and asset access permissions information.
  2. The Lambda function Dataset_Info is scheduled to run hourly. This function calls QuickSight APIs to get dashboard, analysis, dataset, and data source information.
  3. Both the functions save the results to an Amazon Simple Storage Service (Amazon S3) bucket.
  4. AWS CloudTrail logs are stored in an S3 bucket.
  5. Based on the file in Amazon S3 that contains user-group information, dataset information, QuickSight assets access permissions information, as well as dashboard views and user login events from the CloudTrail logs, five Amazon Athena tables are created. Optionally, the BI engineer can combine these tables with employee information tables to display human resource information of the users.
  6. Four QuickSight datasets fetch the data from the Athena tables created in Step 5 and import them into SPICE. Then, based on these datasets, a QuickSight dashboard is created.

Prerequisites

For this walkthrough, you should have the following prerequisites:

Create solution resources

We can create all the resources needed for this dashboard using three CloudFormation templates: one for Lambda functions, one for Athena tables, and one for QuickSight objects.

CloudFormation template for Lambda functions

This template creates the Lambda functions data_prepare and dataset_info.

  • Choose Launch Stack and follow the steps to create these resources.

After the stack creation is successful, you have two Lambda functions, data_prepare and dataset_info, and one S3 bucket named admin-console[AWS-account-ID]. You can verify if the Lambda function can run successfully and if the group_membership, object_access, datasets_info, and data_dictionary folders are created in the S3 bucket under admin-console[AWS-account-ID]/monitoring/quicksight/, as shown in the following screenshots.

The Data_Prepare Lambda function is scheduled to run hourly with the CloudWatch Events rule admin-console-every-hour. This function calls the QuickSight Assets APIs to get QuickSight users, assets, and the access permissions information. Finally, this function creates two files, group_membership.csv and object_access.csv, and saves these files to an S3 bucket.

The Dataset_Info Lambda function is scheduled to run hourly and calls the QuickSight Assets APIs to get datasets, schemas, tables, and fields (columns) information. Then this function creates two files, datasets_info.csv and data_dictionary.csv, and saves these files to an S3 bucket.

  •  Create a CloudTrail log if you don’t already have one and note down the S3 bucket name of the log files for future use.
  •  Note down all the resources created from the previous steps. If the S3 bucket name for the CloudTrail log from step 2 is different from the one in step 1’s output, use the S3 bucket from step 2.

The following table summarizes the keys and values you use when creating the Athena tables with the next CloudFormation stack.

Key Value Description
cloudtraillog s3://cloudtrail-awslogs-[aws-account-id]-do-not-delete/AWSLogs/[aws-account-id]/CloudTrail The Amazon S3 location of the CloudTrail log
cloudtraillogtablename cloudtrail_logs The table name of CloudTrail log
groupmembership s3://admin-console[aws-account-id]/monitoring/quicksight/group_membership The Amazon S3 location of group_membership.csv
objectaccess s3://admin-console[aws-account-id]/monitoring/quicksight/object_access The Amazon S3 location of object_access.csv
dataset info s3://admin-console[aws-account-id]/monitoring/quicksight/datsets_info The Amazon S3 location of datsets_info.csv
datadict s3://admin-console[aws-account-id]/monitoring/quicksight/data_dictionary The Amazon S3 location of data_dictionary.csv

CloudFormation template for Athena tables

To create your Athena tables, complete the following steps:

  • Download the following JSON file.
  • Edit the file and replace the corresponding fields with the keys and values you noted in the previous section.

For example, search for the groupmembership keyword.

Then replace the location value with the Amazon S3 location for the groupmembership folder.

  • Create Athena tables by deploying this edited file as a CloudFormation template. For instructions, refer to Get started.

After a successful deployment, you have a database called admin-console created in AwsDataCatalog in Athena and three tables in the database: cloudtrail_logs, group_membership, object_access, datasets_info and data_dict

  • Confirm the tables via the Athena console.

The following screenshot shows sample data of the group_membership table.

The following screenshot shows sample data of the object_access table.

For instructions on building an Athena table with CloudTrail events, see Amazon QuickSight Now Supports Audit Logging with AWS CloudTrail. For this post, we create the table cloudtrail_logs in the default database.

  • After all five tables are created in Athena, go to the security permissions on the QuickSight console to enable bucket access for s3://admin-console[AWS-account-ID] and s3://cloudtrail-awslogs-[aws-account-id]-do-not-delete.
  • Enable Athena access under Security & Permissions.

Now QuickSight can access all five tables through Athena.

CloudFormation template for QuickSight objects

To create the QuickSight objects, complete the following steps:

  • Get the QuickSight admin user’s ARN by running following command in the AWS Command Line Interface (AWS CLI):
    aws quicksight describe-user --aws-account-id [aws-account-id] --namespace default --user-name [admin-user-name]

    For example: arn:aws:quicksight:us-east-1:12345678910:user/default/admin/xyz.

  • Choose Launch Stack to create the QuickSight datasets and dashboard:

  • Provide the ARN you noted earlier.

After a successful deployment, four datasets named Admin-Console-Group-Membership, Admin-Console-dataset-info, Admin-Console-Object-Access, and Admin-Console-CFN-Main are created and you have the dashboard named admin-console-dashboard. If modifying the dashboard is preferred, use the dashboard save-as option, then recreate the analysis, make modifications, and publish a new dashboard.

  • Set your preferred SPICE refresh schedule for the four SPICE datasets, and share the dashboard in your organization as needed.

Dashboard demo

The following screenshot shows the Admin Console Landing page.

The following screenshot shows the User Analysis sheet.

The following screenshot shows the Dashboards Analysis sheet.

The following screenshot shows the Access Permissions sheet.

The following screenshot shows the Data Dictionary sheet.

The following screenshot shows the Overview sheet.

You can interactively play with the sample dashboard in the following Interactive Dashboard Demo.

You can reference the public template of the preceding dashboard in create-template, create-analysis, and create-dashboard API calls to create this dashboard and analysis in your account. The public template of this dashboard with the template ARN is 'TemplateArn': 'arn:aws:quicksight:us-east-1:889399602426:template/admin-console'.

Tips and tricks

Here are some advanced tips and tricks to build the dashboard as the Admin Console to analyze usage metrics. The following steps are based on the dataset admin_console. You can apply the same logic to create the calculated fields to analyze user login activities.

  • Create parameters – For example, we can create a parameter called InActivityMonths, as in the following screenshot. Similarly, we can create other parameters such as InActivityDays, Start Date, and End Date.

  • Create controls based on the parameters – In the following screenshot, we create controls based on the start and end date.

  • Create calculated fields – For instance, we can create a calculated field to detect the active or inactive status of QuickSight authors. If the time span between the latest view dashboard activity and now is larger or equal to the number defined in the Inactivity Months control, the author status is Inactive. The following screenshot shows the relevant code. According to the end-user’s requirements, we can define several calculated fields to perform the analysis.

  • Create visuals – For example, we create an insight to display the top three dashboard views by reader and a visual to display the authors of these dashboards.

  • Add URL actions – You can add an URL action to define some extra features to email inactive authors or check details of users.

The following sample code defines the action to email inactive authors:

mailto:<<email>>?subject=Alert to inactive author! &body=Hi, <<username>>, any author without activity for more than a month will be deleted. Please log in to your QuickSight account to continue accessing and building analyses and dashboards!

Clean up

To avoid incurring future charges, delete all the resources you created with the CloudFormation templates.

Conclusion

This post discussed how BI administrators can use QuickSight, CloudTrail, and other AWS services to create a centralized view to analyze QuickSight usage metrics. We also presented a serverless data pipeline to support the Admin Console dashboard.

If you would like to have a demo, please email us.

Appendix

We can perform some additional sophisticated analysis to collect advanced usage metrics. For example, Forwood Safety raised a unique request to analyze the readers who log in but don’t view any dashboard actions (see the following code). This helps their clients identify and prevent any wasting of reader sessions fees. Leadership teams value the ability to minimize uneconomical user activity.

CREATE OR REPLACE VIEW "loginwithoutviewdashboard" AS
with login as
(SELECT COALESCE("useridentity"."username", "split_part"("useridentity"."arn", '/', 3)) AS "user_name", awsregion,
date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ') AS event_time
FROM cloudtrail_logs
WHERE
eventname = 'AssumeRoleWithSAML'
GROUP BY  1,2,3),
dashboard as
(SELECT COALESCE("useridentity"."username", "split_part"("useridentity"."arn", '/', 3)) AS "user_name", awsregion,
date_parse(eventtime, '%Y-%m-%dT%H:%i:%sZ') AS event_time
FROM cloudtrail_logs
WHERE
eventsource = 'quicksight.amazonaws.com'
AND
eventname = 'GetDashboard'
GROUP BY  1,2,3),
users as 
(select Namespace,
Group,
User,
(case
when Group in (‘quicksight-fed-bi-developer’, ‘quicksight-fed-bi-admin’)
then ‘Author’
else ‘Reader’
end)
as author_status
from "group_membership" )
select l.* 
from login as l 
join dashboard as d 
join users as u 
on l.user_name=d.user_name 
and 
l.awsregion=d.awsregion 
and 
l.user_name=u.user_name
where d.event_time>(l.event_time + interval '30' minute ) 
and 
d.event_time<l.event_time 
and 
u.author_status='Reader'

About the Authors

Ying Wang is a Manager of Software Development Engineer. She has 12 years of expertise in data analytics and science. She assisted customers with enterprise data architecture solutions to scale their data analytics in the cloud during her time as a data architect. Currently, she helps customer to unlock the power of Data with QuickSight from engineering by delivering new features.

Ian Liao is a Senior Data Visualization Architect at AWS Professional Services. Before AWS, Ian spent years building startups in data and analytics. Now he enjoys helping customer to scale their data application on the cloud.

Maitri Brahmbhatt is a Business Intelligence Engineer at AWS. She helps customers and partners leverage their data to gain insights into their business and make data driven decisions by developing QuickSight dashboards.

Simplify data analysis and collaboration with SQL Notebooks in Amazon Redshift Query Editor V2.0

Post Syndicated from Ranjan Burman original https://aws.amazon.com/blogs/big-data/simplify-data-analysis-and-collaboration-with-sql-notebooks-in-amazon-redshift-query-editor-v2-0/

Amazon Redshift Query Editor V2.0 is a web-based analyst workbench that you can use to author and run queries on your Amazon Redshift data warehouse. You can visualize query results with charts, and explore, share, and collaborate on data with your teams in SQL through a common interface.

With SQL Notebooks, Amazon Redshift Query Editor V2.0 simplifies organizing, documenting, and sharing of data analysis with SQL queries. The notebook interface enables users such as data analysts, data scientists, and data engineers to author SQL code more easily, organizing multiple SQL queries and annotations on a single document. You can also collaborate with your team members by sharing notebooks. With SQL Notebooks, you can visualize the query results using charts. SQL Notebooks support provides an alternative way to embed all queries required for a complete data analysis in a single document using SQL cells. Query Editor V2.0 simplifies development of SQL notebooks with query versioning and export/import features. You can use the built-in version history feature to track changes in your SQL and markdown cells. With the export/import feature, you can easily move your notebooks from development to production accounts or share with team members cross-Region and cross-account.

In this post, we demonstrate how to use SQL Notebooks using Query Editor V2.0 and walk you through some of the new features.

Use cases for SQL Notebooks

Customers want to use SQL notebooks when they want reusable SQL code with multiple SQL statements and annotations or documentations. For example:

  • A data analyst might have several SQL queries to analyze data that create temporary tables, and runs multiple SQL queries in sequence to derive insights. They might also perform visual analysis of the results.
  • A data scientist might create a notebook that creates some training data, creates a model, tests the model, and runs sample predictions.
  • A data engineer might have a script to create schema and tables, load sample data, and run test queries.

Solution overview

For this post, we use the Global Database of Events, Language, and Tone (GDELT) dataset, which monitors news across the world, and the data is stored for every second of every day. This information is freely available as part of the Registry of Open Data on AWS.

For our use case, a data scientist wants to perform unsupervised learning with Amazon Redshift ML by creating a machine learning (ML) model, and then generate insights from the dataset, create multiple versions of the notebook, visualize using charts, and share the notebook with other team members.

Prerequisites

To use the SQL Notebooks feature, you must add a policy for SQL Notebooks to a principal—an AWS Identity and Access Management (IAM) user or role—that already has one of the Query Editor V2.0 managed policies. For more information, see Accessing the query editor V2.0.

Import the sample notebook

To import the sample SQL notebook in Query Editor V2.0, complete the following steps:

  1. Download the sample SQL notebook.
  2. On the Amazon Redshift console, choose Query Editor V2 in the navigation pane. Query Editor V2.0 opens in a new browser tab.
  3. To connect to a database, choose the cluster or workgroup name.
  4. If prompted, enter your connection parameters.  For more information about different authentication methods, refer to Connecting to an Amazon Redshift database.
  5. When you’re connected to the database, choose Notebooks in the navigation pane.
  6. Choose Import to use the SQL notebook downloaded in the first step.
    After the notebook is imported successfully, it will be available under My notebooks.
  7. To open the notebook, right-click on the notebook and choose Open notebook, or double-click on the notebook.

Perform data analysis

Let’s explore how you can run different queries from the SQL notebook cells for your data analysis.

  1. Let’s start by creating the table.
  2. Next, we load data into the table using COPY command. Before running the COPY command in the notebook, you need to have a default IAM role attached to your Amazon Redshift cluster, or replace the default keyword with the IAM role ARN attached to the Amazon Redshift cluster:
    COPY gdelt_data FROM 's3://gdelt-open-data/events/1979.csv'
    region 'us-east-1' iam_role 'arn:aws:iam::<account-id>:role/<role-name>' csv delimiter '\t';

    For more information, refer to Creating an IAM role as default in Amazon Redshift.

    Before we create the ML model, let’s examine the training data.

  3. Before you run the cell to create the ML model, replace the <your-amazon-s3-bucket-name> with the S3 bucket of your account to store intermediate results.
  4. Create the ML model.
  5. To check the status of the model, run the notebook cell Show status of the model.  The model is ready when the Model State key value is READY.
  6. Let’s identify the clusters associated with each GlobalEventId.
  7. Let’s get insights into the data points assigned to one of the clusters.

In the preceding screenshot, we can observe the data points assigned to the clusters. We see clusters of events corresponding to interactions between the US and China (probably due to the establishment of diplomatic relations), between the US and RUS (probably corresponding to the SALT II Treaty), and those involving Iran (probably corresponding to the Iranian Revolution).

To add text and format the appearance to provide context and additional information for your data analysis tasks, you can add a markdown cell. For example, in our sample notebook, we have provided a description about the query in the markdown cells to make it simpler to understand. For more information on markdown cells, refer to Markdown Cells.

To run all the queries in the SQL notebook at once, choose Run all.

Add new SQL and markdown cells

To add new SQL queries or markdown cells, complete the following steps:

  1. After you open the SQL notebook, hover over the cell and choose Insert SQL to add a SQL cell or Insert markdown to add a markdown cell.
  2. The new cell is added before the cell you selected.
  3. You can also move the new cell after a specific cell by choosing the up or down icon.

Visualize notebook results using charts

Now that you can run the SQL notebook cell and get the results, you can display a graphic visualization of the results by using the chart option in Query Editor V2.0.

Let’s run the following query to get more insights into the data points assigned to one of the cluster’s results and visualize using charts.

To visualize the query results, configure a chart on the Results tab. Choose actor2name for the X-axis and totalarticles for the Y-axis dropdown. By default, the graph type is a bar chart.

Charts can be plotted in every cell, and each cell can have multiple result tables, but only one of them can have a chart. For more information about working with charts in Query Editor V2.0, refer to Visualizing query results.

Versioning in SQL Notebooks

Version control enables easier collaboration with your peers and reduces the risks of any mistakes. You can create multiple versions of the same SQL notebook by using the Save version option in Query Editor V2.0.

  1. In the navigation pane, choose Notebooks.
  2. Choose the SQL notebook that you want to open.
  3. Choose the options menu (three dots) and choose Save version.

    SQL Notebooks creates the new version and displays a message that the version has been created successfully.

    Now we can view the version history of the notebook.
  4. Choose the SQL notebook for which you created the version (right-click) and choose Version history.

    You can see a list of all the versions of the SQL notebook.
  5. To revert to a specific version of the notebook, choose the version you want and choose Revert to version.
  6. To create a new notebook from a version, choose the version you want and choose Create a new notebook from the version.

Duplicate the SQL notebook

While working with your peers, you might need to share your notebook, but you also need to continue making changes in your notebook. To avoid any impact with the shared version, you can duplicate the notebook and keep working on your changes in the duplicate copy of the notebook.

  1. In the navigation pane, choose Notebooks.
  2. Open the SQL notebook.
  3. Choose the options menu (three dots) and choose Duplicate.
  4. Provide the duplicate notebook name.
  5. Choose Duplicate.

Share notebooks

You often need to collaborate with other teams, for example to share the queries for integration testing, deploy the queries from dev to the production account, and more. You can achieve this by sharing the notebook with your team.

A team is defined for a set of users who collaborate and share Query Editor V2.0 resources. An administrator can create a team by adding a tag to an IAM role.

Before you start sharing your notebook with your team, make sure that you have the principal tag sqlworkbench-team set to the same value as the rest of your team members in your account. For example, an administrator might set the value to accounting-team for everyone in the accounting department. To create a team and tag, refer to Permissions required to use the query editor v2.0.

To share a SQL notebook with a team in the same account, complete the following steps:

  1. Open the SQL notebook you want to share.
  2. Choose the options menu (three dots) and choose Share with my team.Notebooks that are shared to the team can be seen in the notebooks panel’s Shared to my team tab, and the notebooks that are shared by the user can be seen in Shared by me tab.You can also use the export/import feature for other use cases. For example, developers can deploy notebooks from lower environments to production, or customers can provide a SAAS solution sharing notebook with their end-users in different accounts or Regions. Complete the following steps to export and import SQL notebooks:
  3. Open the SQL notebook you want to share.
  4. Choose the options menu (three dots) and choose Export. SQL Notebooks saves the notebook in your local desktop as a .ipynb file.
  5. Import the notebook into another account or Region.

Run parameterized queries in a SQL notebook

Database users often need to pass parameters to the queries with different values at runtime. You can achieve this in SQL Notebooks by using parameterized queries. It can be defined in the query as ${parameter_name}, and when the query is run, it prompts to set a value for the parameter.

Let’s look at the following example, in which we pass the events_cluster parameter.

  1. Insert a SQL cell in the SQL notebook and add the following SQL query:
    select news_monitoring_cluster ( AvgTone, EventCode, NumArticles, Actor1Geo_Lat, Actor1Geo_Long, Actor2Geo_Lat, Actor2Geo_Long ) as events_cluster, eventcode, actor1name, actor2name, sum(numarticles) as totalarticles
    from gdelt_data
    where events_cluster = ${events_cluster}
    and actor1name <> ' 'and actor2name <> ' '
    group by 1,2,3,4
    order by 5 desc

  2. When prompted, input the value of the parameter events_cluster, (for this post, we set the value as 4).
  3. Choose Run now to run the query.

The following screenshot shows the query results with the events_cluster parameter value set to 4.

Conclusion

In this post, we introduced SQL Notebooks using the Amazon Redshift Query Editor V2.0. We used a sample notebook to demonstrate how it simplifies data analysis tasks for a data scientist and how you can collaborate using notebooks with your team.


About the Authors

Ranjan Burman is an Analytics Specialist Solutions Architect at AWS. He specializes in Amazon Redshift and helps customers build scalable analytical solutions. He has more than 15 years of experience in different database and data warehousing technologies. He is passionate about automating and solving customer problems with the use of cloud solutions.

Erol Murtezaoglu, a Technical Product Manager at AWS, is an inquisitive and enthusiastic thinker with a drive for self-improvement and learning. He has a strong and proven technical background in software development and architecture, balanced with a drive to deliver commercially successful products. Erol highly values the process of understanding customer needs and problems in order to deliver solutions that exceed expectations.

Cansu Aksu is a Frontend Engineer at AWS. She has several years of experience in building user interfaces that simplify complex actions and contribute to a seamless customer experience. In her career in AWS, she has worked on different aspects of web application development, including front end, backend, and application security.

Andrei Marchenko is a Full Stack Software Development Engineer at AWS. He works to bring notebooks to life on all fronts—from the initial requirements to code deployment, from the database design to the end-user experience. He uses a holistic approach to deliver the best experience to customers.

Debu-PandaDebu Panda is a Senior Manager, Product Management at AWS. He is an industry leader in analytics, application platform, and database technologies, and has more than 25 years of experience in the IT world. Debu has published numerous articles on analytics, enterprise Java, and databases and has presented at multiple conferences such as re:Invent, Oracle Open World, and Java One. He is lead author of the EJB 3 in Action (Manning Publications 2007, 2014) and Middleware Management (Packt, 2009)

How The Mill Adventure enabled data-driven decision-making in iGaming using Amazon QuickSight

Post Syndicated from Deepak Singh original https://aws.amazon.com/blogs/big-data/how-the-mill-adventure-enabled-data-driven-decision-making-in-igaming-using-amazon-quicksight/

This post is co-written with Darren Demicoli from The Mill Adventure.

The Mill Adventure is an iGaming industry enabler offering customizable turnkey solutions to B2B partners and custom branding enablement for its B2C partners. They provide a complete gaming platform, including licenses and operations, for rapid deployment and success in iGaming, and are committed to improving the iGaming experience by being a differentiator through innovation. The Mill Adventure already provides its services to a number of iGaming brands and seeks to continuously grow through the ranks of the industry.

In this post, we show how The Mill Adventure is helping its partners answer business-critical iGaming questions by building a data analytics application using modern data strategy using AWS. This modern data strategy approach has led to high velocity innovation while lowering the total operating cost.

With a gross market revenue exceeding $70 billion and a global player base of around 3 billion players (per a recent imarc Market Overview 2022-2027), the iGaming industry has, without a doubt, been booming over the past few years. This presents a lucrative opportunity to an ever-growing list of businesses seeking to tap into the market and attract a bigger share as their audience. Needless to say, staying competitive in this somewhat saturated market is extremely challenging. Making data-driven decisions is critical to the growth and success of iGaming businesses.

Business challenges

Gaming companies typically generate a massive amount of data, which could potentially enable meaningful insights and answer business-critical questions. Some of the critical and common business challenges in iGaming industry are:

  • What impacts the brand’s turnover—its new players, retained players, or a mix of both?
  • How to assess the effectiveness of a marketing campaign? Should a campaign be reinstated? Which games to promote via campaigns?
  • Which affiliates drive quality players that have better conversion rates? Which paid traffic channels should be discontinued?
  • For how long does the typical player stay active within a brand? What is the lifetime deposit from a player?
  • How to improve the registration to first deposit processes? What are the most pressing issues impacting player conversion?

Though sufficient data was captured, The Mill Adventure found two key challenges in their ability to generate actionable insights:

  • Lack of analysis-ready datasets (not raw and unusable data formats)
  • Lack of timely access to business-critical data

For example, The Mill Adventure generates over 50 GB of data daily. Its partners have access to this data. However, due to the data being in a raw form, they find it of little value in answering their business-critical questions. This affects their decision-making processes.

To address these challenges, The Mill Adventure chose to build a modern data platform on AWS that was not only capable of providing timely and meaningful business insights for the iGaming industry, but also efficiently manageable, low-cost, scalable, and secure.

Modern data architecture

The Mill Adventure wanted to build a data analytics platform using a modern data strategy that would grow as the company grows. Key tenets of this modern data strategy are:

  • Build a modern business application and store data in the cloud
  • Unify data from different application sources into a common data lake, preferably in its native format or in an open file format
  • Innovate using analytics and machine learning, with an overarching need to meet security and governance compliance requirements

A modern data architecture on AWS applies these tenets. Two key features that form the basic foundation of a modern data architecture on AWS are serverless and microservices.

The Mill Adventure solution

The Mill Adventure built a serverless iGaming data analytics platform that allows its partners to have quick and easy access to a dashboard with data visualizations driven by the varied sources of gaming data, including real-time streaming data. With this platform, stakeholders can use data to devise strategies and plan for future growth based on past performance, evaluate outcomes, and respond to market events with more agility. Having the capability to access insightful information in a timely manner and respond promptly has substantial impact on the turnover and revenue of the business.

A serverless iGaming platform on AWS

In building the iGaming platform, The Mill Adventure was quick to recognize the benefits of having a serverless microservice infrastructure. We wanted to spend time on innovating and building new applications, not managing infrastructure. AWS services such as Amazon API Gateway, AWS Lambda, Amazon DynamoDB, Amazon Kinesis Data Streams, Amazon Simple Storage Service (Amazon S3), Amazon Athena, and Amazon QuickSight are at the core of this data platform solution. Moving to AWS serverless services has saved time, reduced cost, and improved productivity. A microservice architecture has enabled us to accelerate time to value, increase innovation speed, and reduce the need to re-platform, refactor, and rearchitect in the future.

The following diagram illustrates the data flow from the gaming platform to QuickSight.

The data flow includes the following steps:

  1. As players access the gaming portal, associated business functions such as gaming activity, payment, bonus, accounts management, and session management capture the relevant player actions.
  2. Each business function has a corresponding Lambda-based microservice that handles the ingestion of the data from that business function. For example, the Session service handles player session management. The Payment service handles player funds, including deposits and withdrawals from player wallets. Each microservice stores data locally in DynamoDB and manages the create, read, update, and delete (CRUD) tasks for the data. For event sourcing implementation details, see How The Mill Adventure Implemented Event Sourcing at Scale Using DynamoDB.
  3. Data records resulting from the CRUD outputs are written in real time to Kinesis Data Streams, which forms the primary data source for the analytics dashboards of the platform.
  4. Amazon S3 forms the underlying storage for data in Kinesis Data Streams and forms the internal real-time data lake containing raw data.
  5. The raw data is transformed and optimized through custom-built extract, transform, and load (ETL) pipelines and stored in a different S3 bucket in the data lake.
  6. Both raw and processed data are immediately available for querying via Athena and QuickSight.

Raw data is transformed, optimized, and stored as processed data using an hourly data pipeline to meet analytics and business intelligence needs. The following figure shows an example of record counts and the size of the data being written into Kinesis Data Streams, which eventually needs to be processed from the data lake.

These data pipeline jobs can be broadly classified into six main stages:

  • Cleanup – Filtering out invalid records
  • Deduplicate – Removing duplicate data records
  • Aggregate at various levels – Grouping data at various aggregation levels of interest (such as per player, per session, or per hour or day)
  • Optimize – Writing files to Amazon S3 in optimized Parquet format
  • Report – Triggering connectors with updated data (such as updates to affiliate providers and compliance)
  • Ingest – Triggering an event to ingest data in QuickSight for analytics and visualizations

The output of this data pipeline is two-fold:

  • A transformed data lake that is designed and optimized for fast query performance
  • A refreshed view of data for all QuickSight dashboards and analyses

Cultivating a data-driven mindset with QuickSight

The Mill Adventure’s partners access their data securely via QuickSight datasets. These datasets are purposefully curated views on top of the transformed data lake. Each partner can access and visualize their data immediately. With QuickSight, partners can build useful dashboards without having deep technical knowledge or familiarity with the internal structure of the data. This approach significantly reduces the time and effort required and speeds up access to valuable gaming insights for business decision-making.

The Mill Adventure also provides each partner with a set of readily available dashboards. These dashboards are built on the years of experience that The Mill Adventure has in the iGaming industry, cover the most common business intelligence requirements, and jumpstart a data-driven mindset.

In the following sections, we provide a high-level overview of some of The Mill Adventure iGaming dashboard features and how these are used to meet the iGaming business analytics needs.

Key performance indicators

This analysis provides a comprehensive set of iGaming key performance indicators (KPIs) across different functional areas, including but not limited to payment activity (deposits and withdrawals), game activity (bets, gross game wins, return to player) and conversion metrics (active customers, active players, unique depositing customers, newly registered customers, new depositing customers, first-time depositors). These are presented concisely in both a quantitative view and in more visual forms.

In the following example KPI report, we can see how by presenting different iGaming metrics for key periods and lifetime, we can identify the overall performance of the brand.

Affiliates analysis

This analysis presents metrics related to the activity generated by players acquired through affiliates. Affiliates usually account for a large share of the traffic driven to gaming sites, and such a report helps identify the most effective affiliates. It informs performance trends per affiliate and compares across different affiliates. By combining data from multiple sources via QuickSight cross-data source joins, affiliate provider-related data such as earnings and clicks can be presented together with other key gaming platform metrics. By having these metrics broken down by affiliate, we can determine which affiliates contribute the most to the brand, as shown in the following example figure.

Cohort analysis

Cohort analyses track the progression of KPIs (such as average deposits) over a period of time for groups of players after their first deposit day. In the following figure, the average deposits per user (ADPU) is presented for players registering in different quarters within the last 2 years. By moving horizontally along each row on the graph, we can see how the ADPU changes for successive quarters for the same group of players. In the following example, the ADPU decreases substantially, indicating higher player churn.

We can use cohort analyses to calculate the churn rate (rate of players who become inactive). Additionally, by averaging the ADPU figures from this analysis, you can extract the lifetime value (LTV) of the ADPU. This shows the average deposit that can be expected to be deposited by players over their lifetime with the brand.

Player onboarding journey

Player onboarding is not a single-step process. In particular, jurisdictional requirements impose a number of compliance checks that need to be fulfilled along various stages during registration flow. All these, plus other steps along the registration (such as email verification), could pose potential pitfalls for players, leading them to fail to complete registration. Showing these steps in QuickSight funnel visuals helps identify such issues and pinpoint any bottlenecks in such flows, as shown in the following example. Additionally, Sankey visuals are used to monitor player movement across registration steps, identifying steps that need to be optimized.

Campaign outcome analysis

Bonus campaigns are a valuable promotional technique used to reward players and boost engagement. Campaigns can drive turnover and revenue, but there is always an inherent cost associated. It’s critical to assess the performance of campaigns and determine the net outcome. We have built a specific analysis to simplify the task of evaluating these promotions. A number of key metrics related to players activated by campaigns are available. These include both monetary incentives for game activity and deposits and other details related to player demographics (such as country, age group, gender, and channel). Individual campaign performance is analyzed and high-performance ones are identified.

In the following example, the figure on the left shows a time series distribution of deposits coming from campaigns in comparison to the global ones. The figure on the right shows a geographic plot of players activated from selected campaigns.

Demographics distribution analysis

Brands may seek to improve player engagement and retention by tailoring their content for their player base. They need to collect and understand information about their players’ demographics. Players’ demographic distribution varies from brand to brand, and the outcome of actions taken on different brands will vary due to this distribution. Keeping an eye on this demographic (age, country, gender) distribution helps shape a brand strategy in the best way that suits the player base and helps choose the right promotions that appeal most to its audience.

Through visuals such as the following example, it’s possible to quickly analyze the distribution of the selected metric along different demographic categories.

In addition, grouping players by the number of days since registration indicates which players are making a higher contribution to revenue, whether it is existing players or newly registered players. In the following figure, we can see that players who registered in the last 3 months continually account for the highest contribution to deposits. In addition, the proportion of deposits coming from the other two bands of players isn’t increasing, indicating an issue with player retention.

Compliance and responsible gaming

The Mill Adventure treats player protection with the utmost priority. Each iGaming regulated market has its own rules that need to be followed by the gaming operators. These include a number of compliance reports that need to be regularly sent to authorities in the respective jurisdictions. This process was simplified for new brands by creating a common reports template and automating the report creation in QuickSight. This helps new B2B brands meet these reporting requirements quickly and with minimal effort.

In addition, a number of control reports highlighting different areas of player protection are in place. As shown in the following example, responsible gaming reports such as those outlining player behavior deviations help identify accounts with problematic gambling patterns.

Players whose gaming pattern varies from the identified norm are flagged for inspection. This is useful to identify players who may need intervention.

Assessing game play and releases

It’s important to measure the performance and popularity of new games post release. Metrics such as unique player participation and player stakes are monitored during the initial days after the release, as shown in the following figures.

Not only does this help evaluate the overall player engagement, but it can also give a clear indication of how these games will perform in the future. By identifying popular games, a brand may choose to focus marketing campaigns on those games, and therefore ensure that it’s promoting games that appeal to its player base.

As shown in these example dashboards, we can use QuickSight to design and create business analytics insights of the iGaming data. This helps us answer real-life business-critical questions and take measurable actions using these insights.

Conclusion

In the iGaming industry, decisions not backed up by data are like an attempt to hit the bullseye blindfolded. With QuickSight, The Mill Adventure empowers its B2B partners and customers to harness data in a timely and convenient manner and support decision-making with winning strategies. Ultimately, in addition to gaining a competitive edge in maximizing revenue opportunities, improved decision-making will also lead to enhanced player experiences.

Reach out to The Mill Adventure and kick-start your iGaming journey today.

Explore rich set of out-of-the-box Amazon QuickSight ML Insights. Amazon QuickSight Q enables dashboards with natural language querying capabilities. For more information and resources on how to get started with free trial, visit Amazon QuickSight.


About the authors

Darren Demicoli is a Senior Devops and Business Intelligence Engineer at The Mill Adventure. He has worked in different roles in technical infrastructure, software development and database administration and has been building solutions for the iGaming sector for the past few years. Outside work, he enjoys travelling, exploring good food and spending time with his family.

Padmaja Suren is a Technical Business Development Manager serving the Public Sector Field Community in Market Intelligence on Analytics. She has 20+ years of experience in building scalable data platforms using a variety of technologies. At AWS she has served as Specialist Solution Architect on services such as Database, Analytics and QuickSight. Prior to AWS, she has implemented successful data and BI initiatives for diverse industry sectors in her capacity as Datawarehouse and BI Architect. She dedicates her free time on her passion project SanghWE which delivers psychosocial education for sexual trauma survivors to heal and recover.

Deepak Singh is a Solution Architect at AWS with specialization in business intelligence and analytics. Deepak has worked across a number of industry verticals such as Finance, Healthcare, Utilities, Retail, and High Tech. Throughout his career, he has focused on solving complex business problems to help customers achieve impactful business outcomes using applied intelligence solutions and services.